# Section 2: Fundamentals of Natural Language Processing

# Chapter 3 : NLP and Text Embeddings

There are many different ways of representing text in deep learning. While we have covered basic bag-of-words (BoW) representations, unsurprisingly, there is a far more sophisticated way of representing text data known as **embeddings**. While a **BoW vector acts only as a count of words within a sentence**, **embeddings help to numerically define the actual meaning of certain words**.

## Embeddings for NLP

Words do not have a natural way of representing their meaning. In images, we already have representations in rich vectors (containing the values of each pixel within the image). **When parts of language are represented in a high-dimensional vector format, they are known as embeddings**. Through analysis of a corpus of words, and **by determining which words appear frequently together**, we can obtain **an $n$-length vector for each word, which better represents the semantic relationship of each word to all other words**. We saw previously that we can easily represent words as one-hot encoded vectors:

![](ohe_vectors.png)

On the other hand, embeddings are vectors of length $n$ (in the following example, $n = 3$) that can take any value

![](vectors_n_equal_3.png)

These embeddings **represent the word's vector in $n$-dimensional space (where $n$ is the *length of the embedding vectors*), and words with similar vectors within this space are considered *to be more similar in meaning***. While these embeddings can be of any size, they are generally of much lower dimensionality than the BoW representation. The BOW vectors are generally very sparse, consisting mostly of zeros, whereas embeddings are rich in data and every dimension contributes to the overall representation of the word. The lower dimensionality and the fact that they are not sparse makes performing deep learning on embeddings much more efficient than performing it on BOW representations

### **GLoVe**

Global Vectors for Word Representation (GLoVe) embeddings is a set of pre-calculated word embeddings, It can be downloaded [here](https://nlp.stanford.edu/projects/glove/) and it will help us demonstrate how embeddings work.

These embeddings are calculated on a very large corpus of NLP data and are trained on a word co-occurrence matrix. This is based on the notion that words that appear together are more likely to have similar meaning

In [11]:
# !iwr -outf glove.6B.50d.zip http://nlp.stanford.edu/data/glove.6B.50d.zip

In [8]:
# 1 We first create a simple function to load our GLoVe vectors
import numpy as np

def loadGlove(path):
    file = open(path, 'r', encoding="utf8")
    model = {}
    for l in file:
        line = l.split()
        word = line[0]
        value = np.array([float(val) for val in line[1:]])
        model[word]  = value
    return model

glove = loadGlove(path='./glove.6B.50d.txt')

# 2 We can access a single vector by juste calling it
# from the dict
glove['python']

array([ 0.5897  , -0.55043 , -1.0106  ,  0.41226 ,  0.57348 ,  0.23464 ,
       -0.35773 , -1.78    ,  0.10745 ,  0.74913 ,  0.45013 ,  1.0351  ,
        0.48348 ,  0.47954 ,  0.51908 , -0.15053 ,  0.32474 ,  1.0789  ,
       -0.90894 ,  0.42943 , -0.56388 ,  0.69961 ,  0.13501 ,  0.16557 ,
       -0.063592,  0.35435 ,  0.42819 ,  0.1536  , -0.47018 , -1.0935  ,
        1.361   , -0.80821 , -0.674   ,  1.2606  ,  0.29554 ,  1.0835  ,
        0.2444  , -1.1877  , -0.60203 , -0.068315,  0.66256 ,  0.45336 ,
       -1.0178  ,  0.68267 , -0.20788 , -0.73393 ,  1.2597  ,  0.15425 ,
       -0.93256 , -0.15025 ])

We can see that this **returns a $50$-dimensional vector embedding for the word *Python***. We will now introduce the concept of ***cosine similarity*** to **compare how similar two vectors are**. Vectors will have a similarity of $1$ if **the angle in the $n$-dimensional space between them is $0^{\circ}$**. **Values with high cosine similarity can be considered similar**, even if they are not equal. This can be calculated using the following formula, where $A$ and $B$ are the two embedding vectors being compared:

$\frac{\sum A . B}{\sqrt{\sum A^2} \times \sqrt{\sum B^2}}  $

In [9]:
from sklearn.metrics.pairwise import cosine_similarity
# 3 We can calculatee easily using the cosine_similarity() of sklearn
cosine_similarity(glove['cat'].reshape(1, -1), glove['dog'].reshape(1, -1))

array([[0.92180053]])

In [10]:
# 4 However cat and piano are quite dissimilar as they
# are two seemingly unrelated items
cosine_similarity(glove['cat'].reshape(1, -1), glove['piano'].reshape(1, -1))

array([[0.19825255]])

### **Embeddings operation**s

Since embeddings are vectors, we can perform operations on them. For example, let's say we take the embeddings for the following sorts and we calculate the following

$Queen-Woman+Man$

With this, we can approximate the embedding for $king$. This essentially replaces the Woman vector component from $Queen$ with the $Man$ vector to arrive at this approximation. We can graphically illustrate this as follows:

![](embeddings_operation.png)

Note that in this example, we illustrate this graphically in two dimensions. In the case of our embeddings, this is happening in a $50$-dimensional space. While this is not exact, we can verify that our calculated vector is indeed similar to the GLoVe vector for King:

In [11]:
predicted_king_embedding = glove['queen'] - glove['woman'] + glove['man']
cosine_similarity(predicted_king_embedding.reshape(1, -1), glove['king'].reshape(1, -1))

array([[0.85888392]])

While GLoVe embeddings are very useful pre-calculated embeddings, it is actually possible for us to calculate our own embeddings. This may be useful when we are analyzing a particularly unique corpus. For example, **the language used on Twitter may differ from the language used on Wikipedia**, so embeddings trained on one may not be useful for the other. We will now demonstrate **how we can calculate our own embeddings using a continuous bag-of-words**.

## Exploring CBOW

The **continuous bag-of-words (CBOW)** model forms part of **Word2Vec** – a model created by Google in order to **obtain vector representations of words**. By running these models over a very large corpus, we are able to obtain detailed representations of words that represent their semantic and contextual similarity to one another. The **Word2Vec** model consists of two main components:

* **CBOW**: This model attempts to predict the target word in a document, given the surrounding words
* **Skip-gram**: This is the opposite of CBOW; this model attempts to predict the surrounding words, given the target word.

Consider the following sentence:

***PyTorch is a deep learning framework***

Let's say we want to predict the word *deep*, given the context words:

***PyTorch is a {target_word} learning framework***

We could look at this in a number of ways:

![](table%20of%20context%20and%20representation.png)

For our CBOW model, we will use a **window of length $2$**, which means for our model's $(X, y)~input/output$ pairs, we use $([n-2, n-1, n+1, n+2, n])$, where $n$ is our target word being predicted.

Using these as our model inputs, we will train a model that includes an embedding layer. This embedding layer automatically forms an $n$-dimensional representation of the words in our corpus. However, to begin with, this layer is initialized with random weights. These parameters are what will be learned using our model so that after our model has finished training, this embedding layer can be used can be used to encode our corpus in an embedded vector representation.

### **CBOW architecture**

Here, our model takes an input of $4$ words ($2$ before our target word and $2$ after) and trains it against an output (our target word). The following representation is an illustration of how this might look:

![](cbow%20architecture.png)

Our input words are first fed through an embedding layer, represented as a tensor of size $(n,l)$, where $n$ is the **specified length of our embeddings** and $l$ is **the number of words in our corpus**. This is because **every word within the corpus has its own unique tensor representation**.

Using our combined (summed) embeddings from our $4$ context words, this is then **fed into a fully connected layer in order to learn the final classification** of our target word against our embedded representation of our context words

> Note that our predicted/target word is encoded as a vector that's the length of our corpus

Because our model effectively predicts the probability of each word in the corpus to be the target word, and the final classification is the one with the highest probability. We then obtain a loss, backpropagate this through our network, and update the parameters on the fully connected layer, as well as the embeddings themselves.

The reason this methodology works is because our learned embeddings represent semantic similarity. Let's say we train our model on the following:

$X = ["is", "a", "learning", "framework"];~y = "deep"$

What our model is essentially learning is that **the combined embedding representation of our target words is semantically similar to our target word**. If we repeat this over a large enough corpus of words, we will find that our word embeddings begin to resemble our previously seen GLoVe embeddings, where semantically similar words appear to one another within the embedding space.

### **Building CBOW**

In [12]:
# 1 We first define some text and perform some basic text
# cleaning, removing basic punctuation and converting it all to
# lowercase

text = """
For us, the members of the AlphaGo team, the AlphaGo story was the adventure of a
lifetime. It began, as many great adventures do, with a small step—training a simple
convolutional neural network on records of Go games played by strong human play-
ers. This led to pivotal breakthroughs in the recent development of machine learning,
as well as a series of unforgettable events, including matches against the formidable
Go professionals Fan Hui, Lee Sedol, and Ke Jie. We’re proud to see the lasting
impact of these matches on the way Go is played around the world, as well as their role
in making more people aware of, and interested in, the field of artificial intelligence.
But why, you might ask, should we care about games? Just as children use games to
learn about aspects of the real world, so researchers in machine learning use them to
train artificial software agents. In this vein, the AlphaGo project is part of DeepMind’s
strategy to use games as simulated microcosms of the real world. This helps us study
artificial intelligence and train learning agents with the goal of one day building gen-
eral purpose learning systems capable of solving the world’s most complex problems.
AlphaGo works in a way that is similar to the two modes of thinking that Nobel
laureate Daniel Kahnemann describes in his book on human cognition, Thinking Fast
and Slow.
"""

text = text.replace(',', '').replace('.', '').lower().split()

# 2 We start by defining our coprus and its lenght
corpus = set(text)
corpus_length = len(corpus)

# 3 Note that we use a set instead of a list as we are only
# converned with the unique words within our text. We then
# build our corpus index and our inverse corpus index. Our
# corpus index will allow us to obtain the index of a word
# given the word itself, which will be useful when encoding
# our words for entry into our network. Our inverse corpus
# index allows us to obtain a word, given the index value,
# which will be used to convert our predictions back into words:
word_dict = {}
inverse_word_dict = {}
for i, word in enumerate(corpus):
    word_dict[word] = i
    inverse_word_dict[i] = word

# 4 Next, we encode our data. We loop through our corpus
# and for each target word, we capture the context words
# (the two words before and the two words after). We append
# this with the target word itself to our dataset. Note how
# we begin this process from the third word in our corpus
# (index = 2) and stop it two steps before the end of the
# corpus. This is because the two words at the beginning
# won't have two words before them and, similarly, the two
# words at the end won't have two words after them:
data = []

for i in range(2, len(text) - 2):

    sentence = [text[i-2], text[i-1], text[i+1], text[i+2]]
    target = text[i]
    data.append((sentence, target))

print(data[3])

(['members', 'of', 'alphago', 'team'], 'the')


5. We then define the length of our embeddings. While this can technically be any number you wish, there are some tradeoffs to consider. While higher-dimensional embeddings can lead to a more detailed representation of the words, the feature space also becomes sparser, which means high-dimensional embeddings are only appropriate for large corpuses. Furthermore, larger embeddings mean more parameters to learn, so increasing the embedding size can increase training time significantly. We are only training on a very small dataset, so we have opted to use embeddings of size $20$:

In [13]:
embedding_length = 20

Next, we define our **CBOW model in PyTorch**. We define our embeddings layer so that **it takes a vector of corpus length in and outputs a single embedding**. We define our linear layer as a fully connected layer that takes an embedding in and outputs a vector of 64. We define our final layer as a classification layer that is the same length as our text corpus.

6. We define our forward pass by obtaining and summing the embeddings for all input context words. This then passes through the fully connected layer with ReLU activation functions and finally into the classification layer, which predicts which word in the corpus corresponds to the summed embeddings of the context words the most:

In [14]:
import torch
from torch import nn

class CBOW(nn.Module):
    def __init__(self, corpus_length, embedding_dim):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(corpus_length, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 64)
        self.linear2 = nn.Linear(64, corpus_length)
        self.activation_function1 = nn.ReLU()
        self.activation_function2 = nn.LogSoftmax(dim = -1)

    def forward(self, inputs):
        embeds = sum(self.embeddings(inputs)).view(1, -1)
        out = self.linear1(embeds)
        out = self.activation_function1(out)
        out = self.linear2(out)
        out = self.activation_function2(out)
        return out

    # 7 We can also define a get_word_embedding() function, which will
    # allow us to extract embeddings for a given word after our model
    # has been trained
    def get_word_embedding(self, word):
        word = torch.LongTensor([word_dict[word]])
        return self.embeddings(word).view(1, -1)

# 8 Now we are ready to train our model. We first create an instance of our
# model and define the loss function and optimizer
model = CBOW(corpus_length, embedding_length)
loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=.01)

# 9 We then create a helper function that makes our input context words
# gets the word indexes for each of these, and transforms them into a tensor
# of length 4, which forms the input to our neural network:
def make_sentence_vector(sentence, word_dict):
    idxs = [word_dict[w] for w in sentence]
    return torch.tensor(idxs, dtype=torch.long)

print(make_sentence_vector(['the', 'alphago', 'project', 'is', 'part', 'of'], word_dict))

tensor([ 78, 105,  60, 101, 110,  89])


Now, we train our network. We loop through $100$ epochs and for each pass, we loop through all our context words, that is, target word pairs. For each of these pairs, we load the context sentence using `make_sentence_vector()` and use our current model state to obtain predictions. We evaluate these predictions against our actual target in order to obtain our loss. We backpropagate to calculate the gradients and step through our optimizer to update the weights. Finally, we sum all our losses for the epoch and print this out. Here, we can see that our loss is decreasing, showing that our model is learning

In [15]:
for epoch in range(100):
    epoch_loss = 0
    for sentence, target in data:
        model.zero_grad()
        sentence_vector = make_sentence_vector(sentence, word_dict)
        log_probs = model(sentence_vector)
        loss = loss_function(log_probs, torch.tensor([word_dict[target]], dtype=torch.long))
        loss.backward()
        optimizer.step()
        epoch_loss += loss.data
    print('Epoch: ' + str(epoch)+ ', Loss: ' + str(epoch_loss.item()))


Epoch: 0, Loss: 1148.0550537109375
Epoch: 1, Loss: 1023.6575927734375
Epoch: 2, Loss: 934.0814208984375
Epoch: 3, Loss: 850.8849487304688
Epoch: 4, Loss: 768.6152954101562
Epoch: 5, Loss: 684.1392211914062
Epoch: 6, Loss: 598.3729858398438
Epoch: 7, Loss: 513.0980224609375
Epoch: 8, Loss: 430.1632080078125
Epoch: 9, Loss: 352.50628662109375
Epoch: 10, Loss: 282.7103271484375
Epoch: 11, Loss: 221.9185028076172
Epoch: 12, Loss: 171.78469848632812
Epoch: 13, Loss: 131.42916870117188
Epoch: 14, Loss: 99.86688232421875
Epoch: 15, Loss: 76.31681060791016
Epoch: 16, Loss: 59.3114013671875
Epoch: 17, Loss: 47.06007766723633
Epoch: 18, Loss: 38.15354537963867
Epoch: 19, Loss: 31.66030502319336
Epoch: 20, Loss: 26.807104110717773
Epoch: 21, Loss: 23.121313095092773
Epoch: 22, Loss: 20.26970863342285
Epoch: 23, Loss: 17.99311637878418
Epoch: 24, Loss: 16.175214767456055
Epoch: 25, Loss: 14.655794143676758
Epoch: 26, Loss: 13.392498970031738
Epoch: 27, Loss: 12.318419456481934
Epoch: 28, Loss: 11.

Now that our model has been trained, we can make predictions. We define a couple of functions to allow us to do so. `get_predicted_result()` returns the predicted word from the array of predictions, while our `predict_sentence()` function makes a prediction based on the context words

11. We split our sentences into individual words and transform them into and input vector. We then create our prediction array by feeding this into our model and get our final predicted word by using the `get_predicted_result()` function. We also print the $2$ words before and after the predicted target word for context. We can run a couple of predictions to validate our model is working correctly.

In [24]:
def get_predicted_result(input, inverse_word_dict):
    index = np.argmax(input)
    return inverse_word_dict[index]

def predict_sentence(sentence):
    sentence_split = sentence.replace('.', '').lower().split()
    sentence_vector = make_sentence_vector(sentence_split, word_dict)
    prediction_array = model(sentence_vector).data.numpy()
    print("Preceding words : {}".format(sentence_split[:2]))
    print("Predicted word : {}".format(get_predicted_result(prediction_array[0], inverse_word_dict)))
    print("Following words: {}\n".format(sentence_split[2:]))

predict_sentence("professionals")

Preceding words : ['professionals']
Predicted word : go
Following words: []



12. Now that we have a trained model, we are able to use the `get_word_embedding()` function in order to return the 20 dimensions word embedding for any word in our corpus. If we needed our embeddings for another NLP task, we could actually extract the weights from the whole embedding layer and use this in our new model:

In [25]:
print(model.get_word_embedding('professionals'))

tensor([[-0.1237,  1.8168, -0.7407, -1.3450, -0.6330,  0.1734,  1.6975,  0.3242,
         -1.0633, -0.8235, -1.3310, -1.9816, -0.2267,  0.6849,  0.5219, -0.7008,
          0.3234, -0.2631, -0.6461, -0.1032]], grad_fn=<ViewBackward0>)


Here, we have demonstrated how to train a **CBOW** model for creating word embeddings. In reality, to create reliable embeddings for a corpus, we would require a very large dataset to be able to truly capture the semantic relationship between all the words. Because of this, it may be preferable to use pre-trained embeddings such as GLoVe, which have been trained on a very large corpus of data, for your models, but there may be cases where it would be preferable to train a brand new set of embeddings from scratch; for example, when analyzing a corpus of data that doesn't resemble normal NLP (for example, Twitter data where users may speak in short abbreviations and not use full sentences).

## Exploring $n$-grams

It is not only our context words that influence the meaning of words in a sentence, but the order of those words as well. Consider the following sentences:

***The cat sat on the dog***

***The dog sat on the cat***

If you were to transform these two sentences into a bag-of-words representation, we would see that they are identical (in fact, they are the complete opposite!). This clearly demonstrates that **the meaning of a sentence is not just the words it contains**, **but the order in which they occur**. One simple way of attempting to capture the order of words within a sentence is by using $n$-grams.

If we perform a count on our sentences, but instead of counting individual words, we now count the distinct two-word pairings that occur within the sentences, this is known as using bi-grams

![](bi_gram.png)

We can represent this as follows:

***The cat sat on the dog*** $ \to [1,1,1,0,1,1]$

***The dog sat on the cat*** $ \to [1,1,0,1,1,1]$

**We can use $n$-grams as inputs into our deep learning models instead of just a singular word, but when using $n$-gram models, it is worth noting that your feature space can become very large very quickly and may make machine learning very slow**. If a dictionary contains all the words in the English language, a dictionary containing all distinct pairs of words would be several orders of magnitude larger!

### **$n$-gram language modeling**

If we think of a language as being represented by parts of smaller word pairs (bigrams) instead of single words, we can begin to model language as a probabilistic model where the probability that a word appears in a sentence depends on the words that appeared before it.

In a **unigram** model, we assume that all the words have a finite probability of appearing based on the distribution of the words in a corpus or document

*My name is my name*

Based on this sentence, we can generate a distribution of words whereby each word has a given probability of occurring based on its frequency within the document:

![](unigram.png)

We could then draw words randomly from this distribution in order to generate new sentences:

*Name is Name my my*

This sentence doesn't make any sense, illustrating the problems of using a unigram model. Because the probability of each word occurring is independent of all the other words in the sentence, there is no consideration given to the order or context of the words appearing. This is where n-gram models are useful.

We will now consider using a **bigram** language model. This calculation takes the probability of a word occurring, given the word that appears before it:

$p(W_n|W_{n-1}) = \frac{p(W_{n-1}, W_n)}{p(W_{n-1})}$

This means that **the probability of a word occurring, given the previous word, is the probability of the word $n$-gram occurring divided by the probability of the previous word occurring**. Let's say we are trying to predict the next word in the following sentence:

***My favourite language is ___***

Along with this, we're given the following n-gram and word probabilities:

![](probabilities.png)

With this, we could calculate the probability of Python occurring, given the probability of the previous word is occurring is only $20\%$, whereas the probability of English occurring is only $10\%$. We could expand this model further to use a trigram or any $n$-gram representation of words as we deem appropriate. We have demonstrated that $n$-gram language modeling can be used to introduce further information about word's relationships to one another into our models, rather than naively assuming that words are independently distributed.

## Tokenization

It's way of pre-processing text for entry into our models. Tokenization splits our sentences up into smaller parts. This could involve **splitting a sentence up into its individual words** or **splitting a whole document up into individual sentences**.

In [26]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
# 1 We first take a basic sentence and split this up
# into individual words using the word tokenizer in NLTK
text = 'This is a single sentence.'
tokens = word_tokenize(text)
print(tokens)

# 2 Note how a period is considered a token as it is a part of
# natural language. Depending on what we want to do with the text
# we may wish to keep or dispose of the punctuation
no_punctuation = [word.lower() for word in tokens if word.isalpha()]
print(no_punctuation)

# 3 We can also tokenize documents into individuals sentences
# using sentence_tokenizer
text = "This document is the first sentence. THis is the second sentence. A document\
    contains many sentences."

print(sent_tokenize(text))

# 4 Alternatively, we can combine the 2 split into individual
# sentences of words
print([word_tokenize(sentence) for sentence in sent_tokenize(text)])

# 5 One other optional step in the process of tokenization, which
# is the removal of stopwords.
stop_words = stopwords.words('english')
print(stop_words[:20])

# 6 Let's easily remove these stopwords from our words using basic list
# comprehension
tokens = [token for token in word_tokenize(text) if token not in stop_words]
print(tokens)

['This', 'is', 'a', 'single', 'sentence', '.']
['this', 'is', 'a', 'single', 'sentence']
['This document is the first sentence.', 'THis is the second sentence.', 'A document    contains many sentences.']
[['This', 'document', 'is', 'the', 'first', 'sentence', '.'], ['THis', 'is', 'the', 'second', 'sentence', '.'], ['A', 'document', 'contains', 'many', 'sentences', '.']]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']
['This', 'document', 'first', 'sentence', '.', 'THis', 'second', 'sentence', '.', 'A', 'document', 'contains', 'many', 'sentences', '.']


While some NLP tasks (such as predicting the next word in the sentence) require stopwords, others (such as judging the sentiment of a film review) do not as the stopwords do not contribute much toward the overall meaning of the document

## Tagging and chunking for parts of speech

So far, we have covered several approaches for representing words and sentences, including bag-of-words, embeddings, and $n$-grams. However, **these representations fail to capture the structure of any given sentence**. Within natural language, different words can have different functions within a sentence. Consider the following:

*The big dog is sleeping on the bed*

We can "tag" the various words of this text, depending on the function of each word in the sentence. So, the preceding sentence becomes as follows:

The $\rarr$ big $\rarr$ dog $\rarr$ is $\rarr$ sleeping $\rarr$ on $\rarr$ the $\rarr$ bed

Determiner $\rarr$ Adjective $\rarr$ Noun $\rarr$ Verb $\rarr$ Verb $\rarr$ Preposition $\rarr$ Determiner $\rarr$ Noun

These different parts of speech can be used to better understand the structure of sentences. For example, ***adjectives* often precede *nouns* in English**. We can use these parts of speech and their relationships to one another in our models. For example, if we are **predicting the next word in the sentence and the context word is an adjective**, **we know the probability of the next *word* being a noun is high**.

### **Tagging**

Part of speech tagging is the act of assigning these part of speech tags to the various words within the sentence. Fortunately, NTLK has a built-in tagging functionality

In [28]:
from nltk.tag import pos_tag
sentence = "The big dog is sleeping on the bed"
token = word_tokenize(sentence)
pos_tag(token)

[('The', 'DT'),
 ('big', 'JJ'),
 ('dog', 'NN'),
 ('is', 'VBZ'),
 ('sleeping', 'VBG'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('bed', 'NN')]

We can decode the meaning of this tag by calling `upenn_tagset()` on the code. In this case, we can see that "VBG" corresponds to a verb:

In [30]:
from nltk.help import upenn_tagset
upenn_tagset("VBG")

VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...


### **Chunking**

Chunking expands upon our initial parts of speech tagging and aims to structure our sentences in small chunks, where each of these chunks represent a small part of speech.

We may wish to split our text up into **entities**, where each entity is a separate object or thing. For example, the red book refers not to three separate entities, but to a single entity described by three words.

We must first define a grammar pattern to match using regular expressions. The pattern in question looks for **noun phrases (NP)**, where a noun phrase is defined as a **determiner (DT)**, followed by an **optional adjective (JJ)**, followed by a **noun (NN)**:

In [32]:
expression = ('NP: {<DT>?<JJ>*<NN>}')

Using the RegexpParser() function, we can match occurrences of this expression and tag them as noun phrases

In [33]:
from nltk import RegexpParser
tagged = pos_tag(token)
REchunkParser = RegexpParser(expression)
tree = REchunkParser.parse(tagged)
print(tree)

(S
  (NP The/DT big/JJ dog/NN)
  is/VBZ
  sleeping/VBG
  on/IN
  (NP the/DT bed/NN))


## TF-IDF

TF-IDF is yet another technique we can learn about to better represent natural language. **It is often used in text mining and information retrieval to match documents based on search terms, but can also be used in combination with embeddings to better represent sentences in embedding form**.

***This is a small giraffe***

Let's say we want a single embedding to represent the meaning of this sentence. One thing we could do is simply average the individual embeddings of each of the five words in this sentence:

![](words_embeddings_tfidf.png)

However, this methodology assigns equal weight to all the words in the sentence. We might want to assign more weight to the rarer words (like *girage* here). This methodology is known as **Term Frequency – Inverse Document Frequency (TD-IDF)**. We will now demonstrate how we can calculate TF-IDF weightings for our documents

### **Calculating TF-IDF**

TF-IDF consists of two separate parts: term frequency and inverse document frequency. Term frequency is a document-specific measure counting the frequency of a given word within the document being analyzed:

$tf(w, d) = \frac{\text{count of word w in document d}}{\text{words in document d}}$

Note that we divide this measure by the total number of words in the document as a longer document is more likely to contain any given word. **If a word appears many times in a document, it will receive a higher term frequency. However, this is the opposite of what we wish our TF-IDF weighting to do as we want to give a higher weight to occurrences of rare words within our document**. This is where IDF comes into play.

Document frequency measures the number of documents within the entire corpus of documents where the word is being analyzed, and inverse document frequency calculates the ratio of the total documents to the document frequency:

$df(w) = \text{count of w accross all documents}$\
$idf(w) = \frac{N}{df(w)}$

If we have a corpus of $100$ documents and our word appears $5$ times across them, we will have an **inverse document frequency** of $20$. This means that **a higher weight is given to words with lower occurrences across all documents**.

Now, consider a corpus of $100,000$ documents. If a word appears just $1$ once, it will have an **IDF of** $100,000$, whereas a word occurring $2$ twice would have an **IDF of** $50,000$.

These very large and volatile IDFs aren't ideal for our calculations, **so we must first normalize them with $logs$**.
> Note how we add $1$ within our calculations to prevent division by $0$ if we calculate TF-IDF for a word that doesn't appear in our corpus:

$idf(w) = log(\frac{N}{df(w) + 1})$

This makes our final TF-IDF equation look as follows:

$tfidf(w, d) = tf(w, d) *  log(\frac{N}{df(w) + 1})$

### **Implementing TF-IDF**

In [41]:
import nltk

# 1 Import the dataset
emma = nltk.corpus.gutenberg.sents('austen-emma.txt')
emma_sentences = []
emma_word_set = []

for sentence in emma:
    emma_sentences.append([word.lower() for word in sentence if word.isalpha()])
    for word in sentence:
        if word.isalpha():
            emma_word_set.append(word.lower())

emma_word_set = set(emma_word_set)

# 2 Next, we create a function that will return our Term Frequencies
# for a given word in our document
def TermFreq(document, word):
    doc_length = len(document)
    occurances = len([w for w in document if w == word])
    return occurances / doc_length

TermFreq(emma_sentences[5], 'ago')

0.024390243902439025

In [42]:
# 3 Next, we calculate our Document Frequency
# In order to do this efficiently, we first
# need to pre-compute a Document Frequency
# dictionary. This loops through all the data
# and counts the number of documents each word
# in our corpus appears in. We pre-compute
# this so we that do not have to perform this
# loop every time we wish to calculate Document
# Frequency for a given word
def build_DF_dict():
    output = {}
    for word in emma_word_set:
        output[word] = 0
        for doc in emma_sentences:
            if word in doc:
                output[word] += 1
    return output

df_dict = build_DF_dict()
df_dict['ago']

# 4 Here, we can see that the word ago appears
# within our document 32 times. Using this
# dictionary, we can very easily calculate our
# Inverse Document Frequency by dividing the total
# number of documents by our Document Frequency
# and taking the logarithm of this value. Note
# how we add one to the Document Frequency to
# avoid a divide by zero error when the word
# doesn't appear in the corpus
def InverseDocumentFreq(words):
    N = len(emma_sentences)
    try:
        df = df_dict[word] + 1
    except:
        df = 1
    return np.log(N/df)

InverseDocumentFreq('ago')

# 5 Finally we simply combile the Term Frequency and
# Inverse Document Frequency to get the TF-IDF
# weighting for each word/document pair:
def TFIDF(doc, word):
    tf = TermFreq(doc, word)
    idf = InverseDocumentFreq(word)
    return tf*idf

print('ago - ' + str(TFIDF(emma_sentences[5],'ago')))
print('indistinct - ' + str(TFIDF(emma_sentences[5],'indistinct')))

ago - 0.2183214869867708
indistinct - 0.2183214869867708


Here, we can see that although the words ago and `indistinct` appear only once in the given document, `indistinct` **occurs less frequently throughout the whole corpus, meaning it receives a higher TF-IDF weighting.**

### **Calculating TF-IDF weighted embeddings**

In [45]:
import numpy as np
# 1 Import GLoVe embeddings
def loadGlove(path):
    file = open(path, 'r', encoding="utf8")
    model = {}
    for l in file:
        line = l.split()
        word = line[0]
        value = np.array([float(val) for val in line[1:]])
        model[word]  = value
    return model

glove = loadGlove(path='./glove.6B.50d.txt')

# 2 We then calculate an unweighted mean average
# of all the individual embeddings in our document
# to get a vector representation of the sentence
# as a whole. We simply loop through all the words 
# n our document, extract the embedding from the
# GLoVe dictionary, and calculate the average
# over all these vectors:
embeddings = []

for word in emma_sentences[5]:
    embeddings.append(glove[word])

mean_embedding = np.mean(embeddings, axis = 0).reshape(1, -1)
print(mean_embedding)

[[ 3.32575634e-01  3.16596488e-01 -1.80050732e-01 -3.82070951e-01
   4.98493527e-01  5.33804805e-01 -5.46517073e-01  9.12476195e-02
  -1.31538483e-01 -2.71967805e-02  2.99867317e-02  2.64278024e-02
  -2.06519756e-01 -1.54796634e-01  4.28036366e-01 -5.74977317e-02
  -2.65928778e-01  1.60373902e-02 -2.84913561e-01 -2.01252268e-01
  -5.96390732e-02  5.72458220e-01  2.06195927e-01 -1.54312293e-01
   2.52049805e-01 -1.64638200e+00 -3.42686049e-01  1.02592522e-01
   1.42848000e-01 -1.09779902e-01  2.89345488e+00  7.36985634e-02
  -3.73648780e-03 -2.76292784e-01  1.50580049e-01  9.80399951e-02
   2.24408780e-03  2.83664024e-01  3.92979024e-02 -2.98091634e-01
  -1.17309171e-01  2.08815776e-01  6.89953902e-03  2.92777244e-02
   5.54180122e-02 -2.20519707e-01 -2.82007805e-01 -4.34917439e-01
  -9.69051537e-02 -1.67569878e-01]]


In [46]:
# 3. We repeat this process to calculate our TF-IDF
# weighted document vector, but this time, we
# multiply our vectors by their TF-IDF weighting
# before we average them:

embeddings = []

for word in emma_sentences[5]:
    tfidf = TFIDF(emma_sentences[5], word)
    embeddings.append(glove[word]* tfidf)

tfidf_weighted_embedding = np.mean(embeddings, axis = 0).reshape(1, -1)

print(tfidf_weighted_embedding)

[[ 0.03384888  0.04561131 -0.02508487 -0.05546237  0.0651311   0.07019455
  -0.06298467  0.02670422 -0.01072827 -0.00508234  0.00517652  0.00817101
  -0.01604324 -0.01483237  0.04946372 -0.01076198 -0.05021479  0.00040191
  -0.01920397 -0.01341318 -0.01123547  0.08492142  0.02142466 -0.01588025
   0.04405683 -0.17856836 -0.03999452  0.01601948  0.02088402 -0.01340125
   0.2829529   0.00694315  0.00485215 -0.02633143  0.01534283  0.01608815
   0.00316191  0.03238881  0.0082704  -0.04192922 -0.0058766   0.01992215
  -0.00304265 -0.00353939  0.01174628 -0.03416807 -0.02939215 -0.06798914
  -0.00774682 -0.01807456]]


In [47]:
# 4 We can then compare the TF-IDF weighted
# embedding with our average embedding to see
# how similar they are. We can do this using
# cosine similarity, as follows:
cosine_similarity(mean_embedding, tfidf_weighted_embedding)

array([[0.986523]])