# **Tutorial:** Text Representation
---
As machines do not understand text, you will need to first convert text into numbers in many NLP applications, for example: sentiment analysis, text classification, question answering, etc.

In this tutorial, you will be using various text representation techniques to convert text to numbers, such as one-hot encoding, Bag of Words (BOW), $N$-grams, total frequency - inverse document frequency (`tf-idf`), and word embeddings.

At the end of this tutorial, you would be able to:
* explain various text representations in NLP
* convert text to numbers using various text representations

## One-hot encoding
---
You can build your own one-hot encoding algorithm, by building custom vocabulary list from scratch.

Run the following cells to see one-hot encoding in action.

In [None]:
# the sample sentences you will use for this tutorial
S1 = 'Breakfast is at seven in the morning'
S2 = 'What time is breakfast'
S3 = 'I love nlp. nlp is so cool.'
S4 = 'what do you think of NLP?'

corpus = [S1, S2, S3, S4]
corpus

['Breakfast is at seven in the morning',
 'What time is breakfast',
 'I love nlp. nlp is so cool.',
 'what do you think of NLP?']

In [None]:
# simple preprocessing: lowercase, and replace "." and "?" with ""
#The order in which these functions are performed in the list comprehension is from left to right. First, sentence.lower() is applied, then .replace(".", ""), and finally .replace("?", ""). This order ensures that all sentences are first converted to lowercase and then have periods and question marks removed before being stored in the processed_corpus list.

processed_corpus = [sentence.lower().replace(".","").replace("?","") for sentence in corpus]
processed_corpus

['breakfast is at seven in the morning',
 'what time is breakfast',
 'i love nlp nlp is so cool',
 'what do you think of nlp']

The above code replaces full stop `.` and question mark `?` manually from the dataset. A _better_ way (but more advanced) is to make use of the `string.punctuation` to retrieve all predefined punctuations, and ask Python to remove all these punctuations from the string by using the `.translate()` method.

In [None]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
str.maketrans('','', string.punctuation)

{33: None,
 34: None,
 35: None,
 36: None,
 37: None,
 38: None,
 39: None,
 40: None,
 41: None,
 42: None,
 43: None,
 44: None,
 45: None,
 46: None,
 47: None,
 58: None,
 59: None,
 60: None,
 61: None,
 62: None,
 63: None,
 64: None,
 91: None,
 92: None,
 93: None,
 94: None,
 95: None,
 96: None,
 123: None,
 124: None,
 125: None,
 126: None}

In [None]:
"i am a girl...".translate(str.maketrans('','', string.punctuation))

'i am a girl'

**What is the following code doing? Understand step by step**

In [None]:
processed_corpus1 = [sentence.lower().translate(str.maketrans('','', string.punctuation)) \
                     for sentence in corpus]
processed_corpus1

['breakfast is at seven in the morning',
 'what time is breakfast',
 'i love nlp nlp is so cool',
 'what do you think of nlp']

Now that you have done basic preprocessing such as lowercasing and removing punctuations, you are ready to move on to the next step: building the vocabulary of one-hot encoding.

In [None]:
# Build the vocabulary
# end result is a vocabulary where each word from the processed sentences is mapped to a unique integer identifier.
vocab = {}
count = 0
for sentence in processed_corpus1:
    for word in sentence.split():
        if word not in vocab:
            count = count+1
            vocab[word] = count
print(vocab)

{'breakfast': 1, 'is': 2, 'at': 3, 'seven': 4, 'in': 5, 'the': 6, 'morning': 7, 'what': 8, 'time': 9, 'i': 10, 'love': 11, 'nlp': 12, 'so': 13, 'cool': 14, 'do': 15, 'you': 16, 'think': 17, 'of': 18}


---
**Task:** ✏️ How many words are there in total in the original corpus? And how many unique words are there in the vocabulary?

> *Type your answer here*

<details>
<summary><font color="red">Click to show solution</font></summary>
    
Total number of words = 24

```python
total_words = ' '.join(processed_corpus1).split()
print(len(total_words))
```

Total number of unique words = 18

```python
unique_words = set(' '.join(processed_corpus1).split())
print(len(unique_words))
```
</details>

In [None]:
test = [0]*18
test

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In *one-hot encoding*, each word $w$ in the corpus vocabulary is given a unique integer ID $w\_id$ that is between $1$ and $|V|$, where $V$ is the set of corpus vocab. Each word is then represented by a $V$-dimensional binary vector of $0$s and $1$s.

In [None]:
# string is a sentence
def get_onehot_vector(string):
    onehot_encoded = []
    for word in string.split():
        temp = [0]*len(vocab)
        if word in vocab:
            temp[vocab[word]-1] = 1 #replace 0s with 1s only when the word is the one, at the position vocab[word]-1
            # -1 is to take care of the fact indexing in array starts from 0 and not 1
        onehot_encoded.append(temp)
    return onehot_encoded

Run the following cell to show the one-hot vector of a `sample_sentence` in the `processed_corpus`.

In [None]:
# now, print one sample sentence and get its one-hot vector
sample_sentence = processed_corpus[1]

print(f'sample_sentence = {sample_sentence}')
print(f'vocab = {vocab}') #vocabulary of the corpus not just sentence itself
get_onehot_vector(sample_sentence)

sample_sentence = what time is breakfast
vocab = {'breakfast': 1, 'is': 2, 'at': 3, 'seven': 4, 'in': 5, 'the': 6, 'morning': 7, 'what': 8, 'time': 9, 'i': 10, 'love': 11, 'nlp': 12, 'so': 13, 'cool': 14, 'do': 15, 'you': 16, 'think': 17, 'of': 18}


[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

In [None]:
# beautify output: put in dataframe
import pandas as pd
pd.DataFrame(get_onehot_vector(sample_sentence), columns=vocab.keys())

Unnamed: 0,breakfast,is,at,seven,in,the,morning,what,time,i,love,nlp,so,cool,do,you,think,of
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Notice how sparse the one-hot representation is! Most of the elements in the matrix are zeroes (`0`)! In fact, there is only one `1` for each row of the matrix.

Also, notice that `sample_sentence` contains 4 words (`"what time is breakfast"`), and the size of the one-hot vector generated is `4 x 18`. `18` here refers to the number of words in the vocabulary.

---
**Task:** ✏️ Now, change the `sample_sentence` to be `processed_corpus[3]`, and observe the size of the one-hot vector generated.

In [None]:
# insert your code here



<details>
<summary><font color="red">Click to show solution</font></summary>
    
```python
sample_sentence = processed_corpus[3]
pd.DataFrame(get_onehot_vector(sample_sentence), columns=vocab.keys())
```
The size of the one-hot vector is now `6 x 18`, because there are 6 words in the sentence and 18 words in the vocabulary.

</details>

---
**Task:** ✏️ Now, change the `sample_sentence` to be `"nlp is so cool but sooo difficult"`, and observe the one-hot vector generated.

<details>
<summary><font color="red">Click to show solution</font></summary>
    
```python
sample_sentence = "nlp is so cool but sooo difficult"
pd.DataFrame(get_onehot_vector(sample_sentence), columns=vocab.keys())
```
</details>

---
<font color="blue"> **Pros of One-hot Encoding**</font>
* Easy to implement
* Intuitive to understand

---
<font color="red"> **Cons of One-hot Encoding**</font>
* The size of one-hot vector is directly proportional to **size of vocabulary**. Most real-world corpora have large vocabularies! Imagine this: how many _common_ words are there in English language?
* Most of the entries are **zeroes**, making it **computationally inefficient** to store, compute with, and learn from
* No notion of similarity (and dissimilarity) between words. For example, consider three words: `"fruit"`, `"fruits"`, `"nlp"`. The distance between the words `"fruit"` and `"fruits"` is 1, and the distance between `"fruit"` and `"nlp"` is also 1. Even though `"fruit"` and `"fruits"` is very close in meaning!
* **Out-of-vocabulary (OOV) words**: If you are now trying to encode a word that is not in the vocabulary, there is no way to represent the word using one-hot vector.

> <font color="blue">**Due to all these shortcomings, one-hot encoding is seldom used in NLP!**</font>

## Bag of Words (BoW)
---

Bag-of-words (BoW) puts words in a "bag" and computes the frequency of occurrence of each word. The word order or lexical information is not accounted for in BoW. For example, the sentences `"dog bites man"` and `"man bites dog"` have the same BoW representation.

To convert a text into its BoW representation, you can make use of `CountVectorizer()` in `sklearn` library. `CountVectorizer()` calculates the frequency of occurrence of a word in a corpus. Basic preprocessing such as lowercasing and punctuation removal is automatically done.

The result of the `.fit_transform()` method is a matrix, with each element indicating the corresponding frequency of words in each sentence.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

test_corpus = ['The quick brown fox jumps over the lazy dog',
          'Mr Brown is very lazy',
          'Mr Brown has a brown dog']

cv = CountVectorizer()
X = cv.fit_transform(test_corpus)
pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

Unnamed: 0,brown,dog,fox,has,is,jumps,lazy,mr,over,quick,the,very
0,1,1,1,0,0,1,1,0,1,1,2,0
1,1,0,0,0,1,0,1,1,0,0,0,1
2,2,1,0,1,0,0,0,1,0,0,0,0


You can also remove stopwords by specifying `stop_words='english'` as a parameter to `CountVectorizer()`.

In [None]:
cv = CountVectorizer(stop_words='english')
X_BoW = cv.fit_transform(corpus)

print(f'corpus = {corpus}')
pd.DataFrame(X_BoW.toarray(), columns=cv.get_feature_names_out())

corpus = ['Breakfast is at seven in the morning', 'What time is breakfast', 'I love nlp. nlp is so cool.', 'what do you think of NLP?']


Unnamed: 0,breakfast,cool,love,morning,nlp,seven,think,time
0,1,0,0,1,0,1,0,0
1,1,0,0,0,0,0,0,1
2,0,1,1,0,2,0,0,0
3,0,0,0,0,1,0,1,0


---
**Task:** ✏️ In the above corpus, sentence 0 and 1 are both about "breakfast", whereas sentence 2 and 3 are both about "nlp".
* Calculate the manhattan distance between sentence 0 and 1.
* Calculate the manhattan distance between sentence 0 and 2.

(Use Manhattan distance between two vectors which is defined as the sum of the absolute differences between their corresponding elements.)

Does BoW capture the semantic similarity of sentences? (i.e. is $distance(0,1) < distance (0,2)$?)

> *Type your answer here*

<details>
<summary><font color="red">Click to show solution</font></summary>
    
$distance(0,1)=0+0+0+1+0+1+0+1=3$

$distance(0,2)=1+1+1+1+2+1+0+0=7$

Yes, BoW capture the semantic similarity of sentences.
</details>

---
<font color="blue"> **Pros of BoW**</font>
* Easy to implement and understand
* Semantic similarity of documents is captured, if same word is used
* Fixed-length encoding for any sentence of arbitrary length

---
<font color="red"> **Cons of BoW**</font>
* The size of the vector increases with the size of the vocabulary (similar to one-hot)
* Out-of-vocabulary (OOV) words (similar to one-hot)
* Semantic similarity of documents is not captured, if different words are used (similar to one-hot)
* Word-order information is lost. For example, the sentences `"dog bites man"` and `"man bites dog"` have the same BoW representation



## Bag of $N$-grams (BoN)
---

BoN is very similar to BoW, except that now, you consider $N$ words together rather than one. For example, given the sentence `"James is the best person ever"`, with $N=2$, the bigrams are:

* `James is`
* `is the`
* `the best`
* `best person`
* `person ever`



In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

test_corpus = ['James is the best person ever']

# construct bigrams
cv = CountVectorizer(ngram_range=(2,2))
X = cv.fit_transform(test_corpus)

print(f'test_corpus = {test_corpus}')
pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

test_corpus = ['James is the best person ever']


Unnamed: 0,best person,is the,james is,person ever,the best
0,1,1,1,1,1


In [None]:
test_corpus = ['James is the best person ever']

# construct trigrams
cv = CountVectorizer(ngram_range=(3,3))
X = cv.fit_transform(test_corpus)

print(f'test_corpus = {test_corpus}')
pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

test_corpus = ['James is the best person ever']


Unnamed: 0,best person ever,is the best,james is the,the best person
0,1,1,1,1


In [None]:
# construct bigrams (3 sentences here)
cv = CountVectorizer(stop_words='english', ngram_range=(2,2))
X = cv.fit_transform(corpus)

print(f'corpus = {corpus}')
pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

corpus = ['Breakfast is at seven in the morning', 'What time is breakfast', 'I love nlp. nlp is so cool.', 'what do you think of NLP?']


Unnamed: 0,breakfast seven,love nlp,nlp cool,nlp nlp,seven morning,think nlp,time breakfast
0,1,0,0,0,1,0,0
1,0,0,0,0,0,0,1
2,0,1,1,1,0,0,0
3,0,0,0,0,0,1,0


In [None]:
# construct uni, bi, trigrams
cv = CountVectorizer(stop_words='english', ngram_range=(1,3))
X = cv.fit_transform(corpus)

print(f'corpus = {corpus}')
pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

corpus = ['Breakfast is at seven in the morning', 'What time is breakfast', 'I love nlp. nlp is so cool.', 'what do you think of NLP?']


Unnamed: 0,breakfast,breakfast seven,breakfast seven morning,cool,love,love nlp,love nlp nlp,morning,nlp,nlp cool,nlp nlp,nlp nlp cool,seven,seven morning,think,think nlp,time,time breakfast
0,1,1,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
2,0,0,0,1,1,1,1,0,2,1,1,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0


In [None]:
# unigram (1-gram) = BoW!
cv = CountVectorizer(stop_words='english', ngram_range=(1,1))
X = cv.fit_transform(corpus)

print(f'corpus = {corpus}')
pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

corpus = ['Breakfast is at seven in the morning', 'What time is breakfast', 'I love nlp. nlp is so cool.', 'what do you think of NLP?']


Unnamed: 0,breakfast,cool,love,morning,nlp,seven,think,time
0,1,0,0,1,0,1,0,0
1,1,0,0,0,0,0,0,1
2,0,1,1,0,2,0,0,0
3,0,0,0,0,1,0,1,0


---
<font color="blue"> **Pros of BoN**</font>
* Capture some context and word-order information (in the form of $N$-grams)
* Capture some semantic similarity. Documents having the same $N$-grams will have their vectors closer to each other as compared to documents with completely different $N$-grams

---
<font color="red"> **Cons of BoN**</font>
* As $N$ increases, dimensionality (and therefore sparsity) also increases rapidly
* Out-of-vocabulary (OOV) words (similar to one-hot and BoW)



## TF-IDF
---

Intuition behind TF-IDF is that the weight assigned to each word not only depends on a words frequency, but also how frequent that particular word is in the entire corpus/corpora.

It takes the `CountVectorizer()` discussed in the section above and multiplies it by the IDF score. The resultant **output weights for the words from the process is low for very highly frequent words** (like stopwords) and very low frequency words (noise terms). **(Words that are frequent in a document but relatively rare in the corpus will have a high TF-IDF score, indicating their significance in that document. Conversely, common words that appear in many documents will have a low TF-IDF score.)**

Next, you would make use of the `TfidfVectorizer()` in `sklearn` library to turn text into `tfidf` vectors.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf= TfidfVectorizer(stop_words='english')
X_tfidf = tfidf.fit_transform(corpus)

print(f'corpus = {corpus}')
pd.DataFrame(X_tfidf.toarray(), columns=tfidf.get_feature_names_out())

corpus = ['Breakfast is at seven in the morning', 'What time is breakfast', 'I love nlp. nlp is so cool.', 'what do you think of NLP?']


Unnamed: 0,breakfast,cool,love,morning,nlp,seven,think,time
0,0.486934,0.0,0.0,0.617614,0.0,0.617614,0.0,0.0
1,0.61913,0.0,0.0,0.0,0.0,0.0,0.0,0.785288
2,0.0,0.47212,0.47212,0.0,0.74445,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.61913,0.0,0.785288,0.0


There is the `tfidf` matrix! While it is a very useful matrix to indicate how important and unique a word is in each document, it is not very _interesting_. You can make the output more interpretable by printing out the most important words in each document.

Run the following cell to observe how cool `tfidf` is.

In [None]:
df = pd.DataFrame(X_tfidf.toarray(), columns = tfidf.get_feature_names_out())
data = df.transpose()
data.columns = ['document_0', 'document_1', 'document_2', 'document_3']

In [None]:
data

Unnamed: 0,document_0,document_1,document_2,document_3
breakfast,0.486934,0.61913,0.0,0.0
cool,0.0,0.0,0.47212,0.0
love,0.0,0.0,0.47212,0.0
morning,0.617614,0.0,0.0,0.0
nlp,0.0,0.0,0.74445,0.61913
seven,0.617614,0.0,0.0,0.0
think,0.0,0.0,0.0,0.785288
time,0.0,0.785288,0.0,0.0


In [None]:
# Find the top 2 words in each document
top_dict = {}
for c in range(4):
    top = data.iloc[:,c].sort_values(ascending=False).head(5)



In [None]:
top

Unnamed: 0,document_3
think,0.785288
nlp,0.61913
breakfast,0.0
cool,0.0
love,0.0


In [None]:
df = pd.DataFrame(X_tfidf.toarray(), columns = tfidf.get_feature_names_out())
data = df.transpose()
data.columns = ['document_0', 'document_1', 'document_2', 'document_3']

# Find the top 2 words in each document
top_dict = {}
for c in range(4):
    sorted_data = data.iloc[:,c].sort_values(ascending=False).head(5)
    top_dict[data.columns[c]]= list(zip(sorted_data.index, sorted_data.values))

print(f'corpus = {corpus}')
# Print the top 2 words mentioned in each document
for document, top_words in top_dict.items():
    print(f'{document}: {", ".join([word for word, count in top_words[0:2]])}')

corpus = ['Breakfast is at seven in the morning', 'What time is breakfast', 'I love nlp. nlp is so cool.', 'what do you think of NLP?']
document_0: morning, seven
document_1: time, breakfast
document_2: nlp, cool
document_3: think, nlp


---
<font color="blue"> **Pros of TF-IDF**</font>
* Capture semantic similarity between words
* Provides some information on how **important** each word is to the respective document and how **unique** each word is relative to its frequency in the entire corpus

---
<font color="red"> **Cons of TF-IDF**</font>
* **Curse of dimensionality**: As vocab size increases, dimensionality (and therefore sparsity) also increases rapidly (similar to one-hot and BoW)
* **Out-of-vocabulary (OOV)** words (similar to one-hot and BoW)



## Cosine similarity
---



# Calculating Cosine Similarity

We used the Cosine Similarity function, but how does it actually work? Cosine similarity is just calculating the similarity between two vectors. There is a mathematical equation for calculating the angle between two vectors.

![](https://drive.google.com/uc?export=view&id=1cehvtx7LKuFeq_LqfnLi-gzIz1D1wSf9)

In [None]:
# import necessary libraries
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# display corpus
corpus

['Breakfast is at seven in the morning',
 'What time is breakfast',
 'I love nlp. nlp is so cool.',
 'what do you think of NLP?']

In [None]:
# list all of the combinations of sentences in the corpus
pairs = list(combinations(range(len(corpus)),2))
combos = [(processed_corpus[a_index],corpus[b_index]) \
          for (a_index,b_index) in pairs]
combos

[('breakfast is at seven in the morning', 'What time is breakfast'),
 ('breakfast is at seven in the morning', 'I love nlp. nlp is so cool.'),
 ('breakfast is at seven in the morning', 'what do you think of NLP?'),
 ('what time is breakfast', 'I love nlp. nlp is so cool.'),
 ('what time is breakfast', 'what do you think of NLP?'),
 ('i love nlp nlp is so cool', 'what do you think of NLP?')]

In [None]:
# recap the earlier BoW representation
pd.DataFrame(X_BoW.toarray(), columns=cv.get_feature_names_out())

Unnamed: 0,breakfast,cool,love,morning,nlp,seven,think,time
0,1,0,0,1,0,1,0,0
1,1,0,0,0,0,0,0,1
2,0,1,1,0,2,0,0,0
3,0,0,0,0,1,0,1,0


In [None]:
# calculate the cosine similarity for all pairs of phrases and sort by most similar (using BoW)
results = [cosine_similarity(X_BoW[a_index],X_BoW[b_index]) \
           for (a_index,b_index) in pairs]
sorted(zip(results,combos),reverse=True)

[(array([[0.57735027]]),
  ('i love nlp nlp is so cool', 'what do you think of NLP?')),
 (array([[0.40824829]]),
  ('breakfast is at seven in the morning', 'What time is breakfast')),
 (array([[0.]]), ('what time is breakfast', 'what do you think of NLP?')),
 (array([[0.]]), ('what time is breakfast', 'I love nlp. nlp is so cool.')),
 (array([[0.]]),
  ('breakfast is at seven in the morning', 'what do you think of NLP?')),
 (array([[0.]]),
  ('breakfast is at seven in the morning', 'I love nlp. nlp is so cool.'))]

In [None]:
# recap the earlier Tf-idf representation
pd.DataFrame(X_tfidf.toarray(), columns=cv.get_feature_names_out())

Unnamed: 0,breakfast,cool,love,morning,nlp,seven,think,time
0,0.486934,0.0,0.0,0.617614,0.0,0.617614,0.0,0.0
1,0.61913,0.0,0.0,0.0,0.0,0.0,0.0,0.785288
2,0.0,0.47212,0.47212,0.0,0.74445,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.61913,0.0,0.785288,0.0


In [None]:
# calculate the cosine similarity for all pairs of phrases and sort by most similar (using Tf-idf)
results = [cosine_similarity(X_tfidf[a_index],X_tfidf[b_index]) \
           for (a_index, b_index) in pairs]
sorted (zip(results, combos), reverse = True)

[(array([[0.46091137]]),
  ('i love nlp nlp is so cool', 'what do you think of NLP?')),
 (array([[0.30147576]]),
  ('breakfast is at seven in the morning', 'What time is breakfast')),
 (array([[0.]]), ('what time is breakfast', 'what do you think of NLP?')),
 (array([[0.]]), ('what time is breakfast', 'I love nlp. nlp is so cool.')),
 (array([[0.]]),
  ('breakfast is at seven in the morning', 'what do you think of NLP?')),
 (array([[0.]]),
  ('breakfast is at seven in the morning', 'I love nlp. nlp is so cool.'))]

## Word embeddings
---

There are two ways to use word embeddings:
* You can choose to train your own embedding from scratch using your own dataset, or
* You can make use of *pretrained* word embeddings

Training your own embedding from scratch is very expensive and time-consuming. It takes a fast computer with a lot of RAM and disk space, and perhaps some expertise in preprocessing the input data and deep learning as word embeddings use a neural network model to learn word associations.

For a start, you will begin by exploring the pretrained word embeddings available in `gensim`. Using pretrained word embeddings is an example of **transfer learning**: you make use of learnings done earlier on a large dataset, and use the trained representation on new tasks.

In [None]:
import gensim.downloader as api

# display all pretrained word embeddings
list(api.info()['models'])

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

In [None]:
# Display all pretrained word embeddings, with detailed information
api.info()

{'corpora': {'semeval-2016-2017-task3-subtaskBC': {'num_records': -1,
   'record_format': 'dict',
   'file_size': 6344358,
   'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py',
   'license': 'All files released for the task are free for general research use',
   'fields': {'2016-train': ['...'],
    '2016-dev': ['...'],
    '2017-test': ['...'],
    '2016-test': ['...']},
   'description': 'SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the task paper http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf linked in section “Papers” of https://github.com/RaRe-Technologies/gensim-data/issues/18.',
   'checksum': '701ea67acd82e75f95e1d8e62fb0ad29',
   'file_name': 'se

Now, load in a pretrained model called `'glove-wiki-gigaword-100'`.

For more technical information on **GloVe**: Global Vectors for Word Representation, the math behind and how it is trained, you can watch this short [video](https://www.coursera.org/lecture/nlp-sequence-models/glove-word-vectors-IxDTG) by Andrew Ng. [Optional]

In [None]:
model_name = "glove-wiki-gigaword-100"
model = api.load(model_name)

# display the length of vocabulary in the pretrained model
print(f'Length of vocab in {model_name} is {len(model.key_to_index)}')


Length of vocab in glove-wiki-gigaword-100 is 400000


In [None]:
# print 10 most similar words
similar_beautiful = model.most_similar("beautiful")
similar_beautiful

[('lovely', 0.8909062743186951),
 ('gorgeous', 0.8721614480018616),
 ('wonderful', 0.8080509305000305),
 ('charming', 0.771931529045105),
 ('magnificent', 0.7331939339637756),
 ('elegant', 0.717603862285614),
 ('fabulous', 0.6914447546005249),
 ('splendid', 0.685012936592102),
 ('perfect', 0.6778431534767151),
 ('pretty', 0.6774278879165649)]

In [None]:
# display the vector of the word "beautiful"
model["beautiful"]

array([-0.18173 ,  0.49759 ,  0.46326 ,  0.22507 ,  0.46379 ,  0.70062 ,
       -0.55155 ,  0.79148 , -0.18582 ,  0.19755 ,  0.19881 ,  0.09037 ,
        0.02684 ,  0.036921,  0.25217 ,  0.30879 ,  0.33164 ,  0.2714  ,
       -0.12808 ,  1.1721  , -0.072969,  0.34904 ,  0.11161 , -0.36056 ,
        0.59628 ,  0.42417 , -0.69904 , -0.19768 , -0.35599 , -0.23141 ,
       -0.38503 , -0.12665 ,  0.77121 , -0.37397 ,  0.59642 , -0.24416 ,
       -0.25387 , -0.065911,  0.21035 , -0.83429 ,  0.28604 , -0.022707,
        0.06746 ,  0.088804,  0.23424 ,  0.20475 ,  0.085396,  0.55393 ,
        0.34153 , -0.095455, -0.19291 , -0.55262 ,  1.0229  ,  0.3866  ,
       -0.24254 , -2.3519  ,  0.43561 ,  1.1172  ,  0.77358 , -0.73769 ,
       -0.35302 ,  1.6699  , -0.63955 , -0.39244 ,  0.56454 , -0.27873 ,
        0.9252  , -0.13997 , -0.096213, -1.1242  ,  0.49031 ,  0.36918 ,
        0.41195 , -0.038159,  0.84123 ,  0.24619 ,  0.081767,  0.07483 ,
        0.44646 , -0.19423 ,  0.013369,  0.37712 , 

In [None]:
len(model["beautiful"])

100

In [None]:
# try with another word!
similar_computer = model.most_similar("computer")
print(f'The top 10 words similar to the word "computer" are {similar_computer}')
print(f'The vector representation of the word "computer" is \n {model["computer"]}')

The top 10 words similar to the word "computer" are [('computers', 0.8751984238624573), ('software', 0.8373122215270996), ('technology', 0.7642159461975098), ('pc', 0.7366448640823364), ('hardware', 0.7290390729904175), ('internet', 0.7286775708198547), ('desktop', 0.7234441637992859), ('electronic', 0.7221828699111938), ('systems', 0.7197922468185425), ('computing', 0.7141730785369873)]
The vector representation of the word "computer" is 
 [-1.6298e-01  3.0141e-01  5.7978e-01  6.6548e-02  4.5835e-01 -1.5329e-01
  4.3258e-01 -8.9215e-01  5.7747e-01  3.6375e-01  5.6524e-01 -5.6281e-01
  3.5659e-01 -3.6096e-01 -9.9662e-02  5.2753e-01  3.8839e-01  9.6185e-01
  1.8841e-01  3.0741e-01 -8.7842e-01 -3.2442e-01  1.1202e+00  7.5126e-02
  4.2661e-01 -6.0651e-01 -1.3893e-01  4.7862e-02 -4.5158e-01  9.3723e-02
  1.7463e-01  1.0962e+00 -1.0044e+00  6.3889e-02  3.8002e-01  2.1109e-01
 -6.6247e-01 -4.0736e-01  8.9442e-01 -6.0974e-01 -1.8577e-01 -1.9913e-01
 -6.9226e-01 -3.1806e-01 -7.8565e-01  2.3831

In [None]:
# what happens if you try to look for a word that does not exist in the dictionary?
try:
    model["hyperparameter"]
except:
    print("The word 'hyperparameter' is not found in the vocabulary")

The word 'hyperparameter' is not found in the vocabulary


You can also ask the pre-trained model to answer this question:

> $King - man + woman = \rule{2cm}{0.15mm}$

In [None]:
result = model.most_similar(positive=['woman','king'], negative=['man'])
result[0]

('queen', 0.7698540687561035)

Or answer this question:

> Jakarta is to Indonesia as $\rule{2cm}{0.15mm}$ is to the Philippines.

In [None]:
result = model.most_similar(positive=['jakarta','philippines'], negative=['indonesia'])
result[0]

('manila', 0.8788260221481323)

So far, you have seen how easy it is to use ***pretrained* word embeddings**. You can also fine tune the pretrained word embeddings for your own applications. This is called **transfer learning** and this is one of the reasons why word embeddings is so popular.

Other than using pretrained word embeddings, you can also choose to train your own word embedding from scratch, using your own custom dataset. You will learn this in the subsequent lessons.

## Transformer models (Dynamic Word EmBeddings)
---

In [None]:
pip install transformers



In [None]:
from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



In [None]:
text = "This is an example sentence."
tokens = tokenizer.tokenize(text)
print(tokens)

['this', 'is', 'an', 'example', 'sentence', '.']


In [None]:
import torch

input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

[2023, 2003, 2019, 2742, 6251, 1012]


In [None]:
# Convert inputs to PyTorch tensors
input_ids = torch.tensor([input_ids])
attention_mask = torch.ones_like(input_ids)

# Generate embeddings
with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    embeddings = outputs[0]

embeddings

tensor([[[-1.8824e-01, -7.6451e-04,  1.0336e-01,  ..., -2.0809e-01,
          -3.9280e-01,  7.9072e-01],
         [-4.1441e-01, -1.7607e-01,  5.4727e-02,  ..., -1.0659e-01,
          -3.8406e-01,  7.9451e-01],
         [-5.5160e-01,  1.7656e-01,  2.4592e-01,  ...,  1.5594e-01,
          -5.1121e-01,  1.3524e+00],
         [-2.6705e-01,  2.0308e-01, -3.6436e-02,  ..., -9.6218e-02,
          -5.8836e-01,  6.6819e-01],
         [-1.7557e-01,  1.8462e-01,  3.5970e-02,  ..., -1.4965e-01,
          -3.1363e-01,  6.9866e-01],
         [-1.6752e-01, -3.2122e-01,  7.4659e-02,  ..., -2.1811e-01,
          -3.7288e-01,  7.0560e-01]]])

In [None]:
print(len(embeddings[0][0]))

768


Note: Each token has an embedding of 768

## <font color="blue">**Conclusion**</font>
---
Congratulations! You have learnt how to represent text using numbers, such as one-hot encoding, Bag of Words (BOW), N-grams, total frequency - inverse document frequency (`tf-idf`), and word embeddings.

Text representation, together with text preprocessing, are the fundamentals of NLP. By now, you have mastered them to a certain extent. Give yourself a pat on the back! That is not an easy feat!

For the rest of this module, you will learn the various applications of NLP, such as sentiment analysis, text classification, text summarization, etc. You will also learn more advanced techniques such as training your own word embeddings, etc.

See you in the next lesson!