#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Lorenzo Vaiani

**Credits:** Moreno La Quatra

**Practice 2:** Word and Sentence Embeddings

## Word Embedding

![](https://qph.fs.quoracdn.net/main-qimg-3e812fd164a08f5e4f195000fecf988f)


**Key takeaways** from lessons and in-class practices:
- Word embeddings are able to map words into a semantic-aware vector space
- There are multiple architectures for the generation of word embeddings
- Each architecture has its advantages and disadvantages
- Word embedding evaluation could be intrinsic (intermediate tasks) or extrinsic (downstream task)
- It is possible to use pre-trained word embedding models or use large amount of text to train it from scratch
- The use of pre-trained word embedding models is a common practice in NLP and removes the need of training a word embedding model from scratch (that could be very time consuming and computationally expensive)


### **Question 1**

Train a new Word2Vec model using gensim with the text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps.

**Hint:** you can use the following code to load the text8 corpus:

```python
import gensim.downloader as api
from gensim.models import Word2Vec
import time
dataset = api.load("text8")
```

In [1]:
! pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.3.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (8.3 kB)
Collecting numpy>=1.18.5 (from gensim)
  Using cached numpy-1.26.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (115 kB)
Collecting scipy>=1.7.0 (from gensim)
  Using cached scipy-1.11.3-cp311-cp311-macosx_12_0_arm64.whl.metadata (165 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-6.4.0-py3-none-any.whl.metadata (21 kB)
Downloading gensim-4.3.2-cp311-cp311-macosx_11_0_arm64.whl (24.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached numpy-1.26.1-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB)
Using cached scipy-1.11.3-cp311-cp311-macosx_12_0_arm64.whl (29.7 MB)
Downloading smart_open-6.4.0-py3-none-any.whl (57 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages

In [24]:
# your code here

import gensim.downloader as api
from gensim.models import Word2Vec
import time
dataset = api.load("text8")


start_time = time.time()
model_W2V = Word2Vec(sentences=dataset, vector_size=100, window=5, min_count=5, workers=4) # Train Word2Vec model
end_time = time.time()

training_time = end_time - start_time
print(f"Word2Vec model trained in {training_time} seconds.")

Word2Vec model trained in 30.35967183113098 seconds.


### **Question 2**:
Perform **intrinsic** evaluation of the model for the task of word analogy by exploiting the data collection available [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv).

1. read CSV file
2. group analogy entries by type (column: `type`)
3. for each type entry (**in the lab, just set type="family"** to reduce the required time) use the first 3 word vectors to compute the fourth
    - Entry: `Athens,Greece,Baghdad,Iraq`
    - `v(Greece) - v(Athens) + v(Baghdad) = res_v`
    - Get the most similar vectors to `res_v`
    - Compute in how many cases the correct word is among the top K (if `v[Iraq]` is among the K most similar words) with `K = 1, 3, 5, 10`

$top(k) = \dfrac{\sum_{i=1}^{N} f(i)}{|E|}$

where $f(i) = 1$ if the target word is among the top k and $f(i) = 0$ otherwise.

$|E|$ is the total number of entries for the considered type.

**Notes:**
1. Try with the model trained on `text8`, is there any issue? If yes, how can you solve it?
2. Test the model trained on Google News available in gensim.

In [4]:
%%capture
! wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv
! pip install --upgrade pandas

In [5]:
# Executing this cell could take ~5 minutes
import gensim.downloader
w2v_google_news_model = gensim.downloader.load('word2vec-google-news-300')



In [27]:
# your code here
import pandas as pd
data_analogies = pd.read_csv("google_analogies.csv")


In [17]:
grouped_analogies = data_analogies.groupby('type')

def calculate_analogy_accuracy(type_group, model, k_values):
    total_entries = len(type_group)
    correct_counts = {k: 0 for k in k_values}

    for _, entry in type_group.iterrows():
        words = entry[['word1', 'word2', 'word3', 'target']].values
        try:
            analogy_vector = model[words[1]] - model[words[0]] + model[words[2]]
            similar_words = model.similar_by_vector(analogy_vector, topn=max(k_values))

            for k in k_values:
                correct_counts[k] += int(words[3] in [word[0] for word in similar_words[:k]])

        except KeyError as e:
            print(f"Skipping entry {words} due to missing word in the vocabulary.")

    accuracy_results = {k: count / total_entries for k, count in correct_counts.items()}
    return accuracy_results



type_group = grouped_analogies.get_group('family')


k_values = [1, 3, 5, 10]
accuracy_results = calculate_analogy_accuracy(type_group, model_W2V.wv, k_values)
accuracy_results_google = calculate_analogy_accuracy(type_group, w2v_google_news_model , k_values)

for k, accuracy in accuracy_results.items():
    print(f"Accuracy for k = {k} and W2V trained on text8: {accuracy * 100:.2f}%")

for k, accuracy in accuracy_results_google.items():
    print(f"Accuracy for k = {k} and W2V pretrainde on google news: {accuracy * 100:.2f}%")  


Skipping entry ['boy' 'girl' 'stepbrother' 'stepsister'] due to missing word in the vocabulary.
Skipping entry ['brother' 'sister' 'stepbrother' 'stepsister'] due to missing word in the vocabulary.
Skipping entry ['brothers' 'sisters' 'stepbrother' 'stepsister'] due to missing word in the vocabulary.
Skipping entry ['dad' 'mom' 'stepbrother' 'stepsister'] due to missing word in the vocabulary.
Skipping entry ['father' 'mother' 'stepbrother' 'stepsister'] due to missing word in the vocabulary.
Skipping entry ['grandfather' 'grandmother' 'stepbrother' 'stepsister'] due to missing word in the vocabulary.
Skipping entry ['grandpa' 'grandma' 'stepbrother' 'stepsister'] due to missing word in the vocabulary.
Skipping entry ['grandson' 'granddaughter' 'stepbrother' 'stepsister'] due to missing word in the vocabulary.
Skipping entry ['groom' 'bride' 'stepbrother' 'stepsister'] due to missing word in the vocabulary.
Skipping entry ['he' 'she' 'stepbrother' 'stepsister'] due to missing word in t

Yes there is issues with the model trainde on text8 because there is missing words in the vocabulary, to go pass this we have to put a try-except clause in the function that calculate the analogy accuracy
What we can see is that we obtained better performances with the pretrained on goolgle news model and that is no matter k. 

Accuracy for k = 1 and W2V trained on text8: 5.14%
Accuracy for k = 3 and W2V trained on text8: 51.98%
Accuracy for k = 5 and W2V trained on text8: 58.50%
Accuracy for k = 10 and W2V trained on text8: 62.25%
Accuracy for k = 1 and W2V pretrainde on google news: 34.98%
Accuracy for k = 3 and W2V pretrainde on google news: 87.94%
Accuracy for k = 5 and W2V pretrainde on google news: 93.68%
Accuracy for k = 10 and W2V pretrainde on google news: 97.23%

### **Question 3:**

Train a new FastText model using gensim with text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps.

- Is there any significant difference in training time if compared with Word2Vec training?

In [30]:
# your code here
from gensim.models import FastText
start_time = time.time()
model_FastText = FastText(sentences=dataset, vector_size=100, window=5, min_count=5, workers=4) # Train FastText model
end_time = time.time()

training_time = end_time - start_time
print(f"FastText model trained in {training_time} seconds.")

FastText model trained in 131.47197914123535 seconds.


### **Question 4:**
Provide the same evaluation done in Question 2 for the FastText model. In this case, you can use the same type of analogy (family) and the same K values.

**Notes:**
- Try with the model trained on `text8`, is there any issue? What does it mean?
- Test the model trained on Wikipedia+News available in gensim.

In [21]:
# Executing this cell could take ~5 minutes
import gensim.downloader
ft_wiki_news_model = gensim.downloader.load('fasttext-wiki-news-subwords-300')



In [23]:
# your code here
accuracy_results = calculate_analogy_accuracy(type_group, model_FastText.wv , k_values)
accuracy_results_wiki = calculate_analogy_accuracy(type_group, ft_wiki_news_model , k_values)

for k, accuracy in accuracy_results.items():
    print(f"Accuracy for k = {k} and FastText trained on text8: {accuracy * 100:.2f}%")

for k, accuracy in accuracy_results_wiki.items():
    print(f"Accuracy for k = {k} and W2V pretrainde on wikipedial and news: {accuracy * 100:.2f}%")  

Accuracy for k = 1 and FastText trained on text8: 2.37%
Accuracy for k = 3 and FastText trained on text8: 32.41%
Accuracy for k = 5 and FastText trained on text8: 39.92%
Accuracy for k = 10 and FastText trained on text8: 47.43%
Accuracy for k = 1 and W2V pretrainde on wikipedial and news: 22.92%
Accuracy for k = 3 and W2V pretrainde on wikipedial and news: 89.33%
Accuracy for k = 5 and W2V pretrainde on wikipedial and news: 95.06%
Accuracy for k = 10 and W2V pretrainde on wikipedial and news: 97.83%


There is no issue that means that the word doesn't have to be in the vocabulary for it to work. The model trained on Wikipedia and news is working way better even though it is not entirely satisfying bcause for small k (k =1 or 2) we don't have that much accuracy. 

### **Question 5** (optional)

Provide a complete evaluation of the best performing models (Word2Vec and FastText) by leveraging the complete dataset of analogy entries. In this case, you should use all the analogy types and all you can use the same K values provided in Question 2.

In [31]:
# your code here

accuracy_results_W2V = calculate_analogy_accuracy(data_analogies, model_W2V.wv , k_values)
accuracy_results_google = calculate_analogy_accuracy(data_analogies, w2v_google_news_model , k_values)
accuracy_results_FastText = calculate_analogy_accuracy(data_analogies, model_FastText.wv , k_values)
accuracy_results_wiki = calculate_analogy_accuracy(data_analogies, ft_wiki_news_model , k_values)

for k, accuracy in accuracy_results_W2V.items():
    print(f"Accuracy for k = {k} and W2V trained on text8: {accuracy * 100:.2f}%")
for k, accuracy in accuracy_results_google.items():
    print(f"Accuracy for k = {k} and W2V trained on google news: {accuracy * 100:.2f}%")
for k, accuracy in accuracy_results_FastText.items():
    print(f"Accuracy for k = {k} and FastText trained on text8: {accuracy * 100:.2f}%")
for k, accuracy in accuracy_results_wiki.items():
    print(f"Accuracy for k = {k} and W2V pretrainde on wikipedia and news: {accuracy * 100:.2f}%")  


Skipping entry ['Athens' 'Greece' 'Baghdad' 'Iraq'] due to missing word in the vocabulary.
Skipping entry ['Athens' 'Greece' 'Bangkok' 'Thailand'] due to missing word in the vocabulary.
Skipping entry ['Athens' 'Greece' 'Beijing' 'China'] due to missing word in the vocabulary.
Skipping entry ['Athens' 'Greece' 'Berlin' 'Germany'] due to missing word in the vocabulary.
Skipping entry ['Athens' 'Greece' 'Bern' 'Switzerland'] due to missing word in the vocabulary.
Skipping entry ['Athens' 'Greece' 'Cairo' 'Egypt'] due to missing word in the vocabulary.
Skipping entry ['Athens' 'Greece' 'Canberra' 'Australia'] due to missing word in the vocabulary.
Skipping entry ['Athens' 'Greece' 'Hanoi' 'Vietnam'] due to missing word in the vocabulary.
Skipping entry ['Athens' 'Greece' 'Havana' 'Cuba'] due to missing word in the vocabulary.
Skipping entry ['Athens' 'Greece' 'Helsinki' 'Finland'] due to missing word in the vocabulary.
Skipping entry ['Athens' 'Greece' 'Islamabad' 'Pakistan'] due to missi

Accuracy for k = 1 and W2V trained on text8: 2.61%
Accuracy for k = 3 and W2V trained on text8: 13.86%
Accuracy for k = 5 and W2V trained on text8: 17.03%
Accuracy for k = 10 and W2V trained on text8: 19.92%
Accuracy for k = 1 and W2V trained on google news: 20.19%
Accuracy for k = 3 and W2V trained on google news: 78.35%
Accuracy for k = 5 and W2V trained on google news: 84.72%
Accuracy for k = 10 and W2V trained on google news: 89.35%
Accuracy for k = 1 and FastText trained on text8: 12.41%
Accuracy for k = 3 and FastText trained on text8: 27.89%
Accuracy for k = 5 and FastText trained on text8: 31.16%
Accuracy for k = 10 and FastText trained on text8: 34.43%
Accuracy for k = 1 and W2V pretrainde on wikipedia and news: 33.62%
Accuracy for k = 3 and W2V pretrainde on wikipedia and news: 88.96%
Accuracy for k = 5 and W2V pretrainde on wikipedia and news: 92.14%
Accuracy for k = 10 and W2V pretrainde on wikipedia and news: 94.65%

## Sentence Embeddings

Sentence embeddings are a way to represent a sentence in a vector space. The vector space is usually learned from a large corpus of text. They are used in many NLP tasks, such as text classification, text similarity, and question answering. In this practice, we will use and interact both with Doc2Vec and InferSent models.

Key takeaways from lessons and in-class practices:
- Doc2Vec is an extension of the Word2Vec framework.
- It incorporates Document ID to obtain a more accurate representation of a document/paragraph.
- Training document vectors are pre-computed, however you can infer vectors for new documents.
- InferSent exploit a deep learning architecture to supervisedly learn sentence representations.
- InferSent vectors could exploit both Word2Vec or FastText as word embedding models.

### **Question 6**

Train a novel Doc2Vec model using the [APIs provided by gensim](https://radimrehurek.com/gensim/models/doc2vec.html) with text8 corpus.

- Which is the training time for the model? Is it comparable with Word2Vec and FastText training time?

NB. **Store** the model to a file for subsequent steps.

In [36]:
# your code here
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument


tagged_data = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(dataset)]
start_time = time.time()
doc2vec_model = Doc2Vec(vector_size=100, window=5, min_count=5, workers=4, epochs=10)
doc2vec_model.build_vocab(tagged_data)
doc2vec_model.train(tagged_data, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)
end_time = time.time()
print(f"Doc2Vec model trained in {end_time - start_time} seconds.")

doc2vec_model.save("doc2vec_model.bin")

Doc2Vec model trained in 47.36729025840759 seconds.


The time for training is around 47 seconds wich is ....

### **Question 7 (Doc2Vec qualitative evaluation)**
Perform some **qualitative** experiments by computing the cosine similarities between sentences composed by yourself.
For example, you can use the following sentences:

```python
s1 = "The president of the United States is Donald Trump"
s2 = "The president of the United States is Joe Biden"
s3 = "United States is a country"
s4 = "The cell phone is a device"
```

Please try to interact with the model by providing different sentences and check the results. Is the model able to capture the semantic meaning of the sentences? Are you satisfied with the results?

In [37]:
# your code here

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


s1 = "The president of the United States is Donald Trump"
s2 = "The president of the United States is Joe Biden"
s3 = "United States is a country"
s4 = "The cell phone is a device"

tokenized_s1 = s1.split()
tokenized_s2 = s2.split()
tokenized_s3 = s3.split()
tokenized_s4 = s4.split()

tagged_s1 = TaggedDocument(words=tokenized_s1, tags=['s1'])
tagged_s2 = TaggedDocument(words=tokenized_s2, tags=['s2'])
tagged_s3 = TaggedDocument(words=tokenized_s3, tags=['s3'])
tagged_s4 = TaggedDocument(words=tokenized_s4, tags=['s4'])


tagged_data = [tagged_s1, tagged_s2, tagged_s3, tagged_s4]


def calculate_cosine_similarity(sentence1, sentence2, model):
    vector1 = model.infer_vector(sentence1)
    vector2 = model.infer_vector(sentence2)
    similarity = cosine_similarity([vector1], [vector2])[0][0]
    return similarity


#doc2vec_model = Doc2Vec(tagged_data, vector_size=100, window=5, min_count=1, workers=4)

# Calculate and print cosine similarities
print(f"Similarity between s1 and s2: {calculate_cosine_similarity(tokenized_s1, tokenized_s2, doc2vec_model)}")
print(f"Similarity between s1 and s3: {calculate_cosine_similarity(tokenized_s1, tokenized_s3, doc2vec_model)}")
print(f"Similarity between s1 and s4: {calculate_cosine_similarity(tokenized_s1, tokenized_s4, doc2vec_model)}")
print(f"Similarity between s2 and s3: {calculate_cosine_similarity(tokenized_s2, tokenized_s3, doc2vec_model)}")
print(f"Similarity between s2 and s4: {calculate_cosine_similarity(tokenized_s2, tokenized_s4, doc2vec_model)}")
print(f"Similarity between s3 and s4: {calculate_cosine_similarity(tokenized_s3, tokenized_s4, doc2vec_model)}")

Similarity between s1 and s2: 0.8405611515045166
Similarity between s1 and s3: 0.6792886257171631
Similarity between s1 and s4: 0.6500366926193237
Similarity between s2 and s3: 0.7398487329483032
Similarity between s2 and s4: 0.6416105031967163
Similarity between s3 and s4: 0.7838231325149536


### **Question 8**

Load the InferSent model provided by Facebook Research ([reference](https://github.com/facebookresearch/InferSent)) and perform the same qualitative evaluation done in Question 7. In this case, you can use the InferSent pretrained model (v2) - [reference](https://github.com/facebookresearch/InferSent).

Try to find some sentences for which InferSent is able to capture the semantic meaning of the sentences as opposed to Doc2Vec. Are you satisfied with the results? Which model is able to better capture the semantic meaning of the sentences? What can be the reason for this?

**Notes:**
Please find below the code to download the InferSent model.

In [38]:
%%capture
# InferSent download required files

! mkdir fastText
! curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
! unzip fastText/crawl-300d-2M.vec.zip -d fastText/
! mkdir encoder
! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl
! git clone https://github.com/facebookresearch/InferSent.git

In [39]:
from InferSent.models import InferSent
import torch
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = 'fastText/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

In [None]:
# your code here

### **Question 9** (Extrinsic Evaluation)

**Extrinsic** evaluation aims at measuring the performance of the word/sentence/paragraph embedding model when used in a downstream task. In this case, we will use the model to perform a text classification task.
We can use different configuration, training corpora or even different models to build a complete architecture for the task at hand.

For this practice we use the text classification dataset available [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P2/news_headline_classification.csv) - [source: Kaggle](https://www.kaggle.com/rmisra/news-category-dataset). It contains news headlines and the corresponding category. The dataset is composed by 200846 divided into multiple categories (e.g. politics, business, sports, etc.).

**Note:** consider using just the first 10.000 headlines to reduce runtime during the lab. You can use the complete data collection at home to achieve better results.

Compute the accuracy of 3 classification models each one built with one of the models introduced in this practice:
- Word2Vec model pretrained on Google News corpus
- FastText model pretrained on Wikipedia+News corpus
- **[Optional]** Doc2Vec model pretrained on Text8 corpus
- **[Optional]** InferSent pretrained model (v2) - [reference](https://github.com/facebookresearch/InferSent)

The procedure to create a classification system is sketched below:
1. Choose a machine learning (multi-class) classifier (e.g., MLP)
2. Split the data collection in train/test (80%/20%)
3. Use text vectors obtained by pretrained model as input of the classifier
4. Measure the accuracy of the classification system
5. Repeat step 3-4 using different embedding models


**Note:** For word embedding models you must use an aggregation strategy to obtain a single vector for each sentence. You can use the average of the word vectors or the sum of the word vectors. In both cases, the output vector can be used as input of the classifier.

Report the performance of each classification pipeline. Which model has better performance? Why? Try to elaborate on the results.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/news_headline_classification.csv

In [None]:
# your code here

**Word2Vec + Average aggregation function**

In [None]:
# your code here

**FastText + Average aggregation function**

In [None]:
# your code here

**Doc2Vec (Text8)**

In [None]:
# your code here

**InferSent**

In [None]:
# your code here