<a href="https://colab.research.google.com/github/Raffix-14/NLP/blob/main/Lab_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 2:** Word and Sentence Embeddings

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MorenoLaQuatra/DeepNLP/blob/main/2022_2023/Practice_2_Word_and_Sentence_Embeddings.ipynb)

## Word Embedding 

![](https://qph.fs.quoracdn.net/main-qimg-3e812fd164a08f5e4f195000fecf988f)


**Key takeaways** from lessons and in-class practices:
- Word embeddings are able to map words into a semantic-aware vector space
- There are multiple architectures for the generation of word embeddings
- Each architecture has its advantages and disadvantages
- Word embedding evaluation could be intrinsic (intermediate tasks) or extrinsic (downstream task)
- It is possible to use pre-trained word embedding models or use large amount of text to train it from scratch
- The use of pre-trained word embedding models is a common practice in NLP and removes the need of training a word embedding model from scratch (that could be very time consuming and computationally expensive)


### **Question 1**

Train a new Word2Vec model using gensim with the text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps.

**Hint:** you can use the following code to load the text8 corpus:

```python
import gensim.downloader as api
from gensim.models import Word2Vec
import time
dataset = api.load("text8")
```

In [1]:
%%capture
!pip install --upgrade gensim

In [8]:
import gensim.downloader as api
from gensim.models import Word2Vec
import time
dataset = api.load("text8")
t0 = time.time()
model = Word2Vec(dataset)
print(time.time() - t0)

109.48493313789368


### **Question 2**:
Perform **intrinsic** evaluation of the model for the task of word analogy by exploiting the data collection available [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv). 

1. read CSV file
2. group analogy entries by type (column: `type`)
3. for each type entry (**in the lab, just set type="family"** to reduce the required time) use the first 3 word vectors to compute the fourth
    - Entry: `Athens,Greece,Baghdad,Iraq`
    - `v(Greece) - v(Athens) + v(Baghdad) = res_v` 
    - Get the most similar vectors to `res_v`
    - Compute in how many cases the correct word is among the top K (if `v[Iraq]` is among the K most similar words) with `K = 1, 3, 5, 10`

$top(k) = \dfrac{\sum_{i=1}^{N} f(i)}{|E|}$

where $f(i) = 1$ if the target word is among the top k and $f(i) = 0$ otherwise.

$|E|$ is the total number of entries for the considered type.

**Notes:**
1. Try with the model trained on `text8`, is there any issue? If yes, how can you solve it?
2. Test the model trained on Google News available in gensim.



In [9]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv
!pip install --upgrade pandas

In [10]:
# Executing this cell could take ~5 minutes - you can write your code in the meantime
import gensim.downloader
w2v_google_news_model = gensim.downloader.load('word2vec-google-news-300')



In [59]:
import pandas as pd

df = pd.read_csv("google_analogies.csv")
df_family = df[df['type']=='family']
v = w2v_google_news_model
K = [1, 3, 5, 10]

res_v = v['Greece'] - v['Athens'] + v['Baghdad']
count = 0
for k in K:
  top_similar = v.most_similar(res_v, topn=k)
  print(top_similar)
  if 'Iraq' in list(zip(*top_similar))[0]:
    count += 1
print(count, '\n', 'Accuracy:', count/len(K))



[('Baghdad', 0.7489827275276184)]
[('Baghdad', 0.7489827275276184), ('Iraqi', 0.6727433204650879), ('Mosul', 0.6652752757072449)]
[('Baghdad', 0.7489827275276184), ('Iraqi', 0.6727433204650879), ('Mosul', 0.6652752757072449), ('Iraq', 0.6355191469192505), ('Iraqis', 0.6133478879928589)]
[('Baghdad', 0.7489827275276184), ('Iraqi', 0.6727433204650879), ('Mosul', 0.6652752757072449), ('Iraq', 0.6355191469192505), ('Iraqis', 0.6133478879928589), ('Samarra', 0.6069650053977966), ('Sunni_Arab', 0.6064962148666382), ('Basra', 0.5986490845680237), ('Fallujah', 0.5971603989601135), ('Anbar', 0.5966798067092896)]
2 
 Accuracy: 0.5


### **Question 3:**

Train a new FastText model using gensim with text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps. 

- Is there any significant difference in training time if compared with Word2Vec training?

In [55]:
from gensim.models import FastText
import time
dataset = api.load("text8")
t0 = time.time()
model = FastText(dataset)
print(time.time() - t0)

436.4086244106293


### **Question 4:**
Provide the same evaluation done in Question 2 for the FastText model. In this case, you can use the same type of analogy (family) and the same K values.

**Notes:**
- Try with the model trained on `text8`, is there any issue? What does it mean?
- Test the model trained on Wikipedia+News available in gensim.

In [56]:
# Executing this cell could take ~5 minutes - you can write your code in the meantime
import gensim.downloader
ft_wiki_news_model = gensim.downloader.load('fasttext-wiki-news-subwords-300')



In [63]:
import pandas as pd

df = pd.read_csv("google_analogies.csv")
df_family = df[df['type']=='family']
v = ft_wiki_news_model
K = [1, 3, 5, 10]

res_v = v['Greece'] - v['Athens'] + v['Baghdad']
count = 0
for k in K:
  top_similar = v.most_similar(res_v, topn=k)
  print(top_similar)
  if 'Iraq' in list(zip(*top_similar))[0]:
    count += 1
print(count, '\n', 'Accuracy:', count/len(K))


[('Baghdad', 0.7944707274436951)]
[('Baghdad', 0.7944707274436951), ('Iraq', 0.7864524126052856), ('Mosul', 0.6864204406738281)]
[('Baghdad', 0.7944707274436951), ('Iraq', 0.7864524126052856), ('Mosul', 0.6864204406738281), ('Kuwait', 0.6606603264808655), ('Mesopotamia', 0.6495900750160217)]
[('Baghdad', 0.7944707274436951), ('Iraq', 0.7864524126052856), ('Mosul', 0.6864204406738281), ('Kuwait', 0.6606603264808655), ('Mesopotamia', 0.6495900750160217), ('Iraq-Syria', 0.6473897695541382), ('Syria', 0.645632266998291), ('Baghdady', 0.6396152377128601), ('Al-Basrah', 0.6377001404762268), ('Iraq-Kuwait', 0.6375606060028076)]
3 
 Accuracy: 0.75


### **Question 5** (optional) 

Provide a complete evaluation of the best performing models (Word2Vec and FastText) by leveraging the complete dataset of analogy entries. In this case, you should use all the analogy types and all you can use the same K values provided in Question 2.

In [77]:
df = pd.read_csv("google_analogies.csv")
df[['word1','word2','word3','target']].to_csv('right_format_analogies.csv', index=False)

a = w2v_google_news_model.evaluate_word_analogies(analogies="right_format_analogies.txt")
b = ft_wiki_news_model.evaluate_word_analogies(analogies="right_format_analogies.txt")
print(a, '\n', b)

ValueError: ignored

## Sentence Embeddings

Sentence embeddings are a way to represent a sentence in a vector space. The vector space is usually learned from a large corpus of text. They are used in many NLP tasks, such as text classification, text similarity, and question answering. In this practice, we will use and interact both with Doc2Vec and InferSent models.

Key takeaways from lessons and in-class practices:
- Doc2Vec is an extension of the Word2Vec framework.
- It incorporates Document ID to obtain a more accurate representation of a document/paragraph.
- Training document vectors are pre-computed, however you can infer vectors for new documents.
- InferSent exploit a deep learning architecture to supervisedly learn sentence representations.
- InferSent vectors could exploit both Word2Vec or FastText as word embedding models.

### **Question 6**

Train a novel Doc2Vec model using the [APIs provided by gensim](https://radimrehurek.com/gensim/models/doc2vec.html) with text8 corpus. 

- Which is the training time for the model? Is it comparable with Word2Vec and FastText training time?

NB. **Store** the model to a file for subsequent steps.

In [None]:
# Your code here

### **Question 7 (Doc2Vec qualitative evaluation)**
Perform some **qualitative** experiments by computing the cosine similarities between sentences composed by yourself.
For example, you can use the following sentences:

```python
s1 = "The president of the United States is Donald Trump"
s2 = "The president of the United States is Joe Biden"
s3 = "United States is a country"
s4 = "The cell phone is a device"
```

Please try to interact with the model by providing different sentences and check the results. Is the model able to capture the semantic meaning of the sentences? Are you satisfied with the results?

In [None]:
# Your code here

### **Question 8** 

Load the InferSent model provided by Facebook Research ([reference](https://github.com/facebookresearch/InferSent)) and perform the same qualitative evaluation done in Question 7. In this case, you can use the InferSent pretrained model (v2) - [reference](https://github.com/facebookresearch/InferSent).

Try to find some sentences for which InferSent is able to capture the semantic meaning of the sentences as opposed to Doc2Vec. Are you satisfied with the results? Which model is able to better capture the semantic meaning of the sentences? What can be the reason for this?

**Notes:**
Please find below the code to download the InferSent model.

In [None]:
%%capture
# InferSent download required files

! mkdir fastText
! curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
! unzip fastText/crawl-300d-2M.vec.zip -d fastText/
! mkdir encoder
! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl
! git clone https://github.com/facebookresearch/InferSent.git

In [None]:
from InferSent.models import InferSent
import torch
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = 'fastText/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

In [None]:
# Your code here

### **Question 9** (Extrinsic Evaluation)

**Extrinsic** evaluation aims at measuring the performance of the word/sentence/paragraph embedding model when used in a downstream task. In this case, we will use the model to perform a text classification task.
We can use different configuration, training corpora or even different models to build a complete architecture for the task at hand.

For this practice we use the text classification dataset available [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P2/news_headline_classification.csv) - [source: Kaggle](https://www.kaggle.com/rmisra/news-category-dataset). It contains news headlines and the corresponding category. The dataset is composed by 200846 divided into multiple categories (e.g. politics, business, sports, etc.).

**Note:** consider using just the first 10.000 headlines to reduce runtime during the lab. You can use the complete data collection at home to achieve better results.

Compute the accuracy of 3 classification models each one built with one of the models introduced in this practice:
- Word2Vec model pretrained on Google News corpus
- FastText model pretrained on Wikipedia+News corpus
- **[Optional]** Doc2Vec model pretrained on Text8 corpus
- **[Optional]** InferSent pretrained model (v2) - [reference](https://github.com/facebookresearch/InferSent)

The procedure to create a classification system is sketched below:
1. Choose a machine learning (multi-class) classifier (e.g., MLP)
2. Split the data collection in train/test (80%/20%)
3. Use text vectors obtained by pretrained model as input of the classifier
4. Measure the accuracy of the classification system
5. Repeat step 3-4 using different embedding models 


**Note:** For word embedding models you must use an aggregation strategy to obtain a single vector for each sentence. You can use the average of the word vectors or the sum of the word vectors. In both cases, the output vector can be used as input of the classifier.

Report the performance of each classification pipeline. Which model has better performance? Why? Try to elaborate on the results.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/news_headline_classification.csv

In [None]:
# Your code here

**Word2Vec + Average aggregation function**

In [None]:
# Your code here

**FastText + Average aggregation function**

In [None]:
# Your code here

**Doc2Vec (Text8)**

In [None]:
# Your code here

**InferSent**

In [None]:
# Your code here