<a href="https://colab.research.google.com/github/MorenoLaQuatra/DeepNLP/blob/main/practices/P2/Practice_2_Word_and_Sentence_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 2:** Word and Sentence Embeddings

## Word Embedding 

![](https://qph.fs.quoracdn.net/main-qimg-3e812fd164a08f5e4f195000fecf988f)


**Key takeaways** from lessons and in-class practices:
- Word embeddings are able to map words into a semantic-aware vector space
- There are multiple architectures for the generation of word embeddings
- Each architecture has its advantages and disadvantages
- Word embedding evaluation could be intrinsic (intermediate tasks) or extrinsic (downstream task)
- It is possible to use pre-trained word embedding models or use large amount of text to train it from scratch


### **Question 1**

Train a new Word2Vec model using gensim with the text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps.

In [None]:
! pip install --upgrade gensim

In [None]:
# Your code here

### **Question 2**:
Perform intrinsic evaluation of the model for the task of word analogy by exploiting the data collection available [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv). 

1. read CSV file
2. group analogy entries by type (column: `type`)
3. for each type entry (**in the lab, just set type="family"** to reduce the required time) use the first 3 word vectors to compute the fourth
    - Entry: `Athens,Greece,Baghdad,Iraq`
    - `v(Greece) - v(Athens) + v(Baghdad) = res_v` 
    - Get the most similar vectors to `res_v`
    - Compute in how many cases the correct word is among the top K (if `v[Iraq]` is among the K most similar words) with `K = 1, 3, 5, 10`

$top(k) = \dfrac{\sum_{i=1}^{N} f(i)}{|E|}$

where $f(i) = 1$ if the target word is among the top k and $f(i) = 0$ otherwise.

$|E|$ is the total number of entries for the considered type.

**Notes:**
1. Try with the model trained on `text8`, is there any issue?
2. Test the model trained on Google News available in gensim.



In [None]:
%%capture
! wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/google_analogies.csv
! pip install --upgrade pandas

In [None]:
# Executing this cell could take ~5 minutes
import gensim.downloader
w2v_google_news_model = gensim.downloader.load('word2vec-google-news-300')

In [None]:
# Your code here

### **Question 3:**

Train a new FastText model using gensim with text8 corpus available in the python package ([reference](https://radimrehurek.com/gensim/downloader.html)). Compute the training time for the model and store it for subsequent steps. 

- Is there any significant difference in training time if compared with Word2Vec training?

In [None]:
# Your code here

### **Question 4:**
Score the FastText model by exploiting the same methodology presented in Q2. 

**Notes:**
- Is there any issue similar to Word2Vec model?
- Test the model trained on Wikipedia+News available in gensim.

In [None]:
# Your code here

### **Question 5** (optional) 
Evaluate Word2Vec and FastText models  on the analogy task for the whole dataset (include all analogy types).

In [None]:
# Your code here

## Sentence Embeddings

Key takeaways from lessons and in-class practices:
- Doc2Vec is an extension of the Word2Vec framework
- It incorporate Document ID to obtain a more accurate representation of a document/paragraph
- Training document vectors are pre-computed, however you can infer vectors for new documents
- InferSent exploit a deep learning architecture to supervisedly learn sentence representations
- InferSent vectors could exploit both Word2Vec or FastText as word embedding models.

### **Question 6:**

Train a Doc2Vec model using gensim with text8 corpus. Compute the training time for the model and store it for subsequent steps.

In [None]:
# Your code here

### **Question 7 (qualitative Evaluation)**
Perform some qualitative experiments by computing the cosine similarities between sentences composed by yourself.

In [None]:
# Your code here

### **Question 8** (Extrinsic Evaluation)

Extrinsic evaluation measure performance of the word/sentence/paragraph embedding model for a downstream NLP task (e.g., Text Classification).

We can use different configuration, training corpora or even different models to build a complete architecture for the task at hand.

For this practice we use the text classification dataset available [here](https://github.com/MorenoLaQuatra/DeepNLP/blob/main/practices/P2/news_headline_classification.csv) - [source: Kaggle](https://www.kaggle.com/rmisra/news-category-dataset)

**Note:** consider using just the first 10.000 headlines to reduce runtime during the lab.

Compute the accuracy of 3 classification models each one built with one of the models introduced in this practice:
- Word2Vec model pretrained on Google News corpus
- FastText model pretrained on Wikipedia+News corpus
- **[Optional]** Doc2Vec model pretrained on Text8 corpus
- **[Optional]** InferSent pretrained model (v2) - [reference](https://github.com/facebookresearch/InferSent)

The procedure to create a classification system is sketched below:
1. Choose a machine learning (multi-class) classifier (e.g., MLP)
2. Split the data collection in train/test (80%/20%)
3. Use text vectors obtained by pretrained model as input of the classifier
4. Measure the accuracy of the classification system
5. Repeat step 3-4 using different embedding models 


**Note:** You need to choose an aggregation function (e.g., average) to obtain sentence embeddings from word vectors.

Which model has better performance? Report the performance of each variant of the classification system.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P2/news_headline_classification.csv

In [None]:
# Your code here

**Word2Vec + Average aggregation function**

In [None]:
# Your code here

**FastText + Average aggregation function**

In [None]:
# Your code here

**Doc2Vec (Text8)**

In [None]:
# Your code here

**InferSent**

In [None]:
%%capture
# InferSent download required files

! mkdir fastText
! curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
! unzip fastText/crawl-300d-2M.vec.zip -d fastText/
! mkdir encoder
! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl
! git clone https://github.com/facebookresearch/InferSent.git

In [None]:
from InferSent.models import InferSent
import torch
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = 'fastText/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

In [None]:
# Your code here