# NLP & representation learning: Neural Embeddings, Text Classification


To use statistical classifiers with text, it is first necessary to vectorize the text. In the first practical session we explored the **Bag of Word (BoW)** model. 

Modern **state of the art** methods uses  embeddings to vectorize the text before classification in order to avoid feature engineering.

## [Dataset](https://thome.isir.upmc.fr/classes/RITAL/json_pol.json)


## "Modern" NLP pipeline

By opposition to the **bag of word** model, in the modern NLP pipeline everything is **embeddings**. Instead of encoding a text as a **sparse vector** of length $D$ (size of feature dictionnary) the goal is to encode the text in a meaningful dense vector of a small size $|e| <<< |D|$. 


The raw classification pipeline is then the following:

```
raw text ---|embedding table|-->  vectors --|Neural Net|--> class 
```


### Using a  language model:

How to tokenize the text and extract a feature dictionnary is still a manual task. To directly have meaningful embeddings, it is common to use a pre-trained language model such as `word2vec` which we explore in this practical.

In this setting, the pipeline becomes the following:
```
      
raw text ---|(pre-trained) Language Model|--> vectors --|classifier (or fine-tuning)|--> class 
```


- #### Classic word embeddings

 - [Word2Vec](https://arxiv.org/abs/1301.3781)
 - [Glove](https://nlp.stanford.edu/projects/glove/)


- #### bleeding edge language models techniques (see next)

 - [UMLFIT](https://arxiv.org/abs/1801.06146)
 - [ELMO](https://arxiv.org/abs/1802.05365)
 - [GPT](https://blog.openai.com/language-unsupervised/)
 - [BERT](https://arxiv.org/abs/1810.04805)






### Goal of this session:

1. Train word embeddings on training dataset
2. Tinker with the learnt embeddings and see learnt relations
3. Tinker with pre-trained embeddings.
4. Use those embeddings for classification
5. Compare different embedding models

## STEP 0: Loading data 

In [1]:
import json
from collections import Counter

# Loading json
file = 'json_pol.json'
with open(file,encoding="utf-8") as f:
    data = json.load(f)
    

# Quick Check
counter = Counter((x[1] for x in data))
print("Number of reviews : ", len(data))
print("----> # of positive : ", counter[1])
print("----> # of negative : ", counter[0])
print("")
print(data[0])

Number of reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

['Although credit should have been given to Dr. Seuess for stealing the story-line of "Horton Hatches The Egg", this was a fine film. It touched both the emotions and the intellect. Due especially to the incredible performance of seven year old Justin Henry and a script that was sympathetic to each character (and each one\'s predicament), the thought provoking elements linger long after the tear jerking ones are over. Overall, superior acting from a solid cast, excellent directing, and a very powerful script. The right touches of humor throughout help keep a "heavy" subject from becoming tedious or difficult to sit through. Lastly, this film stands the test of time and seems in no way dated, decades after it was released.', 1]


## Word2Vec: Quick Recap

**[Word2Vec](https://arxiv.org/abs/1301.3781) is composed of two distinct language models (CBOW and SG), optimized to quickly learn word vectors**


given a random text: `i'm taking the dog out for a walk`



### (a) Continuous Bag of Word (CBOW)
    -  predicts a word given a context
    
maximizing `p(dog | i'm taking the ___ out for a walk)`
    
### (b) Skip-Gram (SG)               
    -  predicts a context given a word
    
 maximizing `p(i'm taking the out for a walk | dog)`



   

## STEP 1: train a language model (word2vec)

Gensim has one of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) fastest implementation.


### Train:

In [None]:
# if gensim not installed yet
# ! pip install gensim

In [2]:
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

text = [t.split() for t,p in data]

# the following configuration is the default configuration
w2v = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model 
                                min_count=5,                      
                                sample=0.001, workers=3,
                                sg=1, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)

2024-02-08 00:14:53,419 : INFO : collecting all words and their counts
2024-02-08 00:14:53,419 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-02-08 00:14:54,141 : INFO : PROGRESS: at sentence #10000, processed 2301366 words, keeping 153853 word types
2024-02-08 00:14:54,887 : INFO : PROGRESS: at sentence #20000, processed 4553558 words, keeping 240043 word types
2024-02-08 00:14:55,225 : INFO : collected 276678 word types from a corpus of 5713167 raw words and 25000 sentences
2024-02-08 00:14:55,225 : INFO : Creating a fresh vocabulary
2024-02-08 00:14:55,444 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 48208 unique words (17.42% of original 276678, drops 228470)', 'datetime': '2024-02-08T00:14:55.444444', 'gensim': '4.3.2', 'python': '3.11.7 (tags/v3.11.7:fa7a6f2, Dec  4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22621-SP0', 'event': 'prepare_vocab'}
2024-02-08 00:14:55,444 : INFO : Word2Vec l

In [3]:
# Worth it to save the previous embedding
w2v.save("W2v-movies.dat")
# You will be able to reload them:
# w2v = gensim.models.Word2Vec.load("W2v-movies.dat")
# and you can continue the learning process if needed

2024-02-08 00:15:58,055 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'W2v-movies.dat', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2024-02-08T00:15:58.055156', 'gensim': '4.3.2', 'python': '3.11.7 (tags/v3.11.7:fa7a6f2, Dec  4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22621-SP0', 'event': 'saving'}
2024-02-08 00:15:58,055 : INFO : not storing attribute cum_table
2024-02-08 00:15:58,183 : INFO : saved W2v-movies.dat


## STEP 2: Test learnt embeddings

The word embedding space directly encodes similarities between words: the vector coding for the word "great" will be closer to the vector coding for "good" than to the one coding for "bad". Generally, [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is the distance used when considering distance between vectors.

KeyedVectors have a built in [similarity](https://radimrehurek.com/gensim/models /keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.similarity) method to compute the cosine similarity between words

In [8]:
# is great really closer to good than to bad ?
print("great and good:",w2v.wv.similarity("great","good"))
print("great and bad:",w2v.wv.similarity("great","bad"))

great and good: 0.76491594
great and bad: 0.4745189


Since cosine distance encodes similarity, neighboring words are supposed to be similar. The [most_similar](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.most_similar) method returns the `topn` words given a query.

In [9]:
# The query can be as simple as a word, such as "movie"

# Try changing the word
w2v.wv.most_similar("movie",topn=5) # 5 most similar words
#w2v.wv.most_similar("awesome",topn=5)
#w2v.wv.most_similar("actor",topn=5)

[('film', 0.9290799498558044),
 ('"movie"', 0.8077793121337891),
 ('flick', 0.7757822871208191),
 ('movie,', 0.7666845321655273),
 ('dreck', 0.7309340834617615)]

But it can be a more complicated query
Word embedding spaces tend to encode much more.

The most famous exemple is: `vec(king) - vec(man) + vec(woman) => vec(queen)`

In [10]:
# What is awesome - good + bad ?
w2v.wv.most_similar(positive=["ugly","movie"],negative=["handsome"],topn=3)  

#w2v.wv.most_similar(positive=["actor","woman"],negative=["man"],topn=3) # do the famous exemple works for actor ?


# Try other things like plurals for exemple.


[('film', 0.7153176665306091),
 ('"movie"', 0.6472780108451843),
 ('pile', 0.6349601745605469)]

**To test learnt "synctactic" and "semantic" similarities, Mikolov et al. introduced a special dataset containing a wide variety of three way similarities.**

**You can download the dataset [here](https://thome.isir.upmc.fr/classes/RITAL/questions-words.txt).**

In [28]:
out = w2v.wv.evaluate_word_analogies("questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

2024-02-07 09:54:41,273 : INFO : Evaluating word analogies for top 300000 words in the model on questions-words.txt


2024-02-07 09:54:41,568 : INFO : capital-common-countries: 4.4% (4/90)
2024-02-07 09:54:41,772 : INFO : capital-world: 0.0% (0/71)
2024-02-07 09:54:41,849 : INFO : currency: 0.0% (0/28)
2024-02-07 09:54:42,582 : INFO : city-in-state: 0.0% (0/329)
2024-02-07 09:54:43,238 : INFO : family: 34.5% (118/342)
2024-02-07 09:54:44,950 : INFO : gram1-adjective-to-adverb: 2.0% (19/930)
2024-02-07 09:54:45,997 : INFO : gram2-opposite: 2.2% (12/552)
2024-02-07 09:54:48,106 : INFO : gram3-comparative: 19.1% (241/1260)
2024-02-07 09:54:49,301 : INFO : gram4-superlative: 7.3% (51/702)
2024-02-07 09:54:50,420 : INFO : gram5-present-participle: 17.1% (129/756)
2024-02-07 09:54:51,515 : INFO : gram6-nationality-adjective: 2.9% (23/792)
2024-02-07 09:54:53,233 : INFO : gram7-past-tense: 14.3% (180/1260)
2024-02-07 09:54:54,317 : INFO : gram8-plural: 4.8% (39/812)
2024-02-07 09:54:55,396 : INFO : gram9-plural-verbs: 28.8% (218/756)
2024-02-07 09:54:55,399 : INFO : Quadruplets with out-of-vocabulary words: 

**When training the w2v models on the review dataset, since it hasn't been learnt with a lot of data, it does not perform very well.**


## STEP 3: Loading a pre-trained model

In Gensim, embeddings are loaded and can be used via the ["KeyedVectors"](https://radimrehurek.com/gensim/models/keyedvectors.html) class

> Since trained word vectors are independent from the way they were trained (Word2Vec, FastText, WordRank, VarEmbed etc), they can be represented by a standalone structure, as implemented in this module.

>The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}.

>The entity typically corresponds to a word (so the mapping maps words to 1D vectors), but for some models, they key can also correspond to a document, a graph node etc. To generalize over different use-cases, this module calls the keys entities. Each entity is always represented by its string id, no matter whether the entity is a word, a document or a graph node.

**You can download the pre-trained word embedding [HERE](https://thome.isir.upmc.fr/classes/RITAL/word2vec-google-news-300.dat) .**

In [4]:
#from gensim.test.utils import get_tmpfile
import gensim.downloader as api
from gensim.models import KeyedVectors
bload = True
fname = "word2vec-google-news-300"
sdir = "" # Change

if(bload==True):
    wv_pre_trained = KeyedVectors.load(sdir+fname+".dat")
else:    
    wv_pre_trained = api.load(fname)
    wv_pre_trained.save(sdir+fname+".dat")
    

2024-02-08 00:16:06,203 : INFO : loading KeyedVectors object from word2vec-google-news-300.dat


2024-02-08 00:16:09,118 : INFO : loading vectors from word2vec-google-news-300.dat.vectors.npy with mmap=None
2024-02-08 00:16:18,156 : INFO : KeyedVectors lifecycle event {'fname': 'word2vec-google-news-300.dat', 'datetime': '2024-02-08T00:16:18.156747', 'gensim': '4.3.2', 'python': '3.11.7 (tags/v3.11.7:fa7a6f2, Dec  4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22621-SP0', 'event': 'loaded'}


**Perform the "synctactic" and "semantic" evaluations again. Conclude on the pre-trained embeddings.**

In [40]:
out = wv_pre_trained.evaluate_word_analogies("questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

2024-02-07 10:24:57,444 : INFO : Evaluating word analogies for top 300000 words in the model on questions-words.txt
2024-02-07 10:25:09,370 : INFO : capital-common-countries: 83.2% (421/506)
2024-02-07 10:26:20,456 : INFO : capital-world: 81.3% (3552/4368)
2024-02-07 10:26:33,436 : INFO : currency: 28.5% (230/808)
2024-02-07 10:27:12,555 : INFO : city-in-state: 72.1% (1779/2467)
2024-02-07 10:27:20,487 : INFO : family: 86.2% (436/506)
2024-02-07 10:27:37,104 : INFO : gram1-adjective-to-adverb: 29.2% (290/992)
2024-02-07 10:27:50,769 : INFO : gram2-opposite: 43.5% (353/812)


KeyboardInterrupt: 

## STEP 4:  sentiment classification

In the previous practical session, we used a bag of word approach to transform text into vectors.
Here, we propose to try to use word vectors (previously learnt or loaded).


### <font color='green'> Since we have only word vectors and that sentences are made of multiple words, we need to aggregate them. </font>


### (1) Vectorize reviews using word vectors:

Word aggregation can be done in different ways:

- Sum
- Average
- Min/feature
- Max/feature

#### a few pointers:

- `w2v.wv.vocab` is a `set()` of the vocabulary (all existing words in your model)
- `np.minimum(a,b) and np.maximum(a,b)` respectively return element-wise min/max 

In [13]:
import numpy as np
from sklearn.model_selection import train_test_split
import nltk
from nltk.tokenize import word_tokenize
# We first need to vectorize text:
# First we propose to a sum of them

def randomvec():
    default = np.random.randn(300)
    default = default  / np.linalg.norm(default)
    return default

def vectorize(text,mean=False):
    """
    This function should vectorize one review

    input: str
    output: np.array(float)
    """    
    words=word_tokenize(text.lower())
    c=0
    vec=np.zeros(wv_pre_trained["the"].shape)
    for word in words:
        if word in wv_pre_trained:
            vec+=wv_pre_trained[word]
            c+=1
 
    return vec/c
    
train,test=train_test_split(data,test_size=0.2,train_size=0.8)
classes = [pol for text,pol in train]
X = [vectorize(text) for text,pol in train]
X_test = [vectorize(text) for text,pol in test]
true = [pol for text,pol in test]

#let's see what a review vector looks like.
print(X[0])

[ 0.01765296  0.01079974  0.01314081  0.07143536 -0.04485846  0.01552931
  0.02808714 -0.0593172   0.04548616  0.05103862 -0.03631634 -0.12405038
 -0.02113517  0.02423506 -0.08011907  0.07529658  0.04926082  0.0994224
 -0.03263602 -0.02624044 -0.02225281  0.05222666  0.03915137 -0.01552693
  0.03860121 -0.01445112 -0.08239678  0.0475596   0.04513919  0.02161515
 -0.01342563  0.0030255  -0.04965216  0.00118342  0.04265644 -0.01144844
  0.0072607   0.03666089  0.04948972  0.04719726  0.09572689 -0.03390986
  0.06955991  0.00705163 -0.0150942  -0.01117392 -0.04584522  0.0035457
  0.03867861  0.00056748  0.0074515   0.0395701  -0.01166216  0.00562643
  0.00404136  0.0105268  -0.03696223 -0.04637394  0.01718493 -0.05615671
  0.00701199  0.10361707 -0.08462278 -0.05248498 -0.03134738 -0.00143203
 -0.03514716  0.06317281  0.00251136  0.05226024  0.04034332  0.02064062
  0.02458275  0.01235348 -0.09101634 -0.02428893  0.06711914  0.08997778
  0.03897506  0.11832432  0.00269912 -0.06010424  0.0

### (2) Train a classifier 
as in the previous practical session, train a logistic regression to do sentiment classification with word vectors



In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()

X_scaled=scaler.fit_transform(X)
X_test_scaled=scaler.transform(X_test)

classifier=LogisticRegression(max_iter=1000)
classifier.fit(X_scaled,classes)
y_pred=classifier.predict(X_test_scaled)
accuracy=accuracy_score(true,y_pred)

print("Accuracy:", accuracy)




Accuracy: 0.8694


performance should be worst than with bag of word (~80%). Sum/Mean aggregation does not work well on long reviews (especially with many frequent words). This adds a lot of noise.

## **Todo** :  Try answering the following questions:

- Which word2vec model works best: skip-gram or cbow
- Do pretrained vectors work best than those learnt on the train dataset ?



**(Bonus)** To have a better accuracy, we could try two things:
- Better aggregation methods (weight by tf-idf ?)
- Another word vectorizing method such as [fasttext](https://radimrehurek.com/gensim/models/fasttext.html)
- A document vectorizing method such as [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)