# NLP & representation learning: Neural Embeddings, Text Classification


To use statistical classifiers with text, it is first necessary to vectorize the text. In the first practical session we explored the **Bag of Word (BoW)** model.

Modern **state of the art** methods uses  embeddings to vectorize the text before classification in order to avoid feature engineering.

## [Dataset](https://thome.isir.upmc.fr/classes/RITAL/json_pol.json)


## "Modern" NLP pipeline

By opposition to the **bag of word** model, in the modern NLP pipeline everything is **embeddings**. Instead of encoding a text as a **sparse vector** of length $D$ (size of feature dictionnary) the goal is to encode the text in a meaningful dense vector of a small size $|e| <<< |D|$.


The raw classification pipeline is then the following:

```
raw text ---|embedding table|-->  vectors --|Neural Net|--> class
```


### Using a  language model:

How to tokenize the text and extract a feature dictionnary is still a manual task. To directly have meaningful embeddings, it is common to use a pre-trained language model such as `word2vec` which we explore in this practical.

In this setting, the pipeline becomes the following:
```
      
raw text ---|(pre-trained) Language Model|--> vectors --|classifier (or fine-tuning)|--> class
```


- #### Classic word embeddings

 - [Word2Vec](https://arxiv.org/abs/1301.3781)
 - [Glove](https://nlp.stanford.edu/projects/glove/)


- #### bleeding edge language models techniques (see next)

 - [UMLFIT](https://arxiv.org/abs/1801.06146)
 - [ELMO](https://arxiv.org/abs/1802.05365)
 - [GPT](https://blog.openai.com/language-unsupervised/)
 - [BERT](https://arxiv.org/abs/1810.04805)






### Goal of this session:

1. Train word embeddings on training dataset
2. Tinker with the learnt embeddings and see learnt relations
3. Tinker with pre-trained embeddings.
4. Use those embeddings for classification
5. Compare different embedding models

## STEP 0: Loading data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import json
from collections import Counter

# Loading json
file = '/content/drive/My Drive/json_pol.json'
with open(file,encoding="utf-8") as f:
    data = json.load(f)


# Quick Check
counter = Counter((x[1] for x in data))
print("Number of reviews : ", len(data))
print("----> # of positive : ", counter[1])
print("----> # of negative : ", counter[0])
print("")
print(data[0])


Number of reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

['Although credit should have been given to Dr. Seuess for stealing the story-line of "Horton Hatches The Egg", this was a fine film. It touched both the emotions and the intellect. Due especially to the incredible performance of seven year old Justin Henry and a script that was sympathetic to each character (and each one\'s predicament), the thought provoking elements linger long after the tear jerking ones are over. Overall, superior acting from a solid cast, excellent directing, and a very powerful script. The right touches of humor throughout help keep a "heavy" subject from becoming tedious or difficult to sit through. Lastly, this film stands the test of time and seems in no way dated, decades after it was released.', 1]


In [None]:
data[0]

['Although credit should have been given to Dr. Seuess for stealing the story-line of "Horton Hatches The Egg", this was a fine film. It touched both the emotions and the intellect. Due especially to the incredible performance of seven year old Justin Henry and a script that was sympathetic to each character (and each one\'s predicament), the thought provoking elements linger long after the tear jerking ones are over. Overall, superior acting from a solid cast, excellent directing, and a very powerful script. The right touches of humor throughout help keep a "heavy" subject from becoming tedious or difficult to sit through. Lastly, this film stands the test of time and seems in no way dated, decades after it was released.',
 1]

In [None]:
counter
data

[['Although credit should have been given to Dr. Seuess for stealing the story-line of "Horton Hatches The Egg", this was a fine film. It touched both the emotions and the intellect. Due especially to the incredible performance of seven year old Justin Henry and a script that was sympathetic to each character (and each one\'s predicament), the thought provoking elements linger long after the tear jerking ones are over. Overall, superior acting from a solid cast, excellent directing, and a very powerful script. The right touches of humor throughout help keep a "heavy" subject from becoming tedious or difficult to sit through. Lastly, this film stands the test of time and seems in no way dated, decades after it was released.',
  1],
 ["This was one of those films I would always come across (be it on TV or cheap DVD), but never struck me to give it a shot as I thought I wasn't missing out on much. It was on one night and I thought oh well\x85 why not. A good decision too, as I would kick 

## Word2Vec: Quick Recap

**[Word2Vec](https://arxiv.org/abs/1301.3781) is composed of two distinct language models (CBOW and SG), optimized to quickly learn word vectors**


given a random text: `i'm taking the dog out for a walk`



### (a) Continuous Bag of Word (CBOW) Skip-Gram (SG)
    -  predicts a word given a context
    
maximizing `p(dog | i'm taking the ___ out for a walk)`
    
### (b) Skip-Gram (SG)               
    -  predicts a context given a word
    
 maximizing `p(i'm taking the out for a walk | dog)`



   

## STEP 1: train a language model (word2vec)

Gensim has one of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) fastest implementation.


### Train:

In [None]:
# if gensim not installed yet
# ! pip install gensim

In [None]:
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

text = [t.split() for t,p in data]

# the following configuration is the default configuration
w2v = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model
                                min_count=5,
                                sample=0.001, workers=3,
                                sg=1, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)

In [None]:
# Worth it to save the previous embedding
w2v.save("W2v-movies.dat")
# You will be able to reload them:
# w2v = gensim.models.Word2Vec.load("W2v-movies.dat")
# and you can continue the learning process if needed

## STEP 2: Test learnt embeddings

The word embedding space directly encodes similarities between words: the vector coding for the word "great" will be closer to the vector coding for "good" than to the one coding for "bad". Generally, [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is the distance used when considering distance between vectors.

KeyedVectors have a built in [similarity](https://radimrehurek.com/gensim/models /keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.similarity) method to compute the cosine similarity between words

In [None]:
# is great really closer to good than to bad ?
print("great and good:",w2v.wv.similarity("great","good"))
print("great and bad:",w2v.wv.similarity("great","bad"))

great and good: 0.7647952
great and bad: 0.44735837


Since cosine distance encodes similarity, neighboring words are supposed to be similar. The [most_similar](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.most_similar) method returns the `topn` words given a query.

In [None]:
# The query can be as simple as a word, such as "movie"

# Try changing the word
print(w2v.wv.most_similar("movie",topn=5)) # 5 most similar words
print(w2v.wv.most_similar("awesome",topn=5))
print(w2v.wv.most_similar("actor",topn=5))

[('film', 0.9303047060966492), ('"movie"', 0.7820390462875366), ('flick', 0.7710607051849365), ('movie,', 0.7621265053749084), ('movie...', 0.7466787099838257)]
[('amazing', 0.791512131690979), ('excellent', 0.7320544123649597), ('hoot.', 0.7044680118560791), ('terrific', 0.7000255584716797), ('incredible', 0.6958985328674316)]
[('actor,', 0.8249809741973877), ('actress', 0.7571878433227539), ('Reeves', 0.7537413239479065), ('actor.', 0.7450438141822815), ('Hopper', 0.7392629981040955)]


But it can be a more complicated query
Word embedding spaces tend to encode much more.

The most famous exemple is: `vec(king) - vec(man) + vec(woman) => vec(queen)`

In [None]:
# What is awesome - good + bad ?
w2v.wv.most_similar(positive=["awesome","bad"],negative=["good"],topn=3)

[('awful', 0.7691811323165894),
 ('horrible', 0.6881365776062012),
 ('terrible', 0.6497606635093689)]

In [None]:
w2v.wv.most_similar(positive=["actor","woman"],negative=["man"],topn=3) # do the famous exemple works for actor ?


[('actress', 0.8533424139022827),
 ('actress,', 0.7552055716514587),
 ('Glover', 0.6598100662231445)]

In [None]:
w2v.wv.most_similar(positive=["mother","male"],negative=["female"],topn=3) # do the famous exemple works for actor ?


[('sister', 0.7353665828704834),
 ('father', 0.7133386135101318),
 ('sister,', 0.7128931283950806)]

**To test learnt "synctactic" and "semantic" similarities, Mikolov et al. introduced a special dataset containing a wide variety of three way similarities.**

**You can download the dataset [here](https://thome.isir.upmc.fr/classes/RITAL/questions-words.txt).**

In [None]:
out = w2v.wv.evaluate_word_analogies("/content/drive/My Drive/questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

In [None]:
out[0]

0.12188940092165898

In [None]:
print("Correct :",len(out[1][0]['correct']))
print("Incorrect :", len(out[1][0]['incorrect']))


Correct : 2
Incorrect : 88


  Très mauvaise performance .

**When training the w2v models on the review dataset, since it hasn't been learnt with a lot of data, it does not perform very well.**


## STEP 3: Loading a pre-trained model

In Gensim, embeddings are loaded and can be used via the ["KeyedVectors"](https://radimrehurek.com/gensim/models/keyedvectors.html) class

> Since trained word vectors are independent from the way they were trained (Word2Vec, FastText, WordRank, VarEmbed etc), they can be represented by a standalone structure, as implemented in this module.

>The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}.

>The entity typically corresponds to a word (so the mapping maps words to 1D vectors), but for some models, they key can also correspond to a document, a graph node etc. To generalize over different use-cases, this module calls the keys entities. Each entity is always represented by its string id, no matter whether the entity is a word, a document or a graph node.

**You can download the pre-trained word embedding [HERE](https://thome.isir.upmc.fr/classes/RITAL/word2vec-google-news-300.dat) .**

In [None]:
#from gensim.test.utils import get_tmpfile
import gensim.downloader as api
from gensim.models import KeyedVectors

bload = True
fname = "word2vec-google-news-300"
sdir = "/content/drive/My Drive/" # Change

if(bload==True):
    wv_pre_trained = KeyedVectors.load(sdir+fname+".dat")
else:
    wv_pre_trained = api.load(fname)
    wv_pre_trained.save(sdir+fname+".dat")


**Perform the "synctactic" and "semantic" evaluations again. Conclude on the pre-trained embeddings.**

In [None]:
out = wv_pre_trained.evaluate_word_analogies("/content/drive/My Drive/questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

In [None]:
out[0]

0.7401448525607863

In [None]:
print("Correct :",len(out[1][0]['correct']))
print("Incorrect :", len(out[1][0]['incorrect']))


Correct : 421
Incorrect : 85


Ici on s'attend a avoir de meilleur résultats

## STEP 4:  sentiment classification

In the previous practical session, we used a bag of word approach to transform text into vectors.
Here, we propose to try to use word vectors (previously learnt or loaded).


### <font color='green'> Since we have only word vectors and that sentences are made of multiple words, we need to aggregate them. </font>


### (1) Vectorize reviews using word vectors:

Word aggregation can be done in different ways:

- Sum
- Average
- Min/feature
- Max/feature

#### a few pointers:

- `w2v.wv.vocab` is a `set()` of the vocabulary (all existing words in your model)
- `np.minimum(a,b) and np.maximum(a,b)` respectively return element-wise min/max

In [None]:
'the' in w2v.wv


True

In [None]:
import pandas as pd
df = pd.DataFrame(data)
df[df[1]==0]

Unnamed: 0,0,1
12500,The Plot: A group of young people with ridicul...,0
12501,Being stuck in bed with the flu and feeling to...,0
12502,"The story was disjointed, the acting was not o...",0
12503,"The great Yul Brynner, who won an 'Oscar', and...",0
12504,Just what the world needed-another superficial...,0
...,...,...
24995,"I wasn't impressed with the Graffiti Artist, d...",0
24996,Here is a rundown of a typical Rachael Ray Sho...,0
24997,Wow...not in a good way.<br /><br />I can't be...,0
24998,This movie made me want to bang my head agains...,0


In [None]:
import numpy as np
# We first need to vectorize text:
# First we propose to a sum of them


# def vectorize(text,mean=False):
def vectorize(text,model,agreg_fonction=np.mean):
    """
    This function should vectorize one review

    input: str
    output: np.array(float)
    """
    vec = []
    for word in text.split():
        if word in model.wv.key_to_index:
            vec.append(model.wv[word])
    return agreg_fonction(vec,axis=0)


# Je crée le train et le test (distribution uniforme pour les classes neg et pos dans le test et le train)
from sklearn.model_selection import train_test_split

positive_reviews = data[:12500]
negative_reviews = data[12500:]

positive_train, positive_test = train_test_split(positive_reviews, train_size=0.5, test_size=0.5, random_state=42)

negative_train, negative_test = train_test_split(negative_reviews, train_size=0.5, test_size=0.5, random_state=42)

train = positive_train + negative_train
test = positive_test + negative_test

classes = [pol for text,pol in train]
X = [vectorize(text,w2v) for text,pol in train]
X_test = [vectorize(text,w2v) for text,pol in test]
true = [pol for text,pol in test]

#let's see what a review vector looks like.
print(X[0])

[-1.99517943e-02  6.40816689e-02 -5.72315119e-02 -1.95959471e-02
 -1.16117215e-02 -2.41031080e-01  1.33945681e-02  4.49112952e-01
 -1.49077341e-01 -9.99534950e-02  4.95982729e-02 -2.48819768e-01
 -8.83613303e-02  6.06548116e-02  2.45031882e-02 -5.67322336e-02
  4.09239158e-02 -8.65475237e-02 -2.06472035e-02 -4.07575011e-01
  1.65603682e-01  4.92223687e-02  3.37977082e-01 -1.61409289e-01
 -7.33310059e-02  1.35707781e-01 -1.48534164e-01  1.26234710e-01
 -1.78640395e-01  1.85368046e-01  1.66294202e-01  1.85749494e-02
  3.91112082e-02 -3.05938959e-01 -2.50995569e-02  1.17375433e-01
  1.29480734e-01 -1.05554678e-01 -1.98790848e-01 -3.42679292e-01
  5.83587661e-02 -2.78670311e-01 -1.28435180e-01  8.11256543e-02
  1.93677723e-01 -1.18101552e-01 -1.55706257e-01  1.32942963e-02
  4.88838367e-02  4.43212129e-02  2.26506032e-02 -1.72752813e-01
  7.74928704e-02 -1.03005521e-01 -9.69294012e-02 -1.63817108e-02
  9.35442299e-02  5.28341308e-02 -4.22388723e-04  1.20487340e-01
  2.14140192e-02 -5.09121

In [None]:
print(len(X[0]))

100


### (2) Train a classifier
as in the previous practical session, train a logistic regression to do sentiment classification with word vectors



In [None]:


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Scikit Logistic Regression
t = 1e-8
C=100.0
lr_clf = LogisticRegression(random_state=0, solver='liblinear',max_iter=100, tol=t, C=C)

lr_clf.fit(X, classes)
pred_lrt = lr_clf.predict(X)
pred_lr = lr_clf.predict(X_test)
print(f"Logistic Regression accuracy train={accuracy_score(classes, pred_lrt)}, accuracy test={accuracy_score(true, pred_lr)}")


Logistic Regression accuracy train=0.83448, accuracy test=0.82616


performance should be worst than with bag of word (~80%). Sum/Mean aggregation does not work well on long reviews (especially with many frequent words). This adds a lot of noise.

## **Todo** :  Try answering the following questions:

- Which word2vec model works best: skip-gram or cbow
- Do pretrained vectors work best than those learnt on the train dataset ?


In [None]:
w2v_cbow = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model
                                min_count=5,
                                sample=0.001, workers=3,
                                sg=0, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)


In [None]:
w2v_sg = w2v

In [None]:
def eval ( train,test,model,agreg_fct=np.mean):
    classes = [pol for text,pol in train]
    X = [vectorize(text,model,agreg_fct) for text,pol in train]
    X_test = [vectorize(text,model,agreg_fct) for text,pol in test]
    true = [pol for text,pol in test]
    lr_clf.fit(X, classes)
    pred_lrt = lr_clf.predict(X)
    pred_lr = lr_clf.predict(X_test)
    print(f"\t\tLogistic Regression accuracy train={accuracy_score(classes, pred_lrt)}, accuracy test={accuracy_score(true, pred_lr)}")


In [None]:
foctions_verbose = ['sum', 'mean', 'min', 'max']
models_verbose = ["Skip-Gram (SG)","Continuous Bag of Word (CBOW)" ]
models = [w2v_sg,w2v_cbow]
fonctions = [np.sum,np.mean,np.min,np.max]
for num, fct in enumerate(fonctions):
    for mi, m in enumerate(models):
        print("Fonction d'agregation : ",foctions_verbose[num])
        print("\t",models_verbose[mi]," : ")
        eval (train,test,models[mi],fonctions[num])

Fonction d'agregation :  sum
	 Skip-Gram (SG)  : 
		Logistic Regression accuracy train=0.83368, accuracy test=0.82688
Fonction d'agregation :  sum
	 Continuous Bag of Word (CBOW)  : 
		Logistic Regression accuracy train=0.78, accuracy test=0.77544
Fonction d'agregation :  mean
	 Skip-Gram (SG)  : 
		Logistic Regression accuracy train=0.83448, accuracy test=0.82616
Fonction d'agregation :  mean
	 Continuous Bag of Word (CBOW)  : 
		Logistic Regression accuracy train=0.78112, accuracy test=0.7764
Fonction d'agregation :  min
	 Skip-Gram (SG)  : 
		Logistic Regression accuracy train=0.71344, accuracy test=0.71264
Fonction d'agregation :  min
	 Continuous Bag of Word (CBOW)  : 
		Logistic Regression accuracy train=0.66656, accuracy test=0.66016
Fonction d'agregation :  max
	 Skip-Gram (SG)  : 
		Logistic Regression accuracy train=0.72632, accuracy test=0.71136
Fonction d'agregation :  max
	 Continuous Bag of Word (CBOW)  : 
		Logistic Regression accuracy train=0.66608, accuracy test=0.6568

On constate que le modèle Skip_Gram donne des resultats meilleurs.


**(Bonus)** To have a better accuracy, we could try two things:
- Better aggregation methods (weight by tf-idf ?)
- Another word vectorizing method such as [fasttext](https://radimrehurek.com/gensim/models/fasttext.html)
- A document vectorizing method such as [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)