# NLP & representation learning: Neural Embeddings, Text Classification


To use statistical classifiers with text, it is first necessary to vectorize the text. In the first practical session we explored the **Bag of Word (BoW)** model. 

Modern **state of the art** methods uses  embeddings to vectorize the text before classification in order to avoid feature engineering.

## [Dataset](https://thome.isir.upmc.fr/classes/RITAL/json_pol)


## "Modern" NLP pipeline

By opposition to the **bag of word** model, in the modern NLP pipeline everything is **embeddings**. Instead of encoding a text as a **sparse vector** of length $D$ (size of feature dictionnary) the goal is to encode the text in a meaningful dense vector of a small size $|e| <<< |D|$. 


The raw classification pipeline is then the following:

```
raw text ---|embedding table|-->  vectors --|Neural Net|--> class 
```


### Using a  language model:

How to tokenize the text and extract a feature dictionnary is still a manual task. To directly have meaningful embeddings, it is common to use a pre-trained language model such as `word2vec` which we explore in this practical.

In this setting, the pipeline becomes the following:
```
      
raw text ---|(pre-trained) Language Model|--> vectors --|classifier (or fine-tuning)|--> class 
```


- #### Classic word embeddings

 - [Word2Vec](https://arxiv.org/abs/1301.3781)
 - [Glove](https://nlp.stanford.edu/projects/glove/)


- #### bleeding edge language models techniques (see next)

 - [UMLFIT](https://arxiv.org/abs/1801.06146)
 - [ELMO](https://arxiv.org/abs/1802.05365)
 - [GPT](https://blog.openai.com/language-unsupervised/)
 - [BERT](https://arxiv.org/abs/1810.04805)






### Goal of this session:

1. Train word embeddings on training dataset
2. Tinker with the learnt embeddings and see learnt relations
3. Tinker with pre-trained embeddings.
4. Use those embeddings for classification
5. Compare different embedding models
6. Pytorch first look: learn to generate text.

## STEP 0: Loading data 

In [1]:
import json
from collections import Counter

# Loading json
with open("ressources/json_pol",encoding="utf-8") as f:
    data = f.readlines()
    json_data = json.loads(data[0])
    train = json_data["train"]
    test = json_data["test"]
    

# Quick Check
counter_train = Counter((x[1] for x in train))
counter_test = Counter((x[1] for x in test))
print("Number of train reviews : ", len(train))
print("----> # of positive : ", counter_train[1])
print("----> # of negative : ", counter_train[0])
print("")
print(train[0])
print("")
print("Number of test reviews : ",len(test))
print("----> # of positive : ", counter_test[1])
print("----> # of negative : ", counter_test[0])

print("")
print(test[0])
print("")

Number of train reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

["The undoubted highlight of this movie is Peter O'Toole's performance. In turn wildly comical and terribly terribly tragic. Does anybody do it better than O'Toole? I don't think so. What a great face that man has!<br /><br />The story is an odd one and quite disturbing and emotionally intense in parts (especially toward the end) but it is also oddly touching and does succeed on many levels. However, I felt the film basically revolved around Peter O'Toole's luminous performance and I'm sure I wouldn't have enjoyed it even half as much if he hadn't been in it.", 1]

Number of test reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

['Although credit should have been given to Dr. Seuess for stealing the story-line of "Horton Hatches The Egg", this was a fine film. It touched both the emotions and the intellect. Due especially to the incredible performance of seven year old 

## Word2Vec: Quick Recap

**[Word2Vec](https://arxiv.org/abs/1301.3781) is composed of two distinct language models (CBOW and SG), optimized to quickly learn word vectors**


given a random text: `i'm taking the dog out for a walk`



### (a) Continuous Bag of Word (CBOW)
    -  predicts a word given a context
    
maximizing `p(dog | i'm taking the ___ out for a walk)`
    
### (b) Skip-Gram (SG)               
    -  predicts a context given a word
    
 maximizing `p(i'm taking the out for a walk | dog)`



   

## STEP 1: train a language model (word2vec)

Gensim has one of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) fastest implementation.


### Train:

In [2]:
# if gensim not installed yet
# ! pip install gensim

In [3]:
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

text = [t.split() for t,p in train]

# the following configuration is the default configuration
w2v = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model 
                                min_count=5,                      
                                sample=0.001, workers=3,
                                sg=1, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)

2023-02-16 16:40:30,968 : INFO : collecting all words and their counts
2023-02-16 16:40:30,972 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2023-02-16 16:40:31,365 : INFO : PROGRESS: at sentence #10000, processed 2358544 words, keeping 155393 word types
2023-02-16 16:40:31,786 : INFO : PROGRESS: at sentence #20000, processed 4675912 words, keeping 243050 word types
2023-02-16 16:40:32,004 : INFO : collected 280617 word types from a corpus of 5844680 raw words and 25000 sentences
2023-02-16 16:40:32,004 : INFO : Creating a fresh vocabulary
2023-02-16 16:40:32,203 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 49345 unique words (17.58% of original 280617, drops 231272)', 'datetime': '2023-02-16T16:40:32.203135', 'gensim': '4.3.0', 'python': '3.10.10 (main, Feb 16 2023, 11:11:15) [GCC 12.2.1 20221121 (Red Hat 12.2.1-4)]', 'platform': 'Linux-6.1.11-200.fc37.x86_64-x86_64-with-glibc2.36', 'event': 'prepare_vocab'}
2023-02-16 16:40:32,

In [4]:
# Worth it to save the previous embedding
w2v.save("W2v-movies.dat")
# You will be able to reload them:
# w2v = gensim.models.Word2Vec.load("W2v-movies.dat")
# and you can continue the learning process if needed

2023-02-16 16:41:43,218 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'W2v-movies.dat', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2023-02-16T16:41:43.218107', 'gensim': '4.3.0', 'python': '3.10.10 (main, Feb 16 2023, 11:11:15) [GCC 12.2.1 20221121 (Red Hat 12.2.1-4)]', 'platform': 'Linux-6.1.11-200.fc37.x86_64-x86_64-with-glibc2.36', 'event': 'saving'}
2023-02-16 16:41:43,220 : INFO : not storing attribute cum_table
2023-02-16 16:41:43,297 : INFO : saved W2v-movies.dat


## STEP 2: Test learnt embeddings

The word embedding space directly encodes similarities between words: the vector coding for the word "great" will be closer to the vector coding for "good" than to the one coding for "bad". Generally, [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is the distance used when considering distance between vectors.

KeyedVectors have a built in [similarity](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.similarity) method to compute the cosine similarity between words

In [5]:
# is great really closer to good than to bad ?
print("great and good:",w2v.wv.similarity("great","good"))
print("great and bad:",w2v.wv.similarity("great","bad"))

great and good: 0.78008837
great and bad: 0.47419986


Since cosine distance encodes similarity, neighboring words are supposed to be similar. The [most_similar](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.most_similar) method returns the `topn` words given a query.

In [6]:
# The query can be as simple as a word, such as "movie"

# Try changing the word
w2v.wv.most_similar("movie",topn=5) # 5 most similar words
#w2v.wv.most_similar("awesome",topn=5)
#w2v.wv.most_similar("actor",topn=5)

[('film', 0.9325805306434631),
 ('"film"', 0.8607557415962219),
 ('"movie"', 0.7760202288627625),
 ('programme', 0.7686837911605835),
 ('movie...', 0.7646211981773376)]

But it can be a more complicated query
Word embedding spaces tend to encode much more.

The most famous exemple is: `vec(king) - vec(man) + vec(woman) => vec(queen)`

In [7]:
# What is awesome - good + bad ?
w2v.wv.most_similar(positive=["awesome","bad"],negative=["good"],topn=3)  

w2v.wv.most_similar(positive=["actor","woman"],negative=["man"],topn=3) # do the famous exemple works for actor ?


# Try other things like plurals for exemple.

[('actress', 0.8330235481262207),
 ('actress,', 0.7358322739601135),
 ('actress.', 0.6949314475059509)]

To test learnt "synctactic" and "semantic" similarities, Mikolov et al. introduced a special dataset containing a wide variety of three way similarities.

In [8]:
out = w2v.wv.evaluate_word_analogies("ressources/questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

2023-02-16 16:41:43,684 : INFO : Evaluating word analogies for top 300000 words in the model on ressources/questions-words.txt
2023-02-16 16:41:44,156 : INFO : capital-common-countries: 0.6% (1/156)
2023-02-16 16:41:44,463 : INFO : capital-world: 0.9% (1/111)
2023-02-16 16:41:44,514 : INFO : currency: 0.0% (0/18)
2023-02-16 16:41:45,140 : INFO : city-in-state: 0.0% (0/301)
2023-02-16 16:41:46,033 : INFO : family: 31.7% (133/420)


**When training the w2v models on the review dataset, since it hasn't been learnt with a lot of data, it does not perform very well.**


## STEP 3: Loading a pre-trained model

In Gensim, embeddings are loaded and can be used via the ["KeyedVectors"](https://radimrehurek.com/gensim/models/keyedvectors.html) class

> Since trained word vectors are independent from the way they were trained (Word2Vec, FastText, WordRank, VarEmbed etc), they can be represented by a standalone structure, as implemented in this module.

>The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}.

>The entity typically corresponds to a word (so the mapping maps words to 1D vectors), but for some models, they key can also correspond to a document, a graph node etc. To generalize over different use-cases, this module calls the keys entities. Each entity is always represented by its string id, no matter whether the entity is a word, a document or a graph node.

In [1]:
# from gensim.test.utils import get_tmpfile
import gensim.downloader as api
from gensim.models import KeyedVectors

bload = True
fname = "word2vec-google-news-300"
sdir = "" # Change

if(bload==True):
    wv_pre_trained = KeyedVectors.load(sdir+fname+".dat")
else:
    wv_pre_trained = api.load(fname)
    wv_pre_trained.save(sdir+fname+".dat")

**Perform the "synctactic" and "semantic" evaluations again. Conclude on the pre-trained embeddings.**

In [9]:
wv_pre_trained

NameError: name 'wv_pre_trained' is not defined

## STEP 4:  sentiment classification

In the previous practical session, we used a bag of word approach to transform text into vectors.
Here, we propose to try to use word vectors (previously learnt or loaded).


### <font color='green'> Since we have only word vectors and that sentences are made of multiple words, we need to aggregate them. </font>


### (1) Vectorize reviews using word vectors:

Word aggregation can be done in different ways:

- Sum
- Average
- Min/feature
- Max/feature

#### a few pointers:

- `w2v.wv.vocab` is a `set()` of the vocabulary (all existing words in your model)
- `np.minimum(a,b) and np.maximum(a,b)` respectively return element-wise min/max 

In [22]:
w2v.wv.key_to_index

{'the': 0,
 'a': 1,
 'and': 2,
 'of': 3,
 'to': 4,
 'is': 5,
 'in': 6,
 'I': 7,
 'that': 8,
 'this': 9,
 'it': 10,
 '/><br': 11,
 'was': 12,
 'as': 13,
 'with': 14,
 'for': 15,
 'but': 16,
 'The': 17,
 'on': 18,
 'movie': 19,
 'are': 20,
 'his': 21,
 'film': 22,
 'have': 23,
 'not': 24,
 'be': 25,
 'you': 26,
 'he': 27,
 'by': 28,
 'at': 29,
 'one': 30,
 'an': 31,
 'from': 32,
 'who': 33,
 'like': 34,
 'all': 35,
 'they': 36,
 'has': 37,
 'so': 38,
 'just': 39,
 'about': 40,
 'or': 41,
 'her': 42,
 'out': 43,
 'some': 44,
 'very': 45,
 'more': 46,
 'This': 47,
 'would': 48,
 'what': 49,
 'when': 50,
 'good': 51,
 'only': 52,
 'their': 53,
 'It': 54,
 'if': 55,
 'had': 56,
 'really': 57,
 "it's": 58,
 'which': 59,
 'up': 60,
 'even': 61,
 'can': 62,
 'were': 63,
 'my': 64,
 'see': 65,
 'no': 66,
 'she': 67,
 'than': 68,
 '-': 69,
 'been': 70,
 'there': 71,
 'into': 72,
 'get': 73,
 'will': 74,
 'story': 75,
 'much': 76,
 'because': 77,
 'other': 78,
 'most': 79,
 'we': 80,
 'time': 81,


In [35]:
w2v.wv

<gensim.models.keyedvectors.KeyedVectors at 0x7f6e96f53070>

In [51]:
import numpy as np
# We first need to vectorize text:
# First we propose to a sum of them

def get_train_test_set(agg_func):
    def vectorize(text, mean=False):
        """
        This function should vectorize one review

        input: str
        output: np.array(float)
        """    
        vec = []
        for word in text.split():
            try:
                vec.append(w2v.wv[word])
            except KeyError:
                continue
        return agg_func(np.array(vec), axis=0)

    classes = [pol for text,pol in train]
    X = [vectorize(text) for text,pol in train]
    X_test = [vectorize(text) for text,pol in test]
    true = [pol for text,pol in test]

    return X, X_test, classes, true

In [52]:
X_train, X_test, y_train, y_test = get_train_test_set(np.mean)

In [49]:
#let's see what a review vector looks like.
X[0]

array([  0.9962789 ,  11.440914  ,   1.8048078 ,  -0.1913385 ,
         0.85355127, -24.18671   ,  16.140886  ,  37.643944  ,
       -34.91759   , -23.692745  ,  -1.1175997 , -25.628988  ,
        -1.3991727 ,  18.307074  ,   7.7939153 ,  -9.873245  ,
         5.995385  , -11.069575  ,  -5.351398  , -43.677418  ,
        17.827225  ,   8.9137125 ,  14.733991  , -18.565659  ,
        10.744837  ,  -4.4008193 , -15.416159  ,   1.1796215 ,
       -16.153578  ,  21.468283  ,  26.662832  , -11.013621  ,
         7.774525  , -25.564692  ,  -6.052724  ,  17.006271  ,
        -5.6226754 ,  12.455929  , -14.924302  , -13.561244  ,
        17.20747   ,  -9.256992  ,  -8.899597  ,   0.8016176 ,
         9.517336  ,  -4.8509493 , -17.599451  ,  -6.102292  ,
        21.656157  ,  20.079906  ,   7.9809413 , -20.20644   ,
        -2.7865844 ,   0.7548676 ,  -2.295291  ,  13.713524  ,
         7.071466  ,  14.156362  , -23.193888  ,  10.417994  ,
        -5.4104767 ,  -4.888612  ,  26.078571  ,   6.82

### (2) Train a classifier 
as in the previous practical session, train a logistic regression to do sentiment classification with word vectors



In [55]:
np.unique(y_test)

array([0, 1])

In [56]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Scikit Logistic Regression
def eval(agg_func):
    X_train, X_test, y_train, y_test = get_train_test_set(agg_func)
    lr = LogisticRegression(random_state=0)
    lr.fit(X_train, y_train)
    return lr.score(X_test, y_test)

print(f"""
sum : {eval(np.sum)}
mean : {eval(np.mean)}
min : {eval(np.min)}
max : {eval(np.max)}
""")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt


sum : 0.82284
mean : 0.817
min : 0.70212
max : 0.71076



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


performance should be worst than with bag of word (~80%). Sum/Mean aggregation does not work well on long reviews (especially with many frequent words). This adds a lot of noise.

## **Todo** :  Try answering the following questions:

- Which word2vec model works best: skip-gram or cbow
- Do pretrained vectors work best than those learnt on the train dataset ?



**(Bonus)** To have a better accuracy, we could try two things:
- Better aggregation methods (weight by tf-idf ?)
- Another word vectorizing method such as [fasttext](https://radimrehurek.com/gensim/models/fasttext.html)
- A document vectorizing method such as [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)