# Activity : Word2Vec embeddings

In the section about Text Classification, we used various approaches to embed the sentences to classify. Some of the reference techniques involved statistical modelling (e.g. BOW, TFIDF), and others involved learning representations for words.

In this activity, we propose to implement Word2Vec embeddings in PyTorch (during the previous activity we used the gensim implementation of the algorithm)

## 1. Corpus

We will work on a very simple corpus :

In [15]:
corpus = [
    'louis est roi de france',
    'marie est reine de france',
    'louis est un homme',
    'marie est une femme',
    'paris est la capitale de la france',
    'bruxelles est la capitale de la belgique',
    'tokyo est la capitale du japon',
]

Using the same implementation as what we did earlier, we propose to preprocess the corpus

In [16]:
from nltk.stem.snowball import FrenchStemmer
from nltk import wordpunct_tokenize          
from nltk.corpus import stopwords
from nltk.corpus import words
from string import punctuation
import unidecode

class FrenchTokenizer(object):
    def __init__(self):
        self.stopwords = set(stopwords.words('french'))
        self.words = set(words.words())
    def __call__(self, doc):
        # tokenize words and punctuation
        word_list = wordpunct_tokenize(doc)
        # remove stopwords
        word_list = [word for word in word_list if word not in self.stopwords]
        # remove 1-character words
        word_list = [word for word in word_list if len(word)>1]
        # remove non alpha
        word_list = [word for word in word_list if word.isalpha()]
        return [unidecode.unidecode(t) for t in word_list]

#Tokenizing sentences
tok=FrenchTokenizer()
text_for_word2vec=[tok(sent) for sent in corpus]

In [17]:
#Vocabulary modelling : assigning indexes to words
vocabulary = []
for sentence in text_for_word2vec:
    for token in sentence:
        if token not in vocabulary:
            vocabulary.append(token)

word2idx = {w: idx for (idx, w) in enumerate(vocabulary)}
idx2word = {idx: w for (idx, w) in enumerate(vocabulary)}

vocabulary_size = len(vocabulary)
print("vocabulary size : ", vocabulary_size)

vocabulary size :  13


The Word2Vec architecture relies on predicting the tokens surrounding a given one (or vice versa). It is composed of a simple auto-encoder using one-hot encodings of the tokens as inputs and outputs, and a hidden layer of a chosen size that will represent the embedding.

![skipgram](img/word2vecskipgram.png)

Because input output vectors are one-hot encodings of the tokens, their dimensionnality is the size of the corpus (e.g : if there are 30 tokens in the vocabulary, the encoding will be of size 30, with each dimension representing the presence or absence of the token in the message)
The hidden layer is of size N. Because of the auto-encoder structure, the hidden layer will be trained to constrain the information about tokens in a space of given dimensionnality.

There are two approaches for Word2Vec :
- the Continuous Bag-of-words (CBOW) appraoach, where the token is used to predict surrounding ones
- the Skip-Gram approach, where context tokens are used to predict the central one

In this exercice, we propose to implement the SkipGram approach. To train this model, we need token pairs (a context token and the central one).
Token pairs are generated by going through all tokens and generating context tokens belonging to a window of a given size around the token.

<div class = "alert alert-warning">

On génère des paires de mots sur lesquels on entraîne le réseau

In [18]:
import numpy as np

window_size = 1

idx_pairs = []
# for each sentence
for sentence in text_for_word2vec:
    indices = [word2idx[word] for word in sentence]
    # for each word, treated as center word
    for center_word_pos in range(len(indices)):
        # for each window position
        for w in range(-window_size, window_size + 1):
            context_word_pos = center_word_pos + w
            # make soure not jump out sentence
            if context_word_pos < 0 or context_word_pos >= len(indices) or center_word_pos == context_word_pos:
                continue
            context_word_idx = indices[context_word_pos]
            idx_pairs.append((indices[center_word_pos], context_word_idx))

idx_pairs = np.array(idx_pairs)

#Printing first 10 pairs
for i in range(10):
    print(idx2word[idx_pairs[i][0]], idx2word[idx_pairs[i][1]])

louis roi
roi louis
roi france
france roi
marie reine
reine marie
reine france
france reine
louis homme
homme louis


To use these pairs to train the network, we need to encode them in a layer :

In [19]:
import torch

def get_layer(word_idx):
    x = torch.zeros(vocabulary_size).float()
    x[word_idx] = 1.0
    return x

#exemple
get_layer(2)

tensor([0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

<div class="alert alert-success">
Exercice : 
    
Complete the class below to implement the architecture of the Skipgram.
The class will include a method get_wv to get the word vector (i.e. the hidden layer) for a given word
</div>

In [20]:
#COMPLETE HERE
import torch.nn as nn
import numpy as np

# class skipgram(nn.Module):
#     def __init__(self, vocab_size, embedding_dim):
#         #vocab size : vocabulary size (corresponding to input and output dimensions)
#         #embedding_dim : dimension of the embedding (hidden) layer
#         super(skipgram, self).__init__()

#     def forward(self, input):

#         return output
#     def get_wv(self,input):
#         #get the word vector for a given input
#         return 

In [21]:
# %load solutions/1_skipgram.py
class skipgram(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        #vocab size : vocabulary size (corresponding to input and output dimensions)
        #embedding_dim : dimension of the embedding (hidden) layer
        super(skipgram, self).__init__()
        # pas de RNN ici, juste une réseau linéaire
        self.lin1 = nn.Linear(vocab_size,embedding_dim) # une couche pour aller à l'état caché
        self.lin2 = nn.Linear(embedding_dim,vocab_size) # une couche pour récupérer l'état de sortie
        self.soft= nn.Softmax(dim=0) 
    def forward(self, input):
        hidden = self.lin1(input)
        output=self.lin2(hidden)
        output=self.soft(output) # on fait passer l'état de sortie par le softmax
        return output
    def get_wv(self,input): # récupérer la représentation vectorielle d'un mot : on récupère l'activation de couche cachée
        #get the word vector for a given input
        return self.lin1(input).detach().numpy()

Training the network :

In [22]:
model = skipgram(vocab_size=vocabulary_size,embedding_dim=2)
loss_function = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

num_epochs = 100

for epo in range(num_epochs):
    loss_val = 0
    for center_word, target_word in idx_pairs:
        optimizer.zero_grad()
        input = get_layer(center_word)
        target = get_layer(target_word)
        output=model(input)
        loss = loss_function(output, target)
        loss_val += loss.item()
        loss.backward()
        optimizer.step()
    if epo%10==0:
        print(f'Loss at epoch {epo}: {loss_val/len(idx_pairs)}')

Loss at epoch 0: 0.286283270145456
Loss at epoch 10: 0.2713726833462715
Loss at epoch 20: 0.26150408138831455
Loss at epoch 30: 0.2539468618730704
Loss at epoch 40: 0.24764183349907398
Loss at epoch 50: 0.24146481789648533
Loss at epoch 60: 0.23474724404513836
Loss at epoch 70: 0.22746935797234377
Loss at epoch 80: 0.22002030878017345
Loss at epoch 90: 0.21287764764080444


Let's have a look at the word embeddings. They could be used to represent individual tokens in a classification task, for instance 

In [23]:
word_vectors = {w : model.get_wv(get_layer(word2idx[w])) for w in vocabulary}

print(word_vectors)

{'louis': array([ 2.7596951e-04, -6.3711846e-01], dtype=float32), 'roi': array([ 0.31588852, -0.43377864], dtype=float32), 'france': array([0.23440444, 0.58936715], dtype=float32), 'marie': array([-0.6226934,  0.5823401], dtype=float32), 'reine': array([ 0.507831  , -0.48083264], dtype=float32), 'homme': array([0.66379166, 0.05007281], dtype=float32), 'femme': array([ 0.72154117, -0.02954477], dtype=float32), 'paris': array([1.0636444 , 0.58579195], dtype=float32), 'capitale': array([-0.3752228, -0.0287312], dtype=float32), 'bruxelles': array([0.99591804, 0.80952233], dtype=float32), 'belgique': array([1.1701809 , 0.53766656], dtype=float32), 'tokyo': array([1.1276237 , 0.62002265], dtype=float32), 'japon': array([1.2272844 , 0.46912107], dtype=float32)}


<div class = "alert alert-warning">

On a donc bien récupéré une représentation vectorielle des mots 
- Les mots sont représentés par des vecteurs (qu'on a forcé de taille 2 ici - on a pris une taille de couche cachée de 2)