<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Deep Learning for NLP
  </div> 
  
<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
    <font color=orange>I - 1 </font>
  Word Embedding
  </div> 

  <div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 20px; 
      text-align: center; 
      padding: 15px;">
  </div> 

  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE
  </div> 

### Part I
1. <font color=orange>**Word Embedding**</font>

2. Sentence Classification

3. Language Modeling

4. Sequence Labelling


### Part II

1. Text Classification

2. Sequence to sequence



### Part III

1. Abstractive Summarization

2. Question Answering

3. Chatbot


</div>

***

<a id="plan"></a>

# Overview

The global purpose of Word Embedding is to represent a _Token_ , a raw string representing a unit of text, as a low dimensional (dense) vector. The way tokens are defined only depends on the method used to split a text into text units : using blank spaces as separators or using classical NLTK or SpaCy's segmentation models leave _words_ as tokens, but splitting protocols yielding _subword units_ , that are half-way between characters and full words, are also investigated :

- [Neural Machine Translation of Rare Words with Subword Units (2015)](https://www.aclweb.org/anthology/P16-1162.pdf)
- [Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (2016)](https://arxiv.org/pdf/1609.08144.pdf)). 
- [BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages (2018)](https://www.aclweb.org/anthology/L18-1473.pdf)
- [SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (2018)](https://arxiv.org/abs/1808.06226)


Here we broadly denote by _word_ any such token. Commonly followed approaches for the embedding of words (aka tokens) decompose into three levels of granularity :

| Level |  | |
|------|------|------|
| **Word** | [I.1 Custom model](#word_level_custom) | [I.2 Gensim Model](#gensim) |
| **sub-word unit** | [II.1 FastText model](#fastText) |  |
| **Character** |  |  |


<br>
Visualization with TensorBoard : https://www.tensorflow.org/guide/embedding (TODO)

# Training objectives

#### CBOW training objective

Cette méthode de vectorisation est introduite dans \cite{mikolov2013distributed, mikolov2013efficient}, et consiste à construire pour un vocabulaire de mots une table de vectorisation $T$ contenant un vecteur par mot. La spécificité de cette méthode est que cette vectorisation est faite de façon à pouvoir prédire chaque mot à partir de son contexte. La construction de cette table $T$ passe par la création d'un réseau de neurones, qui sert de modèle pour l'estimation de la probabilité de prédiction d'un mot $w_t$ d'après son contexte $c = w_{t-N}, \, ... \, , w_{t-1}$, $w_{t+1}, \, ... \, , w_{t+N}$. La table $T$ intégrée au modèle sera optimisée lorsque ce modèle sera entrainé de façon à ce qu'un mot $w_t$ maximise la vraisemblance de la probabilité $P(. \, | \, c)$ fournie par le modèle. 

Le réseau de neurones de décrit de la façon suivante :

![cbow](figs/CBOW.png)

Un contexte $c = w_{t-N}, \, ... \, , w_{t-1}$, $w_{t+1}, \, ... \, , w_{t+N}$ est vectorisé via une table $T$ fournissant un ensemble de vecteurs denses (typiquement de dimension comprise entre 50 et 300) $T(w_{t-N}), \, ... \, , T(w_{t-1})$, $T(w_{t+1}), \, ... \, , T(w_{t+N})$. Chaque vecteur est ensuite transformé via une transformation affine, dont les vecteurs résultants sont superposés en un unique vecteur

\begin{align*}
v_c = \sum _{i = - N}^N M_i T(w_{t+i}) + b_i
\end{align*}

Le vecteur $v_c$ est de dimension typiquement égale à la dimension de la vectorisation de mots. Une autre table $T'$ est utilisée pour une nouvelle vectorisation du vocabulaire, de sorte que le mot $w_{t}$ soit transformé en un vecteur $T'(w_{t})$ par cette table, et soit proposé en position $t$ avec probabilité

\begin{align*}
P(w_{t} \, | \, c\,) = \frac{\exp\left( T'(w_{t}) \cdot v_c \right) }{\displaystyle \sum _{w \in \mathcal{V}} \exp\left(   T'(w) \cdot v_c 
\right) }
\end{align*}

Ici $\cdot$ désigne le produit scalaire entre vecteurs. L'optimisation de ce modèle permet d'ajuster la table $T$ afin que les vecteurs de mots portent suffisamment d'information pour reformer un mot à partir du contexte.


#### Skip-Gram training objective


Cette méthode de vectorisation est introduite dans \cite{mikolov2013distributed, mikolov2013efficient} comme version mirroir au Continuous Bag Of Words, et consiste là encore à construire pour un vocabulaire de mots une table de vectorisation $T$ contenant un vecteur par mot. La spécificité de cette méthode est que cette vectorisation est faite non pas de façon prédire un mot central $w$ à partir d'un contexte $c $ comme pour CBOW, mais plutôt de prédire le contexte $c $ à partir du mot central $w$. La construction de cette table $T$ passe par la création d'un réseau de neurones servant de modèle pour l'estimation de la probabilité de prédiction d'un contexte $c = w_{t-N}, \, ... \, , w_{t-1}$, $w_{t+1}, \, ... \, , w_{t+N}$ à partir d'un mot central $w_t$. La table $T$ intégrée au modèle sera optimisée lorsque ce modèle sera entrainé de façon à ce que le contexte  $ c $ maximise la vraisemblance de la probabilité $P( . \, | \, w_t)$ fournie par le modèle.


Une implémentation de ce modèle est la suivante : 


![skipgram](figs/Skipgram.png)


Un mot courant $w_t$ est vectorisé par une table $T$ fournissant un vecteur dense (typiquement de dimension comprise entre 50 et 300) $T(w_t)$. Ce vecteur est alors transformé en un ensemble de $2N$ vecteurs

\begin{align*}
\sigma (M_{i} T(w_t) + b_{i}) \qquad \qquad i =-N,\, ...\, , -1, 1, \, ...\, , N
\end{align*}

où $N$ désigne la taille de la fenêtre retenue, d'une dimension typiquement égale à la dimension de la vectorisation de mots, et $\sigma$ une fonction non linéaire (typiquement la _Rectified Linear Unit_ $\sigma (x) = max (0, x)$). Une autre table $T'$ est utilisée pour une nouvelle vectorisation du vocabulaire, de sorte que chaque mot $w_{t+i}$, transformé en un vecteur $T'(w_{t+i})$ par cette table, soit proposé en position $t+i$ avec probabilité

\begin{align*}
P( w_{t+i} | \, w_t) = \frac{\exp\left(  T'(w_{t+i}) ^\perp \sigma \left( M_i T(w_t) + b_{i}\right) \right) }{\displaystyle \sum _{w \in \mathcal{V}} \exp\left(   T'(w) ^\perp \sigma \left( M_i T(w_t) + b_i\right) \right) }
\end{align*}

On modélise alors la probabilité qu'un ensemble de mots $c = w_{t-N}, \, ... \, , w_{t-1}$, $w_{t+1}, \, ... \, , w_{t+N}$ soit le contexte d'un mot $w_t$ par le produit

\begin{align*}
 P( c\, | \, w_t) = \prod _{i = -N}^N P( w_{t+i}\, | \, w_t)
\end{align*}

Ce modèle de probabilité du contexte d'un mot est naif au sens où les mots de contextes sont considérés comme indépendants deux à deux dès lors que le mot central est connu. Cette approximation rend cependant le calcul d'optimisation beaucoup plus court.



L'optimisation de ce modèle permet d'ajuster la table $T$ afin que les vecteurs de mots portent suffisamment d'information pour reformer l'intégralité du contexte à partir de ce seul mot. La vectorisation Skip-Gram est typiquement plus performante que CBOW, car la table $T$ subit plus de contrainte dans son optimisation, et puisque le vecteur d'un mot est obtenu de façon à pouvoir prédire l'utilisation réelle du mot, ici donnée par son contexte. 

# Packages

[Back to top](#plan)

In [1]:
from __future__ import unicode_literals, print_function, division
import sys
import warnings
import os
from io import open
import unicodedata
import string
import time
import math
import re
import random
import pickle
import copy
from unidecode import unidecode
import multiprocessing

# for special math operation
from sklearn.preprocessing import normalize


# for manipulating data 
import numpy as np
#np.set_printoptions(threshold=np.nan)
import pandas as pd
import bcolz # see https://bcolz.readthedocs.io/en/latest/intro.html
import pickle


# for text processing
import gensim
from gensim.models import KeyedVectors
import spacy
import nltk
#nltk.download()
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem.porter import PorterStemmer


# for deep learning
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.backends.cudnn.benchmark = True


warnings.filterwarnings("ignore")
print('python version :', sys.version)
print('pytorch version :', torch.__version__)
print('DL device :', device)



python version : 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
pytorch version : 1.5.0
DL device : cuda


In [2]:
path_to_DL4NLP = os.path.dirname(os.getcwd())

In [3]:
sys.path.append(path_to_DL4NLP + '\\lib')

# Corpus

[Back to top](#plan)

Le texte est importé et mis sous forme de liste, où chaque élément représente un texte présenté sous forme d'une chaine de caractères.<br> 

In [4]:
df_AGnews = pd.read_csv(path_to_DL4NLP + "\\data\\AG News\\train.csv", sep = ',', header = None, error_bad_lines = False)

In [5]:
df_AGnews.columns = ['index', 'title', 'description']

In [6]:
df_AGnews.head()

Unnamed: 0,index,title,description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


On définit le tokeniseur :

In [7]:
def tokenize(s : str) :
    def unicodeToAscii(s):
        return ''.join( c for c in unicodedata.normalize('NFD', s)
                        if unicodedata.category(c) != 'Mn')

    def normalizeString(s):
        s = unicodeToAscii(s.strip())
        return s

    def cleanSentence(s) :
        s = s.lower()
        s = s.replace('\\', ' ')
        s = re.sub('[\.!?]+ ', ' . ', s)
        s = s.replace('%', ' % ')
        s = re.sub(' [0-9]*\.[0-9]', ' FLOAT ', ' ' + s).strip()
        s = re.sub(' [0-9,]*[0-9]', ' INT ', ' ' + s).strip()

        for w in ['"', "'", '”', '“', '/', '(', ')', '[', ']', '<', '>', ':', ','] : s = s.replace(w, '')
        return s

    def trueWord(w) :
        return len(w)>0 and re.sub('[^a-zA-Z0-9.,]', '', w) != ''

    # -- main --
    s = normalizeString(s)
    s = cleanSentence(s)
    s = nltk.tokenize.word_tokenize(s)
    s = [w for w in s if trueWord(w)]
    return s

In [8]:
corpus = [s1 + ' . ' + s2 for s1, s2 in zip(df_AGnews["title"].values.tolist(), 
                                            df_AGnews["description"].values.tolist()) if tokenize(s1) != []]

In [9]:
corpus_tokenized = [tokenize(s) for s in corpus]

<a id="word_level"></a>


# 1 Word Embedding
***

<a id="word_level_custom"></a>

## 1.1 Custom Word-level Embedding Model

[Back to top](#plan)

### 1.1.1 Model

#### Language

Classe de langage prennant en paramètre un corpus de la forme [[str]]

In [10]:
#from libDL4NLP.utils.Lang import Lang

In [11]:
class Lang:
    def __init__(self, corpus = None, base_tokens = ['UNK'], min_count = None):
        self.base_tokens = base_tokens
        self.initData(base_tokens)
        if    corpus is not None : self.addCorpus(corpus)
        if min_count is not None : self.removeRareWords(min_count)

        
    def initData(self, base_tokens) :
        self.word2index = {word : i for i, word in enumerate(base_tokens)}
        self.index2word = {i : word for i, word in enumerate(base_tokens)}
        self.word2count = {word : 0 for word in base_tokens}
        self.n_words = len(base_tokens)
        return
    
    def getIndex(self, word) :
        if    word in self.word2index : return self.word2index[word]
        elif 'UNK' in self.word2index : return self.word2index['UNK']
        return
        
    def addWord(self, word):
        '''Add a word to the language'''
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1
        return 
            
    def addSentence(self, sentence):
        '''Add to the language all words of a sentence'''
        words = sentence if type(sentence) == list else nltk.word_tokenize(sentence)
        for word in words : self.addWord(word)          
        return
            
    def addCorpus(self, corpus):
        '''Add to the language all words contained into a corpus'''
        for text in corpus : self.addSentence(text)
        return 
                
    def removeRareWords(self, min_count):
        '''remove words appearing lesser than a min_count threshold'''
        kept_word2count = {word: count for word, count in self.word2count.items() if count >= min_count}
        self.initData(self.base_tokens)
        for word, count in kept_word2count.items(): 
            self.addWord(word)
            self.word2count[word] = kept_word2count[word]
        return

In [12]:
def saveLang(name, lang):
    with open(path_to_DL4NLP + '\\saves\\' + name + '.file', 'wb') as fil :
        pickle.dump(lang, fil)
    return

def importLang(name):
    with open(path_to_DL4NLP + '\\saves\\' + name + '.file', 'rb') as fil :
        lang = pickle.load(fil)
    return lang

In [13]:
lang = Lang(corpus_tokenized, base_tokens = ['SOS', 'EOS', 'UNK'])
print("Mots comptés avant : {}".format(lang.n_words))
lang.removeRareWords(min_count = 5)
print("Mots comptés après : {}".format(lang.n_words))

Mots comptés avant : 84308
Mots comptés après : 29846


In [14]:
#saveLang(name = 'DL4NLP_I1_lang', lang = lang)
#lang = importLang(name = 'DL4NLP_I1_lang')

#### Comparaison avec un vocabulaire de référence

In [14]:
#taken from https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76

# --------------------- comparison with Glove vocab ------------------------
def vocabGlove(name, path = 'D:\\data\\vectors\\') :
    words = []
    path += name 
    with open(path + '.txt', 'rb') as f:
        for l in f:
            line = l.decode().split()
            word = line[0]
            words.append(word)
    return words

def intersection(lst1, lst2): 
    return list(set(lst1) & set(lst2))

def comparaison(lang) :
    vocab_lang = list(lang.word2index.keys())
    intersect_glove = intersection(vocab_glove, vocab_lang)
    reste_glove = np.setdiff1d(vocab_lang, intersect_glove)
    printComparaison('glove', vocab_lang, intersect_glove, reste_glove)
    return intersect_glove, reste_glove

def printComparaison(nom, vocab_lang, intersect, reste) :
    print('proportion de mots du langage appartenants à {}  {:.2f} % \nproportion de mots du langage ny appartenant pas     {:.2f} %'.format(nom, len(intersect)*100/len(vocab_lang),len(reste)*100/len(vocab_lang) ) )


# --------------------- detect missing spaces ------------------------
def checkWhetherBroken(vocab, clean_vocab) :
    exit = {}
    for word in vocab :
        exit[word] = True if word in clean_vocab else False
    return exit

def checkMissingSpaces(word, clean_vocab) :
    for word2 in clean_vocab :
        if word.startswith(word2) :
            rest = word.replace(word2, '')
            if rest in clean_vocab :
                return word2 + ' ' + rest
    return word

In [15]:
vocab_glove = vocabGlove('glove.6B.100d')

In [38]:
words_glove, reste_glove = comparaison(lang)

proportion de mots du langage appartenants à glove  93.37 % 
proportion de mots du langage ny appartenant pas     6.63 %


#### Word2Vec model

[Back to top](#plan)

In [16]:
#from libDL4NLP.models.Word_Embedding import Word2Vec as myWord2Vec

In [15]:
class myWord2Vec(nn.Module) :
    def __init__(self, lang, 
                 T = 100):
        super().__init__()
        
        self.lang = lang
        if type(T) == int :
            self.embedding = nn.Embedding(lang.n_words, T)  
        else :
            self.embedding = nn.Embedding(T.shape[0], T.shape[1])
            self.embedding.weight = nn.Parameter(torch.FloatTensor(T))
            
        self.out_dim = self.lookupTable().shape[1]
        self.sims = None
        
    def lookupTable(self) :
        return self.embedding.weight.cpu().detach().numpy()
        
    def computeSimilarities(self) :
        T = normalize(self.lookupTable(), norm = 'l2', axis = 1)
        self.sims = np.matmul(T, T.transpose())
        return

    def most_similar(self, word, bound = 10) :
        if word not in self.lang.word2index : return
        if self.sims is None : self.computeSimilarities()
        index = self.lang.word2index[word]
        coefs = self.sims[index]
        indices = coefs.argsort()[-bound -1 :-1]
        output = [(self.lang.index2word[i], coefs[i]) for i in reversed(indices)]
        return output
    
    def wv(self, word) :
        return self.lookupTable()[self.lang.getIndex(word)]
    
    def addWord(self, word, vector = None) :
        self.lang.addWord(word)
        T = self.lookupTable()
        v = np.random.rand(1, T.shape[1]) if vector is None else vector
        updated_T = np.concatenate((T, v), axis = 0)
        self.embedding = nn.Embedding(updated_T.shape[0], updated_T.shape[1])
        self.embedding.weight = nn.Parameter(torch.FloatTensor(updated_T))
        return
    
    def freeze(self) :
        for param in self.embedding.parameters() : param.requires_grad = False
        return self
    
    def unfreeze(self) :
        for param in self.embedding.parameters() : param.requires_grad = True
        return self
    
    def forward(self, words, device = None) :
        '''Transforms a list of n words into a torch.FloatTensor of size (1, n, emb_dim)'''
        indices  = [self.lang.getIndex(w) for w in words]
        indices  = [[i for i in indices if i is not None]]
        variable = Variable(torch.LongTensor(indices)) # size = (1, n)
        if device is not None : variable = variable.to(device)
        tensor   = self.embedding(variable)            # size = (1, n, emb_dim)
        return tensor

#### Word2Vec Shell

[Back to top](#plan)

Shell acting as a wrapper around the Word2Vec model, implementing :

- The layers suited for the training objective
- The methods for all optimization steps
- The methods for generating the data suitable for the optimization process

In [18]:
#from libDL4NLP.models.Word_Embedding import Word2VecShell

In [16]:
class Word2VecShell(nn.Module):
    '''Word2Vec model :
        - sg = 0 yields CBOW training procedure
        - sg = 1 yields Skip-Gram training procedure
    '''
    def __init__(self, word2vec, device, 
                 sg = 0, 
                 context_size = 5, 
                 weight_tying = True,
                 criterion = nn.NLLLoss(size_average = False), 
                 optimizer = optim.SGD):
        super().__init__()
        self.device = device
        
        # core of Word2Vec
        self.word2vec = word2vec
        
        # training layers
        self.in_n_word  = (2 * context_size if sg == 0 else 1)
        self.out_n_word = (1 if sg == 0 else 2 * context_size)
        self.word_size  = word2vec.embedding.weight.size(1)
        self.linear_1   = nn.Linear(self.in_n_word * self.word_size, self.out_n_word * self.word_size)
        self.linear_2   = nn.Linear(self.word_size, word2vec.lang.n_words, bias = False)
        
        # weight tying
        if weight_tying : self.linear_2.weight = self.word2vec.embedding.weight
        
        # training tools
        self.sg = sg
        self.weight_tying = weight_tying
        self.criterion = criterion
        self.optimizer = optimizer
        
        # load to device
        self.to(device)
        
    def forward(self, batch):
        '''Transforms a batch of Ngrams of size (batch_size, in_n_word)
           Into log probabilities of size (batch_size, lang.n_words, out_n_word)
           '''
        batch = batch.to(self.device)                 # size = (batch_size, self.in_n_word)
        embed = self.word2vec.embedding(batch)        # size = (batch_size, self.in_n_word, emb_dim)
        embed = embed.view((batch.size(0), -1))       # size = (batch_size, self.in_n_word * emb_dim)
        out = self.linear_1(embed)                    # size = (batch_size, self.out_n_word * hid_dim) 
        out = out.view((batch.size(0),self.out_n_word, -1))
        if not self.weight_tying : out = F.relu(out)  # size = (batch_size, self.out_n_word, hid_dim)                                         
        out = self.linear_2(out)                      # size = (batch_size, self.out_n_word, lang.n_words)
        out = torch.transpose(out, 1, 2)              # size = (batch_size, lang.n_words, self.out_n_word)
        log_probs = F.log_softmax(out, dim = 1)       # size = (batch_size, lang.n_words, self.out_n_word)
        return log_probs
    
    def generatePackedNgrams(self, corpus, 
                             context_size = 5, 
                             batch_size = 32, 
                             seed = 42) :
        # generate Ngrams
        data = []
        for text in corpus :
            text = [w for w in text if w in self.word2vec.lang.word2index]
            text = ['SOS' for i in range(context_size)] + text + ['EOS' for i in range(context_size)]
            for i in range(context_size, len(text) - context_size):
                context = text[i-context_size : i] + text[i+1 : i+context_size+1]
                word = text[i]
                data.append([word, context])
                
        # pack Ngrams into mini_batches
        random.seed(seed)
        random.shuffle(data)
        packed_data = []
        for i in range(0, len(data), batch_size):
            pack0 = [el[0] for el in data[i:i + batch_size]]
            pack0 = [[self.word2vec.lang.getIndex(w)] for w in pack0]
            pack0 = Variable(torch.LongTensor(pack0)) # size = (batch_size, 1)
            pack1 = [el[1] for el in data[i:i + batch_size]]
            pack1 = [[self.word2vec.lang.getIndex(w) for w in context] for context in pack1]
            pack1 = Variable(torch.LongTensor(pack1)) # size = (batch_size, 2*context_size)   
            if   self.sg == 1 : packed_data.append([pack0, pack1])
            elif self.sg == 0 : packed_data.append([pack1, pack0])
            else :
                print('sg should be either 0 or 1')
                pass
        return packed_data
    
    def train(self, ngrams, 
              iters = None, 
              epochs = None, 
              lr = 0.025, 
              random_state = 42,
              print_every = 10, 
              compute_accuracy = False):
        """Performs training over a given dataset and along a specified amount of loop
        s"""
        def asMinutes(s):
            m = math.floor(s / 60)
            s -= m * 60
            return '%dm %ds' % (m, s)

        def timeSince(since, percent):
            now = time.time()
            s = now - since
            rs = s/percent - s
            return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

        def computeAccuracy(log_probs, targets) :
            accuracy = 0
            acc = sum([log_probs[i, :, j].data.topk(1)[1].item() == targets[i, j].item() 
                       for i in range(targets.size(0)) 
                       for j in range(targets.size(1))])
            return (acc * 100) / (targets.size(0) * targets.size(1))

        def printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy) :
            avg_loss = tot_loss / print_every
            avg_loss_words = tot_loss_words / print_every
            if compute_accuracy : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}  accuracy : {:.1f} %'.format(iter, int(iter / iters * 100), avg_loss, avg_loss_words))
            else                : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}                     '.format(iter, int(iter / iters * 100), avg_loss))
            return 0, 0

        def trainLoop(couple, optimizer, compute_accuracy = False):
            """Performs a training loop, with forward pass and backward pass for gradient optimisation."""
            optimizer.zero_grad()
            self.zero_grad()
            log_probs = self(couple[0])           # size = (batch_size, agent.out_n_word, agent.lang.n_words)
            targets   = couple[1].to(self.device) # size = (batch_size, agent.out_n_word)
            loss      = self.criterion(log_probs, targets)
            loss.backward()
            optimizer.step() 
            accuracy = computeAccuracy(log_probs, targets) if compute_accuracy else 0
            return float(loss.item() / (targets.size(0) * targets.size(1))), accuracy
        
        # -- main --
        np.random.seed(random_state)
        start = time.time()
        optimizer = self.optimizer([param for param in self.parameters() if param.requires_grad == True], lr = lr)
        tot_loss = 0  
        tot_loss_words = 0
        if epochs is None :
            for iter in range(1, iters + 1):
                couple = random.choice(ngrams)
                loss, loss_words = trainLoop(couple, optimizer, compute_accuracy)
                tot_loss += loss
                tot_loss_words += loss_words      
                if iter % print_every == 0 : 
                    tot_loss, tot_loss_words = printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy)
        else :
            iter = 0
            iters = len(ngrams) * epochs
            for epoch in range(1, epochs + 1):
                print('epoch ' + str(epoch))
                np.random.shuffle(ngrams)
                for couple in ngrams :
                    loss, loss_words = trainLoop(couple, optimizer, compute_accuracy)
                    tot_loss += loss
                    tot_loss_words += loss_words 
                    iter += 1
                    if iter % print_every == 0 : 
                        tot_loss, tot_loss_words = printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy)
        return

### 1.1.2 Training with CBOW objective

[Back to top](#plan)


Model

In [17]:
lang = Lang(corpus_tokenized, base_tokens = ['SOS', 'EOS', 'UNK'], min_count = 5)
lang.n_words

29846

In [18]:
word2vec = myWord2Vec(lang, T = 100)
cbow = Word2VecShell(word2vec, device, sg = 0, context_size = 5, weight_tying = False)
print('cbow.word2vec = word2vec :', cbow.word2vec == word2vec)

cbow.word2vec = word2vec : True


Data

In [45]:
Ngrams = cbow.generatePackedNgrams(corpus_tokenized, context_size = 5, batch_size = 64, seed = 42)
len(Ngrams)

74874

Training

The training methods allows to display accuracy over predicted target words. However, since the underlying computation is quite time consuming, we display accuracy only at the begining of training, and a few times periodically along the training process.

In [47]:
cbow.train(Ngrams, epochs = 1, lr = 0.0005, print_every = 100, compute_accuracy = True)

epoch 1
0m 10s (- 128m 33s) (100 0%) loss : 9.769  accuracy : 6.4 %
0m 19s (- 124m 2s) (200 0%) loss : 9.259  accuracy : 7.7 %
0m 30s (- 127m 17s) (300 0%) loss : 8.964  accuracy : 7.6 %
0m 40s (- 124m 59s) (400 0%) loss : 8.751  accuracy : 8.9 %
0m 49s (- 123m 38s) (500 0%) loss : 8.570  accuracy : 9.8 %
1m 0s (- 124m 51s) (600 0%) loss : 8.425  accuracy : 10.5 %
1m 9s (- 123m 35s) (700 0%) loss : 8.347  accuracy : 10.2 %
1m 19s (- 122m 34s) (800 1%) loss : 8.220  accuracy : 10.7 %
1m 29s (- 123m 7s) (900 1%) loss : 8.108  accuracy : 11.2 %
1m 40s (- 123m 20s) (1000 1%) loss : 8.050  accuracy : 11.3 %
1m 49s (- 122m 43s) (1100 1%) loss : 8.067  accuracy : 11.4 %
1m 59s (- 122m 30s) (1200 1%) loss : 7.974  accuracy : 11.8 %
2m 9s (- 121m 59s) (1300 1%) loss : 7.827  accuracy : 12.1 %
2m 19s (- 121m 48s) (1400 1%) loss : 7.811  accuracy : 12.0 %
2m 29s (- 121m 53s) (1500 2%) loss : 7.768  accuracy : 12.7 %
2m 38s (- 121m 18s) (1600 2%) loss : 7.752  accuracy : 11.7 %
2m 49s (- 121m 46s)

21m 56s (- 102m 31s) (13200 17%) loss : 6.490  accuracy : 17.5 %
22m 7s (- 102m 27s) (13300 17%) loss : 6.593  accuracy : 16.7 %
22m 18s (- 102m 22s) (13400 17%) loss : 6.463  accuracy : 17.2 %
22m 30s (- 102m 18s) (13500 18%) loss : 6.440  accuracy : 17.4 %
22m 41s (- 102m 14s) (13600 18%) loss : 6.495  accuracy : 17.5 %
22m 52s (- 102m 9s) (13700 18%) loss : 6.410  accuracy : 17.8 %
23m 3s (- 102m 5s) (13800 18%) loss : 6.444  accuracy : 17.5 %
23m 15s (- 101m 59s) (13900 18%) loss : 6.485  accuracy : 17.2 %
23m 26s (- 101m 54s) (14000 18%) loss : 6.515  accuracy : 17.3 %
23m 37s (- 101m 49s) (14100 18%) loss : 6.541  accuracy : 16.6 %
23m 48s (- 101m 45s) (14200 18%) loss : 6.566  accuracy : 16.7 %
24m 0s (- 101m 40s) (14300 19%) loss : 6.414  accuracy : 18.6 %
24m 11s (- 101m 35s) (14400 19%) loss : 6.446  accuracy : 18.3 %
24m 22s (- 101m 29s) (14500 19%) loss : 6.476  accuracy : 17.7 %
24m 33s (- 101m 24s) (14600 19%) loss : 6.483  accuracy : 17.0 %
24m 45s (- 101m 19s) (14700 19

45m 5s (- 84m 15s) (26100 34%) loss : 6.268  accuracy : 18.3 %
45m 16s (- 84m 7s) (26200 34%) loss : 6.171  accuracy : 19.4 %
45m 28s (- 83m 59s) (26300 35%) loss : 6.281  accuracy : 18.4 %
45m 40s (- 83m 51s) (26400 35%) loss : 6.193  accuracy : 19.1 %
45m 52s (- 83m 44s) (26500 35%) loss : 6.205  accuracy : 19.0 %
46m 4s (- 83m 37s) (26600 35%) loss : 6.215  accuracy : 19.2 %
46m 16s (- 83m 29s) (26700 35%) loss : 6.247  accuracy : 19.2 %
46m 28s (- 83m 22s) (26800 35%) loss : 6.227  accuracy : 18.8 %
46m 41s (- 83m 15s) (26900 35%) loss : 6.240  accuracy : 18.3 %
46m 53s (- 83m 8s) (27000 36%) loss : 6.236  accuracy : 18.5 %
47m 5s (- 83m 1s) (27100 36%) loss : 6.162  accuracy : 20.2 %
47m 18s (- 82m 54s) (27200 36%) loss : 6.182  accuracy : 19.2 %
47m 29s (- 82m 46s) (27300 36%) loss : 6.189  accuracy : 18.7 %
47m 41s (- 82m 38s) (27400 36%) loss : 6.210  accuracy : 19.2 %
47m 53s (- 82m 30s) (27500 36%) loss : 6.218  accuracy : 18.4 %
48m 5s (- 82m 21s) (27600 36%) loss : 6.252  a

67m 37s (- 62m 12s) (39000 52%) loss : 5.919  accuracy : 20.7 %
67m 48s (- 62m 2s) (39100 52%) loss : 6.076  accuracy : 19.8 %
67m 58s (- 61m 51s) (39200 52%) loss : 6.039  accuracy : 19.2 %
68m 8s (- 61m 41s) (39300 52%) loss : 5.958  accuracy : 20.5 %
68m 18s (- 61m 30s) (39400 52%) loss : 6.086  accuracy : 18.7 %
68m 29s (- 61m 19s) (39500 52%) loss : 6.092  accuracy : 19.6 %
68m 39s (- 61m 9s) (39600 52%) loss : 5.966  accuracy : 20.2 %
68m 57s (- 61m 5s) (39700 53%) loss : 5.992  accuracy : 20.6 %
69m 17s (- 61m 3s) (39800 53%) loss : 6.070  accuracy : 20.2 %
69m 27s (- 60m 52s) (39900 53%) loss : 5.957  accuracy : 20.7 %
69m 37s (- 60m 42s) (40000 53%) loss : 6.034  accuracy : 19.5 %
69m 47s (- 60m 31s) (40100 53%) loss : 6.064  accuracy : 19.7 %
69m 57s (- 60m 20s) (40200 53%) loss : 6.002  accuracy : 20.4 %
70m 8s (- 60m 10s) (40300 53%) loss : 6.112  accuracy : 19.2 %
70m 18s (- 59m 59s) (40400 53%) loss : 6.060  accuracy : 20.3 %
70m 28s (- 59m 48s) (40500 54%) loss : 6.018  

KeyboardInterrupt: 

Evaluation

In [22]:
word2vec.computeSimilarities()

In [50]:
word2vec.most_similar(word = 'car', bound = 10)

[('johan', 0.39462912),
 ('definition', 0.3945577),
 ('neighbourhood', 0.35367444),
 ('tonight', 0.34645903),
 ('mocking', 0.3464274),
 ('discoverer', 0.34625724),
 ('enact', 0.34261838),
 ('organized', 0.34073338),
 ('screens', 0.339069),
 ('memo', 0.33459648)]

Save & Load<br>

The lightweight word2vec model can be saved for further use, or alternatively the full shell wrapping the word2vec model can be saved for subsequent training.

In [52]:
# save
#torch.save(word2vec, path_to_DL4NLP + '\\saves\\DL4NLP_I1_cbow.pt')

# load
#word2vec = torch.load(path_to_DL4NLP + '\\saves\\DL4NLP_I1_cbow.pt')

### 1.1.3 Training with SkipGram objective

[Back to top](#plan)


Model

In [24]:
lang = Lang(corpus_tokenized, base_tokens = ['SOS', 'EOS', 'UNK'], min_count = 5)

In [25]:
word2vec = myWord2Vec(lang, T = 100)
skipgram = Word2VecShell(word2vec, device, sg = 1, context_size = 5)
print('skipgram.word2vec = word2vec :', skipgram.word2vec == word2vec)

skipgram.word2vec = word2vec : True


Data

In [55]:
Ngrams = skipgram.generatePackedNgrams(corpus_tokenized, context_size = 5, batch_size = 64, seed = 42)
len(Ngrams)

19435

Training

In [56]:
skipgram.train(Ngrams, epochs = 1, lr = 0.00025, print_every = 100, compute_accuracy = True)

epoch 1
0m 39s (- 127m 41s) (100 0%) loss : 14.977  accuracy : 0.6 %
1m 19s (- 127m 1s) (200 1%) loss : 10.444  accuracy : 1.9 %
1m 57s (- 124m 44s) (300 1%) loss : 9.782  accuracy : 2.7 %
2m 34s (- 122m 41s) (400 2%) loss : 9.573  accuracy : 3.4 %
3m 12s (- 121m 19s) (500 2%) loss : 9.494  accuracy : 3.7 %
3m 49s (- 120m 0s) (600 3%) loss : 9.405  accuracy : 3.7 %
4m 26s (- 118m 56s) (700 3%) loss : 9.318  accuracy : 4.0 %
5m 4s (- 118m 2s) (800 4%) loss : 9.236  accuracy : 4.1 %
5m 41s (- 117m 12s) (900 4%) loss : 9.177  accuracy : 4.1 %
6m 18s (- 116m 15s) (1000 5%) loss : 9.114  accuracy : 4.0 %
6m 55s (- 115m 18s) (1100 5%) loss : 9.034  accuracy : 4.0 %
7m 31s (- 114m 27s) (1200 6%) loss : 8.985  accuracy : 4.0 %
8m 9s (- 113m 41s) (1300 6%) loss : 8.963  accuracy : 3.9 %
8m 46s (- 112m 58s) (1400 7%) loss : 8.892  accuracy : 3.9 %
9m 23s (- 112m 12s) (1500 7%) loss : 8.847  accuracy : 4.1 %
9m 59s (- 111m 26s) (1600 8%) loss : 8.804  accuracy : 4.0 %
10m 36s (- 110m 38s) (1700 8

82m 31s (- 37m 10s) (13400 68%) loss : 7.688  accuracy : 5.1 %
83m 8s (- 36m 33s) (13500 69%) loss : 7.724  accuracy : 4.9 %
83m 46s (- 35m 56s) (13600 69%) loss : 7.698  accuracy : 5.0 %
84m 23s (- 35m 19s) (13700 70%) loss : 7.696  accuracy : 5.1 %
84m 59s (- 34m 42s) (13800 71%) loss : 7.677  accuracy : 5.2 %
85m 36s (- 34m 5s) (13900 71%) loss : 7.665  accuracy : 5.1 %
86m 13s (- 33m 28s) (14000 72%) loss : 7.714  accuracy : 5.0 %
86m 50s (- 32m 51s) (14100 72%) loss : 7.670  accuracy : 5.2 %
87m 27s (- 32m 14s) (14200 73%) loss : 7.676  accuracy : 5.0 %
88m 4s (- 31m 37s) (14300 73%) loss : 7.669  accuracy : 5.0 %
88m 41s (- 31m 0s) (14400 74%) loss : 7.701  accuracy : 5.2 %
89m 18s (- 30m 23s) (14500 74%) loss : 7.674  accuracy : 5.0 %
89m 55s (- 29m 46s) (14600 75%) loss : 7.678  accuracy : 5.0 %
90m 31s (- 29m 9s) (14700 75%) loss : 7.675  accuracy : 4.9 %
91m 8s (- 28m 32s) (14800 76%) loss : 7.623  accuracy : 5.2 %
91m 45s (- 27m 55s) (14900 76%) loss : 7.648  accuracy : 5.2 

KeyboardInterrupt: 

Evaluation

In [57]:
word2vec.most_similar(word = 'looking', bound = 10)

[('olympic', 0.35984373),
 ('acceleration', 0.35875782),
 ('bizarre', 0.35553902),
 ('democracies', 0.3541008),
 ('literally', 0.3506152),
 ('manual', 0.35027972),
 ('temptation', 0.3372059),
 ('personal', 0.3276703),
 ('chiefs', 0.3203342),
 ('sedan', 0.318881)]

Save & Load<br>

The lightweight word2vec model can be saved for further use, or alternatively the full shell wrapping the word2vec model can be saved for subsequent training.

In [None]:
# save
#torch.save(word2vec, path_to_DL4NLP + '\\saves\\DL4NLP_I1_skipgram.pt')

# load
#word2vec = torch.load(path_to_DL4NLP + '\\saves\\DL4NLP_I1_skipgram.pt')

<a id="gensim"></a>

## 1.2 Gensim Word2Vec

[Back to top](#plan)

Link : https://radimrehurek.com/gensim/models/word2vec.html<br>
Tutorials :

- https://cambridgespark.com/4046-2/
- https://rare-technologies.com/word2vec-tutorial/
- http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

### 1.2.1 Model

In [8]:
from gensim.models import Word2Vec
from gensim.test.utils import datapath, get_tmpfile

### 1.2.2 Training with CBOW objective

[Back to top](#plan)

Model & Data & Training

In [13]:
cbow_gensim = Word2Vec(corpus_tokenized, 
                       size = 100, 
                       window = 5, 
                       min_count = 5, 
                       negative = 20, 
                       iter = 50,
                       sg = 0,
                       workers = multiprocessing.cpu_count())

Evaluation

In [14]:
cbow_gensim.wv.most_similar('looking')

[('novice', 0.5856069326400757),
 ('amazed', 0.5641556978225708),
 ('searching', 0.5585623979568481),
 ('selling', 0.5566306710243225),
 ('writing', 0.5173830986022949),
 ('using', 0.4963681697845459),
 ('guessing', 0.48206374049186707),
 ('working', 0.4725167155265808),
 ('glad', 0.4712035655975342),
 ('hoping', 0.4707739055156708)]

Save & Load

The Gensim model can easily be saved & loaded :

In [15]:
# save
#file_name = get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I1_cbow_gensim.model")
#cbow_gensim.save(file_name)

# load
#file_name = get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I1_cbow_gensim.model")
#cbow_gensim = Word2Vec.load(file_name)

Alternatively it is direct to build a lightweight word2vec model out of a trained gensim model and then save & load it as done in previous section.

In [86]:
word2vec = myWord2Vec(lang = Lang(corpus = [list(cbow_gensim.wv.index2word)], base_tokens = []), T = cbow_gensim.wv.vectors)

In [None]:
word2vec.most_similar('looking')

### 1.2.3 Training with SkipGram objective

[Back to top](#plan)

Model & Data & Training

In [16]:
skipgram_gensim = Word2Vec(corpus_tokenized, 
                           size = 100, 
                           window = 5, 
                           min_count = 5, 
                           negative = 20, 
                           iter = 50,
                           sg = 1,
                           workers = multiprocessing.cpu_count())

Evaluation

In [17]:
skipgram_gensim.wv.most_similar('looking')

[('novice', 0.6076338291168213),
 ('trying', 0.5377198457717896),
 ('getting', 0.531008243560791),
 ('working', 0.5226007699966431),
 ('going', 0.5183506011962891),
 ('interested', 0.5151867270469666),
 ('need', 0.5117555856704712),
 ('sale', 0.5094774961471558),
 ('planning', 0.5061888098716736),
 ('wizard', 0.5041382312774658)]

Save & Load

In [18]:
# save
#file_name = get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I1_skipgram_gensim.model")
#skipgram_gensim.save(file_name)

# load
#file_name = get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I1_skipgram_gensim.model")
#skipgram_gensim = Word2Vec.load(file_name)

<a id="sub_word_level"></a>


# 2 Word Embedding via sub-word units
***

<a id="fastText"></a>

## 2.1 FastText's Word Embedding via character n-grams

[Back to top](#plan)


We consider the Gensim implementation of FastText, based on the CBOW training objective.<br>
Tutorial : [Gensim FastText](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb)<br>
Link to the original paper : [Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606.pdf).

### 2.1.1 Model

In [19]:
from gensim.models.fasttext import FastText as FT_gensim
from gensim.test.utils import datapath, get_tmpfile

### 2.1.2 Training with CBOW objective

[Back to top](#plan)

Model & Data

In [20]:
cbow_fastText_gensim = FT_gensim(size = 100, 
                                 window = 5, 
                                 min_count = 5, 
                                 negative = 20,
                                 sg = 0)

In [21]:
cbow_fastText_gensim.build_vocab(corpus_tokenized)

Training

In [22]:
cbow_fastText_gensim.train(sentences = corpus_tokenized, 
                           epochs = 50,
                           total_examples = cbow_fastText_gensim.corpus_count)

Evaluation

In [23]:
cbow_fastText_gensim.wv.most_similar('looking')

[('hooking', 0.864661693572998),
 ('overlooking', 0.855728268623352),
 ('cooking', 0.8507105708122253),
 ('joking', 0.8126223683357239),
 ('smoking', 0.753277599811554),
 ('-king', 0.7203484773635864),
 ('seeking', 0.7179160118103027),
 ('searching', 0.715704083442688),
 ('idling', 0.7143186330795288),
 ('scanning', 0.7119997143745422)]

Save & Load

In [24]:
# save
#file_name = get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I1_cbow_fasttext.model")
#cbow_fastText_gensim.save(file_name)

# load
#file_name = get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I1_cbow_fasttext.model")
#cbow_fastText_gensim = FT_gensim.load(file_name)

Alternatively it is direct to build a lightweight word2vec model out of a trained gensim model and then save & load it as done in previous section.

In [94]:
word2vec = myWord2Vec(lang = Lang(corpus = [list(cbow_fastText_gensim.wv.index2word)], base_tokens = []), T = cbow_fastText_gensim.wv.vectors)

However, the main advantage FastText offers is the possibility to get an embedding vector out of **any word**, and in fact any string thanks to the character-ngrams embedding trick :

In [None]:
cbow_fastText_gensim['HelloWorld']

Nonetheless, it can be interesting to load the look-up word vectors table into a lightweight word2vec module, as it allows to further optimize this table for any specific downstream task performed by a larger PyTorch model.

### 2.1.3 Training with SkipGram objective

[Back to top](#plan)

Model & Data

In [25]:
fastText_gensim = FT_gensim(size = 100, 
                           window = 5, 
                           min_count = 5, 
                           negative = 20,
                           sg = 1)

In [26]:
fastText_gensim.build_vocab(corpus_tokenized)

Training

In [27]:
fastText_gensim.train(sentences = corpus_tokenized, 
                      epochs = 50,
                      total_examples = fastText_gensim.corpus_count)

Evaluation

In [28]:
fastText_gensim.wv.most_similar('looking')

[('novice', 0.6242992877960205),
 ('hooking', 0.5955933928489685),
 ('getting', 0.5934851169586182),
 ('researching', 0.5598569512367249),
 ('converting', 0.5522629022598267),
 ('marching', 0.5496691465377808),
 ('designing', 0.5489811897277832),
 ('buying', 0.5446159839630127),
 ('searching', 0.5296090245246887),
 ('pd', 0.5229203104972839)]

Save & Load

In [29]:
# save
#file_name = get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I1_skipgram_fasttext.model")
#fastText_gensim.save(file_name)

# load
#file_name = get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I1_skipgram_fasttext.model")
#fastText_gensim = FT_gensim.load(file_name)

In [None]:
fastText_gensim[['13', 'to']]