<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Deep Learning for NLP
  </div> 
  
<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Part I - 1 Word Embedding
  </div> 

  <div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 20px; 
      text-align: center; 
      padding: 15px;">
  Jean-baptiste Aujogue
  </div> 

### Part I
1. <font color=red>**Word Embedding**</font>

2. Sentence Classification

    _Applications :_
    
    - Extractive Summarization
    - Sentiment Analysis
    - Text segmentation


3. Language Modeling

4. Sentence tagging

    _Applications :_
    
    - Part-of-speech Tagging
    - Named Entity Recognition
    - Automatic Value Extraction
    


### Part II

5. Auto-Encoding

6. Machine Translation

7. Text Classification




### Part III

8. Abstractive Summarization

9. Question Answering

10. Chatbot


</div>

***

<a id="plan"></a>

# Overview

The global purpose of Word Embedding is to map a word, as raw string, to a dense vector. Three approaches are commonly followed for such task :

| Level |  | |
|------|------|------|
| **Word** | [I.1 Custom model](#word_level_custom) | [I.2 Gensim Model](#gensim) |
| **Subword** | [II.1 FastText model](#fastText) |  |
| **Character** |  |  |


<br>
Visualization with TensorBoard : https://www.tensorflow.org/guide/embedding (TODO)

# Training objectives

#### CBOW training objective

Cette méthode de vectorisation est introduite dans \cite{mikolov2013distributed, mikolov2013efficient}, et consiste à construire pour un vocabulaire de mots une table de vectorisation $T$ contenant un vecteur par mot. La spécificité de cette méthode est que cette vectorisation est faite de façon à pouvoir prédire chaque mot à partir de son contexte. La construction de cette table $T$ passe par la création d'un réseau de neurones, qui sert de modèle pour l'estimation de la probabilité de prédiction d'un mot $w_t$ d'après son contexte $c = w_{t-N}, \, ... \, , w_{t-1}$, $w_{t+1}, \, ... \, , w_{t+N}$. La table $T$ intégrée au modèle sera optimisée lorsque ce modèle sera entrainé de façon à ce qu'un mot $w_t$ maximise la vraisemblance de la probabilité $P(. \, | \, c)$ fournie par le modèle. 

Le réseau de neurones de décrit de la façon suivante :

![cbow](figs/CBOW.png)

Un contexte $c = w_{t-N}, \, ... \, , w_{t-1}$, $w_{t+1}, \, ... \, , w_{t+N}$ est vectorisé via une table $T$ fournissant un ensemble de vecteurs denses (typiquement de dimension comprise entre 50 et 300) $T(w_{t-N}), \, ... \, , T(w_{t-1})$, $T(w_{t+1}), \, ... \, , T(w_{t+N})$. Chaque vecteur est ensuite transformé via une transformation affine, dont les vecteurs résultants sont superposés en un unique vecteur

\begin{align*}
v_c = \sum _{i = - N}^N M_i T(w_{t+i}) + b_i
\end{align*}

Le vecteur $v_c$ est de dimension typiquement égale à la dimension de la vectorisation de mots. Une autre table $T'$ est utilisée pour une nouvelle vectorisation du vocabulaire, de sorte que le mot $w_{t}$ soit transformé en un vecteur $T'(w_{t})$ par cette table, et soit proposé en position $t$ avec probabilité

\begin{align*}
P(w_{t} \, | \, c\,) = \frac{\exp\left( T'(w_{t}) \cdot v_c \right) }{\displaystyle \sum _{w \in \mathcal{V}} \exp\left(   T'(w) \cdot v_c 
\right) }
\end{align*}

Ici $\cdot$ désigne le produit scalaire entre vecteurs. L'optimisation de ce modèle permet d'ajuster la table $T$ afin que les vecteurs de mots portent suffisamment d'information pour reformer un mot à partir du contexte.


#### Skip-Gram training objective


Cette méthode de vectorisation est introduite dans \cite{mikolov2013distributed, mikolov2013efficient} comme version mirroir au Continuous Bag Of Words, et consiste là encore à construire pour un vocabulaire de mots une table de vectorisation $T$ contenant un vecteur par mot. La spécificité de cette méthode est que cette vectorisation est faite non pas de façon prédire un mot central $w$ à partir d'un contexte $c $ comme pour CBOW, mais plutôt de prédire le contexte $c $ à partir du mot central $w$. La construction de cette table $T$ passe par la création d'un réseau de neurones servant de modèle pour l'estimation de la probabilité de prédiction d'un contexte $c = w_{t-N}, \, ... \, , w_{t-1}$, $w_{t+1}, \, ... \, , w_{t+N}$ à partir d'un mot central $w_t$. La table $T$ intégrée au modèle sera optimisée lorsque ce modèle sera entrainé de façon à ce que le contexte  $ c $ maximise la vraisemblance de la probabilité $P( . \, | \, w_t)$ fournie par le modèle.


Une implémentation de ce modèle est la suivante : 


![skipgram](figs/Skipgram.png)


Un mot courant $w_t$ est vectorisé par une table $T$ fournissant un vecteur dense (typiquement de dimension comprise entre 50 et 300) $T(w_t)$. Ce vecteur est alors transformé en un ensemble de $2N$ vecteurs

\begin{align*}
\sigma (M_{i} T(w_t) + b_{i}) \qquad \qquad i =-N,\, ...\, , -1, 1, \, ...\, , N
\end{align*}

où $N$ désigne la taille de la fenêtre retenue, d'une dimension typiquement égale à la dimension de la vectorisation de mots, et $\sigma$ une fonction non linéaire (typiquement la _Rectified Linear Unit_ $\sigma (x) = max (0, x)$). Une autre table $T'$ est utilisée pour une nouvelle vectorisation du vocabulaire, de sorte que chaque mot $w_{t+i}$, transformé en un vecteur $T'(w_{t+i})$ par cette table, soit proposé en position $t+i$ avec probabilité

\begin{align*}
P( w_{t+i} | \, w_t) = \frac{\exp\left(  T'(w_{t+i}) ^\perp \sigma \left( M_i T(w_t) + b_{i}\right) \right) }{\displaystyle \sum _{w \in \mathcal{V}} \exp\left(   T'(w) ^\perp \sigma \left( M_i T(w_t) + b_i\right) \right) }
\end{align*}

On modélise alors la probabilité qu'un ensemble de mots $c = w_{t-N}, \, ... \, , w_{t-1}$, $w_{t+1}, \, ... \, , w_{t+N}$ soit le contexte d'un mot $w_t$ par le produit

\begin{align*}
 P( c\, | \, w_t) = \prod _{i = -N}^N P( w_{t+i}\, | \, w_t)
\end{align*}

Ce modèle de probabilité du contexte d'un mot est naif au sens où les mots de contextes sont considérés comme indépendants deux à deux dès lors que le mot central est connu. Cette approximation rend cependant le calcul d'optimisation beaucoup plus court.



L'optimisation de ce modèle permet d'ajuster la table $T$ afin que les vecteurs de mots portent suffisamment d'information pour reformer l'intégralité du contexte à partir de ce seul mot. La vectorisation Skip-Gram est typiquement plus performante que CBOW, car la table $T$ subit plus de contrainte dans son optimisation, et puisque le vecteur d'un mot est obtenu de façon à pouvoir prédire l'utilisation réelle du mot, ici donnée par son contexte. 

# Packages

[Back to top](#plan)

In [2]:
import sys
import warnings
from __future__ import unicode_literals, print_function, division
import os
from io import open
import unicodedata
import string
import time
import math
import re
import random
import pickle
import copy
from unidecode import unidecode


# for special math operation
from sklearn.preprocessing import normalize


# for manipulating data 
import numpy as np
#np.set_printoptions(threshold=np.nan)
import pandas as pd
import bcolz # see https://bcolz.readthedocs.io/en/latest/intro.html
import pickle


# for text processing
import gensim
from gensim.models import KeyedVectors
#import spacy
import nltk
#nltk.download()
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem.porter import PorterStemmer


# for deep learning
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


warnings.filterwarnings("ignore")
print('python version :', sys.version)
print('pytorch version :', torch.__version__)
print('DL device :', device)



python version : 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
pytorch version : 0.4.0
DL device : cuda


In [3]:
path_to_NLP = 'C:\\Users\\Jb\\Desktop\\NLP'

In [4]:
#sys.path.append(path_to_NLP + '\\chatNLP')

# Corpus

[Back to top](#plan)

Le texte est importé et mis sous forme de liste, où chaque élément représente un texte présenté sous forme d'une liste de mots.<br> Le corpus et donc une fois importé sous le forme : [[str]]

In [5]:
def cleanSentence(sentence): # -------------------------  str
    sw = ['']
    #sw += nltk.corpus.stopwords.words('english')
    #sw += nltk.corpus.stopwords.words('french')

    def unicodeToAscii(s):
        """Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427"""
        return ''.join( c for c in unicodedata.normalize('NFD', s)
                        if unicodedata.category(c) != 'Mn')

    def normalizeString(s):
        '''Remove rare symbols from a string'''
        s = unicodeToAscii(s.lower().strip()) # 
        #s = re.sub(r"[^a-zA-Z\.\(\)\[\]]+", r" ", s)  # 'r' before a string is for 'raw' # ?&\%\_\- removed # set('''.,:;()*#&-_%!?/\'")''')
        return s

    def wordTokenizerFunction():
        # base version
        function = lambda sentence : sentence.strip().split()

        # nltk version
        #function = word_tokenize    
        return function

    # 1 - caractères spéciaux
    def clean_sentence_punct(text): # --------------  str
        text = normalizeString(text)
        # suppression de la dernière ponctuation
        if (len(text) > 0 and text[-1] in ['.', ',', ';', ':', '!', '?']) : text = text[:-1]

        text = text.replace(r'(', r' ( ')
        text = text.replace(r')', r' ) ')
        text = text.replace(r'[', r' [ ')
        text = text.replace(r']', r' ] ')
        text = text.replace(r'<', r' < ')
        text = text.replace(r'>', r' > ')

        text = text.replace(r':', r' : ')
        text = text.replace(r';', r' ; ')
        for i in range(5) :
            text = re.sub('(?P<val1>[0-9])\.(?P<val2>[0-9])', '\g<val1>__-__\g<val2>', text)
            text = re.sub('(?P<val1>[0-9]),(?P<val2>[0-9])', '\g<val1>__-__\g<val2>', text)
        text = text.replace(r',', ' , ')
        text = text.replace(r'.', ' . ')
        for i in range(5) : text = re.sub('(?P<val1>[p0-9])__-__(?P<val2>[p0-9])', '\g<val1>.\g<val2>', text)
        text = re.sub('(?P<val1>[0-9]) \. p \. (?P<val2>[0-9])', '\g<val1>.p.\g<val2>', text)
        text = re.sub('(?P<val1>[0-9]) \. s \. (?P<val2>[0-9])', '\g<val1>.s.\g<val2>', text)

        text = text.replace(r'"', r' " ')
        text = text.replace(r'’', r" ' ")
        text = text.replace(r'”', r' " ')
        text = text.replace(r'“', r' " ')
        text = text.replace(r'/', r' / ')

        text = re.sub('(…)+', ' … ', text)
        text = text.replace('≤', ' ≤ ')          
        text = text.replace('≥', ' ≥ ')
        text = text.replace('°c', ' °c ')
        text = text.replace('°C', ' °c ')
        text = text.replace('ºc', ' °c ')
        text = text.replace('n°', 'n° ')
        text = text.replace('%', ' % ')
        text = text.replace('*', ' * ')
        text = text.replace('+', ' + ')
        text = text.replace('-', ' - ')
        text = text.replace('_', ' ')
        text = text.replace('®', ' ')
        text = text.replace('™', ' ')
        text = text.replace('±', ' ± ')
        text = text.replace('÷', ' ÷ ')
        text = text.replace('–', ' - ')
        text = text.replace('μg', ' µg')
        text = text.replace('µg', ' µg')
        text = text.replace('µl', ' µl')
        text = text.replace('μl', ' µl')
        text = text.replace('µm', ' µm')
        text = text.replace('μm', ' µm')
        text = text.replace('ppm', ' ppm')
        text = re.sub('(?P<val1>[0-9])mm', '\g<val1> mm', text)
        text = re.sub('(?P<val1>[0-9])g', '\g<val1> g', text)
        text = text.replace('nm', ' nm')

        text = re.sub('fa(?P<val1>[0-9])', 'fa \g<val1>', text)
        text = re.sub('g(?P<val1>[0-9])', 'g \g<val1>', text)
        text = re.sub('n(?P<val1>[0-9])', 'n \g<val1>', text)
        text = re.sub('p(?P<val1>[0-9])', 'p \g<val1>', text)
        text = re.sub('q_(?P<val1>[0-9])', 'q_ \g<val1>', text)
        text = re.sub('u(?P<val1>[0-9])', 'u \g<val1>', text)
        text = re.sub('ud(?P<val1>[0-9])', 'ud \g<val1>', text)
        text = re.sub('ui(?P<val1>[0-9])', 'ui \g<val1>', text)

        text = text.replace('=', ' ')
        text = text.replace('!', ' ')
        text = text.replace('-', ' ')
        text = text.replace(r' , ', ' ')
        text = text.replace(r' . ', ' ')

        text = re.sub('(?P<val>[0-9])ml', '\g<val> ml', text)
        text = re.sub('(?P<val>[0-9])mg', '\g<val> mg', text)

        for i in range(5) : text = re.sub('( [0-9]+ )', ' ', text)
        #text = re.sub('cochran(\S)*', 'cochran ', text)
        return text

    # 3 - split des mots
    def wordSplit(sentence, tokenizeur): # ------------- [str]
        return tokenizeur(sentence)

    # 4 - mise en minuscule et enlèvement des stopwords
    def stopwordsRemoval(sentence, sw): # ------------- [[str]]
        return [word for word in sentence if word not in sw]

    # 6 - correction des mots
    def correction(text):
        def correct(word):
            return spelling.suggest(word)[0]
        list_of_list_of_words = [[correct(word) for word in sentence] for sentence in text]
        return list_of_list_of_words

    # 7 - stemming
    def stemming(text): # ------------------------- [[str]]
        list_of_list_of_words = [[PorterStemmer().stem(word) for word in sentence if word not in sw] for sentence in text]
        return list_of_list_of_words


    tokenizeur = wordTokenizerFunction()
    sentence = clean_sentence_punct(str(sentence))
    sentence = wordSplit(sentence, tokenizeur)
    sentence = stopwordsRemoval(sentence, sw)
    #text = correction(text)
    #text = stemming(text)
    return sentence


def importSheet(file_name) :
    def cleanDatabase(db):
        words = []
        title = ''
        for pair in db :
            if pair[0] != title :
                words += cleanSentence(pair[0]) #[str]
                title  = pair[0]                # str
            words += cleanSentence(pair[1])     #[str]
        return words

    df = pd.read_excel(file_name, sep = ',', header = None)
    headers = [i for i, titre in enumerate(df.ix[0,:].values) if i in [1, 2] or titre == 'score manuel'] 
    db = df.ix[1:, headers].values.tolist()
    db = [el[1: 3] for el in db if el[-1] in [0,1, 10]]
    words = cleanDatabase(db)
    return words


def importCorpus(path_to_data) :
    corpus = []
    reps = os.listdir(path_to_data)
    for rep in reps :
        files = os.listdir(path_to_data + '\\' + rep)
        for file in files :
            file_name = path_to_data + '\\' + rep + '\\' + file
            corpus.append(importSheet(file_name))
    return corpus

In [6]:
corpus = importCorpus(path_to_NLP + '\\data\\AMM')

In [7]:
corpus[0]

['the',
 'testing',
 'performed',
 'on',
 'the',
 'finished',
 'product',
 '(',
 'fp',
 ')',
 'is',
 'in',
 'compliance',
 'with',
 'both',
 'current',
 'european',
 'pharmacopoeia',
 '(',
 'ph',
 'eur',
 ')',
 'and',
 'world',
 'health',
 'organization',
 '(',
 'who',
 ')',
 'requirements',
 'of',
 'the',
 'vaccine',
 '1',
 'once',
 'the',
 'lyophilizate',
 'has',
 'been',
 'reconstituted',
 'with',
 'the',
 'appropriate',
 'volume',
 'of',
 'diluent',
 '(',
 '0.4',
 '%',
 'sodium',
 'chloride',
 'solution',
 'for',
 'suspension',
 'for',
 'injection',
 ')',
 'a',
 'human',
 'dose',
 'of',
 'fp',
 'is',
 '0.5',
 'ml',
 '0',
 'the',
 'specifications',
 'of',
 'the',
 'drug',
 'product',
 'single',
 'dose',
 'are',
 'described',
 'in',
 'table',
 '1',
 '1',
 'table',
 ':',
 'specifications',
 'for',
 'the',
 'drug',
 'product',
 '1',
 'table',
 ':',
 'specifications',
 'for',
 'the',
 'drug',
 'product',
 '|',
 '*',
 'iu',
 ':',
 'international',
 'unit',
 '1']

<a id="word_level"></a>


# 1 Word-level Embedding
***

<a id="word_level_custom"></a>

## 1.1 Custom Word-level Embedding Model

[Back to top](#plan)

### 1.1.1 Model

#### Language

Classe de langage prennant en paramètre un corpus de la forme [[str]]

In [6]:
#from chatNLP.utils import Lang

In [7]:
class Lang:
    def __init__(self, corpus = None, base_tokens = ['SOS', 'EOS', 'UNK'], min_count = None):
        self.base_tokens = base_tokens
        self.initData(base_tokens)
        if    corpus is not None : self.addCorpus(corpus)
        if min_count is not None : self.removeRareWords(min_count)

        
    def initData(self, base_tokens) :
        self.word2index = {word : i for i, word in enumerate(base_tokens)}
        self.index2word = {i : word for i, word in enumerate(base_tokens)}
        self.word2count = {word : 0 for word in base_tokens}
        self.n_words = len(base_tokens)
        return
    
    def getIndex(self, word) :
        if    word in self.word2index : return self.word2index[word]
        elif 'UNK' in self.word2index : return self.word2index['UNK']
        return
        
    def addWord(self, word):
        '''Add a word to the language'''
        if word not in self.word2index:
            if word.strip() != '' :
                self.word2index[word] = self.n_words
                self.word2count[word] = 1
                self.index2word[self.n_words] = word
                self.n_words += 1
        else:
            self.word2count[word] += 1
        return 
            
    def addSentence(self, sentence):
        '''Add to the language all words of a sentence'''
        words = sentence if type(sentence) == list else nltk.word_tokenize(sentence)
        for word in words : self.addWord(word)          
        return
            
    def addCorpus(self, corpus):
        '''Add to the language all words contained into a corpus'''
        for text in corpus : self.addSentence(text)
        return 
                
    def removeRareWords(self, min_count):
        '''remove words appearing lesser than a min_count threshold'''
        kept_word2count = {word: count for word, count in self.word2count.items() if count >= min_count}
        self.initData(self.base_tokens)
        for word, count in kept_word2count.items(): 
            self.addWord(word)
            self.word2count[word] = kept_word2count[word]
        return

In [8]:
def saveLang(name, lang):
    with open(path_to_NLP + '\\saves\\lang\\' + name + '.file', 'wb') as fil :
        pickle.dump(lang, fil)
    return

def importLang(name):
    with open(path_to_NLP + '\\saves\\lang\\' + name + '.file', 'rb') as fil :
        lang = pickle.load(fil)
    return lang

In [7]:
lang = Lang(corpus, base_tokens = ['SOS', 'EOS', 'UNK'])
print("Mots comptés avant : {}".format(lang.n_words))
lang.removeRareWords(min_count = 4)
print("Mots comptés après : {}".format(lang.n_words))

Mots comptés avant : 8055
Mots comptés après : 4050


In [8]:
#saveLang(name = 'DL4NLP_I1', lang = lang)
#lang = importLang(name = 'DL4NLP_I1')

#### Comparaison avec un vocabulaire de référence

In [9]:
#taken from https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76

# --------------------- comparison with Glove vocab ------------------------
def vocabGlove(name) :
    words = []
    path = path_to_NLP + '\\vectors\\' + name 
    with open(path + '.txt', 'rb') as f:
        for l in f:
            line = l.decode().split()
            word = line[0]
            words.append(word)
    return words

def intersection(lst1, lst2): 
    return list(set(lst1) & set(lst2))

def comparaison(lang) :
    vocab_lang = list(lang.word2index.keys())
    intersect_glove = intersection(vocab_glove, vocab_lang)
    reste_glove = np.setdiff1d(vocab_lang, intersect_glove)
    printComparaison('glove', vocab_lang, intersect_glove, reste_glove)
    return intersect_glove, reste_glove

def printComparaison(nom, vocab_lang, intersect, reste) :
    print('proportion de mots du langage appartenants à {}  {:.2f} % \nproportion de mots du langage ny appartenant pas     {:.2f} %'.format(nom, len(intersect)*100/len(vocab_lang),len(reste)*100/len(vocab_lang) ) )


# --------------------- detect missing spaces ------------------------
def checkWhetherBroken(vocab, clean_vocab) :
    exit = {}
    for word in vocab :
        exit[word] = True if word in clean_vocab else False
    return exit

def checkMissingSpaces(word, clean_vocab) :
    for word2 in clean_vocab :
        if word.startswith(word2) :
            rest = word.replace(word2, '')
            if rest in clean_vocab :
                return word2 + ' ' + rest
    return word

In [10]:
vocab_glove = vocabGlove('glove.6B.200d')

In [11]:
words_glove, reste_glove = comparaison(lang)

proportion de mots du langage appartenants à glove  85.90 % 
proportion de mots du langage ny appartenant pas     14.10 %


#### Word2Vec model

[Back to top](#plan)

In [12]:
#from chatNLP.models.Word_Embedding import Word2Vec as myWord2Vec

In [10]:
class myWord2Vec(nn.Module) :
    def __init__(self, lang, T = 100):
        super(myWord2Vec, self).__init__()
        self.lang = lang
        if type(T) == int :
            self.embedding = nn.Embedding(lang.n_words, T)  
        else :
            self.embedding = nn.Embedding(T.shape[0], T.shape[1])
            self.embedding.weight = nn.Parameter(torch.FloatTensor(T))
            
        self.output_dim = self.lookupTable().shape[1]
        self.sims = None
        
    def lookupTable(self) :
        return self.embedding.weight.cpu().detach().numpy()
        
    def computeSimilarities(self) :
        T = normalize(self.lookupTable(), norm = 'l2', axis = 1)
        self.sims = np.matmul(T, T.transpose())
        return

    def most_similar(self, word, bound = 10) :
        if word not in self.lang.word2index : return
        if self.sims is None : self.computeSimilarities()
        index = self.lang.word2index[word]
        coefs = self.sims[index]
        indices = coefs.argsort()[-bound -1 :-1]
        output = [(self.lang.index2word[i], coefs[i]) for i in reversed(indices)]
        return output
    
    def wv(self, word) :
        return self.lookupTable()[self.lang.getIndex(word)]
    
    def addWord(self, word, vector = None) :
        self.lang.addWord(word)
        T = self.lookupTable()
        v = np.random.rand(1, T.shape[1]) if vector is None else vector
        updated_T = np.concatenate((T, v), axis = 0)
        self.embedding = nn.Embedding(updated_T.shape[0], updated_T.shape[1])
        self.embedding.weight = nn.Parameter(torch.FloatTensor(updated_T))
        return
    
    def freeze(self) :
        for param in self.embedding.parameters() : param.requires_grad = False
        return self
    
    def unfreeze(self) :
        for param in self.embedding.parameters() : param.requires_grad = True
        return self
    
    def forward(self, words, device = None) :
        '''Transforms a list of n words into a torch.FloatTensor of size (1, n, emb_dim)'''
        indices  = [self.lang.getIndex(w) for w in words]
        indices  = [[i for i in indices if i is not None]]
        variable = Variable(torch.LongTensor(indices)) # size = (1, n)
        if device is not None : variable = variable.to(device)
        tensor   = self.embedding(variable)            # size = (1, n, emb_dim)
        return tensor

#### Word2Vec Shell

[Back to top](#plan)

Shell acting as a wrapper around the Word2Vec model, implementing :

- The layers suited for the training objective
- The methods for all optimization steps
- The methods for generating the data suitable for the optimization process

In [14]:
#from chatNLP.models.Word_Embedding import Word2VecShell

In [60]:
class Word2VecShell(nn.Module):
    '''Word2Vec model :
        - sg = 0 yields CBOW training procedure
        - sg = 1 yields Skip-Gram training procedure
    '''
    def __init__(self, word2vec, device, sg = 0, context_size = 5, hidden_dim = 150, 
                 criterion = nn.NLLLoss(size_average = False), optimizer = optim.SGD):
        super(Word2VecShell, self).__init__()
        self.device = device
        
        # core of Word2Vec
        self.word2vec = word2vec
        
        # training layers
        self.input_n_words  = (2 * context_size if sg == 0 else 1)
        self.output_n_words = (1 if sg == 0 else 2 * context_size)
        self.linear_1  = nn.Linear(self.input_n_words * word2vec.embedding.weight.size(1), self.output_n_words * hidden_dim)
        self.linear_2  = nn.Linear(hidden_dim, lang.n_words)
        
        # training tools
        self.sg = sg
        self.criterion = criterion
        self.optimizer = optimizer
        
        # load to device
        self.to(device)
        
    def forward(self, batch):
        '''Transforms a batch of Ngrams of size (batch_size, input_n_words)
           Into log probabilities of size (batch_size, lang.n_words, output_n_words)
           '''
        batch = batch.to(self.device)                 # size = (batch_size, self.input_n_words)
        embed = self.word2vec.embedding(batch)        # size = (batch_size, self.input_n_words, embedding_dim)
        embed = embed.view((batch.size(0), -1))       # size = (batch_size, self.input_n_words * embedding_dim)
        out = self.linear_1(embed)                    # size = (batch_size, self.output_n_words * hidden_dim) 
        out = out.view((batch.size(0),self.output_n_words, -1))
        out = F.relu(out)                             # size = (batch_size, self.output_n_words, hidden_dim)                                         
        out = self.linear_2(out)                      # size = (batch_size, self.output_n_words, lang.n_words)
        out = torch.transpose(out, 1, 2)              # size = (batch_size, lang.n_words, self.output_n_words)
        log_probs = F.log_softmax(out, dim = 1)       # size = (batch_size, lang.n_words, self.output_n_words)
        return log_probs
    
    def generatePackedNgrams(self, corpus, context_size = 5, batch_size = 32, seed = 42) :
        # generate Ngrams
        data = []
        for text in corpus :
            text = [w for w in text if w in self.word2vec.lang.word2index]
            text = ['SOS' for i in range(context_size)] + text + ['EOS' for i in range(context_size)]
            for i in range(context_size, len(text) - context_size):
                context = text[i-context_size : i] + text[i+1 : i+context_size+1]
                word = text[i]
                data.append([word, context])
        # pack Ngrams into mini_batches
        random.seed(seed)
        random.shuffle(data)
        packed_data = []
        for i in range(0, len(data), batch_size):
            pack0 = [el[0] for el in data[i:i + batch_size]]
            pack0 = [[self.word2vec.lang.getIndex(w)] for w in pack0]
            pack0 = Variable(torch.LongTensor(pack0)) # size = (batch_size, 1)
            pack1 = [el[1] for el in data[i:i + batch_size]]
            pack1 = [[self.word2vec.lang.getIndex(w) for w in context] for context in pack1]
            pack1 = Variable(torch.LongTensor(pack1)) # size = (batch_size, 2*context_size)   
            if   self.sg == 1 : packed_data.append([pack0, pack1])
            elif self.sg == 0 : packed_data.append([pack1, pack0])
            else :
                print('A problem occured')
                pass
        return packed_data
    
    def train(self, ngrams, iters = None, epochs = None, lr = 0.025, random_state = 42,
              print_every = 10, compute_accuracy = False):
        """Performs training over a given dataset and along a specified amount of loop
        s"""
        def asMinutes(s):
            m = math.floor(s / 60)
            s -= m * 60
            return '%dm %ds' % (m, s)

        def timeSince(since, percent):
            now = time.time()
            s = now - since
            rs = s/percent - s
            return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

        def computeAccuracy(log_probs, targets) :
            accuracy = 0
            for i in range(targets.size(0)) :
                for j in range(targets.size(1)) :
                    topv, topi = log_probs[i, :, j].data.topk(1) 
                    ni = topi[0][0]
                    if ni == targets[i, j].data[0] : accuracy += 1
            return (accuracy * 100) / (targets.size(0) * targets.size(1))

        def printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy) :
            avg_loss = tot_loss / print_every
            avg_loss_words = tot_loss_words / print_every
            if compute_accuracy : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}  accuracy : {:.1f} %'.format(iter, int(iter / iters * 100), avg_loss, avg_loss_words))
            else                : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}                     '.format(iter, int(iter / iters * 100), avg_loss))
            return 0, 0

        def trainLoop(couple, optimizer, compute_accuracy = False):
            """Performs a training loop, with forward pass and backward pass for gradient optimisation."""
            optimizer.zero_grad()
            self.zero_grad()
            log_probs = self(couple[0])           # size = (batch_size, agent.output_n_words, agent.lang.n_words)
            targets   = couple[1].to(self.device) # size = (batch_size, agent.output_n_words)
            loss      = self.criterion(log_probs, targets)
            loss.backward()
            optimizer.step() 
            accuracy = computeAccuracy(log_probs, targets) if compute_accuracy else 0
            return float(loss.data[0] / (targets.size(0) * targets.size(1))), accuracy
        
        # --- main ---
        np.random.seed(random_state)
        start = time.time()
        optimizer = self.optimizer([param for param in self.parameters() if param.requires_grad == True], lr = lr)
        tot_loss = 0  
        tot_loss_words = 0
        if epochs is None :
            for iter in range(1, iters + 1):
                couple = random.choice(ngrams)
                loss, loss_words = trainLoop(couple, optimizer, compute_accuracy)
                tot_loss += loss
                tot_loss_words += loss_words      
                if iter % print_every == 0 : 
                    tot_loss, tot_loss_words = printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy)
        else :
            iter = 0
            iters = len(ngrams) * epochs
            for epoch in range(1, epochs + 1):
                print('epoch ' + str(epoch))
                np.random.shuffle(ngrams)
                for couple in ngrams :
                    loss, loss_words = trainLoop(couple, optimizer, compute_accuracy)
                    tot_loss += loss
                    tot_loss_words += loss_words 
                    iter += 1
                    if iter % print_every == 0 : 
                        tot_loss, tot_loss_words = printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy)
        return

### 1.1.2 Training with CBOW objective

[Back to top](#plan)


Model

In [11]:
lang = Lang(corpus, base_tokens = ['SOS', 'EOS', 'UNK'], min_count = 4)

In [61]:
word2vec = myWord2Vec(lang, T = 75)
cbow = Word2VecShell(word2vec, device, sg = 0, context_size = 5, hidden_dim = 150)
print('cbow.word2vec == word2vec : ', cbow.word2vec == word2vec)

cbow.word2vec == word2vec :  True


Data

In [62]:
Ngrams = cbow.generatePackedNgrams(corpus, context_size = 5, batch_size = 32, seed = 42)

Training

The training methods allows to display accuracy over predicted target words. However, since the underlying computation is quite time consuming, we display accuracy only at the begining of training, and a few times periodically along the training process.

In [21]:
cbow.train(Ngrams, iters = 100, lr = 0.005, print_every = 100, compute_accuracy = True)

for alpha in [0.005, 0.001, 0.0005, 0.00025, 0.0001] : 
    cbow.train(Ngrams, epochs = 3,  lr = alpha, print_every = 100)
    cbow.train(Ngrams, iters = 100, lr = alpha, print_every = 100, compute_accuracy = True)

0m 1s (- 0m 0s) (100 100%) loss : 1.568  accuracy : 64.9 %
epoch 1
0m 0s (- 3m 30s) (100 0%) loss : 1.605                     
0m 0s (- 3m 30s) (200 0%) loss : 1.466                     
0m 1s (- 3m 29s) (300 0%) loss : 1.534                     
0m 1s (- 3m 28s) (400 0%) loss : 1.533                     
0m 2s (- 3m 26s) (500 1%) loss : 1.541                     
0m 2s (- 3m 25s) (600 1%) loss : 1.496                     
0m 3s (- 3m 24s) (700 1%) loss : 1.455                     
0m 3s (- 3m 24s) (800 1%) loss : 1.459                     
0m 4s (- 3m 24s) (900 2%) loss : 1.488                     
0m 4s (- 3m 23s) (1000 2%) loss : 1.460                     
0m 5s (- 3m 23s) (1100 2%) loss : 1.432                     
0m 5s (- 3m 23s) (1200 2%) loss : 1.487                     
0m 6s (- 3m 23s) (1300 2%) loss : 1.457                     
0m 6s (- 3m 22s) (1400 3%) loss : 1.470                     
0m 7s (- 3m 22s) (1500 3%) loss : 1.444                     
0m 7s (- 3m 21s) (1600 3%) 

1m 2s (- 2m 30s) (13200 29%) loss : 1.510                     
1m 3s (- 2m 29s) (13300 29%) loss : 1.477                     
1m 3s (- 2m 29s) (13400 29%) loss : 1.469                     
1m 4s (- 2m 28s) (13500 30%) loss : 1.522                     
1m 4s (- 2m 28s) (13600 30%) loss : 1.400                     
1m 5s (- 2m 27s) (13700 30%) loss : 1.401                     
1m 5s (- 2m 27s) (13800 30%) loss : 1.422                     
1m 6s (- 2m 26s) (13900 31%) loss : 1.456                     
1m 6s (- 2m 26s) (14000 31%) loss : 1.545                     
1m 7s (- 2m 25s) (14100 31%) loss : 1.470                     
1m 7s (- 2m 25s) (14200 31%) loss : 1.492                     
1m 8s (- 2m 24s) (14300 31%) loss : 1.503                     
1m 8s (- 2m 24s) (14400 32%) loss : 1.494                     
1m 9s (- 2m 23s) (14500 32%) loss : 1.475                     
1m 9s (- 2m 23s) (14600 32%) loss : 1.487                     
1m 9s (- 2m 22s) (14700 32%) loss : 1.427              

2m 5s (- 1m 29s) (26100 58%) loss : 1.432                     
2m 5s (- 1m 29s) (26200 58%) loss : 1.383                     
2m 6s (- 1m 28s) (26300 58%) loss : 1.455                     
2m 6s (- 1m 28s) (26400 59%) loss : 1.451                     
2m 7s (- 1m 27s) (26500 59%) loss : 1.468                     
2m 7s (- 1m 27s) (26600 59%) loss : 1.402                     
2m 8s (- 1m 26s) (26700 59%) loss : 1.468                     
2m 8s (- 1m 26s) (26800 59%) loss : 1.480                     
2m 9s (- 1m 25s) (26900 60%) loss : 1.491                     
2m 9s (- 1m 25s) (27000 60%) loss : 1.447                     
2m 10s (- 1m 24s) (27100 60%) loss : 1.432                     
2m 10s (- 1m 24s) (27200 60%) loss : 1.475                     
2m 11s (- 1m 23s) (27300 61%) loss : 1.410                     
2m 11s (- 1m 23s) (27400 61%) loss : 1.480                     
2m 12s (- 1m 22s) (27500 61%) loss : 1.375                     
2m 12s (- 1m 22s) (27600 61%) loss : 1.478        

3m 6s (- 0m 27s) (39000 87%) loss : 1.401                     
3m 6s (- 0m 26s) (39100 87%) loss : 1.369                     
3m 7s (- 0m 26s) (39200 87%) loss : 1.413                     
3m 7s (- 0m 25s) (39300 87%) loss : 1.380                     
3m 8s (- 0m 25s) (39400 88%) loss : 1.475                     
3m 9s (- 0m 25s) (39500 88%) loss : 1.392                     
3m 9s (- 0m 24s) (39600 88%) loss : 1.455                     
3m 9s (- 0m 24s) (39700 88%) loss : 1.449                     
3m 10s (- 0m 23s) (39800 88%) loss : 1.460                     
3m 10s (- 0m 23s) (39900 89%) loss : 1.390                     
3m 11s (- 0m 22s) (40000 89%) loss : 1.384                     
3m 11s (- 0m 22s) (40100 89%) loss : 1.392                     
3m 12s (- 0m 21s) (40200 89%) loss : 1.365                     
3m 12s (- 0m 21s) (40300 90%) loss : 1.458                     
3m 13s (- 0m 20s) (40400 90%) loss : 1.419                     
3m 13s (- 0m 20s) (40500 90%) loss : 1.454      

0m 34s (- 2m 59s) (7300 16%) loss : 1.329                     
0m 35s (- 2m 58s) (7400 16%) loss : 1.390                     
0m 35s (- 2m 58s) (7500 16%) loss : 1.341                     
0m 36s (- 2m 57s) (7600 16%) loss : 1.405                     
0m 36s (- 2m 57s) (7700 17%) loss : 1.346                     
0m 37s (- 2m 56s) (7800 17%) loss : 1.349                     
0m 37s (- 2m 56s) (7900 17%) loss : 1.331                     
0m 38s (- 2m 55s) (8000 17%) loss : 1.428                     
0m 38s (- 2m 55s) (8100 18%) loss : 1.378                     
0m 39s (- 2m 54s) (8200 18%) loss : 1.312                     
0m 39s (- 2m 53s) (8300 18%) loss : 1.384                     
0m 40s (- 2m 53s) (8400 18%) loss : 1.392                     
0m 40s (- 2m 52s) (8500 19%) loss : 1.322                     
0m 41s (- 2m 52s) (8600 19%) loss : 1.375                     
0m 41s (- 2m 51s) (8700 19%) loss : 1.397                     
0m 41s (- 2m 51s) (8800 19%) loss : 1.405              

1m 36s (- 1m 56s) (20200 45%) loss : 1.311                     
1m 36s (- 1m 56s) (20300 45%) loss : 1.399                     
1m 36s (- 1m 55s) (20400 45%) loss : 1.342                     
1m 37s (- 1m 55s) (20500 45%) loss : 1.336                     
1m 37s (- 1m 54s) (20600 46%) loss : 1.315                     
1m 38s (- 1m 54s) (20700 46%) loss : 1.313                     
1m 38s (- 1m 53s) (20800 46%) loss : 1.314                     
1m 39s (- 1m 53s) (20900 46%) loss : 1.327                     
1m 39s (- 1m 52s) (21000 46%) loss : 1.343                     
1m 40s (- 1m 52s) (21100 47%) loss : 1.341                     
1m 40s (- 1m 51s) (21200 47%) loss : 1.344                     
1m 41s (- 1m 51s) (21300 47%) loss : 1.262                     
1m 41s (- 1m 50s) (21400 47%) loss : 1.395                     
1m 42s (- 1m 50s) (21500 48%) loss : 1.255                     
1m 42s (- 1m 49s) (21600 48%) loss : 1.304                     
1m 43s (- 1m 49s) (21700 48%) loss : 1.3

2m 37s (- 0m 55s) (33100 74%) loss : 1.319                     
2m 38s (- 0m 54s) (33200 74%) loss : 1.250                     
2m 38s (- 0m 54s) (33300 74%) loss : 1.319                     
2m 39s (- 0m 53s) (33400 74%) loss : 1.338                     
2m 39s (- 0m 53s) (33500 74%) loss : 1.312                     
2m 40s (- 0m 52s) (33600 75%) loss : 1.341                     
2m 40s (- 0m 52s) (33700 75%) loss : 1.376                     
2m 40s (- 0m 52s) (33800 75%) loss : 1.303                     
2m 41s (- 0m 51s) (33900 75%) loss : 1.366                     
2m 41s (- 0m 51s) (34000 76%) loss : 1.280                     
2m 42s (- 0m 50s) (34100 76%) loss : 1.386                     
2m 42s (- 0m 50s) (34200 76%) loss : 1.324                     
2m 43s (- 0m 49s) (34300 76%) loss : 1.372                     
2m 43s (- 0m 49s) (34400 76%) loss : 1.271                     
2m 44s (- 0m 48s) (34500 77%) loss : 1.342                     
2m 44s (- 0m 48s) (34600 77%) loss : 1.3

0m 6s (- 3m 28s) (1300 2%) loss : 1.324                     
0m 6s (- 3m 27s) (1400 3%) loss : 1.277                     
0m 7s (- 3m 28s) (1500 3%) loss : 1.255                     
0m 7s (- 3m 27s) (1600 3%) loss : 1.327                     
0m 8s (- 3m 27s) (1700 3%) loss : 1.297                     
0m 8s (- 3m 26s) (1800 4%) loss : 1.365                     
0m 9s (- 3m 25s) (1900 4%) loss : 1.283                     
0m 9s (- 3m 24s) (2000 4%) loss : 1.279                     
0m 10s (- 3m 24s) (2100 4%) loss : 1.291                     
0m 10s (- 3m 23s) (2200 4%) loss : 1.273                     
0m 11s (- 3m 23s) (2300 5%) loss : 1.282                     
0m 11s (- 3m 22s) (2400 5%) loss : 1.227                     
0m 12s (- 3m 22s) (2500 5%) loss : 1.272                     
0m 12s (- 3m 22s) (2600 5%) loss : 1.305                     
0m 13s (- 3m 22s) (2700 6%) loss : 1.321                     
0m 13s (- 3m 22s) (2800 6%) loss : 1.287                     
0m 14s (- 3m 22s

1m 8s (- 2m 24s) (14400 32%) loss : 1.331                     
1m 8s (- 2m 23s) (14500 32%) loss : 1.310                     
1m 9s (- 2m 23s) (14600 32%) loss : 1.289                     
1m 9s (- 2m 22s) (14700 32%) loss : 1.308                     
1m 10s (- 2m 22s) (14800 33%) loss : 1.315                     
1m 10s (- 2m 21s) (14900 33%) loss : 1.345                     
epoch 2
1m 11s (- 2m 21s) (15000 33%) loss : 1.233                     
1m 11s (- 2m 20s) (15100 33%) loss : 1.283                     
1m 12s (- 2m 20s) (15200 33%) loss : 1.257                     
1m 12s (- 2m 19s) (15300 34%) loss : 1.274                     
1m 13s (- 2m 19s) (15400 34%) loss : 1.241                     
1m 13s (- 2m 18s) (15500 34%) loss : 1.320                     
1m 14s (- 2m 18s) (15600 34%) loss : 1.154                     
1m 14s (- 2m 17s) (15700 35%) loss : 1.317                     
1m 15s (- 2m 17s) (15800 35%) loss : 1.244                     
1m 15s (- 2m 16s) (15900 35%) loss :

2m 9s (- 1m 22s) (27300 61%) loss : 1.294                     
2m 9s (- 1m 22s) (27400 61%) loss : 1.287                     
2m 10s (- 1m 21s) (27500 61%) loss : 1.305                     
2m 10s (- 1m 21s) (27600 61%) loss : 1.300                     
2m 11s (- 1m 20s) (27700 61%) loss : 1.255                     
2m 11s (- 1m 20s) (27800 62%) loss : 1.285                     
2m 12s (- 1m 19s) (27900 62%) loss : 1.268                     
2m 12s (- 1m 19s) (28000 62%) loss : 1.380                     
2m 13s (- 1m 18s) (28100 62%) loss : 1.319                     
2m 13s (- 1m 18s) (28200 63%) loss : 1.340                     
2m 14s (- 1m 17s) (28300 63%) loss : 1.253                     
2m 14s (- 1m 17s) (28400 63%) loss : 1.273                     
2m 15s (- 1m 17s) (28500 63%) loss : 1.344                     
2m 15s (- 1m 16s) (28600 63%) loss : 1.335                     
2m 16s (- 1m 16s) (28700 64%) loss : 1.334                     
2m 16s (- 1m 15s) (28800 64%) loss : 1.343

3m 11s (- 0m 21s) (40200 89%) loss : 1.279                     
3m 11s (- 0m 21s) (40300 90%) loss : 1.311                     
3m 12s (- 0m 20s) (40400 90%) loss : 1.243                     
3m 12s (- 0m 20s) (40500 90%) loss : 1.252                     
3m 13s (- 0m 19s) (40600 90%) loss : 1.289                     
3m 13s (- 0m 19s) (40700 90%) loss : 1.336                     
3m 14s (- 0m 18s) (40800 91%) loss : 1.283                     
3m 14s (- 0m 18s) (40900 91%) loss : 1.250                     
3m 15s (- 0m 17s) (41000 91%) loss : 1.278                     
3m 15s (- 0m 17s) (41100 91%) loss : 1.222                     
3m 15s (- 0m 16s) (41200 92%) loss : 1.245                     
3m 16s (- 0m 16s) (41300 92%) loss : 1.310                     
3m 16s (- 0m 15s) (41400 92%) loss : 1.345                     
3m 17s (- 0m 15s) (41500 92%) loss : 1.362                     
3m 17s (- 0m 14s) (41600 93%) loss : 1.311                     
3m 18s (- 0m 14s) (41700 93%) loss : 1.2

Evaluation

In [134]:
word2vec.most_similar(word = 'formaldehyde', bound = 10)

[('disruption', 0.4915212),
 ('gmu', 0.36797026),
 ('biostatistical', 0.3601455),
 ('milliliter', 0.34712657),
 ('xy', 0.3376171),
 ('diffuse', 0.3375387),
 ('sterile', 0.33623347),
 ('ah', 0.33311707),
 ('live', 0.33145988),
 ('analytical', 0.32609305)]

Save & Load<br>

The lightweight word2vec model can be saved for further use, or alternatively the full shell wrapping the word2vec model can be saved for subsequent training.

In [115]:
# save
#torch.save(word2vec, path_to_NLP + '\\saves\\models\\DL4NLP_I1_cbow.pt')

# load
#word2vec = torch.load(path_to_NLP + '\\saves\\models\\DL4NLP_I1_cbow.pt')

### 1.1.3 Training with SkipGram objective

[Back to top](#plan)


Model

In [102]:
lang = Lang(corpus, base_tokens = ['SOS', 'EOS', 'UNK'], min_count = 4)

In [103]:
word2vec = myWord2Vec(lang, T = 75)
skipgram = Word2VecShell(word2vec, device, sg = 1, context_size = 5, hidden_dim = 150)
print('skipgram.word2vec == word2vec : ', skipgram.word2vec == word2vec)

skipgram.word2vec == word2vec :  True


Data

In [104]:
Ngrams = skipgram.generatePackedNgrams(corpus, context_size = 5, batch_size = 32, seed = 42)

Training

In [105]:
skipgram.train(Ngrams, iters = 100, lr = 0.005, print_every = 100, compute_accuracy = True)

for alpha in [0.005, 0.001, 0.0005] : 
    skipgram.train(Ngrams, epochs = 3,  lr = alpha, print_every = 100)
    skipgram.train(Ngrams, iters = 100, lr = alpha, print_every = 100, compute_accuracy = True)

0m 10s (- 0m 0s) (100 100%) loss : 6.551  accuracy : 7.6 %
epoch 1
0m 1s (- 11m 44s) (100 0%) loss : 6.082                     
0m 3s (- 11m 41s) (200 0%) loss : 6.001                     
0m 4s (- 11m 41s) (300 0%) loss : 5.951                     
0m 6s (- 11m 39s) (400 0%) loss : 5.921                     
0m 7s (- 11m 37s) (500 1%) loss : 5.912                     
0m 9s (- 11m 35s) (600 1%) loss : 5.881                     
0m 11s (- 11m 33s) (700 1%) loss : 5.837                     
0m 12s (- 11m 31s) (800 1%) loss : 5.824                     
0m 14s (- 11m 29s) (900 2%) loss : 5.828                     
0m 15s (- 11m 28s) (1000 2%) loss : 5.815                     
0m 17s (- 11m 26s) (1100 2%) loss : 5.782                     
0m 18s (- 11m 25s) (1200 2%) loss : 5.799                     
0m 20s (- 11m 23s) (1300 2%) loss : 5.775                     
0m 22s (- 11m 22s) (1400 3%) loss : 5.785                     
0m 23s (- 11m 20s) (1500 3%) loss : 5.766                     
0m 

3m 24s (- 8m 14s) (13100 29%) loss : 5.437                     
3m 26s (- 8m 12s) (13200 29%) loss : 5.397                     
3m 27s (- 8m 11s) (13300 29%) loss : 5.410                     
3m 29s (- 8m 9s) (13400 29%) loss : 5.365                     
3m 30s (- 8m 7s) (13500 30%) loss : 5.440                     
3m 32s (- 8m 6s) (13600 30%) loss : 5.431                     
3m 34s (- 8m 5s) (13700 30%) loss : 5.390                     
3m 35s (- 8m 3s) (13800 30%) loss : 5.406                     
3m 37s (- 8m 2s) (13900 31%) loss : 5.401                     
3m 38s (- 8m 0s) (14000 31%) loss : 5.376                     
3m 40s (- 7m 59s) (14100 31%) loss : 5.391                     
3m 42s (- 7m 57s) (14200 31%) loss : 5.411                     
3m 43s (- 7m 55s) (14300 31%) loss : 5.375                     
3m 45s (- 7m 54s) (14400 32%) loss : 5.392                     
3m 46s (- 7m 52s) (14500 32%) loss : 5.379                     
3m 48s (- 7m 51s) (14600 32%) loss : 5.395     

6m 47s (- 4m 53s) (26000 58%) loss : 5.331                     
6m 48s (- 4m 51s) (26100 58%) loss : 5.296                     
6m 50s (- 4m 50s) (26200 58%) loss : 5.334                     
6m 51s (- 4m 48s) (26300 58%) loss : 5.322                     
6m 53s (- 4m 46s) (26400 59%) loss : 5.247                     
6m 54s (- 4m 45s) (26500 59%) loss : 5.307                     
6m 56s (- 4m 43s) (26600 59%) loss : 5.269                     
6m 57s (- 4m 42s) (26700 59%) loss : 5.257                     
6m 59s (- 4m 40s) (26800 59%) loss : 5.308                     
7m 1s (- 4m 39s) (26900 60%) loss : 5.316                     
7m 2s (- 4m 37s) (27000 60%) loss : 5.310                     
7m 4s (- 4m 35s) (27100 60%) loss : 5.263                     
7m 5s (- 4m 34s) (27200 60%) loss : 5.310                     
7m 7s (- 4m 32s) (27300 61%) loss : 5.287                     
7m 8s (- 4m 31s) (27400 61%) loss : 5.300                     
7m 10s (- 4m 29s) (27500 61%) loss : 5.300    

10m 8s (- 1m 31s) (38900 86%) loss : 5.257                     
10m 10s (- 1m 29s) (39000 87%) loss : 5.217                     
10m 12s (- 1m 28s) (39100 87%) loss : 5.267                     
10m 13s (- 1m 26s) (39200 87%) loss : 5.200                     
10m 15s (- 1m 24s) (39300 87%) loss : 5.243                     
10m 16s (- 1m 23s) (39400 88%) loss : 5.251                     
10m 18s (- 1m 21s) (39500 88%) loss : 5.255                     
10m 19s (- 1m 20s) (39600 88%) loss : 5.207                     
10m 21s (- 1m 18s) (39700 88%) loss : 5.275                     
10m 22s (- 1m 17s) (39800 88%) loss : 5.215                     
10m 24s (- 1m 15s) (39900 89%) loss : 5.250                     
10m 26s (- 1m 13s) (40000 89%) loss : 5.230                     
10m 27s (- 1m 12s) (40100 89%) loss : 5.248                     
10m 29s (- 1m 10s) (40200 89%) loss : 5.210                     
10m 30s (- 1m 9s) (40300 90%) loss : 5.261                     
10m 32s (- 1m 7s) (40400 90

1m 49s (- 9m 48s) (7000 15%) loss : 5.074                     
1m 50s (- 9m 47s) (7100 15%) loss : 5.077                     
1m 52s (- 9m 46s) (7200 16%) loss : 5.112                     
1m 54s (- 9m 44s) (7300 16%) loss : 5.073                     
1m 55s (- 9m 43s) (7400 16%) loss : 5.084                     
1m 57s (- 9m 42s) (7500 16%) loss : 5.112                     
1m 58s (- 9m 40s) (7600 16%) loss : 5.113                     
2m 0s (- 9m 39s) (7700 17%) loss : 5.095                     
2m 1s (- 9m 37s) (7800 17%) loss : 5.096                     
2m 3s (- 9m 35s) (7900 17%) loss : 5.071                     
2m 5s (- 9m 34s) (8000 17%) loss : 5.119                     
2m 6s (- 9m 32s) (8100 18%) loss : 5.114                     
2m 8s (- 9m 31s) (8200 18%) loss : 5.080                     
2m 9s (- 9m 29s) (8300 18%) loss : 5.066                     
2m 11s (- 9m 28s) (8400 18%) loss : 5.052                     
2m 12s (- 9m 26s) (8500 19%) loss : 5.102                     

5m 13s (- 6m 27s) (20000 44%) loss : 5.069                     
5m 14s (- 6m 25s) (20100 44%) loss : 5.071                     
5m 16s (- 6m 24s) (20200 45%) loss : 5.083                     
5m 17s (- 6m 22s) (20300 45%) loss : 5.105                     
5m 19s (- 6m 21s) (20400 45%) loss : 5.053                     
5m 21s (- 6m 19s) (20500 45%) loss : 5.066                     
5m 22s (- 6m 18s) (20600 46%) loss : 5.096                     
5m 24s (- 6m 16s) (20700 46%) loss : 5.084                     
5m 25s (- 6m 14s) (20800 46%) loss : 5.076                     
5m 27s (- 6m 13s) (20900 46%) loss : 5.079                     
5m 29s (- 6m 11s) (21000 46%) loss : 5.056                     
5m 30s (- 6m 10s) (21100 47%) loss : 5.059                     
5m 32s (- 6m 8s) (21200 47%) loss : 5.059                     
5m 33s (- 6m 7s) (21300 47%) loss : 5.043                     
5m 35s (- 6m 5s) (21400 47%) loss : 5.054                     
5m 37s (- 6m 4s) (21500 48%) loss : 5.071  

8m 35s (- 3m 5s) (32900 73%) loss : 5.069                     
8m 36s (- 3m 3s) (33000 73%) loss : 5.033                     
8m 38s (- 3m 2s) (33100 74%) loss : 5.093                     
8m 39s (- 3m 0s) (33200 74%) loss : 5.033                     
8m 41s (- 2m 58s) (33300 74%) loss : 5.073                     
8m 42s (- 2m 57s) (33400 74%) loss : 5.032                     
8m 44s (- 2m 55s) (33500 74%) loss : 5.071                     
8m 45s (- 2m 54s) (33600 75%) loss : 5.062                     
8m 47s (- 2m 52s) (33700 75%) loss : 5.036                     
8m 49s (- 2m 51s) (33800 75%) loss : 5.052                     
8m 50s (- 2m 49s) (33900 75%) loss : 5.049                     
8m 52s (- 2m 47s) (34000 76%) loss : 5.046                     
8m 53s (- 2m 46s) (34100 76%) loss : 5.081                     
8m 55s (- 2m 44s) (34200 76%) loss : 5.063                     
8m 56s (- 2m 43s) (34300 76%) loss : 5.077                     
8m 58s (- 2m 41s) (34400 76%) loss : 5.079  

0m 14s (- 11m 23s) (900 2%) loss : 5.022                     
0m 15s (- 11m 21s) (1000 2%) loss : 5.023                     
0m 17s (- 11m 19s) (1100 2%) loss : 5.024                     
0m 18s (- 11m 18s) (1200 2%) loss : 5.023                     
0m 20s (- 11m 16s) (1300 2%) loss : 5.022                     
0m 21s (- 11m 15s) (1400 3%) loss : 5.003                     
0m 23s (- 11m 14s) (1500 3%) loss : 5.015                     
0m 24s (- 11m 12s) (1600 3%) loss : 5.031                     
0m 26s (- 11m 11s) (1700 3%) loss : 5.049                     
0m 28s (- 11m 9s) (1800 4%) loss : 5.000                     
0m 29s (- 11m 7s) (1900 4%) loss : 5.019                     
0m 31s (- 11m 5s) (2000 4%) loss : 5.045                     
0m 32s (- 11m 4s) (2100 4%) loss : 5.035                     
0m 34s (- 11m 2s) (2200 4%) loss : 5.021                     
0m 35s (- 11m 0s) (2300 5%) loss : 5.045                     
0m 37s (- 10m 59s) (2400 5%) loss : 5.047                     

3m 36s (- 8m 0s) (13900 31%) loss : 5.020                     
3m 38s (- 7m 59s) (14000 31%) loss : 5.085                     
3m 39s (- 7m 57s) (14100 31%) loss : 5.037                     
3m 41s (- 7m 55s) (14200 31%) loss : 5.032                     
3m 42s (- 7m 54s) (14300 31%) loss : 5.060                     
3m 44s (- 7m 52s) (14400 32%) loss : 5.041                     
3m 46s (- 7m 51s) (14500 32%) loss : 5.074                     
3m 47s (- 7m 49s) (14600 32%) loss : 5.075                     
3m 49s (- 7m 48s) (14700 32%) loss : 5.060                     
3m 50s (- 7m 46s) (14800 33%) loss : 5.032                     
3m 52s (- 7m 45s) (14900 33%) loss : 5.030                     
epoch 2
3m 53s (- 7m 43s) (15000 33%) loss : 5.005                     
3m 55s (- 7m 42s) (15100 33%) loss : 5.042                     
3m 57s (- 7m 40s) (15200 33%) loss : 5.031                     
3m 58s (- 7m 38s) (15300 34%) loss : 4.987                     
4m 0s (- 7m 37s) (15400 34%) loss

6m 58s (- 4m 39s) (26800 59%) loss : 5.021                     
7m 0s (- 4m 38s) (26900 60%) loss : 5.040                     
7m 1s (- 4m 36s) (27000 60%) loss : 5.069                     
7m 3s (- 4m 35s) (27100 60%) loss : 5.025                     
7m 4s (- 4m 33s) (27200 60%) loss : 5.032                     
7m 6s (- 4m 32s) (27300 61%) loss : 5.092                     
7m 8s (- 4m 30s) (27400 61%) loss : 5.048                     
7m 9s (- 4m 29s) (27500 61%) loss : 5.045                     
7m 11s (- 4m 27s) (27600 61%) loss : 5.005                     
7m 12s (- 4m 26s) (27700 61%) loss : 5.046                     
7m 14s (- 4m 24s) (27800 62%) loss : 5.038                     
7m 15s (- 4m 22s) (27900 62%) loss : 5.035                     
7m 17s (- 4m 21s) (28000 62%) loss : 5.027                     
7m 19s (- 4m 19s) (28100 62%) loss : 5.040                     
7m 20s (- 4m 18s) (28200 63%) loss : 5.045                     
7m 22s (- 4m 16s) (28300 63%) loss : 5.024     

10m 19s (- 1m 18s) (39700 88%) loss : 5.019                     
10m 21s (- 1m 16s) (39800 88%) loss : 5.013                     
10m 22s (- 1m 15s) (39900 89%) loss : 5.025                     
10m 24s (- 1m 13s) (40000 89%) loss : 5.010                     
10m 25s (- 1m 12s) (40100 89%) loss : 5.035                     
10m 27s (- 1m 10s) (40200 89%) loss : 5.012                     
10m 29s (- 1m 9s) (40300 90%) loss : 4.991                     
10m 30s (- 1m 7s) (40400 90%) loss : 5.012                     
10m 32s (- 1m 5s) (40500 90%) loss : 5.046                     
10m 33s (- 1m 4s) (40600 90%) loss : 5.018                     
10m 35s (- 1m 2s) (40700 90%) loss : 5.014                     
10m 36s (- 1m 1s) (40800 91%) loss : 5.065                     
10m 38s (- 0m 59s) (40900 91%) loss : 5.015                     
10m 39s (- 0m 58s) (41000 91%) loss : 5.028                     
10m 41s (- 0m 56s) (41100 91%) loss : 5.072                     
10m 43s (- 0m 55s) (41200 92%) 

Evaluation

In [119]:
word2vec.most_similar(word = 'final', bound = 10)

[('weights', 0.3850521),
 ('2.4', 0.36428553),
 ('cho', 0.35723716),
 ('0.20', 0.3491864),
 ('glucose', 0.3309465),
 ('double', 0.32847974),
 ('m5178', 0.3262306),
 ('northern', 0.32450148),
 ('normality', 0.32145855),
 ('9.0', 0.31833637)]

Save & Load<br>

The lightweight word2vec model can be saved for further use, or alternatively the full shell wrapping the word2vec model can be saved for subsequent training.

In [120]:
# save
#torch.save(word2vec, path_to_NLP + '\\saves\\models\\DL4NLP_I1_skipgram.pt')

# load
#word2vec = torch.load(path_to_NLP + '\\saves\\models\\DL4NLP_I1_skipgram.pt')

<a id="gensim"></a>

## 1.2 Gensim Word2Vec

[Back to top](#plan)

Link : https://radimrehurek.com/gensim/models/word2vec.html<br>
Tutorials :

- https://cambridgespark.com/4046-2/
- https://rare-technologies.com/word2vec-tutorial/
- http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

### 1.2.1 Model

In [56]:
from gensim.models import Word2Vec
import multiprocessing
from gensim.test.utils import datapath, get_tmpfile

### 1.2.2 Training with CBOW objective

[Back to top](#plan)

Model & Data & Training

In [52]:
cbow_gensim = Word2Vec(corpus, 
                       size = 75, 
                       window = 5, 
                       min_count = 4, 
                       negative = 15, 
                       iter = 50,
                       sg = 0,
                       workers = multiprocessing.cpu_count())

Evaluation

In [53]:
#help(cbow_gensim)
cbow_gensim.wv.most_similar('formaldehyde')

[('polysorbate', 0.6989661455154419),
 ('ovalbumin', 0.6707851886749268),
 ('thiomersal', 0.6553795337677002),
 ('phenol', 0.6534852981567383),
 ('phosphorus', 0.620326042175293),
 ('phenoxyethanol', 0.617948055267334),
 ('moisture', 0.5869265198707581),
 ('sucrose', 0.5719404220581055),
 ('2phenoxyethanol', 0.5603488683700562),
 ('aluminium', 0.5357670783996582)]

Save & Load

The Gensim model can easily be saved & loaded :

In [57]:
# save
#file_name = get_tmpfile(path_to_NLP + "\\saves\\models\\DL4NLP_I1_cbow_gensim.model")
#cbow_gensim.save(file_name)

# load
#file_name = get_tmpfile(path_to_NLP + "\\saves\\models\\DL4NLP_I1_cbow_gensim.model")
#cbow_gensim = Word2Vec.load(file_name)

Alternatively it is direct to build a lightweight word2vec model out of a trained gensim model and then save & load it as done in previous section.

In [86]:
word2vec = myWord2Vec(lang = Lang(corpus = [list(cbow_gensim.wv.index2word)], base_tokens = []), T = cbow_gensim.wv.vectors)

In [88]:
word2vec.most_similar('formaldehyde')

[('polysorbate', 0.6989662),
 ('ovalbumin', 0.67078525),
 ('thiomersal', 0.6553794),
 ('phenol', 0.6534851),
 ('phosphorus', 0.6203261),
 ('phenoxyethanol', 0.61794806),
 ('moisture', 0.5869265),
 ('sucrose', 0.5719405),
 ('2phenoxyethanol', 0.5603489),
 ('aluminium', 0.535767)]

### 1.2.3 Training with SkipGram objective

[Back to top](#plan)

Model & Data & Training

In [None]:
skipgram_gensim = Word2Vec(corpus, 
                           size = 75, 
                           window = 5, 
                           min_count = 4, 
                           negative = 15, 
                           iter = 50,
                           sg = 1,
                           workers = multiprocessing.cpu_count())

Evaluation

In [132]:
#help(skipgram_gensim)
skipgram_gensim.wv.most_similar('formaldehyde')

[('phenoxyethanol', 0.687885046005249),
 ('thiomersal', 0.6411113739013672),
 ('polysorbate', 0.6349731683731079),
 ('ovalbumin', 0.6308521628379822),
 ('phenol', 0.5957325100898743),
 ('free', 0.5818331241607666),
 ('hcho', 0.580868124961853),
 ('triton', 0.5749791264533997),
 ('residual', 0.5559578537940979),
 ('acetyl', 0.5499086380004883)]

Save & Load

In [None]:
# save
#file_name = get_tmpfile(path_to_NLP + "\\saves\\models\\DL4NLP_I1_skipgram_gensim.model")
#skipgram_gensim.save(file_name)

# load
#file_name = get_tmpfile(path_to_NLP + "\\saves\\models\\DL4NLP_I1_skipgram_gensim.model")
#skipgram_gensim = Word2Vec.load(file_name)

<a id="sub_word_level"></a>


# 2 Subword-level Embedding
***

<a id="fastText"></a>

## 2.1 FastText Subword-level Embedding Model

[Back to top](#plan)


We consider the Gensim implementation of FastText, based on the CBOW training objective.<br>
Tutorial : [Gensim FastText](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb)<br>
Link to the original paper : [Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606.pdf).

### 2.1.1 Model

In [89]:
from gensim.models.fasttext import FastText as FT_gensim
from gensim.test.utils import datapath, get_tmpfile

### 2.1.2 Training with CBOW objective

[Back to top](#plan)

Model & Data

In [90]:
cbow_fastText_gensim = FT_gensim(size = 75, 
                                 window = 5, 
                                 min_count = 4, 
                                 negative = 15,
                                 sg = 0)

In [91]:
cbow_fastText_gensim.build_vocab(corpus)

Training

In [92]:
cbow_fastText_gensim.train(sentences = corpus, 
                           epochs = 50,
                           total_examples = cbow_fastText_gensim.corpus_count)

Evaluation

In [93]:
cbow_fastText_gensim.wv.most_similar('formaldehyde')

[('glutaraldehyde', 0.7981022596359253),
 ('thiomersal', 0.6922125816345215),
 ('phenoxyethanol', 0.6608516573905945),
 ('formal', 0.6601725816726685),
 ('2phenoxyethanol', 0.6546887755393982),
 ('ovalbumin', 0.6526973247528076),
 ('acetaldehyde', 0.647036612033844),
 ('phenoxy', 0.6376691460609436),
 ('polysorbate', 0.6080763339996338),
 ('phenol', 0.5886105298995972)]

Save & Load

In [131]:
# save
#file_name = get_tmpfile(path_to_NLP + "\\saves\\models\\DL4NLP_I1_cbow_fasttext.model")
#cbow_fastText_gensim.save(file_name)

# load
#file_name = get_tmpfile(path_to_NLP + "\\saves\\models\\DL4NLP_I1_cbow_fasttext.model")
#cbow_fastText_gensim = FT_gensim.load(file_name)

Alternatively it is direct to build a lightweight word2vec model out of a trained gensim model and then save & load it as done in previous section.

In [94]:
word2vec = myWord2Vec(lang = Lang(corpus = [list(cbow_fastText_gensim.wv.index2word)], base_tokens = []), T = cbow_fastText_gensim.wv.vectors)

However, the main advantage FastText offers is the possibility to get an embedding vector out of **any word**, and in fact any string thanks to the character-ngrams embedding trick :

In [101]:
cbow_fastText_gensim['HelloWorld']

array([ 0.60040593, -0.61111313, -2.4772959 , -3.5749006 , -0.43338707,
       -2.2147598 , -1.1924213 , -1.3568501 ,  1.5161968 ,  1.8103373 ,
       -2.0902889 , -2.6638277 , -1.4272233 , -0.10690732, -1.9536633 ,
       -0.23879535,  1.602015  ,  0.01936079,  0.04406525, -0.3471393 ,
       -3.406037  ,  0.9583735 ,  0.7140704 ,  0.17500015,  0.46005052,
        2.257169  , -1.0044819 ,  0.5483043 , -0.9547367 , -0.49952805,
       -0.24594651, -0.21130262, -1.2208652 , -0.6694741 ,  0.87412256,
        1.7601272 ,  0.73085773,  0.10473473,  1.5312183 , -1.3219206 ,
       -1.3290527 ,  0.9072932 ,  1.3730991 , -0.90493995,  0.28533888,
        0.38472265, -1.5437093 ,  0.04730683, -0.2976788 ,  2.7981303 ,
       -1.7771748 ,  0.29214206, -0.9805347 , -0.35345754, -1.2273612 ,
        0.33308518,  1.4707417 , -1.3735195 , -1.2793874 ,  0.39640912,
       -0.04994425,  3.4968672 , -3.8160832 , -0.6460553 ,  1.6184001 ,
        1.4465878 , -1.3832889 , -1.1182241 , -1.0623002 ,  0.19

Nonetheless, it can be interesting to load the look-up word vectors table into a lightweight word2vec module, as it allows to further optimize this table for any specific downstream task performed by a larger PyTorch model.

### 2.1.3 Training with SkipGram objective

[Back to top](#plan)

Model & Data

In [128]:
fastText_gensim = FT_gensim(size = 75, 
                           window = 5, 
                           min_count = 4, 
                           negative = 15,
                           sg = 1)

In [86]:
fastText_gensim.build_vocab(corpus)

Training

In [87]:
fastText_gensim.train(sentences = corpus, 
                      epochs = 500,
                      total_examples = fastText_gensim.corpus_count)

Evaluation

In [84]:
fastText_gensim.wv.most_similar('formaldehyde')

[('phenoxyethanol', 0.7689735889434814),
 ('thiomersal', 0.6627542972564697),
 ('polysorbate', 0.6582513451576233),
 ('residual', 0.6486191749572754),
 ('acetyl', 0.6439298391342163),
 ('free', 0.6395246386528015),
 ('hcho', 0.6370760798454285),
 ('ovalbumin', 0.6338086724281311),
 ('phosphorus', 0.6294593811035156),
 ('acetaldehyde', 0.618198037147522)]

Save & Load

In [130]:
# save
#file_name = get_tmpfile(path_to_NLP + "\\saves\\models\\DL4NLP_I1_fasttext.model")
#fastText_gensim.save(file_name)

# load
file_name = get_tmpfile(path_to_NLP + "\\saves\\models\\DL4NLP_I1_fasttext.model")
fastText_gensim = FT_gensim.load(file_name)

In [131]:
fastText_gensim[['13', 'to']]

array([[ 4.1185090e-01, -1.8866447e-01,  4.5702112e-01,  9.5058337e-02,
        -4.9046433e-01, -1.7982282e-01, -8.3711225e-01, -2.1798968e-01,
         1.3073857e+00,  5.6606084e-03,  2.3142101e-01,  6.9535589e-01,
         5.9402555e-02,  4.7985664e-01,  5.4549623e-01, -4.1711867e-01,
        -3.2760030e-01,  6.0999656e-01, -8.2229602e-01, -5.0304300e-01,
        -1.2463865e+00, -6.4032364e-01, -3.3593947e-01,  2.3377445e-01,
         8.3939898e-01, -5.6188452e-01,  7.1624267e-01,  3.1629649e-01,
        -6.5501964e-01, -1.3965420e-02, -2.3780808e-01,  3.3856243e-01,
         9.1053849e-01,  8.3878809e-01, -6.5342933e-01,  3.2637709e-01,
        -5.9420161e-02,  4.7055259e-04,  1.5663326e-02, -1.3566703e+00,
         3.0353647e-01,  4.4896945e-02, -3.3726567e-01, -3.1158471e-01,
         5.7190084e-01, -2.0193291e-01, -1.8398750e-01,  1.4189348e-01,
         3.6389869e-01,  8.2089943e-01, -2.3509660e-01, -2.9905848e-02,
        -4.0085071e-01, -9.8194093e-01,  8.6322200e-01,  2.04408