<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Deep Learning for NLP
  </div> 
  
<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Part I - 3 <br><br><br>
  Language Modeling
  </div> 

  <div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 20px; 
      text-align: center; 
      padding: 15px;">
  </div> 

  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE
  </div> 

### Part I

1. Word Embedding

2. Sentence Classification

3. <font color=red>**Language Modeling**</font>

4. Sequence Labelling


### Part II

5. Auto-Encoding

6. Machine Translation

7. Text Classification




### Part III

8. Abstractive Summarization

9. Question Answering

10. Chatbot


</div>

***

<a id="plan"></a>

# Overview

The global structure of the [language model](#language_model) is the pipeline of two modules, followed by a final classification layer :



| | Module |  | |
|------|------|------|------|
| 1 | **Word Embedding** | [I.1 Custom model](#word_level_custom) | [I.2 Gensim Model](#gensim) | [I.3 FastText model](#fastText) |
| 2 | **Contextualization** | [II.1 bidirectionnal GRU](#bi_gru) | [II.2 Transformer](#transformer) |
| 3 | **Attention** | [III.1 Attention](#attention) | [III.2 Multi-head Attention](#attention) |



All details on Word Embedding modules and their pre-training are found in **Part I - 1**.

Exemples d'implémentation en PyTorch :

- https://github.com/pytorch/examples/blob/master/word_language_model/model.py


Différentes architectures sont décrites dans la litérature :

- Regularizing and Optimizing LSTM Language Models - https://arxiv.org/pdf/1708.02182.pdf

Un modèle linguistique est intérressant en soi, mais peut aussi servir pour le pré-entrainement de couches basses d'un modèle plus complexe :

- Deep contextualized word representations - https://arxiv.org/pdf/1802.05365.pdf
- Improving Language Understanding by Generative Pre-Training - https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
- Language Models are Unsupervised Multitask Learners - https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

# Packages

In [1]:
import sys
import warnings
from __future__ import unicode_literals, print_function, division
import os
from io import open
import unicodedata
import string
import time
import math
import re
import random
import pickle
import copy
from unidecode import unidecode
import itertools
import matplotlib
import matplotlib.pyplot as plt


# for special math operation
from sklearn.preprocessing import normalize


# for manipulating data 
import numpy as np
#np.set_printoptions(threshold=np.nan)
import pandas as pd
import bcolz # see https://bcolz.readthedocs.io/en/latest/intro.html
import pickle


# for text processing
import gensim
from gensim.models import KeyedVectors
#import spacy
import nltk
#nltk.download()
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem.porter import PorterStemmer


# for deep learning
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


warnings.filterwarnings("ignore")
print('python version :', sys.version)
print('pytorch version :', torch.__version__)
print('DL device :', device)



python version : 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
pytorch version : 1.3.1
DL device : cuda


In [2]:
path_to_NLP = 'C:\\Users\\Jb\\Desktop\\NLP'

In [3]:
sys.path.append(path_to_NLP + '\\libDL4NLP')

# Corpus

[Back to top](#plan)

Le texte est importé et mis sous forme de liste, où chaque élément représente un texte présenté sous forme d'une liste de mots.<br> Le corpus est donc une fois importé sous le forme :<br>

- corpus = [text]<br>
- text   = [word]<br>
- word   = str<br>

In [4]:
def cleanSentence(sentence): # -------------------------  str
    sw = ['']
    #sw += nltk.corpus.stopwords.words('english')
    #sw += nltk.corpus.stopwords.words('french')

    def unicodeToAscii(s):
        """Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427"""
        return ''.join( c for c in unicodedata.normalize('NFD', s)
                        if unicodedata.category(c) != 'Mn')

    def normalizeString(s):
        '''Remove rare symbols from a string'''
        s = unicodeToAscii(s.lower().strip()) # 
        #s = re.sub(r"[^a-zA-Z\.\(\)\[\]]+", r" ", s)  # 'r' before a string is for 'raw' # ?&\%\_\- removed # set('''.,:;()*#&-_%!?/\'")''')
        return s

    def wordTokenizerFunction():
        # base version
        function = lambda sentence : sentence.strip().split()

        # nltk version
        #function = word_tokenize    
        return function

    # 1 - caractères spéciaux
    def clean_sentence_punct(text): # --------------  str
        text = normalizeString(text)
        # suppression de la dernière ponctuation
        if (len(text) > 0 and text[-1] in ['.', ',', ';', ':', '!', '?']) : text = text[:-1]

        text = text.replace(r'(', r' ( ')
        text = text.replace(r')', r' ) ')
        text = text.replace(r'[', r' [ ')
        text = text.replace(r']', r' ] ')
        text = text.replace(r'<', r' < ')
        text = text.replace(r'>', r' > ')

        text = text.replace(r':', r' : ')
        text = text.replace(r';', r' ; ')
        for i in range(5) :
            text = re.sub('(?P<val1>[0-9])\.(?P<val2>[0-9])', '\g<val1>__-__\g<val2>', text)
            text = re.sub('(?P<val1>[0-9]),(?P<val2>[0-9])', '\g<val1>__-__\g<val2>', text)
        text = text.replace(r',', ' , ')
        text = text.replace(r'.', ' . ')
        for i in range(5) : text = re.sub('(?P<val1>[p0-9])__-__(?P<val2>[p0-9])', '\g<val1>.\g<val2>', text)
        text = re.sub('(?P<val1>[0-9]) \. p \. (?P<val2>[0-9])', '\g<val1>.p.\g<val2>', text)
        text = re.sub('(?P<val1>[0-9]) \. s \. (?P<val2>[0-9])', '\g<val1>.s.\g<val2>', text)

        text = text.replace(r'"', r' " ')
        text = text.replace(r'’', r" ' ")
        text = text.replace(r'”', r' " ')
        text = text.replace(r'“', r' " ')
        text = text.replace(r'/', r' / ')

        text = re.sub('(…)+', ' … ', text)
        text = text.replace('≤', ' ≤ ')          
        text = text.replace('≥', ' ≥ ')
        text = text.replace('°c', ' °c ')
        text = text.replace('°C', ' °c ')
        text = text.replace('ºc', ' °c ')
        text = text.replace('n°', 'n° ')
        text = text.replace('%', ' % ')
        text = text.replace('*', ' * ')
        text = text.replace('+', ' + ')
        text = text.replace('-', ' - ')
        text = text.replace('_', ' ')
        text = text.replace('®', ' ')
        text = text.replace('™', ' ')
        text = text.replace('±', ' ± ')
        text = text.replace('÷', ' ÷ ')
        text = text.replace('–', ' - ')
        text = text.replace('μg', ' µg')
        text = text.replace('µg', ' µg')
        text = text.replace('µl', ' µl')
        text = text.replace('μl', ' µl')
        text = text.replace('µm', ' µm')
        text = text.replace('μm', ' µm')
        text = text.replace('ppm', ' ppm')
        text = re.sub('(?P<val1>[0-9])mm', '\g<val1> mm', text)
        text = re.sub('(?P<val1>[0-9])g', '\g<val1> g', text)
        text = text.replace('nm', ' nm')

        text = re.sub('fa(?P<val1>[0-9])', 'fa \g<val1>', text)
        text = re.sub('g(?P<val1>[0-9])', 'g \g<val1>', text)
        text = re.sub('n(?P<val1>[0-9])', 'n \g<val1>', text)
        text = re.sub('p(?P<val1>[0-9])', 'p \g<val1>', text)
        text = re.sub('q_(?P<val1>[0-9])', 'q_ \g<val1>', text)
        text = re.sub('u(?P<val1>[0-9])', 'u \g<val1>', text)
        text = re.sub('ud(?P<val1>[0-9])', 'ud \g<val1>', text)
        text = re.sub('ui(?P<val1>[0-9])', 'ui \g<val1>', text)

        text = text.replace('=', ' ')
        text = text.replace('!', ' ')
        text = text.replace('-', ' ')
        text = text.replace(r' , ', ' ')
        text = text.replace(r' . ', ' ')

        text = re.sub('(?P<val>[0-9])ml', '\g<val> ml', text)
        text = re.sub('(?P<val>[0-9])mg', '\g<val> mg', text)

        for i in range(5) : text = re.sub('( [0-9]+ )', ' ', text)
        #text = re.sub('cochran(\S)*', 'cochran ', text)
        return text

    # 3 - split des mots
    def wordSplit(sentence, tokenizeur): # ------------- [str]
        return tokenizeur(sentence)

    # 4 - mise en minuscule et enlèvement des stopwords
    def stopwordsRemoval(sentence, sw): # ------------- [[str]]
        return [word for word in sentence if word not in sw]

    # 6 - correction des mots
    def correction(text):
        def correct(word):
            return spelling.suggest(word)[0]
        list_of_list_of_words = [[correct(word) for word in sentence] for sentence in text]
        return list_of_list_of_words

    # 7 - stemming
    def stemming(text): # ------------------------- [[str]]
        list_of_list_of_words = [[PorterStemmer().stem(word) for word in sentence if word not in sw] for sentence in text]
        return list_of_list_of_words


    tokenizeur = wordTokenizerFunction()
    sentence = clean_sentence_punct(str(sentence))
    sentence = wordSplit(sentence, tokenizeur)
    sentence = stopwordsRemoval(sentence, sw)
    #text = correction(text)
    #text = stemming(text)
    return sentence


def importSheet(file_name) :
    def cleanDatabase(db):
        words = ['.']
        title = ''
        for pair in db :
            #print(pair)
            current_tile = pair[0].split(' | ')[-1]
            if current_tile != title :
                words += cleanSentence(current_tile) + ['.']
                title  = current_tile
            words += cleanSentence(str(pair[1]).split(' | ')[-1]) + ['.']
        return words

    df = pd.read_excel(file_name, sep = ',', header = None)
    headers = [i for i, titre in enumerate(df.ix[0,:].values) if i in [1, 2] or titre == 'score manuel'] 
    db = df.ix[1:, headers].values.tolist()
    db = [el[:2] for el in db if el[-1] in [0,1, 10]]
    words = cleanDatabase(db)
    return words


def importCorpus(path_to_data) :
    corpus = []
    reps = os.listdir(path_to_data)
    for rep in reps :
        files = os.listdir(path_to_data + '\\' + rep)
        for file in files :
            file_name = path_to_data + '\\' + rep + '\\' + file
            corpus.append(importSheet(file_name))
    return corpus

In [5]:
corpus = importCorpus(path_to_NLP + '\\data\\AMM')
len(corpus)

510

# 1 Modules

## 1.1 Word Embedding module

[Back to top](#plan)

We consider here a FastText model trained following the Skip-Gram training objective.

In [6]:
from libDL4NLP.models.Word_Embedding import Word2Vec as myWord2Vec
from libDL4NLP.models.Word_Embedding import Word2VecConnector
from libDL4NLP.utils.Lang import Lang

In [7]:
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath, get_tmpfile

In [8]:
fastText_word2vec = FastText(size = 75, 
                             window = 5, 
                             min_count = 1, 
                             negative = 20,
                             sg = 1)

In [9]:
fastText_word2vec.build_vocab(corpus)

In [10]:
len(fastText_word2vec.wv.vocab)

8086

In [11]:
fastText_word2vec.train(sentences = corpus, 
                        epochs = 50,
                        total_examples = fastText_word2vec.corpus_count)

In [12]:
word2vec = Word2VecConnector(fastText_word2vec)

## 1.2 Contextualization module

[Back to top](#plan)

The contextualization layer transforms a sequences of word vectors into another one, of same length, where each output vector corresponds to a new version of each input vector that is contextualized with respect to neighboring vectors.


This module consists of a bi-directional _Gated Recurrent Unit_ (GRU) that supports packed sentences :

In [13]:
from libDL4NLP.modules import RecurrentEncoder

<a id="language_model"></a>

# 2 Language Model

[Back to top](#plan)


In [19]:
class LanguageModel(nn.Module) :
    def __init__(self, device, tokenizer, word2vec, 
                 hidden_dim = 100, 
                 n_layers = 1, 
                 dropout = 0, 
                 class_weights = None, 
                 optimizer = optim.SGD
                 ):
        super(LanguageModel, self).__init__()
        
        # embedding
        self.tokenizer = tokenizer
        self.word2vec  = word2vec
        self.context   = RecurrentEncoder(self.word2vec.output_dim, hidden_dim, n_layers, dropout, bidirectional = False)
        self.out       = nn.Linear(self.context.output_dim, self.word2vec.lang.n_words)
        self.act       = F.softmax
        
        # optimizer
        self.criterion = nn.NLLLoss(size_average = False, weight = class_weights)
        self.optimizer = optimizer
        
        # load to device
        self.device = device
        self.to(device)
        
    def nbParametres(self) :
        return sum([p.data.nelement() for p in self.parameters() if p.requires_grad == True])
    
    def forward(self, sentence = '.', hidden = None, limit = 10, color_code = '\033[94m'):
        words  = self.tokenizer(sentence)
        result = words + [color_code]
        hidden, count, stop = None, 0, False
        while not stop :
            # compute probs
            embeddings = self.word2vec(words, self.device)
            _, hidden  = self.context(embeddings, lengths = None, hidden = hidden) # WARNING : dim = (n_layers, batch_size, hidden_dim)
            probs      = self.act(self.out(hidden[-1, :, :]), dim = 1).view(-1)
            # get predicted word
            topv, topi = probs.data.topk(1)
            words = [self.word2vec.lang.index2word[topi.item()]]
            result += words
            # stopping criterion
            count += 1
            if count == limit or words == [limit] or count == 50 : stop = True
        print(' '.join(result + ['\033[0m']))
        return
    
    def generatePackedSentences(self, sentences, batch_size = 32, depth_range = (2, 10)) :
        sentences = [s[i: i+j] \
                     for s in sentences \
                     for i in range(len(s)-depth_range[0]) \
                     for j in range(depth_range[0], min(depth_range[1], len(s)-i)+1) \
                    ]
        sentences.sort(key = lambda s: len(s), reverse = True)
        packed_data = []
        for i in range(0, len(sentences), batch_size) :
            pack0 = sentences[i:i + batch_size]
            pack0 = [[self.word2vec.lang.getIndex(w) for w in s] for s in pack0]
            pack0 = [[w for w in words if w is not None] for words in pack0]
            pack0.sort(key = len, reverse = True)
            pack1 = Variable(torch.LongTensor([s[-1] for s in pack0]))
            pack0 = [s[:-1] for s in pack0]
            lengths = torch.tensor([len(p) for p in pack0]) # size = (batch_size) 
            pack0 = list(itertools.zip_longest(*pack0, fillvalue = self.word2vec.lang.getIndex('PADDING_WORD')))
            pack0 = Variable(torch.LongTensor(pack0).transpose(0, 1))   # size = (batch_size, max_length) 
            packed_data.append([[pack0, lengths], pack1])
        return packed_data
    
    def fit(self, batches, iters = None, epochs = None, lr = 0.025, random_state = 42,
              print_every = 10, compute_accuracy = True):
        """Performs training over a given dataset and along a specified amount of loops"""
        def asMinutes(s):
            m = math.floor(s / 60)
            s -= m * 60
            return '%dm %ds' % (m, s)

        def timeSince(since, percent):
            now = time.time()
            s = now - since
            rs = s/percent - s
            return '%s (- %s)' % (asMinutes(s), asMinutes(rs))
        
        def computeLogProbs(batch) :
            embeddings = self.word2vec.embedding(batch[0].to(self.device))
            _, hidden  = self.context(embeddings, lengths = batch[1].to(self.device)) # WARNING : dim = (n_layers, batch_size, hidden_dim)
            log_probs  = F.log_softmax(self.out(hidden[-1, :, :]), dim = 1)   # dim = (batch_size, lang_size)
            return log_probs

        def computeAccuracy(log_probs, targets) :
            return sum([targets[i].item() == log_probs[i].data.topk(1)[1].item() for i in range(targets.size(0))]) * 100 / targets.size(0)

        def printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy) :
            avg_loss = tot_loss / print_every
            avg_loss_words = tot_loss_words / print_every
            if compute_accuracy : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}  accuracy : {:.1f} %'.format(iter, int(iter / iters * 100), avg_loss, avg_loss_words))
            else                : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}                     '.format(iter, int(iter / iters * 100), avg_loss))
            return 0, 0

        def trainLoop(batch, optimizer, compute_accuracy = True):
            """Performs a training loop, with forward pass, backward pass and weight update."""
            optimizer.zero_grad()
            self.zero_grad()
            log_probs = computeLogProbs(batch[0])
            targets   = batch[1].to(self.device).view(-1)
            loss      = self.criterion(log_probs, targets)
            loss.backward()
            optimizer.step() 
            accuracy = computeAccuracy(log_probs, targets) if compute_accuracy else 0
            return float(loss.item() / targets.size(0)), accuracy
        
        # --- main ---
        self.train()
        np.random.seed(random_state)
        start = time.time()
        optimizer = self.optimizer([param for param in self.parameters() if param.requires_grad == True], lr = lr)
        tot_loss = 0  
        tot_acc  = 0
        if epochs is None :
            for iter in range(1, iters + 1):
                batch = random.choice(batches)
                loss, acc = trainLoop(batch, optimizer, compute_accuracy)
                tot_loss += loss
                tot_acc += acc      
                if iter % print_every == 0 : 
                    tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        else :
            iter = 0
            iters = len(batches) * epochs
            for epoch in range(1, epochs + 1):
                print('epoch ' + str(epoch))
                np.random.shuffle(batches)
                for batch in batches :
                    loss, acc = trainLoop(batch, optimizer, compute_accuracy)
                    tot_loss += loss
                    tot_acc += acc 
                    iter += 1
                    if iter % print_every == 0 : 
                        tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        return

### Training

In [20]:
language_model = LanguageModel(device,
                               tokenizer = lambda s : s.split(' '),
                               word2vec = word2vec,
                               hidden_dim = 50, 
                               n_layers = 3, 
                               dropout = 0.1,
                               optimizer = optim.SGD)

language_model.nbParametres()

462138

In [21]:
language_model

LanguageModel(
  (word2vec): Word2VecConnector(
    (twin): Word2Vec(
      (embedding): Embedding(8088, 75)
    )
    (embedding): Embedding(8088, 75)
  )
  (context): RecurrentEncoder(
    (dropout): Dropout(p=0.1, inplace=False)
    (bigru): GRU(75, 50, num_layers=3, batch_first=True, dropout=0.1)
  )
  (out): Linear(in_features=50, out_features=8088, bias=True)
  (criterion): NLLLoss()
)

In [17]:
batches = language_model.generatePackedSentences(corpus, batch_size = 64, depth_range = (5, 20))
len(batches)

121513

In [22]:
language_model.fit(batches, iters = 20000, lr = 0.01, print_every = 500)
language_model.fit(batches, iters = 20000, lr = 0.0025, print_every = 500)
language_model.fit(batches, iters = 20000, lr = 0.0005, print_every = 500)

0m 22s (- 14m 42s) (500 2%) loss : 6.562  accuracy : 5.9 %
0m 44s (- 14m 13s) (1000 5%) loss : 6.021  accuracy : 10.5 %
1m 6s (- 13m 36s) (1500 7%) loss : 5.750  accuracy : 12.6 %
1m 27s (- 13m 5s) (2000 10%) loss : 5.578  accuracy : 14.2 %
1m 48s (- 12m 40s) (2500 12%) loss : 5.286  accuracy : 16.1 %
2m 8s (- 12m 10s) (3000 15%) loss : 5.201  accuracy : 17.3 %
2m 33s (- 12m 2s) (3500 17%) loss : 5.077  accuracy : 18.3 %
2m 58s (- 11m 55s) (4000 20%) loss : 4.978  accuracy : 18.7 %
3m 21s (- 11m 32s) (4500 22%) loss : 4.879  accuracy : 19.9 %
3m 44s (- 11m 12s) (5000 25%) loss : 4.825  accuracy : 20.4 %
4m 23s (- 11m 35s) (5500 27%) loss : 4.713  accuracy : 20.8 %
5m 12s (- 12m 10s) (6000 30%) loss : 4.707  accuracy : 21.2 %
6m 3s (- 12m 34s) (6500 32%) loss : 4.691  accuracy : 21.4 %
6m 39s (- 12m 22s) (7000 35%) loss : 4.541  accuracy : 22.8 %
7m 0s (- 11m 41s) (7500 37%) loss : 4.555  accuracy : 22.9 %
7m 21s (- 11m 1s) (8000 40%) loss : 4.587  accuracy : 23.0 %
7m 42s (- 10m 26s) (

In [23]:
# save
#torch.save(language_model.state_dict(), path_to_NLP + '\\saves\\models\\DL4NLP_I3_language_model.pth')

# load
#language_model.load_state_dict(torch.load(path_to_NLP + '\\saves\\models\\DL4NLP_I3_language_model.pth'))

In [28]:
torch.cuda.empty_cache()

#### Evaluation

In [27]:
# fastText gensim, n_layers = 3, dh = 150
language_model.eval()
sentence = random.choice(corpus)
i = random.choice(range(int(len(sentence)/2)))
sentence = ' '.join(sentence[:i]) if i > 0 else '.'
language_model(sentence, limit = '.', color_code = '\x1b[48;2;255;229;217m') #  '\x1b[48;2;255;229;217m' '\x1b[31m'

. introduction . this section provides validation information for the analytical procedures used to release the tdap ipv final bulk product and filled product ( unlabeled ) . the in vitro analytical procedures listed in table are performed in compliance with [48;2;255;229;217m ph eur monograph no current edition . [0m
