<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Deep Learning for NLP
  </div> 
  
<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
    <font color=orange>I - 3 </font>
  Language Modeling
  </div> 

  <div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 20px; 
      text-align: center; 
      padding: 15px;">
  </div> 

  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE
  </div> 

### Part I

1. Word Embedding

2. Sentence Classification

3. <font color=orange>**Language Modeling**</font>

4. Sequence Labelling


### Part II

1. Text Classification

2. Sequence to sequence



### Part III

1. Abstractive Summarization

2. Question Answering

3. Chatbot


</div>

***

<a id="plan"></a>

| | | | |
|------|------|------|------|
| **Content** | [Corpus](#corpus) | [Modules](#modules) | [Model](#model) |

# Overview


Exemples d'implémentation en PyTorch :

- https://github.com/pytorch/examples/blob/master/word_language_model/model.py


Différentes architectures sont décrites dans la litérature :

- Regularizing and Optimizing LSTM Language Models - https://arxiv.org/pdf/1708.02182.pdf

Un modèle linguistique est intérressant en soi, mais peut aussi servir pour le pré-entrainement de couches basses d'un modèle plus complexe :

- Deep contextualized word representations - https://arxiv.org/pdf/1802.05365.pdf
- Improving Language Understanding by Generative Pre-Training - https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
- Language Models are Unsupervised Multitask Learners - https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

# Packages

In [1]:
from __future__ import unicode_literals, print_function, division
import sys
import warnings
import os
from io import open
import unicodedata
import string
import time
import math
import re
import random
import pickle
import copy
from unidecode import unidecode
import itertools
import gc
import multiprocessing

import matplotlib
import matplotlib.pyplot as plt


# for special math operation
from sklearn.preprocessing import normalize


# for manipulating data 
import numpy as np
#np.set_printoptions(threshold=np.nan)
import pandas as pd
import bcolz # see https://bcolz.readthedocs.io/en/latest/intro.html
import pickle


# for text processing
import gensim
from gensim.models import KeyedVectors
#import spacy
import nltk
#nltk.download()
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem.porter import PorterStemmer


# for deep learning
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


warnings.filterwarnings("ignore")
print('python version :', sys.version)
print('pytorch version :', torch.__version__)
print('DL device :', device)



python version : 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
pytorch version : 1.5.0
DL device : cuda


In [2]:
path_to_DL4NLP = os.path.dirname(os.getcwd())

In [3]:
sys.path.append(path_to_DL4NLP + '\\lib')

<a id="corpus"></a>

# Corpus

[Back to top](#plan)

Le texte est importé et mis sous forme de liste, où chaque élément représente un texte présenté sous forme d'une liste de mots.

In [4]:
df_GMB_extract = pd.read_csv(path_to_DL4NLP + "\\data\\Groningen Meaning Bank (extract)\\ner.csv", encoding = "ISO-8859-1", error_bad_lines = False)

b'Skipping line 281837: expected 25 fields, saw 34\n'


In [5]:
df_GMB_extract.dropna(inplace = True)
df_GMB_extract = df_GMB_extract[['sentence_idx', 'word', 'pos']]
print(df_GMB_extract.shape)
df_GMB_extract.head()

(1050794, 3)


Unnamed: 0,sentence_idx,word,pos
0,1.0,Thousands,NNS
1,1.0,of,IN
2,1.0,demonstrators,NNS
3,1.0,have,VBP
4,1.0,marched,VBN


In [6]:
# corpus with words lowered and stripped
corpus = df_GMB_extract.groupby("sentence_idx").apply(lambda s: [w.lower().strip() for w in s["word"].values.tolist()]).tolist()
corpus = [[w for w in s if w != ''] for s in corpus]

In [7]:
len(corpus)

35177

<a id="modules"></a>

# 1 Modules

### 1.1 Word Embedding module

[Back to top](#plan)

We consider here a FastText model trained following the Skip-Gram training objective.

In [8]:
from libDL4NLP.models.Word_Embedding import Word2Vec as myWord2Vec
from libDL4NLP.models.Word_Embedding import Word2VecConnector
from libDL4NLP.utils.Lang import Lang

In [9]:
from gensim.models import Word2Vec
from gensim.test.utils import datapath, get_tmpfile

In [10]:
# load
file_name = get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I3_skipgram_gensim.model")
sg_gensim = Word2Vec.load(file_name)

In [14]:
lang = Lang(corpus, min_count = 1)
print(lang.n_words)

27419


In [153]:
sg_gensim = Word2Vec(corpus, 
                     size = 100, 
                     window = 5, 
                     min_count = 1, 
                     negative = 20, 
                     iter = 100,
                     sg = 1,
                     workers = multiprocessing.cpu_count())

In [154]:
# save
#file_name = get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I3_skipgram_gensim.model")
#sg_gensim.save(file_name)

In [11]:
word2vec = Word2VecConnector(sg_gensim)

In [12]:
# The ordered vocab is the same for both the original and its wrapped objects
# except the two last words 'PADDING_WORD' and 'UNK' added to the wrapped object
list(word2vec.word2vec.wv.index2word) == list(word2vec.twin.lang.word2index)[:-2]

True

### 1.2 Contextualization module

[Back to top](#plan)

The contextualization layer transforms a sequences of word vectors into another one, of same length, where each output vector corresponds to a new version of each input vector that is contextualized with respect to neighboring vectors.


This module consists of a bi-directional _Gated Recurrent Unit_ (GRU) that supports packed sentences :

In [13]:
from libDL4NLP.modules import RecurrentEncoder

<a id="model"></a>

# 2 Language Model

[Back to top](#plan)


In [None]:
#from libDL4NLP.models import LanguageModel

In [100]:
class LanguageModel(nn.Module) :
    def __init__(self, device, tokenizer, word2vec, 
                 hidden_dim = 100, 
                 n_layer = 1, 
                 dropout = 0, 
                 class_weights = None, 
                 optimizer = optim.SGD
                 ):
        super().__init__()
        
        # layers
        self.tokenizer = tokenizer
        self.word2vec  = word2vec
        self.context   = RecurrentEncoder(
            emb_dim = self.word2vec.out_dim, 
            hid_dim = hidden_dim, 
            n_layer = n_layer, 
            dropout = dropout, 
            bidirectional = False)
        self.out = nn.Linear(self.context.out_dim, self.word2vec.lang.n_words)
        self.act = F.softmax
        
        # optimizer
        self.ignore_index = self.word2vec.lang.getIndex('PADDING_WORD')
        self.criterion = nn.NLLLoss(
            reduction    = 'mean',
            ignore_index = self.ignore_index,
            weight       = class_weights)
        self.optimizer = optimizer

        # load to device
        self.device = device
        self.to(device)
        
    def nbParametres(self) :
        return sum([p.data.nelement() for p in self.parameters() if p.requires_grad == True])
    
    def forward(self, 
                sentence = '.', 
                hidden = None, 
                limit = 10, 
                color_code = '\033[94m'):
        # init variables
        words  = self.tokenizer(sentence)
        result = words + [color_code]
        hidden, count, stop = None, 0, False
        while not stop :
            # compute probs
            embeddings = self.word2vec(words, self.device)
            _, hidden  = self.context(embeddings, lengths = None, hidden = hidden) # size (n_layers, batch_size, hid_dim)
            probs      = self.act(self.out(hidden[-1]), dim = 1).view(-1)
            # get predicted word
            topv, topi = probs.data.topk(1)
            words = [self.word2vec.lang.index2word[topi.item()]]
            result += words
            # stopping criterion
            count += 1
            if count == limit or words == [limit] or count == 50 : stop = True
        print(' '.join(result + ['\033[0m']))
        return 
    
    def generatePackedSentences(self, sentences, batch_size = 32, max_length = 15) :
        sentences = [s[i: i+max_length] for s in sentences for i in range(0, len(s)- max_length)]
        sentences = [s for s in sentences if len(s) > 1]
        sentences.sort(key = lambda s: len(s), reverse = True)
        packed_data = []
        for i in range(0, len(sentences), batch_size) :
            pack0 = sentences[i:i + batch_size]
            pack0 = [[self.word2vec.lang.getIndex(w) for w in s] for s in pack0]
            pack0 = [[w for w in words if w is not None] for words in pack0]
            pack0.sort(key = len, reverse = True)
            pack0 = list(itertools.zip_longest(*pack0, fillvalue = self.ignore_index))
            pack0 = Variable(torch.LongTensor(pack0).transpose(0, 1)) # size (batch_size, max_length)
            pack1 = pack0[:, 1:]                                      # size (batch_size, max_length-1) 
            pack0 = pack0[:,:-1]                                      # size (batch_size, max_length-1) 
            lengths = torch.tensor([len(p) for p in pack0])           # size (batch_size) 
 
            packed_data.append([[pack0, lengths], pack1])
        return packed_data
    
    def fit(self, batches, iters = None, epochs = None, lr = 0.025, min_background_length = 3, random_state = 42,
              print_every = 10, compute_accuracy = True):
        """Performs training over a given dataset and along a specified amount of loops"""
        def asMinutes(s):
            m = math.floor(s / 60)
            s -= m * 60
            return '%dm %ds' % (m, s)

        def timeSince(since, percent):
            now = time.time()
            s = now - since
            rs = s/percent - s
            return '%s (- %s)' % (asMinutes(s), asMinutes(rs))
        
        def computeLogProbs(batch) :
            embeddings = self.word2vec.embedding(batch[0].to(self.device))
            lengths    = batch[1].to(self.device)
            hiddens, _ = self.context(embeddings, lengths = lengths) # size (batch_size, max_length-1, hid_dim)
            log_probs  = F.log_softmax(self.out(hiddens), dim = 2)   # size (batch_size, max_length-1, lang_size)
            return log_probs

        def computeAccuracy(log_probs, targets) :
            mask1 = (targets.data != self.ignore_index).int() #~ flips all bools to opposite value
            mask2 = (targets.data == log_probs.data.topk(1, dim = 1)[1].squeeze(1)).int()
            good  = torch.sum(mask1 * mask2).item()
            alls  = torch.sum(mask1).item()
            return good, alls

        def printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy) :
            avg_loss = tot_loss / print_every
            avg_loss_words = tot_loss_words / print_every
            if compute_accuracy : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}  accuracy : {:.1f} %'.format(iter, int(iter / iters * 100), avg_loss, avg_loss_words))
            else                : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}                     '.format(iter, int(iter / iters * 100), avg_loss))
            return

        def trainLoop(batch, min_background_length, optimizer, compute_accuracy = True):
            """Performs a training loop, with forward pass, backward pass and weight update."""
            torch.cuda.empty_cache()
            optimizer.zero_grad()
            self.zero_grad()
            if min_background_length < batch[1].size(-1) :
                log_probs = computeLogProbs(batch[0]).transpose(1, 2) # size (batch_size, lang_size, max_length-1)
                log_probs = log_probs[:, :, min_background_length:]   # size (batch_size, lang_size, max_length-min_background_length-1)
                targets   = batch[1].to(self.device)                  # size (batch_size, max_length-1)
                targets   = targets[:, min_background_length:]        # size (batch_size, max_length-min_background_length-1)
                loss      = self.criterion(log_probs, targets)
                loss.backward()
                optimizer.step() 
                good, alls = computeAccuracy(log_probs, targets) if compute_accuracy else 0
                return loss.item(), good, alls
            else :
                return 0, 0, 0
        
        # --- main ---
        self.train()
        np.random.seed(random_state)
        start = time.time()
        optimizer = self.optimizer([param for param in self.parameters() if param.requires_grad == True], lr = lr)
        tot_loss = 0  
        tot_good = 0
        tot_alls = 0
        if epochs is None :
            for iter in range(1, iters + 1):
                batch = random.choice(batches)
                loss, good, alls = trainLoop(batch, min_background_length, optimizer, compute_accuracy)
                tot_loss += loss
                tot_good += good 
                tot_alls += alls
                if iter % print_every == 0 : 
                    tot_acc = tot_good * 100 / tot_alls
                    printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
                    tot_loss = 0  
                    tot_good = 0
                    tot_alls = 0
        else :
            iter = 0
            iters = len(batches) * epochs
            for epoch in range(1, epochs + 1):
                print('epoch ' + str(epoch))
                np.random.shuffle(batches)
                for batch in batches :
                    loss, good, alls = trainLoop(batch, min_background_length, optimizer, compute_accuracy)
                    tot_loss += loss
                    tot_good += good 
                    tot_alls += alls
                    iter += 1
                    if iter % print_every == 0 : 
                        tot_acc = tot_good * 100 / tot_alls
                        printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
                        tot_loss = 0  
                        tot_good = 0
                        tot_alls = 0
        return

### Training

In [101]:
language_model = LanguageModel(device,
                               tokenizer = lambda s : s.split(' '),
                               word2vec = word2vec,
                               hidden_dim = 150, 
                               n_layer = 3, 
                               dropout = 0.1,
                               optimizer = optim.AdamW)

language_model.nbParametres()

4525620

In [74]:
batches = language_model.generatePackedSentences(corpus, batch_size = 2, max_length = 15)
len(batches)

270059

In [102]:
language_model.fit(batches, iters = 1, lr = 0.001, min_background_length = 12, print_every = 50)

tensor([[ 7.7750e-02, -1.5767e-02,  1.2777e-01,  8.3767e-02, -1.2256e-01,
         -6.6195e-02,  5.0019e-02,  1.3278e-02,  5.4525e-02, -9.6136e-02,
         -1.0278e-01, -4.0381e-02, -1.2332e-01, -6.7921e-02, -4.0722e-02,
          2.5816e-02,  1.0844e-01,  3.6596e-02, -8.1626e-02,  1.5819e-01,
          6.6519e-02,  5.6303e-02,  5.9060e-02,  1.2683e-01, -1.9040e-02,
          5.5987e-02,  0.0000e+00,  0.0000e+00, -2.0169e-01, -1.0295e-01,
          2.4276e-02, -7.3571e-02, -6.2827e-02, -3.9476e-02, -1.0112e-03,
         -1.1361e-01, -7.2010e-02,  3.2149e-02,  1.1524e-01,  6.5980e-02,
         -7.8823e-02,  4.9361e-02, -1.6162e-02,  0.0000e+00, -5.8017e-02,
          4.6166e-02,  1.2452e-01,  4.0334e-02, -1.0646e-01, -2.2557e-02,
         -5.3657e-02, -1.7797e-02,  1.5746e-01,  4.3862e-02, -3.3072e-02,
          6.2544e-02,  1.9251e-01,  0.0000e+00,  1.9553e-01,  1.2036e-01,
          1.0195e-01,  8.6168e-02,  0.0000e+00,  0.0000e+00,  1.2080e-02,
          1.8458e-02, -1.3271e-01,  1.

In [166]:
language_model.fit(batches[:40000], epochs = 1, lr = 0.00025, print_every = 250)
language_model.fit(batches[40000:], epochs = 1, lr = 0.0001, print_every = 250)

epoch 1
0m 21s (- 56m 46s) (250 0%) loss : 4.910  accuracy : 21.1 %
0m 42s (- 56m 16s) (500 1%) loss : 4.948  accuracy : 20.9 %
1m 4s (- 56m 28s) (750 1%) loss : 4.947  accuracy : 21.5 %
1m 26s (- 55m 59s) (1000 2%) loss : 5.021  accuracy : 20.9 %
1m 47s (- 55m 33s) (1250 3%) loss : 4.884  accuracy : 21.8 %
2m 8s (- 55m 10s) (1500 3%) loss : 4.918  accuracy : 21.4 %
2m 30s (- 54m 44s) (1750 4%) loss : 4.994  accuracy : 21.5 %
2m 51s (- 54m 19s) (2000 5%) loss : 4.999  accuracy : 21.1 %
3m 13s (- 53m 58s) (2250 5%) loss : 5.029  accuracy : 21.8 %
3m 34s (- 53m 34s) (2500 6%) loss : 5.013  accuracy : 21.3 %
3m 55s (- 53m 11s) (2750 6%) loss : 5.011  accuracy : 21.7 %
4m 16s (- 52m 48s) (3000 7%) loss : 4.989  accuracy : 21.8 %
4m 38s (- 52m 25s) (3250 8%) loss : 5.005  accuracy : 21.5 %
4m 59s (- 52m 5s) (3500 8%) loss : 4.973  accuracy : 22.0 %
5m 20s (- 51m 42s) (3750 9%) loss : 4.899  accuracy : 21.7 %
5m 42s (- 51m 21s) (4000 10%) loss : 4.966  accuracy : 21.7 %
6m 3s (- 50m 59s) (42

46m 38s (- 10m 19s) (32750 81%) loss : 4.799  accuracy : 22.2 %
46m 59s (- 9m 58s) (33000 82%) loss : 4.891  accuracy : 21.8 %
47m 21s (- 9m 36s) (33250 83%) loss : 4.816  accuracy : 22.3 %
47m 42s (- 9m 15s) (33500 83%) loss : 4.864  accuracy : 21.9 %
48m 4s (- 8m 54s) (33750 84%) loss : 4.826  accuracy : 22.3 %
48m 25s (- 8m 32s) (34000 85%) loss : 4.760  accuracy : 22.8 %
48m 46s (- 8m 11s) (34250 85%) loss : 4.853  accuracy : 22.2 %
49m 7s (- 7m 49s) (34500 86%) loss : 4.830  accuracy : 22.4 %
49m 29s (- 7m 28s) (34750 86%) loss : 4.761  accuracy : 23.2 %
49m 50s (- 7m 7s) (35000 87%) loss : 4.774  accuracy : 23.1 %
50m 11s (- 6m 45s) (35250 88%) loss : 4.764  accuracy : 22.9 %
50m 33s (- 6m 24s) (35500 88%) loss : 4.681  accuracy : 22.6 %
50m 54s (- 6m 3s) (35750 89%) loss : 4.923  accuracy : 22.0 %
51m 16s (- 5m 41s) (36000 90%) loss : 4.830  accuracy : 22.0 %
51m 37s (- 5m 20s) (36250 90%) loss : 4.737  accuracy : 22.7 %
51m 58s (- 4m 59s) (36500 91%) loss : 4.828  accuracy : 22

39m 36s (- 1m 0s) (25500 97%) loss : 4.698  accuracy : 23.5 %
39m 59s (- 0m 37s) (25750 98%) loss : 4.643  accuracy : 23.6 %
40m 23s (- 0m 13s) (26000 99%) loss : 4.601  accuracy : 24.2 %


In [168]:
batches  = language_model.generatePackedSentences(corpus, batch_size = 64, depth_range = (6, 7))
batches += language_model.generatePackedSentences(corpus, batch_size = 64, depth_range = (11, 12))
batches += language_model.generatePackedSentences(corpus, batch_size = 64, depth_range = (16, 17))
len(batches)

63064

In [169]:
language_model.fit(batches[:40000], epochs = 1, lr = 0.00005, print_every = 250)
language_model.fit(batches[40000:], epochs = 1, lr = 0.00001, print_every = 250)

epoch 1
0m 22s (- 58m 36s) (250 0%) loss : 4.705  accuracy : 22.9 %
0m 43s (- 57m 45s) (500 1%) loss : 4.697  accuracy : 23.2 %
1m 5s (- 57m 14s) (750 1%) loss : 4.633  accuracy : 23.0 %
1m 27s (- 56m 42s) (1000 2%) loss : 4.674  accuracy : 24.2 %
1m 49s (- 56m 19s) (1250 3%) loss : 4.683  accuracy : 23.4 %
2m 10s (- 55m 56s) (1500 3%) loss : 4.767  accuracy : 23.9 %
2m 32s (- 55m 31s) (1750 4%) loss : 4.759  accuracy : 23.0 %
2m 54s (- 55m 8s) (2000 5%) loss : 4.755  accuracy : 23.9 %
3m 15s (- 54m 47s) (2250 5%) loss : 4.633  accuracy : 24.1 %
3m 37s (- 54m 25s) (2500 6%) loss : 4.784  accuracy : 23.2 %
3m 59s (- 54m 2s) (2750 6%) loss : 4.761  accuracy : 23.0 %
4m 21s (- 53m 39s) (3000 7%) loss : 4.705  accuracy : 24.5 %
4m 42s (- 53m 17s) (3250 8%) loss : 4.742  accuracy : 23.8 %
5m 4s (- 52m 55s) (3500 8%) loss : 4.675  accuracy : 23.6 %
5m 26s (- 52m 33s) (3750 9%) loss : 4.696  accuracy : 23.8 %
5m 47s (- 52m 11s) (4000 10%) loss : 4.744  accuracy : 23.1 %
6m 9s (- 51m 49s) (425

47m 31s (- 10m 31s) (32750 81%) loss : 4.672  accuracy : 23.6 %
47m 53s (- 10m 9s) (33000 82%) loss : 4.711  accuracy : 23.3 %
48m 14s (- 9m 47s) (33250 83%) loss : 4.689  accuracy : 23.5 %
48m 36s (- 9m 25s) (33500 83%) loss : 4.674  accuracy : 23.5 %
48m 58s (- 9m 4s) (33750 84%) loss : 4.772  accuracy : 23.0 %
49m 19s (- 8m 42s) (34000 85%) loss : 4.841  accuracy : 22.8 %
49m 41s (- 8m 20s) (34250 85%) loss : 4.719  accuracy : 23.6 %
50m 3s (- 7m 58s) (34500 86%) loss : 4.713  accuracy : 23.2 %
50m 25s (- 7m 37s) (34750 86%) loss : 4.681  accuracy : 23.7 %
50m 46s (- 7m 15s) (35000 87%) loss : 4.661  accuracy : 23.9 %
51m 8s (- 6m 53s) (35250 88%) loss : 4.645  accuracy : 24.2 %
51m 30s (- 6m 31s) (35500 88%) loss : 4.650  accuracy : 24.0 %
51m 52s (- 6m 9s) (35750 89%) loss : 4.746  accuracy : 23.2 %
52m 13s (- 5m 48s) (36000 90%) loss : 4.733  accuracy : 24.0 %
52m 35s (- 5m 26s) (36250 90%) loss : 4.716  accuracy : 23.4 %
52m 57s (- 5m 4s) (36500 91%) loss : 4.692  accuracy : 23.

In [170]:
# save
#torch.save(language_model.state_dict(), path_to_DL4NLP + '\\saves\\DL4NLP_I3_language_model.pth')

# load
#language_model.load_state_dict(torch.load(path_to_DL4NLP + '\\saves\\DL4NLP_I3_language_model.pth'))

#### Evaluation

In [27]:
# fastText gensim, n_layers = 3, dh = 50
language_model.eval()
sentence = random.choice(corpus)
i = random.choice(range(int(len(sentence)/2)))
sentence = ' '.join(sentence[:i]) if i > 0 else '.'
language_model(sentence, limit = '.', color_code = '\x1b[48;2;255;229;217m') #  '\x1b[48;2;255;229;217m' '\x1b[31m'

mr. museveni told the u.n. leader he [48;2;255;229;217m the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the [0m
