<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Deep Learning for NLP
  </div> 
  
<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
    <font color=orange>I - 3 </font>
  Language Modeling
  </div> 

  <div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 20px; 
      text-align: center; 
      padding: 15px;">
  </div> 

  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE
  </div> 

### Part I

1. Word Embedding

2. Sentence Classification

3. <font color=orange>**Language Modeling**</font>

4. Sequence Labelling


### Part II

1. Text Classification

2. Sequence to sequence



### Part III

1. Abstractive Summarization

2. Question Answering

3. Chatbot


</div>

***

<a id="plan"></a>

| | | | |
|------|------|------|------|
| **Content** | [Corpus](#corpus) | [Modules](#modules) | [Model](#model) |

# Overview


Exemples d'implémentation en PyTorch :

- https://github.com/pytorch/examples/blob/master/word_language_model/model.py


Différentes architectures sont décrites dans la litérature :

- Regularizing and Optimizing LSTM Language Models - https://arxiv.org/pdf/1708.02182.pdf

Un modèle linguistique est intérressant en soi, mais peut aussi servir pour le pré-entrainement de couches basses d'un modèle plus complexe :

- Deep contextualized word representations - https://arxiv.org/pdf/1802.05365.pdf
- Improving Language Understanding by Generative Pre-Training - https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
- Language Models are Unsupervised Multitask Learners - https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

# Packages

In [1]:
from __future__ import unicode_literals, print_function, division
import sys
import warnings
import os
from io import open
import unicodedata
import string
import time
import math
import re
import random
import pickle
import copy
from unidecode import unidecode
import itertools
import gc
import multiprocessing

import matplotlib
import matplotlib.pyplot as plt


# for special math operation
from sklearn.preprocessing import normalize


# for manipulating data 
import numpy as np
#np.set_printoptions(threshold=np.nan)
import pandas as pd
import bcolz # see https://bcolz.readthedocs.io/en/latest/intro.html
import pickle


# for text processing
import gensim
from gensim.models import KeyedVectors
#import spacy
import nltk
#nltk.download()
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem.porter import PorterStemmer


# for deep learning
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


warnings.filterwarnings("ignore")
print('python version :', sys.version)
print('pytorch version :', torch.__version__)
print('DL device :', device)



python version : 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
pytorch version : 1.5.0
DL device : cuda


In [2]:
path_to_DL4NLP = os.path.dirname(os.getcwd())

In [3]:
sys.path.append(path_to_DL4NLP + '\\lib')

<a id="corpus"></a>

# Corpus

[Back to top](#plan)

Le texte est importé et mis sous forme de liste, où chaque élément représente un texte présenté sous forme d'une liste de mots.

In [4]:
df_GMB_extract = pd.read_csv(path_to_DL4NLP + "\\data\\Groningen Meaning Bank (extract)\\ner.csv", encoding = "ISO-8859-1", error_bad_lines = False)

b'Skipping line 281837: expected 25 fields, saw 34\n'


In [5]:
df_GMB_extract.dropna(inplace = True)
df_GMB_extract = df_GMB_extract[['sentence_idx', 'word', 'pos']]
print(df_GMB_extract.shape)
df_GMB_extract.head()

(1050794, 3)


Unnamed: 0,sentence_idx,word,pos
0,1.0,Thousands,NNS
1,1.0,of,IN
2,1.0,demonstrators,NNS
3,1.0,have,VBP
4,1.0,marched,VBN


In [6]:
# corpus with words lowered and stripped
corpus = df_GMB_extract.groupby("sentence_idx").apply(lambda s: [w.lower().strip() for w in s["word"].values.tolist()]).tolist()
corpus = [[w for w in s if w != ''] for s in corpus]

In [7]:
len(corpus)

35177

<a id="modules"></a>

# 1 Modules

### 1.1 Word Embedding module

[Back to top](#plan)

We consider here a FastText model trained following the Skip-Gram training objective.

In [8]:
from libDL4NLP.models.Word_Embedding import Word2Vec as myWord2Vec
from libDL4NLP.models.Word_Embedding import Word2VecConnector
from libDL4NLP.utils.Lang import Lang

In [9]:
from gensim.models import Word2Vec
from gensim.test.utils import datapath, get_tmpfile

In [10]:
# load
file_name = get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I3_skipgram_gensim.model")
sg_gensim = Word2Vec.load(file_name)

In [14]:
lang = Lang(corpus, min_count = 1)
print(lang.n_words)

27419


In [153]:
sg_gensim = Word2Vec(corpus, 
                     size = 100, 
                     window = 5, 
                     min_count = 1, 
                     negative = 20, 
                     iter = 100,
                     sg = 1,
                     workers = multiprocessing.cpu_count())

In [154]:
# save
#file_name = get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I3_skipgram_gensim.model")
#sg_gensim.save(file_name)

In [11]:
word2vec = Word2VecConnector(sg_gensim)

In [12]:
# The ordered vocab is the same for both the original and its wrapped objects
# except the two last words 'PADDING_WORD' and 'UNK' added to the wrapped object
list(word2vec.word2vec.wv.index2word) == list(word2vec.twin.lang.word2index)[:-2]

True

### 1.2 Contextualization module

[Back to top](#plan)

The contextualization layer transforms a sequences of word vectors into another one, of same length, where each output vector corresponds to a new version of each input vector that is contextualized with respect to neighboring vectors.


This module consists of a bi-directional _Gated Recurrent Unit_ (GRU) that supports packed sentences :

In [13]:
from libDL4NLP.modules import RecurrentEncoder

<a id="model"></a>

# 2 Language Model

[Back to top](#plan)


In [None]:
#from libDL4NLP.models import LanguageModel

In [139]:
class LanguageModel(nn.Module) :
    def __init__(self, device, tokenizer, word2vec, 
                 hidden_dim = 100, 
                 n_layer = 1, 
                 dropout = 0, 
                 class_weights = None, 
                 optimizer = optim.SGD
                 ):
        super().__init__()
        
        # layers
        self.tokenizer = tokenizer
        self.word2vec  = word2vec
        self.context   = RecurrentEncoder(
            emb_dim = self.word2vec.out_dim, 
            hid_dim = hidden_dim, 
            n_layer = n_layer, 
            dropout = dropout, 
            bidirectional = False)
        self.out       = nn.Linear(self.context.out_dim, self.word2vec.lang.n_words)
        self.act       = F.softmax
        
        # optimizer
        self.criterion = nn.NLLLoss(size_average = False, weight = class_weights)
        self.optimizer = optimizer
        
        # load to device
        self.device = device
        self.to(device)
        
    def nbParametres(self) :
        return sum([p.data.nelement() for p in self.parameters() if p.requires_grad == True])
    
    def forward(self, 
                sentence = '.', 
                hidden = None, 
                limit = 10, 
                color_code = '\033[94m'):
        # init variables
        words  = self.tokenizer(sentence)
        result = words + [color_code]
        hidden, count, stop = None, 0, False
        while not stop :
            # compute probs
            embeddings = self.word2vec(words, self.device)
            _, hidden  = self.context(embeddings, lengths = None, hidden = hidden) # WARNING : dim = (n_layers, batch_size, hidden_dim)
            probs      = self.act(self.out(hidden[-1]), dim = 1).view(-1)
            # get predicted word
            topv, topi = probs.data.topk(1)
            words = [self.word2vec.lang.index2word[topi.item()]]
            result += words
            # stopping criterion
            count += 1
            if count == limit or words == [limit] or count == 50 : stop = True
        print(' '.join(result + ['\033[0m']))
        return 
    
    def generatePackedSentences(self, sentences, batch_size = 32, lengths = [5, 10, 15]) :
        sentences = [s[i: i+j] \
                     for s in sentences \
                     for j in lengths \
                     for i in range(len(s)-j)]
        sentences.sort(key = lambda s: len(s), reverse = True)
        packed_data = []
        for i in range(0, len(sentences), batch_size) :
            pack0 = sentences[i:i + batch_size]
            pack0 = [[self.word2vec.lang.getIndex(w) for w in s] for s in pack0]
            pack0 = [[w for w in words if w is not None] for words in pack0]
            pack0.sort(key = len, reverse = True)
            pack1 = Variable(torch.LongTensor([s[-1] for s in pack0]))
            pack0 = [s[:-1] for s in pack0]
            lengths = torch.tensor([len(p) for p in pack0]) # size = (batch_size) 
            pack0 = list(itertools.zip_longest(*pack0, fillvalue = self.word2vec.lang.getIndex('PADDING_WORD')))
            pack0 = Variable(torch.LongTensor(pack0).transpose(0, 1))   # size = (batch_size, max_length) 
            packed_data.append([[pack0, lengths], pack1])
        return packed_data
    
    def fit(self, batches, iters = None, epochs = None, lr = 0.025, random_state = 42,
              print_every = 10, compute_accuracy = True):
        """Performs training over a given dataset and along a specified amount of loops"""
        def asMinutes(s):
            m = math.floor(s / 60)
            s -= m * 60
            return '%dm %ds' % (m, s)

        def timeSince(since, percent):
            now = time.time()
            s = now - since
            rs = s/percent - s
            return '%s (- %s)' % (asMinutes(s), asMinutes(rs))
        
        def computeLogProbs(batch) :
            embeddings = self.word2vec.embedding(batch[0].to(self.device))
            _, hidden  = self.context(embeddings, lengths = batch[1].to(self.device)) # WARNING : dim = (n_layers, batch_size, hidden_dim)
            log_probs  = F.log_softmax(self.out(hidden[-1]), dim = 1)   # dim = (batch_size, lang_size)
            return log_probs

        def computeAccuracy(log_probs, targets) :
            return sum([targets[i].item() == log_probs[i].data.topk(1)[1].item() for i in range(targets.size(0))]) * 100 / targets.size(0)

        def printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy) :
            avg_loss = tot_loss / print_every
            avg_loss_words = tot_loss_words / print_every
            if compute_accuracy : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}  accuracy : {:.1f} %'.format(iter, int(iter / iters * 100), avg_loss, avg_loss_words))
            else                : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}                     '.format(iter, int(iter / iters * 100), avg_loss))
            return 0, 0

        def trainLoop(batch, optimizer, compute_accuracy = True):
            """Performs a training loop, with forward pass, backward pass and weight update."""
            torch.cuda.empty_cache()
            optimizer.zero_grad()
            self.zero_grad()
            log_probs = computeLogProbs(batch[0])
            targets   = batch[1].to(self.device).view(-1)
            loss      = self.criterion(log_probs, targets)
            loss.backward()
            optimizer.step() 
            accuracy = computeAccuracy(log_probs, targets) if compute_accuracy else 0
            return float(loss.item() / targets.size(0)), accuracy
        
        # --- main ---
        self.train()
        random.seed(random_state)
        start = time.time()
        optimizer = self.optimizer([param for param in self.parameters() if param.requires_grad == True], lr = lr)
        tot_loss = 0  
        tot_acc  = 0
        if epochs is None :
            for iter in range(1, iters + 1):
                batch = random.choice(batches)
                loss, acc = trainLoop(batch, optimizer, compute_accuracy)
                tot_loss += loss
                tot_acc += acc      
                if iter % print_every == 0 : 
                    tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        else :
            iter = 0
            iters = len(batches) * epochs
            for epoch in range(1, epochs + 1):
                print('epoch ' + str(epoch))
                np.random.shuffle(batches)
                for batch in batches :
                    loss, acc = trainLoop(batch, optimizer, compute_accuracy)
                    tot_loss += loss
                    tot_acc += acc 
                    iter += 1
                    if iter % print_every == 0 : 
                        tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        return

### Training

In [140]:
language_model = LanguageModel(device,
                               tokenizer = lambda s : s.split(' '),
                               word2vec = word2vec,
                               hidden_dim = 200, 
                               n_layer = 3, 
                               dropout = 0.1,
                               optimizer = optim.AdamW)

language_model.nbParametres()

6175020

In [141]:
batches = language_model.generatePackedSentences(corpus, batch_size = 64, lengths = [5, 7, 9, 11, 13, 15])
len(batches)

66035

In [142]:
language_model.train()
language_model.fit(batches[:40000], epochs = 1, lr = 0.001, print_every = 250)
language_model.fit(batches[40000:], epochs = 1, lr = 0.0005, print_every = 250)

epoch 1
0m 27s (- 72m 12s) (250 0%) loss : 7.832  accuracy : 5.1 %
0m 54s (- 71m 33s) (500 1%) loss : 7.453  accuracy : 5.2 %
1m 21s (- 70m 51s) (750 1%) loss : 7.332  accuracy : 5.8 %
1m 48s (- 70m 18s) (1000 2%) loss : 7.247  accuracy : 7.3 %
2m 14s (- 69m 42s) (1250 3%) loss : 7.121  accuracy : 7.5 %
2m 41s (- 69m 11s) (1500 3%) loss : 6.993  accuracy : 8.0 %
3m 8s (- 68m 40s) (1750 4%) loss : 6.981  accuracy : 8.4 %
3m 36s (- 68m 35s) (2000 5%) loss : 6.786  accuracy : 9.8 %
4m 3s (- 68m 6s) (2250 5%) loss : 6.732  accuracy : 10.2 %
4m 30s (- 67m 35s) (2500 6%) loss : 6.671  accuracy : 10.3 %
4m 57s (- 67m 4s) (2750 6%) loss : 6.552  accuracy : 11.1 %
5m 24s (- 66m 37s) (3000 7%) loss : 6.474  accuracy : 12.0 %
5m 50s (- 66m 8s) (3250 8%) loss : 6.361  accuracy : 12.3 %
6m 17s (- 65m 38s) (3500 8%) loss : 6.335  accuracy : 12.0 %
6m 44s (- 65m 9s) (3750 9%) loss : 6.283  accuracy : 12.7 %
7m 11s (- 64m 40s) (4000 10%) loss : 6.254  accuracy : 12.9 %
7m 37s (- 64m 12s) (4250 10%) lo

58m 7s (- 12m 51s) (32750 81%) loss : 4.986  accuracy : 20.6 %
58m 33s (- 12m 25s) (33000 82%) loss : 4.972  accuracy : 19.4 %
59m 0s (- 11m 58s) (33250 83%) loss : 5.026  accuracy : 19.3 %
59m 26s (- 11m 32s) (33500 83%) loss : 4.932  accuracy : 20.2 %
59m 53s (- 11m 5s) (33750 84%) loss : 5.040  accuracy : 19.9 %
60m 19s (- 10m 38s) (34000 85%) loss : 4.973  accuracy : 20.1 %
60m 46s (- 10m 12s) (34250 85%) loss : 4.966  accuracy : 20.6 %
61m 13s (- 9m 45s) (34500 86%) loss : 4.962  accuracy : 19.7 %
61m 39s (- 9m 18s) (34750 86%) loss : 4.918  accuracy : 20.5 %
62m 6s (- 8m 52s) (35000 87%) loss : 4.960  accuracy : 19.9 %
62m 32s (- 8m 25s) (35250 88%) loss : 4.872  accuracy : 20.4 %
62m 59s (- 7m 59s) (35500 88%) loss : 4.918  accuracy : 20.6 %
63m 25s (- 7m 32s) (35750 89%) loss : 4.934  accuracy : 20.7 %
63m 52s (- 7m 5s) (36000 90%) loss : 4.969  accuracy : 19.5 %
64m 18s (- 6m 39s) (36250 90%) loss : 4.948  accuracy : 20.3 %
64m 45s (- 6m 12s) (36500 91%) loss : 4.905  accuracy

41m 16s (- 0m 51s) (25500 97%) loss : 5.037  accuracy : 21.7 %
41m 40s (- 0m 27s) (25750 98%) loss : 4.960  accuracy : 21.7 %
42m 5s (- 0m 3s) (26000 99%) loss : 4.970  accuracy : 22.1 %


In [143]:
language_model.train()
language_model.fit(batches[:40000], epochs = 1, lr = 0.00025, print_every = 250)
language_model.fit(batches[40000:], epochs = 1, lr = 0.0001, print_every = 250)

epoch 1
0m 26s (- 70m 13s) (250 0%) loss : 4.761  accuracy : 22.0 %
0m 53s (- 69m 48s) (500 1%) loss : 4.810  accuracy : 21.4 %
1m 19s (- 69m 23s) (750 1%) loss : 4.814  accuracy : 21.5 %
1m 46s (- 69m 1s) (1000 2%) loss : 4.856  accuracy : 22.6 %
2m 12s (- 68m 32s) (1250 3%) loss : 4.846  accuracy : 22.4 %
2m 39s (- 68m 5s) (1500 3%) loss : 4.813  accuracy : 22.6 %
3m 5s (- 67m 40s) (1750 4%) loss : 4.851  accuracy : 21.9 %
3m 32s (- 67m 13s) (2000 5%) loss : 4.798  accuracy : 22.9 %
3m 58s (- 66m 45s) (2250 5%) loss : 4.836  accuracy : 22.4 %
4m 25s (- 66m 18s) (2500 6%) loss : 4.860  accuracy : 22.2 %
4m 51s (- 65m 52s) (2750 6%) loss : 4.777  accuracy : 23.4 %
5m 18s (- 65m 28s) (3000 7%) loss : 4.794  accuracy : 23.0 %
5m 45s (- 65m 1s) (3250 8%) loss : 4.722  accuracy : 23.2 %
6m 11s (- 64m 34s) (3500 8%) loss : 4.773  accuracy : 23.2 %
6m 38s (- 64m 9s) (3750 9%) loss : 4.817  accuracy : 22.6 %
7m 4s (- 63m 43s) (4000 10%) loss : 4.779  accuracy : 22.6 %
7m 31s (- 63m 15s) (4250

57m 53s (- 12m 48s) (32750 81%) loss : 4.543  accuracy : 24.5 %
58m 20s (- 12m 22s) (33000 82%) loss : 4.530  accuracy : 23.9 %
58m 46s (- 11m 55s) (33250 83%) loss : 4.557  accuracy : 23.3 %
59m 13s (- 11m 29s) (33500 83%) loss : 4.510  accuracy : 24.1 %
59m 39s (- 11m 2s) (33750 84%) loss : 4.581  accuracy : 23.6 %
60m 6s (- 10m 36s) (34000 85%) loss : 4.551  accuracy : 24.2 %
60m 32s (- 10m 9s) (34250 85%) loss : 4.515  accuracy : 24.8 %
60m 59s (- 9m 43s) (34500 86%) loss : 4.524  accuracy : 23.5 %
61m 25s (- 9m 16s) (34750 86%) loss : 4.483  accuracy : 24.6 %
61m 52s (- 8m 50s) (35000 87%) loss : 4.521  accuracy : 23.6 %
62m 18s (- 8m 23s) (35250 88%) loss : 4.443  accuracy : 24.4 %
62m 45s (- 7m 57s) (35500 88%) loss : 4.508  accuracy : 24.6 %
63m 11s (- 7m 30s) (35750 89%) loss : 4.478  accuracy : 24.6 %
63m 38s (- 7m 4s) (36000 90%) loss : 4.536  accuracy : 23.7 %
64m 4s (- 6m 37s) (36250 90%) loss : 4.521  accuracy : 23.8 %
64m 31s (- 6m 11s) (36500 91%) loss : 4.501  accuracy

41m 15s (- 0m 51s) (25500 97%) loss : 4.835  accuracy : 23.8 %
41m 39s (- 0m 27s) (25750 98%) loss : 4.752  accuracy : 23.5 %
42m 3s (- 0m 3s) (26000 99%) loss : 4.754  accuracy : 23.7 %


In [150]:
# save
#torch.save(language_model.state_dict(), path_to_DL4NLP + '\\saves\\DL4NLP_I3_language_model.pth')

# load
#language_model.load_state_dict(torch.load(path_to_DL4NLP + '\\saves\\DL4NLP_I3_language_model.pth'))

#### Evaluation

In [149]:
# fastText gensim, n_layers = 3, dh = 50
language_model.eval()
sentence = random.choice(corpus)
i = random.choice(range(int(len(sentence)/2)))
sentence = ' '.join(sentence[:i]) if i > 0 else '.'
language_model(sentence, limit = '.', color_code = '\x1b[48;2;255;229;217m') #  '\x1b[48;2;255;229;217m' '\x1b[31m'

iran [48;2;255;229;217m 's nuclear program , which is not to be used to develop nuclear weapons . [0m
