<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Deep Learning for NLP
  </div> 


<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
    <font color=orange>I - 4 </font>
    Sequence Labelling
    
  </div> 

<div style="
      font-weight: normal; 
      font-size: 20px; 
      text-align: center; 
      padding: 20px; 
      margin: 10px;">
  b. Sentence Denoising
  </div>

  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE
  </div> 

### Part I

1. Word Embedding

2. Sentence Classification

3. Language Modeling

4. <font color=orange>**Sequence Labelling**</font>


### Part II

1. Text Classification

2. Sequence to sequence



### Part III

1. Abstractive Summarization

2. Question Answering

3. Chatbot


</div>

***

<a id="plan"></a>

| | | | |
|------|------|------|------|
| **Content** | [Corpus](#corpus) | [Modules](#modules) | [Model](#model) | 


# Overview

A top-quality Github repository discussing Sequence Labelling is found [here](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Sequence-Labeling)<br>

We consider as Sequence labelling task a **Sentence Denoising** problem, which consists in transforming a noisy sequence of words into a correctly formed sentence. Training follows a denoising objective known as _Cloze task_ , which is used :

- For the BERT model in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)

# Packages

In [1]:
from __future__ import unicode_literals, print_function, division
import sys
import warnings
import os
from io import open
import unicodedata
import string
import time
import math
import re
import random
import pickle
import copy
from unidecode import unidecode
import itertools
import matplotlib
import matplotlib.pyplot as plt


# for special math operation
from sklearn.preprocessing import normalize


# for manipulating data 
import numpy as np
#np.set_printoptions(threshold=np.nan)
import pandas as pd
import bcolz # see https://bcolz.readthedocs.io/en/latest/intro.html
import pickle


# for text processing
import gensim
from gensim.models import KeyedVectors
#import spacy
import nltk
#nltk.download()
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem.porter import PorterStemmer


# for deep learning
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")

warnings.filterwarnings("ignore")
print('python version :', sys.version)
print('pytorch version :', torch.__version__)
print('DL device :', device)



python version : 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
pytorch version : 1.4.0
DL device : cuda


In [2]:
path_to_DL4NLP = os.path.dirname(os.getcwd())

In [3]:
sys.path.append(path_to_DL4NLP + '\\lib')

<a id="corpus"></a>

# Corpus

[Back to top](#plan)

Le texte est importé et mis sous forme de liste, où chaque élément représente un texte présenté sous forme d'une liste de mots.

In [4]:
df_AGnews_trn = pd.read_csv(path_to_DL4NLP + "\\data\\AG news\\train.csv", sep = ',', header = None, error_bad_lines = False)
df_AGnews_tst = pd.read_csv(path_to_DL4NLP + "\\data\\AG news\\test.csv" , sep = ',', header = None, error_bad_lines = False)

In [5]:
df_AGnews_trn.columns = ['index', 'title', 'description']
df_AGnews_tst.columns = ['index', 'title', 'description']

In [7]:
def unicodeToAscii(s):
    return ''.join( c for c in unicodedata.normalize('NFD', s)
                    if unicodedata.category(c) != 'Mn')

def normalizeString(s):
    s = unicodeToAscii(s.strip())
    return s

def cleanSentence(s) :
    s = s.lower()
    s = re.sub('[\.!?]+ ', ' . ', s)
    s = s.replace('%', ' % ')
    s = re.sub(' [0-9]*\.[0-9] ', ' FLOAT ', ' ' + s + ' ').strip()
    s = re.sub(' [0-9,]*[0-9] ', ' INT ', ' ' + s + ' ').strip()
    
    for w in ['"', "'", '”', '“', '/', '(', ')', '[', ']', '<', '>', ':', ','] : s = s.replace(w, '')
    return s

def trueWord(w) :
    return len(w)>0 and re.sub('[^a-zA-Z0-9.,]', '', w) != ''

def tokenize(s) :
    s = normalizeString(s)
    s = cleanSentence(s)
    s = nltk.tokenize.word_tokenize(s)
    s = [w for w in s if trueWord(w)]
    return s

In [16]:
# reduce label by 1 to make is starts from 0
sentences_trn = [tokenize(s1 + ' . ' + s2) for s1, s2 in zip(df_AGnews_trn["title"].values.tolist(), df_AGnews_trn["description"].values.tolist()) if tokenize(s1) != []]
sentences_tst = [tokenize(s1 + ' . ' + s2) for s1, s2 in zip(df_AGnews_tst["title"].values.tolist(), df_AGnews_tst["description"].values.tolist()) if tokenize(s1) != []]

<a id="modules"></a>

# 1 Modules

### 1.1 Word Embedding module

[Back to top](#plan)

All details on Word Embedding modules and their pre-training are found in **Part I - 1**. We consider here a FastText model trained following the Skip-Gram training objective.

In [10]:
from libDL4NLP.models.Word_Embedding import Word2Vec as myWord2Vec
from libDL4NLP.models.Word_Embedding import Word2VecConnector
from libDL4NLP.utils.Lang import Lang

In [27]:
from gensim.models import Word2Vec
from gensim.test.utils import datapath, get_tmpfile

In [20]:
word2vec = Word2Vec(sentences_trn, 
                     size = 100, 
                     window = 5, 
                     min_count = 5, 
                     negative = 20, 
                     iter = 25,
                     sg = 1,
                     workers = multiprocessing.cpu_count())

In [29]:
# save
#word2vec.save(get_tmpfile(path_to_DL4NLP + '\\saves\\DL4NLP_I4b_skipgram_gensim.model'))

# load trained model
#word2vec = Word2Vec.load(get_tmpfile(path_to_DL4NLP + '\\saves\\DL4NLP_I4b_skipgram_gensim.model'))

### 1.2 Contextualization module

[Back to top](#plan)


In [30]:
from libDL4NLP.modules import RecurrentEncoder

<a id="model"></a>

# 2 Sentence denoising Model

[Back to top](#plan)


In [31]:
class SentenceDenoiser(nn.Module) :
    def __init__(self, device, tokenizer, word2vec, 
                 hidden_dim = 100, 
                 n_layers = 1, 
                 dropout = 0, 
                 class_weights = None, 
                 optimizer = optim.SGD
                 ):
        super(SentenceDenoiser, self).__init__()
        
        # embedding
        self.tokenizer = tokenizer
        self.word2vec  = word2vec
        self.context   = RecurrentEncoder(self.word2vec.output_dim, hidden_dim, n_layers, dropout, bidirectional = True)
        self.out       = nn.Linear(self.context.output_dim, self.word2vec.lang.n_words)
        self.act       = F.softmax
        
        # optimizer
        self.ignore_index = self.word2vec.lang.getIndex('PADDING_WORD')
        self.criterion = nn.NLLLoss(size_average = False, 
                                    ignore_index = self.ignore_index, 
                                    weight = class_weights)
        self.optimizer = optimizer
        
        # load to device
        self.device = device
        self.to(device)
        
    def nbParametres(self) :
        return sum([p.data.nelement() for p in self.parameters() if p.requires_grad == True])
    
    def predict_proba(self, words):
        embeddings = self.word2vec.twin(words, self.device) # dim = (1, input_length, hidden_dim)
        hiddens, _ = self.context(embeddings)               # dim = (1, input_length, hidden_dim)
        probs      = self.act(self.out(hiddens), dim = 2)   # dim = (1, input_length, lang_size)
        return probs

    # main method
    def forward(self, sentence = '.', color = '\033[94m'):
        def addColor(w1, w2, color) : return color + w2 + '\033[0m' if w1 != w2 else w2
        words  = self.tokenizer(sentence)
        probs  = self.predict_proba(words).squeeze(0) # dim = (input_length, lang_size)
        inds   = [probs[i].data.topk(1)[1].item() for i in range(probs.size(0))]
        new_ws = [self.word2vec.lang.index2word[ind] for ind in inds]
        print(' '.join([addColor(w1, w2, color) for w1, w2 in zip(words, new_ws)]))
        return

    # load data
    def generatePackedSentences(self, 
                                sentences, 
                                batch_size = 32, 
                                mask_ratio = 0.15,
                                max_sentence_length = 50,
                                tol = 10,
                                seed = 42) :
        def maskInput(index, b) :
            if   b and random.random() > 0.25 : return self.word2vec.lang.getIndex('UNK')
            elif b and random.random() > 0.10 : return random.choice(list(self.word2vec.twin.lang.word2index.values()))
            else                              : return index
            
        def maskOutput(index, b) :
            return index if b else self.ignore_index
        
        def splitLongs(words, threshold = 50, tol = 10):
            news = []
            for i in range(0, len(words), threshold) :
                if len(words)-i-threshold > tol : 
                    news.append(words[i : i + threshold])
                else : 
                    news.append(words[i:])
                    break
            return news
        
        packed_data = []
        random.seed(seed)
        # prepare sentences
        #sentences = [self.tokenizer(s) for s in sentences]
        sentences = [[self.word2vec.lang.getIndex(w) for w in s] for s in sentences]
        sentences = [[w for w in words if w is not None] for words in sentences]
        sentences = [s for S in sentences for s in splitLongs(S, max_sentence_length, tol) if len([w for w in s if w != self.word2vec.lang.getIndex('UNK')]) > 1]
        sentences.sort(key = lambda s: len(s), reverse = True)
        # collect packs
        for i in range(0, len(sentences), batch_size) :
            pack = sentences[i:i + batch_size]
            # prepare mask
            mask_xl = [[i for i, w in enumerate(p) if w != self.word2vec.lang.getIndex('UNK')] for p in pack]
            mask_xs = [random.sample(m, k = int(mask_ratio*len(m) +1)) for m in mask_xl]
            # prepare input and target packs
            pack0    = [[ maskInput(s[i], i in m) for i in range(len(s))] for s, m in zip(pack, mask_xs)]
            pack1_xs = [[maskOutput(s[i], i in m) for i in range(len(s))] for s, m in zip(pack, mask_xs)]
            pack1_xl = [[maskOutput(s[i], i in m) for i in range(len(s))] for s, m in zip(pack, mask_xl)]
            lengths  = torch.tensor([len(p) for p in pack0]) # size = (batch_size) 
            # padd
            pack0    = list(itertools.zip_longest(*pack0, fillvalue = self.ignore_index)) 
            pack1_xs = list(itertools.zip_longest(*pack1_xs, fillvalue = self.ignore_index))
            pack1_xl = list(itertools.zip_longest(*pack1_xl, fillvalue = self.ignore_index))
            # turn into torch variables
            pack0 = Variable(torch.LongTensor(pack0).transpose(0, 1))   # size = (batch_size, max_length)
            pack1_xs = Variable(torch.LongTensor(pack1_xs).transpose(0, 1))   # size = (batch_size, max_length) 
            pack1_xl = Variable(torch.LongTensor(pack1_xl).transpose(0, 1))   # size = (batch_size, max_length) 
            # store pack
            packed_data.append([[pack0, lengths], [pack1_xs, pack1_xl]])
        return packed_data
    
    # fit model
    def fit(self, batches, 
            iters = None, epochs = None, lr = 0.025, unmasked_ratio = 0,
            random_state = 42, print_every = 10, compute_accuracy = 'xs'):
        """Performs training over a given dataset and along a specified amount of loops"""
        def asMinutes(s):
            m = math.floor(s / 60)
            s -= m * 60
            return '%dm %ds' % (m, s)

        def timeSince(since, percent):
            now = time.time()
            s = now - since
            rs = s/percent - s
            return '%s (- %s)' % (asMinutes(s), asMinutes(rs))
        
        def computeLogProbs(batch) :
            embeddings = self.word2vec.embedding(batch[0].to(self.device))
            hiddens,_  = self.context(embeddings, lengths = batch[1].to(self.device)) # dim = (batch_size, input_length, hidden_dim)
            log_probs  = F.log_softmax(self.out(hiddens), dim = 2)                    # dim = (batch_size, input_length, lang_size)
            return log_probs

        def computeAccuracy(log_probs, targets) :
            total   = np.sum(targets.data.cpu().numpy() != self.ignore_index)
            success = sum([self.ignore_index != targets[i, j].item() == log_probs[i, :, j].data.topk(1)[1].item() \
                           for i in range(targets.size(0)) \
                           for j in range(targets.size(1)) ])
            return  success * 100 / total

        def printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy) :
            avg_loss = tot_loss / print_every
            avg_loss_words = tot_loss_words / print_every
            if compute_accuracy : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}  accuracy : {:.1f} %'.format(iter, int(iter / iters * 100), avg_loss, avg_loss_words))
            else                : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}                     '.format(iter, int(iter / iters * 100), avg_loss))
            return 0, 0

        def trainLoop(batch, optimizer, unmasked_ratio, compute_accuracy = True):
            """Performs a training loop, with forward pass, backward pass and weight update."""
            torch.cuda.empty_cache()
            optimizer.zero_grad()
            self.zero_grad()
            log_probs  = computeLogProbs(batch[0]).transpose(1, 2) # dim = (batch_size, lang_size, input_length)
            targets_xs = batch[1][0].to(self.device)               # dim = (batch_size, input_length)
            targets_xl = batch[1][1].to(self.device)               # dim = (batch_size, input_length)
            loss       = (1-unmasked_ratio)*self.criterion(log_probs, targets_xs) \
                       + unmasked_ratio    *self.criterion(log_probs, targets_xl)
            loss.backward()
            optimizer.step() 
            if compute_accuracy == 'xs': 
                accuracy = computeAccuracy(log_probs, targets_xs)
                error = float(loss.item() / np.sum(targets_xs.data.cpu().numpy() != self.ignore_index))
            elif compute_accuracy == 'xl': 
                accuracy = computeAccuracy(log_probs, targets_xl)
                error = float(loss.item() / np.sum(targets_xl.data.cpu().numpy() != self.ignore_index))
            else : 
                accuracy = 0
                error = float(loss.item() / np.sum(targets_xs.data.cpu().numpy() != self.ignore_index))
            return error, accuracy
        
        # --- main ---
        self.train()
        np.random.seed(random_state)
        start = time.time()
        optimizer = self.optimizer([param for param in self.parameters() if param.requires_grad == True], lr = lr)
        tot_loss = 0  
        tot_acc  = 0
        if epochs is None :
            for iter in range(1, iters + 1):
                batch = random.choice(batches)
                loss, acc = trainLoop(batch, optimizer, unmasked_ratio, compute_accuracy)
                tot_loss += loss
                tot_acc += acc      
                if iter % print_every == 0 : 
                    tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        else :
            iter = 0
            iters = len(batches) * epochs
            for epoch in range(1, epochs + 1):
                print('epoch ' + str(epoch))
                np.random.shuffle(batches)
                for batch in batches :
                    loss, acc = trainLoop(batch, optimizer,unmasked_ratio, compute_accuracy)
                    tot_loss += loss
                    tot_acc += acc 
                    iter += 1
                    if iter % print_every == 0 : 
                        tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        return

#### Training

In [33]:
denoiser = SentenceDenoiser(device = device, # torch.device('cpu'),
                            tokenizer = lambda s : s.split(' '),
                            word2vec = Word2VecConnector(word2vec),
                            hidden_dim = 100, 
                            n_layers = 3, 
                            dropout = 0.1,
                            optimizer = optim.AdamW)

denoiser.nbParametres()

3566019

In [37]:
batches = denoiser.generatePackedSentences(sentences_trn, 
                                            batch_size = 16,
                                            mask_ratio = 0.15,
                                            max_sentence_length = 50,
                                            tol = 10,
                                            seed = 42)
len(batches)

7788

In [38]:
denoiser.fit(batches, epochs = 1, lr = 0.001, unmasked_ratio = 0.15, compute_accuracy = 'xs', print_every = 100)

epoch 1
0m 37s (- 48m 19s) (100 1%) loss : 14.124  accuracy : 6.2 %
1m 14s (- 47m 15s) (200 2%) loss : 13.149  accuracy : 7.9 %
1m 52s (- 46m 43s) (300 3%) loss : 12.990  accuracy : 8.2 %
2m 29s (- 46m 0s) (400 5%) loss : 12.660  accuracy : 8.9 %
3m 7s (- 45m 34s) (500 6%) loss : 12.240  accuracy : 9.3 %
3m 47s (- 45m 24s) (600 7%) loss : 11.850  accuracy : 10.4 %
4m 24s (- 44m 35s) (700 8%) loss : 11.381  accuracy : 12.0 %
5m 2s (- 44m 0s) (800 10%) loss : 10.954  accuracy : 12.8 %
5m 39s (- 43m 20s) (900 11%) loss : 10.667  accuracy : 13.7 %
6m 18s (- 42m 48s) (1000 12%) loss : 10.381  accuracy : 14.2 %
6m 57s (- 42m 18s) (1100 14%) loss : 10.125  accuracy : 14.8 %
7m 34s (- 41m 36s) (1200 15%) loss : 9.969  accuracy : 15.1 %
8m 12s (- 40m 55s) (1300 16%) loss : 9.767  accuracy : 15.1 %
8m 49s (- 40m 15s) (1400 17%) loss : 9.500  accuracy : 15.7 %
9m 27s (- 39m 39s) (1500 19%) loss : 9.215  accuracy : 16.2 %
10m 5s (- 39m 0s) (1600 20%) loss : 9.199  accuracy : 16.2 %
10m 43s (- 38m 

In [39]:
denoiser.fit(batches, iters = 200, lr = 0.001, unmasked_ratio = 0.15, compute_accuracy = 'xl', print_every = 100)

0m 51s (- 0m 51s) (100 50%) loss : 1.042  accuracy : 72.6 %
1m 45s (- 0m 0s) (200 100%) loss : 1.087  accuracy : 71.9 %


In [40]:
batches2 = denoiser.generatePackedSentences(sentences_trn, 
                                            batch_size = 16,
                                            mask_ratio = 0.15,
                                            max_sentence_length = 50,
                                            tol = 10,
                                            seed = 4242)
len(batches2)

7788

In [41]:
denoiser.fit(batches2, epochs = 1, lr = 0.00025, unmasked_ratio = 0.15, compute_accuracy = 'xs', print_every = 100)

epoch 1
0m 36s (- 46m 49s) (100 1%) loss : 6.699  accuracy : 22.1 %
1m 13s (- 46m 28s) (200 2%) loss : 6.759  accuracy : 22.0 %
1m 50s (- 46m 8s) (300 3%) loss : 6.734  accuracy : 22.6 %
2m 27s (- 45m 30s) (400 5%) loss : 6.695  accuracy : 22.6 %
3m 5s (- 45m 7s) (500 6%) loss : 6.680  accuracy : 21.9 %
3m 45s (- 45m 3s) (600 7%) loss : 6.553  accuracy : 23.5 %
4m 22s (- 44m 17s) (700 8%) loss : 6.708  accuracy : 22.1 %
5m 0s (- 43m 44s) (800 10%) loss : 6.604  accuracy : 22.8 %
5m 37s (- 43m 4s) (900 11%) loss : 6.658  accuracy : 22.2 %
6m 16s (- 42m 32s) (1000 12%) loss : 6.557  accuracy : 22.7 %
6m 54s (- 42m 3s) (1100 14%) loss : 6.614  accuracy : 22.3 %
7m 32s (- 41m 21s) (1200 15%) loss : 6.669  accuracy : 22.3 %
8m 9s (- 40m 40s) (1300 16%) loss : 6.553  accuracy : 22.5 %
8m 46s (- 40m 0s) (1400 17%) loss : 6.582  accuracy : 23.0 %
9m 24s (- 39m 25s) (1500 19%) loss : 6.473  accuracy : 22.6 %
10m 1s (- 38m 46s) (1600 20%) loss : 6.553  accuracy : 22.2 %
10m 39s (- 38m 10s) (1700

KeyboardInterrupt: 

In [42]:
denoiser.fit(batches2, iters = 200, lr = 0.00025, unmasked_ratio = 0.15, compute_accuracy = 'xl', print_every = 100)

0m 53s (- 0m 53s) (100 50%) loss : 1.040  accuracy : 75.0 %
1m 47s (- 0m 0s) (200 100%) loss : 1.041  accuracy : 74.6 %


In [43]:
# save
#torch.save(denoiser.state_dict(), path_to_DL4NLP + '\\saves\\DL4NLP_I4b_sentence_denoiser.pth')

# load
#denoiser.load_state_dict(torch.load(path_to_DL4NLP + '\\saves\\DL4NLP_I4b_sentence_denoiser.pth'))

#### Evaluation

In [47]:
denoiser.eval()
sentence = ' '.join(sentences_tst[25]) #'what are you thinking of this'
print(sentence)
print('\n')
denoiser(sentence, color = '\033[93m')

news sluggish movement on power grid cyber security . industry cyber security standards fail to reach some of the most vulnerable components of the power grid.\


news sluggish [93mgrowth[0m on power [93mcomputing[0m cyber security . industry cyber security standards fail to reach some of the most vulnerable components of the power [93m.[0m


#### Word correction

In [49]:
def correctCorpus(word2vec, corpus, threshold = 0.9) :
    new_sentences = []
    corrections = []
    for s in corpus :
        new_s = []
        for word in s :
            try : #fasttext raises an error if no character ngram seen during training appears in the word
                if (word not in word2vec.wv.vocab and word2vec.most_similar(word)[0][1] >= threshold) :
                    new_word = word2vec.most_similar(word)[0][0]
                    new_s.append(new_word)
                    corrections.append([word, new_word])
                else :
                    new_s.append(word)
            except : 
                new_s.append(word)
        new_sentences.append(new_s)
    return new_sentences, corrections


def contextualCorrectCorpus(denoiser, corpus, threshold = 0.9, print_every = 100) :
    word2vec = denoiser.word2vec.word2vec
    new_sentences = []
    corrections = []
    corpus = [s for s in corpus if len(s) > 0]
    for ws in corpus : 
        probs = denoiser.predict_proba(ws).squeeze(0) # size = (input_length, lang_size)
        new_s = []
        for i, word in enumerate(ws) :
            try : #fasttext raises an error if no character ngram seen during training appears in the word
                candidates = [wp[0] for wp in word2vec.most_similar(word) if wp[1] >= threshold]
                if (word not in word2vec.wv.vocab and candidates != []) :
                    indices  = [denoiser.word2vec.lang.getIndex(w) for w in candidates]
                    probsi   = [probs[i, j].item() for j in indices]
                    wps      = [[w, p] for w, p in zip(candidates, probsi)]
                    wps.sort(key = lambda wp : wp[1], reverse = True)
                    new_word = wps[0][0]
                    new_s.append(new_word)
                    corrections.append([word, new_word])
                else :
                    new_s.append(word)
            except : 
                new_s.append(word)
        new_sentences.append(new_s)
        if len(new_sentences)+1 % print_every == 0 : print(len(new_sentences)+1)
    return new_sentences, corrections

In [62]:
corrected_corpus, corrections = correctCorpus(word2vec, sentences_tst, threshold = 0.85)

In [52]:
contextual_corrected_corpus, contextual_corrections = contextualCorrectCorpus(denoiser, sentences_tst[:2500], threshold = 0.85)

In [None]:
for i in range(500) : 
    word  = corrections[i][0]
    pred1 = corrections[i][1]
    pred2 = contextual_corrections[i][1]
    if pred2 != pred1 : pred2 = '\033[93m' + pred2 + '\033[0m'
    print(word, '->', pred1, ' | ', pred2)