<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Deep Learning for NLP
  </div> 
  
<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Part I - 4 <br><br><br>
  Sequence Labelling
  </div> 

  <div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 20px; 
      text-align: center; 
      padding: 15px;">
  </div> 

  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE
  </div> 

### Part I

1. Word Embedding

2. Sentence Classification

3. Language Modeling

4. <font color=red>**Sequence Labelling**</font>


### Part II

5. Auto-Encoding

6. Machine Translation

7. Text Classification




### Part III

8. Abstractive Summarization

9. Question Answering

10. Chatbot


</div>

***

<a id="plan"></a>

| | | | |
|------|------|------|------|
| **Content** | [Corpus](#corpus) | [Modules](#modules) | [Model](#model) | 


# Overview

We consider as Sequence labelling task a **Sentence Denoising** problem, which consists in transforming a noisy sequence of words into a correctly formed sentence.<br> Training follows a denoising objective known as _Cloze task_, which is used :

- For the BERT model in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)

# Packages

In [1]:
import sys
import warnings
from __future__ import unicode_literals, print_function, division
import os
from io import open
import unicodedata
import string
import time
import math
import re
import random
import pickle
import copy
from unidecode import unidecode
import itertools
import matplotlib
import matplotlib.pyplot as plt


# for special math operation
from sklearn.preprocessing import normalize


# for manipulating data 
import numpy as np
#np.set_printoptions(threshold=np.nan)
import pandas as pd
import bcolz # see https://bcolz.readthedocs.io/en/latest/intro.html
import pickle


# for text processing
import gensim
from gensim.models import KeyedVectors
#import spacy
import nltk
#nltk.download()
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem.porter import PorterStemmer


# for deep learning
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")

warnings.filterwarnings("ignore")
print('python version :', sys.version)
print('pytorch version :', torch.__version__)
print('DL device :', device)



python version : 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
pytorch version : 1.3.1
DL device : cuda


In [2]:
path_to_NLP = 'C:\\Users\\Jb\\Desktop\\NLP'

In [3]:
sys.path.append(path_to_NLP + '\\libDL4NLP')

<a id="corpus"></a>

# Corpus

[Back to top](#plan)

Le texte est importé et mis sous forme de liste, où chaque élément représente un texte présenté sous forme d'une liste de mots.<br> Le corpus est donc une fois importé sous le forme :<br>

- corpus = [text]<br>
- text   = [word]<br>
- word   = str<br>

In [4]:
def cleanSentence(sentence): # -------------------------  str
    sw = ['']
    #sw += nltk.corpus.stopwords.words('english')
    #sw += nltk.corpus.stopwords.words('french')

    def unicodeToAscii(s):
        """Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427"""
        return ''.join( c for c in unicodedata.normalize('NFD', s)
                        if unicodedata.category(c) != 'Mn')

    def normalizeString(s):
        '''Remove rare symbols from a string'''
        s = unicodeToAscii(s.lower().strip()) # 
        #s = re.sub(r"[^a-zA-Z\.\(\)\[\]]+", r" ", s)  # 'r' before a string is for 'raw' # ?&\%\_\- removed # set('''.,:;()*#&-_%!?/\'")''')
        return s

    def wordTokenizerFunction():
        # base version
        function = lambda sentence : sentence.strip().split()

        # nltk version
        #function = word_tokenize    
        return function

    # 1 - caractères spéciaux
    def clean_sentence_punct(text): # --------------  str
        text = normalizeString(text)
        # suppression de la dernière ponctuation
        if (len(text) > 0 and text[-1] in ['.', ',', ';', ':', '!', '?']) : text = text[:-1]

        text = text.replace(r'(', r' ( ')
        text = text.replace(r')', r' ) ')
        text = text.replace(r'[', r' [ ')
        text = text.replace(r']', r' ] ')
        text = text.replace(r'<', r' < ')
        text = text.replace(r'>', r' > ')

        text = text.replace(r':', r' : ')
        text = text.replace(r';', r' ; ')
        for i in range(5) :
            text = re.sub('(?P<val1>[0-9])\.(?P<val2>[0-9])', '\g<val1>__-__\g<val2>', text)
            text = re.sub('(?P<val1>[0-9]),(?P<val2>[0-9])', '\g<val1>__-__\g<val2>', text)
        text = text.replace(r',', ' , ')
        text = text.replace(r'.', ' . ')
        for i in range(5) : text = re.sub('(?P<val1>[p0-9])__-__(?P<val2>[p0-9])', '\g<val1>.\g<val2>', text)
        text = re.sub('(?P<val1>[0-9]) \. p \. (?P<val2>[0-9])', '\g<val1>.p.\g<val2>', text)
        text = re.sub('(?P<val1>[0-9]) \. s \. (?P<val2>[0-9])', '\g<val1>.s.\g<val2>', text)

        text = text.replace(r'"', r' " ')
        text = text.replace(r'’', r" ' ")
        text = text.replace(r'”', r' " ')
        text = text.replace(r'“', r' " ')
        text = text.replace(r'/', r' / ')

        text = re.sub('(…)+', ' … ', text)
        text = text.replace('≤', ' ≤ ')          
        text = text.replace('≥', ' ≥ ')
        text = text.replace('°c', ' °c ')
        text = text.replace('°C', ' °c ')
        text = text.replace('ºc', ' °c ')
        text = text.replace('n°', 'n° ')
        text = text.replace('%', ' % ')
        text = text.replace('*', ' * ')
        text = text.replace('+', ' + ')
        text = text.replace('-', ' - ')
        text = text.replace('_', ' ')
        text = text.replace('®', ' ')
        text = text.replace('™', ' ')
        text = text.replace('±', ' ± ')
        text = text.replace('÷', ' ÷ ')
        text = text.replace('–', ' - ')
        text = text.replace('μg', ' µg')
        text = text.replace('µg', ' µg')
        text = text.replace('µl', ' µl')
        text = text.replace('μl', ' µl')
        text = text.replace('µm', ' µm')
        text = text.replace('μm', ' µm')
        text = text.replace('ppm', ' ppm')
        text = re.sub('(?P<val1>[0-9])mm', '\g<val1> mm', text)
        text = re.sub('(?P<val1>[0-9])g', '\g<val1> g', text)
        text = text.replace('nm', ' nm')

        text = re.sub('fa(?P<val1>[0-9])', 'fa \g<val1>', text)
        text = re.sub('g(?P<val1>[0-9])', 'g \g<val1>', text)
        text = re.sub('n(?P<val1>[0-9])', 'n \g<val1>', text)
        text = re.sub('p(?P<val1>[0-9])', 'p \g<val1>', text)
        text = re.sub('q_(?P<val1>[0-9])', 'q_ \g<val1>', text)
        text = re.sub('u(?P<val1>[0-9])', 'u \g<val1>', text)
        text = re.sub('ud(?P<val1>[0-9])', 'ud \g<val1>', text)
        text = re.sub('ui(?P<val1>[0-9])', 'ui \g<val1>', text)

        text = text.replace('=', ' ')
        text = text.replace('!', ' ')
        text = text.replace('-', ' ')
        text = text.replace(r' , ', ' ')
        text = text.replace(r' . ', ' ')

        text = re.sub('(?P<val>[0-9])ml', '\g<val> ml', text)
        text = re.sub('(?P<val>[0-9])mg', '\g<val> mg', text)

        for i in range(5) : text = re.sub('( [0-9]+ )', ' ', text)
        #text = re.sub('cochran(\S)*', 'cochran ', text)
        return text

    # 3 - split des mots
    def wordSplit(sentence, tokenizeur): # ------------- [str]
        return tokenizeur(sentence)

    # 4 - mise en minuscule et enlèvement des stopwords
    def stopwordsRemoval(sentence, sw): # ------------- [[str]]
        return [word for word in sentence if word not in sw]

    # 6 - correction des mots
    def correction(text):
        def correct(word):
            return spelling.suggest(word)[0]
        list_of_list_of_words = [[correct(word) for word in sentence] for sentence in text]
        return list_of_list_of_words

    # 7 - stemming
    def stemming(text): # ------------------------- [[str]]
        list_of_list_of_words = [[PorterStemmer().stem(word) for word in sentence if word not in sw] for sentence in text]
        return list_of_list_of_words


    tokenizeur = wordTokenizerFunction()
    sentence = clean_sentence_punct(str(sentence))
    sentence = wordSplit(sentence, tokenizeur)
    sentence = stopwordsRemoval(sentence, sw)
    #text = correction(text)
    #text = stemming(text)
    return sentence


def importWords(file_name) :
    def cleanDatabase(db):
        words = ['.']
        title = ''
        for pair in db :
            #print(pair)
            current_tile = pair[0].split(' | ')[-1]
            if current_tile != title :
                words += cleanSentence(current_tile) + ['.']
                title  = current_tile
            words += cleanSentence(str(pair[1]).split(' | ')[-1]) + ['.']
        return words

    df = pd.read_excel(file_name, sep = ',', header = None)
    headers = [i for i, titre in enumerate(df.ix[0,:].values) if i in [1, 2] or titre == 'score manuel'] 
    db = df.ix[1:, headers].values.tolist()
    db = [el[:2] for el in db if el[-1] in [0,1, 10]]
    words = cleanDatabase(db)
    return words


def importAllWords(path_to_data) :
    corpus = []
    reps = os.listdir(path_to_data)
    for rep in reps :
        files = os.listdir(path_to_data + '\\' + rep)
        for file in files :
            file_name = path_to_data + '\\' + rep + '\\' + file
            corpus.append(importWords(file_name))
    return corpus



def importSentences(file_name) :
    def cleanDatabase(db):
        sentences = []
        title = ''
        for pair in db :
            current_tile = pair[0].split(' | ')[-1]
            if current_tile != title :
                sentences.append(cleanSentence(current_tile))
                title = current_tile
            sentences.append(cleanSentence(str(pair[1]).split(' | ')[-1]))
        return sentences

    df = pd.read_excel(file_name, sep = ',', header = None)
    headers = [i for i, titre in enumerate(df.ix[0,:].values) if i in [1, 2] or titre == 'score manuel'] 
    db = df.ix[1:, headers].values.tolist()
    db = [el[:2] for el in db if el[-1] in [0,1, 10]]
    sentences = cleanDatabase(db)
    return sentences


def importAllSentences(path_to_data) :
    corpus = []
    reps = os.listdir(path_to_data)
    for rep in reps :
        files = os.listdir(path_to_data + '\\' + rep)
        for file in files :
            file_name = path_to_data + '\\' + rep + '\\' + file
            corpus += importSentences(file_name)
    return corpus

In [5]:
corpus = importAllWords(path_to_NLP + '\\data\\AMM')
len(corpus)

510

In [6]:
sentences = importAllSentences(path_to_NLP + '\\data\\AMM')
len(sentences)

31574

<a id="modules"></a>

# 1 Modules

### 1.1 Word Embedding module

[Back to top](#plan)

All details on Word Embedding modules and their pre-training are found in **Part I - 1**. We consider here a FastText model trained following the Skip-Gram training objective.

In [7]:
from libDL4NLP.models.Word_Embedding import Word2Vec as myWord2Vec
from libDL4NLP.models.Word_Embedding import Word2VecConnector
from libDL4NLP.utils.Lang import Lang

In [8]:
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath, get_tmpfile

In [9]:
fastText_word2vec = FastText(size = 75, 
                             window = 5, 
                             min_count = 1, 
                             negative = 20,
                             sg = 1)

In [10]:
fastText_word2vec.build_vocab(corpus)

In [11]:
len(fastText_word2vec.wv.vocab)

8086

In [12]:
fastText_word2vec.train(sentences = corpus, 
                        epochs = 50,
                        total_examples = fastText_word2vec.corpus_count)

In [13]:
word2vec = Word2VecConnector(fastText_word2vec)

### 1.2 Contextualization module

[Back to top](#plan)


In [14]:
from libDL4NLP.modules import RecurrentEncoder

<a id="model"></a>

# 2 Sentence denoising Model

[Back to top](#plan)


In [15]:
class SentenceDenoiser(nn.Module) :
    def __init__(self, device, tokenizer, word2vec, 
                 hidden_dim = 100, 
                 n_layers = 1, 
                 dropout = 0, 
                 class_weights = None, 
                 optimizer = optim.SGD
                 ):
        super(SentenceDenoiser, self).__init__()
        
        # embedding
        self.tokenizer = tokenizer
        self.word2vec  = word2vec
        self.context   = RecurrentEncoder(self.word2vec.output_dim, hidden_dim, n_layers, dropout, bidirectional = True)
        self.out       = nn.Linear(self.context.output_dim, self.word2vec.lang.n_words)
        self.act       = F.softmax
        
        # optimizer
        self.ignore_index = self.word2vec.lang.getIndex('PADDING_WORD')
        self.criterion = nn.NLLLoss(size_average = False, 
                                    ignore_index = self.ignore_index, 
                                    weight = class_weights)
        self.optimizer = optimizer
        
        # load to device
        self.device = device
        self.to(device)
        
    def nbParametres(self) :
        return sum([p.data.nelement() for p in self.parameters() if p.requires_grad == True])
    
    def forward(self, sentence = '.', hidden = None, limit = 10, color_code = '\033[94m'):
        words  = self.tokenizer(sentence)
        result = words + [color_code]
        hidden, count, stop = None, 0, False
        while not stop :
            # compute probs
            embeddings = self.word2vec(words, self.device)
            _, hidden  = self.context(embeddings, lengths = None, hidden = hidden) # WARNING : dim = (n_layers, batch_size, hidden_dim)
            probs      = self.act(self.out(hidden[-1, :, :]), dim = 1).view(-1)
            # get predicted word
            topv, topi = probs.data.topk(1)
            words = [self.word2vec.lang.index2word[topi.item()]]
            result += words
            # stopping criterion
            count += 1
            if count == limit or words == [limit] or count == 50 : stop = True
        print(' '.join(result + ['\033[0m']))
        return
    
    def generatePackedSentences(self, 
                                sentences, 
                                batch_size = 32, 
                                mask_ratio = 0.15,
                                predict_masked_only = True,
                                seed = 42) :
        def maskInput(index, b) :
            if   b and random.random() > 0.25 : return self.ignore_index
            elif b and random.random() > 0.10 : return random.choice(list(self.word2vec.twin.lang.word2index.values()))
            else                              : return index
            
        def maskOutput(index, b) :
            return index if b else self.ignore_index
            
        random.seed(seed)
        sentences.sort(key = lambda s: len(s), reverse = True)
        packed_data = []
        for i in range(0, len(sentences), batch_size) :
            # prepare pack
            pack = sentences[i:i + batch_size]
            pack = [[self.word2vec.lang.getIndex(w) for w in s] for s in pack]
            pack = [[w for w in words if w is not None] for words in pack]
            mask = [random.sample([_ for _ in range(len(p))], k = int(mask_ratio*len(p) +1)) for p in pack]
            # split into input and target pack
            pack0 = [[ maskInput(s[i], i in m) for i in range(len(s))] for s, m in zip(pack, mask)]
            pack1 = [[maskOutput(s[i], i in m) for i in range(len(s))] for s, m in zip(pack, mask)] if predict_masked_only else pack
            lengths = torch.tensor([len(p) for p in pack0]) # size = (batch_size) 
            pack0 = list(itertools.zip_longest(*pack0, fillvalue = self.ignore_index))
            pack0 = Variable(torch.LongTensor(pack0).transpose(0, 1))   # size = (batch_size, max_length) 
            pack1 = list(itertools.zip_longest(*pack1, fillvalue = self.ignore_index))
            pack1 = Variable(torch.LongTensor(pack1).transpose(0, 1))   # size = (batch_size, max_length) 
            packed_data.append([[pack0, lengths], pack1])
        return packed_data
    
    def fit(self, batches, iters = None, epochs = None, lr = 0.025, random_state = 42,
              print_every = 10, compute_accuracy = True):
        """Performs training over a given dataset and along a specified amount of loops"""
        def asMinutes(s):
            m = math.floor(s / 60)
            s -= m * 60
            return '%dm %ds' % (m, s)

        def timeSince(since, percent):
            now = time.time()
            s = now - since
            rs = s/percent - s
            return '%s (- %s)' % (asMinutes(s), asMinutes(rs))
        
        def computeLogProbs(batch) :
            embeddings = self.word2vec.embedding(batch[0].to(self.device))
            hiddens,_  = self.context(embeddings, lengths = batch[1].to(self.device)) # dim = (batch_size, input_length, hidden_dim)
            log_probs  = F.log_softmax(self.out(hiddens), dim = 2)                    # dim = (batch_size, input_length, lang_size)
            return log_probs

        def computeAccuracy(log_probs, targets) :
            total   = np.sum(targets.data.cpu().numpy() != self.ignore_index)
            success = sum([self.ignore_index != targets[i, j].item() == log_probs[i, :, j].data.topk(1)[1].item() \
                           for i in range(targets.size(0)) \
                           for j in range(targets.size(1)) ])
            return  success * 100 / total

        def printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy) :
            avg_loss = tot_loss / print_every
            avg_loss_words = tot_loss_words / print_every
            if compute_accuracy : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}  accuracy : {:.1f} %'.format(iter, int(iter / iters * 100), avg_loss, avg_loss_words))
            else                : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}                     '.format(iter, int(iter / iters * 100), avg_loss))
            return 0, 0

        def trainLoop(batch, optimizer, compute_accuracy = True):
            """Performs a training loop, with forward pass, backward pass and weight update."""
            optimizer.zero_grad()
            self.zero_grad()
            log_probs = computeLogProbs(batch[0]).transpose(1, 2) # dim = (batch_size, lang_size, input_length)
            targets   = batch[1].to(self.device)                  # dim = (batch_size, input_length)
            loss      = self.criterion(log_probs, targets)
            loss.backward()
            optimizer.step() 
            accuracy = computeAccuracy(log_probs, targets) if compute_accuracy else 0
            return float(loss.item() / np.sum(targets.data.cpu().numpy() != self.ignore_index)), accuracy
        
        # --- main ---
        self.train()
        np.random.seed(random_state)
        start = time.time()
        optimizer = self.optimizer([param for param in self.parameters() if param.requires_grad == True], lr = lr)
        tot_loss = 0  
        tot_acc  = 0
        if epochs is None :
            for iter in range(1, iters + 1):
                batch = random.choice(batches)
                loss, acc = trainLoop(batch, optimizer, compute_accuracy)
                tot_loss += loss
                tot_acc += acc      
                if iter % print_every == 0 : 
                    tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        else :
            iter = 0
            iters = len(batches) * epochs
            for epoch in range(1, epochs + 1):
                print('epoch ' + str(epoch))
                np.random.shuffle(batches)
                for batch in batches :
                    loss, acc = trainLoop(batch, optimizer, compute_accuracy)
                    tot_loss += loss
                    tot_acc += acc 
                    iter += 1
                    if iter % print_every == 0 : 
                        tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        return

#### Training

In [16]:
denoiser = SentenceDenoiser(device,
                            tokenizer = lambda s : s.split(' '),
                            word2vec = word2vec,
                            hidden_dim = 75, 
                            n_layers = 3, 
                            dropout = 0.1,
                            optimizer = optim.SGD)

denoiser.nbParametres()

1493988

In [17]:
denoiser

SentenceDenoiser(
  (word2vec): Word2VecConnector(
    (twin): Word2Vec(
      (embedding): Embedding(8088, 75)
    )
    (embedding): Embedding(8088, 75)
  )
  (context): RecurrentEncoder(
    (dropout): Dropout(p=0.1, inplace=False)
    (bigru): GRU(75, 75, num_layers=3, batch_first=True, dropout=0.1, bidirectional=True)
  )
  (out): Linear(in_features=150, out_features=8088, bias=True)
  (criterion): NLLLoss()
)

In [18]:
torch.cuda.empty_cache()

In [19]:
batches = []
for seed in [42, 854, 3, 7956, 881125, 76, 5721, 1499, 8752, 374, 14758, 23, 9543, 856, 75] :
    batches += denoiser.generatePackedSentences(sentences, 
                                                batch_size = 16,
                                                mask_ratio = 0.15,
                                                predict_masked_only = True,
                                                seed = seed)
len(batches)

29610

In [20]:
denoiser.fit(batches, epochs = 1, lr = 0.01,   print_every = 100)
denoiser.fit(batches, epochs = 1, lr = 0.0025, print_every = 100)
denoiser.fit(batches, epochs = 1, lr = 0.0005, print_every = 100)

epoch 1
0m 8s (- 41m 32s) (100 0%) loss : 7.736  accuracy : 4.2 %
0m 15s (- 39m 5s) (200 0%) loss : 7.512  accuracy : 3.5 %
0m 24s (- 39m 25s) (300 1%) loss : 6.721  accuracy : 5.3 %
0m 32s (- 40m 0s) (400 1%) loss : 6.660  accuracy : 6.1 %
0m 40s (- 38m 56s) (500 1%) loss : 6.481  accuracy : 6.6 %
0m 47s (- 37m 58s) (600 2%) loss : 6.425  accuracy : 7.4 %
0m 55s (- 37m 57s) (700 2%) loss : 6.346  accuracy : 7.0 %
1m 1s (- 37m 11s) (800 2%) loss : 6.160  accuracy : 8.3 %
1m 9s (- 37m 0s) (900 3%) loss : 5.973  accuracy : 9.8 %
1m 16s (- 36m 39s) (1000 3%) loss : 5.954  accuracy : 11.1 %
1m 24s (- 36m 20s) (1100 3%) loss : 5.802  accuracy : 11.8 %
1m 32s (- 36m 34s) (1200 4%) loss : 5.753  accuracy : 10.7 %
1m 40s (- 36m 30s) (1300 4%) loss : 5.732  accuracy : 13.8 %
1m 49s (- 36m 45s) (1400 4%) loss : 5.543  accuracy : 14.4 %
1m 57s (- 36m 34s) (1500 5%) loss : 5.392  accuracy : 14.2 %
2m 4s (- 36m 26s) (1600 5%) loss : 5.326  accuracy : 16.8 %
2m 13s (- 36m 28s) (1700 5%) loss : 5.331

18m 36s (- 22m 30s) (13400 45%) loss : 3.514  accuracy : 35.6 %
18m 44s (- 22m 21s) (13500 45%) loss : 3.472  accuracy : 36.0 %
18m 51s (- 22m 11s) (13600 45%) loss : 3.257  accuracy : 38.8 %
19m 0s (- 22m 4s) (13700 46%) loss : 3.317  accuracy : 36.6 %
19m 9s (- 21m 56s) (13800 46%) loss : 3.504  accuracy : 34.4 %
19m 16s (- 21m 47s) (13900 46%) loss : 3.354  accuracy : 36.9 %
19m 24s (- 21m 38s) (14000 47%) loss : 3.405  accuracy : 36.9 %
19m 32s (- 21m 29s) (14100 47%) loss : 3.620  accuracy : 32.8 %
19m 39s (- 21m 19s) (14200 47%) loss : 3.457  accuracy : 36.3 %
19m 46s (- 21m 10s) (14300 48%) loss : 3.385  accuracy : 35.1 %
19m 53s (- 21m 0s) (14400 48%) loss : 3.395  accuracy : 36.8 %
20m 0s (- 20m 50s) (14500 48%) loss : 3.433  accuracy : 32.9 %
20m 7s (- 20m 41s) (14600 49%) loss : 3.288  accuracy : 38.6 %
20m 15s (- 20m 32s) (14700 49%) loss : 3.247  accuracy : 38.8 %
20m 22s (- 20m 23s) (14800 49%) loss : 3.306  accuracy : 38.8 %
20m 29s (- 20m 14s) (14900 50%) loss : 3.341  

41m 3s (- 4m 59s) (26400 89%) loss : 3.205  accuracy : 38.2 %
41m 15s (- 4m 50s) (26500 89%) loss : 3.129  accuracy : 40.3 %
41m 26s (- 4m 41s) (26600 89%) loss : 3.199  accuracy : 38.8 %
41m 36s (- 4m 32s) (26700 90%) loss : 3.123  accuracy : 38.8 %
41m 47s (- 4m 22s) (26800 90%) loss : 3.215  accuracy : 37.1 %
41m 57s (- 4m 13s) (26900 90%) loss : 3.233  accuracy : 37.2 %
42m 7s (- 4m 4s) (27000 91%) loss : 3.170  accuracy : 37.6 %
42m 17s (- 3m 54s) (27100 91%) loss : 3.038  accuracy : 41.9 %
42m 28s (- 3m 45s) (27200 91%) loss : 3.124  accuracy : 38.6 %
42m 39s (- 3m 36s) (27300 92%) loss : 3.124  accuracy : 41.1 %
42m 52s (- 3m 27s) (27400 92%) loss : 3.074  accuracy : 39.7 %
43m 3s (- 3m 18s) (27500 92%) loss : 3.046  accuracy : 42.6 %
43m 14s (- 3m 8s) (27600 93%) loss : 3.092  accuracy : 41.4 %
43m 26s (- 2m 59s) (27700 93%) loss : 3.294  accuracy : 36.6 %
43m 36s (- 2m 50s) (27800 93%) loss : 3.139  accuracy : 41.0 %
43m 48s (- 2m 41s) (27900 94%) loss : 3.022  accuracy : 41.0

16m 50s (- 32m 31s) (10100 34%) loss : 2.778  accuracy : 44.7 %
17m 0s (- 32m 22s) (10200 34%) loss : 2.679  accuracy : 45.8 %
17m 13s (- 32m 16s) (10300 34%) loss : 2.833  accuracy : 44.6 %
17m 23s (- 32m 7s) (10400 35%) loss : 2.792  accuracy : 43.9 %
17m 33s (- 31m 57s) (10500 35%) loss : 2.794  accuracy : 43.4 %
17m 44s (- 31m 49s) (10600 35%) loss : 2.728  accuracy : 46.3 %
17m 56s (- 31m 41s) (10700 36%) loss : 2.658  accuracy : 47.0 %
18m 6s (- 31m 32s) (10800 36%) loss : 2.724  accuracy : 45.6 %
18m 17s (- 31m 23s) (10900 36%) loss : 2.717  accuracy : 45.1 %
18m 27s (- 31m 14s) (11000 37%) loss : 2.818  accuracy : 42.6 %
18m 38s (- 31m 4s) (11100 37%) loss : 2.810  accuracy : 45.1 %
18m 48s (- 30m 55s) (11200 37%) loss : 2.746  accuracy : 45.2 %
18m 57s (- 30m 43s) (11300 38%) loss : 2.788  accuracy : 44.2 %
19m 7s (- 30m 33s) (11400 38%) loss : 2.750  accuracy : 45.6 %
19m 17s (- 30m 22s) (11500 38%) loss : 2.651  accuracy : 46.5 %
19m 26s (- 30m 11s) (11600 39%) loss : 2.710 

34m 36s (- 9m 56s) (23000 77%) loss : 2.669  accuracy : 45.8 %
34m 45s (- 9m 47s) (23100 78%) loss : 2.823  accuracy : 42.6 %
34m 52s (- 9m 38s) (23200 78%) loss : 2.633  accuracy : 46.8 %
34m 59s (- 9m 28s) (23300 78%) loss : 2.648  accuracy : 45.0 %
35m 7s (- 9m 19s) (23400 79%) loss : 2.650  accuracy : 46.6 %
35m 15s (- 9m 9s) (23500 79%) loss : 2.626  accuracy : 46.3 %
35m 23s (- 9m 0s) (23600 79%) loss : 2.695  accuracy : 45.8 %
35m 31s (- 8m 51s) (23700 80%) loss : 2.714  accuracy : 44.2 %
35m 38s (- 8m 41s) (23800 80%) loss : 2.616  accuracy : 48.2 %
35m 45s (- 8m 32s) (23900 80%) loss : 2.673  accuracy : 47.2 %
35m 53s (- 8m 23s) (24000 81%) loss : 2.546  accuracy : 49.0 %
36m 0s (- 8m 14s) (24100 81%) loss : 2.881  accuracy : 42.7 %
36m 8s (- 8m 4s) (24200 81%) loss : 2.572  accuracy : 47.4 %
36m 16s (- 7m 55s) (24300 82%) loss : 2.507  accuracy : 48.2 %
36m 23s (- 7m 46s) (24400 82%) loss : 2.789  accuracy : 43.9 %
36m 30s (- 7m 36s) (24500 82%) loss : 2.702  accuracy : 44.4 

9m 45s (- 33m 22s) (6700 22%) loss : 2.668  accuracy : 45.2 %
9m 54s (- 33m 13s) (6800 22%) loss : 2.523  accuracy : 47.7 %
10m 2s (- 33m 4s) (6900 23%) loss : 2.558  accuracy : 46.2 %
10m 10s (- 32m 52s) (7000 23%) loss : 2.523  accuracy : 47.6 %
10m 17s (- 32m 39s) (7100 23%) loss : 2.542  accuracy : 47.6 %
10m 25s (- 32m 26s) (7200 24%) loss : 2.479  accuracy : 48.2 %
10m 33s (- 32m 15s) (7300 24%) loss : 2.538  accuracy : 48.6 %
10m 40s (- 32m 3s) (7400 24%) loss : 2.557  accuracy : 49.2 %
10m 48s (- 31m 52s) (7500 25%) loss : 2.433  accuracy : 49.9 %
10m 56s (- 31m 40s) (7600 25%) loss : 2.568  accuracy : 46.3 %
11m 4s (- 31m 30s) (7700 26%) loss : 2.487  accuracy : 48.6 %
11m 11s (- 31m 18s) (7800 26%) loss : 2.571  accuracy : 48.5 %
11m 19s (- 31m 6s) (7900 26%) loss : 2.644  accuracy : 45.0 %
11m 26s (- 30m 54s) (8000 27%) loss : 2.512  accuracy : 48.9 %
11m 34s (- 30m 45s) (8100 27%) loss : 2.465  accuracy : 49.3 %
11m 42s (- 30m 33s) (8200 27%) loss : 2.692  accuracy : 45.8 %

26m 17s (- 13m 13s) (19700 66%) loss : 2.579  accuracy : 46.6 %
26m 25s (- 13m 5s) (19800 66%) loss : 2.564  accuracy : 46.6 %
26m 33s (- 12m 57s) (19900 67%) loss : 2.540  accuracy : 48.2 %
26m 39s (- 12m 48s) (20000 67%) loss : 2.604  accuracy : 45.9 %
26m 47s (- 12m 40s) (20100 67%) loss : 2.587  accuracy : 47.5 %
26m 54s (- 12m 31s) (20200 68%) loss : 2.765  accuracy : 44.6 %
27m 1s (- 12m 23s) (20300 68%) loss : 2.549  accuracy : 46.8 %
27m 7s (- 12m 14s) (20400 68%) loss : 2.463  accuracy : 48.2 %
27m 15s (- 12m 6s) (20500 69%) loss : 2.505  accuracy : 48.4 %
27m 23s (- 11m 58s) (20600 69%) loss : 2.455  accuracy : 49.4 %
27m 30s (- 11m 50s) (20700 69%) loss : 2.649  accuracy : 45.8 %
27m 37s (- 11m 42s) (20800 70%) loss : 2.453  accuracy : 49.0 %
27m 45s (- 11m 33s) (20900 70%) loss : 2.542  accuracy : 48.1 %
27m 52s (- 11m 25s) (21000 70%) loss : 2.429  accuracy : 50.4 %
28m 0s (- 11m 17s) (21100 71%) loss : 2.520  accuracy : 50.0 %
28m 8s (- 11m 9s) (21200 71%) loss : 2.532  a

In [21]:
# save
#torch.save(denoiser.state_dict(), path_to_NLP + '\\saves\\models\\DL4NLP_I4a_sentence_denoiser_2.pth')

# load
#denoiser.load_state_dict(torch.load(path_to_NLP + '\\saves\\models\\DL4NLP_I4a_sentence_denoiser.pth'))

#### Evaluation

In [None]:
language_model.eval()
sentence = random.choice(corpus)
i = random.choice(range(int(len(sentence)/2)))
sentence = ' '.join(sentence[:i]) if i > 0 else '.'
language_model(sentence, limit = '.', color_code = '\x1b[48;2;255;229;217m') #  '\x1b[48;2;255;229;217m' '\x1b[31m'