<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Deep Learning for NLP
  </div> 
  
<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Part I - 4 <br><br><br>
  Sequence Labelling
  </div> 

  <div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 20px; 
      text-align: center; 
      padding: 15px;">
  </div> 

  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE
  </div> 

### Part I

1. Word Embedding

2. Sentence Classification

3. Language Modeling

4. <font color=red>**Sequence Labelling**</font>


### Part II

5. Auto-Encoding

6. Machine Translation

7. Text Classification




### Part III

8. Abstractive Summarization

9. Question Answering

10. Chatbot


</div>

***

<a id="plan"></a>

| | | | |
|------|------|------|------|
| **Content** | [Corpus](#corpus) | [Modules](#modules) | [Model](#model) | 


# Overview

We consider as Sequence labelling task a **Sentence Denoising** problem, which consists in transforming a noisy sequence of words into a correctly formed sentence.<br> Training follows a denoising objective known as _Cloze task_, which is used :

- For the BERT model in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)

# Packages

In [22]:
import sys
import warnings
from __future__ import unicode_literals, print_function, division
import os
from io import open
import unicodedata
import string
import time
import math
import re
import random
import pickle
import copy
from unidecode import unidecode
import itertools
import matplotlib
import matplotlib.pyplot as plt


# for special math operation
from sklearn.preprocessing import normalize


# for manipulating data 
import numpy as np
#np.set_printoptions(threshold=np.nan)
import pandas as pd
import bcolz # see https://bcolz.readthedocs.io/en/latest/intro.html
import pickle


# for text processing
import gensim
from gensim.models import KeyedVectors
#import spacy
import nltk
#nltk.download()
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem.porter import PorterStemmer


# for deep learning
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")

warnings.filterwarnings("ignore")
print('python version :', sys.version)
print('pytorch version :', torch.__version__)
print('DL device :', device)

python version : 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
pytorch version : 1.3.1
DL device : cuda


In [23]:
path_to_NLP = 'C:\\Users\\Jb\\Desktop\\NLP'

In [24]:
sys.path.append(path_to_NLP + '\\libDL4NLP')

<a id="corpus"></a>

# Corpus

[Back to top](#plan)

Le texte est importé et mis sous forme de liste, où chaque élément représente un texte présenté sous forme d'une liste de mots.<br> Le corpus est donc une fois importé sous le forme :<br>

- corpus = [text]<br>
- text   = [word]<br>
- word   = str<br>

In [25]:
def cleanSentence(sentence): # -------------------------  str
    sw = ['']
    #sw += nltk.corpus.stopwords.words('english')
    #sw += nltk.corpus.stopwords.words('french')

    def unicodeToAscii(s):
        """Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427"""
        return ''.join( c for c in unicodedata.normalize('NFD', s)
                        if unicodedata.category(c) != 'Mn')

    def normalizeString(s):
        '''Remove rare symbols from a string'''
        s = unicodeToAscii(s.lower().strip()) # 
        #s = re.sub(r"[^a-zA-Z\.\(\)\[\]]+", r" ", s)  # 'r' before a string is for 'raw' # ?&\%\_\- removed # set('''.,:;()*#&-_%!?/\'")''')
        return s

    def wordTokenizerFunction():
        # base version
        function = lambda sentence : sentence.strip().split()

        # nltk version
        #function = word_tokenize    
        return function

    # 1 - caractères spéciaux
    def clean_sentence_punct(text): # --------------  str
        text = normalizeString(text)
        # suppression de la dernière ponctuation
        if (len(text) > 0 and text[-1] in ['.', ',', ';', ':', '!', '?']) : text = text[:-1]

        text = text.replace(r'(', r' ( ')
        text = text.replace(r')', r' ) ')
        text = text.replace(r'[', r' [ ')
        text = text.replace(r']', r' ] ')
        text = text.replace(r'<', r' < ')
        text = text.replace(r'>', r' > ')

        text = text.replace(r':', r' : ')
        text = text.replace(r';', r' ; ')
        for i in range(5) :
            text = re.sub('(?P<val1>[0-9])\.(?P<val2>[0-9])', '\g<val1>__-__\g<val2>', text)
            text = re.sub('(?P<val1>[0-9]),(?P<val2>[0-9])', '\g<val1>__-__\g<val2>', text)
        text = text.replace(r',', ' , ')
        text = text.replace(r'.', ' . ')
        for i in range(5) : text = re.sub('(?P<val1>[p0-9])__-__(?P<val2>[p0-9])', '\g<val1>.\g<val2>', text)
        text = re.sub('(?P<val1>[0-9]) \. p \. (?P<val2>[0-9])', '\g<val1>.p.\g<val2>', text)
        text = re.sub('(?P<val1>[0-9]) \. s \. (?P<val2>[0-9])', '\g<val1>.s.\g<val2>', text)

        text = text.replace(r'"', r' " ')
        text = text.replace(r'’', r" ' ")
        text = text.replace(r'”', r' " ')
        text = text.replace(r'“', r' " ')
        text = text.replace(r'/', r' / ')

        text = re.sub('(…)+', ' … ', text)
        text = text.replace('≤', ' ≤ ')          
        text = text.replace('≥', ' ≥ ')
        text = text.replace('°c', ' °c ')
        text = text.replace('°C', ' °c ')
        text = text.replace('ºc', ' °c ')
        text = text.replace('n°', 'n° ')
        text = text.replace('%', ' % ')
        text = text.replace('*', ' * ')
        text = text.replace('+', ' + ')
        text = text.replace('-', ' - ')
        text = text.replace('_', ' ')
        text = text.replace('®', ' ')
        text = text.replace('™', ' ')
        text = text.replace('±', ' ± ')
        text = text.replace('÷', ' ÷ ')
        text = text.replace('–', ' - ')
        text = text.replace('μg', ' µg')
        text = text.replace('µg', ' µg')
        text = text.replace('µl', ' µl')
        text = text.replace('μl', ' µl')
        text = text.replace('µm', ' µm')
        text = text.replace('μm', ' µm')
        text = text.replace('ppm', ' ppm')
        text = re.sub('(?P<val1>[0-9])mm', '\g<val1> mm', text)
        text = re.sub('(?P<val1>[0-9])g', '\g<val1> g', text)
        text = text.replace('nm', ' nm')

        text = re.sub('fa(?P<val1>[0-9])', 'fa \g<val1>', text)
        text = re.sub('g(?P<val1>[0-9])', 'g \g<val1>', text)
        text = re.sub('n(?P<val1>[0-9])', 'n \g<val1>', text)
        text = re.sub('p(?P<val1>[0-9])', 'p \g<val1>', text)
        text = re.sub('q_(?P<val1>[0-9])', 'q_ \g<val1>', text)
        text = re.sub('u(?P<val1>[0-9])', 'u \g<val1>', text)
        text = re.sub('ud(?P<val1>[0-9])', 'ud \g<val1>', text)
        text = re.sub('ui(?P<val1>[0-9])', 'ui \g<val1>', text)

        text = text.replace('=', ' ')
        text = text.replace('!', ' ')
        text = text.replace('-', ' ')
        text = text.replace(r' , ', ' ')
        text = text.replace(r' . ', ' ')

        text = re.sub('(?P<val>[0-9])ml', '\g<val> ml', text)
        text = re.sub('(?P<val>[0-9])mg', '\g<val> mg', text)

        for i in range(5) : text = re.sub('( [0-9]+ )', ' ', text)
        #text = re.sub('cochran(\S)*', 'cochran ', text)
        return text

    # 3 - split des mots
    def wordSplit(sentence, tokenizeur): # ------------- [str]
        return tokenizeur(sentence)

    # 4 - mise en minuscule et enlèvement des stopwords
    def stopwordsRemoval(sentence, sw): # ------------- [[str]]
        return [word for word in sentence if word not in sw]

    # 6 - correction des mots
    def correction(text):
        def correct(word):
            return spelling.suggest(word)[0]
        list_of_list_of_words = [[correct(word) for word in sentence] for sentence in text]
        return list_of_list_of_words

    # 7 - stemming
    def stemming(text): # ------------------------- [[str]]
        list_of_list_of_words = [[PorterStemmer().stem(word) for word in sentence if word not in sw] for sentence in text]
        return list_of_list_of_words


    tokenizeur = wordTokenizerFunction()
    sentence = clean_sentence_punct(str(sentence))
    sentence = wordSplit(sentence, tokenizeur)
    sentence = stopwordsRemoval(sentence, sw)
    #text = correction(text)
    #text = stemming(text)
    return sentence


def importWords(file_name) :
    def cleanDatabase(db):
        words = ['.']
        title = ''
        for pair in db :
            #print(pair)
            current_tile = pair[0].split(' | ')[-1]
            if current_tile != title :
                words += cleanSentence(current_tile) + ['.']
                title  = current_tile
            words += cleanSentence(str(pair[1]).split(' | ')[-1]) + ['.']
        return words

    df = pd.read_excel(file_name, sep = ',', header = None)
    headers = [i for i, titre in enumerate(df.ix[0,:].values) if i in [1, 2] or titre == 'score manuel'] 
    db = df.ix[1:, headers].values.tolist()
    db = [el[:2] for el in db if el[-1] in [0,1, 10]]
    words = cleanDatabase(db)
    return words


def importAllWords(path_to_data) :
    corpus = []
    reps = os.listdir(path_to_data)
    for rep in reps :
        files = os.listdir(path_to_data + '\\' + rep)
        for file in files :
            file_name = path_to_data + '\\' + rep + '\\' + file
            corpus.append(importWords(file_name))
    return corpus



def importSentences(file_name) :
    def cleanDatabase(db):
        sentences = []
        title = ''
        for pair in db :
            current_tile = pair[0].split(' | ')[-1]
            if current_tile != title :
                sentences.append(cleanSentence(current_tile))
                title = current_tile
            sentences.append(cleanSentence(str(pair[1]).split(' | ')[-1]))
        return sentences

    df = pd.read_excel(file_name, sep = ',', header = None)
    headers = [i for i, titre in enumerate(df.ix[0,:].values) if i in [1, 2] or titre == 'score manuel'] 
    db = df.ix[1:, headers].values.tolist()
    db = [el[:2] for el in db if el[-1] in [0,1, 10]]
    sentences = cleanDatabase(db)
    return sentences


def importAllSentences(path_to_data) :
    corpus = []
    reps = os.listdir(path_to_data)
    for rep in reps :
        files = os.listdir(path_to_data + '\\' + rep)
        for file in files :
            file_name = path_to_data + '\\' + rep + '\\' + file
            corpus += importSentences(file_name)
    return corpus

In [26]:
corpus = importAllWords(path_to_NLP + '\\data\\AMM')
len(corpus)

510

In [27]:
sentences = importAllSentences(path_to_NLP + '\\data\\AMM')
len(sentences)

31574

<a id="modules"></a>

# 1 Modules

### 1.1 Word Embedding module

[Back to top](#plan)

All details on Word Embedding modules and their pre-training are found in **Part I - 1**. We consider here a FastText model trained following the Skip-Gram training objective.

In [28]:
from libDL4NLP.models.Word_Embedding import Word2Vec as myWord2Vec
from libDL4NLP.models.Word_Embedding import Word2VecConnector
from libDL4NLP.utils.Lang import Lang

In [29]:
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath, get_tmpfile

In [30]:
fastText_word2vec = FastText(size = 75, 
                             window = 5, 
                             min_count = 1, 
                             negative = 20,
                             sg = 1)

In [31]:
fastText_word2vec.build_vocab(corpus)

In [32]:
len(fastText_word2vec.wv.vocab)

8086

In [33]:
fastText_word2vec.train(sentences = corpus, 
                        epochs = 50,
                        total_examples = fastText_word2vec.corpus_count)

In [34]:
word2vec = Word2VecConnector(fastText_word2vec)

### 1.2 Contextualization module

[Back to top](#plan)


In [35]:
from libDL4NLP.modules import RecurrentEncoder

<a id="model"></a>

# 2 Sentence denoising Model

[Back to top](#plan)


In [61]:
class SentenceDenoiser(nn.Module) :
    def __init__(self, device, tokenizer, word2vec, 
                 hidden_dim = 100, 
                 n_layers = 1, 
                 dropout = 0, 
                 class_weights = None, 
                 optimizer = optim.SGD
                 ):
        super(SentenceDenoiser, self).__init__()
        
        # embedding
        self.tokenizer = tokenizer
        self.word2vec  = word2vec
        self.context   = RecurrentEncoder(self.word2vec.output_dim, hidden_dim, n_layers, dropout, bidirectional = True)
        self.out       = nn.Linear(self.context.output_dim, self.word2vec.lang.n_words)
        self.act       = F.softmax
        
        # optimizer
        self.ignore_index = self.word2vec.lang.getIndex('PADDING_WORD')
        self.criterion = nn.NLLLoss(size_average = False, 
                                    ignore_index = self.ignore_index, 
                                    weight = class_weights)
        self.optimizer = optimizer
        
        # load to device
        self.device = device
        self.to(device)
        
    def nbParametres(self) :
        return sum([p.data.nelement() for p in self.parameters() if p.requires_grad == True])
    
    def forward(self, sentence = '.', hidden = None, limit = 10, color_code = '\033[94m'):
        words  = self.tokenizer(sentence)
        result = words + [color_code]
        hidden, count, stop = None, 0, False
        while not stop :
            # compute probs
            embeddings = self.word2vec(words, self.device)
            _, hidden  = self.context(embeddings, lengths = None, hidden = hidden)
            if self.context.bidirectional :
                hidden = hidden.view(self.context.n_layers, 2, -1, self.context.hidden_dim)
                hidden = torch.sum(hidden, dim = 1) # size (n_layers, batch_size, hidden_dim)
            probs  = self.act(self.out(hidden[-1]), dim = 1).view(-1)
            # get predicted word
            topv, topi = probs.data.topk(1)
            words = [self.word2vec.lang.index2word[topi.item()]]
            result += words
            # stopping criterion
            count += 1
            if count == limit or words == [limit] or count == 50 : stop = True
        print(' '.join(result + ['\033[0m']))
        return
    
    def generatePackedSentences(self, 
                                sentences, 
                                batch_size = 32, 
                                mask_ratio = 0.15,
                                predict_masked_only = True,
                                seed = 42) :
        def maskInput(index, b) :
            if   b and random.random() > 0.25 : return self.ignore_index
            elif b and random.random() > 0.10 : return random.choice(list(self.word2vec.twin.lang.word2index.values()))
            else                              : return index
            
        def maskOutput(index, b) :
            return index if b else self.ignore_index
            
        random.seed(seed)
        sentences.sort(key = lambda s: len(s), reverse = True)
        packed_data = []
        for i in range(0, len(sentences), batch_size) :
            # prepare pack
            pack = sentences[i:i + batch_size]
            pack = [[self.word2vec.lang.getIndex(w) for w in s] for s in pack]
            pack = [[w for w in words if w is not None] for words in pack]
            mask = [random.sample([_ for _ in range(len(p))], k = int(mask_ratio*len(p) +1)) for p in pack]
            # split into input and target pack
            pack0 = [[ maskInput(s[i], i in m) for i in range(len(s))] for s, m in zip(pack, mask)]
            pack1 = [[maskOutput(s[i], i in m) for i in range(len(s))] for s, m in zip(pack, mask)] if predict_masked_only else pack
            lengths = torch.tensor([len(p) for p in pack0]) # size = (batch_size) 
            pack0 = list(itertools.zip_longest(*pack0, fillvalue = self.ignore_index))
            pack0 = Variable(torch.LongTensor(pack0).transpose(0, 1))   # size = (batch_size, max_length) 
            pack1 = list(itertools.zip_longest(*pack1, fillvalue = self.ignore_index))
            pack1 = Variable(torch.LongTensor(pack1).transpose(0, 1))   # size = (batch_size, max_length) 
            packed_data.append([[pack0, lengths], pack1])
        return packed_data
    
    def fit(self, batches, iters = None, epochs = None, lr = 0.025, random_state = 42,
              print_every = 10, compute_accuracy = True):
        """Performs training over a given dataset and along a specified amount of loops"""
        def asMinutes(s):
            m = math.floor(s / 60)
            s -= m * 60
            return '%dm %ds' % (m, s)

        def timeSince(since, percent):
            now = time.time()
            s = now - since
            rs = s/percent - s
            return '%s (- %s)' % (asMinutes(s), asMinutes(rs))
        
        def computeLogProbs(batch) :
            embeddings = self.word2vec.embedding(batch[0].to(self.device))
            hiddens,_  = self.context(embeddings, lengths = batch[1].to(self.device)) # dim = (batch_size, input_length, hidden_dim)
            log_probs  = F.log_softmax(self.out(hiddens), dim = 2)                    # dim = (batch_size, input_length, lang_size)
            return log_probs

        def computeAccuracy(log_probs, targets) :
            total   = np.sum(targets.data.cpu().numpy() != self.ignore_index)
            success = sum([self.ignore_index != targets[i, j].item() == log_probs[i, :, j].data.topk(1)[1].item() \
                           for i in range(targets.size(0)) \
                           for j in range(targets.size(1)) ])
            return  success * 100 / total

        def printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy) :
            avg_loss = tot_loss / print_every
            avg_loss_words = tot_loss_words / print_every
            if compute_accuracy : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}  accuracy : {:.1f} %'.format(iter, int(iter / iters * 100), avg_loss, avg_loss_words))
            else                : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}                     '.format(iter, int(iter / iters * 100), avg_loss))
            return 0, 0

        def trainLoop(batch, optimizer, compute_accuracy = True):
            """Performs a training loop, with forward pass, backward pass and weight update."""
            optimizer.zero_grad()
            self.zero_grad()
            log_probs = computeLogProbs(batch[0]).transpose(1, 2) # dim = (batch_size, lang_size, input_length)
            targets   = batch[1].to(self.device)                  # dim = (batch_size, input_length)
            loss      = self.criterion(log_probs, targets)
            loss.backward()
            optimizer.step() 
            accuracy = computeAccuracy(log_probs, targets) if compute_accuracy else 0
            return float(loss.item() / np.sum(targets.data.cpu().numpy() != self.ignore_index)), accuracy
        
        # --- main ---
        self.train()
        np.random.seed(random_state)
        start = time.time()
        optimizer = self.optimizer([param for param in self.parameters() if param.requires_grad == True], lr = lr)
        tot_loss = 0  
        tot_acc  = 0
        if epochs is None :
            for iter in range(1, iters + 1):
                batch = random.choice(batches)
                loss, acc = trainLoop(batch, optimizer, compute_accuracy)
                tot_loss += loss
                tot_acc += acc      
                if iter % print_every == 0 : 
                    tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        else :
            iter = 0
            iters = len(batches) * epochs
            for epoch in range(1, epochs + 1):
                print('epoch ' + str(epoch))
                np.random.shuffle(batches)
                for batch in batches :
                    loss, acc = trainLoop(batch, optimizer, compute_accuracy)
                    tot_loss += loss
                    tot_acc += acc 
                    iter += 1
                    if iter % print_every == 0 : 
                        tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        return

#### Training

In [57]:
denoiser = SentenceDenoiser(device,
                            tokenizer = lambda s : s.split(' '),
                            word2vec = word2vec,
                            hidden_dim = 75, 
                            n_layers = 3, 
                            dropout = 0.1,
                            optimizer = optim.SGD)

denoiser.nbParametres()

887388

In [58]:
denoiser

SentenceDenoiser(
  (word2vec): Word2VecConnector(
    (twin): Word2Vec(
      (embedding): Embedding(8088, 75)
    )
    (embedding): Embedding(8088, 75)
  )
  (context): RecurrentEncoder(
    (dropout): Dropout(p=0.1, inplace=False)
    (bigru): GRU(75, 75, num_layers=3, batch_first=True, dropout=0.1, bidirectional=True)
  )
  (out): Linear(in_features=75, out_features=8088, bias=True)
  (criterion): NLLLoss()
)

In [39]:
torch.cuda.empty_cache()

In [40]:
batches = []
for seed in [42, 854, 3, 7956, 881125, 76, 5721, 1499, 8752, 374, 14758, 23, 9543, 856, 75] :
    batches += denoiser.generatePackedSentences(sentences, 
                                                batch_size = 16,
                                                mask_ratio = 0.15,
                                                predict_masked_only = True,
                                                seed = seed)
len(batches)

29610

In [59]:
denoiser.fit(batches, epochs = 1, lr = 0.01,   print_every = 100)
denoiser.fit(batches, epochs = 1, lr = 0.0025, print_every = 100)
denoiser.fit(batches, epochs = 1, lr = 0.0005, print_every = 100)

epoch 1
0m 8s (- 43m 8s) (100 0%) loss : 9.148  accuracy : 3.6 %
0m 15s (- 38m 51s) (200 0%) loss : 7.102  accuracy : 3.5 %
0m 23s (- 38m 37s) (300 1%) loss : 6.739  accuracy : 4.6 %
0m 31s (- 37m 55s) (400 1%) loss : 6.767  accuracy : 4.9 %
0m 38s (- 37m 39s) (500 1%) loss : 6.602  accuracy : 5.4 %
0m 45s (- 36m 50s) (600 2%) loss : 6.389  accuracy : 5.6 %
0m 52s (- 36m 22s) (700 2%) loss : 6.484  accuracy : 6.2 %
0m 59s (- 35m 51s) (800 2%) loss : 6.311  accuracy : 6.5 %
1m 7s (- 35m 48s) (900 3%) loss : 6.386  accuracy : 5.8 %
1m 15s (- 36m 7s) (1000 3%) loss : 6.458  accuracy : 6.1 %
1m 23s (- 36m 12s) (1100 3%) loss : 6.316  accuracy : 6.8 %
1m 31s (- 36m 1s) (1200 4%) loss : 6.255  accuracy : 5.9 %
1m 38s (- 35m 47s) (1300 4%) loss : 6.289  accuracy : 6.9 %
1m 46s (- 35m 42s) (1400 4%) loss : 6.296  accuracy : 6.2 %
1m 54s (- 35m 49s) (1500 5%) loss : 6.238  accuracy : 6.0 %
2m 3s (- 35m 54s) (1600 5%) loss : 6.192  accuracy : 6.8 %
2m 11s (- 35m 52s) (1700 5%) loss : 6.174  accu

16m 35s (- 20m 3s) (13400 45%) loss : 4.690  accuracy : 22.7 %
16m 42s (- 19m 56s) (13500 45%) loss : 4.584  accuracy : 21.9 %
16m 49s (- 19m 48s) (13600 45%) loss : 4.666  accuracy : 20.3 %
16m 58s (- 19m 42s) (13700 46%) loss : 4.735  accuracy : 20.8 %
17m 6s (- 19m 36s) (13800 46%) loss : 4.742  accuracy : 19.3 %
17m 14s (- 19m 28s) (13900 46%) loss : 4.687  accuracy : 21.3 %
17m 21s (- 19m 20s) (14000 47%) loss : 4.570  accuracy : 20.6 %
17m 29s (- 19m 13s) (14100 47%) loss : 4.488  accuracy : 22.1 %
17m 36s (- 19m 6s) (14200 47%) loss : 4.469  accuracy : 23.5 %
17m 43s (- 18m 58s) (14300 48%) loss : 4.536  accuracy : 21.8 %
17m 51s (- 18m 51s) (14400 48%) loss : 4.639  accuracy : 19.9 %
17m 58s (- 18m 43s) (14500 48%) loss : 4.392  accuracy : 23.6 %
18m 5s (- 18m 36s) (14600 49%) loss : 4.414  accuracy : 23.8 %
18m 13s (- 18m 28s) (14700 49%) loss : 4.413  accuracy : 23.4 %
18m 19s (- 18m 20s) (14800 49%) loss : 4.416  accuracy : 24.4 %
18m 27s (- 18m 13s) (14900 50%) loss : 4.438

32m 34s (- 3m 57s) (26400 89%) loss : 3.664  accuracy : 30.3 %
32m 41s (- 3m 50s) (26500 89%) loss : 3.725  accuracy : 29.6 %
32m 49s (- 3m 42s) (26600 89%) loss : 3.610  accuracy : 32.6 %
32m 55s (- 3m 35s) (26700 90%) loss : 3.594  accuracy : 33.2 %
33m 3s (- 3m 27s) (26800 90%) loss : 3.742  accuracy : 32.0 %
33m 11s (- 3m 20s) (26900 90%) loss : 3.703  accuracy : 33.5 %
33m 18s (- 3m 13s) (27000 91%) loss : 3.708  accuracy : 32.2 %
33m 26s (- 3m 5s) (27100 91%) loss : 3.917  accuracy : 30.5 %
33m 33s (- 2m 58s) (27200 91%) loss : 3.683  accuracy : 31.5 %
33m 40s (- 2m 50s) (27300 92%) loss : 3.822  accuracy : 31.4 %
33m 47s (- 2m 43s) (27400 92%) loss : 3.642  accuracy : 31.7 %
33m 54s (- 2m 36s) (27500 92%) loss : 3.738  accuracy : 31.8 %
34m 1s (- 2m 28s) (27600 93%) loss : 3.720  accuracy : 33.0 %
34m 8s (- 2m 21s) (27700 93%) loss : 3.581  accuracy : 32.8 %
34m 16s (- 2m 13s) (27800 93%) loss : 3.782  accuracy : 32.0 %
34m 23s (- 2m 6s) (27900 94%) loss : 3.801  accuracy : 30.1

12m 31s (- 24m 12s) (10100 34%) loss : 3.384  accuracy : 35.8 %
12m 38s (- 24m 3s) (10200 34%) loss : 3.302  accuracy : 36.7 %
12m 46s (- 23m 56s) (10300 34%) loss : 3.236  accuracy : 38.9 %
12m 53s (- 23m 48s) (10400 35%) loss : 3.273  accuracy : 37.6 %
13m 0s (- 23m 40s) (10500 35%) loss : 3.298  accuracy : 36.0 %
13m 7s (- 23m 32s) (10600 35%) loss : 3.307  accuracy : 38.4 %
13m 15s (- 23m 25s) (10700 36%) loss : 3.425  accuracy : 36.1 %
13m 22s (- 23m 18s) (10800 36%) loss : 3.277  accuracy : 38.4 %
13m 29s (- 23m 10s) (10900 36%) loss : 3.416  accuracy : 34.4 %
13m 36s (- 23m 2s) (11000 37%) loss : 3.169  accuracy : 39.2 %
13m 43s (- 22m 52s) (11100 37%) loss : 3.352  accuracy : 36.9 %
13m 49s (- 22m 44s) (11200 37%) loss : 3.407  accuracy : 35.8 %
13m 57s (- 22m 36s) (11300 38%) loss : 3.321  accuracy : 37.0 %
14m 4s (- 22m 28s) (11400 38%) loss : 3.349  accuracy : 39.2 %
14m 11s (- 22m 20s) (11500 38%) loss : 3.387  accuracy : 34.1 %
14m 18s (- 22m 13s) (11600 39%) loss : 3.320 

28m 15s (- 8m 7s) (23000 77%) loss : 3.205  accuracy : 37.4 %
28m 23s (- 8m 0s) (23100 78%) loss : 3.076  accuracy : 40.6 %
28m 30s (- 7m 52s) (23200 78%) loss : 3.176  accuracy : 38.7 %
28m 37s (- 7m 45s) (23300 78%) loss : 3.109  accuracy : 40.6 %
28m 45s (- 7m 38s) (23400 79%) loss : 3.218  accuracy : 38.5 %
28m 52s (- 7m 30s) (23500 79%) loss : 3.264  accuracy : 38.1 %
29m 0s (- 7m 23s) (23600 79%) loss : 3.174  accuracy : 38.8 %
29m 8s (- 7m 16s) (23700 80%) loss : 3.203  accuracy : 38.5 %
29m 16s (- 7m 8s) (23800 80%) loss : 3.338  accuracy : 37.7 %
29m 23s (- 7m 1s) (23900 80%) loss : 3.240  accuracy : 38.6 %
29m 29s (- 6m 53s) (24000 81%) loss : 3.185  accuracy : 39.2 %
29m 37s (- 6m 46s) (24100 81%) loss : 3.332  accuracy : 36.0 %
29m 44s (- 6m 38s) (24200 81%) loss : 3.173  accuracy : 38.4 %
29m 51s (- 6m 31s) (24300 82%) loss : 3.082  accuracy : 41.9 %
29m 58s (- 6m 24s) (24400 82%) loss : 3.091  accuracy : 40.0 %
30m 5s (- 6m 16s) (24500 82%) loss : 3.128  accuracy : 39.1 %

8m 10s (- 27m 56s) (6700 22%) loss : 3.120  accuracy : 38.7 %
8m 17s (- 27m 48s) (6800 22%) loss : 3.182  accuracy : 38.6 %
8m 25s (- 27m 43s) (6900 23%) loss : 3.070  accuracy : 39.0 %
8m 32s (- 27m 34s) (7000 23%) loss : 3.049  accuracy : 41.6 %
8m 41s (- 27m 31s) (7100 23%) loss : 3.179  accuracy : 39.1 %
8m 47s (- 27m 23s) (7200 24%) loss : 3.024  accuracy : 41.4 %
8m 55s (- 27m 16s) (7300 24%) loss : 3.006  accuracy : 41.1 %
9m 3s (- 27m 10s) (7400 24%) loss : 3.066  accuracy : 39.7 %
9m 10s (- 27m 2s) (7500 25%) loss : 3.142  accuracy : 40.0 %
9m 17s (- 26m 54s) (7600 25%) loss : 3.017  accuracy : 41.4 %
9m 25s (- 26m 48s) (7700 26%) loss : 3.150  accuracy : 38.8 %
9m 33s (- 26m 42s) (7800 26%) loss : 3.117  accuracy : 40.4 %
9m 40s (- 26m 35s) (7900 26%) loss : 3.212  accuracy : 37.7 %
9m 48s (- 26m 30s) (8000 27%) loss : 3.101  accuracy : 39.5 %
9m 55s (- 26m 22s) (8100 27%) loss : 2.975  accuracy : 41.4 %
10m 3s (- 26m 14s) (8200 27%) loss : 2.995  accuracy : 43.3 %
10m 9s (- 

24m 0s (- 12m 4s) (19700 66%) loss : 3.051  accuracy : 40.4 %
24m 6s (- 11m 56s) (19800 66%) loss : 3.194  accuracy : 37.7 %
24m 14s (- 11m 49s) (19900 67%) loss : 3.101  accuracy : 40.1 %
24m 21s (- 11m 42s) (20000 67%) loss : 3.183  accuracy : 38.9 %
24m 28s (- 11m 34s) (20100 67%) loss : 3.080  accuracy : 40.5 %
24m 36s (- 11m 27s) (20200 68%) loss : 3.080  accuracy : 39.1 %
24m 43s (- 11m 20s) (20300 68%) loss : 2.970  accuracy : 41.1 %
24m 50s (- 11m 13s) (20400 68%) loss : 3.116  accuracy : 39.9 %
24m 58s (- 11m 5s) (20500 69%) loss : 2.998  accuracy : 41.9 %
25m 4s (- 10m 58s) (20600 69%) loss : 3.039  accuracy : 40.6 %
25m 11s (- 10m 50s) (20700 69%) loss : 2.969  accuracy : 41.2 %
25m 19s (- 10m 43s) (20800 70%) loss : 3.152  accuracy : 39.7 %
25m 26s (- 10m 36s) (20900 70%) loss : 3.117  accuracy : 40.1 %
25m 34s (- 10m 29s) (21000 70%) loss : 3.148  accuracy : 40.2 %
25m 41s (- 10m 21s) (21100 71%) loss : 3.071  accuracy : 40.4 %
25m 49s (- 10m 14s) (21200 71%) loss : 3.209 

In [60]:
# save
#torch.save(denoiser.state_dict(), path_to_NLP + '\\saves\\models\\DL4NLP_I4a_sentence_denoiser_2.pth')

# load
#denoiser.load_state_dict(torch.load(path_to_NLP + '\\saves\\models\\DL4NLP_I4a_sentence_denoiser.pth'))

#### Evaluation

In [None]:
language_model.eval()
sentence = random.choice(corpus)
i = random.choice(range(int(len(sentence)/2)))
sentence = ' '.join(sentence[:i]) if i > 0 else '.'
language_model(sentence, limit = '.', color_code = '\x1b[48;2;255;229;217m') #  '\x1b[48;2;255;229;217m' '\x1b[31m'