<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Deep Learning for NLP
  </div> 
  
<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Part I - 4 <br><br><br>
  Sequence Labelling
  </div> 

  <div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 20px; 
      text-align: center; 
      padding: 15px;">
  </div> 

  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE
  </div> 

### Part I

1. Word Embedding

2. Sentence Classification

3. Language Modeling

4. <font color=red>**Sequence Labelling**</font>


### Part II

5. Auto-Encoding

6. Machine Translation

7. Text Classification




### Part III

8. Abstractive Summarization

9. Question Answering

10. Chatbot


</div>

***

<a id="plan"></a>

| | | | |
|------|------|------|------|
| **Content** | [Corpus](#corpus) | [Modules](#modules) | [Model](#model) | 


# Overview

We consider as Sequence labelling task a **Sentence Denoising** problem, which consists in transforming a noisy sequence of words into a correctly formed sentence.<br> Training follows a denoising objective known as _Cloze task_, which is used :

- For the BERT model in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)

# Packages

In [2]:
from __future__ import unicode_literals, print_function, division
import sys
import warnings
import os
from io import open
import unicodedata
import string
import time
import math
import re
import random
import pickle
import copy
from unidecode import unidecode
import itertools
import matplotlib
import matplotlib.pyplot as plt


# for special math operation
from sklearn.preprocessing import normalize


# for manipulating data 
import numpy as np
#np.set_printoptions(threshold=np.nan)
import pandas as pd
import bcolz # see https://bcolz.readthedocs.io/en/latest/intro.html
import pickle


# for text processing
import gensim
from gensim.models import KeyedVectors
#import spacy
import nltk
#nltk.download()
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem.porter import PorterStemmer


# for deep learning
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")

warnings.filterwarnings("ignore")
print('python version :', sys.version)
print('pytorch version :', torch.__version__)
print('DL device :', device)



python version : 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
pytorch version : 1.4.0
DL device : cuda


In [3]:
path_to_NLP = 'C:\\Users\\Jb\\Desktop\\NLP'

In [4]:
sys.path.append(path_to_NLP + '\\libDL4NLP')

<a id="corpus"></a>

# Corpus

[Back to top](#plan)

Le texte est importé et mis sous forme de liste, où chaque élément représente un texte présenté sous forme d'une liste de mots.<br> Le corpus est donc une fois importé sous le forme :<br>

- corpus = [text]<br>
- text   = [word]<br>
- word   = str<br>

In [5]:
def cleanSentence(sentence): # -------------------------  str
    sw = ['']
    #sw += nltk.corpus.stopwords.words('english')
    #sw += nltk.corpus.stopwords.words('french')

    def unicodeToAscii(s):
        """Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427"""
        return ''.join( c for c in unicodedata.normalize('NFD', s)
                        if unicodedata.category(c) != 'Mn')

    def normalizeString(s):
        '''Remove rare symbols from a string'''
        s = unicodeToAscii(s.lower().strip()) # 
        #s = re.sub(r"[^a-zA-Z\.\(\)\[\]]+", r" ", s)  # 'r' before a string is for 'raw' # ?&\%\_\- removed # set('''.,:;()*#&-_%!?/\'")''')
        return s

    def wordTokenizerFunction():
        # base version
        function = lambda sentence : sentence.strip().split()

        # nltk version
        #function = word_tokenize    
        return function

    # 1 - caractères spéciaux
    def clean_sentence_punct(text): # --------------  str
        text = normalizeString(text)
        # suppression de la dernière ponctuation
        if (len(text) > 0 and text[-1] in ['.', ',', ';', ':', '!', '?']) : text = text[:-1]

        text = text.replace(r'(', r' ( ')
        text = text.replace(r')', r' ) ')
        text = text.replace(r'[', r' [ ')
        text = text.replace(r']', r' ] ')
        text = text.replace(r'<', r' < ')
        text = text.replace(r'>', r' > ')

        text = text.replace(r':', r' : ')
        text = text.replace(r';', r' ; ')
        for i in range(5) :
            text = re.sub('(?P<val1>[0-9])\.(?P<val2>[0-9])', '\g<val1>__-__\g<val2>', text)
            text = re.sub('(?P<val1>[0-9]),(?P<val2>[0-9])', '\g<val1>__-__\g<val2>', text)
        text = text.replace(r',', ' , ')
        text = text.replace(r'.', ' . ')
        for i in range(5) : text = re.sub('(?P<val1>[p0-9])__-__(?P<val2>[p0-9])', '\g<val1>.\g<val2>', text)
        text = re.sub('(?P<val1>[0-9]) \. p \. (?P<val2>[0-9])', '\g<val1>.p.\g<val2>', text)
        text = re.sub('(?P<val1>[0-9]) \. s \. (?P<val2>[0-9])', '\g<val1>.s.\g<val2>', text)

        text = text.replace(r'"', r' " ')
        text = text.replace(r'’', r" ' ")
        text = text.replace(r'”', r' " ')
        text = text.replace(r'“', r' " ')
        text = text.replace(r'/', r' / ')

        text = re.sub('(…)+', ' … ', text)
        text = text.replace('≤', ' ≤ ')          
        text = text.replace('≥', ' ≥ ')
        text = text.replace('°c', ' °c ')
        text = text.replace('°C', ' °c ')
        text = text.replace('ºc', ' °c ')
        text = text.replace('n°', 'n° ')
        text = text.replace('%', ' % ')
        text = text.replace('*', ' * ')
        text = text.replace('+', ' + ')
        text = text.replace('-', ' - ')
        text = text.replace('_', ' ')
        text = text.replace('®', ' ')
        text = text.replace('™', ' ')
        text = text.replace('±', ' ± ')
        text = text.replace('÷', ' ÷ ')
        text = text.replace('–', ' - ')
        text = text.replace('μg', ' µg')
        text = text.replace('µg', ' µg')
        text = text.replace('µl', ' µl')
        text = text.replace('μl', ' µl')
        text = text.replace('µm', ' µm')
        text = text.replace('μm', ' µm')
        text = text.replace('ppm', ' ppm')
        text = re.sub('(?P<val1>[0-9])mm', '\g<val1> mm', text)
        text = re.sub('(?P<val1>[0-9])g', '\g<val1> g', text)
        text = text.replace('nm', ' nm')

        text = re.sub('fa(?P<val1>[0-9])', 'fa \g<val1>', text)
        text = re.sub('g(?P<val1>[0-9])', 'g \g<val1>', text)
        text = re.sub('n(?P<val1>[0-9])', 'n \g<val1>', text)
        text = re.sub('p(?P<val1>[0-9])', 'p \g<val1>', text)
        text = re.sub('q_(?P<val1>[0-9])', 'q_ \g<val1>', text)
        text = re.sub('u(?P<val1>[0-9])', 'u \g<val1>', text)
        text = re.sub('ud(?P<val1>[0-9])', 'ud \g<val1>', text)
        text = re.sub('ui(?P<val1>[0-9])', 'ui \g<val1>', text)

        text = text.replace('=', ' ')
        text = text.replace('!', ' ')
        text = text.replace('-', ' ')
        text = text.replace(r' , ', ' ')
        text = text.replace(r' . ', ' ')

        text = re.sub('(?P<val>[0-9])ml', '\g<val> ml', text)
        text = re.sub('(?P<val>[0-9])mg', '\g<val> mg', text)

        for i in range(5) : text = re.sub('( [0-9]+ )', ' ', text)
        #text = re.sub('cochran(\S)*', 'cochran ', text)
        return text

    # 3 - split des mots
    def wordSplit(sentence, tokenizeur): # ------------- [str]
        return tokenizeur(sentence)

    # 4 - mise en minuscule et enlèvement des stopwords
    def stopwordsRemoval(sentence, sw): # ------------- [[str]]
        return [word for word in sentence if word not in sw]

    # 6 - correction des mots
    def correction(text):
        def correct(word):
            return spelling.suggest(word)[0]
        list_of_list_of_words = [[correct(word) for word in sentence] for sentence in text]
        return list_of_list_of_words

    # 7 - stemming
    def stemming(text): # ------------------------- [[str]]
        list_of_list_of_words = [[PorterStemmer().stem(word) for word in sentence if word not in sw] for sentence in text]
        return list_of_list_of_words


    tokenizeur = wordTokenizerFunction()
    sentence = clean_sentence_punct(str(sentence))
    sentence = wordSplit(sentence, tokenizeur)
    sentence = stopwordsRemoval(sentence, sw)
    #text = correction(text)
    #text = stemming(text)
    return sentence


def importWords(file_name) :
    def cleanDatabase(db):
        words = ['.']
        title = ''
        for pair in db :
            #print(pair)
            current_tile = pair[0].split(' | ')[-1]
            if current_tile != title :
                words += cleanSentence(current_tile) + ['.']
                title  = current_tile
            words += cleanSentence(str(pair[1]).split(' | ')[-1]) + ['.']
        return words

    df = pd.read_excel(file_name, sep = ',', header = None)
    headers = [i for i, titre in enumerate(df.iloc[0,:].values) if i in [1, 2] or titre == 'score manuel'] 
    db = df.iloc[1:, headers].values.tolist()
    db = [el[:2] for el in db if el[-1] in [0,1, 10]]
    words = cleanDatabase(db)
    return words


def importAllWords(path_to_data) :
    corpus = []
    reps = os.listdir(path_to_data)
    for rep in reps :
        files = os.listdir(path_to_data + '\\' + rep)
        for file in files :
            file_name = path_to_data + '\\' + rep + '\\' + file
            corpus.append(importWords(file_name))
    return corpus



def importSentences(file_name) :
    def cleanDatabase(db):
        sentences = []
        title = ''
        for pair in db :
            current_tile = pair[0].split(' | ')[-1]
            if current_tile != title :
                sentences.append(cleanSentence(current_tile))
                title = current_tile
            sentences.append(cleanSentence(str(pair[1]).split(' | ')[-1]))
        return sentences

    df = pd.read_excel(file_name, sep = ',', header = None)
    headers = [i for i, titre in enumerate(df.iloc[0,:].values) if i in [1, 2] or titre == 'score manuel'] 
    db = df.iloc[1:, headers].values.tolist()
    db = [el[:2] for el in db if el[-1] in [0,1, 10]]
    sentences = cleanDatabase(db)
    return sentences


def importAllSentences(path_to_data) :
    corpus = []
    reps = os.listdir(path_to_data)
    for rep in reps :
        files = os.listdir(path_to_data + '\\' + rep)
        for file in files :
            file_name = path_to_data + '\\' + rep + '\\' + file
            corpus += importSentences(file_name)
    return corpus

In [6]:
corpus = importAllWords(path_to_NLP + '\\data\\AMM')
len(corpus)

510

In [7]:
sentences = importAllSentences(path_to_NLP + '\\data\\AMM')
len(sentences)

31574

In [10]:
sentences[10]

['this',
 'test',
 'consists',
 'of',
 'assessing',
 'the',
 'dissolution',
 'time',
 'of',
 'the',
 'freeze',
 'dried',
 'yellow',
 'fever',
 'vaccine',
 'after',
 'adding',
 'the',
 'suitable',
 'diluent',
 'directly',
 'into',
 'the',
 'original',
 'container']

<a id="modules"></a>

# 1 Modules

### 1.1 Word Embedding module

[Back to top](#plan)

All details on Word Embedding modules and their pre-training are found in **Part I - 1**. We consider here a FastText model trained following the Skip-Gram training objective.

In [11]:
from libDL4NLP.models.Word_Embedding import Word2Vec as myWord2Vec
from libDL4NLP.models.Word_Embedding import Word2VecConnector
from libDL4NLP.utils.Lang import Lang

In [12]:
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath, get_tmpfile

In [31]:
word2vec = FastText(size = 75, 
                    window = 5, 
                    min_count = 3, 
                    negative = 20,
                    sg = 1)

In [32]:
word2vec.build_vocab(sentences)

In [33]:
len(word2vec.wv.vocab)

4662

In [34]:
word2vec.train(sentences = sentences, 
               epochs = 50,
               total_examples = word2vec.corpus_count)

### 1.2 Contextualization module

[Back to top](#plan)


In [18]:
from libDL4NLP.modules import RecurrentEncoder

<a id="model"></a>

# 2 Sentence denoising Model

[Back to top](#plan)


In [44]:
class SentenceDenoiser(nn.Module) :
    def __init__(self, device, tokenizer, word2vec, 
                 hidden_dim = 100, 
                 n_layers = 1, 
                 dropout = 0, 
                 class_weights = None, 
                 optimizer = optim.SGD
                 ):
        super(SentenceDenoiser, self).__init__()
        
        # embedding
        self.tokenizer = tokenizer
        self.word2vec  = word2vec
        self.context   = RecurrentEncoder(self.word2vec.output_dim, hidden_dim, n_layers, dropout, bidirectional = True)
        self.out       = nn.Linear(self.context.output_dim, self.word2vec.lang.n_words)
        self.act       = F.softmax
        
        # optimizer
        self.ignore_index = self.word2vec.lang.getIndex('PADDING_WORD')
        self.criterion = nn.NLLLoss(size_average = False, 
                                    ignore_index = self.ignore_index, 
                                    weight = class_weights)
        self.optimizer = optimizer
        
        # load to device
        self.device = device
        self.to(device)
        
    def nbParametres(self) :
        return sum([p.data.nelement() for p in self.parameters() if p.requires_grad == True])
    
    def predict_proba(self, words):
        embeddings = self.word2vec.twin(words, self.device) # dim = (1, input_length, hidden_dim)
        hiddens, _ = self.context(embeddings)               # dim = (1, input_length, hidden_dim)
        probs      = self.act(self.out(hiddens), dim = 2)   # dim = (1, input_length, lang_size)
        return probs

    # main method
    def forward(self, sentence = '.', color = '\033[94m'):
        def addColor(w1, w2, color) : return color + w2 + '\033[0m' if w1 != w2 else w2
        words  = self.tokenizer(sentence)
        probs  = self.predict_proba(words).squeeze(0) # dim = (input_length, lang_size)
        inds   = [probs[i].data.topk(1)[1].item() for i in range(probs.size(0))]
        new_ws = [self.word2vec.lang.index2word[ind] for ind in inds]
        print(' '.join([addColor(w1, w2, color) for w1, w2 in zip(words, new_ws)]))
        return

    # load data
    def generatePackedSentences(self, 
                                sentences, 
                                batch_size = 32, 
                                mask_ratio = 0.15,
                                predict_masked_only = True,
                                max_sentence_length = 50,
                                tol = 10,
                                seed = 42) :
        def maskInput(index, b) :
            if   b and random.random() > 0.25 : return self.word2vec.lang.getIndex('UNK')
            elif b and random.random() > 0.10 : return random.choice(list(self.word2vec.twin.lang.word2index.values()))
            else                              : return index
            
        def maskOutput(index, b) :
            return index if b else self.ignore_index
        
        def splitLongs(words, threshold = 50, tol = 10):
            news = []
            for i in range(0, len(words), threshold) :
                if len(words)-i-threshold > tol : 
                    news.append(words[i : i + threshold])
                else : 
                    news.append(words[i:])
                    break
            return news
        
        packed_data = []
        random.seed(seed)
        # prepare sentences
        #sentences = [self.tokenizer(s) for s in sentences]
        sentences = [[self.word2vec.lang.getIndex(w) for w in s] for s in sentences]
        sentences = [[w for w in words if w is not None] for words in sentences]
        sentences = [s for S in sentences for s in splitLongs(S, max_sentence_length, tol) if len([w for w in s if w != self.word2vec.lang.getIndex('UNK')]) > 1]
        sentences.sort(key = lambda s: len(s), reverse = True)
        # collect packs
        for i in range(0, len(sentences), batch_size) :
            pack = sentences[i:i + batch_size]
            # prepare mask
            mask = [[i for i, w in enumerate(p) if w != self.word2vec.lang.getIndex('UNK')] for p in pack]
            mask = [random.sample(m, k = int(mask_ratio*len(m) +1)) for m in mask]
            # prepare input and target packs
            pack0 = [[ maskInput(s[i], i in m) for i in range(len(s))] for s, m in zip(pack, mask)]
            pack1 = [[maskOutput(s[i], i in m) for i in range(len(s))] for s, m in zip(pack, mask)] \
                    if predict_masked_only else \
                    [[maskOutput(w, w != self.word2vec.lang.getIndex('UNK')) for w in p] for p in pack]
            lengths = torch.tensor([len(p) for p in pack0]) # size = (batch_size) 
            # padd
            pack0 = list(itertools.zip_longest(*pack0, fillvalue = self.ignore_index)) 
            pack1 = list(itertools.zip_longest(*pack1, fillvalue = self.ignore_index))
            # turn into torch variables
            pack0 = Variable(torch.LongTensor(pack0).transpose(0, 1))   # size = (batch_size, max_length)
            pack1 = Variable(torch.LongTensor(pack1).transpose(0, 1))   # size = (batch_size, max_length) 
            # store pack
            packed_data.append([[pack0, lengths], pack1])
        return packed_data
    
    # fit model
    def fit(self, batches, iters = None, epochs = None, lr = 0.025, random_state = 42,
              print_every = 10, compute_accuracy = True):
        """Performs training over a given dataset and along a specified amount of loops"""
        def asMinutes(s):
            m = math.floor(s / 60)
            s -= m * 60
            return '%dm %ds' % (m, s)

        def timeSince(since, percent):
            now = time.time()
            s = now - since
            rs = s/percent - s
            return '%s (- %s)' % (asMinutes(s), asMinutes(rs))
        
        def computeLogProbs(batch) :
            embeddings = self.word2vec.embedding(batch[0].to(self.device))
            hiddens,_  = self.context(embeddings, lengths = batch[1].to(self.device)) # dim = (batch_size, input_length, hidden_dim)
            log_probs  = F.log_softmax(self.out(hiddens), dim = 2)                    # dim = (batch_size, input_length, lang_size)
            return log_probs

        def computeAccuracy(log_probs, targets) :
            total   = np.sum(targets.data.cpu().numpy() != self.ignore_index)
            success = sum([self.ignore_index != targets[i, j].item() == log_probs[i, :, j].data.topk(1)[1].item() \
                           for i in range(targets.size(0)) \
                           for j in range(targets.size(1)) ])
            return  success * 100 / total

        def printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy) :
            avg_loss = tot_loss / print_every
            avg_loss_words = tot_loss_words / print_every
            if compute_accuracy : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}  accuracy : {:.1f} %'.format(iter, int(iter / iters * 100), avg_loss, avg_loss_words))
            else                : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}                     '.format(iter, int(iter / iters * 100), avg_loss))
            return 0, 0

        def trainLoop(batch, optimizer, compute_accuracy = True):
            """Performs a training loop, with forward pass, backward pass and weight update."""
            optimizer.zero_grad()
            self.zero_grad()
            log_probs = computeLogProbs(batch[0]).transpose(1, 2) # dim = (batch_size, lang_size, input_length)
            targets   = batch[1].to(self.device)                  # dim = (batch_size, input_length)
            loss      = self.criterion(log_probs, targets)
            loss.backward()
            optimizer.step() 
            accuracy = computeAccuracy(log_probs, targets) if compute_accuracy else 0
            return float(loss.item() / np.sum(targets.data.cpu().numpy() != self.ignore_index)), accuracy
        
        # --- main ---
        self.train()
        np.random.seed(random_state)
        start = time.time()
        optimizer = self.optimizer([param for param in self.parameters() if param.requires_grad == True], lr = lr)
        tot_loss = 0  
        tot_acc  = 0
        if epochs is None :
            for iter in range(1, iters + 1):
                batch = random.choice(batches)
                loss, acc = trainLoop(batch, optimizer, compute_accuracy)
                tot_loss += loss
                tot_acc += acc      
                if iter % print_every == 0 : 
                    tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        else :
            iter = 0
            iters = len(batches) * epochs
            for epoch in range(1, epochs + 1):
                print('epoch ' + str(epoch))
                np.random.shuffle(batches)
                for batch in batches :
                    loss, acc = trainLoop(batch, optimizer, compute_accuracy)
                    tot_loss += loss
                    tot_acc += acc 
                    iter += 1
                    if iter % print_every == 0 : 
                        tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        return

#### Training

In [49]:
denoiser = SentenceDenoiser(device = torch.device('cpu'), # device
                            tokenizer = lambda s : s.split(' '),
                            word2vec = Word2VecConnector(word2vec),
                            hidden_dim = 75, 
                            n_layers = 3, 
                            dropout = 0.1,
                            optimizer = optim.AdamW)

denoiser.nbParametres()

627164

In [48]:
batches = []
for seed in [42, 854, 7956, 657, 124] :
    batches += denoiser.generatePackedSentences(sentences, 
                                                batch_size = 16,
                                                mask_ratio = 0.15,
                                                predict_masked_only = True,
                                                max_sentence_length = 50,
                                                tol = 10,
                                                seed = seed)
    print(len(batches))
len(batches)

1830
3660
5490
7320
9150


9150

In [50]:
denoiser.fit(batches, epochs = 1, lr = 0.001, print_every = 100)

epoch 1
0m 13s (- 20m 51s) (100 1%) loss : 6.979  accuracy : 6.1 %
0m 25s (- 19m 5s) (200 2%) loss : 6.401  accuracy : 6.5 %
0m 39s (- 19m 13s) (300 3%) loss : 6.426  accuracy : 7.7 %
0m 51s (- 18m 46s) (400 4%) loss : 6.137  accuracy : 9.5 %
1m 5s (- 18m 54s) (500 5%) loss : 6.047  accuracy : 9.0 %
1m 17s (- 18m 31s) (600 6%) loss : 5.885  accuracy : 10.7 %
1m 31s (- 18m 22s) (700 7%) loss : 5.729  accuracy : 12.7 %
1m 44s (- 18m 12s) (800 8%) loss : 5.481  accuracy : 14.5 %
1m 59s (- 18m 13s) (900 9%) loss : 5.596  accuracy : 14.0 %
2m 13s (- 18m 4s) (1000 10%) loss : 5.339  accuracy : 16.4 %
2m 27s (- 18m 1s) (1100 12%) loss : 5.288  accuracy : 17.0 %
2m 40s (- 17m 42s) (1200 13%) loss : 5.274  accuracy : 17.6 %
2m 54s (- 17m 33s) (1300 14%) loss : 5.218  accuracy : 17.3 %
3m 9s (- 17m 28s) (1400 15%) loss : 5.125  accuracy : 18.8 %
3m 22s (- 17m 10s) (1500 16%) loss : 4.939  accuracy : 18.6 %
3m 35s (- 16m 54s) (1600 17%) loss : 4.913  accuracy : 21.5 %
3m 48s (- 16m 40s) (1700 18%

In [53]:
batches2 = []
for seed in [627, 87569, 21, 334, 2] :
    batches2 += denoiser.generatePackedSentences(sentences, 
                                                batch_size = 16,
                                                mask_ratio = 0.15,
                                                predict_masked_only = False,
                                                max_sentence_length = 50,
                                                tol = 10,
                                                seed = seed)
    print(len(batches2))
len(batches2)

1830
3660
5490
7320
9150


9150

In [55]:
denoiser.fit(batches2, epochs = 1, lr = 0.0001, print_every = 100)

epoch 1
0m 15s (- 23m 28s) (100 1%) loss : 2.759  accuracy : 46.2 %
0m 29s (- 21m 43s) (200 2%) loss : 2.612  accuracy : 50.0 %
0m 44s (- 21m 39s) (300 3%) loss : 2.727  accuracy : 49.1 %
0m 58s (- 21m 18s) (400 4%) loss : 2.546  accuracy : 51.9 %
1m 14s (- 21m 27s) (500 5%) loss : 2.491  accuracy : 52.8 %
1m 28s (- 21m 2s) (600 6%) loss : 2.413  accuracy : 54.7 %
1m 43s (- 20m 53s) (700 7%) loss : 2.345  accuracy : 55.1 %
1m 58s (- 20m 40s) (800 8%) loss : 2.252  accuracy : 56.8 %
2m 14s (- 20m 36s) (900 9%) loss : 2.346  accuracy : 56.0 %
2m 29s (- 20m 18s) (1000 10%) loss : 2.172  accuracy : 59.5 %
2m 45s (- 20m 11s) (1100 12%) loss : 2.226  accuracy : 58.1 %
2m 59s (- 19m 47s) (1200 13%) loss : 2.177  accuracy : 59.7 %
3m 14s (- 19m 33s) (1300 14%) loss : 2.201  accuracy : 59.1 %
3m 30s (- 19m 22s) (1400 15%) loss : 2.145  accuracy : 60.3 %
3m 44s (- 19m 3s) (1500 16%) loss : 2.054  accuracy : 62.2 %
3m 59s (- 18m 48s) (1600 17%) loss : 2.186  accuracy : 60.5 %
4m 13s (- 18m 32s) (

In [58]:
denoiser.fit(batches, epochs = 1, lr = 0.00001, print_every = 100)

epoch 1
0m 14s (- 21m 56s) (100 1%) loss : 3.914  accuracy : 31.7 %
0m 27s (- 20m 39s) (200 2%) loss : 3.909  accuracy : 33.0 %
0m 39s (- 19m 36s) (300 3%) loss : 3.889  accuracy : 31.4 %
0m 52s (- 19m 16s) (400 4%) loss : 3.764  accuracy : 33.2 %
1m 6s (- 19m 11s) (500 5%) loss : 3.825  accuracy : 31.1 %
1m 17s (- 18m 23s) (600 6%) loss : 3.789  accuracy : 33.2 %
1m 31s (- 18m 20s) (700 7%) loss : 3.692  accuracy : 33.1 %
1m 43s (- 17m 57s) (800 8%) loss : 3.704  accuracy : 33.9 %
1m 57s (- 17m 54s) (900 9%) loss : 3.659  accuracy : 33.8 %
2m 8s (- 17m 30s) (1000 10%) loss : 3.604  accuracy : 33.9 %
2m 21s (- 17m 17s) (1100 12%) loss : 3.685  accuracy : 33.7 %
2m 36s (- 17m 13s) (1200 13%) loss : 3.698  accuracy : 33.1 %
2m 48s (- 16m 59s) (1300 14%) loss : 3.809  accuracy : 32.8 %
3m 2s (- 16m 48s) (1400 15%) loss : 3.662  accuracy : 34.4 %
3m 15s (- 16m 38s) (1500 16%) loss : 3.786  accuracy : 31.3 %
3m 28s (- 16m 23s) (1600 17%) loss : 3.682  accuracy : 32.8 %
3m 40s (- 16m 8s) (17

In [59]:
# save
#torch.save(denoiser.state_dict(), path_to_NLP + '\\saves\\models\\DL4NLP_I4a_sentence_denoiser.pth')

# load
#denoiser.load_state_dict(torch.load(path_to_NLP + '\\saves\\models\\DL4NLP_I4a_sentence_denoiser.pth'))

#### Evaluation

In [61]:
denoiser.eval()
sentence = ' '.join(sentences[10]) #'what are you thinking of this'
print(sentence)
print('\n')
denoiser(sentence, color = '\033[93m')

this test consists of assessing the dissolution time of the freeze dried yellow fever vaccine after adding the suitable diluent directly into the original container


this test consists of [93mdetermining[0m the dissolution time of the freeze [93myellow[0m yellow fever vaccine after adding the [93mendotoxin[0m [93mis[0m [93mprepared[0m into the [93mappropriate[0m container
