<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Deep Learning for NLP
  </div> 
  
<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
    <font color=orange>II - 1 </font>
  Text Classification
  </div> 

  <div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 20px; 
      text-align: center; 
      padding: 15px;">
  </div> 

  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE
  </div> 

### Part I

1. Word Embedding

2. Sentence Classification

3. Language Modeling

4. Sequence Labelling


### Part II

1. <font color=orange>**Text Classification**</font>

2. Sequence to sequence



### Part III

1. Abstractive Summarization

2. Question Answering

3. Chatbot


</div>

***

<a id="plan"></a>

| | | | | |
|------|------|------|------|------|
| **Content** | [Corpus](#corpus) | [Modules](#modules) | [Model](#model) | [Open source models](#open_source_models) | 


# Overview

A top-quality Github repository discussing Hierarchical Attention Networks is found [here](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Text-Classification). <br>
This repo doesn't feature temporal reccurence in attention provided by the present multi-hoped attention mechanism.<br>


# Packages

In [1]:
from __future__ import unicode_literals, print_function, division

import sys
import warnings
import os
from io import open
import unicodedata
import string
import time
import math
import re
import random
import pickle
import copy
from unidecode import unidecode
import itertools
import gc
import multiprocessing

import matplotlib
import matplotlib.pyplot as plt


# for special math operation
from sklearn.preprocessing import normalize


# for manipulating data 
import numpy as np
#np.set_printoptions(threshold=np.nan)
import pandas as pd
import bcolz # see https://bcolz.readthedocs.io/en/latest/intro.html
import pickle


# for text processing
import gensim
from gensim.models import KeyedVectors
#import spacy
import nltk
#nltk.download()
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem.porter import PorterStemmer


# for deep learning
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


warnings.filterwarnings("ignore")
print('python version :', sys.version)
print('pytorch version :', torch.__version__)
print('DL device :', device)



python version : 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
pytorch version : 1.5.0
DL device : cuda


In [2]:
path_to_DL4NLP = os.path.dirname(os.getcwd())

In [3]:
sys.path.append(path_to_DL4NLP + '\\lib')

<a id="corpus"></a>

# Corpus

[Back to top](#plan)

Le corpus est importé et mis sous forme de liste, où chaque élément représente un texte présenté sous forme d'une liste de mots.

In [4]:
df_AGnews_trn = pd.read_csv(path_to_DL4NLP + "\\data\\AG News\\train.csv", sep = ',', header = None, error_bad_lines = False)
df_AGnews_tst = pd.read_csv(path_to_DL4NLP + "\\data\\AG News\\test.csv" , sep = ',', header = None, error_bad_lines = False)

In [5]:
df_AGnews_trn.columns = ['index', 'title', 'description']
df_AGnews_tst.columns = ['index', 'title', 'description']

In [6]:
df_AGnews_trn.head()

Unnamed: 0,index,title,description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [7]:
def unicodeToAscii(s):
    return ''.join( c for c in unicodedata.normalize('NFD', s)
                    if unicodedata.category(c) != 'Mn')

def normalizeString(s):
    s = unicodeToAscii(s.strip())
    return s

def cleanSentence(s) :
    s = s.lower()
    s = s.replace('\\', ' ')
    s = re.sub('[\.!?]+ ', ' . ', s)
    s = s.replace('%', ' % ')
    s = re.sub(' [0-9]*\.[0-9] ', ' FLOAT ', ' ' + s + ' ').strip()
    s = re.sub(' [0-9,]*[0-9] ', ' INT ', ' ' + s + ' ').strip()
    
    for w in ['"', "'", '”', '“', '/', '(', ')', '[', ']', '<', '>', ':', ','] : s = s.replace(w, '')
    return s

def trueWord(w) :
    return len(w)>0 and re.sub('[^a-zA-Z0-9.,]', '', w) != ''

def tokenize(s) :
    s = normalizeString(s)
    s = cleanSentence(s)
    S = s.split('.')
    S = [nltk.tokenize.word_tokenize(s) for s in S]
    S = [[w for w in s if trueWord(w)] for s in S]
    S = [s for s in S if s != []]
    return S

In [8]:
# reduce label by 1 to make is starts from 0
labelled_sentences_trn = [[tokenize(s1 + ' . ' + s2), l-1] for s1, s2, l in zip(df_AGnews_trn["title"].values.tolist(), df_AGnews_trn["description"].values.tolist(), df_AGnews_trn["index"].values.tolist()) if tokenize(s1) != []]
labelled_sentences_tst = [[tokenize(s1 + ' . ' + s2), l-1] for s1, s2, l in zip(df_AGnews_tst["title"].values.tolist(), df_AGnews_tst["description"].values.tolist(), df_AGnews_tst["index"].values.tolist()) if tokenize(s1) != []]

<a id="modules"></a>

# 1 Modules

### 1.1 Word Embedding module

[Back to top](#plan)

_Remark_ : The pre-trained Word2vec models are the same as those used in **Part I - 2 Sentence Classification**.

<a id="word_level_custom"></a>


#### 1.1.1 Custom model

In [9]:
from libDL4NLP.models.Word_Embedding import Word2Vec as myWord2Vec
from libDL4NLP.models.Word_Embedding import Word2VecConnector
from libDL4NLP.utils.Lang import Lang

<a id="gensim"></a>

#### 1.1.2 Gensim model

In [10]:
from gensim.models import Word2Vec
from gensim.test.utils import datapath, get_tmpfile

In [11]:
gensim_word2vec = Word2VecConnector(Word2Vec.load(get_tmpfile(path_to_DL4NLP + "\\saves\\DL4NLP_I2_skipgram_gensim.model")))

<a id="fastText"></a>

#### 1.1.3 FastText model

In [12]:
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath, get_tmpfile

### 1.2 Contextualization module

[Back to top](#plan)

The contextualization layer transforms a sequences of word vectors into another one, of same length, where each output vector corresponds to a new version of each input vector that is contextualized with respect to neighboring vectors.

<a id="bi_gru"></a>

#### 1.2.1 Bi-directionnal GRU contextualization

This module consists of a bi-directional _Gated Recurrent Unit_ (GRU) that supports packed sentences :

In [13]:
from libDL4NLP.modules import RecurrentEncoder

### 1.3 Attention module

[Back to top](#plan)

<a id="attention"></a>

#### 1.3.1 Classical Attention Module


In [14]:
# from libDL4NLP.modules import Attention
# from libDL4NLP.misc    import HighwayQ

In [15]:
class HighwayQ(nn.Module):
    def __init__(self, dim, 
                 query_dim = 0, 
                 dropout = 0,
                 act = F.tanh):
        super().__init__()
        
        # relevant quantities
        self.dim      = dim + query_dim
        self.transf   = nn.Linear(self.dim, dim)
        self.gate     = nn.Linear(self.dim, dim)
        self.dropout  = nn.Dropout(p = dropout)
        self.act      = act

    def forward(self, vect, 
                query = None):
        '''vect and (optional) query must be 3D tensors with same size along dim 0 and 1'''
        if query is not None : merge = torch.cat((vect, query), dim = 2)
        else                 : merge = vect
        transf = self.act(self.transf(merge))
        gate   = F.sigmoid(self.gate(merge))
        vect   = gate * transf + (1 - gate) * vect
        vect   = self.dropout(vect)
        return vect

In [16]:
class Attention(nn.Module):
    def __init__(self, emb_dim, query_dim, 
                 dropout = 0, 
                 method = 'concat'): 
        super().__init__()
        
        # relevant quantities
        self.method  = method
        self.emb_dim = emb_dim
        self.out_dim = emb_dim
        
        # layers
        self.dropout    = nn.Dropout(p = dropout)
        self.attn_layer = HighwayQ(emb_dim, query_dim, dropout)
        self.attn_v     = nn.Linear(emb_dim, 1, bias = False)
        self.value      = HighwayQ(emb_dim, query_dim, dropout)
        self.act        = F.softmax
        
    def forward(self, embeddings, query):
        '''embeddings       of size (batch_size, input_length, emb_dim)
           query (optional) of size (batch_size, 1, emb_dim)
        '''
        # query is optional for this method
        if self.method == 'concat' :
            weights = self.attn_layer(embeddings, query)       # size (batch_size, input_length, embedding_dim)
            weights = self.act(self.attn_v(weights), dim = 1)  # size (batch_size, input_length, 1)
            weights = weights.transpose(1, 2)                  # size (batch_size, 1, input_length)
            
        # query is necessary for this method
        elif self.method == 'dot' :
            query   = query.transpose(1, 2)                    # size (batch_size, query_dim, 1)
            weights = torch.bmm(embeddings, query)             # size (batch_size, input_length, 1)
            weights = self.act(weights, dim = 1)               # size (batch_size, input_length, 1)
            weights = torch.transpose(weights, 1, 2)           # size (batch_size, 1, input_length)
        applied = self.dropout(torch.bmm(weights, embeddings)) # size (batch_size, 1, embedding_dim)
        return applied, weights

#### 1.3.2 Multi-hoped Hierarchical Attention Module

A combination of ideas originating from :

- Hierarchical Attention : [Hierarchical Attention Networks for Document Classification (2016)](https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf)
- Hoping mechanism : [End-To-End Memory Networks (2015)](https://arxiv.org/pdf/1503.08895.pdf)

In [17]:
#from libDL4NLP.modules import HAN

In [18]:
class HAN(nn.Module):
    '''Ce module d'attention est :
    
    - hiérarchique avec bi-GRU entre les deux niveaux d'attention
    - globalement multi-hopé, où il est possible d'effectuer plusieurs passes pour accumuler de l'information
    '''
    def __init__(self, emb_dim, hidden_dim, query_dim,
                 n_layers = 1,
                 hops = 1,
                 share = True,
                 transf = False,
                 dropout = 0):
        super(HAN, self).__init__()
        
        # dimensions
        self.emb_dim = emb_dim
        self.query_dim = query_dim
        self.hidden_dim = hidden_dim
        self.output_dim = self.query_dim if (self.query_dim > 0 and \
                                            (transf or (hops > 1 and query_dim != hidden_dim))) \
                                         else hidden_dim
        self.hops = hops
        self.share = share
        
        
        # modules
        self.dropout = nn.Dropout(p = dropout)
        
        # first attention module
        if share : self.attn1 = nn.ModuleList([Attention(emb_dim, query_dim, dropout)] * hops)
        else     : self.attn1 = nn.ModuleList([Attention(emb_dim, query_dim, dropout) for _ in range(hops)])
            
        # intermediate encoder module
        self.bigru = RecurrentEncoder(emb_dim, hidden_dim, n_layers, dropout, bidirectional = True)
        
        # second attention module
        if share : self.attn2 = nn.ModuleList([Attention(self.bigru.output_dim, query_dim, dropout)] * hops)
        else     : self.attn2 = nn.ModuleList([Attention(self.bigru.output_dim, query_dim, dropout) for _ in range(hops)])
            
        # accumulation step
        self.transf = nn.Linear(self.bigru.output_dim, self.output_dim, bias = False) if (transf or (self.hops > 1 and query_dim != self.bigru.output_dim)) else None
        
        
    def singlePass(self, packed_embeddings, query, attn1, attn2): 
        # first attention
        query1 = query.expand(packed_embeddings.size(0), 
                              packed_embeddings.size(1), 
                              query.size(2)) if query is not None else None
        output, weights1 = attn1(packed_embeddings, query1) # size (dialogue_length, 1, emb_dim)
        
        # intermediate biGRU
        output, _ = self.bigru(output.transpose(0, 1))      # size (1, dialogue_length, hidden_dim)
        output = self.dropout(output)
        
        # second attention
        query2 = query.expand(output.size(0), 
                              output.size(1), 
                              query.size(2)) if query is not None else None
        output, weights2 = attn2(output, query2)            # size (1, dialogue_length, hid_dim)
        
        # output decision vector
        if self.transf is not None : output = self.transf(output) # size (1, 1, out_dim)
        if query is not None       : output = output + query
            
        # return
        return output, weights1, weights2
        
        
    def forward(self, packed_embeddings, query = None):
        weights1_list = []
        weights2_list = []
        
        # perform attention loops
        if packed_embeddings is not None :
            for hop in range(self.hops) :
                
                # perform attention pass
                query, weights1, weights2 = self.singlePass(packed_embeddings, query, self.attn1[hop], self.attn2[hop])
                weights1_list.append(weights1)
                weights2_list.append(weights2)
                
        # output decision vector
        return query, weights1_list, weights2_list

#### Visualisation of attention


In [19]:
#from libDL4NLP.utils import HANViewerOnWords

In [20]:
def HANViewerOnWords(attn_words, attn_sentences, sentences, 
                     colors = 'Reds', n = 8) : 
    '''attn_words = [1D np.array]
       attn_sentences = 1D np.array
       sentences = [[str]]
    '''
    def generateColors(colors, n):
        colors = plt.get_cmap(colors)
        Triplets = []
        for i in range(n) :
            triplet = [int(j * 256) for j in colors(i/10)[:3]]
            Triplets.append(triplet)
        return Triplets
    
    def weight2color(weight, triplets):
        n = len(triplets)
        for i in range(n):
            if weight >= i/n and weight <= (i+1)/n : 
                return triplets[i]
            
    def addColor(texte, RGB = (100,100,100)):
        new_texte = '\x1b[48;2;'  + str(RGB[0]) + ";" + str(RGB[1]) + ";" + str(RGB[2]) + "m"  + texte + "\x1b[0m"
        return new_texte
    
    # -- main --
    Triplets = generateColors(colors, n)
    Colored_text = ''
    for i, s in enumerate(sentences) :
        s_color = weight2color(attn_sentences[i], Triplets)
        Colored_text += addColor('  ', s_color) + ' '
        for j, w in enumerate(s) :
            color = weight2color(attn_words[i][j], Triplets)
            Colored_text += addColor(w, color) + ' '
        Colored_text += '\n' 
    print(Colored_text)
    return

<a id="model"></a>

# 2 Text Classifier

[Back to top](#plan)


### Model

In [21]:
#from libDL4NLP.models import TextClassifier

In [22]:
class TextClassifier(nn.Module) :
    def __init__(self, device, tokenizer, word2vec, 
                 hidden1_dim = 100,
                 hidden2_dim = 100,
                 n1_layer = 1, 
                 n2_layer = 1,
                 hops = 1, 
                 share = True,
                 transf = False,
                 n_class = 2, 
                 dropout = 0, 
                 class_weights = None, 
                 optimizer = optim.SGD
                ):
        super(TextClassifier, self).__init__()

        # embedding
        self.bin_mode  = (n_class == 'binary')
        self.tokenizer = tokenizer
        self.word2vec  = word2vec
        self.context   = RecurrentEncoder(
            emb_dim = self.word2vec.out_dim, 
            hid_dim = hidden1_dim, 
            n_layer = n1_layer, 
            dropout = dropout, 
            bidirectional = True)
        self.query_dim = (self.context.out_dim if hops > 1 else 0)
        self.attention = HAN(
            emb_dim   = self.context.out_dim,
            hid_dim   = hidden2_dim,
            query_dim = self.query_dim,
            n_layer   = n2_layer,
            hops      = hops,
            share     = share,
            transf    = transf,
            dropout   = dropout)
        self.out = nn.Linear(self.attention.out_dim, (1 if self.bin_mode else n_class))
        self.act = F.sigmoid if self.bin_mode else F.softmax
        
        # optimizer
        if self.bin_mode : self.criterion = nn.BCEWithLogitsLoss(size_average = False)
        else             : self.criterion = nn.NLLLoss(size_average = False, weight = class_weights)
        self.optimizer = optimizer
        
        # load to device
        self.device = device
        self.to(device)
        

    def nbParametres(self) :
        return sum([p.data.nelement() for p in self.parameters() if p.requires_grad == True])
    
    # main method
    def forward(self, text, 
                attention_method = None) :
        '''classifies a sentence as string'''
        # tokenize, embed and contextualize
        sentences   = self.tokenizer(text)
        embeddings  = [self.word2vec(words, self.device).squeeze(0) for words in sentences] # list of tensors of size (1, n_words, embedding_dim)
        embeddings  = nn.utils.rnn.pad_sequence(embeddings, batch_first = True, padding_value = 0)  # size (n_sentences, n_words, embedding_dim)
        hiddens, _  = self.context(embeddings, enforce_sorted = False) # size (n_sentences, n_words, embedding_dim)

        #init query whether necessary
        if self.query_dim > 0 : query = torch.zeros(1, 1, self.query_dim).to(self.device)
        else                  : query = None

        # compute attention
        attended, w1, w2 = self.attention(hiddens, query)
        if self.bin_mode : prediction = self.act(self.out(attended).view(-1)).data.topk(1)[0].item()
        else             : prediction = self.act(self.out(attended.squeeze(1)), dim = 1).data.topk(1)[1].item()

        # display attention weights
        if attention_method is not None :
            attn_words     = [np.array(s.view(-1).data.cpu().numpy()) for s in w1[0]]
            attn_sentences = np.array(w2[0].view(-1).data.cpu().numpy())
            attention_method(attn_words, attn_sentences, sentences)
        return prediction
    
    # load data
    def generatePaddedTexts(self, texts) :
        padded_data = []
        for text, label in texts :
            pack0 = [[self.word2vec.lang.getIndex(w) for w in words] for words in text]
            pack0 = [[w for w in words if w is not None] for words in pack0]
            lengths = torch.tensor([len(p) for p in pack0])               # size = (text_length) 
            pack0 = list(itertools.zip_longest(*pack0, fillvalue = self.word2vec.lang.getIndex('PADDING_WORD')))
            pack0 = Variable(torch.LongTensor(pack0).transpose(0, 1))     # size = (text_length, max_length)
            pack1 = [label]
            if self.bin_mode : pack1 = Variable(torch.FloatTensor(pack1)) # size = (1) 
            else             : pack1 = Variable(torch.LongTensor(pack1))  # size = (1) 
            padded_data.append([[pack0, lengths], pack1])
        return padded_data
    
    # compute model perf
    def compute_accuracy(self, texts) :
        def compute_batch_accuracy(batch, target) :
            torch.cuda.empty_cache()
            # embed and contextualize
            embeddings       = self.word2vec.embedding(batch[0].to(self.device))
            hiddens, _       = self.context(embeddings, lengths = batch[1].to(self.device), enforce_sorted = False)
            #init query whether necessary
            if self.query_dim > 0 : query = torch.zeros(1, 1, self.query_dim).to(self.device)
            else                  : query = None
            # compute attention
            attended, w1, w2 = self.attention(hiddens, query)
            # compute score
            if self.bin_mode : 
                pred  = self.act(self.out(attended).view(-1)).data.topk(1)[0].item()
                score = (abs(target.item() - pred) < 0.5)
            else : 
                pred  = self.act(self.out(attended.squeeze(1)), dim = 1).data.topk(1)[1].item()
                score = (target.item() == pred)
            return score

        # --- main ---
        batches = self.generatePaddedTexts(texts)
        score = 0
        for batch, target in batches : score += compute_batch_accuracy(batch, target)
        return score * 100 / len(texts)
    
    # fit model
    def fit(self, batches, 
            iters = None, 
            epochs = None, 
            lr = 0.025, 
            random_state = 42,
            print_every = 10, 
            compute_accuracy = True):
        """Performs training over a given dataset and along a specified amount of loops"""
        def asMinutes(s):
            m = math.floor(s / 60)
            s -= m * 60
            return '%dm %ds' % (m, s)

        def timeSince(since, percent):
            now = time.time()
            s = now - since
            rs = s/percent - s
            return '%s (- %s)' % (asMinutes(s), asMinutes(rs))
        
        def computeLogProbs(batch) :
            # embed and contextualize
            embeddings       = self.word2vec.embedding(batch[0].to(self.device))
            hiddens, _       = self.context(embeddings, lengths = batch[1].to(self.device), enforce_sorted = False)
            #init query whether necessary
            if self.query_dim > 0 : query = torch.zeros(1, 1, self.query_dim).to(self.device)
            else                  : query = None
            # compute attention
            attended, w1, w2 = self.attention(hiddens, query)
            # compute log prob
            if self.bin_mode : return self.out(attended).view(-1)
            else             : return F.log_softmax(self.out(attended.squeeze(1)))

        def computeAccuracy(log_probs, targets) :
            if self.bin_mode : return sum(torch.abs(targets - self.act(log_probs)) < 0.5).item() * 100 / targets.size(0)
            else             : return sum([targets[i].item() == log_probs[i].data.topk(1)[1].item() for i in range(targets.size(0))]) * 100 / targets.size(0)
            
        def printScores(start, iter, iters, tot_loss, tot_loss_words, print_every, compute_accuracy) :
            avg_loss = tot_loss / print_every
            avg_loss_words = tot_loss_words / print_every
            if compute_accuracy : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}  accuracy : {:.1f} %'.format(iter, int(iter / iters * 100), avg_loss, avg_loss_words))
            else                : print(timeSince(start, iter / iters) + ' ({} {}%) loss : {:.3f}                     '.format(iter, int(iter / iters * 100), avg_loss))
            return 0, 0

        def trainLoop(batch, optimizer, compute_accuracy = True):
            """Performs a training loop, with forward pass, backward pass and weight update."""
            torch.cuda.empty_cache()
            optimizer.zero_grad()
            self.zero_grad()
            log_probs = computeLogProbs(batch[0])
            targets = batch[1].to(self.device).view(-1)
            loss    = self.criterion(log_probs, targets)
            loss.backward()
            optimizer.step() 
            accuracy = computeAccuracy(log_probs, targets) if compute_accuracy else 0
            return float(loss.item() / targets.size(0)), accuracy
        
        # --- main ---
        self.train()
        np.random.seed(random_state)
        start = time.time()
        optimizer = self.optimizer([param for param in self.parameters() if param.requires_grad == True], lr = lr)
        tot_loss = 0  
        tot_acc  = 0
        if epochs is None :
            for iter in range(1, iters + 1):
                batch = random.choice(batches)
                loss, acc = trainLoop(batch, optimizer, compute_accuracy)
                tot_loss += loss
                tot_acc += acc      
                if iter % print_every == 0 : 
                    tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        else :
            iter = 0
            iters = len(batches) * epochs
            for epoch in range(1, epochs + 1):
                print('epoch ' + str(epoch))
                np.random.shuffle(batches)
                for batch in batches :
                    loss, acc = trainLoop(batch, optimizer, compute_accuracy)
                    tot_loss += loss
                    tot_acc += acc 
                    iter += 1
                    if iter % print_every == 0 : 
                        tot_loss, tot_acc = printScores(start, iter, iters, tot_loss, tot_acc, print_every, compute_accuracy)
        return

### Training

In [23]:
classifier = TextClassifier(device = torch.device("cpu"),
                            tokenizer = tokenize,
                            word2vec = gensim_word2vec,
                            hidden1_dim = 100,
                            hidden2_dim = 100,
                            n1_layer = 2,
                            n2_layer = 1,
                            hops = 1,
                            share = True,
                            n_class = 4, #'binary', 
                            dropout = 0.1,
                            optimizer = optim.AdamW)

classifier.nbParametres()

505004

In [24]:
batches = classifier.generatePaddedTexts(labelled_sentences_trn)

In [25]:
len(batches)

120000

In [None]:
classifier.fit(batches[:30000], epochs = 1, lr = 0.0025, print_every = 1000)
classifier.fit(batches[30000:60000], epochs = 1, lr = 0.001, print_every = 1000)
classifier.fit(batches[60000:90000], epochs = 1, lr = 0.00025, print_every = 1000)
classifier.fit(batches[90000:], epochs = 1, lr = 0.0001, print_every = 1000)

In [29]:
# save
#torch.save(classifier.state_dict(), path_to_DL4NLP + '\\saves\\DL4NLP_II1_text_classifier.pth')

# load
#classifier.load_state_dict(torch.load(path_to_DL4NLP + '\\saves\\DL4NLP_II1_text_classifier.pth'))

#### Evaluation single-head

In [30]:
# attention heads = 1
classifier.eval()
text = ' . '.join([' '.join(s) for s in labelled_sentences_tst[157][0]])
classifier(text, attention_method = HANViewerOnWords)

[48;2;252;138;106m  [0m [48;2;256;245;240mgreek[0m [48;2;256;245;240msprinters[0m [48;2;256;245;240mquit[0m [48;2;256;245;240mto[0m [48;2;256;245;240mend[0m [48;2;256;245;240mgames[0m [48;2;256;245;240mscandal[0m 
[48;2;252;171;143m  [0m [48;2;256;245;240mathens[0m [48;2;256;245;240mreuters[0m [48;2;256;245;240mgreece[0m [48;2;256;245;240mINT[0m [48;2;256;245;240ms[0m [48;2;256;245;240mtwo[0m [48;2;256;245;240mtop[0m [48;2;256;245;240mathletes[0m [48;2;256;245;240mhave[0m [48;2;256;245;240mpulled[0m [48;2;256;245;240mout[0m [48;2;256;245;240mof[0m [48;2;256;245;240mthe[0m [48;2;256;245;240mathens[0m [48;2;256;245;240molympics[0m [48;2;256;245;240mand[0m [48;2;256;245;240mapologised[0m [48;2;256;245;240mto[0m [48;2;256;245;240mthe[0m [48;2;256;245;240mgreek[0m [48;2;256;245;240mpeople[0m [48;2;256;245;240mfor[0m [48;2;256;245;240ma[0m [48;2;256;245;240mscandal[0m [48;2;256;245;240mover[0m [48;2;256;245;240mmissed[0m 

1

In [31]:
classifier.eval()
classifier.compute_accuracy(labelled_sentences_tst)

91.0657894736842