# <center> Quora insincere questions (2) </center>

Now that we have run our first basic model, we will construct a new model with several small improvements :

<i>1)</i> First, we will let the user specify wether they want to use a LSTM or a GRU unit, for the reasons we have explained in the first notebook. Indeed, a LSTM contains much more parameters to learn, which make sthe training process much longer. As GRU and LSTM yield similar results on numerous tasks, it can be convenient to let the user decide for themselves if they want to use a LSTM or a GRU.

<i>2)</i> We will also use <b>regularization techniques</b> to prevent overfitting and improve the training process. Batch normalization is used in convolutional neural networks, so we will not make use of it. But we can add <b>dropout</b> layers, i.e. we randomly set some neurons to zero during training in order to prevent the model to learn co-adaptative features. Note that this amounts to train several different models that we aggregate.

<i>3)</i> We will let the user decide wether tgey want to use <b>bi-directional recurrent layers</b> or not. Bi-directional RNN consists in reading a sentence in both ways, which should help to better capture the structure of the sentence. The hidden states are then concatenated to form the final hidden state.

<i>4)</i> We will also let them decide <b>how many recurrent layers</b> they want to stack. The idea here is that lower layers should capture low-level information whereas higher layers should capture high-level information.

<i>5)</i> A parameter `train_embedding` will allow to chose to re-train the embedding matrix or not. Setting this parameter to TRUE will dramatically increase the number of parameters to train, but it should also help to retrain them on the task at hand.

---

#### The big improvement : self-attention layer

In an <b>encoder - decoder</b> network, en attention layer allows the decoder to look at the entire input sequence at every decoding step, so that it can decide what input words are important at any point in time. The basic attention layer works as follow :

- We have encoder hidden states $h_1, h_2, ..., h_N$
- On timestep $t$, we have the decoder hidden state $s_t$
- For this step, we get the attention scores $e_t = (s_t^T h_1, ... s_t^T h_N)$
- Take softmax to get attention distribution $\alpha^t = softmax(e_t)$
- Compute the attention output $a_t = \sum_{i=1}^N \alpha_i^t h_i$
- Concatenate attention output and decoder hidden state $(a_t, s_t)$

There are dozens of attention models, which can be grouped into a single framework : <i>given a set of vector values and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query</i>.

<b>Self-Attention</b> is a mechanism that allows to extract different aspects of the sentence into multiple vector representations. Due to its direct access to hidden representations from previous time steps, self-attention relieves some long-term memorization burden from LSTM. Moreover, interpreting the extracted embedding becomes very easy and explicit.

We will perform the self attention mechanism proposed by Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou and Yoshua Bengio in their paper <i>"A Structured Self-Attentive Sentence Embedding"</i> in 2017.

- We have encoder hidden states $H = (h_1, h_2, ..., h_N)^T$ of shape $(n, u)$
- The aim is to encode a variable length sentence into a fixed size embedding, which is achieved by choosing a linear combination of the $n$ LSTM hidden vectors in $H$. A vector of weights $a$ is computed : $A = softmax(􏰀w_{s2} \tanh 􏰀(w_{s1} H^T))$. However, this vector representation usually focuses on a specific component of the sentence, like a special set of related words or phrases. But there can be multiple components in a sentence that together forms the overall semantics of the whole sentence. Thus, to represent the overall semantics of the sentence, we need multiple m’s that focus on different parts of the sentence. We thus compute a matrix weights $A = softmax(􏰀W_{s2} \tanh 􏰀(W_{s1} H^T))$, where the softmax is performed along the second dimension of its input.
- We then compute the sentence embedding $M = AH$.

## Load data

In [1]:
import random

import torch

import torchtext
from torchtext import data
from torchtext.vocab import Vectors, GloVe

In [2]:
# Select device (GPU or CPU)
USE_GPU = False
if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
print('Using device:', device)

Using device: cpu


In [3]:
# Create training data
random.seed(14)

tokenizer = lambda x: x.split()
ID = data.Field()
TEXT = data.Field(tokenize=tokenizer, init_token='<bos>', eos_token='<eos>', lower=True)
TARGET = data.LabelField(dtype=torch.float)
train_fields = [('id', None), ('text', TEXT), ('target', TARGET)]

# Data
train_data = data.TabularDataset(
    path='train.csv',
    format='csv',
    skip_header=True,
    fields=train_fields
)

# Split
train, val, test = train_data.split(split_ratio=[0.6, 0.2, 0.2], random_state=random.getstate())

# Vocab
TEXT.build_vocab(train_data, vectors=GloVe(name='6B', dim=300), min_freq=5)
TARGET.build_vocab(train_data)

batch_size_train = 256
batch_size_val = 256
batch_size_test = 256

# Iterators
train_iter = data.BucketIterator(
    train,
    sort_key=lambda x: len(x.text),  # sort sequences by length (dynamic padding)
    batch_size=batch_size_train,  # batch size
    device=device  # select device (e.g. CPU)
)

val_iter = data.BucketIterator(
    val,
    sort_key=lambda x: len(x.text),
    batch_size=batch_size_val,
    device=device
)

test_iter = data.Iterator(
    test,
    batch_size=batch_size_test,
    device=device,
    train=False,
    sort=False,
    sort_within_batch=False
)

## Define model architecture

In this part, we define the model architecture presented above using as usual the `nn.Module` API. The `self_attention` method is where we perform the self-attention mechanism. The `init_hidden` method is necessary to initialize parameters whose size change depending on the fact that we use a bidirectional RNN or not.

In [4]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [5]:
class TextClassificationModel(nn.Module):
    def __init__(self, 
                 embedding_matrix,
                 hidden_dim,
                 da,
                 r,
                 output_size,
                 dropout,
                 num_layers=1,
                 use_lstm=True, 
                 bidirectional=True,
                 train_embedding=True):
        super(TextClassificationModel, self).__init__()
        """
        A text classification model, made of an embedding matrix, one or several recurrent layers
        and a self-attention layer.
        
        Arguments:
        - embedding_matrix: pre-trained embedding matrix of size (vocab_size, embedding_dim).
        - hidden_dim: an integer giving the dimension of the hidden state of the recurrent layer.
        - da : Number of units in the Attention mechanism.
        - r : Number of Attention heads.
        - output_size: an integer giving the size of the output (2 for binary classification).
        - dropout: a float between 0.0 and 1.0 giving the dropout rate.
        - num_layers: (Optional) number of stacked recurrent layers. Default 1.
        - use_lstm: (Optional) boolean that indicates wether to use a LSTM layer or not. 
          When False, use a GRU instead. Default True.
        - bidirectional: (Optional) boolean for using a bidirectional recurrent layer. Default True.
        - train embedding: (Optional) boolean to know if we fine tune the embedding matrix. 
          Default True.
        """
        vocab_size = embedding_matrix.shape[0]
        embedding_dim = embedding_matrix.shape[1]
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        self.bidirectional = bidirectional
        self.use_lstm = use_lstm
        self.da = da
        self.r = r
        
        # Embedding layer
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix)
        if train_embedding:
            self.embedding.weight.requires_grad = True
        else:
            self.embedding.weight.requires_grad = False

        # Recurrent layer
        if use_lstm:
            self.rnn = nn.LSTM(input_size=embedding_dim, 
                               hidden_size=hidden_dim, 
                               num_layers=num_layers, 
                               bidirectional=bidirectional,
                               dropout=dropout,
                               batch_first=True)
        else:
            self.rnn = nn.GRU(input_size=embedding_dim, 
                              hidden_size=hidden_dim, 
                              num_layers=num_layers,
                              bidirectional=bidirectional,
                              dropout=dropout,
                              batch_first=True)
            
        # Fully connected layer
        if bidirectional:
            self.fully_connected = nn.Linear(r * hidden_dim * 2, output_size)
        else:
            self.fully_connected = nn.Linear(r * hidden_dim, output_size)
        
        # Dropout
        self.dropout_layer = nn.Dropout(dropout)
        

    def forward(self, x):
        """  
        Perform a forward pass
        
        Arguments:
        - X: tensor of shape (batch_size, sequence_length)
        
        Returns:
        - Output of the linear layer of shape (batch_size, output_size)
        """
        
        # 1. Embeddings layer + dropout
        x = self.embedding(x)  # [batch_size, seq_len, embed_dim]
        x = self.dropout_layer(x)  # [batch_size, seq_len, embed_dim]
        
        # 2. Recurrent layer(s)
        # First, initialize hidden and cell states.
        # Note that only the LSTM requires the cell state.
        # x is of shape [batch_size, seq_len, hidden_size]
        h0, c0 = self._init_hidden(self.num_layers, x.shape[0], self.hidden_dim)
        if self.use_lstm:
            x, (fn, cn) = self.rnn(x, (h0, c0))
        else:
            x, fn = self.rnn(x, h0)
            
        # 3. Attention layer + dropout
        x = self.self_attention(x, self.da, self.r)  # [batch_size, r, hidden_dim] 
        x = self.dropout_layer(x)
        
        # 4. Final layer
        output = self.fully_connected(x.view(x.size()[0], -1))  # [batch_size, 2]
        
        return output
    
    
    def self_attention(self, x, da, r):
        """
        Attention mechanism in our model. 
        Attention is used to compute soft alignment scores between each of 
        the hidden_state and the last hidden_state of the LSTM. 

        Arguments:
        - lstm_output : Output of the LSTM of shape (batch, seq_len, num_directions * hidden_size).
          Tensor containing the output features (h_t) from the last layer of the LSTM, for each t.
        - da : Number of units in the Attention mechanism.
        - r : Number of Attention heads.
        
        Returns:
        - Tensor of size [batch_size, seq_len, r]
        """
        hidden_dim = x.size()[2]
        W_s1 = nn.Linear(hidden_dim, da)
        W_s2 = nn.Linear(da, r)
        
        weight_matrix = F.tanh(W_s1(x))  # [batch_size, seq_len, da]
        weight_matrix = W_s2(weight_matrix)  # [batch_size, seq_len, r]
        weight_matrix = F.softmax(weight_matrix, dim=1)  # [batch_size, seq_len, r]
        weight_matrix = weight_matrix.permute(0, 2, 1)  # [batch-size, r, seq_len]
        
        x = torch.bmm(weight_matrix, x)  # [batch_size, r, hidden_dim]

        return x
    
    
    def _init_hidden(self, num_layers, batch_size, hidden_dim):
        """
        Initialize hidden states for the recurrent layers
        
        Arguments:
        - num_layers : number of stacked layers (int).
        - batch_size : batch size (int).
        - hidden_dim : hidden dimension (int).
        
        Returns:
        - A tuple (h0, c0) containing hidden and cell states.
        """
        
        # The hidden state is twice as large for bidirectional LSTM
        if self.bidirectional:
            h0 = torch.zeros(num_layers * 2, batch_size, hidden_dim)
            c0 = torch.zeros(num_layers * 2, batch_size, hidden_dim)
        else:
            h0 = torch.zeros(num_layers, batch_size, hidden_dim)
            c0 = torch.zeros(num_layers, batch_size, hidden_dim)
            
        return (h0, c0)

## Training

Finally, the training process will be wrapped into a class `Solver` that allows several tasks :
- Training a model ;
- Compute loss or accuracy history ;
- Save a model, in case we want to keep some specific parameters.

The core of the `train` method is the same as the one we coded in the previous notebook, but we added additional information to display and an option to save a model's parameters.

In [6]:
import torch.optim as optim
import numpy as np
import time

In [14]:
class Solver():
    """
    Ecapsulates all the logic necessary for training text classification
    models.
    
    The Solver accepts both training and validataion data and labels so it can
    periodically check classification accuracy on both training and validation
    data to watch out for overfitting.
    
    - To train a model, construct Solver instance, pass the model, dataset, and 
    various options (learning rate, batch size, etc) to the
    constructor. 
    - Call the train() method to train the model.
    - Instance variable solver.loss_history contains a list of all losses 
    encountered during training and the instance variables 
    - Instance variables solver.train_acc_history and solver.val_acc_history are lists of the
    accuracies of the model on the training and validation set at each epoch.
    
    Example usage :
    model = Model(*args)
    solver = Solver(model, 
                    loader_train,
                    loader_val,
                    optimizer)
    solver.train()
    """
    
    def __init__(self, model,  optimizer, loader_train, loader_val, **kwargs):
        """
        Required arguments:
        - model: A model constructed from PyTorch nn.Module.
        - optimizer: An Optimizer object we will use to train the model.
        - loader_train: An Iterator object on which iterating to construct batches of 
          training data.
        - loader_val: An Iterator object on which iterating to construct batches of 
          validation data.
          
        Optional arguments:
        - verbose: Boolean; if set to false then no output will be printed
          during training.
        - save_model: Boolean; if set to True then save best model.
        """
        self.model = model
        self.optimizer = optimizer
        self.loader_train = loader_train
        self.loader_val = loader_val

        # Unpack arguments
        self.verbose = kwargs.pop('verbose', False)
        self.save_model = kwargs.pop('save_model', False)
        
        self._reset()
        
        
    def _reset(self):
        """
        Reset some variables for book-keeping:
        - best validation accuracy
        - loss history
        - train accuracy history
        - validation accuracy history
        - best model parameters
        """
        self.best_val_accuracy = 0
        self.loss_history = []
        self.train_accuracy_history = []
        self.val_accuracy_history = []
        self.best_params = {}
        
        
    def _select_device(self, verbose=True):
        """
        Select device e.g. CPU / GPU
        """
        use_gpu = False
        if use_gpu and torch.cuda.is_available():
            device = torch.device('cuda')
        else:
            device = torch.device('cpu')
        if verbose:
            print('Using device:', device)

        self.device = device
        
    
    def train(self, print_every=10, epochs=1):
        """
        Train a model using the PyTorch Module API.

        Arguments:
        - print_every: (Optional) Print training accuracy every print_every iterations.
        - epochs: (Optional) A Python integer giving the number of epochs to train for.

        Returns: Nothing, but prints model accuracies during training.
        """

        # Move the model parameters to CPU / GPU
        model = self.model.to(device=self.device)
        optimizer = self.optimizer
        
        # Initialize iteration
        t = 0

        for epoch in range(epochs):
            start = time.time()
            for train_batch in self.loader_train:

                # Put model to training mode
                model.train()

                # Load x and y
                x = train_batch.text.transpose(1, 0)  # reshape to [batch_size, len_seq]
                y = train_batch.target.type(torch.LongTensor)

                # Move to device, e.g. CPU
                x = x.to(device=self.device)
                y = y.to(device=self.device)

                # Compute scores and softmax loss
                scores = model(x)
                loss = F.cross_entropy(scores, y)

                # Zero out all of the gradients for the variables which the optimizer
                # will update.
                optimizer.zero_grad()

                # Backwards pass: compute the gradient of the loss with
                # respect to each parameter of the model.
                loss.backward()

                # Update the parameters of the model using the gradients
                # computed by the backwards pass.
                optimizer.step()
                
                # Save loss
                self.loss_history.append(loss.item())

                # Display information
                if self.verbose and t % print_every == 0:
                    print('Iteration %d, loss = %.4f' % (t, self.loss_history[-1]))
                    acc = self.compute_accuracy(validation=True)
                    print('Accuracy :', acc)
                    print()
                
                t += 1
                
            end = time.time()
            print('Epoch {0} / {1}, time = {2} secs'.format(epoch, epochs, end-start))
            
            # Compute train and val accuracy at the end of each epoch.
            train_accuracy = self.compute_accuracy(validation=False)
            val_accuracy = self.compute_accuracy(validation=True)
            
            self.train_accuracy_history.append(train_accuracy)
            self.val_accuracy_history.append(val_accuracy)
            
            # Print useful information
            if self.verbose:
                print('(Epoch %d / %d) Train acc: %f; Val acc: %f' % (epoch, epochs, 
                                                                      train_accuracy, val_accuracy))

            # Keep track of the best model
            if val_accuracy > self.best_val_accuracy:
                self.best_val_accuracy = val_accuracy
                # update best params
                self.best_params['state_dict'] = model.state_dict().copy()
                self.best_params['optimizer'] = optimizer.state_dict().copy()
                    
        # Save best model
        if self.save_model:
            self._save_model('/Users/robin/Projects/zelros/', 
                             self.best_params['state_dict'], 
                             self.best_params['optimizer'])
                
                
    def compute_accuracy(self, validation=True):
        """
        Compute accuracy of a model.
        
        Arguments:
        - validation: (Optional) If True, compute accuracy on the validation dataset.
        """
        if validation:
            loader = self.loader_val
        else:
            loader = self.loader_train
                        
        num_correct = 0
        num_samples = 0

        # Set model to evaluation mode : This has any effect only on certain modules. 
        # For example, behaviors of dropout layers during train or test differ.
        self.model.eval()

        # Tell PyTorch not to build computational graphs
        with torch.no_grad():
            for batch in loader:

                # Load x and y
                x = batch.text.transpose(1, 0)  # reshape to [batch_size, len_seq]
                y = batch.target.type(torch.LongTensor)

                # Move to device, e.g. CPU
                x = x.to(device=self.device)  
                y = y.to(device=self.device)

                # Compute scores and predictions
                scores = self.model(x)
                _, preds = scores.max(1)
                num_correct += (preds == y).sum()
                num_samples += preds.size(0)

        acc = float(num_correct) / num_samples
        return acc
    
    
    def _save_model(self, model_path, model_dict, optimizer_dict):
        """
        Save model parameters if we have to interrupt the training.
        Parameters are saved in a dictionary.
        
        Required arguments:
        - model_path: where to save the model.
        - model_dict: Python dictionary that maps each layer to its parameter tensor.
        - optimizer_dict: Python dictionary that contains info about the optimizer's 
          states and hyperparameters used.
        """
    
        state = {
            'state_dict': model_dict,   # model.state_dict()
            'optimizer' : optimizer_dict   # optimizer.state_dict()
        }
        filename = model_path + 'best_model.pkl'
        torch.save(state, filename)
        print('Model saved to %s' % filename)


### Usage

Th way to put all the things together is as follow :

In [15]:
# Learning rate
learning_rate = 1e-2

# Model
model = TextClassificationModel(embedding_matrix=TEXT.vocab.vectors, 
                                hidden_dim=64,
                                da=100,
                                r=5,
                                output_size=2,
                                dropout=0.5,
                                num_layers=1,
                                use_lstm=True,
                                bidirectional=True,
                                train_embedding=True)

# Optimizer
optimizer = optim.Adam(model.parameters(), 
                       lr=learning_rate, 
                       betas=(0.9, 0.999),  # recommended values
                       eps=1e-08)  # recommended value

# Solver
solver = Solver(model=model, 
                optimizer=optimizer, 
                loader_train=train_iter, 
                loader_val=val_iter, 
                verbose=True)

In [None]:
#solver.train()