# <center> Quora insincere questions </center>

##  1. Prepare and load data 

In this part, we will do the preprocessing to handle text data. Recall that the task at hand here is to automatically classify Quora questions to detect sincere from insincere questions.

The main steps to prepare text data are the following :
- Read the data ;
- Do a bit of preprocessing, i.e. remove special characters, numbers, etc. This step really depends on the task ;
- Tokenize the text, i.e. each sentence is transformed into a list of words ;
- Build the vocabulary, i.e. map each word to a unique integer, handling beginning and ending of sentence, as well as unknown words ;
- Convert the tokenized text to the corresponding list of integers built by the mapping ;
- Pas the sentences in order to create batches of same length.

These steps can be a bit long. We will use ```torchtext```, a library specifically designed to handle text data and load them in the appropriate format.

### Overview of the data

In [1]:
import pandas as pd

# Load train and test datasets
train_csv = pd.read_csv("train.csv")
test_csv = pd.read_csv("test.csv")

In [2]:
print(train_csv.shape)
print(test_csv.shape)

(1306122, 3)
(375806, 2)


In [3]:
train_csv.head(2)

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0


In [4]:
# Check missing values
train_csv.isnull().any()

qid              False
question_text    False
target           False
dtype: bool

No missing value. We see here that we have 3 fields : `qid` corresponds to the question unique ID, `question_text` corresponds to the question itself and `target` corresponds to the label, i.e. 0 if sincere question, 1 if insincere question. The test dataset does not contain this `target` field.

In [5]:
print('Number of sincere questions', train_csv[train_csv['target'] == 0].shape[0])
print('Number of insincere questions', train_csv[train_csv['target'] == 1].shape[0])

Number of sincere questions 1225312
Number of insincere questions 80810


The dataset is unbalanced. We have 80,810 examples for positive targets though, which is clearly enough for a classifier to learn. Thus, we will not perform any <b>subsampling</b> or <b>oversampling</b> technique.

<i> The dataset is <b>huge</b>. As I do not possess any GPU, I will probably make small splits in order to be able to train the model.</i>

### Torchtext

Torchtext takes in our csv files and converts them to ```Datasets``` objects. A, ```Iterator``` object tgen iterates over this ```Dataset``` object to construct batches of data, handling various steps described above (converting words to integers, constructing batches, etc.).

First, we will define ```Fields``` to load the data. In our case, the data have 3 headers : ```qid```, ```question_text``` and ```target```. Declaring fields simply consists in telling torchtext about this structure and telling it that the ```question_text``` field should be treated as text, whereas the ```target``` field should be treated as label. 

We will define a custom tokenizer to handle the tokenization. The argument `lower=True` will tell torchtext to convert the text to lowercase. We could use a tokenizer from `nltk`, but we will create a simple one. The `sequential` argument is set to true when we are dealing with text data. For the TARGET field in the cell below, we will thus use `sequential=False` and `use_vocab=False`. 

In [6]:
import torchtext
import torch

In [8]:
from torchtext import data

# Define a custom tokenizer
# We could also use `from nltk import word_tokeniz` for a more efficient tokenizer
tokenizer = lambda x: x.split()

ID = data.Field()
TEXT = data.Field(tokenize=tokenizer, init_token='<bos>', eos_token='<eos>', lower=True)
TARGET = data.LabelField(dtype=torch.float)

# Declare field for training set
train_fields = [('id', None), ('text', TEXT), ('target', TARGET)]

We won't be using the test dataset because we do not have the targets to measure the performances. Instead, we will split our training set into a training dataset and a validation dataset.

The `TabularDataset` class allows to construct a `Dataset` object for data whose format is typically a csv file.

We can split the data into a <b>training</b>, a <b>validation</b> and a <b>test</b> set. if we want, in order to test our implementation, we can also define a <b>development</b> set as a small subset of the training set. As such, computation time would be much faster and it would help us to debug the code.

In [14]:
import random
random.seed(14)  # for reproductibility

# Create our train data
train_data = data.TabularDataset(
    path='train.csv',
    format='csv',
    skip_header=True,
    fields=train_fields
)

In [21]:
# Create train, validation and test datasets
train, val, test = train_data.split(split_ratio=[0.6, 0.2, 0.2], random_state=random.getstate())

Let's have a look a what `train` looks like.

In [10]:
print(train.type)
print(train[0])
print(train[0].__dict__.keys())

<generator object Dataset.__getattr__ at 0x18ff2d0c0>
<torchtext.data.example.Example object at 0x14858b6a0>
dict_keys(['text', 'target'])


In [11]:
# One training example
vars(train.examples[0])

{'text': ['is',
  'it',
  'safe',
  'to',
  'use',
  'american',
  'appliances',
  '(110v)',
  'in',
  'india?'],
 'target': '0'}

In [22]:
print('Number of training examples:', len(train))
print('Number of validation examples:', len(val))
print('Number of test examples:', len(test))

Number of training examples: 783673
Number of validation examples: 261225
Number of test examples: 261224


In [29]:
print(type(train))
print(type(val))
print(type(test))

<class 'torchtext.data.dataset.Dataset'>
<class 'torchtext.data.dataset.Dataset'>
<class 'torchtext.data.dataset.Dataset'>


Now, build the vocabulary for the `TEXT` field from from the entire training dataset. We can load pre-trained word vectors to embed the tokens. We will use <b>Glove</b> vectors, a common choice for word embedding. Moreover, the construction of Glove vectors ensure that the embedding takes into account both global statistics (like counts or frequencies in a corpus) and semantic information (like Word2Vec would do).

I was somehow unable to load pre-trained vectors from the zip file provided by Kaggle using this command :
`TEXT.vocab.load_vectors(torchtext.vocab.Vectors('glove.840B.300d/glove.840B.300d.txt'))`

Hence, I used the available Glove vectors from torchtext directly.

In [23]:
from torchtext.vocab import Vectors, GloVe

# minimum frequency needed to include a token in the vocabulary set to 5
TEXT.build_vocab(train_data, vectors=GloVe(name='6B', dim=300), min_freq=5)
TARGET.build_vocab(train_data)

In [24]:
print('Unique tokens in TEXT vocab:', len(TEXT.vocab))
print('Unique tokens in TARGET vocab:', len(TARGET.vocab))

Unique tokens in TEXT vocab: 80258
Unique tokens in TARGET vocab: 2


Now, let's have a look a our embedding matrix.

In [25]:
word_embeddings = TEXT.vocab.vectors
print ("Length of text vocabulary: " + str(len(TEXT.vocab)))
print ("Embedding size of text vocabulary: ", TEXT.vocab.vectors.shape)

Length of text vocabulary: 80258
Embedding size of text vocabulary:  torch.Size([80258, 300])


As we can see, we have a vocabulary of length 117,676 and an embedding size of 300. The first words in our vocabulary are special tokens `<unk>`, `<pad>`, `<bos>` and `<eos>`. We can access this mapping with the arguments `itos` and `stoi`. As we can see, they are all initialized to 0 vectors. 

In [26]:
# Print first tokens in vocabulary
for i in range(10):
    print(TEXT.vocab.itos[i])

<unk>
<pad>
<bos>
<eos>
the
what
is
a
to
in


In [27]:
# Print embeddings for `<bos>` token
word_embeddings[2]

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 

In [28]:
# Print embeddings for `the` token
word_embeddings[4]

tensor([ 4.6560e-02,  2.1318e-01, -7.4364e-03, -4.5854e-01, -3.5639e-02,
         2.3643e-01, -2.8836e-01,  2.1521e-01, -1.3486e-01, -1.6413e+00,
        -2.6091e-01,  3.2434e-02,  5.6621e-02, -4.3296e-02, -2.1672e-02,
         2.2476e-01, -7.5129e-02, -6.7018e-02, -1.4247e-01,  3.8825e-02,
        -1.8951e-01,  2.9977e-01,  3.9305e-01,  1.7887e-01, -1.7343e-01,
        -2.1178e-01,  2.3617e-01, -6.3681e-02, -4.2318e-01, -1.1661e-01,
         9.3754e-02,  1.7296e-01, -3.3073e-01,  4.9112e-01, -6.8995e-01,
        -9.2462e-02,  2.4742e-01, -1.7991e-01,  9.7908e-02,  8.3118e-02,
         1.5299e-01, -2.7276e-01, -3.8934e-02,  5.4453e-01,  5.3737e-01,
         2.9105e-01, -7.3514e-03,  4.7880e-02, -4.0760e-01, -2.6759e-02,
         1.7919e-01,  1.0977e-02, -1.0963e-01, -2.6395e-01,  7.3990e-02,
         2.6236e-01, -1.5080e-01,  3.4623e-01,  2.5758e-01,  1.1971e-01,
        -3.7135e-02, -7.1593e-02,  4.3898e-01, -4.0764e-02,  1.6425e-02,
        -4.4640e-01,  1.7197e-01,  4.6246e-02,  5.8

### Use an Iterator to build batches

Now, an `Iterator` will iterate over the datasets to construct the batches. The nice thing with Torchtext is that it can handle <b>dynamic padding</b> itself. Basically, sentences are sequences of tokens of different sizes, but batches need to be of fixes size, for example (256, 32), where 32 is the length of the sequences in the batch.

In order to have sentences of same length, we <b>zero-pad</b> them within a batch so that they all have the length of the longest sequence in the batch. To be more efficient, we create batches of sentences with approximately the same length so that we avoid too much zero-padding. This is called dynamic padding, and luckily Torchtext will take care of it for us, i.e. it will shuffle the data and create batches made of similar length sequences.

In [32]:
# Select device (GPU or CPU)

USE_GPU = False

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

print('Using device:', device)

Using device: cpu


In [46]:
batch_size_train = 256
batch_size_val = 256
batch_size_test = 256

# Iterator for training set
train_iter = data.BucketIterator(
    train,
    sort_key=lambda x: len(x.text),  # sort sequences by length (dynamic padding)
    batch_size=batch_size_train,  # batch size
    device=device  # select device (e.g. CPU)
)

# Iterator for validation set
val_iter = data.BucketIterator(
    val,
    sort_key=lambda x: len(x.text),
    batch_size=batch_size_val,
    device=device
)

# Don't want to shuffle test data, so we use a standard iterator
test_iter = data.Iterator(
    test,
    batch_size=batch_size_test,
    device=device,
    train=False,
    sort=False,
    sort_within_batch=False
)

### Wrap in a function

Now that we have all the elements, we can wrap them into a single function `load_data` if we want to perform all of the above steps at once.

The function could for instance return `train_iter`, `val_iter`, `test_iter`, `dev_iter` as well as the embedding matrix `word_embeddings` and the vocabulary `TEXT.vocab`.

## 2. Model architecture

We will implement a simple model in Pytorch that is designed to handle text data. We will use a recurrent neural network with the following architecture :
- <b>Embedding layer</b>
- <b>GRU</b> or <b>LSTM</b> layer
- <b>Fully connected layer</b>

The GRU unit is computationally more efficient than a LSTM unit (less parameters to update), and in most cases results are quite similar. The default choice would be to use a LSTM unit, but for this example we can use a GRU.

We will implement this model using the Pytorch `nn.Module` API which requires to declare the different layers in the `init` and the forward pass in a `forward` function. The `nn.Module` API allows to define arbitrary network architectures, while tracking every learnable parameters. PyTorch also provides the `torch.optim` package that implements all the common optimizers, such as RMSProp, Adagrad, and Adam.

#### Why using a recurrent unit ?

A recurrent unit will proceed text sequentially using the same parameters, which is suitable to encode a complete sentence. It contains a <b>hidden state</b> and several gates that allow to select the information it needs at each time step to update this hidden state. Recurrent models have become the <i>default choice for handling NLP tasks</i> and tend to perform better than other architectures such as Convolutional networks (more suitable to image data). Another advantage of recurrent layers is that it can process sequences of arbitrary length.

#### LSTM in details

On step t, there is a hidden state `ht`and a celle state `ct`. The cell stores long term information, and the LSTM can erase, write and read information from the cell. The information that is erased, written and red is controlled by gates, whose value is dynamic (i.e. computed based on the current context) :
- forget gate : $f_t = \sigma (W_f h_{t-1} + U_f x_t + b_f)$ (which information are we going to throw away)
- input gate : $i_t = \sigma (W_i h_{t-1} + U_i x_t + b_i)$ (which values are we going to update)
- vector of candidates : $\hat{c_t} = \tanh (W_c h_{t-1} + U_c x_t + b_c)$ (create a vector of candidates for updating)
- cell state : $c_t = f_t  c_{t_1} + i_t  \hat{c_t}$ (erases some content from last cell and writes some new content)
- output gate : $o_t = \sigma (W_o h_{t-1} + U_o x_t + b_o)$ (what part of the cell state are we going to output)
- hidden state : $h_t = o_t  tanh(c_t)$

A GRU does not have a celle state. Rather, an update gate controls what parts of the hidden state are updated and a reset gate controls what parts of previous hidden state are used to compute new content.

We thus see that a LSTM / GRU is a good model to encode a sentence in a final hidden state $h_T$.

##### Potential Improvements

- <b>Stacked RNN</b> : the lower RNN should compute low level features, and the higher RNN should compute high level features ;
- <b>Bi-directional RNN</b> : reads the sentence from left to right and right to left, which allows to better encode the sentence ;
- <b>Deep bi-directional RNN</b> : combination of the two previous improvements.

In [34]:
import torch
import torch.nn as nn

In [35]:
class NaiveModel(nn.Module):
    def __init__(self, embedding_matrix, hidden_dim, output_size):
        super(NaiveModel, self).__init__()
        """
        A simple text classification model based on the following architecture :
        Embeddings -> LSTM / GRU -> Linear
        
        Arguments:
        - batch_size: batch size
        - embedding_matrix: pre-trained embedding matrix of size (vocab_size, embedding_dim).
        - hidden_dim: an integer giving the dimension of the hidden state of the recurrent layer.
        - output_size: size of output (2 for binary classification)
        """
                
        vocab_size = embedding_matrix.shape[0]
        embedding_dim = embedding_matrix.shape[1]

        # 1. Embedding layer
        # We initialize the embedding matrix using our pre-trained Glove vectors.
        # The `require_grad` argument tells PyTorch that this matrix should not be
        # updated during training.
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix)
        self.embedding.weight.requires_grad = False

        # 2. Recurrent layer
        # The `batch_first` argument tells PyTorch that the data will contain batch size
        # as first dimension
        self.rnn = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, batch_first=True)
            
        # 3. Fully connected layer
        self.fully_connected = nn.Linear(hidden_dim, output_size)
        

    def forward(self, x):
        """  
        Perform a forward pass
        
        Arguments:
        - X: tensor of shape (batch_size, sequence_length)
        
        Returns:
        - Output of the linear layer of shape (batch_size, output_size)
        """
        
        # 1. Embeddings layer
        x = self.embedding(x)  # [batch_size, seq_len, embed_dim]
        
        # 2. Recurrent layer 
        # By default, the hidden state and cell state are initialized to zero
        x, (hn, cn) = self.rnn(x)  # hn is of shape [1, batch_size, hidden_dim]
            
        # 3. Final layer
        # The `squeeze()` function allows to put hn in shape [batch_size, hidden_dim]
        output = self.fully_connected(torch.squeeze(hn))  # [batch_size, 2]
        
        return output

## Training

Once our model is defined, we can train it using a simple `for` loop. In Pytorch, we must first declare  an optimizer and tell that we are in training mode. We can then iterate over batches of data and update the parameters using gradient descent. 

Several optimization algorithms are available :
<b>- Stochastic gradient descent (SGD)
- SGD with momentum
- Nesterov momentum
- Adagrad
- RMSProp
- Adam</b>

Basically, simple SGD is not really efficient because gradients can be high in some directions and low in others, provoking an undesirable zigzagging movement. `SGD with momentum` tries to tackle this issue by computing a momentum, which is kind of a historic of gradients in order to better orient gradient descent (this is analoguous to velocity in physics). 

`Adagrad` allows to directly control the learning rate by computing a sum of squares of gradients. Thus, if gradients are high in one direction, we will divide the learning rate by a large value, making the updates smaller. On the contrary, weights that receive small updates will have their learning rate increasedAdagrad allows to directly control the learning rate by computing a sum of squares of gradients. Thus, if gradients are high in one direction, we will divide the learning rate by a large value, making the updates smaller. On the contrary, weights that receive small updates will have their learning rate increased.

`RMSProp` is similar to `Adagrad` but attempts to reduce its aggressive, monotonically decreasing learning rate, by using a moving average of squared gradients instead.

Finally, `Adam` is a combination of `RMSProp` and momentum techniques, and is the recommended default choice in most problems. This is the one we will use. The update rule is :
- $m = \beta m + (1-\beta)dx$
- $m_t = m / (1-\beta)^2$
- $v = \beta_2 v + (1-\beta_2)dx^2$
- $v_t = v / (1 - \beta_2)^2$
- $x += - lr \times m_t / (\sqrt{v_t} + \epsilon)$

We use $\beta=0.9$ and $\beta_2 = 0.999$.

We also have to define a loss function for our problem. We will use the <b>cross entropy loss</b>, a common choice for classification tasks. 

In [44]:
import torch
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import time

First, define a function to test the accuracy of a model either on the training or on the validation dataset. 

Several metrics are available for classification tasks, such as :
<b>- accuracy
- recall
- precision
- F1 score
- ROC AUC</b>

Choosing the right metric is essential but is really problem-dependent (e.g. do we want to classify all positive targets accurately, even if this implies misclassifying a lot of negative targets ?). For simplicity, we will use accuracy as our main metric.

In [40]:
def check_accuracy_naive(loader, model, validation=True):
    """
    Check accuracy of a model.
    
    Arguments:
    - model: A PyTorch Module giving the model.
    - loader_train: An Iterator object on which iterating to construct batches of data.
    - validation: (Optionnal) boolean which indicates if we check accuracy on the training
      or validation dataset.
    
    Returns:
    - prints the accuracy
    """
    
    # Use validation or test set
    if validation:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')
        
    num_correct = 0
    num_samples = 0
    
    # Set model to evaluation mode : This has any effect only on certain modules. 
    # For example, behaviors of dropout layers during train or test differ.
    model.eval()
    
    with torch.no_grad():  # Indicate to PyTorch that we don't need to build computational graphs
        for t, batch in enumerate(loader):
            
            # Load x and y
            x = batch.text.transpose(1, 0)  # reshape to [batch_size, len_seq]
            y = batch.target.type(torch.LongTensor)
            
            # Move to device, e.g. CPU
            x = x.to(device=device)  
            y = y.to(device=device)
            
            # Compute scores and predictions
            scores = model(x)
            _, preds = scores.max(1)
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
            
            if t == 10:
                break
            
        acc = float(num_correct) / num_samples
        return acc

In [50]:
def train_naive(model, optimizer, loader_train, print_every=10, epochs=1, stop=True):
    """
    Train a model using the PyTorch Module API.
    
    Arguments:
    - model: A PyTorch Module giving the model to train.
    - optimizer: An Optimizer object we will use to train the model.
    - loader_train: An Iterator object on which iterating to construct batches of data.
    - print_every: (Optional) Print training accuracy every print_every iterations.
    - epochs: (Optional) A Python integer giving the number of epochs to train for.
    - stop: (Optional) If True, stops after 100 iterations.
    
    Returns: Nothing, but prints model accuracies during training.
    """
    
    # Move the model parameters to CPU / GPU
    model = model.to(device=device)
    
    start = time.time()
    
    for epoch in range(epochs):
        for t, train_batch in enumerate(loader_train):
            
            # Put model to training mode
            model.train()
            
            # Load x and y
            x = train_batch.text.transpose(1, 0)  # reshape to [batch_size, len_seq]
            y = train_batch.target.type(torch.LongTensor)
   
            # Move to device, e.g. CPU
            x = x.to(device=device)
            y = y.to(device=device)

            # Compute scores and softmax loss
            scores = model(x)
            loss = F.cross_entropy(scores, y)

            # Zero out all of the gradients for the variables which the optimizer
            # will update.
            optimizer.zero_grad()

            # Backwards pass: compute the gradient of the loss with
            # respect to each  parameter of the model.
            loss.backward()

            # Update the parameters of the model using the gradients
            # computed by the backwards pass.
            optimizer.step()
            
            if t % print_every == 0:
                print('Iteration %d, loss = %.4f' % (t, loss.item()))
                acc = check_accuracy_naive(val_iter, model, validation=True)
                print('Accuracy :', acc)
                print()
                
            if stop and t == 100:
                break
                
    end = time.time()
    print(end - start)

#### Check implementation

In [51]:
# Learning rate
learning_rate = 1e-2

# Model
model = NaiveModel(embedding_matrix=word_embeddings, 
                   hidden_dim=64, 
                   output_size=2)

# Optimizer
optimizer = optim.Adam(model.parameters(), 
                       lr=learning_rate, 
                       betas=(0.9, 0.999),  # recommended values
                       eps=1e-08)  # recommended value

# Train
train_naive(model, optimizer, train_iter)

Iteration 0, loss = 0.7549
Checking accuracy on validation set
Accuracy : 0.9392755681818182

Iteration 10, loss = 0.1781
Checking accuracy on validation set
Accuracy : 0.93359375

Iteration 20, loss = 0.2837
Checking accuracy on validation set
Accuracy : 0.9421164772727273

Iteration 30, loss = 0.2444
Checking accuracy on validation set
Accuracy : 0.9240056818181818

Iteration 40, loss = 0.2131
Checking accuracy on validation set
Accuracy : 0.94140625

Iteration 50, loss = 0.1615
Checking accuracy on validation set
Accuracy : 0.9378551136363636

Iteration 60, loss = 0.2342
Checking accuracy on validation set
Accuracy : 0.9364346590909091

Iteration 70, loss = 0.3271
Checking accuracy on validation set
Accuracy : 0.9364346590909091

Iteration 80, loss = 0.2312
Checking accuracy on validation set
Accuracy : 0.9431818181818182

Iteration 90, loss = 0.2182
Checking accuracy on validation set
Accuracy : 0.9339488636363636

Iteration 100, loss = 0.2157
Checking accuracy on validation set
Ac

897 seconds per epoch.