![MLU Logo](../images/MLU_Logo.png)

# MLU-NLP2 Final Project

## Problem Statement
The project focuses on answer selection and uses the WikiQA dataset. Each record in the dataset has a question, answer and relevance score. The relevance score is binary, 1/0 indicating whether the answer is relevant to the question. 

Each question can be repeated multiple times and can have multiple relevant answer statements. 

To make the problem less complex, we have considered only questions which have at least 1 relevant answer. This simplification results in train, validation and test datasets with 873, 126 and 243 questions respectively.

## Project Objective

In this notebook, you will start our jorney. It contains a baseline model that will give you a first performance score and ourse and all code necessary ready for your first submission.

__IMPORTANT__ 

Make sure you submit this notebook to get to know better how Leaderboard works and, also, make sure your completion will be granted :) .

## The Baseline Model

Here we are using Torchtext: an NLP specific package in Torch. 

We will generate 100 dim vector embeddings for each word using Glove and build a basic convolutional network which takes the text embeddings as input (50 * 100). The training dataset is trained in batches using this network and the losses in each epoch are backpropagated to update the weights and minimize losses in future iterations.

The trained model is then used to make predictions on test dataset and finally, a result dataset with the list of predictions and sequential ID is created for your first leaderboard submission

Notebook has been inspired from https://www.kaggle.com/ziliwang/pytorch-text-cnn

### __Dataset:__
The originial train and test datasets have questions for which there are no answers with relevance 1. To make the problem simpler, we have considered only questions which have atleast 1 answer with relevance score 1. This updated version of the datasets are used in the project

### __Table of Contents__
Here is the plan for this assignment.
<p>
<div class="lev1">
    <a href="#Reading the dataset"><span class="toc-item-num">1&nbsp;&nbsp;</span>
        Reading the dataset
    </a>
</div>
<div class="lev1">
    <a href="#Data-Preparation"><span class="toc-item-num">2&nbsp;&nbsp;</span>
        Data Preparation
    </a>
</div>
<div class="lev1">
    <a href="#Model-Building"><span class="toc-item-num">3&nbsp;&nbsp;</span>
        Model Building
    </a>
</div>
<div class="lev1">
    <a href="#Training"><span class="toc-item-num">4&nbsp;&nbsp;</span>
        Training
    </a>
</div>
<div class="lev1">
    <a href="#Prediction"><span class="toc-item-num">5&nbsp;&nbsp;</span>
        Prediction
    </a>
</div>
<div class="lev1">
    <a href="#Submit-Results"><span class="toc-item-num">6&nbsp;&nbsp;</span>
        Submit Results
    </a>
</div>

In [41]:
##torchtext is a package within pytorch consisting of data processing utilities and popular datasets for natural language
!pip -q install torchtext==0.4

In [42]:
import pandas as pd
import boto3
import os
import numpy as np
import torch
from torch import nn
from sklearn.metrics import f1_score
from tqdm import tqdm, tqdm_notebook
import torchtext
from nltk import word_tokenize
import random
from torch import optim
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Reading the dataset
The datasets are in our MLU datalake and can be downloaded to your local instance here

In [43]:
# import the datasets
bucketname = 'mlu-courses-datalake' 
s3 = boto3.resource('s3')

s3.Bucket(bucketname).download_file('NLP2/data/training.csv', 
                                         './training.csv') 
s3.Bucket(bucketname).download_file('NLP2/data/public_test_features.csv', 
                                         './public_test_features.csv')
s3.Bucket(bucketname).download_file('NLP2/data/glove.6B.100d.txt', 
                                         './glove.6B.100d.txt')

In [44]:
TRAIN_DATA_FILE ='./training.csv'
TEST_DATA_FILE = './public_test_features.csv'
GLOVE_DATA_FILE = './glove.6B.100d.txt'

Below, we are combining question and answer in each row as 1 single text column for simplicity. Alternatively, we can run two parallel networks for question and answer, merge the output of the 2 networks and have a classification layer as output. You may choose to save the files for ease of use, in future steps.

In [45]:
train=pd.read_csv(TRAIN_DATA_FILE)
test_original=pd.read_csv(TEST_DATA_FILE)
test = test_original.copy()
train['text']=train[['question','answer']].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
test['text']=test[['question','answer']].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

train=train[['text','relevance']].rename(columns={'relevance':'label'})
test=test[['text']]
train.to_csv('train.csv',index=False)
test.to_csv('test.csv',index=False)

### Data Preparation

In the below steps, we are converting the pandas dataset into Torch datasets. 
1. Define the datatypes of columns: 
We define text column as field type text and set the parameters: 
   lower=True: sets the words in text to lowercase
   batch_first=True: set the dimensions such that batch dimension is first. For example, if length of text is fixed at 50 and batch size is 512, then the dimensions will be (512,50)
   tokenize=word_tokenize: the method used for tokenizing words. You can try other methods such as spacy here
   fix_length=50: all examples will be fixed or padded to this length
We define label column as field type label and set the parameters:
   sequential=False: if false, no tokenization will be applied
   use_vocab=False: if false, the data in this field is known to be numerical
   is_target=True: if true, this field is the target variable


2. Read both train and test datasets as tabular datasets in torch and define the text and label columns based on previously defined field types. Tabular Dataset is used for datasets of columns stored in CSV, TSV or JSON format.

In [46]:
text = torchtext.data.Field(lower=True, batch_first=True, tokenize=word_tokenize, fix_length=50)
label = torchtext.data.Field(sequential=False, use_vocab=False, is_target=True)
train = torchtext.data.TabularDataset(path='train.csv', format='csv',
                                      fields={'text': ('text',text),
                                              'label': ('label',label)})
test = torchtext.data.TabularDataset(path='test.csv', format='csv',
                                     fields={'text': ('text', text)
                                             })

In the next step, we build the vocabulary from the text columns in train and test data. 

text.build_vocab:
In order to build vocabulary, we pass datasets for which vocabulary has to be built and additional parameters such as min_freq and max_size. 
min_freq: minimum number of times a word has to occur in the datasets
max_size: maximum size of vocabulary to be created (not used in the example)

torchtext.vocab.Vectors
We define and load 100 dim glove embeddings to use for the vector representation

text.vocab.set_vectors
Maps each word in the vocabulary to glove embeddings. glove.stoi will have numerical identifier mapped for each token string and glove.vectors will have the 100 dim vector mapped for each token string

In [47]:
text.build_vocab(train, test, min_freq=3)
glove = torchtext.vocab.Vectors(GLOVE_DATA_FILE)
text.vocab.set_vectors(glove.stoi, glove.vectors, dim=100)

### Model Building

Below is the neural network architecture:
1. The input to the neural network is the padded glove embeddings for each word in text. The identifier for padding is also passed as a parameter
2. The input embeddings are passed through a convolutional layer with specified number of kernels and kernel size. You may want to experiment with more layers and more kernels/different kernel sizes
3. Convolutional layer is followed by a 2 * 2 max pooling layer and linear layer which generates 1 output
4. If static is set to True, then the weights of the embedding layer are not updated through gradient descent

In [48]:
class TextCNN(nn.Module):
    
    def __init__(self, lm, padding_idx, static=True, kernel_num=128, fixed_length=50, kernel_size=5, dropout=0.2):
        super(TextCNN, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        self.embedding = nn.Embedding.from_pretrained(lm)
        if static:
            self.embedding.weight.requires_grad = False
        self.embedding.padding_idx = padding_idx
        self.conv = nn.ModuleList([nn.Conv2d(1, kernel_num, (kernel_size, self.embedding.embedding_dim))])
        self.maxpools = [nn.MaxPool2d((fixed_length+1-kernel_size,1))]
        self.fc = nn.Linear(kernel_num, 1)
        
    def forward(self, input):
        x = self.embedding(input).unsqueeze(1)  
        x = [self.maxpools[i](torch.tanh(cov(x))).squeeze(3).squeeze(2) for i, cov in enumerate(self.conv)]  
        x = torch.cat(x, dim=1)  
        y = self.fc(self.dropout(x))
        return y

### Training

Below is the training loop. Each batch of training data is read, predictions are computed through forward propagation of batch inputs. Losses are computed between predictions and actual labels and back propagated to update the weights. In each epoch, we compute the f1 score with a preset threshold of 0.15 (this can be a tunable parameter and could provide better performance of other thresholds)

In [49]:
def training(epoch, model, loss_func, optimizer, train_iter):
    e = 0
    
    while e < epoch:
        train_iter.init_epoch()
        losses, preds, true = [], [], []
        for train_batch in tqdm(list(iter(train_iter)), 'epoch {} training'.format(e)):
            model.train()
            x = train_batch.text.cuda()
            y = train_batch.label.type(torch.Tensor).cuda()
            true.append(train_batch.label.numpy())
            model.zero_grad()
            pred = model.forward(x).view(-1)
            loss = loss_function(pred, y)
            preds.append(torch.sigmoid(pred).cpu().data.numpy())
            losses.append(loss.cpu().data.numpy())
            loss.backward()
            optimizer.step()
        train_f1=f1_score([j for i in true for j in i],np.array([j for i in preds for j in i])>0.15)
        alpha_train=0.15
        print('epoch {:02} - train_loss {:.4f} - train f1 {:.4f} - threshold {:.4f}'.format(
                            e, np.mean(losses), train_f1, alpha_train))
                
        e += 1
    return alpha_train

Setting random seed for reproducibility. Batch size is set as 512 and is a hyperparameter that can be varied. Torch uses bucket iterator for language modelling tasks. 

In [50]:
random.seed(1234)
batch_size = 512
train_iter = torchtext.data.BucketIterator(dataset=train,
                                               batch_size=batch_size,
                                               shuffle=True,
                                               sort=False)

Below function helps with weight initialization for model parameters. Xavier is typically used as a standard for weight initialization

In [51]:
def init_network(model, method='xavier', exclude='embedding', seed=123):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
    for name, w in model.named_parameters():
        if not exclude in name:
            if 'weight' in name:
                if method is 'xavier':
                    nn.init.xavier_normal_(w)
                elif method is 'kaiming':
                    nn.init.kaiming_normal_(w)
                else:
                    nn.init.normal_(w)
            elif 'bias' in name:
                nn.init.constant_(w, 0.0)
            else: 
                pass

Below steps are used to define the model, initialize the model, the optimization algorithm and loss function. Learning rate is a tunable hyperparameter here

In [52]:
text.fix_length = 50
model = TextCNN(text.vocab.vectors, padding_idx=text.vocab.stoi[text.pad_token], kernel_size=5, kernel_num=128, static=False, fixed_length=text.fix_length, dropout=0.1).cuda()
init_network(model)
optimizer = optim.Adam(params=model.parameters(), lr=1e-3)
loss_function = nn.BCEWithLogitsLoss()

In [53]:
%%time
training(3, model, loss_function, optimizer, train_iter)

epoch 0 training: 100%|██████████| 14/14 [00:00<00:00, 81.77it/s]
epoch 1 training:   0%|          | 0/14 [00:00<?, ?it/s]

epoch 00 - train_loss 0.4256 - train f1 0.1556 - threshold 0.1500


epoch 1 training: 100%|██████████| 14/14 [00:00<00:00, 82.77it/s]
epoch 2 training:   0%|          | 0/14 [00:00<?, ?it/s]

epoch 01 - train_loss 0.3671 - train f1 0.2247 - threshold 0.1500


epoch 2 training: 100%|██████████| 14/14 [00:00<00:00, 93.63it/s]

epoch 02 - train_loss 0.3414 - train f1 0.3149 - threshold 0.1500
CPU times: user 809 ms, sys: 104 ms, total: 913 ms
Wall time: 905 ms





0.15

### Prediction

Below function is used to predict on test dataset using trained model. It returns a list of predicted probabilities

In [54]:
def predict(model, test_list):
    pred = []
    with torch.no_grad():
        for test_batch in test_list:
            model.eval()
            x = test_batch.text.cuda()
            pred += torch.sigmoid(model.forward(x).view(-1)).cpu().data.numpy().tolist()
    return pred

In [55]:
test_list = list(torchtext.data.BucketIterator(dataset=test,
                                    batch_size=batch_size,
                                    sort=False,
                                    train=False))

In [56]:
preds = predict(model, test_list)

### Submit Results

Create a new dataframe for submission. The list of predicted probabilities are converted to labels using the pre-defined threshold of 0.15 (can be tuned for better performance). The list of labels is concatenated with the original sequential ID from the test file downloaded from Leaderboard, to generate the final submission

For submission, follow these steps:
1. Go to the folder where your notebook is in Sagemaker
2. Donwload the file __test_submission_nlp2.csv__ to your local machine
3. On NLP2 Leaderboard contest, select option __My Submissions"__ and upload your file

In [68]:
result_df = pd.DataFrame(columns=["ID", "relevance"])
result_df["ID"] = test_original["ID"].tolist()
labels=[1 if pred>0.15 else 0 for pred in preds]
result_df["relevance"] = labels
result_df.to_csv("test_submission_nlp2.csv", index=False)