![MLU Logo](../images/MLU_Logo.png)

# MLU-NLP2 Final Project

## Problem Statement
The project focuses on answer selection and uses the WikiQA dataset. Each record in the dataset has a question, answer and relevance score. The relevance score is binary, 1/0 indicating whether the answer is relevant to the question. 

Each question can be repeated multiple times and can have multiple relevant answer statements. 

To make the problem less complex, we have considered only questions which have at least 1 relevant answer. This simplification results in train, validation and test datasets with 873, 126 and 243 questions respectively.

## Project Objective

In this notebook, you will start our jorney. It contains a baseline model that will give you a first performance score and ourse and all code necessary ready for your first submission.

__IMPORTANT__ 

Make sure you submit this notebook to get to know better how Leaderboard works and, also, make sure your completion will be granted :) .

## The Baseline Model

Here we are using Torchtext: an NLP specific package in Torch. 

We will generate 100 dim vector embeddings for each word using Glove and build a basic convolutional network which takes the text embeddings as input (50 * 100). The training dataset is trained in batches using this network and the losses in each epoch are backpropagated to update the weights and minimize losses in future iterations.

The trained model is then used to make predictions on test dataset and finally, a result dataset with the list of predictions and sequential ID is created for your first leaderboard submission

Notebook has been inspired from https://www.kaggle.com/ziliwang/pytorch-text-cnn

### __Dataset:__
The originial train and test datasets have questions for which there are no answers with relevance 1. To make the problem simpler, we have considered only questions which have atleast 1 answer with relevance score 1. This updated version of the datasets are used in the project

### __Table of Contents__
Here is the plan for this assignment.
<p>
<div class="lev1">
    <a href="#Reading the dataset"><span class="toc-item-num">1&nbsp;&nbsp;</span>
        Reading the dataset
    </a>
</div>
<div class="lev1">
    <a href="#Data-Preparation"><span class="toc-item-num">2&nbsp;&nbsp;</span>
        Data Preparation
    </a>
</div>
<div class="lev1">
    <a href="#Model-Building"><span class="toc-item-num">3&nbsp;&nbsp;</span>
        Model Building
    </a>
</div>
<div class="lev1">
    <a href="#Training"><span class="toc-item-num">4&nbsp;&nbsp;</span>
        Training
    </a>
</div>
<div class="lev1">
    <a href="#Prediction"><span class="toc-item-num">5&nbsp;&nbsp;</span>
        Prediction
    </a>
</div>
<div class="lev1">
    <a href="#Submit-Results"><span class="toc-item-num">6&nbsp;&nbsp;</span>
        Submit Results
    </a>
</div>

In [1]:
##torchtext is a package within pytorch consisting of data processing utilities and popular datasets for natural language
!pip -q install torchtext

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastai 1.0.61 requires nvidia-ml-py3, which is not installed.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p36/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
import pandas as pd
import boto3
import os
import numpy as np
import torch
from torch import nn
from sklearn.metrics import f1_score
from tqdm import tqdm, tqdm_notebook
import torchtext
from nltk import word_tokenize
import random
from torch import optim
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Reading the dataset
The datasets are in our MLU datalake and can be downloaded to your local instance here

In [3]:
# import the datasets
bucketname = 'mlu-courses-datalake' 
s3 = boto3.resource('s3')

s3.Bucket(bucketname).download_file('NLP2/data/training.csv', 
                                         './training.csv') 
s3.Bucket(bucketname).download_file('NLP2/data/public_test_features.csv', 
                                         './public_test_features.csv')
s3.Bucket(bucketname).download_file('NLP2/data/glove.6B.100d.txt', 
                                         './glove.6B.100d.txt')

In [4]:
TRAIN_DATA_FILE ='./training.csv'
TEST_DATA_FILE = './public_test_features.csv'
GLOVE_DATA_FILE = './glove.6B.100d.txt'

Below, we are combining question and answer in each row as 1 single text column for simplicity. Alternatively, we can run two parallel networks for question and answer, merge the output of the 2 networks and have a classification layer as output. You may choose to save the files for ease of use, in future steps.

In [5]:
train=pd.read_csv(TRAIN_DATA_FILE)
test=pd.read_csv(TEST_DATA_FILE)
#test = test_original.copy()
#train['text']=train[['question','answer']].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
#test['text']=test[['question','answer']].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

#train=train[['text','relevance']].rename(columns={'relevance':'label'})
#test=test[['text']]
#train.to_csv('train.csv',index=False)
#test.to_csv('test.csv',index=False)

In [6]:
train.head()

Unnamed: 0,ID,question,answer,relevance
0,2788,who kill franz ferdinand ww1,A plaque commemorating the location of the Sar...,0
1,8166,what is a medallion guarantee,Sample of a Medallion signature guarantee stampIn,0
2,4289,what does a vote to table a motion mean ?,The difference is the idea of what the table i...,0
3,8180,when was the lady gaga judas song released,`` Judas '' is a song by American recording ar...,1
4,725,How did Edgar Allan Poe die ?,His work forced him to move among several citi...,0


In [7]:
train.columns

Index(['ID', 'question', 'answer', 'relevance'], dtype='object')

In [8]:
test.head()

Unnamed: 0,ID,question,answer
0,917,when does the electoral college votes,The Twelfth Amendment specifies how a Presiden...
1,6587,what year lord of rings made ?,Tolkien 's work has been the subject of extens...
2,5227,what countries are under the buddhism religion,Estimate of the worldwide Buddhist population ...
3,4707,what does ( sic ) mean ?,Sic may also refer to:
4,700,when is it memorial day,In cases involving a family graveyard where re...


In [9]:
!pip install transformers
!pip install --upgrade torch
!pip install nvidia-ml-py3
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer

Collecting transformers
  Downloading transformers-4.8.2-py3-none-any.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 22.7 MB/s eta 0:00:01
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 54.0 MB/s eta 0:00:01
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 50.0 MB/s eta 0:00:01
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Installing collected packages: tokenizers, sacremoses, huggingface-hub, transformers
Successfully installed huggingface-hub-0.0.12 sacremoses-0.0.45 tokenizers-0.10.3 transformers-4.8.2
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p36/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading

In [10]:
train

Unnamed: 0,ID,question,answer,relevance
0,2788,who kill franz ferdinand ww1,A plaque commemorating the location of the Sar...,0
1,8166,what is a medallion guarantee,Sample of a Medallion signature guarantee stampIn,0
2,4289,what does a vote to table a motion mean ?,The difference is the idea of what the table i...,0
3,8180,when was the lady gaga judas song released,`` Judas '' is a song by American recording ar...,1
4,725,How did Edgar Allan Poe die ?,His work forced him to move among several citi...,0
...,...,...,...,...
6856,1310,when is the wv state fair,Free parking is provided adjacent to the fairg...,0
6857,3413,what are square diamonds called ?,"However , while displaying the same high degre...",0
6858,9631,what is direct marketing channel,Direct marketing is practiced by businesses of...,0
6859,581,who was charged with murder after the massacre...,They received hate mail and death threats and ...,0


In [11]:
test

Unnamed: 0,ID,question,answer
0,917,when does the electoral college votes,The Twelfth Amendment specifies how a Presiden...
1,6587,what year lord of rings made ?,Tolkien 's work has been the subject of extens...
2,5227,what countries are under the buddhism religion,Estimate of the worldwide Buddhist population ...
3,4707,what does ( sic ) mean ?,Sic may also refer to:
4,700,when is it memorial day,In cases involving a family graveyard where re...
...,...,...,...
2936,5590,how many ports are there in networking,"That is , data packets are routed across the n..."
2937,5320,what genre is bloody beetroots,"In fact , the only identifying public feature ..."
2938,1664,where is green bay packers from,They are members of the North Division of the ...
2939,1245,when did the civil war start and where,The Union marshaled the resources and manpower...


In [12]:
train['text'] = train[['question','answer']].apply(lambda row: ' [SEP] '.join(row.values.astype(str)), axis=1)
test['text'] = test[['question','answer']].apply(lambda row: ' [SEP] '.join(row.values.astype(str)), axis=1)

In [69]:
from sklearn.model_selection import train_test_split

model_name = 'roberta-base'#'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_name)

train_, val_ = train_test_split(train, test_size=0.1, random_state = 17)

#train_encodings = tokenizer(train['text'].tolist(), truncation=True, padding=True)
#val_encodings = tokenizer(test['text'].tolist(), truncation=True, padding=True)
train_encodings = tokenizer(train_['text'].tolist(), truncation=True, padding=True)
val_encodings = tokenizer(val_['text'].tolist(), truncation=True, padding=True)
test_encodings = tokenizer(test['text'].tolist(), truncation=True, padding=True)

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /home/ec2-user/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.8.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_siz

In [70]:
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [71]:
train_

Unnamed: 0,ID,question,answer,relevance,text
5303,8296,what is vitamin a for,Both structural features are essential for vit...,0,what is vitamin a for [SEP] Both structural fe...
426,9188,How Works Diaphragm Pump,Those employing one or more unsealed diaphragm...,0,How Works Diaphragm Pump [SEP] Those employing...
3003,1709,who is inventor of the radio,It is considered likely that the first intenti...,0,who is inventor of the radio [SEP] It is consi...
5665,4602,what day is the feast of st joseph 's ?,March 19 was dedicated to Saint Joseph in seve...,0,what day is the feast of st joseph 's ? [SEP] ...
5784,9072,what area code is 949,Area code 949 is an area code in California th...,1,what area code is 949 [SEP] Area code 949 is a...
...,...,...,...,...,...
1337,2808,how deep can be drill for deep underwater,“Not all oil is accessible on land or in shall...,0,how deep can be drill for deep underwater [SEP...
406,7344,how much does united states spend on health care,A 2013 study found that about 25 % of all seni...,0,how much does united states spend on health ca...
5510,8044,what is lean manufacturing and who developed,TPS is renowned for its focus on reduction of ...,0,what is lean manufacturing and who developed [...
2191,4501,what did ronald reagan do as president,"A conservative icon , he ranks highly in publi...",0,what did ronald reagan do as president [SEP] A...


In [72]:
#train_Dataset = CustomDataset(train_encodings, train.relevance.values)
#val_Dataset = CustomDataset(val_encodings, np.zeros(test.shape[0]))
train_Dataset = CustomDataset(train_encodings, train_.relevance.values)
val_Dataset = CustomDataset(val_encodings, val_.relevance.values)
tmp =np.zeros(test.shape[0]).astype(int)
tmp[0] = 1
test_Dataset = CustomDataset(test_encodings, tmp)

In [73]:
np.zeros(test.shape[0])

array([0., 0., 0., ..., 0., 0., 0.])

In [74]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision_P, recall_P, f1_P, _ = precision_recall_fscore_support(labels, preds, average='binary')
    precision_N, recall_N, f1_N, _ = precision_recall_fscore_support(labels, preds, average='binary', pos_label=0)
    acc = accuracy_score(labels, preds)
    #fpr, tpr, thresholds = roc_curve(labels, pred, pos_label=2)
    #AUC = auc(fpr, tpr)
    AUC = roc_auc_score(labels, pred.predictions[:,1])
    return {
        'accuracy': acc,
        'auc': AUC,
        'f1_P': f1_P,
        'precision_P': precision_P,
        'recall_P': recall_P,
        'f1_N': f1_N,
        'precision_N': precision_N,
        'recall_N': recall_N,
    }

In [75]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    evaluation_strategy="steps",
    eval_steps=25,
    num_train_epochs=5,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    warmup_steps=100,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    #logging_dir='./logs',            # directory for storing logs
    #logging_steps=10,
    save_total_limit=5,
    load_best_model_at_end=True,
    metric_for_best_model='f1_P', #'AUC'
    learning_rate=1e-5
)

model = AutoModelForSequenceClassification.from_pretrained(model_name)


trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_Dataset,         # training dataset
    eval_dataset=val_Dataset,            # evaluation dataset
    compute_metrics=compute_metrics
    #callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /home/ec2-user/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_

In [None]:
trainer.train()

***** Running training *****
  Num examples = 6174
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1930


Step,Training Loss,Validation Loss,Accuracy,Auc,F1 P,Precision P,Recall P,F1 N,Precision N,Recall N
25,No log,0.640084,0.868996,0.52574,0.0,0.0,0.0,0.929907,0.868996,1.0
50,No log,0.484446,0.868996,0.542397,0.0,0.0,0.0,0.929907,0.868996,1.0
75,No log,0.388281,0.868996,0.748576,0.0,0.0,0.0,0.929907,0.868996,1.0
100,No log,0.365145,0.868996,0.7882,0.0,0.0,0.0,0.929907,0.868996,1.0
125,No log,0.3163,0.868996,0.831379,0.0,0.0,0.0,0.929907,0.868996,1.0
150,No log,0.313986,0.868996,0.834003,0.0,0.0,0.0,0.929907,0.868996,1.0
175,No log,0.33582,0.868996,0.812265,0.0,0.0,0.0,0.929907,0.868996,1.0
200,No log,0.290632,0.868996,0.850717,0.0,0.0,0.0,0.929907,0.868996,1.0
225,No log,0.328692,0.871907,0.827731,0.06383,0.75,0.033333,0.93125,0.872621,0.998325
250,No log,0.282899,0.898108,0.859296,0.520548,0.678571,0.422222,0.942997,0.917591,0.969849


***** Running Evaluation *****
  Num examples = 687
  Batch size = 32
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to ./results/checkpoint-25
Configuration saved in ./results/checkpoint-25/config.json
Model weights saved in ./results/checkpoint-25/pytorch_model.bin
Deleting older checkpoint [results/checkpoint-850] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 687
  Batch size = 32
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to ./results/checkpoint-50
Configuration saved in ./results/checkpoint-50/config.json
Model weights saved in ./results/checkpoint-50/pytorch_model.bin
Deleting older checkpoint [results/checkpoint-1175] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 687
  Batch size = 32
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to ./results/checkpoint-75
Configuration saved in ./results/checkpoint-75/config.json
Mode

In [58]:
pred = trainer.predict(test_Dataset)

***** Running Prediction *****
  Num examples = 2941
  Batch size = 32


In [67]:
np.argmax(pred.predictions, axis=1)

array([0, 0, 0, ..., 0, 0, 0])

### Submit Results

Create a new dataframe for submission. The list of predicted probabilities are converted to labels using the pre-defined threshold of 0.15 (can be tuned for better performance). The list of labels is concatenated with the original sequential ID from the test file downloaded from Leaderboard, to generate the final submission

For submission, follow these steps:
1. Go to the folder where your notebook is in Sagemaker
2. Donwload the file __test_submission_nlp2.csv__ to your local machine
3. On NLP2 Leaderboard contest, select option __My Submissions"__ and upload your file

In [68]:
result_df = pd.DataFrame(columns=["ID", "relevance"])
result_df["ID"] = test["ID"].tolist()
labels=np.argmax(pred.predictions, axis=1)
result_df["relevance"] = labels
result_df.to_csv("test_submission_nlp2.csv", index=False)