![MLU Logo](../images/MLU_Logo.png)

# MLU-NLP2 Final Project

## Problem Statement
The project focuses on answer selection and uses the WikiQA dataset. Each record in the dataset has a question, answer and relevance score. The relevance score is binary, 1/0 indicating whether the answer is relevant to the question. 

Each question can be repeated multiple times and can have multiple relevant answer statements. 

To make the problem less complex, we have considered only questions which have at least 1 relevant answer. This simplification results in train, validation and test datasets with 873, 126 and 243 questions respectively.

## Project Objective

In this notebook, you will start our jorney. It contains a baseline model that will give you a first performance score and ourse and all code necessary ready for your first submission.

__IMPORTANT__ 

Make sure you submit this notebook to get to know better how Leaderboard works and, also, make sure your completion will be granted :) .

### __Dataset:__
The originial train and test datasets have questions for which there are no answers with relevance 1. To make the problem simpler, we have considered only questions which have atleast 1 answer with relevance score 1. This updated version of the datasets are used in the project

### __Table of Contents__
Here is the plan for this assignment.
<p>
<div class="lev1">
    <a href="#Reading the dataset"><span class="toc-item-num">1&nbsp;&nbsp;</span>
        Reading the dataset
    </a>
</div>
<div class="lev1">
    <a href="#Data-Preparation"><span class="toc-item-num">2&nbsp;&nbsp;</span>
        Data Preparation
    </a>
</div>
<div class="lev1">
    <a href="#Model-Building"><span class="toc-item-num">3&nbsp;&nbsp;</span>
        Model Building
    </a>
</div>
<div class="lev1">
    <a href="#Training"><span class="toc-item-num">4&nbsp;&nbsp;</span>
        Training
    </a>
</div>
<div class="lev1">
    <a href="#Prediction"><span class="toc-item-num">5&nbsp;&nbsp;</span>
        Prediction
    </a>
</div>
<div class="lev1">
    <a href="#Submit-Results"><span class="toc-item-num">6&nbsp;&nbsp;</span>
        Submit Results
    </a>
</div>

In [1]:
#!pip install -U sentence-transformers


In [2]:
import pandas as pd
import boto3
import os
import numpy as np
import torch
from torch import nn
from sklearn.metrics import f1_score
from tqdm import tqdm, tqdm_notebook
# import torchtext
from nltk import word_tokenize
import random
from torch import optim
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Reading the dataset
The datasets are in our MLU datalake and can be downloaded to your local instance here

In [3]:
# import the datasets
bucketname = 'mlu-courses-datalake' 
s3 = boto3.resource('s3')

s3.Bucket(bucketname).download_file('NLP2/data/training.csv', 
                                         './training.csv') 
s3.Bucket(bucketname).download_file('NLP2/data/public_test_features.csv', 
                                         './public_test_features.csv')
s3.Bucket(bucketname).download_file('NLP2/data/glove.6B.100d.txt', 
                                         './glove.6B.100d.txt')

In [4]:
TRAIN_DATA_FILE ='./training.csv'
TEST_DATA_FILE = './public_test_features.csv'
GLOVE_DATA_FILE = './glove.6B.100d.txt'

Below, we are combining question and answer in each row as 1 single text column for simplicity. Alternatively, we can run two parallel networks for question and answer, merge the output of the 2 networks and have a classification layer as output. You may choose to save the files for ease of use, in future steps.

In [5]:
train_original=pd.read_csv(TRAIN_DATA_FILE)
train_original=train_original.rename(columns={'relevance':'label'}) 
train = train_original.iloc[:5500,:]
dev = train_original.iloc[5500:,:]

# train['text']=train[['question','answer']].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
# test['text']=test[['question','answer']].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

In [6]:
train

Unnamed: 0,ID,question,answer,label
0,2788,who kill franz ferdinand ww1,A plaque commemorating the location of the Sar...,0
1,8166,what is a medallion guarantee,Sample of a Medallion signature guarantee stampIn,0
2,4289,what does a vote to table a motion mean ?,The difference is the idea of what the table i...,0
3,8180,when was the lady gaga judas song released,`` Judas '' is a song by American recording ar...,1
4,725,How did Edgar Allan Poe die ?,His work forced him to move among several citi...,0
...,...,...,...,...
5495,9015,what are the arb medications,"Arbitron , a radio audience research company",0
5496,340,when was everybody hates chris made,The show is set between 1983 and 1987 ; howeve...,0
5497,3596,what is a lapping machine,"Taken to a finer limit , this will produce a p...",0
5498,4610,what day is the feast of st joseph 's ?,As the traditional holiday of the Apostles Ss ...,0


In [7]:
test=pd.read_csv(TEST_DATA_FILE) 
test

Unnamed: 0,ID,question,answer
0,917,when does the electoral college votes,The Twelfth Amendment specifies how a Presiden...
1,6587,what year lord of rings made ?,Tolkien 's work has been the subject of extens...
2,5227,what countries are under the buddhism religion,Estimate of the worldwide Buddhist population ...
3,4707,what does ( sic ) mean ?,Sic may also refer to:
4,700,when is it memorial day,In cases involving a family graveyard where re...
...,...,...,...
2936,5590,how many ports are there in networking,"That is , data packets are routed across the n..."
2937,5320,what genre is bloody beetroots,"In fact , the only identifying public feature ..."
2938,1664,where is green bay packers from,They are members of the North Division of the ...
2939,1245,when did the civil war start and where,The Union marshaled the resources and manpower...


### Model Building

In [8]:
from torch.utils.data import DataLoader
import math
from sentence_transformers import SentenceTransformer, LoggingHandler, losses, util, InputExample 
from sentence_transformers.cross_encoder import CrossEncoder
from sentence_transformers.cross_encoder.evaluation import CEBinaryClassificationEvaluator
import logging
from datetime import datetime
import os
import gzip
import csv

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
 
train_batch_size = 16
num_epochs = 4
model_save_path = 'output/continue_training-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")


# Pre-trained cross encoder
model = CrossEncoder('distilbert-base-uncased-distilled-squad', num_labels=1) # For binary tasks and tasks with continious scores (like STS), we set num_labels=1

# Convert the dataset to a DataLoader ready for training 
def create_examples(df):
    samples = []
    for index, row in df.iterrows():
        samples.append(InputExample(texts=[row['question'], row['answer']], label=row['label'] )) 
    return samples


train_samples = create_examples(train)
dev_samples = create_examples(dev) 
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
 
evaluator = CEBinaryClassificationEvaluator.from_input_examples(dev_samples)


# Configure the training. We skip evaluation in this example
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1) #10% of train data for warm-up
logging.info("Warmup-steps: {}".format(warmup_steps))


# Train the model
model.fit(train_dataloader=train_dataloader,
          evaluator=evaluator,
          epochs=num_epochs,
          evaluation_steps=1000,
          warmup_steps=warmup_steps,
          output_path=model_save_path)

Some weights of the model checkpoint at distilbert-base-uncased-distilled-squad were not used when initializing DistilBertForSequenceClassification: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-distilled-squad and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-

2021-07-22 16:31:00 - Use pytorch device: cuda
2021-07-22 16:31:01 - Warmup-steps: 138


Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration:   0%|          | 0/344 [00:00<?, ?it/s]

2021-07-22 16:32:26 - CEBinaryClassificationEvaluator: Evaluating the model on  dataset after epoch 0:
2021-07-22 16:32:33 - Accuracy:           93.31	(Threshold: 0.3836)
2021-07-22 16:32:33 - F1:                 64.86	(Threshold: 0.3836)
2021-07-22 16:32:33 - Precision:          77.78
2021-07-22 16:32:33 - Recall:             55.63
2021-07-22 16:32:33 - Average Precision:  64.46

2021-07-22 16:32:33 - Save model to output/continue_training-2021-07-22_16-30-59


Iteration:   0%|          | 0/344 [00:00<?, ?it/s]

2021-07-22 16:33:58 - CEBinaryClassificationEvaluator: Evaluating the model on  dataset after epoch 1:
2021-07-22 16:34:04 - Accuracy:           93.24	(Threshold: 0.7220)
2021-07-22 16:34:04 - F1:                 65.72	(Threshold: 0.4279)
2021-07-22 16:34:04 - Precision:          70.45
2021-07-22 16:34:04 - Recall:             61.59
2021-07-22 16:34:04 - Average Precision:  66.41

2021-07-22 16:34:04 - Save model to output/continue_training-2021-07-22_16-30-59


Iteration:   0%|          | 0/344 [00:00<?, ?it/s]

2021-07-22 16:35:30 - CEBinaryClassificationEvaluator: Evaluating the model on  dataset after epoch 2:
2021-07-22 16:35:37 - Accuracy:           93.09	(Threshold: 0.9437)
2021-07-22 16:35:37 - F1:                 65.37	(Threshold: 0.0914)
2021-07-22 16:35:37 - Precision:          63.92
2021-07-22 16:35:37 - Recall:             66.89
2021-07-22 16:35:37 - Average Precision:  64.52



Iteration:   0%|          | 0/344 [00:00<?, ?it/s]

2021-07-22 16:37:00 - CEBinaryClassificationEvaluator: Evaluating the model on  dataset after epoch 3:
2021-07-22 16:37:07 - Accuracy:           93.09	(Threshold: 0.9501)
2021-07-22 16:37:07 - F1:                 64.60	(Threshold: 0.5917)
2021-07-22 16:37:07 - Precision:          67.14
2021-07-22 16:37:07 - Recall:             62.25
2021-07-22 16:37:07 - Average Precision:  63.91



### Prediction

In [9]:
##############################################################################
#
# Load the stored model and evaluate its performance on test dataset
#
##############################################################################

model = CrossEncoder(model_save_path) 
# test_evaluator = CEBinaryClassificationEvaluator.from_input_examples(test_samples)
# test_evaluator(model, output_path=model_save_path)

2021-07-22 16:37:07 - Use pytorch device: cuda


In [10]:
test_pairs = [[row['question'], row['answer']] for index, row in test.iterrows()]  
test_pairs[0]

['when does the electoral college votes',
 'The Twelfth Amendment specifies how a President and Vice President are elected and requires each elector to cast one vote for President and another vote for Vice President .']

In [11]:
preds = model.predict(test_pairs)
preds

Batches:   0%|          | 0/92 [00:00<?, ?it/s]

array([0.05267958, 0.01043369, 0.01169606, ..., 0.01145452, 0.01000954,
       0.01304464], dtype=float32)

### Submit Results

Create a new dataframe for submission. The list of predicted probabilities are converted to labels using the pre-defined threshold of 0.15 (can be tuned for better performance). The list of labels is concatenated with the original sequential ID from the test file downloaded from Leaderboard, to generate the final submission

For submission, follow these steps:
1. Go to the folder where your notebook is in Sagemaker
2. Donwload the file __test_submission_nlp2.csv__ to your local machine
3. On NLP2 Leaderboard contest, select option __My Submissions"__ and upload your file

In [13]:
result_df = pd.DataFrame(columns=["ID", "relevance"])
result_df["ID"] = test["ID"].tolist()
labels=[1 if pred>0.5917 else 0 for pred in preds]
result_df["relevance"] = labels
result_df.to_csv("test_submission_nlp2.csv", index=False)