# **Creating a Classification Model using CoronaBERT**

The purpose of this colab is to create a classification model to predict the relevance on documents, using the qrels for training, validation and testing data.  The queries and documents will be encoded using CoronaBERT.  The training data makes use of all the judgements made at round 4 of TREC-COVID, whereas the testing data makes use of the judgements made in round 5.

## **Setup**

In [1]:
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.utils import resample

In [None]:
#get all the doc embeddings - the process of doc encoding is performed in a complementary colab
!wget https://github.com/DavidONeill75101/level-4-project/blob/master/Datasets/coronaBERT_Embeddings/coronaBERT_doc_embeddings.pickle?raw=true

with open('/content/coronaBERT_doc_embeddings.pickle?raw=true', 'rb') as f:
  doc_embeddings = pickle.load(f)

In [None]:
len(doc_embeddings)

191175

In [2]:
#get all the query embeddings - the process of query encoding is performed in a complementary colab

!wget https://github.com/DavidONeill75101/level-4-project/blob/master/Datasets/coronaBERT%20Embeddings/coronaBERT_query_embeddings.pickle?raw=true
with open('/content/coronaBERT_query_embeddings.pickle?raw=true', 'rb') as f:
  query_embeddings = pickle.load(f)

--2022-03-15 14:43:43--  https://github.com/DavidONeill75101/level-4-project/blob/master/Datasets/coronaBERT%20Embeddings/coronaBERT_query_embeddings.pickle?raw=true
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/DavidONeill75101/level-4-project/raw/master/Datasets/coronaBERT%20Embeddings/coronaBERT_query_embeddings.pickle [following]
--2022-03-15 14:43:43--  https://github.com/DavidONeill75101/level-4-project/raw/master/Datasets/coronaBERT%20Embeddings/coronaBERT_query_embeddings.pickle
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/DavidONeill75101/level-4-project/master/Datasets/coronaBERT%20Embeddings/coronaBERT_query_embeddings.pickle [following]
--2022-03-15 14:43:43--  https://raw.githubusercontent.com/DavidONeill75101/level-4-project/ma

In [None]:
len(query_embeddings)

50

In [None]:
!wget https://raw.githubusercontent.com/DavidONeill75101/level-4-project/master/Datasets/DataSplit/training_validation_data.csv
training_qrels = pd.read_csv('/content/training_validation_data.csv').drop(columns=['Unnamed: 0'])

In [None]:
len(training_qrels)

46203

In [None]:
!wget https://raw.githubusercontent.com/DavidONeill75101/level-4-project/master/Datasets/DataSplit/testing_data.csv
test_qrels = pd.read_csv('/content/testing_data.csv').drop(columns=['Unnamed: 0'])

In [None]:
len(test_qrels)

23151

As there is some ambiguity around the validity of the "partially relevant" judgements, we will only consider the documents which are either "not relevant" or "full relevant".

In [None]:
training_qrels = training_qrels[training_qrels['label'].isin([0,2])]

In [None]:
len(training_qrels)

39377

In [None]:
test_qrels = test_qrels[test_qrels['label'].isin([0,2])]

In [None]:
len(test_qrels)

18916

In [None]:
#concatenate the query and document embeddings for every row in the training qrels

training_queries = list(training_qrels['qid'])
training_docnos = list(training_qrels['docno'])
training_labels = list(training_qrels['label'])

X_train = []
y_train = []

for query, docno, label in zip(training_queries, training_docnos, training_labels):
  
  try:
    query_embedding = query_embeddings[str(query)]
    doc_embedding = doc_embeddings[docno]
    input = np.concatenate([query_embedding, doc_embedding])

    X_train.append(input)
    y_train.append(label)

  except:
    print("No embedding")
  


No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding
No embedding


In [None]:
len(X_train)==len(y_train)

True

We now have our training data - the list of concatenated query and doc embeddings - and our training labels - the corresponding relevance judgements.

Next we need to define a model on which to train these qrels - we will begin with a dense neural network.

First we will scale the data.

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)

In [None]:
len(X_train)==len(y_train)

True

Now we will do the same with the test qrels.

In [None]:
test_queries = list(test_qrels['qid'])
test_docnos = list(test_qrels['docno'])
test_labels = list(test_qrels['label'])



X_test = []
y_test = []

for query, docno, label in zip(test_queries, test_docnos, test_labels):
  
  try:
    query_embedding = query_embeddings[str(query)]
    doc_embedding = doc_embeddings[docno]
    input = np.concatenate([query_embedding, doc_embedding])

    X_test.append(input)
    y_test.append(label)

  except:
    print("No embedding")
  


In [None]:
sc = StandardScaler()
X_test = sc.fit_transform(X_test)

## **Creating the model**

In [None]:
classifier = MLPClassifier(hidden_layer_sizes=(150,100,50), max_iter=300,activation = 'relu',solver='adam',random_state=1)

In [None]:
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

cuda:0


In [None]:
classifier.fit(X_train, y_train)

MLPClassifier(hidden_layer_sizes=(150, 100, 50), max_iter=300, random_state=1)

In [None]:
#get the test predictions
y_true, y_pred = y_test , classifier.predict(X_test)

In [None]:
#show the results on the test set
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.87      0.82     12239
           2       0.68      0.52      0.59      6677

    accuracy                           0.75     18916
   macro avg       0.73      0.69      0.70     18916
weighted avg       0.74      0.75      0.74     18916



## **Hyperparameter Tuning**

In [None]:
param_grid = {
    'hidden_layer_sizes': [(150,100,50), (120,80,40), (100,50,30)],
    'max_iter': [100],
    'activation': ['relu'],
    'solver': ['adam'],
}

In [None]:
grid = GridSearchCV(classifier, param_grid, n_jobs= -1, cv=5)
grid.fit(X_train, y_train)

print(grid.best_params_) 

{'activation': 'relu', 'hidden_layer_sizes': (150, 100, 50), 'max_iter': 100, 'solver': 'adam'}




In [None]:
grid_predictions = grid.predict(X_test) 

In [None]:
print(classification_report(y_test, grid_predictions))

              precision    recall  f1-score   support

           0       0.77      0.86      0.81     12239
           2       0.67      0.53      0.60      6677

    accuracy                           0.74     18916
   macro avg       0.72      0.70      0.70     18916
weighted avg       0.74      0.74      0.74     18916

