# **Project SPOTTED: Model Selection - BERT**

**<u>_Objective:_</u>** We fine-tune a pretrained BERTModel, to predict if a tweet is made by an information operative (state-sponsored troll) or by a verified Twitter account. Our purpose is to increase the efficiency of identifying, and disrupting state-sponsored disinformation campaigns for the defense and intelligence community.

---
## Introduction
---

In this notebook, we use a very powerful NLP model for called BERTModel. There a several ways to use BERT as an embedding model, which we will explore two ways of doing so.
In the first, we use the PyTorch Trainer class on a combined BERT model and classifier called BertForSequenceClassification. In the second method, we use native PyTorch to directly access the last hidden state for each of the sentence, and vertically stacking the tensors up to form the features vector. Using the second method, we are able to use the output of BERT model and feed into the differente classifiers of our choosing.

Of course, we will do these two methods, and pick the best one. This ensures that our final model is robust enough.

BERT stands for Bidirectional Encoding From Transformers. Similar to Word2Vec, it is a state-of-the-art NLP model. BERT was developed by Google AI Language in 2018, and is considered as a swiss army knive for majority of NLP tasks. Typically, one requires separate models for different NLP tasks. But one can accomplish most of these with BERT. This is possible becasue BERT is trained on a very large dataset - Wikipedia (about 2.5 billion words) and Google's BookCorpus (about 800M words). It took Google 4 days of continuous training on custom TPUs to achieve this. BERT works by masking certain words, and using the words to the either side (hence, birectional!) of the mask to guess the masked word. Lastly, BERT uses a Transformer architecture to learn context and relationship between words. It it these combination that makes BERT such a powerful deep learning NLP model.

To use BERT model practically is easy because it is integrated with PyTorch and Tensorflow. This also allow to leverage on accelerated hardware aided by GPU to speed up calculations. In this notebook, we  will use BERT with PyTorch. The noteook is executed in Google Colab with GPU runtime.

Those interested to learn more about BERT should visit the link: https://huggingface.co/blog/bert-101.

## Setup Environment

We first set up the environment for Google Colab

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Set up and download various dependencies
%%capture

!pip install sentence_transformers
!pip install datasets
!pip install evaluate
!pip install fastBPE sacremoses subword_nmt sentencepiece
!pip install demoji
!pip install --upgrade gensim
!pip install torchvision
!pip install accelerate -U

Next, we set up the paths and directories for Google Colab

In [3]:
import os
# change working directory
%cd '/content/drive/MyDrive/Data Science and Analytics Portfolio/2 Projects/1 Project SPOTTED/2 Main'

cur_dir = os.getcwd()

utility_path = cur_dir + '/5_Utilities'
print(utility_path)

path = cur_dir

train_path = path + '/1_Data/SPOTTED_test_dataset.csv'
predict_path = path + '/1_Data/SPOTTED_validation_dataset.csv'
bert_output_dir = path + '/4_Notebooks/2_Model_Selection/BertForSequenceClassification'


/content/drive/MyDrive/Data Science and Analytics Portfolio/2 Projects/1 Project SPOTTED/2 Main
/content/drive/MyDrive/Data Science and Analytics Portfolio/2 Projects/1 Project SPOTTED/2 Main/5 utilities


In [4]:
# import modules and dependencies
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as mpp
import re
import nltk
import time
import sys
import demoji
import joblib
import accelerate

from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

# import the file with all the function definitions
sys.path.insert(0, utility_path)
from utility_functions import *


from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
from torch.optim import AdamW

from transformers.optimization import Adafactor, AdafactorSchedule

from datasets.dataset_dict import DatasetDict, Dataset

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

from wordcloud import WordCloud
from matplotlib.pyplot import figure

from transformers import EarlyStoppingCallback
from transformers import AutoTokenizer, DataCollatorWithPadding, DataCollatorForLanguageModeling
from transformers import BertModel, BertTokenizer, BertForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification

pd.set_option('display.max_colwidth', None)

## Model Specification and Setting Up GPU Environment

We select the model we want to use, and the tokenizers here.

We should also set up the environment for BERT when we push the model to GPU later

In [None]:
detected_device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
try:
    device_name = torch.cuda.get_device_name()
except:
    device_name = 'CPU'
print(detected_device, '\nName of device:', device_name)

cuda 
Name of device: Tesla T4


In [None]:
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)

# we will have two models - one for the bert for sequence classification and another for extracting the last hidden states
model = BertForSequenceClassification.from_pretrained(model_name, num_labels = 2)
model_lhs = BertModel.from_pretrained(model_name)

# push the two models to GPU
model_lhs.to(detected_device)
model.to(detected_device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

## Read Datasets

We read the test and evalutation datasets here. Remember that the test dataset refers to the dataset that goes into the train-test-split function

In [None]:
df         = pd.read_csv(train_path).drop(columns = ['Unnamed: 0', 'hashtags'])
df_predict = pd.read_csv(predict_path).drop(columns = ['Unnamed: 0', 'hashtags'])

# let us change the datatype of the target column
df = df.astype({'target' : 'int64'})
df_predict = df_predict.astype({'target' : 'int64'})

print(df.dtypes)

print(len(df))
df.head()

tweet_text    object
target         int64
dtype: object
150000


Unnamed: 0,tweet_text,target
0,"As of 5 June 2020, 12pm, we have preliminarily confirmed an additional 261 cases of COVID-19 infection in Singapore. https://t.co/2RFMhrRkUw",0
1,"Boyfriend of missing Florida woman charged with murder: ""We wish Collin would provide us the information of where Kathleen is"" https://t.co/DBDJS5McdW",0
2,K-pop's BTS snags top prize at American Music Awards https://t.co/eR432aHJlm,0
3,RT @CincinnatiDays: Man killed in Bond Hill after altercation #news,1
4,Jared paying attention to his video game more than me pt 2 @juliakim52 http://t.co/0AHkR3K7Vg,1


In [None]:
df['target'].value_counts().to_frame()

Unnamed: 0,target
1,75118
0,74882


## Data Processing for BERT

The BERT model expects the input data to be of a certain format. Here, we will convert the entire dataframe into the DatasetDict PyTorch data structure. Similar to what we used to do, we will use train test split only during the last step before we send the data into whatever regression model. Note that the keys of the dictionary cannot change, as BERT expects them as when the data is feed into it.

But we need to perform some cleaning on the tweet text before we pass the dataframe into the datasetdict

In [None]:
# this function performs basic text cleaning on the data.
df_bert = df.copy()
df_bert['clean_tweet_text'] = df_bert['tweet_text'].apply(text_processing_bert)
df_bert.head()

Unnamed: 0,tweet_text,target,clean_tweet_text
0,"As of 5 June 2020, 12pm, we have preliminarily confirmed an additional 261 cases of COVID-19 infection in Singapore. https://t.co/2RFMhrRkUw",0,"as of 5 june 2020, 12pm, we have preliminarily confirmed an additional 261 cases of covid-19 infection in singapore."
1,"Boyfriend of missing Florida woman charged with murder: ""We wish Collin would provide us the information of where Kathleen is"" https://t.co/DBDJS5McdW",0,"boyfriend of missing florida woman charged with murder: ""we wish collin would provide us the information of where kathleen is"""
2,K-pop's BTS snags top prize at American Music Awards https://t.co/eR432aHJlm,0,k-pop's bts snags top prize at american music awards
3,RT @CincinnatiDays: Man killed in Bond Hill after altercation #news,1,rt : man killed in bond hill after altercation news
4,Jared paying attention to his video game more than me pt 2 @juliakim52 http://t.co/0AHkR3K7Vg,1,jared paying attention to his video game more than me pt 2


We now send the dataframe into the train-test-split function

In [None]:
# Since we have already performed the train test split, we'll just call variables here
tweet_texts = list(df_bert['clean_tweet_text'])
target = list(df_bert['target'])

X_train, X_test, y_train, y_test = train_test_split(tweet_texts,
                                                    target,
                                                    random_state = 0)

Next, we tokenize the training and test segments of the split using only the review data

In [None]:
%%time
X_train_tokenized = tokenizer(X_train, padding = True, truncation = True, max_length = 512)
X_test_tokenized  = tokenizer(X_test, padding = True, truncation = True, max_length = 512)

CPU times: user 1min 17s, sys: 265 ms, total: 1min 18s
Wall time: 1min 18s


In [None]:
X_train_tokenized.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

We would need to convert the dataframe into the dataset data stucture for PyTorch

In [None]:
train_dict = {'label': y_train,
              'input_ids' : X_train_tokenized['input_ids']}

test_dict = {'label': y_test,
             'input_ids' : X_test_tokenized['input_ids']}

train_dataset = Dataset.from_dict(train_dict)
test_dataset = Dataset.from_dict(test_dict)

print(train_dataset, test_dataset)

Dataset({
    features: ['label', 'input_ids'],
    num_rows: 112500
}) Dataset({
    features: ['label', 'input_ids'],
    num_rows: 37500
})


### Using BertForSequenceClassification to Fine-Tune Pre-Trained Model

We first define the Trainer parameters and other specifications

In [None]:
Number_of_Epochs = 1
Evaluation_Steps = 500
Train_Batch_Size = 8
Test_Batch_Size  = 8
weight_decay     = 1e-4
lr               = 1e-5
warmup_steps     = 500
Seed             = 45

Next, we define the parameters for the trainer class - this includes the evulation metrics as well as the training arguments

In [None]:
# Define Trainer parameters
def compute_metrics(p):

    logits, labels = p
    pred = np.argmax(logits, axis = 1)

    auc_score = roc_auc_score(y_true = labels, y_score = pred)
    recall    = recall_score(y_true = labels, y_pred = pred)
    precision = precision_score(y_true = labels, y_pred = pred)
    f1        = f1_score(y_true = labels, y_pred = pred)
    accuracy  = accuracy_score(y_true = labels, y_pred = pred)
    #confusion = confusion_matrix(y_true=labels, y_pred=pred) # cannot JSON confusion matrix

    return {'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'f1': f1}


# Define Trainer
args = TrainingArguments(
    output_dir                  = bert_output_dir,
    overwrite_output_dir        = True,
    evaluation_strategy         = 'steps',
    save_strategy               = 'steps',
    eval_steps                  = Evaluation_Steps,
    per_device_train_batch_size = Train_Batch_Size,
    per_device_eval_batch_size  = Test_Batch_Size,
    num_train_epochs            = Number_of_Epochs,
    weight_decay                = weight_decay,
    seed                        = Seed,
    save_steps                  = Evaluation_Steps,
    learning_rate               = lr,
    load_best_model_at_end      = True,
    gradient_accumulation_steps = 8,
    #warmup_steps               = warmup_steps
)

Next, is the Trainer class

In [None]:
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

trainer = Trainer(
    model           = model,
    args            = args,
    train_dataset   = train_dataset,
    eval_dataset    = test_dataset,
    compute_metrics = compute_metrics,
    callbacks       = [EarlyStoppingCallback(early_stopping_patience = 3)],
    data_collator   = data_collator,
)

Train the model! (close your eyes and pray very very hard again)


In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
500,0.2744,0.179322,0.928533,0.963138,0.891071,0.925704
1000,0.1767,0.143288,0.942907,0.949026,0.936009,0.942473
1500,0.149,0.133034,0.947627,0.953398,0.941186,0.947253


TrainOutput(global_step=1757, training_loss=0.19204613778424248, metrics={'train_runtime': 4097.2099, 'train_samples_per_second': 27.458, 'train_steps_per_second': 0.429, 'total_flos': 8032221409148160.0, 'train_loss': 0.19204613778424248, 'epoch': 1.0})

## Using Trained Model for Prediction

Similarly, we need to create the datasetdict data structure for the reviews we want to predict. But we would need to load the trained model at certain checkpoints. Then we pass the tweets to be predicted by the model into it.

We similarly need to clean the tweets.

In [None]:
df_predict['clean_tweet_text'] = df_predict['tweet_text'].apply(text_processing_bert)

print(len(df_predict))

df_predict.head()

5000


Unnamed: 0,tweet_text,target,clean_tweet_text
0,RT @kodiak149: .@PetersonUtah deserves more followers \nFollow @PetersonUtah \nSupport @PetersonUtah \nElections matter \n#wtpBlue https://t.co…,1,rt : . deserves more followers \nfollow \nsupport \nelections matter \nwtpblue https://t.co…
1,@unpuNISHAble_ @Desh3hunna like you'll avi's,1,_ like you'll avi's
2,"RT @MuslimIQ: Meanwhile millions of parents find it hard to cope that 1 in 6 kids go to bed hungry every night in America, &amp; the poverty th…",1,"rt : meanwhile millions of parents find it hard to cope that 1 in 6 kids go to bed hungry every night in america, &amp; the poverty th…"
3,.@TheRock put his #RedNotice co-stars @GalGadot and @VancityReynolds to the test in our first #MuseumFaceOff! 🎨 Follow @GoogleArts and find out which one of them can tell their Michelangelo from their Van Gogh. https://t.co/UIoTCadGzz @NetflixFilm https://t.co/ksE0Vg8zbR,0,. put his rednotice co-stars and to the test in our first museumfaceoff! follow and find out which one of them can tell their michelangelo from their van gogh.
4,"Parents who could work from home tried to multitask their way through, often at the cost of their productivity, sanity or both https://t.co/K0fzLY9kVe",0,"parents who could work from home tried to multitask their way through, often at the cost of their productivity, sanity or both"


In [None]:
tweets_eval = list(df_predict['clean_tweet_text'])

X_eval_tokenized = tokenizer(tweets_eval, padding = True, truncation = True, max_length = 512)

X_eval_dataset = Dataset.from_dict({'input_ids' : X_eval_tokenized['input_ids']})

X_eval_dataset

Dataset({
    features: ['input_ids'],
    num_rows: 5000
})

Next we load the trained model. The model need to be feed into the Trainer function, so that it can be used to predict the reviews

In [None]:
model_path = bert_output_dir + '/checkpoint-1500'
model = BertForSequenceClassification.from_pretrained(model_path, num_labels = 2)

test_trainer = Trainer(model)

  return self.fget.__get__(instance, owner)()


Now we can make predictions

In [None]:
raw_pred, _, _ = test_trainer.predict(X_eval_dataset)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


In [None]:
predicted_target = np.argmax(raw_pred, axis = 1)
df_predict['predicted target'] = predicted_target
df_predict

Unnamed: 0,tweet_text,target,clean_tweet_text,predicted target
0,RT @kodiak149: .@PetersonUtah deserves more followers \nFollow @PetersonUtah \nSupport @PetersonUtah \nElections matter \n#wtpBlue https://t.co…,1,rt : . deserves more followers \nfollow \nsupport \nelections matter \nwtpblue https://t.co…,1
1,@unpuNISHAble_ @Desh3hunna like you'll avi's,1,_ like you'll avi's,1
2,"RT @MuslimIQ: Meanwhile millions of parents find it hard to cope that 1 in 6 kids go to bed hungry every night in America, &amp; the poverty th…",1,"rt : meanwhile millions of parents find it hard to cope that 1 in 6 kids go to bed hungry every night in america, &amp; the poverty th…",1
3,.@TheRock put his #RedNotice co-stars @GalGadot and @VancityReynolds to the test in our first #MuseumFaceOff! 🎨 Follow @GoogleArts and find out which one of them can tell their Michelangelo from their Van Gogh. https://t.co/UIoTCadGzz @NetflixFilm https://t.co/ksE0Vg8zbR,0,. put his rednotice co-stars and to the test in our first museumfaceoff! follow and find out which one of them can tell their michelangelo from their van gogh.,0
4,"Parents who could work from home tried to multitask their way through, often at the cost of their productivity, sanity or both https://t.co/K0fzLY9kVe",0,"parents who could work from home tried to multitask their way through, often at the cost of their productivity, sanity or both",0
...,...,...,...,...
4995,Taiwanese Actress Bella Chang Had To Remove Her 1.5m-Long Large Intestine After Suffering From Chronic Constipation Since Young https://t.co/m4KpjEhtYJ,0,taiwanese actress bella chang had to remove her 1.5m-long large intestine after suffering from chronic constipation since young,0
4996,RT @jason_compsonIV: The fate of all dictators is the same\nWish of their death by people\n#DownWithKhamenei,1,rt _compsoniv: the fate of all dictators is the same\nwish of their death by people\ndownwithkhamenei,1
4997,"@mark61167237 Hi Mark, you may wish to submit more information via the i-Witness portal at https://t.co/1SxSjRqdGB. All information will be kept strictly confidential. Thank you.",0,"hi mark, you may wish to submit more information via the i-witness portal at",0
4998,It is 22:48 UTC now,1,it is 22:48 utc now,1


Finally, we compute the accuracy of the model based on the fraction of unseen tweets predicted correctly to be state actors

In [None]:
wrong_predictions = df_predict[df_predict['target'] != df_predict['predicted target']]
no_wrong_predictions = len(wrong_predictions)

print('Final accuracy of model by predicting unseen tweets: {}'.format(1 - no_wrong_predictions / 5000))

target_true = df_predict['target']
target_predict = df_predict['predicted target']

auc_score = roc_auc_score(y_true = target_true, y_score = target_predict)
recall    = recall_score(y_true = target_true, y_pred = target_predict)
precision = precision_score(y_true = target_true, y_pred = target_predict)
f1        = f1_score(y_true = target_true, y_pred = target_predict)
accuracy  = accuracy_score(y_true = target_true, y_pred = target_predict)
confusion = confusion_matrix(y_true = target_true, y_pred = target_predict)

print('Confusion matrix:\n', confusion, '\n')

print('AUC Score : {}\n Recall : {}\n Precision : {}\n F1 Score : {}\n '.format(auc_score, recall, precision, f1))

Final accuracy of model by predicting unseen tweets: 0.9514
Confusion matrix:
 [[2422  116]
 [ 127 2335]] 

AUC Score : 0.9513553211333947
 Recall : 0.9484159220146222
 Precision : 0.952672378620971
 F1 Score : 0.9505393853042947
 


### Accessing Last Hidden State from BERT Model

This is the modified approach from the very first BERT tutorial that yielded good results. Here, we will pass in each tokenized sentence into the model, and access the last hidden state directly from the model output.

Similarly, we will need to vertically stack the last hidden states from the output, as in the case for Word2Vec. We will still use the same tokenizer as above. We do not use the encode attribute in tokenizer - this is because it only encodes on sentence at a time, without padding. By using tokenizer on the list of sentence, it'll convert the list of sentences into a list of encodings with padding. Simply access the list with the input_ids key.

But we also do something slightly smarter - we will split up the tweets in two halves. We will get the entire last hidden states for the two halves - saving each of them into a file and loading them and finally stacking them up. This also imply that we need to restart the runtime in Google colab. Although this may be slightly tedious, but it avoids the risk that the GPU exceed the RAM limit.

In [None]:
# Split up the entire thing into half
n = len(df)
tweet_texts_top = tweet_texts[: int(n / 2)]
tweet_texts_bottom = tweet_texts[-int(n / 2) :]

In [None]:
tweets_token_top = tokenizer(tweet_texts_top, padding = True, truncation = True, max_length = 250)
tweets_token_bottom = tokenizer(tweet_texts_bottom, padding = True, truncation = True, max_length = 250)

# access the input ids, convert to tensor and push to GPU if available
input_ids_top = torch.tensor(np.array(tweets_token_top['input_ids'])).to(detected_device)
input_ids_bottom = torch.tensor(np.array(tweets_token_bottom['input_ids'])).to(detected_device)

print(len(input_ids_top))
print(len(input_ids_bottom))

75000
75000


Below we perfom the algo we described above. **DO NOT RUN THE FOLLOWING CELL UNLESS YOU WANT TO RECALCULATE THE FEATURE TENSORS**

In [None]:
"""
%%time
features_vector = torch.zeros(1, 768)

for i, row in enumerate(input_ids_bottom):

    input_id = torch.unsqueeze(row, 0)

    with torch.no_grad():
        output = model_lhs(input_id)
        last_hidden_states = output[0][:,0,:]

    if i == 0:
        features_vector = last_hidden_states
    else:
        features_vector = torch.vstack((features_vector, last_hidden_states))

features_vector = features_vector.cpu()

print('[*]-----------------------------------------------  SUCCESS  -----------------------------------------------[*]')
"""

"\n%%time\nfeatures_vector = torch.zeros(1, 768)\n\nfor i, row in enumerate(input_ids_bottom):\n    \n    input_id = torch.unsqueeze(row, 0)\n    \n    with torch.no_grad():\n        output = model_lhs(input_id)\n        last_hidden_states = output[0][:,0,:]\n    \n    if i == 0:\n        features_vector = last_hidden_states\n    else:\n        features_vector = torch.vstack((features_vector, last_hidden_states))\n        \nfeatures_vector = features_vector.cpu()\n\nprint('[*]-----------------------------------------------  SUCCESS  -----------------------------------------------[*]')\n"

We save the top and bottom features vectors. Then, we load them and vertically stack them together to form the complete features vector. We then save this final vector as a file. This way, we only need to load the final vector and use it for our purpose. The final vector is read into a dataframe and later passed into the different classifiers.

In [None]:
features_vector = torch.load('/content/drive/MyDrive/Data Science and Analytics Portfolio/2 Projects/1 Project SPOTTED/2 Main/4_Notebooks/2_Model_Selection/feature vectors/features_vector_bert.pt')
features_vector

tensor([[-0.3798,  0.4210,  0.1782,  ..., -0.4502,  0.2521, -1.1289],
        [-0.3605,  0.4170,  0.3344,  ..., -0.4744,  0.5639, -0.7408],
        [-0.3127,  0.3273,  0.3748,  ..., -0.3724,  0.3606, -0.9193],
        ...,
        [-0.2973,  0.4203,  0.3777,  ..., -0.3895,  0.4951, -0.7913],
        [-0.2878,  0.4490,  0.2354,  ..., -0.4451,  0.2699, -0.9756],
        [-0.3603,  0.4863,  0.0358,  ..., -0.4842,  0.3976, -0.8951]])

In [None]:
# This train test split is different as the one above
X_train, X_test, y_train, y_test = train_test_split(features_vector,
                                                    target,
                                                    random_state = 0)

## Using Different Machine Learning Classifiers

Now with the embeddings, we can formally send that data to a few ML classifiers and compare their relative performance

K-Nearest Neighbour

In [None]:
knn_clf = KNeighborsClassifier(n_neighbors = 10)
knn_clf.fit(X_train, y_train)

# using the KNN model to predict the y values
y_predict = knn_clf.predict(X_test)

# calculate the evaluation metrices of the model
auc_score     = roc_auc_score(y_test, y_predict)
recall        = recall_score(y_test, y_predict)
precision     = precision_score(y_test, y_predict)
f1            = f1_score(y_test, y_predict)
knn_clf_score = knn_clf.score(X_test, y_test)
confusion     = confusion_matrix(y_test, y_predict)

print('Confusion matrix:\n', confusion, '\n')

performance_dict = {'K-Nearest Neighbour' : [auc_score, recall, precision, f1, knn_clf_score]}
performance_df_knn_clf = pd.DataFrame(data  = performance_dict,
                                      index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])

print('Evaluation metrices of K-Nearest Neighbour:')
performance_df_knn_clf

Confusion matrix:
 [[16585  2178]
 [ 4129 14608]] 

Evaluation metrices of K-Nearest Neighbour:


Unnamed: 0,K-Nearest Neighbour
AUC,0.831777
Recall,0.779634
Precision,0.870249
F1,0.822453
Score,0.831813


Logistic Regression

In [None]:
lr_clf = LogisticRegression(max_iter = 10000)
lr_clf.fit(X_train, y_train)

# using the logistic regression model to predict the y values
y_predict = lr_clf.predict(X_test)

# calculate the evaluation metrices of the model
auc_score    = roc_auc_score(y_test, y_predict)
recall       = recall_score(y_test, y_predict)
precision    = precision_score(y_test, y_predict)
f1           = f1_score(y_test, y_predict)
lr_clf_score = lr_clf.score(X_test, y_test)
confusion    = confusion_matrix(y_test, y_predict)

print('Confusion matrix:\n', confusion, '\n')

performance_dict = {'Logistic Regression' :  [auc_score, recall, precision, f1, lr_clf_score]}
performance_df_lr_clf = pd.DataFrame(data  = performance_dict,
                                     index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])

print('Evaluation metrices of Logistic Regression:')
performance_df_lr_clf

Confusion matrix:
 [[16699  2064]
 [ 2410 16327]] 

Evaluation metrices of Logistic Regression:


Unnamed: 0,Logistic Regression
AUC,0.880687
Recall,0.871377
Precision,0.887771
F1,0.879498
Score,0.880693


Support Vector Machines (SVM)

Not to use this, because it takes too long

In [None]:
'''
#svm_clf = SVC(C = 1e9, gamma = 1e-07)
svm_clf = SVC()
svm_clf.fit(X_train, y_train)

# predict using SVM
y_predict = svm_clf.predict(X_test)

# calculate the evaluation metrices of the model
auc_score     = roc_auc_score(y_test, y_predict)
recall        = recall_score(y_test, y_predict)
precision     = precision_score(y_test, y_predict)
f1            = f1_score(y_test, y_predict)
svm_clf_score = svm_clf.score(X_test, y_test)
confusion     = confusion_matrix(y_test, y_predict)

print('Confusion matrix:\n', confusion, '\n')

performance_dict = {'Support Vector Machine' : [auc_score, recall, precision, f1, svm_clf_score]}
performance_df_svm_clf = pd.DataFrame(data  = performance_dict,
                                      index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])

print('Evaluation metrices of Support Vector Machine:')
performance_df_svm_clf
'''

Gaussian-Native Bayes

In [None]:
gnb_clf = GaussianNB()
gnb_clf.fit(X_train, y_train)

# predict using Gaussian Naive Bayes
y_predict = gnb_clf.predict(X_test)

# calculate the evaluation metrices of the model
auc_score     = roc_auc_score(y_test, y_predict)
recall        = recall_score(y_test, y_predict)
precision     = precision_score(y_test, y_predict)
f1            = f1_score(y_test, y_predict)
gnb_clf_score = gnb_clf.score(X_test, y_test)
confusion     = confusion_matrix(y_test, y_predict)

print('Confusion matrix:\n', confusion, '\n')

performance_dict = {'Gaussian Naive-Bayes' : [auc_score, recall, precision, f1, gnb_clf_score]}
performance_df_gnb_clf = pd.DataFrame(data  = performance_dict,
                                      index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])

print('Evaluation metrices of Gaussian Naive-Bayes:')
performance_df_gnb_clf

Confusion matrix:
 [[11082  7681]
 [ 5498 13239]] 

Evaluation metrices of Gaussian Naive-Bayes:


Unnamed: 0,Gaussian Naive-Bayes
AUC,0.6486
Recall,0.70657
Precision,0.632839
F1,0.667675
Score,0.64856


Random Forest

In [None]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)

# predict using random forest
y_predict = rf_clf.predict(X_test)

# calculate the evaluation metrices of the model
auc_score    = roc_auc_score(y_test, y_predict)
recall       = recall_score(y_test, y_predict)
precision    = precision_score(y_test, y_predict)
f1           = f1_score(y_test, y_predict)
rf_clf_score = rf_clf.score(X_test, y_test)
confusion    = confusion_matrix(y_test, y_predict)

print('Confusion matrix:\n', confusion, '\n')

performance_dict = {'Random Forest' : [auc_score, recall, precision, f1, rf_clf_score]}
performance_df_rf_clf = pd.DataFrame(data  = performance_dict,
                                     index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])

print('Evaluation metrices of Random Forest:')
performance_df_rf_clf

Confusion matrix:
 [[16250  2513]
 [ 3113 15624]] 

Evaluation metrices of Random Forest:


Unnamed: 0,Random Forest
AUC,0.849962
Recall,0.833858
Precision,0.861443
F1,0.847426
Score,0.849973


Gradient-Boosted Decision Tree (GBDT)

Takes too long

In [None]:
#gbdt_clf = GradientBoostingClassifier(learning_rate = 0.1, max_depth = 10, random_state = 0)
gbdt_clf = GradientBoostingClassifier()
gbdt_clf.fit(X_train, y_train)

# predict using GBDT
y_predict = gbdt_clf.predict(X_test)

# calculate the evaluation metrices of the model
auc_score      = roc_auc_score(y_test, y_predict)
recall         = recall_score(y_test, y_predict)
precision      = precision_score(y_test, y_predict)
f1             = f1_score(y_test, y_predict)
gbdt_clf_score = gbdt_clf.score(X_test, y_test)
confusion      = confusion_matrix(y_test, y_predict)

print('Confusion matrix:\n', confusion, '\n')

performance_dict = {'Gradient-Boosted Decision Tree' : [auc_score, recall, precision, f1, gbdt_clf_score]}
performance_df_gbdt_clf = pd.DataFrame(data  = performance_dict,
                                       index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])

print('Evaluation metrices of Gradient-Boosted Decision Tree:')
performance_df_gbdt_clf

Confusion matrix:
 [[15480  3283]
 [ 3642 15095]] 

Evaluation metrices of Gradient-Boosted Decision Tree:


Unnamed: 0,Gradient-Boosted Decision Tree
AUC,0.815327
Recall,0.805625
Precision,0.821362
F1,0.813418
Score,0.815333


As we can see from the output above, the classifiers approach are good and produces accurate results. However, they fair poorly to their BERT Model counterpart

## Conclusion

It is evident that the best performing model is BertForSequenceClassification. We choose this model over Word2Vec for the following reasons:
- seamless integration with PyTorch
- very accurate model


Although one is unable to change the type of classifier, the ease of use and the model's seamless integration with PyTorch makes the model extremely versatile. The only drawback is perhaps that you are at the mercy of Google Colab's free GPU. In fact, this notebook was run several times, before a complete successful execution and a best set of parameters is selected.

# PROJECT SPOTTED... SECURED!!