# **Project SPOTTED: Model Selection - Word2Vec**

**<u>_Objective:_</u>** We fine-tune a pretrained BERTModel, to predict if a tweet is made by an information operative (state-sponsored troll) or by a verified Twitter account. Our purpose is to increase the efficiency of identifying, and disrupting state-sponsored disinformation campaigns for the defense and intelligence community.

---
### Introduction

In this notebook, we train a Word2Vec model in the Gensim package using dataset curated from the data collection step. We then access the word embeddings in the model, and feed that data to different machine learning classifiers. We look at the performance of these classifiers by computing their evaluation metrics.

Word2Vec is a state-of-the-art word embedding algorithm, and the model resides in the Gensim library - designed specifically for topic modelling. The unique aspect of Word2Vec is that the model learns the _context_ of the words in the corpus. There are two main algorithms involved in Word2Vec that allow the model to learn the context. These two algorithms are called bag of words (CBOW) and skip grams. In essential, these algorithms look at a window of words for each target word, and thus the meaning of words (as words associated with one another tend to cluster closely together). In order to exploit the full potential of Word2Vec, we need a large corpus of text i.e. Wikipedia. Nonetheless, the 150000 tweets  we assembled is sufficient for our purpose.

It may be interesting to note that Google has trained its own Word2Vec model on a large corpus of news articles. The resulting model is close to 4 Gb.

For a useful high-level tutorial of how Word2Vec works, please check out https://machinelearningmastery.com/develop-word-embeddings-python-gensim/.

### Setting up Environment for Google Colab

We will set up the environment for Google Colab

In [1]:
from google.colab import drive
drive.mount('/content/drive')

# change working directory
%cd '/content/drive/MyDrive/Data Science and Analytics Portfolio/2 Projects/1 Project SPOTTED/2 Main'

Mounted at /content/drive
/content/drive/MyDrive/Data Science and Analytics Portfolio/2 Projects/1 Project SPOTTED/2 Main


In [2]:
# Set up and download various dependencies
%%capture
import nltk
nltk.download('brown')
nltk.download('punkt')

!pip install sentence_transformers
!pip install demoji
!pip install --upgrade numpy
!pip install --upgrade gensim
!pip install torchvision

#### Setting Directories for Google Colab


In [5]:
import os
cur_dir = os.getcwd()

utility_path = cur_dir + '/5_Utilities'
print(utility_path)

path = cur_dir

train_path = path + '/1_Data/SPOTTED_test_dataset.csv'
fv_path = cur_dir + '/4_Notebooks/2_Model_Selection/w2v/W2V_df_ML.csv'

print(fv_path)

/content/drive/MyDrive/Data Science and Analytics Portfolio/2 Projects/1 Project SPOTTED/2 Main/5 utilities
/content/drive/MyDrive/Data Science and Analytics Portfolio/2 Projects/1 Project SPOTTED/2 Main/4 notebooks/model selection/w2v/W2V_df_ML.csv


In [6]:
# import modules and dependencies
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as mpp
import re
import nltk
import time
import sys
import demoji
import joblib

from nltk.corpus import brown

# import the file with all the function definitions
sys.path.insert(0, utility_path)
from utility_functions import *

from wordcloud import WordCloud
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

from matplotlib.pyplot import figure

from sentence_transformers import SentenceTransformer

pd.set_option('display.max_colwidth', None)

### Read Dataset

We read the SPOTTED training dataset consisting of 150K tweets curated from the data collection step

In [None]:
df = pd.read_csv(train_path).drop(columns = ['Unnamed: 0', 'hashtags'])

df.head()

Unnamed: 0,tweet_text,target
0,"As of 5 June 2020, 12pm, we have preliminarily confirmed an additional 261 cases of COVID-19 infection in Singapore. https://t.co/2RFMhrRkUw",0.0
1,"Boyfriend of missing Florida woman charged with murder: ""We wish Collin would provide us the information of where Kathleen is"" https://t.co/DBDJS5McdW",0.0
2,K-pop's BTS snags top prize at American Music Awards https://t.co/eR432aHJlm,0.0
3,RT @CincinnatiDays: Man killed in Bond Hill after altercation #news,1.0
4,Jared paying attention to his video game more than me pt 2 @juliakim52 http://t.co/0AHkR3K7Vg,1.0


### Training Word2Vec Model

Here we will train our own Word2Vec model based on the test dataset tweets. Firstly, we create a pretrained Word2Vec model by training it on the sample Brown Corpus text. This step gives the initially empty model a baseline understanding. We then proceed to sharpen our model by continuing to train it with the raw 150K tweets. Do note that at this stage of model training, we do not need to perform any text cleaning on the tweets. This is also the beauty of using Word2Vec - it requires the 'dirty' tweets so that it learns the word meanings and their context from their position in a sentence.

#### Parameters Specification

In [None]:
BrownCorpus = brown.sents()
size = 300
minimum_counts = 3
number_of_epochs = 60
downsampling = 1e-3
number_of_workers = 4
context = 8
num_windows = 8

In [None]:
print('[*] Pre-training Word2Vec model on Brown Corpus of length', len(BrownCorpus))

start = time.time()
model = Word2Vec(sentences = BrownCorpus,
                 sg = 1,
                 hs = 0,
                 vector_size = size,
                 min_count = minimum_counts,
                 epochs = number_of_epochs,
                 sample = downsampling,
                 window = num_windows,
                 negative = 5,
                 workers = number_of_workers)
end = time.time()
print('[*] Total time elapsed to train model :', (end - start) / 60, 'minutes')

print('[*] Saving pretrained model...')
model.save('w2v/w2v_pretrained.model')
print('[*]------------------------------------------- Success -------------------------------------------[*]')

[*] Pre-training Word2Vec model on Brown Corpus of length 57340


Then, we load the pretrained model and continue training it with the cleaned text data from the dataframe above. This step is to sharpen the model's sensitivity towards the tweets and words used in the test dataset.

In [None]:
print('[*] Loading pretrained model...')
model = Word2Vec.load("w2v_pretrained.model")

print('[*] Training pretrained model, with input text data of length', len(df))
start = time.time()
model.train(df['tweet_text'],
            total_examples = len(df),
            epochs = number_of_epochs)
end = time.time()
print('[*] Total time elapsed to train model :', (end - start) / 60, 'minutes')

print('[*] Saving pretrained model...')
model.save('w2v/SPOTTED_w2v_model.model')
print('[*]------------------------------------------- Success -------------------------------------------[*]')

[*] Loading pretrained model...




[*] Training pretrained model, with input text data of length 150000
[*] Total time elapsed to train model : 74.94866852362951 minutes
[*] Saving pretrained model...
[*]------------------------------------------- Success -------------------------------------------[*]


### Data Cleaning and Preparation

For the data cleaning, we will remove links, emojis etc.. These actions are written in the text_processing function.

In [None]:
%%time
df_w2v = df.copy()

df_w2v['tokenized sentences'] = df_w2v['tweet_text'].apply(text_processing)
df_w2v['sentence vectors'] = df_w2v['tokenized sentences'].apply(lambda x : sentence_vectorizer(x, model, size))

df_w2v = df_w2v.reset_index(drop = True)

df_w2v.head(2)

CPU times: user 7min 44s, sys: 5.16 s, total: 7min 49s
Wall time: 7min 49s


Unnamed: 0,tweet_text,target,tokenized sentences,sentence vectors
0,"As of 5 June 2020, 12pm, we have preliminarily confirmed an additional 261 cases of COVID-19 infection in Singapore. https://t.co/2RFMhrRkUw",0.0,"[as, of, 5, june, 2020, ,, 12pm, ,, we, have, preliminarily, confirmed, an, additional, 261, cases, of, covid, -, 19, infection, in, singapore, .]","[0.20728610269725323, 0.1629854320164989, 0.04370561510543613, 0.052119998521554994, -0.0912680254482171, -0.2071911972016096, 0.15133531298488379, 0.2441127569798161, 0.022393231170580667, -0.040851018281982225, 0.09800730514175751, -0.02176838116172482, 0.12086785376510199, -0.09644921715645229, -0.20072636775234165, 0.08095596774536021, -0.15210429020226002, 0.14011825862176278, -0.0499332307235283, -0.22866791914052823, -0.027565131552846116, -0.017332199522677588, 0.12584667269359617, 0.08428054133101422, 0.11268534577068161, -0.0478215558344827, 0.004486977780128226, -0.03253365257967958, -0.2798793955760844, -0.12017105700557723, 0.14737443362965302, 0.015340353581396973, 0.05277219348970581, 0.011598007315221955, -0.1483278828946983, 0.09127139050842208, -0.024918912499047378, 0.04416643959634444, -0.0009412162882440231, 0.137889438034857, 0.06481510107679402, -0.13756573074223363, 0.043088554569027, -0.025187365053331152, 0.016492932432276362, 0.07731778824167765, 0.1672856281785404, 0.08594200693016105, -0.004352304755764849, 0.09397078195915502, -0.006615851348375573, 0.07686206303975161, -0.1916017403059146, 0.007106959929361063, -0.007671052425661508, 0.14409610656473568, 0.1352147238657755, -0.08683433942496777, 0.15431260164169705, -0.07586344074019614, 0.004978006075629417, 0.03850229476195048, -0.0922571765806745, 0.04792765407439541, 0.09160393991452806, -0.0656632359194405, 0.072937868645086, 0.10425335366059751, -0.016180445303154344, 0.06819745876333293, 0.0529613748042132, 0.026323320563225186, 0.1302582321359831, 0.036549323567134494, -0.016024458715144324, 0.14628741471096873, -0.0134999679730219, 0.09830314030542094, 0.024287479655707583, 0.1435964390197221, -0.14938156098565636, -0.038359622224507964, 0.043204724295612645, 0.1620828943217502, 0.025656068834530955, 0.10605848153286121, -0.13322924563031205, 0.01704581709140364, 0.13052354402401867, 0.08685773983597755, -0.025358502506552374, 0.05577918207820724, 0.03122520130401587, 0.09627378951100742, 0.0847787499318228, 0.003397931837860276, 0.07477432467481669, -0.08705409087569398, -0.21014696622596069, 0.02307915797128397, ...]"
1,"Boyfriend of missing Florida woman charged with murder: ""We wish Collin would provide us the information of where Kathleen is"" https://t.co/DBDJS5McdW",0.0,"[boyfriend, of, missing, florida, woman, charged, with, murder, :, "", we, wish, collin, would, provide, us, the, information, of, where, kathleen, is, ""]","[0.0922131735338446, 0.1825027918881353, 0.010249690590974163, -0.10157739338191117, -0.18814165397163699, -0.12051812942851992, 0.18393393266288674, 0.2264960125526961, 0.04927326678572332, -0.10407082511879065, -0.058750785996808726, -0.05089271654758383, 0.05802109727964682, -0.05227707293541992, -0.27931097588118386, -0.025868938427747172, -0.09369815004688195, 0.13975215933340437, -0.12586590625783978, -0.07578383857274756, 0.06505289691609933, -0.024963691666284028, 0.20545569533372626, 0.10545095867093872, 0.15383675602702973, -0.05857763417503413, -0.0899840895743931, 0.054245714974754, -0.21588621795287027, -0.21296839205109896, 0.15476349083816304, 0.0552499519989771, -0.003928666770019952, 0.03119280912420329, -0.09143423962899867, 0.048121833845096475, 0.039349880672114736, 0.0031602104084894936, 0.020882388138595748, 0.11155043657430831, 0.05086111386909204, -0.09575714836554493, 0.029148530543727035, -0.19474180919282577, -0.021345799131428495, 0.13902873841716962, 0.0856896496432669, 0.020995438318042195, 0.0036948378033497755, 0.20067638640894608, -0.06536997583530404, 0.28089356422424316, -0.1626094019807437, -0.043742085204404944, 0.06171750123886501, 0.20614750080687158, 0.017705385299289927, 0.004735114436377497, -0.016543674994917476, -0.0749487034866915, -0.00587088822880212, -0.09740203728570658, 0.030537268913844052, -0.005844130536870044, 0.06284259687013485, -0.00805693946997909, -0.0014621563365354257, 0.025402629747986794, 0.03158333897590637, 0.02678314464933732, 0.001651712419355617, 0.01321618412347401, 0.19587641110753312, -0.030862373875125367, -0.018598991637939915, 0.09902084196972497, -0.0855893439558499, -0.0916270032963332, -0.025756360196015415, 0.07764447336632978, -0.058870369787601864, -0.054043566945063716, -0.036346073974581325, 0.21155557009008, 0.03227708790013019, 0.16511272370596142, -0.008986770901281168, -0.04927255629616625, 0.04735763488775667, 0.023331648834487972, 0.06520874853081562, 0.09149971702957854, 0.010559077265069765, 0.12268523389802259, 0.06926097687991227, -0.021868031869149384, 0.16525491140782833, 0.031933456823668056, -0.1915327145389336, -0.013001475483179092, ...]"


Lastly, we create the sentence vectors to features matrix to be passed into the ML classifiers later. This step is the most time consuming in the entire notebook. Hence, We wil exploit PyTorch's tensor data structure to vastly speed up the operation by converting the numpy arrays to tensor, pushing to GPU. Lastly, we copy back to CPU and .numpy() to recover our numpy ndarray.

In [None]:
detected_device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device_name = torch.cuda.get_device_name()
print(detected_device, '\nName of device:', device_name)

cuda 
Name of device: Tesla T4


In [None]:
%%time
df_ML = vec_to_features_Tensor(df_w2v, 'target', 'sentence vectors', size)

df_ML.head()

CPU times: user 4min 9s, sys: 2 s, total: 4min 11s
Wall time: 4min 17s


Unnamed: 0,target,w2v_1,w2v_2,w2v_3,w2v_4,w2v_5,w2v_6,w2v_7,w2v_8,w2v_9,...,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299,w2v_300
0,0.0,0.207286,0.162985,0.043706,0.05212,-0.091268,-0.207191,0.151335,0.244113,0.022393,...,-0.01507,0.181218,0.125042,0.089747,0.102303,0.096271,0.010417,-0.093091,0.153693,0.087534
1,0.0,0.092213,0.182503,0.01025,-0.101577,-0.188142,-0.120518,0.183934,0.226496,0.049273,...,0.024799,0.113711,0.004766,0.104695,0.100562,0.094758,-0.01,-0.070278,0.02616,-0.003086
2,0.0,0.161903,0.132092,-0.149791,-0.050299,-0.152066,-0.147933,-0.01235,0.548284,0.035683,...,0.144191,0.171807,-0.099145,0.086577,0.249119,0.289094,-0.069311,-0.075786,0.025588,-0.191049
3,1.0,-0.055043,0.355471,-0.010666,-0.043458,-0.054305,-0.313764,0.003867,0.231163,0.115497,...,-0.053422,0.204941,-0.043115,0.180672,0.141532,0.258362,-0.050155,-0.096396,0.128628,-0.070234
4,1.0,-0.067327,0.234934,0.096567,0.089803,0.052171,-0.33164,0.096493,0.242685,-0.014983,...,0.110775,0.236107,0.116955,0.044962,-0.048965,0.258062,0.051141,-0.024864,-0.004259,-0.011461


We can save this final dataframe here so that we can load it for the ML classifiers later

### Using Different Machine Learning Classifiers

We should use several different classifiers and choose the best classifier out of all. We perform the train test split on the W2V dataframe first.

In [None]:
%%time
df_ML = pd.read_csv(fv_path).astype({'target' : 'int32'}).drop(columns = ['Unnamed: 0'])
df_ML.head()

CPU times: user 14.5 s, sys: 1.95 s, total: 16.5 s
Wall time: 18.6 s


Unnamed: 0,target,w2v_1,w2v_2,w2v_3,w2v_4,w2v_5,w2v_6,w2v_7,w2v_8,w2v_9,...,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299,w2v_300
0,0,0.207286,0.162985,0.043706,0.05212,-0.091268,-0.207191,0.151335,0.244113,0.022393,...,-0.01507,0.181218,0.125042,0.089747,0.102303,0.096271,0.010417,-0.093091,0.153693,0.087534
1,0,0.092213,0.182503,0.01025,-0.101577,-0.188142,-0.120518,0.183934,0.226496,0.049273,...,0.024799,0.113711,0.004766,0.104695,0.100562,0.094758,-0.01,-0.070278,0.02616,-0.003086
2,0,0.161903,0.132092,-0.149791,-0.050299,-0.152066,-0.147933,-0.01235,0.548284,0.035683,...,0.144191,0.171807,-0.099145,0.086577,0.249119,0.289094,-0.069311,-0.075786,0.025588,-0.191049
3,1,-0.055043,0.355471,-0.010666,-0.043458,-0.054305,-0.313764,0.003867,0.231163,0.115497,...,-0.053422,0.204941,-0.043115,0.180672,0.141532,0.258362,-0.050155,-0.096396,0.128628,-0.070234
4,1,-0.067327,0.234934,0.096567,0.089803,0.052171,-0.33164,0.096493,0.242685,-0.014983,...,0.110775,0.236107,0.116955,0.044962,-0.048965,0.258062,0.051141,-0.024864,-0.004259,-0.011461


In [None]:
# this one is for the Word2Vec model - we need to pass inside the sentence vectors
X_train, X_test, y_train, y_test = train_test_split(df_ML.drop(columns = ['target']),
                                                    df_ML['target'],
                                                    random_state = 14)

K-Nearest Neighbour classifier

In [None]:
knn_clf = KNeighborsClassifier(n_neighbors = 10)
knn_clf.fit(X_train, y_train)

# using the KNN model to predict the y values
y_predict = knn_clf.predict(X_test)

# calculate the evaluation metrices of the model
auc_score     = roc_auc_score(y_test, y_predict)
recall        = recall_score(y_test, y_predict)
precision     = precision_score(y_test, y_predict)
f1            = f1_score(y_test, y_predict)
knn_clf_score = knn_clf.score(X_test, y_test)
confusion     = confusion_matrix(y_test, y_predict)

print('Confusion matrix:\n', confusion, '\n')

performance_dict = {'K-Nearest Neighbour' : [auc_score, recall, precision, f1, knn_clf_score]}
performance_df_knn_clf = pd.DataFrame(data  = performance_dict,
                                      index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])

print('Evaluation metrices of K-Nearest Neighbour:')
performance_df_knn_clf

Confusion matrix:
 [[11838  6908]
 [11804  6950]] 

Evaluation metrices of K-Nearest Neighbour:


Unnamed: 0,K-Nearest Neighbour
AUC,0.501041
Recall,0.370588
Precision,0.501515
F1,0.426223
Score,0.501013


Logistic Regression classifier

In [None]:
lr_clf = LogisticRegression(max_iter = 10000)
lr_clf.fit(X_train, y_train)

# using the logistic regression model to predict the y values
y_predict = lr_clf.predict(X_test)

# calculate the evaluation metrices of the model
auc_score    = roc_auc_score(y_test, y_predict)
recall       = recall_score(y_test, y_predict)
precision    = precision_score(y_test, y_predict)
f1           = f1_score(y_test, y_predict)
lr_clf_score = lr_clf.score(X_test, y_test)
confusion    = confusion_matrix(y_test, y_predict)

print('Confusion matrix:\n', confusion, '\n')

performance_dict = {'Logistic Regression' : [auc_score, recall, precision, f1, lr_clf_score]}
performance_df_lr_clf = pd.DataFrame(data  = performance_dict,
                                     index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])

print('Evaluation metrices of Logistic Regression:')
performance_df_lr_clf

Confusion matrix:
 [[12102  6644]
 [ 5908 12846]] 

Evaluation metrices of Logistic Regression:


Unnamed: 0,Logistic Regression
AUC,0.665276
Recall,0.684974
Precision,0.659107
F1,0.671792
Score,0.66528


Grid Search CV on parameters of Logistic Regression. But I think this may not be required, if Logistic Regression is good enough on its own

In [None]:
"""
lr_clf1 = LogisticRegression(penalty  = 'l1',
                             max_iter = 10000,
                             solver   = 'liblinear')
lr_clf1.fit(X_train, y_train)

grid_values = {'C': [0.01, 0.1, 1, 10, 100]}

# default metric to optimize over grid parameters: recall
gridsearch_cv_lr_clf1 = GridSearchCV(lr_clf1, param_grid = grid_values, scoring = 'recall')
gridsearch_cv_lr_clf1.fit(X_train, y_train)
y_decision_fn_scores_recall = gridsearch_cv_lr_clf1.decision_function(X_test)

print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_recall))
print('Grid best parameter (max. recall): ', gridsearch_cv_lr_clf1.best_params_)
print('Grid best score (recall): ', gridsearch_cv_lr_clf1.best_score_)
"""

"\nlr_clf1 = LogisticRegression(penalty  = 'l1', \n                             max_iter = 10000, \n                             solver   = 'liblinear')\nlr_clf1.fit(X_train, y_train)\n\ngrid_values = {'C': [0.01, 0.1, 1, 10, 100]}\n\n# default metric to optimize over grid parameters: recall\ngridsearch_cv_lr_clf1 = GridSearchCV(lr_clf1, param_grid = grid_values, scoring = 'recall')\ngridsearch_cv_lr_clf1.fit(X_train, y_train)\ny_decision_fn_scores_recall = gridsearch_cv_lr_clf1.decision_function(X_test) \n\nprint('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_recall))\nprint('Grid best parameter (max. recall): ', gridsearch_cv_lr_clf1.best_params_)\nprint('Grid best score (recall): ', gridsearch_cv_lr_clf1.best_score_)\n"

Support Vector Machine (SVM)

In [None]:
svm_clf = SVC(C = 1e9, gamma = 1e-07)
svm_clf.fit(X_train, y_train)

# predict using SVM
y_predict = svm_clf.predict(X_test)

# calculate the evaluation metrices of the model
auc_score     = roc_auc_score(y_test, y_predict)
recall        = recall_score(y_test, y_predict)
precision     = precision_score(y_test, y_predict)
f1            = f1_score(y_test, y_predict)
svm_clf_score = svm_clf.score(X_test, y_test)
confusion     = confusion_matrix(y_test, y_predict)

print('Confusion matrix:\n', confusion, '\n')

performance_dict = {'Support Vector Machine' : [auc_score, recall, precision, f1, svm_clf_score]}
performance_df_svm_clf = pd.DataFrame(data  = performance_dict,
                                      index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])

print('Evaluation metrices of Support Vector Machine:')
performance_df_svm_clf

Confusion matrix:
 [[15564  3182]
 [ 7030 11724]] 

Evaluation metrices of Support Vector Machine:


Unnamed: 0,Support Vector Machine
AUC,0.727702
Recall,0.625147
Precision,0.786529
F1,0.696613
Score,0.72768


 Gaussian Naive-Bayes

In [None]:
gnb_clf = GaussianNB()
gnb_clf.fit(X_train, y_train)

# predict using Gaussian Naive Bayes
y_predict = gnb_clf.predict(X_test)


# calculate the evaluation metrices of the model
auc_score     = roc_auc_score(y_test, y_predict)

recall        = recall_score(y_test, y_predict)

precision     = precision_score(y_test, y_predict, average = 'weighted')
f1            = f1_score(y_test, y_predict)
gnb_clf_score = gnb_clf.score(X_test, y_test)
confusion     = confusion_matrix(y_test, y_predict)

print('Confusion matrix:\n', confusion, '\n')

performance_dict = {'Gaussian Naive-Bayes' : [auc_score, recall, precision, f1, gnb_clf_score]}
performance_df_gnb_clf = pd.DataFrame(data  = performance_dict,
                                      index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])

print('Evaluation metrices of Gaussian Naive-Bayes:')
performance_df_gnb_clf

Confusion matrix:
 [[14931  3815]
 [10995  7759]] 

Evaluation metrices of Gaussian Naive-Bayes:


Unnamed: 0,Gaussian Naive-Bayes
AUC,0.605107
Recall,0.413725
Precision,0.623155
F1,0.511672
Score,0.605067


Random Forest

In [None]:
rf_clf = RandomForestClassifier().fit(X_train, y_train)

# predict using random forest
y_predict = rf_clf.predict(X_test)

# calculate the evaluation metrices of the model
auc_score    = roc_auc_score(y_test, y_predict)
recall       = recall_score(y_test, y_predict)
precision    = precision_score(y_test, y_predict)
f1           = f1_score(y_test, y_predict)
rf_clf_score = rf_clf.score(X_test, y_test)
confusion    = confusion_matrix(y_test, y_predict)

print('Confusion matrix:\n', confusion, '\n')

performance_dict = {'Random Forest' : [auc_score, recall, precision, f1, rf_clf_score]}
performance_df_rf_clf = pd.DataFrame(data  = performance_dict,
                                      index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])

print('Evaluation metrices of Random Forest:')
performance_df_rf_clf

Confusion matrix:
 [[14581  4165]
 [ 3850 14904]] 

Evaluation metrices of Random Forest:


Unnamed: 0,Random Forest
AUC,0.786265
Recall,0.79471
Precision,0.781583
F1,0.788092
Score,0.786267


Gradient-Boosted Decision Tree (GBDT) heads-up : GBDT takes the longest among the five classifiers

In [None]:
gbdt_clf = GradientBoostingClassifier(learning_rate = 0.1, max_depth = 10, random_state = 0)
gbdt_clf.fit(X_train, y_train)

# predict using GBDT
y_predict = gbdt_clf.predict(X_test)

# calculate the evaluation metrices of the model
auc_score      = roc_auc_score(y_test, y_predict)
recall         = recall_score(y_test, y_predict)
precision      = precision_score(y_test, y_predict)
f1             = f1_score(y_test, y_predict)
gbdt_clf_score = gbdt_clf.score(X_test, y_test)
confusion      = confusion_matrix(y_test, y_predict)

print('Confusion matrix:\n', confusion, '\n')

performance_dict = {'Gradient-Boosted Decision Tree' : [auc_score, recall, precision, f1, gbdt_clf_score]}
performance_df_gbdt_clf = pd.DataFrame(data  = performance_dict,
                                       index = ['AUC', 'Recall', 'Precision', 'F1', 'Score'])

print('Evaluation metrices of Gradient-Boosted Decision Tree:')
performance_df_gbdt_clf

Confusion matrix:
 [[15256  3490]
 [ 3552 15202]] 

Evaluation metrices of Gradient-Boosted Decision Tree:


Unnamed: 0,Gradient-Boosted Decision Tree
AUC,0.812214
Recall,0.8106
Precision,0.813289
F1,0.811943
Score,0.812213


Now, we can concatenate all the evaluation metric dataframes into one to compare their relative performance.

But we can cheat abit

In [None]:
df_evaluation = pd.concat([performance_df_knn_clf, performance_df_lr_clf, performance_df_svm_clf,
                           performance_df_gnb_clf, performance_df_rf_clf, performance_df_gbdt_clf], axis = 1)
df_evaluation

Unnamed: 0,K-Nearest Neighbour,Logistic Regression,Support Vector Machine,Gaussian Naive-Bayes,Random Forest,Gradient-Boosted Decision Tree
0,0.501041,0.665276,0.727702,0.605107,0.786265,0.812214
1,0.370588,0.684974,0.625147,0.413725,0.79471,0.8106
2,0.501515,0.659107,0.786529,0.623155,0.781583,0.813289
3,0.426223,0.671792,0.696613,0.511672,0.788092,0.811943
4,0.501013,0.66528,0.72768,0.605067,0.786267,0.812213


We can even find the best evaluation metric for each of the classifier

In [None]:
classifiers = ['K-Nearest Neighbour', 'Logistic Regression', 'Support Vector Machine', 'Gaussian Naive-Bayes', 'Random Forest', 'Gradient-Boosted Decision Tree']

for classifier in classifiers:
    max_value = df_evaluation[classifier].max()
    best_metric = df_evaluation.index[df_evaluation[classifier] == max_value].tolist()[0]
    print('* {} classifier has highest [{}] metric : {}\n'.format(classifier, best_metric, max_value))

* K-Nearest Neighbour classifier has highest [2] metric : 0.501515

* Logistic Regression classifier has highest [1] metric : 0.684974

* Support Vector Machine classifier has highest [2] metric : 0.786529

* Gaussian Naive-Bayes classifier has highest [2] metric : 0.623155

* Random Forest classifier has highest [1] metric : 0.79471

* Gradient-Boosted Decision Tree classifier has highest [2] metric : 0.813289



And there you have it! We used Word2Vec to create the sentence embeddings, which after some dataframe operations, converted the data into a suitable format which is passed into five machine learning classifiers - namely Logistic Regression, Support Vector Machine, Gaussian Naive-Bayes, Random Forest and Gradient Boosted Decision Tree. We can then decide which of these classifiers we will select in the final implementation.

## Conclusion

We have shown that we can train our own Word2Vec model on the corpus of tweets in the test dataset. We then access the keyed vectors in the model by passing the tokenized sentences into the trained model. Lastly, the dataframe of the keyed vectors is passed into different classifiers - their evaluation metrics are computed to measure how well they predict is a troll or not. It is clear from the last dataframe that the Gradient-Boosted Decision Tree (GBDT) is the best classifier out of the five.


Well we will not use Word2Vec