# Text Classification using GloVe Embeddings
Machine Learning models cannot process text. Therefore, word embeddings are employed to convert the textual data into numerical data.

#### Word Embeddings:
A word embedding is a learned representation for text in an n-dimensional space where words that have the same meaning have a similar representation. This means that two similar words are represented by almost similar vectors that are very closely placed in a vector space. 

#### GloVe Word Embedding:
Global Vectors for Words Representation(GloVe) is a method to create word embeddings by using matrix factorisation techinques on the word-context matrix. 

In [2]:
import numpy as np
import pandas as pd
import pickle
from os import path
from scipy.stats import uniform
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
from class_Word2Vec_vectorizer import Word2Vec_vectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

In [2]:
# ignores warnings 
import warnings
warnings.filterwarnings('ignore') 

##  Preparing  the GloVe pre-trained models 

- Download 100 and 200 dimensional vectors resulting from training on Wikipedia and Gigaword 5 data sets with 6 bilion tokens and a 400K word vocabulary from https://nlp.stanford.edu/projects/glove/. 

-  Convert the Glove file format to Word2Vec file format and load the models with Gensim

In [3]:
# download
glove_path = './GloVe'
file_name = ['glove.6B.100d.txt', 'glove.6B.200d.txt', 'glove.6B.300d.txt']
glove_input_file = [path.join(glove_path, f) for f in file_name]
word2vec_output_file = [f + '.word2vec' for f in glove_input_file]

# convert the GloVe file format to the word2vec file format
for i in range(len(file_name)):
    glove2word2vec(glove_input_file[i], word2vec_output_file[i])
    
# load the models 
GloVe_100 = KeyedVectors.load_word2vec_format(word2vec_output_file[0])
GloVe_200 = KeyedVectors.load_word2vec_format(word2vec_output_file[1])
#GloVe_300 = KeyedVectors.load_word2vec_format(word2vec_output_file[2])

## Model Evaluation 

In this projects, we have 3 versions of text data processed from applying different preprocessing steps. For a given Article and for each version of text and each GloVe models, following steps were applied to find the optimal model: 

- For each classification models, 

1. Convert the text data into an average vector using a customised vectoriser. 

2. Scale the feature matrix using Min-Max scaler to normalise them.

3. Select random combinations from a grid of hyperparameter values and train the model using 5-fold cross-validation. Find the optimal set of paramter which gives the highest accuracy. 

5. Refit the model with the best found parameters on the whole training set and store the parameter values and scores. 

- Compare the performance of models and find the best model.


- Make predictions on test set using the best model. 

In [14]:
def evaluate_performance_random(X_train, y_train, embedding, parameters, folds, scoring):
    # text list
    ## standard(lowercase + removing numbers, punctuations) = without removing stopwords
    processed_data = ['standard', 'stopwords_nltk', 'stopwords_spacy']
    cols = ['Logistic', 'SVM', 'Naive_Bayes', 'LDA', 'QDA', 'Random_Forest','XGboost', 'Adaboost']
    classifiers = [LogisticRegression(), SVC(), GaussianNB(), LinearDiscriminantAnalysis(),
                  QuadraticDiscriminantAnalysis(), RandomForestClassifier(),
                   XGBClassifier(objective= 'binary:logistic', use_label_encoder = False),
                   AdaBoostClassifier()]
    res = {}
    for data in processed_data:
        res[data] = {} # dict for each processed text
        x_train = X_train[data] 
        for i in range(len(classifiers)):
            vectorizer = Word2Vec_vectorizer(embedding)
            classifier = classifiers[i]
            pipeline = Pipeline([('vectorizer', vectorizer),
                                       ('normalisation', MinMaxScaler()),
                                       ('clf', classifier)])
            random = RandomizedSearchCV(pipeline, parameters[cols[i]],  n_iter = 60, scoring = scoring,
                                cv = folds, refit = 'accuracy', random_state = 1)
            random.fit(x_train, y_train)
            # saving the best performed model based on 'accuracy'
            res[data].update({
                cols[i]:{
                    'random_search' : random,
                    'classifier': random.best_estimator_,
                    'best_score': random.best_score_,
                    'best_params': random.best_params_,
                    'accuracy': random.cv_results_['mean_test_accuracy'],
                    'precision': random.cv_results_['mean_test_precision'],
                    'recall': random.cv_results_['mean_test_recall'],
                    'f1_score': random.cv_results_['mean_test_f1_score']
                    }})
    index_1 = processed_data * 4
    index_1.sort()
    index_2 = ['Accuracy', 'Precision', 'Recall', 'F1 score'] * 3
    pd_index = pd.MultiIndex.from_arrays([index_1, index_2], 
                                         names = ['Processing_Type', 'Performance_Metrics'])
   # create a data frame with the models performance metrics scores 
    score_table = pd.DataFrame(index = pd_index, columns = cols)
    # find the best model's score of other performance metrices 
    for c in cols:
        ## find index for best score 
        best_score_std = res['standard'][c]['best_score']
        acc_std = res['standard'][c]['accuracy']
        index_std = np.where(acc_std == best_score_std)[0][0] ## choose first element if duplicates
        
        best_score_nltk = res['stopwords_nltk'][c]['best_score']
        acc_nltk = res['stopwords_nltk'][c]['accuracy']
        index_nltk = np.where(acc_nltk == best_score_nltk)[0][0]
        
        best_score_sp = res['stopwords_spacy'][c]['best_score']
        acc_sp = res['stopwords_spacy'][c]['accuracy']
        index_sp = np.where(acc_sp == best_score_sp)[0][0]
        
        score_table[c] =[
            best_score_std,
            res['standard'][c]['precision'][index_std],
            res['standard'][c]['recall'][index_std],
            res['standard'][c]['f1_score'][index_std],
            best_score_nltk,
            res['stopwords_nltk'][c]['precision'][index_nltk],
            res['stopwords_nltk'][c]['recall'][index_nltk],
            res['stopwords_nltk'][c]['f1_score'][index_nltk],
            best_score_sp,
            res['stopwords_spacy'][c]['precision'][index_sp],
            res['stopwords_spacy'][c]['recall'][index_sp],
            res['stopwords_spacy'][c]['f1_score'][index_sp]
        ]
        # best model 
        score_table['Best_model'] = score_table[['Logistic', 'SVM', 'Naive_Bayes', 'LDA', 'QDA', 'Random_Forest', 'XGboost', 'Adaboost']].idxmax(axis=1)
    
    return res, score_table

In [71]:
def test_performance(art_num, GloVe_dim, stop_word, classifier):
    # loading the best model
    res = pickle.load(open('models/model_art{}_{}.p'.format(art_num, GloVe_dim), 'rb'))
    best_model = res[stop_word][classifier]['classifer']
    # keep the best model
    pickle.dump(best_model, open( "./models/best_model_art{}.p".format(art_num), "wb" ))
    
    # loading test set
    art_df = pd.read_csv('./processed_text_df/test_article_{}.csv'.format(art_num))
    test_x = art_df[stop_word]
    test_y = art_df.label
    
    # predict
    pred_y = best_model.predict(test_x)
    acc = accuracy_score(test_y, pred_y)
    pre = precision_score(test_y, pred_y)
    rec = recall_score(test_y, pred_y)
    f1 = f1_score(test_y, pred_y)
    
    # performance table
    test_res = pd.DataFrame([[classifier, acc, pre, rec, f1]], columns = ['Best_model', 'Accuracy', 'Precision', 'Recall', 'F1_score'])
    # save 
    test_res.to_csv('./models/art{}_test.csv'.format(art_num))
    
    return test_res

In [17]:
# define pipline parameters 
parameters = { 
    'Logistic': {
        'clf__penalty':['l1', 'l2', 'none'],
        'clf__C': uniform(0.001, 100),
        'clf__max_iter':uniform(100, 2000),
        'clf__solver':['saga']
    },
    'SVM': {
        'clf__C': uniform(0.001, 1000),
        'clf__kernel': ['linear', 'rbf'],
        'clf__gamma': uniform(0.001, 10)
    },
    'Naive_Bayes': {
        'clf__var_smoothing':uniform(1e-15, 1e-2)
    },
    'LDA': {
        'clf__solver': ['lsqr', 'eigen'],
        'clf__shrinkage': uniform()
     },
    'QDA': {
        'clf__reg_param': uniform()
    },
    'Random_Forest': {
        'clf__n_estimators': [50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
        'clf__criterion': ['gini', 'entropy'],
        'clf__max_features': np.arange(0, 101)
    },
    'XGboost': {
        'clf__learning_rate': uniform(0.001, 0.3),
        'clf__gamma': uniform(0, 10),
        'clf__n_estimators': np.arange(100, 900),
        'clf__max_depth': np.arange(2, 10),
        'clf__min_child_weight': np.arange(1, 9),
        'clf__colsample_bytree': uniform(0.3, 0.7),
        'clf__subsample': uniform(0.2, 0.8),
    },
    'Adaboost': {
        'clf__n_estimators':np.arange(50, 1000, 1)
        ,
        'clf__learning_rate': uniform(0, 1)
    }
}
# define dictionary with performance metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score),
    'recall': make_scorer(recall_score),
    'f1_score': make_scorer(f1_score)
}

### Article 2

In [7]:
# Loading train-test sets
art2_train = pd.read_csv('./processed_text_df/train_article_2.csv')
art2_test = pd.read_csv('./processed_text_df/test_article_2.csv')
# labels
train_y = art2_train.label
test_y = art2_test.label

#### 100-Dimensions

In [12]:
res_art2_100, art2_perform_100 = evaluate_performance_random(art2_train, train_y, GloVe_100, parameters, 5, scoring)
pickle.dump(res_art2_100, open( "./models/model_art2_100.p", "wb" ))
art2_perform_100.to_csv('./models/art2_100.csv')
art2_perform_100

Unnamed: 0,Processing_Type,Performance_Metrics,Logistic,SVM,Naive_Bayes,LDA,QDA,Random_Forest,XGboost,Adaboost,Best_model
0,standard,Accuracy,0.747312,0.759785,0.591828,0.773333,0.76086,0.740215,0.74043,0.747312,LDA
1,standard,Precision,0.739955,0.761849,0.655462,0.763611,0.773007,0.730247,0.713684,0.741865,QDA
2,standard,Recall,0.765,0.750833,0.344167,0.791667,0.7975,0.735833,0.829167,0.751667,XGboost
3,standard,F1 score,0.749704,0.753136,0.390169,0.776947,0.764777,0.728424,0.7601,0.743867,LDA
4,stopwords_nltk,Accuracy,0.753548,0.734194,0.681935,0.773548,0.741505,0.772903,0.773548,0.791828,Adaboost
5,stopwords_nltk,Precision,0.745455,0.746275,0.752843,0.751667,0.71637,0.776123,0.761328,0.782288,Adaboost
6,stopwords_nltk,Recall,0.763333,0.699167,0.56,0.803333,0.835,0.764167,0.805,0.8025,QDA
7,stopwords_nltk,F1 score,0.748698,0.719428,0.636428,0.772957,0.760904,0.765001,0.778735,0.789091,Adaboost
8,stopwords_spacy,Accuracy,0.753763,0.766882,0.714624,0.773333,0.760645,0.785806,0.785806,0.799355,Adaboost
9,stopwords_spacy,Precision,0.747604,0.781951,0.780375,0.759649,0.72381,0.778926,0.779035,0.787267,Adaboost


#### 200-Dimensions

In [15]:
res_art2_200, art2_perform_200 = evaluate_performance_random(art2_train, train_y, GloVe_200, parameters, 5, scoring)
pickle.dump(res_art2_200, open( "./models/model_art2_200.p", "wb" ))
art2_perform_200.to_csv('./models/art2_200.csv')

art2_perform_200

Unnamed: 0,Processing_Type,Performance_Metrics,Logistic,SVM,Naive_Bayes,LDA,QDA,Random_Forest,XGboost,Adaboost,Best_model
0,standard,Accuracy,0.747097,0.708387,0.578495,0.734194,0.70086,0.79914,0.76,0.779355,Random_Forest
1,standard,Precision,0.739455,0.718158,0.645201,0.717778,0.728419,0.795626,0.746754,0.778332,Random_Forest
2,standard,Recall,0.75,0.671667,0.358333,0.75,0.675833,0.804167,0.776667,0.7775,Random_Forest
3,standard,F1 score,0.739274,0.690019,0.378023,0.729978,0.697042,0.796925,0.760061,0.772421,Random_Forest
4,stopwords_nltk,Accuracy,0.766667,0.727957,0.701505,0.766237,0.760645,0.785806,0.773548,0.753763,Random_Forest
5,stopwords_nltk,Precision,0.747421,0.716053,0.758939,0.759524,0.727843,0.784866,0.760194,0.754769,Random_Forest
6,stopwords_nltk,Recall,0.789167,0.751667,0.613333,0.776667,0.8325,0.7775,0.791667,0.753333,QDA
7,stopwords_nltk,F1 score,0.762877,0.730484,0.671141,0.762869,0.774764,0.777073,0.773922,0.75285,Random_Forest
8,stopwords_spacy,Accuracy,0.773118,0.740645,0.727527,0.766667,0.740645,0.773548,0.766882,0.753763,Random_Forest
9,stopwords_spacy,Precision,0.7725,0.730947,0.778409,0.761836,0.700212,0.774447,0.765738,0.754726,Naive_Bayes


#### Test Performance of the best model for Article 2 

- Adaboost

- 100 dimensional GloVe

- Stop words with SpaCy 

- Validation score: 0.799355

In [39]:
art2_test = test_performance('2', '100', 'stopwords_spacy', 'Adaboost')
art2_test

Unnamed: 0,Accuracy,Precision,Recall,F1_score
0,0.701493,0.895833,0.741379,0.811321


### Article 3

In [11]:
# Loading train set
art3_train = pd.read_csv('./processed_text_df/train_article_3.csv')
# labels
train_y = art3_train.label

In [None]:
# 100d
res_art3_100, art3_perform_100 = evaluate_performance_random(art3_train, train_y, GloVe_100, parameters, 5, scoring)
pickle.dump(res_art3_100, open( "./models/model_art3_100.p", "wb" ))
art3_perform_100.to_csv('./models/art3_100.csv')

# 200d
res_art3_200, art3_perform_200 = evaluate_performance_random(art3_train, train_y, GloVe_200, parameters, 5, scoring)
pickle.dump(res_art3_200, open( "./models/model_art3_200.p", "wb" ))
art3_perform_200.to_csv('./models/art3_200.csv')

In [40]:
art3_perform_100

Unnamed: 0,Processing_Type,Performance_Metrics,Logistic,SVM,Naive_Bayes,LDA,QDA,Random_Forest,XGboost,Adaboost,Best_model
0,standard,Accuracy,0.740243,0.72512,0.620178,0.748776,0.742393,0.77035,0.751041,0.733814,Random_Forest
1,standard,Precision,0.736733,0.715071,0.576878,0.748428,0.740278,0.764326,0.744214,0.726139,Random_Forest
2,standard,Recall,0.750416,0.750231,0.961055,0.750231,0.750416,0.780851,0.76346,0.750694,Naive_Bayes
3,standard,F1 score,0.741746,0.729315,0.71897,0.746912,0.743969,0.772237,0.752823,0.737847,Random_Forest
4,stopwords_nltk,Accuracy,0.744498,0.738092,0.710181,0.757355,0.729444,0.757493,0.748936,0.731663,Random_Forest
5,stopwords_nltk,Precision,0.734131,0.732345,0.662082,0.750207,0.685998,0.757029,0.759527,0.735621,XGboost
6,stopwords_nltk,Recall,0.759019,0.746161,0.888067,0.767715,0.853747,0.754949,0.72914,0.729325,Naive_Bayes
7,stopwords_nltk,F1 score,0.743998,0.738824,0.756136,0.757352,0.759837,0.754841,0.743576,0.731217,QDA
8,stopwords_spacy,Accuracy,0.744475,0.744475,0.708076,0.750904,0.718783,0.748868,0.755228,0.73386,XGboost
9,stopwords_spacy,Precision,0.737892,0.742286,0.661995,0.74528,0.710746,0.74672,0.764494,0.73292,XGboost


In [41]:
art3_perform_200

Unnamed: 0,Processing_Type,Performance_Metrics,Logistic,SVM,Naive_Bayes,LDA,QDA,Random_Forest,XGboost,Adaboost,Best_model
0,standard,Accuracy,0.746694,0.742439,0.624411,0.768131,0.748845,0.76598,0.757401,0.742302,LDA
1,standard,Precision,0.742118,0.737679,0.580332,0.7571,0.723835,0.75767,0.758505,0.75207,XGboost
2,standard,Recall,0.759019,0.758927,0.948011,0.789177,0.820074,0.780759,0.759019,0.720537,Naive_Bayes
3,standard,F1 score,0.748302,0.746255,0.717893,0.771554,0.766731,0.768107,0.757042,0.734848,LDA
4,stopwords_nltk,Accuracy,0.750949,0.720956,0.712263,0.770281,0.753169,0.763921,0.763876,0.742416,LDA
5,stopwords_nltk,Precision,0.745022,0.718719,0.66477,0.755908,0.709006,0.75458,0.75707,0.744356,XGboost
6,stopwords_nltk,Recall,0.763367,0.733302,0.879741,0.793617,0.8716,0.781036,0.776503,0.746161,Naive_Bayes
7,stopwords_nltk,F1 score,0.75249,0.723677,0.756021,0.772399,0.779323,0.766441,0.765569,0.742309,QDA
8,stopwords_spacy,Accuracy,0.757401,0.718852,0.701579,0.768108,0.742393,0.761748,0.750972,0.736033,LDA
9,stopwords_spacy,Precision,0.7536,0.71532,0.657692,0.757589,0.691353,0.763674,0.742737,0.738807,Random_Forest


#### Test Performance of the best model for Article 3

- Random Forest

- 100 dimensional GloVe

- No stop words removed

- Validation score: 0.770350

In [47]:
art3_test = test_performance('3', '100', 'standard', 'Random_Forest')
art3_test                             

Unnamed: 0,Accuracy,Precision,Recall,F1_score
0,0.730392,0.948905,0.730337,0.825397


### Article 5

In [12]:
# Loading train set
art5_train = pd.read_csv('./processed_text_df/train_article_5.csv')
# labels
train_y = art5_train.label

In [None]:
# 100d
res_art5_100, art5_perform_100 = evaluate_performance_random(art5_train, train_y, GloVe_100, parameters, 5, scoring)
pickle.dump(res_art5_100, open( "./models/model_art5_100.p", "wb" ))
art5_perform_100.to_csv('./models/art5_100.csv')

# 200d
res_art5_200, art5_perform_200 = evaluate_performance_random(art5_train, train_y, GloVe_200, parameters, 5, scoring)
pickle.dump(res_art5_200, open( "./models/model_art5_200.p", "wb" ))
art5_perform_200.to_csv('./models/art5_200.csv')

In [42]:
art5_perform_100

Unnamed: 0,Processing_Type,Performance_Metrics,Logistic,SVM,Naive_Bayes,LDA,QDA,Random_Forest,XGboost,Adaboost,Best_model
0,standard,Accuracy,0.751415,0.730636,0.639927,0.728139,0.710057,0.759207,0.740992,0.751315,Random_Forest
1,standard,Precision,0.757197,0.73714,0.764118,0.749784,0.732146,0.768149,0.747541,0.767877,Random_Forest
2,standard,Recall,0.746694,0.720513,0.404588,0.689474,0.65857,0.751552,0.735223,0.725101,Random_Forest
3,standard,F1 score,0.749844,0.726904,0.524482,0.71728,0.692145,0.75836,0.737606,0.744048,Random_Forest
4,stopwords_nltk,Accuracy,0.74622,0.735864,0.665801,0.748884,0.712587,0.756577,0.728072,0.709957,Random_Forest
5,stopwords_nltk,Precision,0.758369,0.745707,0.782693,0.766024,0.710621,0.77565,0.742924,0.703193,Naive_Bayes
6,stopwords_nltk,Recall,0.731039,0.71525,0.456005,0.720513,0.715655,0.736167,0.699595,0.730769,Random_Forest
7,stopwords_nltk,F1 score,0.740968,0.728251,0.575233,0.739887,0.712494,0.753904,0.719795,0.715978,Random_Forest
8,stopwords_spacy,Accuracy,0.751415,0.72291,0.660606,0.74632,0.707459,0.748818,0.730769,0.709923,Logistic
9,stopwords_spacy,Precision,0.76391,0.714642,0.787535,0.75744,0.71741,0.76431,0.746327,0.743048,Naive_Bayes


In [43]:
art5_perform_200

Unnamed: 0,Processing_Type,Performance_Metrics,Logistic,SVM,Naive_Bayes,LDA,QDA,Random_Forest,XGboost,Adaboost,Best_model
0,standard,Accuracy,0.746187,0.72291,0.642557,0.733267,0.704729,0.759274,0.728039,0.707326,Random_Forest
1,standard,Precision,0.759248,0.72904,0.829899,0.748143,0.693201,0.779113,0.750506,0.714789,Naive_Bayes
2,standard,Recall,0.725506,0.710256,0.358165,0.710121,0.736842,0.741161,0.689069,0.694062,Random_Forest
3,standard,F1 score,0.739231,0.718686,0.496535,0.726622,0.712235,0.75656,0.716435,0.702174,Random_Forest
4,stopwords_nltk,Accuracy,0.738428,0.74349,0.670929,0.735897,0.712554,0.748785,0.733233,0.712621,Random_Forest
5,stopwords_nltk,Precision,0.744629,0.743459,0.786377,0.738808,0.920287,0.774255,0.739292,0.748335,QDA
6,stopwords_nltk,Recall,0.730904,0.746424,0.466397,0.736032,0.483806,0.71525,0.731039,0.647773,SVM
7,stopwords_nltk,F1 score,0.735979,0.743764,0.584885,0.735942,0.611498,0.741895,0.733783,0.693294,SVM
8,stopwords_spacy,Accuracy,0.753946,0.728039,0.676124,0.743623,0.707326,0.746387,0.735864,0.712654,Logistic
9,stopwords_spacy,Precision,0.760242,0.735012,0.786609,0.747532,0.899419,0.762202,0.734371,0.732999,QDA


#### Test Performance of the best model for Article 5

- Random Forest

- 200 dimensional GloVe

- No Stop words removed

- Validation score: 0.759274

In [48]:
art5_test = test_performance('5', '200', 'standard', 'Random_Forest')
art5_test 

Unnamed: 0,Accuracy,Precision,Recall,F1_score
0,0.670157,0.94958,0.664706,0.782007


### Article 6

In [15]:
# Loading train set
art6_train = pd.read_csv('./processed_text_df/train_article_6.csv')
# labels
train_y = art6_train.label

In [None]:
# 100d
res_art6_100, art6_perform_100 = evaluate_performance_random(art6_train, train_y, GloVe_100, parameters, 5, scoring)
pickle.dump(res_art6_100, open( "./models/model_art6_100.p", "wb" ))
art6_perform_100.to_csv('./models/art6_100.csv')

# 200d
res_art6_200, art6_perform_200 = evaluate_performance_random(art6_train, train_y, GloVe_200, parameters, 5, scoring)
pickle.dump(res_art6_200, open( "./models/model_art6_200.p", "wb" ))
art6_perform_200.to_csv('./models/art6_200.csv')

In [44]:
art6_perform_100

Unnamed: 0,Processing_Type,Performance_Metrics,Logistic,SVM,Naive_Bayes,LDA,QDA,Random_Forest,XGboost,Adaboost,Best_model
0,standard,Accuracy,0.772516,0.773405,0.62501,0.773379,0.760376,0.777734,0.771636,0.76038,Random_Forest
1,standard,Precision,0.797353,0.799538,0.578568,0.797905,0.762321,0.798733,0.785245,0.779,SVM
2,standard,Recall,0.730885,0.7291,0.927061,0.732564,0.756852,0.742984,0.748186,0.727331,Naive_Bayes
3,standard,F1 score,0.762428,0.762298,0.712115,0.763614,0.759404,0.76958,0.766033,0.751923,Random_Forest
4,stopwords_nltk,Accuracy,0.760376,0.752569,0.684032,0.767333,0.764717,0.77773,0.77601,0.760429,Random_Forest
5,stopwords_nltk,Precision,0.775473,0.767201,0.642003,0.772198,0.762132,0.791425,0.786707,0.76936,Random_Forest
6,stopwords_nltk,Recall,0.732579,0.723973,0.838471,0.758561,0.770735,0.753373,0.758591,0.746462,Naive_Bayes
7,stopwords_nltk,F1 score,0.752971,0.744643,0.726597,0.764847,0.76582,0.771506,0.771456,0.756818,Random_Forest
8,stopwords_spacy,Accuracy,0.76472,0.749972,0.683166,0.766452,0.762981,0.774259,0.774259,0.762127,Random_Forest
9,stopwords_spacy,Precision,0.781634,0.752595,0.640476,0.786523,0.770392,0.790947,0.785835,0.76583,Random_Forest


In [45]:
art6_perform_200

Unnamed: 0,Processing_Type,Performance_Metrics,Logistic,SVM,Naive_Bayes,LDA,QDA,Random_Forest,XGboost,Adaboost,Best_model
0,standard,Accuracy,0.769034,0.764728,0.63889,0.781197,0.766452,0.782074,0.77601,0.769923,Random_Forest
1,standard,Precision,0.793748,0.775866,0.59029,0.813713,0.800528,0.804704,0.799103,0.799594,LDA
2,standard,Recall,0.725652,0.746462,0.918381,0.729085,0.710015,0.744708,0.737781,0.72042,Naive_Bayes
3,standard,F1 score,0.757932,0.760394,0.717997,0.768867,0.751878,0.773234,0.76692,0.757626,Random_Forest
4,stopwords_nltk,Accuracy,0.770777,0.76125,0.696183,0.779469,0.765575,0.780327,0.776025,0.765598,Random_Forest
5,stopwords_nltk,Precision,0.789823,0.770561,0.655594,0.80466,0.829699,0.790489,0.794481,0.782224,QDA
6,stopwords_nltk,Recall,0.736102,0.744708,0.833313,0.737781,0.668351,0.762084,0.746432,0.736042,Naive_Bayes
7,stopwords_nltk,F1 score,0.761345,0.757036,0.733145,0.769344,0.739637,0.77567,0.768794,0.758288,Random_Forest
8,stopwords_spacy,Accuracy,0.766437,0.757779,0.693578,0.778592,0.766433,0.776864,0.770792,0.754312,LDA
9,stopwords_spacy,Precision,0.785184,0.758843,0.653882,0.798436,0.810876,0.792226,0.789842,0.768597,QDA


#### Test Performance of the best model for Article 5

- Random Forest

- 200 dimensional GloVe

- No stop words removed

- Validation score: 0.782074

In [49]:
art6_test = test_performance('6', '200', 'standard', 'Random_Forest')
art6_test

Unnamed: 0,Accuracy,Precision,Recall,F1_score
0,0.746627,0.977974,0.736318,0.840114


In [75]:
# concatenating results 
perform = pd.concat([art2_test, art3_test, art5_test, art6_test]).reset_index(drop = True)
perform = perform.set_index(pd.Index(['Article 2', 'Article 3', 'Article 5', 'Article 6']))
perform['Best_model'] = ['Adaboost', 'Random Forest', 'Random Forest', 'Random Forest']
perform = perform[['Best_model', 'Accuracy', 'Precision', 'Recall', 'F1_score' ]]

In [77]:
perform

Unnamed: 0,Best_model,Accuracy,Precision,Recall,F1_score
Article 2,Adaboost,0.701493,0.895833,0.741379,0.811321
Article 3,Random Forest,0.730392,0.948905,0.730337,0.825397
Article 5,Random Forest,0.670157,0.94958,0.664706,0.782007
Article 6,Random Forest,0.746627,0.977974,0.736318,0.840114
