In [1]:
import numpy as np
import pandas as pd
from WMVE import *

import pickle
from sklearn import preprocessing


from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

# Load data

To make sure this notebook can be run within 1 minute, we use a sample of length 11000 to show how our project work.

Since running words embedding would take a long time, we have finished the works for words embedding and data splitting previously, then saved the classifier training, voting testing, and validation sets locally. Here we load these sets from local files.

In [2]:
# Data which has been embedded and split properly for training
with open(".\\tempfile\\news_data.txt", "rb") as fp:  # Unpickling
    X_train, Y_train, X_test, Y_test, X_val, Y_val = pickle.load(fp)

X = X_train.copy()
X.extend(X_val)
scaler = preprocessing.StandardScaler().fit(X)
X_train_1 = scaler.transform(X_train)
X_test_1 = scaler.transform(X_test)
X_val_1 = scaler.transform(X_val)

The ratio of the three sets should be approximately 4:1:1.

In [3]:
print(len(Y_train),len(Y_test),len(Y_val))

5918 1480 1479


All the classifiers had been trained previously, and the model files (including the deep learning model) were saved locally. Here we load the random forest, naive Bayes, and logistic regression models. 

The BertForSequenceClassification model is a deep learning model. Using it to predict labels is very time-consuming, even on this small sample. Thus, we also saved the prediction results of the BFSC model locally.

In [4]:
# Load classifiers
forest = load(forest_news_model)
nb = load(nb_news_model)
lr = load(lr_news_model)

# Evaluate Models on the Validation Set

Here we use the random forest, naive Bayes, logistic regression to predict the validation set and evaluate their performance.

In [6]:
with open(".\\tempfile\\news_val_pred.txt", "rb") as fp:  # Unpickling
    bert_val_pred = pickle.load(fp)
forest_val_pred = forest.predict(X_val)
nb_val_pred = nb.predict(X_val_1)
lr_val_pred = lr.predict(X_val_1)

binary_eval('bert',bert_val_pred, Y_val)
binary_eval('forest',forest_val_pred, Y_val)
binary_eval('nb',nb_val_pred, Y_val)
binary_eval('lr',lr_val_pred, Y_val)

bert accuracy 0.9051299759172687
bert f1 0.9108522100832451
bert roc_auc 0.9051299759172687
bert confusion matrix [[867  69]
 [ 63 480]]

forest accuracy 0.7571667907669397
forest f1 0.7560845946242135
forest roc_auc 0.7571667907669397
forest confusion matrix [[908 435]
 [ 22 114]]

nb accuracy 0.5700728887200551
nb f1 0.5132307389241781
nb roc_auc 0.5700728887200551
nb confusion matrix [[282 103]
 [648 446]]

lr accuracy 0.5418783715677673
lr f1 0.5171417582625685
lr roc_auc 0.5418783715677674
lr confusion matrix [[429 205]
 [501 344]]



"comments_voting" is a self-defined function in WMVE.py. It will extract all Reddit comments (if they exist) for each piece of news, then use the BFSC model to predict labels for all comments. Then, a majority voting (without weight) method will be applied to these predictions to get a voting result as the label of this piece of news. Also, this function will print the performance evaluation of the voting results. 

Since this step is time-consuming too, we only display the code and the evaluation message in the next chunk. The comment voting result has been saved into a local file so that we can load it directly.

In [7]:
'''
comment_train_pred = comments_voting(mode='val')

with open(".\\tempfile\\comment_train_pred.txt", "wb") as fp:  # Pickling
    pickle.dump(comment_train_pred, fp)
'''

Token indices sequence length is longer than the specified maximum sequence length for this model (917 > 512). Running this sequence through the model will result in indexing errors


comment_val accuracy 0.6535433070866141
comment_val f1 0.7904761904761904
comment_val confusion matrix [[83 44]
 [ 0  0]]





The voting weights were calculated with the voting testing set as we introduced in our final report. Here we load and display the voting weights we calculated previously.

In [39]:
with open(".\\tempfile\\voting_weight.txt", "rb") as fp:  # Unpickling
    weight = pickle.load(fp)
    
weight[0].append(0)

bert_weight = [weight[0][0],weight[1][0]]
forest_weight = [weight[0][1],weight[1][1]]
lr_weight = [weight[0][2],weight[1][2]]
nb_weight = [weight[0][3],weight[1][3]]
comment_weight = [weight[0][4],weight[1][4]]
total = [sum([weight[0][0],weight[0][1],weight[0][2],weight[0][3],weight[0][4]]),
        sum([weight[1][0],weight[1][1],weight[1][2],weight[1][3],weight[1][4]])]
name = ['Weights for News Without Comments','Weights for News With Comments']

pd.DataFrame({'Type':name,
              'BertForSequenceClassification':bert_weight,
              'Random Forest':forest_weight,
             'Logistic Regression':lr_weight,
             'Naive Bayes':nb_weight,
             'Comment Voting':comment_weight,
             'Total':total})

Unnamed: 0,Type,BertForSequenceClassification,Random Forest,Logistic Regression,Naive Bayes,Comment Voting,Total
0,Weights for News Without Comments,0.250916,0.361104,0.203274,0.184706,0.0,1.0
1,Weights for News With Comments,0.166348,0.273423,0.143403,0.128107,0.288719,1.0


In [42]:
with open(".\\tempfile\\comment_train_pred.txt", "rb") as fp:  # Pickling
    comment_train_pred = pickle.load(fp)

# Combine all predictions to feed to the weighted voting model
classfiers_pred_val = pd.DataFrame({'bert': bert_val_pred,
                                      'forest': forest_val_pred,
                                      'nb': nb_val_pred,
                                      'lr': lr_val_pred,
                                      'comment':comment_train_pred})


voting_val_pred, prob = WMVEpredict(weight, classfiers_pred_val,use_softmax=False,final=True)

binary_eval('voting_val', Y_val, voting_val_pred)

voting_val accuracy 0.7496562665256478
voting_val f1 0.7875736441988426
voting_val roc_auc 0.7496562665256478
voting_val confusion matrix [[876  54]
 [243 306]]

