In [80]:
import numpy as np
import pandas as pd
from prediction import *

# Load data

To make sure this notebook can be run within 1 minute, we use a small sample of length 550 to show how our project predicts the authenticity of the news.

In [81]:
news = pd.read_csv('pred_news.csv',index_col=0)
comments = pd.read_csv('pred_comments.csv',index_col=0)

In [5]:
news.head(3)

Unnamed: 0,title,text,label,author,topic,perception,news_index
0,report donald trump leaves 10000 tip 82 bill,editors note story determined hoax click read ...,0,,13,0.559441,0
1,largest great white shark ever recorded,jaws quint notes monstrous prey 25 feet long r...,0,,9,0.830583,1
2,barclays handed biggest bank fine uk history b...,barclays handed biggest uk bank fine history s...,0,,0,0.520255,2


In [6]:
comments.head(3)

Unnamed: 0,title,text,label,author,topic,perception,news_index,comment_author,comment_text,comment_score,comment_subreddit,have_comment
0,long could survive coffin buried alive,normal healthy person might 10 minutes hour si...,1,,9,0.751051,25,crackhappy,quick someone tell ryan reynolds,19.0,todayilearned,1
1,long could survive coffin buried alive,normal healthy person might 10 minutes hour si...,1,,9,0.751051,25,-BigSexy-,makes escape scene kill bill much realistic now,10.0,todayilearned,1
2,long could survive coffin buried alive,normal healthy person might 10 minutes hour si...,1,,9,0.751051,25,running_uphill,person coffin picture must 4 inches tall since...,2.0,todayilearned,1


Since running words embedding would take a long time, we have finished the works for words embedding and saved the results locally. Here we displayed codes for words embedding, where 'embedding_news' is a function in prediction.py. Then, we loaded the results from local files.

In [None]:
# Codes for words embedding
'''
X, Y, dataloader = embedding_news(news)

with open(".\\tempfile\\pred_data.txt", "wb") as fp:  # Pickling
    pickle.dump([X, Y], fp)
'''

In [7]:
# Load X, Y from local files
with open(".\\tempfile\\pred_data.txt", "rb") as fp:  # Unpickling
    X, Y = pickle.load(fp)
    
# Scale X to feed to LR and NB models
scaler = preprocessing.StandardScaler().fit(X)
X_scale = scaler.transform(X)

In each X, the first 128 values are the embedding result of text and title, the next 5 values are the embedding of author (if the news does not have an author, we add five 0s to the array,) and the last value is the dominant topic label gained from topic modeling.

In [8]:
len(X[0]) # 128 + 5 + 1 = 134

134

In [9]:
X[0]

array([  101, 10195,  3602,  2466,  4340, 28520, 11562,  3191,  7172,
        2466,  6866,  2285,  1021,  6221,  8398, 10036,  8337,  2618,
        3310,  2204,  2326,  2613,  4355,  3686,  9587, 24848,  2187,
        6694,  2692,  5955,  6445, 22907,  3021,  6928,  4825,  4203,
        9018, 10250, 10128,  4315, 20891, 29566,  2213,  4311, 11562,
        2156, 24306,  4315, 20891, 29566,  2213,  5006,  1040, 15610,
        6901,  2252,  2288, 11721, 16998, 26302,  2078,  5955,  6221,
       27175,  4596,  2326,  4315, 20891, 29566,  2213,  2988,  8398,
       11729, 11586,  2098, 24857, 11350,  6178, 14289, 14693, 15460,
       18064, 11562,  2156,  2678, 15610,  5221,  5955,  2412,  2288,
        8398,  7283,  2356,  5006,  6128,  7987, 12722, 18826,  3310,
        2843,  5006,  5838, 11182,  3156,  6694,  4638,  2320,  2204,
        3105,  8398,  2409, 15610,  2091,  2239, 24158,  7630,  3600,
       15610,  2387,  5955,  8398,  2187,  2903,  2009,  2609,  2988,
        2941,  2699,

Y is a list of 0/1 labels. We convert it into a tensor since we will use Pytorch to run a deep learning model on it.

In [10]:
Y[:10]

tensor([0, 0, 0, 0, 0, 0, 0, 0, 1, 1])

All the classifiers have been trained previously, and the model files were saved locally. Here we load the random forest, naive Bayes, and logistic regression models. 

The BertForSequenceClassification model is a deep learning model. Using it to predict labels is very time-consuming, even on this small sample. Thus, we have pickled the prediction results of the BFSC model in a local file, too.

In [14]:
# Codes for using BFSC to predict labels
'''
from train_bert import bertpredict_withbatch

bertnews = torch.load(save_news_model, map_location=device)
bert_pred = bertpredict_withbatch(dataloader, bertnews)

with open(".\\tempfile\\bert_pred.txt", "wb") as fp:  # Pickling
    pickle.dump(bert_pred, fp)
'''


Running Validation...


In [11]:
# Load prediction result of BSFC model from local file
with open(".\\tempfile\\bert_pred.txt", "rb") as fp:  # Unpickling
    bert_pred = pickle.load(fp)
    
# Load classifiers
forest = load(forest_news_model)
nb = load(nb_news_model)
lr = load(lr_news_model)

# Prediction & Evaluation

## Use all classifiers to predict labels

Here we use the random forest, naive Bayes, logistic regression to predict labels for the validation set and evaluate models' performance.

In [15]:
forest_pred = forest.predict(X)
nb_pred = nb.predict(X_scale)
lr_pred = lr.predict(X_scale)

binary_eval('bert', Y, bert_pred)
binary_eval('forest', Y, forest_pred)
binary_eval('nb', Y, nb_pred)
binary_eval('lr', Y, lr_pred)

bert accuracy 0.458154588631837
bert f1 0.619536019597501
bert confusion matrix [[315 113]
 [100  22]]

forest accuracy 0.8313926765742301
forest f1 0.9146912728527078
forest confusion matrix [[424   4]
 [ 40  82]]

nb accuracy 0.5769304427761606
nb f1 0.4370490495012234
nb confusion matrix [[129 299]
 [ 18 104]]

lr accuracy 0.5813543741381952
lr f1 0.5865512068448224
lr confusion matrix [[224 204]
 [ 44  78]]



"comments_voting" is a self-defined function in WMVE.py. It extracts all Reddit comments (if they exist) for each piece of news, then uses the BFSC model to predict labels for all comments. Then, a majority voting (without weight) method will be applied to these predictions to get a voting result as the label of this piece of news. Also, this function will print the performance evaluation of the voting results. 

Since this step is time-consuming too, we only display the code and the evaluation message in the next chunk. The comment voting result has been saved into a local file so that we can load it directly.

In [14]:
'''
relationship = pd.read_csv('pred_relationship.csv',index_col=0)
comment_pred = comments_voting_pred(news,comments,relationship)

with open(".\\tempfile\\comment_pred.txt", "wb") as fp:  # Pickling
    pickle.dump(comment_pred, fp)
'''

comment_voting accuracy 0.759469696969697
comment_voting f1 0.775983436853002
comment_voting confusion matrix [[19  5]
 [ 3  8]]



In [86]:
# Load the commont voting result
with open(".\\tempfile\\comment_pred.txt", "rb") as fp:  # Unpickling
    comment_pred = pickle.load(fp)

## Voting Weights

The voting weights were calculated with the voting testing set as we introduced in our final report.

In [17]:
# Load the voting weights
with open(".\\tempfile\\voting_weight.txt", "rb") as fp:  # Unpickling
    weight = pickle.load(fp) 
    
weight[0].append(None)

bert_weight = [weight[0][0],weight[1][0]]
forest_weight = [weight[0][1],weight[1][1]]
lr_weight = [weight[0][2],weight[1][2]]
nb_weight = [weight[0][3],weight[1][3]]
comment_weight = [weight[0][4],weight[1][4]]
total = [sum([weight[0][0],weight[0][1],weight[0][2],weight[0][3]]),
        sum([weight[1][0],weight[1][1],weight[1][2],weight[1][3],weight[1][4]])]
name = ['Weights for News Without Comments','Weights for News With Comments']

pd.DataFrame({'Type':name,
              'BertForSequenceClassification':bert_weight,
              'Random Forest':forest_weight,
             'Logistic Regression':lr_weight,
             'Naive Bayes':nb_weight,
             'Comment Voting':comment_weight,
             'Total':total})

Unnamed: 0,Type,BertForSequenceClassification,Random Forest,Logistic Regression,Naive Bayes,Comment Voting,Total
0,Weights for News Without Comments,0.249231,0.350231,0.202418,0.19812,,1.0
1,Weights for News With Comments,0.190105,0.254369,0.155149,0.146276,0.254101,1.0


"WMVEpredict" is a self-defined function in WMVE.py. It take all predictions from our classifiers and the voting weights to do weighted majority voting.

In [31]:
# Combine all predictions to feed to the weighted voting model
classfiers_pred = pd.DataFrame({'bert': bert_pred,
                                      'forest': forest_pred,
                                      'nb': nb_pred,
                                      'lr': lr_pred,
                                      'comment':comment_pred})


voting_pred, prob = WMVEpredict(weight, classfiers_pred,use_softmax=False,final=True)

binary_eval('voting_val', Y, voting_pred)

voting_val accuracy 0.7653401256319903
voting_val f1 0.8395253231616867
voting_val confusion matrix [[385  43]
 [ 45  77]]



# Result

Here we show the news data frame with the predictions from all models.

In [85]:
result_df = news[['title','text','author','topic','label']].assign(bert_pred = bert_pred,
                                                       forest_pred = forest_pred,
                                                       nb_pred = nb_pred,
                                                       lr_pred = lr_pred,
                                                       comment_pred = list(map(lambda x: np.nan if x==2 else x, comment_pred)),
                                                       voting_pred = voting_pred)
voting_correct = []
label = result_df.label
pred = result_df.voting_pred

for i in range(len(label)):
    if label[i] == pred[i]:
        voting_correct.append(True)
    else:
        voting_correct.append(False)
result_df = result_df.assign(voting_correct=voting_correct)
result_df

Unnamed: 0,title,text,author,topic,label,bert_pred,forest_pred,nb_pred,lr_pred,comment_pred,voting_pred,voting_correct
0,report donald trump leaves 10000 tip 82 bill,editors note story determined hoax click read ...,,13,0,0,0,1,1,,0,True
1,largest great white shark ever recorded,jaws quint notes monstrous prey 25 feet long r...,,9,0,1,0,1,0,,0,True
2,barclays handed biggest bank fine uk history b...,barclays handed biggest uk bank fine history s...,,0,0,1,0,0,0,,0,True
3,10 things school obsolete,3 teachercentered classroom classrooms designe...,,0,0,0,0,1,0,,0,True
4,country singer shot dead nashville bar,country music singer shot dead inside tennesse...,,4,0,0,0,1,0,,0,True
...,...,...,...,...,...,...,...,...,...,...,...,...
545,Painting of woman holding a lotus was created ...,Painting of woman holding a lotus was created ...,Fact Crescendo,9,0,1,0,0,0,,0,True
546,Photos of an Indian Air Force (IAF) jet shot d...,Photos of an Indian Air Force (IAF) jet shot d...,FACTLY,8,0,0,0,0,0,,0,True
547,The number of COVID-19 cases in the University...,The number of COVID-19 cases in the University...,Vera Files,8,0,0,0,0,0,,0,True
548,"Facebook pages Fაქტები (Facts), განსხვავებული...","Facebook pages Fაქტები (Facts), განსხვავებული...",Myth Detector,13,0,0,0,0,0,,0,True


## News without comment

NaN in comment_pred column means that the news does not have any comment, so the voting weights for news without comments will be used to get the voting result.

In [83]:
without_comment = result_df[pd.isna(result_df['comment_pred'])==True].reset_index(drop=True)
without_comment.head(3)

Unnamed: 0,title,text,author,topic,label,bert_pred,forest_pred,nb_pred,lr_pred,comment_pred,voting_pred,voting_correct
0,report donald trump leaves 10000 tip 82 bill,editors note story determined hoax click read ...,,13,0,0,0,1,1,,0,True
1,largest great white shark ever recorded,jaws quint notes monstrous prey 25 feet long r...,,9,0,1,0,1,0,,0,True
2,barclays handed biggest bank fine uk history b...,barclays handed biggest uk bank fine history s...,,0,0,1,0,0,0,,0,True


515 out of the 550 news in our sample do not have any comments. Among these 515 without-comment news, 429 are correctly predicted by our voting model, and 86 are predicted incorrectly. Some examples of without-comment news where our model correctly/incorrectly predicts labels for are shown as follow:

In [62]:
print(len(without_comment),
      len(without_comment[without_comment['voting_correct']==True]),
      len(without_comment[without_comment['voting_correct']==False]))

515 429 86


In [51]:
without_comment[without_comment['voting_correct']==True].head(3)

Unnamed: 0,title,text,author,topic,label,bert_pred,forest_pred,nb_pred,lr_pred,comment_pred,voting_pred,voting_correct
0,report donald trump leaves 10000 tip 82 bill,editors note story determined hoax click read ...,,13,0,0,0,1,1,,0,True
1,largest great white shark ever recorded,jaws quint notes monstrous prey 25 feet long r...,,9,0,1,0,1,0,,0,True
2,barclays handed biggest bank fine uk history b...,barclays handed biggest uk bank fine history s...,,0,0,1,0,0,0,,0,True


In [54]:
without_comment[without_comment['voting_correct']==False].head(3)

Unnamed: 0,title,text,author,topic,label,bert_pred,forest_pred,nb_pred,lr_pred,comment_pred,voting_pred,voting_correct
8,articles leesburg police department,local 5 arrested 75000 drugs seized sting oper...,,8,1,0,0,1,1,,0,False
11,mcdonalds stop serving overweight customers fi...,mcdonalds stop serving overweight customers fi...,,2,0,1,0,1,1,,1,False
12,prayer request,hi everyonei got email wanted pass along every...,,8,1,0,0,1,1,,0,False


## News with comment

The table below shows some news with comments. Here the voting weights for news with comments will be used to get the voting result.

In [79]:
with_comment = result_df[pd.isna(result_df['comment_pred'])==False].reset_index(drop=True)
with_comment.head(5)

Unnamed: 0,title,text,author,topic,label,bert_pred,forest_pred,nb_pred,lr_pred,comment_pred,voting_pred,voting_correct
0,long could survive coffin buried alive,normal healthy person might 10 minutes hour si...,,9,1,0,1,1,1,1.0,1,True
1,exit polls work whether trust,frustrating thing election day one actually in...,,12,0,0,0,1,1,0.0,0,True
2,refugees find hostility hope soccer field,fugees indeed refugees troubled corners afghan...,,14,0,1,0,1,0,0.0,0,True
3,steal book abbie hoffman,dedicated jerry lefcourt lawyer brother librar...,,9,0,1,0,1,1,0.0,0,True
4,hillary clinton president,2016 campaign brought surface despair rage poo...,,7,0,1,0,1,0,0.0,0,True


35 out of the 550 news in our sample have comments. Among these 35 with-comment news, 33 are correctly predicted by our voting model, and 2 are predicted incorrectly. Some examples of with-comment news where our model correctly/incorrectly predicts labels for are shown as follow:

In [63]:
print(len(with_comment),
      len(with_comment[with_comment['voting_correct']==True]),
      len(with_comment[with_comment['voting_correct']==False]))

35 33 2


In [68]:
with_comment[with_comment['voting_correct']==True].head(3)

Unnamed: 0,title,text,author,topic,label,bert_pred,forest_pred,nb_pred,lr_pred,comment_pred,voting_pred,voting_correct
0,long could survive coffin buried alive,normal healthy person might 10 minutes hour si...,,9,1,0,1,1,1,1.0,1,True
1,exit polls work whether trust,frustrating thing election day one actually in...,,12,0,0,0,1,1,0.0,0,True
2,refugees find hostility hope soccer field,fugees indeed refugees troubled corners afghan...,,14,0,1,0,1,0,0.0,0,True


In [69]:
with_comment[with_comment['voting_correct']==False]

Unnamed: 0,title,text,author,topic,label,bert_pred,forest_pred,nb_pred,lr_pred,comment_pred,voting_pred,voting_correct
5,unintentionally brilliant marketing blairwitchcom,eduardo nchez dan myrick met film school unive...,,14,0,1,0,1,1,1.0,1,False
15,call gerrymandering,looking news trust subscribe free newsletters ...,,12,1,0,0,1,1,0.0,0,False
