# ML Model Training #2 - Fakeddit

Dataset used:

* Fakeddit, a dataset containing more than 1 million Reddit samples based on the assumption of "self-moderation" on Reddit (content that stays online for extended periods of time has gone through multiple levels of "filtering", e.g. user voting and subreddit moderation): https://arxiv.org/abs/1911.03854

## Importing packages and datasets

In [2]:
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import ComplementNB
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
import spacy_sentence_bert
import skops.io as sio

fakeddit_train = pd.read_table("data/fakeddit_train.tsv")
fakeddit_test = pd.read_table("data/fakeddit_test.tsv")
fakeddit_validate = pd.read_table("data/fakeddit_validate.tsv")

## Analyzing the Fakeddit dataset 

Having a first look at the dataset, its columns and content:

In [3]:
fakeddit_train

Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,author,clean_title,created_utc,domain,hasImage,id,image_url,linked_submission_id,num_comments,score,subreddit,title,upvote_ratio,2_way_label,3_way_label,6_way_label
0,0,0,,,Alexithymia,my walgreens offbrand mucinex was engraved wit...,1.551641e+09,i.imgur.com,True,awxhir,https://external-preview.redd.it/WylDbZrnbvZdB...,,2.0,12,mildlyinteresting,My Walgreens offbrand Mucinex was engraved wit...,0.84,1,0,0
1,1,1,155885.0,714550.0,RickSisco,,1.443822e+09,,True,cvm5uy4,http://i.imgur.com/yxrkYT8.jpg,3n7fld,,5,psbattle_artwork,,,0,2,4
2,2,2,,,VIDCAs17,this concerned sink with a tiny hat,1.534727e+09,i.redd.it,True,98pbid,https://preview.redd.it/wsfx0gp0f5h11.jpg?widt...,,2.0,119,pareidolia,This concerned sink with a tiny hat,0.99,0,2,2
3,3,3,,,prometheus1123,hackers leak emails from uae ambassador to us,1.496511e+09,aljazeera.com,True,6f2cy5,https://external-preview.redd.it/6fNhdbc6K1vFA...,,1.0,44,neutralnews,Hackers leak emails from UAE ambassador to US,0.92,1,0,0
4,4,4,282323.0,1228398.0,,,1.378792e+09,,True,cc5cbon,http://i.imgur.com/M8KTWMx.jpg,1lz1q0,,3,psbattle_artwork,,,0,2,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
878213,878213,878213,,,PotatoSlapper,clinton campaign uses noise machine to block r...,1.460096e+09,dailycaller.com,False,4dv6j8,,,1.0,7,nottheonion,Clinton Campaign Uses Noise Machine To Block R...,1.00,1,0,0
878214,878214,878214,,,fatkiddown,a squirrels imprint in wet concrete,1.472570e+09,i.redd.it,True,50bmkc,https://preview.redd.it/mhph05cnveix.jpg?width...,,1.0,9,photoshopbattles,PsBattle: A Squirrel's Imprint In Wet Concrete...,0.90,1,0,0
878215,878215,878215,,,MrChrisOD,he keeps an eye on the burner whilst i cook,1.520807e+09,i.imgur.com,True,83q3ot,https://external-preview.redd.it/afTHQfSIpVY52...,,0.0,2,pareidolia,He keeps an eye on the burner whilst I cook,0.75,0,2,2
878216,878216,878216,,,,video game kingpin super mario is finally capt...,1.565283e+09,i.redd.it,False,cnoe3v,,,0.0,16,fakehistoryporn,"Video Game kingpin ""Super Mario"" is finally ca...",1.00,0,2,2


Combining all of the three datasets (we will split the dataframe into training and test data on our own later):

In [4]:
fakeddit_combined = pd.concat([fakeddit_train, fakeddit_test, fakeddit_validate])
fakeddit_combined

Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,author,clean_title,created_utc,domain,hasImage,id,image_url,linked_submission_id,num_comments,score,subreddit,title,upvote_ratio,2_way_label,3_way_label,6_way_label
0,0,0,,,Alexithymia,my walgreens offbrand mucinex was engraved wit...,1.551641e+09,i.imgur.com,True,awxhir,https://external-preview.redd.it/WylDbZrnbvZdB...,,2.0,12,mildlyinteresting,My Walgreens offbrand Mucinex was engraved wit...,0.84,1,0,0
1,1,1,155885.0,714550.0,RickSisco,,1.443822e+09,,True,cvm5uy4,http://i.imgur.com/yxrkYT8.jpg,3n7fld,,5,psbattle_artwork,,,0,2,4
2,2,2,,,VIDCAs17,this concerned sink with a tiny hat,1.534727e+09,i.redd.it,True,98pbid,https://preview.redd.it/wsfx0gp0f5h11.jpg?widt...,,2.0,119,pareidolia,This concerned sink with a tiny hat,0.99,0,2,2
3,3,3,,,prometheus1123,hackers leak emails from uae ambassador to us,1.496511e+09,aljazeera.com,True,6f2cy5,https://external-preview.redd.it/6fNhdbc6K1vFA...,,1.0,44,neutralnews,Hackers leak emails from UAE ambassador to US,0.92,1,0,0
4,4,4,282323.0,1228398.0,,,1.378792e+09,,True,cc5cbon,http://i.imgur.com/M8KTWMx.jpg,1lz1q0,,3,psbattle_artwork,,,0,2,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92439,92439,92439,,,dannylenwinn,nicaraguan bank sanctioned by us shuts down,1.556673e+09,dailyjournal.net,True,bjb8g9,https://external-preview.redd.it/rRJn2A584GGhv...,,0.0,2,usnews,Nicaraguan bank sanctioned by US shuts down,1.00,1,0,0
92440,92440,92440,,,gergbeef91,this column and emergency light,1.506690e+09,i.redd.it,True,737mcu,https://preview.redd.it/be71j19dltoz.jpg?width...,,0.0,7,pareidolia,This column and emergency light,1.00,0,2,2
92441,92441,92441,,,,former royal marine selling medals to help fun...,1.560033e+09,theroyalmarinescharity.org.uk,False,bydg0l,,,0.0,5,upliftingnews,Former royal marine selling medals to help fun...,0.86,1,0,0
92442,92442,92442,307338.0,1330751.0,undercoveruser,so proud,1.361106e+09,,True,c8gnd59,http://i.imgur.com/6OGdxDB.jpg,18oixl,,10,psbattle_artwork,So proud!,,0,2,4


Keeping only the relevant columns:

In [5]:
df = fakeddit_combined[["id", "author", "clean_title", "title", "num_comments", "linked_submission_id", "score", "upvote_ratio", "subreddit", "2_way_label"]]
df

Unnamed: 0,id,author,clean_title,title,num_comments,linked_submission_id,score,upvote_ratio,subreddit,2_way_label
0,awxhir,Alexithymia,my walgreens offbrand mucinex was engraved wit...,My Walgreens offbrand Mucinex was engraved wit...,2.0,,12,0.84,mildlyinteresting,1
1,cvm5uy4,RickSisco,,,,3n7fld,5,,psbattle_artwork,0
2,98pbid,VIDCAs17,this concerned sink with a tiny hat,This concerned sink with a tiny hat,2.0,,119,0.99,pareidolia,0
3,6f2cy5,prometheus1123,hackers leak emails from uae ambassador to us,Hackers leak emails from UAE ambassador to US,1.0,,44,0.92,neutralnews,1
4,cc5cbon,,,,,1lz1q0,3,,psbattle_artwork,0
...,...,...,...,...,...,...,...,...,...,...
92439,bjb8g9,dannylenwinn,nicaraguan bank sanctioned by us shuts down,Nicaraguan bank sanctioned by US shuts down,0.0,,2,1.00,usnews,1
92440,737mcu,gergbeef91,this column and emergency light,This column and emergency light,0.0,,7,1.00,pareidolia,0
92441,bydg0l,,former royal marine selling medals to help fun...,Former royal marine selling medals to help fun...,0.0,,5,0.86,upliftingnews,1
92442,c8gnd59,undercoveruser,so proud,So proud!,,18oixl,10,,psbattle_artwork,0


After removing all image-based content, the remaining content is mostly from news and satire subreddits. Since all of the remaining rows have a value for the "upvote_ratio" (an attribute that in the Reddit API only exists for submissions) and there are only Null values for the "linked_submission_id", it is clear that the dataframe now only contains submissions, no comments.

In [6]:
subreddits = ["neutralnews", "nottheonion", "upliftingnews", "satire", "savedyouaclick", "theonion", "usanews", "usnews", "fakefacts", "waterfordwhispersnews"]
df = df.loc[df['subreddit'].isin(subreddits)].reset_index(drop = True)

df = df[["id", "author", "clean_title", "title", "subreddit", "2_way_label"]]

df

# Show only true content
# df[df["2_way_label"] == 1]

# Show only fake content
# df[df["2_way_label"] == 0]

Unnamed: 0,id,author,clean_title,title,subreddit,2_way_label
0,6f2cy5,prometheus1123,hackers leak emails from uae ambassador to us,Hackers leak emails from UAE ambassador to US,neutralnews,1
1,2vkbtj,CrimsonBlue90,bride and groom exchange vows after fatal shoo...,Bride and groom exchange vows after fatal shoo...,nottheonion,1
2,86byl8,nyswagggggggg,rabbi meat from cloned pig could be kosher for...,Rabbi: Meat from cloned pig could be kosher fo...,nottheonion,1
3,1pulau,RetroEyes,billionaire feels guilty about being so rich,Billionaire Feels Guilty About Being So Rich,nottheonion,1
4,4zd1tb,cutiefoodie,english village becomes climate leader by quie...,English Village Becomes Climate Leader by Quie...,upliftingnews,1
...,...,...,...,...,...,...
246380,1tbjsx,Fewcifur,halfbaked burglar plunders familys pavlova,Half-baked burglar plunders family's pavlova,nottheonion,1
246381,2mbsg2,,uranus might be full of surprises,Uranus might be full of surprises,nottheonion,1
246382,az1zlp,Ghdust2,russian bid to influence brexit vote detailed ...,“Russian bid” to influence “Brexit” vote detai...,neutralnews,1
246383,bjb8g9,dannylenwinn,nicaraguan bank sanctioned by us shuts down,Nicaraguan bank sanctioned by US shuts down,usnews,1


Column explanations:

<b>author:</b> Username of Reddit user <br>
<b>clean_title:</b> Title of the Reddit submission, cleaned by the dataset creators for language-based machine learning <br>
<b>title:</b> Title of the Reddit submission, unedited <br>
<b>subreddit:</b> Name of the subreddit in which the submission was posted <br>
<b>2_way_label:</b> 1 stands for "true" and 0 for "fake". <br>

Checking for null values and showing the unique values:

In [7]:
print('General information:')
print(df.info())

print('\n' + 'Null values:')
print(df.isnull().sum())

for col in df:
    print(col + ": " + str(df[col].unique()))

General information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246385 entries, 0 to 246384
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   id           246385 non-null  object
 1   author       188613 non-null  object
 2   clean_title  246313 non-null  object
 3   title        246385 non-null  object
 4   subreddit    246385 non-null  object
 5   2_way_label  246385 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 11.3+ MB
None

Null values:
id                 0
author         57772
clean_title       72
title              0
subreddit          0
2_way_label        0
dtype: int64
id: ['6f2cy5' '2vkbtj' '86byl8' ... 'az1zlp' 'bjb8g9' 'bydg0l']
author: ['prometheus1123' 'CrimsonBlue90' 'nyswagggggggg' ... 'dylansucks'
 'Yourmomsbiggay' 'Nuyz']
clean_title: ['hackers leak emails from uae ambassador to us'
 'bride and groom exchange vows after fatal shooting at their wedding'
 'rabbi meat from cloned p

Dropping all rows that have a Null value for the "clean_title" attribute:

In [8]:
df["clean_title"] = df["clean_title"].apply(str)
df = df[df["clean_title"] != "nan"]
df

Unnamed: 0,id,author,clean_title,title,subreddit,2_way_label
0,6f2cy5,prometheus1123,hackers leak emails from uae ambassador to us,Hackers leak emails from UAE ambassador to US,neutralnews,1
1,2vkbtj,CrimsonBlue90,bride and groom exchange vows after fatal shoo...,Bride and groom exchange vows after fatal shoo...,nottheonion,1
2,86byl8,nyswagggggggg,rabbi meat from cloned pig could be kosher for...,Rabbi: Meat from cloned pig could be kosher fo...,nottheonion,1
3,1pulau,RetroEyes,billionaire feels guilty about being so rich,Billionaire Feels Guilty About Being So Rich,nottheonion,1
4,4zd1tb,cutiefoodie,english village becomes climate leader by quie...,English Village Becomes Climate Leader by Quie...,upliftingnews,1
...,...,...,...,...,...,...
246380,1tbjsx,Fewcifur,halfbaked burglar plunders familys pavlova,Half-baked burglar plunders family's pavlova,nottheonion,1
246381,2mbsg2,,uranus might be full of surprises,Uranus might be full of surprises,nottheonion,1
246382,az1zlp,Ghdust2,russian bid to influence brexit vote detailed ...,“Russian bid” to influence “Brexit” vote detai...,neutralnews,1
246383,bjb8g9,dannylenwinn,nicaraguan bank sanctioned by us shuts down,Nicaraguan bank sanctioned by US shuts down,usnews,1


## Training the ML model

Splitting the data into training and testing data:

In [9]:
X = df['clean_title'].tolist()
y = df["2_way_label"].tolist()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 100)
test_data = pd.concat([pd.DataFrame(X_test), pd.DataFrame(y_test, columns=["2_way_label"])], axis=1)

Training the model using an scikit-learn pipeline for preprocessing and classification:

In [10]:
# Naïve Bayes
nb_model = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', ComplementNB()),
])

nb_model.fit(X_train, y_train)
test_data["naive_bayes"] = nb_model.predict(X_test)

y_prediction_new = test_data["naive_bayes"]
y_actual = test_data["2_way_label"]

print(f'Accuracy: {round(accuracy_score(y_actual, y_prediction_new) * 100, 2)}%')
print(f'Precision: {round(precision_score(y_actual, y_prediction_new, pos_label = 0) * 100, 2)}%')
print(f'Recall: {round(recall_score(y_actual, y_prediction_new, pos_label = 0) * 100, 2)}%')
print(f'F1 Score: {round(f1_score(y_actual, y_prediction_new, pos_label = 0) * 100, 2)}%')

confusion_matrix_new = pd.crosstab(y_prediction_new, y_actual)
confusion_matrix_new

Accuracy: 88.25%
Precision: 51.64%
Recall: 56.43%
F1 Score: 53.93%


2_way_label,0,1
naive_bayes,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3387,3172
1,2615,40089


In [12]:
# Neural Network
nn_model = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MLPClassifier()),
])

nn_model.fit(X_train, y_train)
test_data["neural_network"] = nn_model.predict(X_test)

y_prediction_new = test_data["neural_network"]
y_actual = test_data["2_way_label"]

print(f'Accuracy: {round(accuracy_score(y_actual, y_prediction_new) * 100, 2)}%')
print(f'Precision: {round(precision_score(y_actual, y_prediction_new, pos_label = 0) * 100, 2)}%')
print(f'Recall: {round(recall_score(y_actual, y_prediction_new, pos_label = 0) * 100, 2)}%')
print(f'F1 Score: {round(f1_score(y_actual, y_prediction_new, pos_label = 0) * 100, 2)}%')

confusion_matrix_new = pd.crosstab(y_prediction_new, y_actual)
confusion_matrix_new

Accuracy: 91.33%
Precision: 68.57%
Recall: 53.28%
F1 Score: 59.97%


2_way_label,0,1
neural_network,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3198,1466
1,2804,41795


This model performs better at detecting misinformation than the ones trained in the Truthseeker notebook. Notice the target label of the scores being 0 and not 1. By setting this target to 1, it's really easy to vastly improve the scores and show "90% results" for the metrics, but it wouldn't change the fact while the "true" content is detected correctly most of the time, the "fake" content is much harder to classify.

## Persisting the trained ML models

In [13]:
# File name to be used
# file_name = "models/fakeddit_nb_model.skops"
file_name = "models/fakeddit_nn_model.skops"

# Persist the model to a file
# sio.dump(obj = nb_model, file = file_name)
sio.dump(obj = nn_model, file = file_name)

# Load the model from the file
trained_model = sio.load(file = file_name, trusted = True)
print(trained_model)

# Optional: for security reasons, first check the data types before loading the file
# data_types = sio.get_untrusted_types(file = file_name)
# print(data_types)
# trained_model = sio.load(file = file_name, trusted = data_types)
# print(trained_model)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', MLPClassifier())])


## Results:

Naïve Bayes (< 5 seconds training time): <br>
<img src="results/fakeddit_nb.png" alt="Logistic Regression Results" style="width: 200px;"/>

Neural Network (1 hour training time): <br>
<img src="results/fakeddit_nn.png" alt="Neural Network Results" style="width: 200px;"/>