# ML Model Training #1 - Truthseeker

Dataset used:

* Truthseeker, a dataset containing more than 180.000 labeled tweets: https://www.techrxiv.org/articles/preprint/TruthSeeker_The_Largest_Social_Media_Ground-Truth_Dataset_for_Real_Fake_Content/22795130

## Importing packages and datasets

In [125]:
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
import spacy_sentence_bert
import skops.io as sio
truthseeker_labeled = pd.read_csv("data/truthseeker_labeled.csv")
truthseeker_language = pd.read_csv("data/truthseeker_language.csv")

## Analyzing the Truthseeker dataset 

Having a first look at the dataset, its columns and content:

In [126]:
truthseeker_labeled

Unnamed: 0.1,Unnamed: 0,author,statement,target,BinaryNumTarget,manual_keywords,tweet,5_label_majority_answer,3_label_majority_answer
0,0,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",@POTUS Biden Blunders - 6 Month Update\n\nInfl...,Mostly Agree,Agree
1,1,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",@S0SickRick @Stairmaster_ @6d6f636869 Not as m...,NO MAJORITY,Agree
2,2,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",THE SUPREME COURT is siding with super rich pr...,Agree,Agree
3,3,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",@POTUS Biden Blunders\n\nBroken campaign promi...,Mostly Agree,Agree
4,4,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",@OhComfy I agree. The confluence of events rig...,Agree,Agree
...,...,...,...,...,...,...,...,...,...
134193,134193,Tom Kertscher,Joe Bidens great-grandfather Joseph J. Biden w...,False,0.0,"Biden, great grandfather, slave owner",Joe Biden's family owned African slaves....\n\...,Mostly Agree,Agree
134194,134194,Tom Kertscher,Joe Bidens great-grandfather Joseph J. Biden w...,False,0.0,"Biden, great grandfather, slave owner","Joe Bidens great, great grandfather was a slav...",Agree,Agree
134195,134195,Tom Kertscher,Joe Bidens great-grandfather Joseph J. Biden w...,False,0.0,"Biden, great grandfather, slave owner","@ChevyChaseToGo ""Joe Bidens great-grandfather ...",Mostly Agree,Agree
134196,134196,Tom Kertscher,Joe Bidens great-grandfather Joseph J. Biden w...,False,0.0,"Biden, great grandfather, slave owner",@JoeBiden Facts are Bidens VP Kamala Harris Gr...,NO MAJORITY,Agree


Column explanations (copied from https://www.unb.ca/cic/datasets/truthseeker-2023.html):

<b>author:</b> Author of news article (headline) <br>
<b>statement:</b> News article headline <br>
<b>target:</b> Truth value of the article headline <br>
<b>BinaryNumTarget:</b> Binary representation of the truth value (1 = True / 0 = False) <br>
<b>manual_keywords:</b> Manually created keywords used to search Twitter <br>
<b>tweet:</b> Twitter post <br>
<b>5_label_majority_answer:</b> Majority answer using 5 labels (Agree, Mostly Agree, Disagree, Mostly Disagree, Unrelated), NO MAJORITY means that there was no consensus <br>
<b>3_label_majority_answer:</b> Majority answer using 3 labels (Agree, Disagree, Unrelated), NO MAJORITY means that there was no consensus <br>

Checking for null values:

In [127]:
print('General information:')
print(truthseeker_labeled.info())
print('\n' + 'Null values:')
print(truthseeker_labeled.isnull().sum())

General information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134198 entries, 0 to 134197
Data columns (total 9 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Unnamed: 0               134198 non-null  int64  
 1   author                   134198 non-null  object 
 2   statement                134198 non-null  object 
 3   target                   134198 non-null  bool   
 4   BinaryNumTarget          134198 non-null  float64
 5   manual_keywords          134198 non-null  object 
 6   tweet                    134198 non-null  object 
 7   5_label_majority_answer  134198 non-null  object 
 8   3_label_majority_answer  134198 non-null  object 
dtypes: bool(1), float64(1), int64(1), object(6)
memory usage: 8.3+ MB
None

Null values:
Unnamed: 0                 0
author                     0
statement                  0
target                     0
BinaryNumTarget            0
manual_keywords            0


Checking the number of unique values per column:

In [128]:
print(len(truthseeker_labeled["author"].unique()))
print(len(truthseeker_labeled["statement"].unique()))
print(len(truthseeker_labeled["target"].unique()))
print(len(truthseeker_labeled["BinaryNumTarget"].unique()))
print(len(truthseeker_labeled["manual_keywords"].unique()))
print(len(truthseeker_labeled["tweet"].unique()))
print(len(truthseeker_labeled["5_label_majority_answer"].unique()))
print(len(truthseeker_labeled["3_label_majority_answer"].unique()))

161
1058
2
2
1056
134198
5
2


Every tweet is unique, so there is no need for removing duplicates.

We assign a "true" (1) or "fake" (0) label based on the truthfulness of the target (= the linked article) and the level of agreement of the tweet with the target, label the tweet's truthfulness, just like in Table 6 (Label Conversion Truth Table) of the paper:

In [129]:
def label_tweet(row):
    # Tweet agrees with true news
    if row["target"] == 1 and row["3_label_majority_answer"] == "Agree":
        return 1
    # Tweet disagrees with fake news
    elif row["target"] == 0 and row["3_label_majority_answer"] == "Disagree":
        return 1
    # Tweet disagrees with true news
    elif row["target"] == 1 and row["3_label_majority_answer"] == "Disagree":
        return 0
    # Tweet agrees with fake news
    elif row["target"] == 0 and row["3_label_majority_answer"] == "Agree":
        return 0
    
truthseeker_labeled['tweet_label'] = truthseeker_labeled.apply(lambda row: label_tweet(row), axis=1)

## Training the ML model

https://newscatcherapi.com/blog/how-to-classify-text-with-python-transformers-and-scikit-learn

https://github.com/MartinoMensio/spacy-sentence-bert/

https://spacy.io/models

In [131]:
# load one of the models listed at https://github.com/MartinoMensio/spacy-sentence-bert/
nlp = spacy_sentence_bert.load_model('en_stsb_distilbert_base')

In [132]:
# statement_vectors = truthseeker_labeled['statement'].apply(lambda x: nlp(x).vector)
tweet_vectors = truthseeker_labeled['tweet'].apply(lambda x: nlp(x).vector)

In [133]:
# truthseeker_labeled['statement_vectors'] = statement_vectors
truthseeker_labeled['tweet_vectors'] = tweet_vectors
truthseeker_labeled

Unnamed: 0.1,Unnamed: 0,author,statement,target,BinaryNumTarget,manual_keywords,tweet,5_label_majority_answer,3_label_majority_answer,tweet_label,tweet_vectors
0,0,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",@POTUS Biden Blunders - 6 Month Update\n\nInfl...,Mostly Agree,Agree,1,"[-0.0017305756, 0.35551363, 0.2279567, -0.4057..."
1,1,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",@S0SickRick @Stairmaster_ @6d6f636869 Not as m...,NO MAJORITY,Agree,1,"[0.5489139, 0.195778, 0.42961097, -0.63502806,..."
2,2,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",THE SUPREME COURT is siding with super rich pr...,Agree,Agree,1,"[0.25383365, 0.027936008, 0.2982609, -0.627730..."
3,3,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",@POTUS Biden Blunders\n\nBroken campaign promi...,Mostly Agree,Agree,1,"[-0.09393569, 0.18510589, -0.2616876, -0.30429..."
4,4,D.L. Davis,End of eviction moratorium means millions of A...,True,1.0,"Americans, eviction moratorium",@OhComfy I agree. The confluence of events rig...,Agree,Agree,1,"[0.14379501, -0.02429921, -0.22568427, -0.2979..."
...,...,...,...,...,...,...,...,...,...,...,...
134193,134193,Tom Kertscher,Joe Bidens great-grandfather Joseph J. Biden w...,False,0.0,"Biden, great grandfather, slave owner",Joe Biden's family owned African slaves....\n\...,Mostly Agree,Agree,0,"[0.55703735, -0.37823865, -0.8882882, -0.54941..."
134194,134194,Tom Kertscher,Joe Bidens great-grandfather Joseph J. Biden w...,False,0.0,"Biden, great grandfather, slave owner","Joe Bidens great, great grandfather was a slav...",Agree,Agree,0,"[1.1357609, -0.7327957, -0.52137864, 0.2452978..."
134195,134195,Tom Kertscher,Joe Bidens great-grandfather Joseph J. Biden w...,False,0.0,"Biden, great grandfather, slave owner","@ChevyChaseToGo ""Joe Bidens great-grandfather ...",Mostly Agree,Agree,0,"[0.4829451, -0.25695887, -0.834214, -0.4299301..."
134196,134196,Tom Kertscher,Joe Bidens great-grandfather Joseph J. Biden w...,False,0.0,"Biden, great grandfather, slave owner",@JoeBiden Facts are Bidens VP Kamala Harris Gr...,NO MAJORITY,Agree,0,"[-0.32603276, 0.12232426, -0.2881443, -0.39833..."


Splitting the data into training and testing data:

In [134]:
# X = truthseeker_labeled['statement_vectors'].tolist()
X = truthseeker_labeled['tweet_vectors'].tolist()
y = truthseeker_labeled["tweet_label"].tolist()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=44)
test_data = pd.concat([pd.DataFrame(X_test), pd.DataFrame(y_test, columns=["tweet_label"])], axis=1)

Training the model:

In [135]:
# Logistic regression
lr_model = LogisticRegression(solver='lbfgs', multi_class='auto')
lr_model.fit(X_train, y_train)
test_data["logistic_regression"] = lr_model.predict(X_test)

# Support Vector Machine
svm_model = svm.LinearSVC(dual = False)
svm_model.fit(X_train, y_train)
test_data["support_vector_machine"] = svm_model.predict(X_test)

# Random Forest
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train)
test_data["random_forest"] = rf_model.predict(X_test)

# Neural Network (Multi-layer Perceptron)
nn_model = MLPClassifier(hidden_layer_sizes=(4, 4, 4, 4, 3), max_iter = 500, solver='adam', random_state=44)
nn_model.fit(X_train, y_train)
test_data["neural_network"] = nn_model.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Calculating the scores:

In [137]:
y_prediction_lr = test_data["logistic_regression"]
y_actual = test_data["tweet_label"]

print(f'Accuracy: {round(accuracy_score(y_actual, y_prediction_lr) * 100, 2)}%')
print(f'Precision: {round(precision_score(y_actual, y_prediction_lr, pos_label=1) * 100, 2)}%')
print(f'Recall: {round(recall_score(y_actual, y_prediction_lr, pos_label=1) * 100, 2)}%')
print(f'F1 Score: {round(f1_score(y_actual, y_prediction_lr, pos_label=1) * 100, 2)}%')

confusion_matrix_lr = pd.crosstab(y_prediction_lr, y_actual)
confusion_matrix_lr

Accuracy: 82.13%
Precision: 82.32%
Recall: 83.0%
F1 Score: 82.66%


tweet_label,0,1
logistic_regression,Unnamed: 1_level_1,Unnamed: 2_level_1
0,10612,2342
1,2455,11431


In [138]:
y_prediction_svm = test_data["support_vector_machine"]
y_actual = test_data["tweet_label"]

print(f'Accuracy: {round(accuracy_score(y_actual, y_prediction_svm) * 100, 2)}%')
print(f'Precision: {round(precision_score(y_actual, y_prediction_svm, pos_label=1) * 100, 2)}%')
print(f'Recall: {round(recall_score(y_actual, y_prediction_svm, pos_label=1) * 100, 2)}%')
print(f'F1 Score: {round(f1_score(y_actual, y_prediction_svm, pos_label=1) * 100, 2)}%')

confusion_matrix_svm = pd.crosstab(y_prediction_svm, y_actual)
confusion_matrix_svm

Accuracy: 82.27%
Precision: 82.52%
Recall: 83.05%
F1 Score: 82.78%


tweet_label,0,1
support_vector_machine,Unnamed: 1_level_1,Unnamed: 2_level_1
0,10644,2335
1,2423,11438


In [139]:
y_prediction_rf = test_data["random_forest"]
y_actual = test_data["tweet_label"]

print(f'Accuracy: {round(accuracy_score(y_actual, y_prediction_rf) * 100, 2)}%')
print(f'Precision: {round(precision_score(y_actual, y_prediction_rf, pos_label=1) * 100, 2)}%')
print(f'Recall: {round(recall_score(y_actual, y_prediction_rf, pos_label=1) * 100, 2)}%')
print(f'F1 Score: {round(f1_score(y_actual, y_prediction_rf, pos_label=1) * 100, 2)}%')

confusion_matrix_rf = pd.crosstab(y_prediction_rf, y_actual)
confusion_matrix_rf

Accuracy: 82.25%
Precision: 81.79%
Recall: 84.16%
F1 Score: 82.96%


tweet_label,0,1
random_forest,Unnamed: 1_level_1,Unnamed: 2_level_1
0,10486,2182
1,2581,11591


In [140]:
y_prediction_nn = test_data["neural_network"]
y_actual = test_data["tweet_label"]

print(f'Accuracy: {round(accuracy_score(y_actual, y_prediction_nn) * 100, 2)}%')
print(f'Precision: {round(precision_score(y_actual, y_prediction_nn, pos_label=1) * 100, 2)}%')
print(f'Recall: {round(recall_score(y_actual, y_prediction_nn, pos_label=1) * 100, 2)}%')
print(f'F1 Score: {round(f1_score(y_actual, y_prediction_nn, pos_label=1) * 100, 2)}%')

confusion_matrix_nn = pd.crosstab(y_prediction_nn, y_actual)
confusion_matrix_nn

Accuracy: 85.5%
Precision: 85.56%
Recall: 86.32%
F1 Score: 85.94%


tweet_label,0,1
neural_network,Unnamed: 1_level_1,Unnamed: 2_level_1
0,11060,1884
1,2007,11889


## Persisting the trained ML model

In [145]:
# File name to be used
file_name = "models/tweet_nn_model.skops"

# Persist the model to a file
sio.dump(obj = nn_model, file = file_name)

# Load the model from the file
trained_model = sio.load(file = file_name, trusted = True)
print(trained_model)

# Optional: for security reasons, first check the data types before loading the file
# data_types = sio.get_untrusted_types(file = file_name)
# print(data_types)
# trained_model = sio.load(file = file_name, trusted = data_types)
# print(trained_model)

LinearSVC(dual=False)


## Results for "statement" column:

Statement - Logistic Regression: <br>
<img src="results/statement_lr.png" alt="Logistic Regression Results" style="width: 200px;"/>

Statement - Support Vector Machine: <br>
<img src="results/statement_svm.png" alt="Support Vector Machine Results" style="width: 200px;"/>

Statement - Random Forest: <br>
<img src="results/statement_rf.png" alt="Random Forest Results" style="width: 200px;"/>

Statement - Neural Network: <br>
<img src="results/statement_nn.png" alt="Neural Network Results" style="width: 200px;"/>

## Results for "tweets" column:

Tweets - Logistic Regression: <br>
<img src="results/tweets_lr.png" alt="Logistic Regression Results" style="width: 200px;"/>

Tweets - Support Vector Machine: <br>
<img src="results/tweets_svm.png" alt="Support Vector Machine Results" style="width: 200px;"/>

Tweets - Random Forest: <br>
<img src="results/tweets_rf.png" alt="Random Forest Results" style="width: 200px;"/>

Tweets - Neural Network: <br>
<img src="results/tweets_nn.png" alt="Neural Network Results" style="width: 200px;"/>