# Master's thesis: Automated truth discovery
- Tutorial of the models implemented in this thesis
- Author: Jan Koci
- May 2023

Table of Contents:

- [Datasets Used in this Thesis](#datasets-used-in-this-thesis)
  - [NELA Dataset](#nela-dataset)
  - [Merged Dataset](#merged-dataset)
  - [FNI Dataset](#fni-dataset)
- [Models for Classification](#models-for-classification)
  - [Baseline Model](#baseline-model)
  - [BERT Model](#bert-model)
- [Qualitative Analysis](#qualitative-analysis)
- [Credibility of Sources](#credibility-of-sources)
- [Challenge Set](#challenge-set)

# Datasets Used in this Thesis
This thesis uses three different datasets:
- __The NELA dataset__: preprocessed version of the NELA-GT-2021 dataset created by extending its source labels to all articles (used for training of both classifiers)
- __The Merged dataset__: created by merging three fake news datasets (used only for testing)
- __The FNI dataset__: created by the author of this thesis by manually selecting 23 reliable and 23 unreliable articles

## NELA Dataset

The NELA dataset is used for training of both classifiers implemented in this thesis. It is split into training, validation and test sets.

In [1]:
import pandas as pd
from data_loaders import BayesLoader

train_df = BayesLoader.load_train_data('../data/nela_dataset/nela_train.gzip') 
test_df = BayesLoader.load_test_data('../data/nela_dataset/nela_test.gzip')
validation_df = BayesLoader.load_test_data('../data/nela_dataset/nela_validation.gzip')

train_df.shape, test_df.shape, validation_df.shape



((616798, 12), (196440, 12), (157197, 12))

In [2]:
train_df.iloc[0]

id                usatoday--2021-02-11--Justice Department drops...
date                                                     2021-02-11
source                                                     usatoday
title             Justice Department drops lawsuit against Melan...
content           Former first lady Melania Trump 's ex-best fri...
author                                      Maria Puente, USA TODAY
url               https://feeds.feedblitz.com/~/644023128/0/usat...
published                           Thu, 11 Feb 2021 23:44:17 +0000
published_utc                                            1613105057
collection_utc                                           1613101770
label                                                             0
text              Justice Department drops lawsuit Melania Trump...
Name: 0, dtype: object

## Merged Dataset

The merged dataset is only used for testing. It was created by merging three fake news datasets.

In [3]:
from data_loaders import BayesLoader
merged = BayesLoader.load_merged_dataset('../data/merged_dataset/merged.gzip', compression='gzip')
merged.shape



(27518, 3)

In [4]:
merged.label.value_counts()

0.0    13769
1.0    13749
Name: label, dtype: int64

In [5]:
merged.head()

Unnamed: 0,title,text,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,House Dem Aide: Didn’t Even See Comey’s Letter...,1.0
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...","FLYNN: Hillary Clinton, Big Woman Campus Breit...",0.0
2,Why the Truth Might Get You Fired,Truth Might Get Fired Truth Might Get Fired Oc...,1.0
3,15 Civilians Killed In Single US Airstrike Hav...,15 Civilians Killed Single US Airstrike Identi...,1.0
4,Iranian woman jailed for fictional unpublished...,Iranian woman jailed fictional unpublished sto...,1.0


## FNI Dataset

The FNI dataset was created in this thesis. It is used for testing and analysis. A detailed description of the article can be found in the text if this thesis and also in file: __../data/fni_dataset/fni_analysis.ipynb__.

In [6]:
fni = pd.read_csv('../data/fni_dataset/fni.tsv', sep='\t')
fni.fillna('Unknown', inplace=True)
fni.shape

(46, 9)

In [7]:
fni.label.value_counts()

fake    23
true    23
Name: label, dtype: int64

In [8]:
fni.iloc[13]

title         FBI Released a Document Proving Adolf Hitler a...
text          The FBI.gov website reveals the government kne...
label                                                      fake
url           http://web.archive.org/web/20221209182855/http...
source                                           weareanonymous
topic                                                conspiracy
mbfc_bias                              conspiracy-pseudoscience
factuality                                                  0.0
date                                                 2016-05-05
Name: 13, dtype: object

# Models for Classification

Two classifiers are implemented in this thesis. The first uses TF-IDF and Multinomial Naive Bayes classifier. The second classifier uses the BERT transformer architecture.

## Baseline model
The baseline model is implemented in the MnbClassifier class.

In [9]:
from bayes_model import MnbClassifier

baseline = MnbClassifier(ngram_range=(1, 1))
baseline.fit(train_df)

Training complete


In [10]:
i = 43
predicted_class = baseline.predict_text(test_df.iloc[i].text)
predicted_probabilities = baseline.predict_proba_text(test_df.iloc[i].text)

print(f'Actual class: {test_df.iloc[i].label}')
print(f'Predicted class: {predicted_class}')
print(f'Predicted probabilities: {predicted_probabilities}')

Actual class: 1
Predicted class: [1]
Predicted probabilities: [[0.19466289 0.80533711]]


In [11]:
baseline.test_report(test_df)

              precision    recall  f1-score   support

           0       0.79      0.80      0.80     98337
           1       0.80      0.78      0.79     98103

    accuracy                           0.79    196440
   macro avg       0.79      0.79      0.79    196440
weighted avg       0.79      0.79      0.79    196440



In [12]:
import nela_helpers as nh
nh.test_report_merged(baseline)



              precision    recall  f1-score   support

           0       0.72      0.65      0.68     13769
           1       0.68      0.74      0.71     13749

    accuracy                           0.70     27518
   macro avg       0.70      0.70      0.69     27518
weighted avg       0.70      0.70      0.69     27518



In [13]:
nh.test_report_fni_dataset(baseline)

              precision    recall  f1-score   support

           0       0.61      0.61      0.61        23
           1       0.61      0.61      0.61        23

    accuracy                           0.61        46
   macro avg       0.61      0.61      0.61        46
weighted avg       0.61      0.61      0.61        46



## BERT model
The BERT model uses the BERT transformer (base-uncased) that was fine-tuned on the NELA dataset. The implementation uses the Hugging face framework.

In [14]:
from bert_model import BertClassifier
import bert_helpers as bh

bert = BertClassifier(model_name='../data/bert_model/checkpoint-98176')

  from .autonotebook import tqdm as notebook_tqdm


In [15]:
report = bh.test_report_fni_dataset(bert)

  0%|          | 0/1 [00:00<?, ?ba/s]
The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: date, factuality, topic, mbfc_bias, url, title, text, source. If date, factuality, topic, mbfc_bias, url, title, text, source are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 46
  Batch size = 8
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 6/6 [00:28<00:00,  4.72s/it]

              precision    recall  f1-score   support

           0       0.90      0.78      0.84        23
           1       0.81      0.91      0.86        23

    accuracy                           0.85        46
   macro avg       0.85      0.85      0.85        46
weighted avg       0.85      0.85      0.85        46






# Qualitative Analysis
To analyze the cues exploited in the text by the classifiers, this thesis implements two methods of interpretability. The interpretability method for baseline computes the importance of each word __x__ as the probability of __P(x|reliable)__ divided by __P(x|unreliable)__. The interpretability method for the BERT model uses __Integrated gradients__.

In [16]:
from interpret_baseline import BaselineInterpreter
from interpret_bert import BertInterpreter

text = "The Dark Side of Football: The Shocking Story of Marco Rodriguez The Dark Side of football: The Shocking Story of Marco Rodriguez  Marco Rodriguez, one of the most talented football players of his generation, has made headlines once again - this time for all the wrong reasons. The striker, who was once regarded as a role model for young athletes, has been accused of engaging in a string of shocking and immoral behavior both on and off the field.  Sources close to the athlete claim that he has been involved in numerous scandals, including incidents of drug abuse, domestic violence, and even assault. In one particularly shocking incident, Rodriguez was caught on camera punching a fellow player during a heated match.  Despite these allegations, Rodriguez has managed to maintain his status as one of the top football players in the world. His incredible skill on the field has earned him a legion of loyal fans, who continue to support him even in the face of controversy.  However, critics argue that Rodriguez's behavior sets a dangerous precedent for young athletes and sends a message that it's acceptable to engage in immoral and unethical behavior as long as you are successful on the field.  As the controversy surrounding Rodriguez continues to grow, many are calling for a re-examination of the way football players are idolized and celebrated in the media. The dark side of football may be ugly, but it's time to confront it head-on and take steps to ensure that players like Rodriguez are held accountable for their actions."

baseline_interpreter = BaselineInterpreter(baseline)
bert_interpreter = BertInterpreter(bert.model, bert.tokenizer)

In [17]:
baseline_interpreter.vizualize_interpretation(text, true_class=1)

True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
1.0,0 (0.74),"The Dark Side of Football: The Shocking Story of Marco Rodriguez The Dark Side of football: The Shocking Story of Marco Rodriguez Marco Rodriguez, one of the most talented football players of his generation, has made headlines once again - this time for all the wrong reasons. The striker, who was once regarded as a role model for young athletes, has been accused of engaging in a string of shocking and immoral behavior both on and off the field. Sources close to the athlete claim that he has been involved in numerous scandals, including incidents of drug abuse, domestic violence, and even assault. In one particularly shocking incident, Rodriguez was caught on camera punching a fellow player during a heated match. Despite these allegations, Rodriguez has managed to maintain his status as one of the top football players in the world. His incredible skill on the field has earned him a legion of loyal fans, who continue to support him even in the face of controversy. However, critics argue that Rodriguez's behavior sets a dangerous precedent for young athletes and sends a message that it's acceptable to engage in immoral and unethical behavior as long as you are successful on the field. As the controversy surrounding Rodriguez continues to grow, many are calling for a re-examination of the way football players are idolized and celebrated in the media. The dark side of football may be ugly, but it's time to confront it head-on and take steps to ensure that players like Rodriguez are held accountable for their actions.",-13.59,"the dark side of football: the shocking story of marco rodriguez the dark side of football: the shocking story of marco rodriguez marco rodriguez, one of the most talented football players of his generation, has made headlines once again - this time for all the wrong reasons. the striker, who was once regarded as a role model for young athletes, has been accused of engaging in a string of shocking and immoral behavior both on and off the field. sources close to the athlete claim that he has been involved in numerous scandals, including incidents of drug abuse, domestic violence, and even assault. in one particularly shocking incident, rodriguez was caught on camera punching a fellow player during a heated match. despite these allegations, rodriguez has managed to maintain his status as one of the top football players in the world. his incredible skill on the field has earned him a legion of loyal fans, who continue to support him even in the face of controversy. however, critics argue that rodriguez's behavior sets a dangerous precedent for young athletes and sends a message that it's acceptable to engage in immoral and unethical behavior as long as you are successful on the field. as the controversy surrounding rodriguez continues to grow, many are calling for a re-examination of the way football players are idolized and celebrated in the media. the dark side of football may be ugly, but it's time to confront it head-on and take steps to ensure that players like rodriguez are held accountable for their actions."
,,,,


In [18]:
bert_interpreter.interpret_text(text, true_class=1)

True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
1.0,1 (0.99),"The Dark Side of Football: The Shocking Story of Marco Rodriguez The Dark Side of football: The Shocking Story of Marco Rodriguez Marco Rodriguez, one of the most talented football players of his generation, has made headlines once again - this time for all the wrong reasons. The striker, who was once regarded as a role model for young athletes, has been accused of engaging in a string of shocking and immoral behavior both on and off the field. Sources close to the athlete claim that he has been involved in numerous scandals, including incidents of drug abuse, domestic violence, and even assault. In one particularly shocking incident, Rodriguez was caught on camera punching a fellow player during a heated match. Despite these allegations, Rodriguez has managed to maintain his status as one of the top football players in the world. His incredible skill on the field has earned him a legion of loyal fans, who continue to support him even in the face of controversy. However, critics argue that Rodriguez's behavior sets a dangerous precedent for young athletes and sends a message that it's acceptable to engage in immoral and unethical behavior as long as you are successful on the field. As the controversy surrounding Rodriguez continues to grow, many are calling for a re-examination of the way football players are idolized and celebrated in the media. The dark side of football may be ugly, but it's time to confront it head-on and take steps to ensure that players like Rodriguez are held accountable for their actions.",0.3,"[CLS] the dark side of football : the shocking story of marco rodriguez the dark side of football : the shocking story of marco rodriguez marco rodriguez , one of the most talented football players of his generation , has made headlines once again - this time for all the wrong reasons . the striker , who was once regarded as a role model for young athletes , has been accused of engaging in a string of shocking and im ##moral behavior both on and off the field . sources close to the athlete claim that he has been involved in numerous scandals , including incidents of drug abuse , domestic violence , and even assault . in one particularly shocking incident , rodriguez was caught on camera punching a fellow player during a heated match . despite these allegations , rodriguez has managed to maintain his status as one of the top football players in the world . his incredible skill on the field has earned him a legion of loyal fans , who continue to support him even in the face of controversy . however , critics argue that rodriguez ' s behavior sets a dangerous precedent for young athletes and sends a message that it ' s acceptable to engage in im ##moral and une ##thic ##al behavior as long as you are successful on the field . as the controversy surrounding rodriguez continues to grow , many are calling for a re - examination of the way football players are idol ##ized and celebrated in the media . the dark side of football may be ugly , but it ' s time to confront it head - on and take steps to ensure that players like rodriguez are held accountable for their actions . [SEP]"
,,,,


Report written to data.html file


# Credibility of Sources
The classifiers were also used to predict the credibility of media sources. The results were compared with referential values obtained from a graph-neighbourhood expoitation method kindly provided by Sergio Burdisso (sergio.burdisso@idiap.ch). Two different methods were used to predict the reliability of sources. The first method uses average credibility of articles. The second method creates embeddings from article credibilities and trains a logistic regression.

In [19]:
from scipy.stats import kendalltau
from scipy.spatial.distance import jensenshannon

embeddings = pd.read_pickle('../data/embeddings_30.pkl')
common_sources = pd.read_pickle('../data/common_sources.pkl')

temp = common_sources.logreg_22.values
tau = kendalltau(common_sources.sergio.values, temp)[0]
js_distance = jensenshannon(common_sources.sergio.values, temp)
print(f'Embedding method (k=22): Tau: {tau}, JS: {js_distance}')

temp = common_sources.logreg_4.values
tau = kendalltau(common_sources.sergio.values, temp)[0]
js_distance = jensenshannon(common_sources.sergio.values, temp)
print(f'Embedding method (k=4): Tau: {tau}, JS: {js_distance}')

temp = common_sources.avg_prob.values
tau = kendalltau(common_sources.sergio.values, temp)[0]
js_distance = jensenshannon(common_sources.sergio.values, temp)
print(f'Average reliability method: Tau: {tau}, JS: {js_distance}')

Embedding method (k=22): Tau: 0.628395646395461, JS: 0.24584616883834953
Embedding method (k=4): Tau: 0.5590718559925969, JS: 0.20174563513629926
Average reliability method: Tau: 0.5139124798911968, JS: 0.20552843921377978


# Challenge set
One of the outputs of this thesis is also a challenge set, that consists of articles which the BERT classifier failed to identify.

In [20]:
challenge_set = pd.read_csv('../data/challenge_set/challenge_set.gzip', compression='gzip')
challenge_set.shape

(10875, 13)

In [21]:
challenge_set.label.value_counts()

0    5629
1    5246
Name: label, dtype: int64