## Import Data

In [1]:
import pandas as pd
train_data = pd.read_csv('https://raw.githubusercontent.com/Tariq60/LIAR-PLUS/refs/heads/master/dataset/tsv/train2.tsv', sep = "\t")
train_data.head(1)

Unnamed: 0,0,2635.json,false,Says the Annies List political group supports third-trimester abortions on demand.,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0.1,0.0.2,0.0.3,a mailer,"That's a premise that he fails to back up. Annie's List makes no bones about being comfortable with candidates who oppose further restrictions on late-term abortions. Then again, this year its backing two House candidates who voted for more limits."
0,1.0,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.,"Surovell said the decline of coal ""started whe..."


In [2]:
# Column 1: the ID of the statement ([ID].json).
# Column 2: the label.
# Column 3: the statement.
# Column 4: the subject(s).
# Column 5: the speaker.
# Column 6: the speaker's job title.
# Column 7: the state info.
# Column 8: the party affiliation.
# Columns 9-13: the total credit history count, including the current statement.
# 9: barely true counts.
# 10: false counts.
# 11: half true counts.
# 12: mostly true counts.
# 13: pants on fire counts.
# Column 14: the context (venue / location of the speech or statement).
# Column 15: the extracted justification

In [3]:
first_data = train_data.columns

In [4]:
train_data.loc[train_data.shape[0]] = first_data
train_data.columns =['index','ID of statement', 'label', 'statement', 'subject', 'speaker', "speaker's job title", 'state info',
                     'party affiliation', 'barely true counts', 'false counts', 'half true counts', 'mostly true counts',
                    'pants on fire counts', 'context', 'extracted justification']
train_data = train_data.drop(columns=['index'])

## Data Cleaning

In [6]:
# label -> clean :)
# subject -> maybe change str to list
# speaker -> clean :)
# speaker's job title -> need to fix big and little letters
# State info -> clean :)
# party affiliation -> clean :)
# barely true counts -> not gonna modify
# false counts -> not gonna modify
# half true counts -> not gonna modify
# mostly true counts -> not gonna modify
# pants on fire counts -> not gonna modify
# context -> assume clean :)
# extracted -> str to list

In [7]:
# subject
train_data['subject'] = train_data['subject'].str.split(",")

# speaker's job title
train_data["speaker's job title"] = train_data["speaker's job title"].str.lower()

# extracted
train_data["extracted justification"] = train_data["extracted justification"].str.split(" ")

train_data.head(3)

Unnamed: 0,ID of statement,label,statement,subject,speaker,speaker's job title,state info,party affiliation,barely true counts,false counts,half true counts,mostly true counts,pants on fire counts,context,extracted justification
0,10540.json,half-true,When did the decline of coal start? It started...,"[energy, history, job-accomplishments]",scott-surovell,state delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.,"[Surovell, said, the, decline, of, coal, ""star..."
1,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",[foreign-policy],barack-obama,president,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver,"[Obama, said, he, would, have, voted, against,..."
2,1123.json,false,Health care reform legislation is likely to ma...,[health-care],blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release,"[The, release, may, have, a, point, that, Miku..."


## Factuality Factor

* Social Credibility: People are more likely to perceive a source as credible if others perceive the source is credible
* Stance Detection: What is the political or issue stance of the article or text corpus? How does that affect the veracity of the article or text?
* Stance features: indicate the user's opinion about the news such as supporting or denying.
* Title vs Body: Does the title, agree, discuss, is unrelated to, or negate the body of the text?

#### Social Credibility
* Source History: Delve into the past of the post or source to understand its track record
* Enforsement checks: a post or source that has been enorsed or validated by external reputable entities gains credibility
* revision Analysis: check if the content has been revised updated, or retracted in the past

* Basic info:
    * 10243 total rows in df
    * 4346 types of unique context
    * top five context
        * a news release                                                  241
        * an interview                                                    229
        * a press release                                                 223
        * a speech                                                        214
        * a TV ad                                                         180

In [51]:
# all imports
import numpy as np
import scipy
import sklearn
import keras

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown = "ignore", sparse = False)

# citation: https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [52]:
train_data['party affiliation'].unique()

array(['democrat', 'none', 'republican', 'organization', 'independent',
       'columnist', 'activist', 'talk-show-host', 'libertarian',
       'newsmaker', 'journalist', 'labor-leader', 'state-official',
       'business-leader', 'education-official', 'tea-party-member', nan,
       'green', 'liberal-party-canada', 'government-body', 'Moderate',
       'democratic-farmer-labor', 'ocean-state-tea-party-action',
       'constitution-party'], dtype=object)

In [54]:
# spliting up the training data set to 8:2
small_training =  train_data[0: int(train_data.shape[0]*0.8)]
small_testing =  train_data[int(train_data.shape[0]*0.8):int(train_data.shape[0])]

# dropping null
small_training = small_training.dropna()
small_testing = small_testing.dropna()

# keeping only 4 columns
small_training = small_training[["label", "speaker", "context", "party affiliation"]]
small_testing = small_testing[["label", "speaker", "context", "party affiliation"]]

# ohe (training data)
ohe_label = ohe.fit_transform(np.array(small_training['label']).reshape(-1,1))
small_training_ohe_label_df = pd.DataFrame(ohe_label, columns = list(small_training['label'].unique()))

ohe_speaker = ohe.fit_transform(np.array(small_training['speaker']).reshape(-1,1))
ohe_context = ohe.fit_transform(np.array(small_training['context']).reshape(-1,1))
ohe_party = ohe.fit_transform(np.array(small_training['party affiliation']).reshape(-1,1))
ohe_speaker_ohe_context_ohe_party = []
for i in range(len(ohe_speaker)):
    ohe_speaker_ohe_context_ohe_party.append(np.concatenate((ohe_speaker[i], ohe_context[i], ohe_party[i])))
ohe_speaker_ohe_context_ohe_party = np.array(ohe_speaker_ohe_context_ohe_party)

small_training_ohe_speaker_context_party_df = pd.DataFrame(ohe_speaker_ohe_context_ohe_party, columns = 
                                             list(small_training['speaker'].unique()) 
                                             + list(small_training['context'].unique())
                                                + list(small_training['party affiliation'].unique()))

# ohe (testing data)
ohe_label = ohe.fit_transform(np.array(small_testing['label']).reshape(-1,1))
small_testing_ohe_label_df = pd.DataFrame(ohe_label, columns = list(small_testing['label'].unique()))

ohe_speaker = ohe.fit_transform(np.array(small_testing['speaker']).reshape(-1,1))
ohe_context = ohe.fit_transform(np.array(small_testing['context']).reshape(-1,1))
ohe_party = ohe.fit_transform(np.array(small_testing['party affiliation']).reshape(-1,1))
ohe_speaker_ohe_context_ohe_party = []
for i in range(len(ohe_speaker)):
    ohe_speaker_ohe_context_ohe_party.append(np.concatenate((ohe_speaker[i], ohe_context[i], ohe_party[i])))
ohe_speaker_ohe_context_ohe_party = np.array(ohe_speaker_ohe_context_ohe_party)


small_testing_ohe_speaker_context_party_df = pd.DataFrame(ohe_speaker_ohe_context_ohe_party, columns = 
                                             list(small_testing['speaker'].unique()) 
                                             + list(small_testing['context'].unique())
                                                + list(small_testing['party affiliation'].unique()))
# citation: https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [61]:
# algorithm/method #1
from keras.models import Sequential
from keras import layers
from keras.layers import Dense

nn = keras.Sequential([
    keras.Input(shape=(small_training_ohe_speaker_context_party_df.shape[1])),
    layers.Dense(10, activation='sigmoid'),
    layers.Dense(10, activation='sigmoid'),
    layers.Dense(10, activation='sigmoid'),
    layers.Dense(6),
])

nn.compile(optimizer = keras.optimizers.Adam(learning_rate = 0.01), loss = 'mse', metrics = "accuracy")

nn.fit(small_training_ohe_speaker_context_party_df, small_training_ohe_label_df, batch_size = 10, epochs = 100)

# citation: https://keras.io/guides/sequential_model/

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x7fa1338cd550>

### Findings: when use both context and speaker, the accuracy of using a simple neural network with sigmoid as the activation function was higher (~32% to 66%)

In [43]:
# algorithm/method #2
nn = keras.Sequential([
    keras.Input(shape=(small_training_ohe_speaker_context_df.shape[1])),
    layers.Dense(10, activation='relu'),
    layers.Dense(10, activation='relu'),
    layers.Dense(10, activation='relu'),
    layers.Dense(6),
])

nn.compile(optimizer = keras.optimizers.Adam(learning_rate = 0.01), loss = 'mse', metrics = "accuracy")

nn.fit(small_training_ohe_speaker_context_df, small_training_ohe_label_df, batch_size = 10, epochs = 50)

# citation: https://keras.io/guides/sequential_model/

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x7fa100225130>

### Findings: when use both context and speaker, the accuracy of using a simple neural network with relu as the activation function was lower (~32% to 20%)

In [45]:
# algorithm/method #3
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
dt = DecisionTreeClassifier(criterion= "entropy", splitter = "best")

accuracy = cross_val_score(dt, small_training_ohe_speaker_context_df, small_training_ohe_label_df, cv=5)

accuracy

# citation: https://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeClassifier.html
# https://scikit-learn.org/stable/modules/cross_validation.html

array([0.18650422, 0.20075047, 0.19043152, 0.18761726, 0.18949343])

### Findings: when use both context and speaker, the accuracy of using a simple decision tree got a bit higher (~6% to ~19%)

#### Stance Detection (Political Affiliation)
* Language Inspection: Scrutinize content for language indicative or political inclination
* Disclosure Checks: Ensure any affiliations by the author or source are openly disclosed
* Fact-checker comparison: contrast content claims against neutal, non-partisan fact-checkers

#### Stance Features (Political Bias)
* Bias Detection: Determine if content distorts facts to favor a specific political entity
* Tonal Analysis: review tone and language for signs of political partiality
* Event Portrayal: ensure political events are represented without consistent bias.

#### Title vs Body (Dont see anything similar to this on the pdf...)
