# Fake News Detection Baseline (LIAR Dataset)

This notebook demonstrates a simple and clean workflow for fake news detection using the LIAR dataset. It includes classical and transformer baselines, with all metrics recorded for reference.

## 1. Environment Setup

Install required packages for classical and transformer baselines.

In [4]:
# Install required packages
!pip install transformers datasets accelerate adapter-transformers scikit-learn pandas numpy nltk joblib



## 2. Load and Preprocess LIAR Dataset

Load LIAR dataset and preprocess statements: lowercase, whitespace, URL, stopword removal, lemmatization.

In [5]:
import pandas as pd
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

columns = [
    "id", "label", "statement", "subject", "speaker", "job_title", "state", "party_affiliation", "barely_true_counts", "false_counts", "half_true_counts", "mostly_true_counts", "pants_on_fire_counts", "context"
]
train_df = pd.read_csv("liar_dataset/train.tsv", sep='\t', names=columns, header=None)
valid_df = pd.read_csv("liar_dataset/valid.tsv", sep='\t', names=columns, header=None)
test_df  = pd.read_csv("liar_dataset/test.tsv",  sep='\t', names=columns, header=None)

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = str(text).lower()
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

train_df['clean_text'] = train_df['statement'].apply(preprocess_text)
valid_df['clean_text'] = valid_df['statement'].apply(preprocess_text)
test_df['clean_text']  = test_df['statement'].apply(preprocess_text)

print("Train shape:", train_df.shape)
print("Valid shape:", valid_df.shape)
print("Test shape:", test_df.shape)
train_df[['label', 'clean_text']].head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kanaa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\kanaa/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Train shape: (10240, 15)
Valid shape: (1284, 15)
Test shape: (1267, 15)


Unnamed: 0,label,clean_text
0,false,say annies list political group support third-...
1,half-true,decline coal start? started natural gas took s...
2,mostly-true,"hillary clinton agrees john mccain ""by voting ..."
3,false,health care reform legislation likely mandate ...
4,half-true,economic turnaround started end term.


## 3. Baseline 1: TF-IDF + Logistic Regression

Train and evaluate a classical baseline using TF-IDF features and Logistic Regression.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, f1_score
import joblib

vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df['clean_text'])
X_valid = vectorizer.transform(valid_df['clean_text'])
X_test = vectorizer.transform(test_df['clean_text'])

y_train = train_df['label']
y_valid = valid_df['label']
y_test = test_df['label']

lr = LogisticRegression(max_iter=500)
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)
print(classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))

# Save model and vectorizer
joblib.dump(lr, "tfidf_lr_model.pkl")
joblib.dump(vectorizer, "tfidf_vectorizer.pkl")

              precision    recall  f1-score   support

 barely-true       0.23      0.18      0.21       212
       false       0.26      0.31      0.29       249
   half-true       0.25      0.28      0.26       265
 mostly-true       0.18      0.20      0.19       241
  pants-fire       0.17      0.03      0.05        92
        true       0.24      0.24      0.24       208

    accuracy                           0.23      1267
   macro avg       0.22      0.21      0.21      1267
weighted avg       0.23      0.23      0.22      1267

Confusion Matrix:
 [[39 58 55 34  1 25]
 [28 77 52 54  5 33]
 [38 47 75 68  3 34]
 [32 39 64 49  2 55]
 [15 29 18 12  3 15]
 [16 41 42 55  4 50]]
F1 Score: 0.22472185423428373


['tfidf_vectorizer.pkl']

## 4. Baseline 2: Hybrid TF-IDF + LightGBM on LIAR

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from lightgbm import LGBMClassifier
from lightgbm import early_stopping, log_evaluation

# 1. Prepare data
X_train = train_df["clean_text"].astype(str)
y_train = train_df["label"]
X_valid = valid_df["clean_text"].astype(str)
y_valid = valid_df["label"]
X_test  = test_df["clean_text"].astype(str)
y_test  = test_df["label"]

# 2. TF-IDF Features (word + bigram)
tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1,2))
X_train_tfidf = tfidf.fit_transform(X_train)
X_valid_tfidf = tfidf.transform(X_valid)
X_test_tfidf  = tfidf.transform(X_test)

# 3. Train LightGBM Classifier
lgbm = LGBMClassifier(
    num_leaves=64,
    n_estimators=1000,
    learning_rate=0.05,
    objective="multiclass",
    random_state=42
)
lgbm.fit(
    X_train_tfidf, y_train,
    eval_set=[(X_valid_tfidf, y_valid)],
    eval_metric="multi_logloss",
    callbacks=[early_stopping(50), log_evaluation(100)]
)

# 4. Validation Results
y_pred_val = lgbm.predict(X_valid_tfidf)
print("\nValidation Results:")
print(classification_report(y_valid, y_pred_val))
print("F1 Score:", f1_score(y_valid, y_pred_val, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_valid, y_pred_val))

# 5. Final Test Results
y_pred_test = lgbm.predict(X_test_tfidf)
print("\nFinal Test Results:")
print(classification_report(y_test, y_pred_test))
print("F1 Score:", f1_score(y_test, y_pred_test, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_test))


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.008455 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 29421
[LightGBM] [Info] Number of data points in the train set: 10240, number of used features: 1217
[LightGBM] [Info] Start training from score -1.823105
[LightGBM] [Info] Start training from score -1.635658
[LightGBM] [Info] Start training from score -1.577720
[LightGBM] [Info] Start training from score -1.652337
[LightGBM] [Info] Start training from score -2.501846
[LightGBM] [Info] Start training from score -1.809892
Training until validation scores don't improve for 50 rounds
Early stopping, best iteration is:
[20]	valid_0's multi_logloss: 1.73911

Validation Results:
              precision    recall  f1-score   support

 barely-true       0.23      0.12      0.16       237
       false       0.28      0.35      0.31       263
   half-true       0.21      0.32      0.26       248
 mostly-true

