# LIAR Dataset: Fake News Classification

**6-Class Classification using TF-IDF + Logistic Regression**

This notebook implements a simple fake news classifier using:
- TF-IDF for feature extraction
- Logistic Regression for classification
- 6 classes: pants-fire, false, barely-true, half-true, mostly-true, true


## 1. Imports & Setup


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
import joblib
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Define label order mapping (from most false to most true)
# This maps string labels to integers for classification
LABEL_MAPPING = {
    'pants-fire': 0,
    'false': 1,
    'barely-true': 2,
    'half-true': 3,
    'mostly-true': 4,
    'true': 5
}

# Reverse mapping for displaying results
LABEL_NAMES = {v: k for k, v in LABEL_MAPPING.items()}

print("✓ Imports and setup complete")
print(f"Label mapping: {LABEL_MAPPING}")


## 2. Load Data

**What happens here:**
- We read the TSV files from the `Dataset/` folder
- The files don't have headers, so we use `header=None`
- We assign column names based on the dataset documentation
- We extract only the label (column 2) and statement (column 3) columns
- We handle any missing values in the statement column

**Note:** If your data is in a `data/` folder instead, change `Dataset/` to `data/` in the file paths below.


In [None]:
# Define column names based on LIAR dataset documentation
# Column 1: ID, Column 2: Label, Column 3: Statement, ...
column_names = [
    'id', 'label', 'statement', 'subject', 'speaker', 'job_title',
    'state', 'party', 'barely_true_counts', 'false_counts',
    'half_true_counts', 'mostly_true_counts', 'pants_on_fire_counts', 'context'
]

# Load the datasets
# Note: Update the path if your data is in a different folder (e.g., 'data/' instead of 'Dataset/')
print("Loading datasets...")
train_df = pd.read_csv('Dataset/train.tsv', sep='\t', header=None, names=column_names)
valid_df = pd.read_csv('Dataset/valid.tsv', sep='\t', header=None, names=column_names)
test_df = pd.read_csv('Dataset/test.tsv', sep='\t', header=None, names=column_names)

print(f"\nDataset shapes:")
print(f"Train: {train_df.shape}")
print(f"Valid: {valid_df.shape}")
print(f"Test: {test_df.shape}")

# Select only label and statement columns (ignore other metadata)
train_df = train_df[['label', 'statement']].copy()
valid_df = valid_df[['label', 'statement']].copy()
test_df = test_df[['label', 'statement']].copy()

# Handle missing statements (drop rows with NaN statements)
print(f"\nMissing statements:")
print(f"Train: {train_df['statement'].isna().sum()}")
print(f"Valid: {valid_df['statement'].isna().sum()}")
print(f"Test: {test_df['statement'].isna().sum()}")

train_df = train_df.dropna(subset=['statement'])
valid_df = valid_df.dropna(subset=['statement'])
test_df = test_df.dropna(subset=['statement'])

print(f"\nAfter dropping missing statements:")
print(f"Train: {train_df.shape}")
print(f"Valid: {valid_df.shape}")
print(f"Test: {test_df.shape}")

# Show label distribution
print(f"\nLabel distribution in training set:")
print(train_df['label'].value_counts().sort_index())


## 3. Preprocess

**What happens here:**
- We clean the statement text (strip whitespace, convert to lowercase)
- We encode string labels into integers using our mapping
- This prepares the data for machine learning algorithms


In [None]:
def preprocess_text(text):
    """Basic text cleaning: strip and lowercase"""
    if pd.isna(text):
        return ""
    return str(text).strip().lower()

def encode_labels(df, label_mapping):
    """Encode string labels to integers"""
    df = df.copy()
    df['label_encoded'] = df['label'].map(label_mapping)
    # Drop rows with unmapped labels (if any)
    df = df.dropna(subset=['label_encoded'])
    return df

# Clean statement text
print("Preprocessing statements...")
train_df['statement'] = train_df['statement'].apply(preprocess_text)
valid_df['statement'] = valid_df['statement'].apply(preprocess_text)
test_df['statement'] = test_df['statement'].apply(preprocess_text)

# Encode labels to integers
print("Encoding labels...")
train_df = encode_labels(train_df, LABEL_MAPPING)
valid_df = encode_labels(valid_df, LABEL_MAPPING)
test_df = encode_labels(test_df, LABEL_MAPPING)

# Extract X (features) and y (labels) for each split
X_train = train_df['statement'].values
y_train = train_df['label_encoded'].values

X_valid = valid_df['statement'].values
y_valid = valid_df['label_encoded'].values

X_test = test_df['statement'].values
y_test = test_df['label_encoded'].values

print(f"\nPreprocessing complete!")
print(f"Train: {len(X_train)} samples")
print(f"Valid: {len(X_valid)} samples")
print(f"Test: {len(X_test)} samples")
print(f"\nSample statement: {X_train[0][:100]}...")
print(f"Sample label: {y_train[0]} ({LABEL_NAMES[y_train[0]]})")


## 4. Feature Extraction (TF-IDF)

**What happens here:**
- TF-IDF (Term Frequency-Inverse Document Frequency) converts text into numerical vectors
- **CRITICAL:** We fit the vectorizer ONLY on training data to avoid data leakage
- Data leakage would occur if we fit on all data (train+valid+test), giving the model information about validation/test sets
- After fitting on train, we transform train/valid/test separately
- Each statement becomes a vector of TF-IDF scores for each word in the vocabulary


In [None]:
# Initialize TF-IDF vectorizer
# max_features limits vocabulary size (use top N most frequent words)
# This helps reduce dimensionality and training time
vectorizer = TfidfVectorizer(
    max_features=10000,  # Use top 10,000 words
    ngram_range=(1, 2),  # Use unigrams and bigrams (single words and word pairs)
    min_df=2,  # Ignore words that appear in fewer than 2 documents
    max_df=0.95  # Ignore words that appear in more than 95% of documents (stopwords)
)

print("Fitting TF-IDF vectorizer on TRAINING data only...")
print("(This is important to avoid data leakage!)")

# Fit ONLY on training data
X_train_vectors = vectorizer.fit_transform(X_train)

# Transform validation and test sets (using the vocabulary learned from train)
X_valid_vectors = vectorizer.transform(X_valid)
X_test_vectors = vectorizer.transform(X_test)

print(f"\nFeature extraction complete!")
print(f"Train vectors shape: {X_train_vectors.shape}")
print(f"Valid vectors shape: {X_valid_vectors.shape}")
print(f"Test vectors shape: {X_test_vectors.shape}")
print(f"\nVocabulary size: {len(vectorizer.vocabulary_)} words")
print(f"\nEach statement is now represented as a {X_train_vectors.shape[1]}-dimensional vector")


## 5. Train Model

**What happens here:**
- We train a Logistic Regression classifier on the training vectors
- We use the validation set to tune the hyperparameter C (regularization strength)
- We try multiple C values and choose the one with the best macro-F1 score on validation set
- Macro-F1 is the average F1 score across all classes (good for imbalanced datasets)


In [None]:
# Hyperparameter tuning: Try different values of C (regularization strength)
# Lower C = stronger regularization (prevents overfitting)
# Higher C = weaker regularization (model can fit training data more closely)
C_values = [0.1, 1.0, 10.0, 100.0, 1000.0]

print("Tuning hyperparameter C on validation set...")
print("C values to try:", C_values)
print("\n" + "="*60)

best_score = -1
best_C = None
best_model = None

for C in C_values:
    # Train model with current C value
    model = LogisticRegression(
        C=C,
        max_iter=1000,  # Maximum iterations for convergence
        random_state=RANDOM_SEED,
        multi_class='multinomial',  # For multi-class classification
        solver='lbfgs'  # Good solver for multinomial logistic regression
    )
    
    model.fit(X_train_vectors, y_train)
    
    # Evaluate on validation set
    y_valid_pred = model.predict(X_valid_vectors)
    macro_f1 = f1_score(y_valid, y_valid_pred, average='macro')
    accuracy = accuracy_score(y_valid, y_valid_pred)
    
    print(f"C = {C:6.1f} | Validation Accuracy: {accuracy:.4f} | Validation Macro-F1: {macro_f1:.4f}")
    
    # Keep track of best model
    if macro_f1 > best_score:
        best_score = macro_f1
        best_C = C
        best_model = model

print("="*60)
print(f"\n✓ Best C value: {best_C}")
print(f"✓ Best validation macro-F1: {best_score:.4f}")
print(f"\nFinal model trained with C = {best_C}")


## 6. Evaluate on Test Set

**What happens here:**
- We evaluate the best model (chosen via validation) on the test set
- This is the FINAL evaluation - we only touch the test set once
- We compute accuracy, macro-F1, detailed classification report, and confusion matrix
- The confusion matrix shows which classes are confused with each other


In [None]:
# Evaluate on test set (FINAL evaluation - only done once!)
print("Evaluating best model on TEST set...")
print("="*60)

y_test_pred = best_model.predict(X_test_vectors)

# Calculate metrics
test_accuracy = accuracy_score(y_test, y_test_pred)
test_macro_f1 = f1_score(y_test, y_test_pred, average='macro')
test_weighted_f1 = f1_score(y_test, y_test_pred, average='weighted')

print(f"\nTest Set Results:")
print(f"Accuracy: {test_accuracy:.4f}")
print(f"Macro-F1: {test_macro_f1:.4f}")
print(f"Weighted-F1: {test_weighted_f1:.4f}")

# Classification report (precision, recall, F1 for each class)
print("\n" + "="*60)
print("Classification Report:")
print("="*60)
target_names = [LABEL_NAMES[i] for i in sorted(LABEL_NAMES.keys())]
print(classification_report(y_test, y_test_pred, target_names=target_names))

# Confusion matrix
print("="*60)
print("Confusion Matrix:")
print("="*60)
cm = confusion_matrix(y_test, y_test_pred)

# Display confusion matrix with labels
cm_df = pd.DataFrame(
    cm,
    index=[f"True {LABEL_NAMES[i]}" for i in sorted(LABEL_NAMES.keys())],
    columns=[f"Pred {LABEL_NAMES[i]}" for i in sorted(LABEL_NAMES.keys())]
)
print(cm_df)

print("\n" + "="*60)
print("Confusion Matrix Interpretation:")
print("="*60)
print("\nThe confusion matrix shows how predictions compare to true labels.")
print("Diagonal values = correct predictions")
print("Off-diagonal values = misclassifications")
print("\nCommon pattern: Adjacent labels (e.g., 'false' vs 'barely-true') are often confused,")
print("which makes sense since they represent similar levels of truthfulness.")


## 7. Save Artifacts (Optional)

**What happens here:**
- We save the trained TF-IDF vectorizer and model using joblib
- This allows us to reuse the model later without retraining
- The vectorizer must be saved too, since we need it to transform new text


In [None]:
# Save the vectorizer and model
print("Saving artifacts...")

joblib.dump(vectorizer, 'tfidf_vectorizer.joblib')
joblib.dump(best_model, 'logistic_regression_model.joblib')

print("✓ Saved: tfidf_vectorizer.joblib")
print("✓ Saved: logistic_regression_model.joblib")

# Example: How to load and use the saved model
print("\n" + "="*60)
print("Example: Loading and using saved model:")
print("="*60)
print("""
# Load artifacts
vectorizer = joblib.load('tfidf_vectorizer.joblib')
model = joblib.load('logistic_regression_model.joblib')

# Predict on new text
new_text = "Your statement here"
new_text_cleaned = preprocess_text(new_text)
new_vector = vectorizer.transform([new_text_cleaned])
prediction = model.predict(new_vector)[0]
predicted_label = LABEL_NAMES[prediction]
print(f"Predicted label: {predicted_label}")
""")


---

# Execution Flow Summary

This section explains the complete pipeline as a sequence of data transformations:

## Pipeline Overview

```
Raw TSV Files → Loaded DataFrames → Preprocessed Text → TF-IDF Vectors → Trained Model → Predictions
```

## Step-by-Step Flow

### 1. **Data Loading**
   - **Input:** TSV files (train.tsv, valid.tsv, test.tsv)
   - **Process:** Read files without headers, assign column names, extract label + statement columns
   - **Output:** Three DataFrames (train_df, valid_df, test_df) with 'label' and 'statement' columns

### 2. **Text Preprocessing**
   - **Input:** Raw statement text strings
   - **Process:** Strip whitespace, convert to lowercase
   - **Output:** Cleaned text strings

### 3. **Label Encoding**
   - **Input:** String labels ('pants-fire', 'false', etc.)
   - **Process:** Map strings to integers (0-5) using LABEL_MAPPING
   - **Output:** Integer labels (y_train, y_valid, y_test)

### 4. **Feature Extraction (TF-IDF)**
   - **Input:** Cleaned text strings (X_train, X_valid, X_test)
   - **Process:** 
     - **Fit** vectorizer on X_train only (learns vocabulary)
     - **Transform** X_train, X_valid, X_test into sparse matrices
   - **Output:** Numerical vectors (X_train_vectors, X_valid_vectors, X_test_vectors)
   - **Why fit only on train?** Prevents data leakage - the model shouldn't see validation/test vocabulary during training

### 5. **Model Training & Validation**
   - **Input:** Training vectors (X_train_vectors, y_train)
   - **Process:** 
     - Train Logistic Regression models with different C values
     - Evaluate each on validation set (X_valid_vectors, y_valid)
     - Select best C based on macro-F1 score
   - **Output:** Best trained model (best_model)

### 6. **Final Evaluation**
   - **Input:** Test vectors (X_test_vectors, y_test) and best_model
   - **Process:** 
     - Make predictions on test set
     - Calculate accuracy, macro-F1, classification report, confusion matrix
   - **Output:** Performance metrics and predictions
   - **Important:** Test set is only evaluated once - this is the final, unbiased estimate of model performance

### 7. **Model Persistence**
   - **Input:** Trained vectorizer and model
   - **Process:** Save to disk using joblib
   - **Output:** Saved files (tfidf_vectorizer.joblib, logistic_regression_model.joblib)

## Key Concepts

### Data Leakage Prevention
- **Problem:** If we fit TF-IDF on all data (train+valid+test), the model gets information about validation/test sets
- **Solution:** Fit vectorizer only on training data, then transform all sets
- **Result:** Model only learns from training data, ensuring fair evaluation

### Train/Validation/Test Split Purpose
- **Training set:** Used to learn model parameters and vocabulary
- **Validation set:** Used to tune hyperparameters (C) and select best model
- **Test set:** Used for final, unbiased evaluation (touched only once)

### TF-IDF Vectorization
- Converts text documents into numerical feature vectors
- Each word gets a TF-IDF score based on:
  - **TF (Term Frequency):** How often the word appears in the document
  - **IDF (Inverse Document Frequency):** How rare/common the word is across all documents
- Words that appear in many documents (like "the", "is") get lower IDF scores
- Words unique to specific documents get higher TF-IDF scores

### Logistic Regression
- A linear classifier that learns weights for each feature (word)
- For each class, it computes a score based on weighted sum of TF-IDF features
- The class with highest score is predicted
- C parameter controls regularization (prevents overfitting)

## Data Flow Diagram

```
train.tsv ──┐
            ├──> Load & Extract ──> Preprocess ──> Fit TF-IDF ──> Train Model ──> Best Model
valid.tsv ──┤                                                      │
            │                                                      │
test.tsv ───┘                                                      │
            │                                                      │
            └──> Load & Extract ──> Preprocess ──> Transform ──> Evaluate ──> Metrics
```

This pipeline ensures that:
1. No information from validation/test sets leaks into training
2. Hyperparameters are tuned on validation set (not test set)
3. Test set provides unbiased final evaluation
4. The model can be saved and reused for new predictions
