# Group 1 — NLP Fake News 

**Project:** Fake News vs Real News (Tab-separated dataset)  
**Bootcamp:** Ironhack — NLP Challenge  
**Last updated:** 2026-02-04




### Goal
Build an NLP classifier that predicts whether a news item is **FAKE** or **REAL** (binary classification).

### Deliverables (typical)
- Clear preprocessing + feature engineering approach
- One or more models with evaluation on a validation split
- Final predictions on the **test** file
- (Optional) Explainability: most informative features / error analysis
- Reproducible code via **pipelines** and fixed random seeds


## 1) Setup & Imports
- Keep imports centralized.
- Use pipelines so preprocessing + vectorization + model training is one reproducible object.


In [None]:
# Core
import os
from pathlib import Path
import re
import time
import numpy as np
import pandas as pd

# Viz (optional)
import matplotlib.pyplot as plt

# Sklearn
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score,
    classification_report, confusion_matrix, ConfusionMatrixDisplay
)

# Models (baselines)
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


## 2) Config
Update these values to match your dataset.


In [None]:
# ======== CONFIG (EDIT IF NEEDED) ========
DATA_DIR = Path("..") / "data"          # notebook lives in /code, so .. points to repo root
TRAIN_FILE = DATA_DIR / "train.tsv"
TEST_FILE  = DATA_DIR / "test.tsv"

# Column names (edit these after you inspect df.columns)
TEXT_COL   = "text"     # e.g. "text", "content", "article", "title_text", etc.
LABEL_COL  = "label"    # e.g. "label", "target", "class"

# If your labels are strings like "FAKE"/"REAL", keep as-is.
# If your labels are 0/1, also fine.
# =========================================

assert TRAIN_FILE.exists(), f"Missing: {TRAIN_FILE}"
assert TEST_FILE.exists(), f"Missing: {TEST_FILE}"
print(" Files found:", TRAIN_FILE.name, "and", TEST_FILE.name)


## 3) Load data
Assumes **tab-separated** files. If you have a header row, this will infer it automatically.
If you don't have headers, pass `header=None` and set column names.


In [None]:
# Load TSV
train_df = pd.read_csv(TRAIN_FILE, sep="\t")
test_df  = pd.read_csv(TEST_FILE, sep="\t")

print("Train shape:", train_df.shape)
print("Test shape :", test_df.shape)
train_df.head()


## 4) Quick schema checks
- Inspect columns
- Check missing values
- Confirm label distribution


In [None]:
print("Train columns:", list(train_df.columns))
print("Test columns :", list(test_df.columns))

# Basic NA checks
display(train_df.isna().mean().sort_values(ascending=False).head(10))
display(test_df.isna().mean().sort_values(ascending=False).head(10))

# Label distribution (if LABEL_COL exists)
if LABEL_COL in train_df.columns:
    display(train_df[LABEL_COL].value_counts(dropna=False))
else:
    print(f"⚠️ LABEL_COL='{LABEL_COL}' not found. Update CONFIG.")


## 5) Define features and target
We keep a clean split between:
- `X` (text)
- `y` (labels)


In [None]:
assert TEXT_COL in train_df.columns, f"TEXT_COL='{TEXT_COL}' not found in train_df.columns"
assert LABEL_COL in train_df.columns, f"LABEL_COL='{LABEL_COL}' not found in train_df.columns"
assert TEXT_COL in test_df.columns,  f"TEXT_COL='{TEXT_COL}' not found in test_df.columns"

X = train_df[TEXT_COL].astype(str)
y = train_df[LABEL_COL]

X_test = test_df[TEXT_COL].astype(str)

print("X shape:", X.shape, "| y shape:", y.shape, "| X_test shape:", X_test.shape)


## 6) Baseline split (Train/Validation)
We start with a **single stratified split** for quick iteration. Later you can add CV.


In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y
)

print("Train:", X_train.shape, "Val:", X_val.shape)


## 7) Text cleaning (lightweight)
This is intentionally conservative:
- remove non-letters
- collapse whitespace
- lowercase

> Important: With TF-IDF, you often get strong performance with *minimal* cleaning.
If you add heavy cleaning (stemming/lemmatization), track whether it actually improves metrics.


In [None]:
def clean_text(text: str) -> str:
    text = str(text)
    # Keep letters and spaces
    text = re.sub(r"[^a-zA-Z\s]", " ", text)
    # Remove standalone single characters
    text = re.sub(r"\s+[a-zA-Z]\s+", " ", text)
    # Collapse spaces
    text = re.sub(r"\s+", " ", text).strip()
    # Lowercase
    text = text.lower()
    return text

# Quick sanity check
sample = X_train.iloc[0]
print("BEFORE:", sample[:200])
print("AFTER :", clean_text(sample)[:200])


## 8) Modeling approach
We'll use **pipelines** so that:
- text cleaning
- vectorization (Count/TF-IDF)
- classifier

…are all packaged into a single object.


In [None]:
from sklearn.preprocessing import FunctionTransformer

cleaner = FunctionTransformer(lambda s: pd.Series(s).apply(clean_text), validate=False)

def make_pipeline(vectorizer, model):
    return Pipeline(steps=[
        ("clean", cleaner),
        ("vec", vectorizer),
        ("clf", model),
    ])


## 9) Baseline models
Try 2–3 baselines quickly and compare using the same metrics.
Recommended quick baselines:
- **TF-IDF + Logistic Regression**
- **TF-IDF + LinearSVC**
- **CountVectorizer + MultinomialNB**


In [None]:
models = {
    "tfidf_logreg": make_pipeline(
        TfidfVectorizer(
            ngram_range=(1,2),
            min_df=2,
            max_df=0.95
        ),
        LogisticRegression(max_iter=2000, n_jobs=None)
    ),
    "tfidf_linearsvc": make_pipeline(
        TfidfVectorizer(
            ngram_range=(1,2),
            min_df=2,
            max_df=0.95
        ),
        LinearSVC()
    ),
    "count_mnb": make_pipeline(
        CountVectorizer(
            ngram_range=(1,2),
            min_df=2,
            max_df=0.95
        ),
        MultinomialNB()
    ),
}


In [None]:
def evaluate(model, X_tr, y_tr, X_va, y_va, name="model"):
    t0 = time.perf_counter()
    model.fit(X_tr, y_tr)
    fit_s = time.perf_counter() - t0

    y_pred = model.predict(X_va)

    metrics = {
        "model": name,
        "fit_seconds": round(fit_s, 3),
        "accuracy": accuracy_score(y_va, y_pred),
        "f1": f1_score(y_va, y_pred, average="weighted"),
        "precision": precision_score(y_va, y_pred, average="weighted", zero_division=0),
        "recall": recall_score(y_va, y_pred, average="weighted", zero_division=0),
    }
    return metrics, y_pred

results = []
preds_by_model = {}

for name, pipe in models.items():
    m, y_pred = evaluate(pipe, X_train, y_train, X_val, y_val, name=name)
    results.append(m)
    preds_by_model[name] = y_pred

results_df = pd.DataFrame(results).sort_values(by="f1", ascending=False)
results_df


## 10) Best model deep-dive
- Classification report
- Confusion matrix
- Error analysis (optional but valuable)


In [None]:
best_name = results_df.iloc[0]["model"]
best_model = models[best_name]

print("Best model:", best_name)
best_model.fit(X_train, y_train)
y_val_pred = best_model.predict(X_val)

print("\nClassification report:")
print(classification_report(y_val, y_val_pred))

cm = confusion_matrix(y_val, y_val_pred, labels=np.unique(y))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=np.unique(y))
disp.plot()
plt.show()


## 11) Feature inspection (optional)
For linear models, you can inspect top tokens for each class (if labels are binary).
This helps you explain what the model is “using” to decide.


In [None]:
def top_features_linear(pipeline: Pipeline, top_n=20):
    vec = pipeline.named_steps["vec"]
    clf = pipeline.named_steps["clf"]

    if not hasattr(clf, "coef_"):
        print("This classifier doesn't expose coef_. Try LogisticRegression or LinearSVC.")
        return

    feature_names = np.array(vec.get_feature_names_out())
    coef = clf.coef_

    # Binary case: coef shape (1, n_features)
    if coef.shape[0] == 1:
        weights = coef[0]
        top_pos = feature_names[np.argsort(weights)[-top_n:]][::-1]
        top_neg = feature_names[np.argsort(weights)[:top_n]]
        print("\nTop features (positive class):")
        print(top_pos)
        print("\nTop features (negative class):")
        print(top_neg)
    else:
        # Multiclass: show per class
        for i, cls in enumerate(clf.classes_):
            weights = coef[i]
            top = feature_names[np.argsort(weights)[-top_n:]][::-1]
            print(f"\nTop features for class={cls}:")
            print(top)

# Fit then inspect
best_model.fit(X_train, y_train)
top_features_linear(best_model, top_n=20)


## 12) Cross-validation (recommended)
Once you have a “best” pipeline, do stratified CV for a more reliable estimate.


In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
cv_scores = cross_val_score(best_model, X, y, cv=skf, scoring="f1_weighted")
print("CV F1 (weighted):", cv_scores.round(4))
print("Mean:", cv_scores.mean().round(4), "| Std:", cv_scores.std().round(4))


## 13) Train final model on full training data
Then generate predictions for the test set.


In [None]:
final_model = best_model
final_model.fit(X, y)

test_pred = final_model.predict(X_test)
pd.Series(test_pred).value_counts().head(10)


## 14) Create submission file
This is a generic template. Edit `ID_COL` if your test file includes an ID column.
If not, we create one from the row index.


In [None]:
# ======= OPTIONAL: ID COLUMN (EDIT IF NEEDED) =======
ID_COL = "id"  # set to None if there's no ID column in test_df
# ================================================

if ID_COL is not None and ID_COL in test_df.columns:
    submission = pd.DataFrame({ID_COL: test_df[ID_COL], LABEL_COL: test_pred})
else:
    submission = pd.DataFrame({"id": np.arange(len(test_df)), LABEL_COL: test_pred})

submission.head()


In [None]:
OUT_DIR = Path(".") / "outputs"
OUT_DIR.mkdir(exist_ok=True)

out_path = OUT_DIR / "submission.csv"
submission.to_csv(out_path, index=False)
print("✅ Saved:", out_path.resolve())


## 15) Notes for the presentation
Use this section as a scratchpad for your group presentation:
- What dataset columns exist?
- What preprocessing did you choose and why?
- What baselines did you test?
- What is your best model and what metrics did it achieve?
- What are common failure cases (misclassifications)?
- Any improvements tried (stopword removal, n-grams, class weights, etc.)?


In [None]:
# Write your bullet points / TODOs here for the group.
