# NLP Fake vs Real News — Complete Notebook

This notebook implements a full pipeline to train text classifiers on the provided headlines dataset, evaluate models, tune hyperparameters, and produce a predictions file for the test set (replacing `2` with predicted `0` or `1`).

## 1) Environment & imports

In [3]:
# Environment & imports
import os
import re
import string
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import joblib
print("Libraries imported.")

Libraries imported.


## 2) Read data (tab-separated without header)

In [4]:
data_dir = r"C:\Users\Chantal Silva\OneDrive\Escritorio\Ironhack\Week_7\Project\project-3-nlp\dataset"  # where the uploader extracted the files
train_path = os.path.join(data_dir, 'training_data.csv')
test_path = os.path.join(data_dir, 'testing_data.csv')

train = pd.read_csv(train_path, sep='\t', header=None, names=['label','headline'])
test = pd.read_csv(test_path, sep='\t', header=None, names=['label','headline'])

print('Train shape:', train.shape)
print('Test shape:', test.shape)
display(train.head())

Train shape: (34152, 2)
Test shape: (9984, 2)


Unnamed: 0,label,headline
0,0,donald trump sends out embarrassing new year‚s...
1,0,drunk bragging trump staffer started russian c...
2,0,sheriff david clarke becomes an internet joke ...
3,0,trump is so obsessed he even has obama‚s name ...
4,0,pope francis just called out donald trump duri...


## 3) Quick exploratory check

In [5]:
print("Label distribution (train):")
print(train['label'].value_counts())

print("\nExample headlines:")
for i in range(5):
    print('-', train['headline'].iloc[i])

Label distribution (train):
label
0    17572
1    16580
Name: count, dtype: int64

Example headlines:
- donald trump sends out embarrassing new year‚s eve message; this is disturbing
- drunk bragging trump staffer started russian collusion investigation
- sheriff david clarke becomes an internet joke for threatening to poke people ‚in the eye‚
- trump is so obsessed he even has obama‚s name coded into his website (images)
- pope francis just called out donald trump during his christmas speech


## 4) Preprocessing function
- Lowercase
- Remove URLs
- Remove punctuation and digits
- Replace special quotes
- Remove stopwords (simple built-in list to avoid external downloads)

Returns a cleaned string.

In [6]:
# Lightweight stopword list (keeps notebook self-contained)
_stopwords = set([
 'a','an','the','and','or','if','in','on','for','of','to','is','are','was','were','be','has','had','have',
 'with','as','by','this','that','it','from','at','but','not','they','their','them','he','she','his','her','you','your'
])
def preprocess_text(text):
    if not isinstance(text, str):
        return ''
    text = text.lower()
    text = text.replace('‚', "'").replace('’',"'").replace('“','"').replace('”','"')
    text = re.sub(r'http\S+|www\.\S+', '', text)  # remove URLs
    text = re.sub(r'[^\w\s\']', ' ', text)  # remove punctuation except apostrophe
    text = re.sub(r'\d+', '', text)  # remove digits
    tokens = [t.strip("'") for t in text.split() if t.strip("'") and t not in _stopwords]
    return ' '.join(tokens)
# Apply to data (create new column)
train['clean'] = train['headline'].astype(str).apply(preprocess_text)
test['clean'] = test['headline'].astype(str).apply(preprocess_text)
display(train[['headline','clean']].head())

Unnamed: 0,headline,clean
0,donald trump sends out embarrassing new year‚s...,donald trump sends out embarrassing new year's...
1,drunk bragging trump staffer started russian c...,drunk bragging trump staffer started russian c...
2,sheriff david clarke becomes an internet joke ...,sheriff david clarke becomes internet joke thr...
3,trump is so obsessed he even has obama‚s name ...,trump so obsessed even obama's name coded into...
4,pope francis just called out donald trump duri...,pope francis just called out donald trump duri...


## 5) Simple baseline: TF-IDF + Logistic Regression pipeline

In [7]:
X = train['clean']
y = train['label']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

pipe_tfidf_lr = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.9)),
    ('clf', LogisticRegression(max_iter=2000, solver='liblinear'))
])

pipe_tfidf_lr.fit(X_train, y_train)
pred_val = pipe_tfidf_lr.predict(X_val)

print('Validation metrics for TF-IDF + LogisticRegression:')
print(classification_report(y_val, pred_val))
print('Accuracy:', accuracy_score(y_val, pred_val))

Validation metrics for TF-IDF + LogisticRegression:
              precision    recall  f1-score   support

           0       0.95      0.94      0.94      3515
           1       0.93      0.94      0.94      3316

    accuracy                           0.94      6831
   macro avg       0.94      0.94      0.94      6831
weighted avg       0.94      0.94      0.94      6831

Accuracy: 0.9398331137461572


## 6) Try a few different classifiers quickly (same TF-IDF)

In [8]:
models = {
    'LogisticRegression': LogisticRegression(max_iter=2000, solver='liblinear'),
    'MultinomialNB': MultinomialNB(),
    'LinearSVC': LinearSVC(max_iter=2000),
    'RandomForest': RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
}

results = []
for name, clf in models.items():
    pipe = Pipeline([
        ('tfidf', TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.9)),
        ('clf', clf)
    ])
    pipe.fit(X_train, y_train)
    p = pipe.predict(X_val)
    acc = accuracy_score(y_val, p)
    results.append((name, acc))
    print(f'-- {name}: Accuracy={acc:.4f}')
print('\nSummary sorted:')
for r in sorted(results, key=lambda x: x[1], reverse=True):
    print(r)


-- LogisticRegression: Accuracy=0.9398
-- MultinomialNB: Accuracy=0.9385
-- LinearSVC: Accuracy=0.9501
-- RandomForest: Accuracy=0.9281

Summary sorted:
('LinearSVC', 0.9500805152979066)
('LogisticRegression', 0.9398331137461572)
('MultinomialNB', 0.9385155906895037)
('RandomForest', 0.9281217976870151)


## 7) Quick hyperparameter search for the best candidate (using LogisticRegression)

In [9]:

param_grid = {
    'tfidf__ngram_range': [(1,1),(1,2)],
    'tfidf__max_df': [0.85, 0.9],
    'tfidf__min_df': [1,2],
    'clf__C': [0.1, 1, 5]
}

grid = GridSearchCV(Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(max_iter=2000, solver='liblinear'))
]), param_grid, cv=3, n_jobs=-1, scoring='f1_macro', verbose=1)

grid.fit(X_train, y_train)
print('Best params:', grid.best_params_)
best_model = grid.best_estimator_

pred_best = best_model.predict(X_val)
print('\nValidation with best model:')
print(classification_report(y_val, pred_best))
print('Accuracy:', accuracy_score(y_val, pred_best))


Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best params: {'clf__C': 5, 'tfidf__max_df': 0.85, 'tfidf__min_df': 2, 'tfidf__ngram_range': (1, 2)}

Validation with best model:
              precision    recall  f1-score   support

           0       0.95      0.95      0.95      3515
           1       0.94      0.95      0.95      3316

    accuracy                           0.95      6831
   macro avg       0.95      0.95      0.95      6831
weighted avg       0.95      0.95      0.95      6831

Accuracy: 0.9478846435368175


## 8) Produce predictions for test set and save them
- Replace original label (which is `2`) with predicted 0/1
- Save as tab-separated, no header, same order: label then headline

In [10]:

# Use best_model from grid if available, otherwise fallback to baseline pipe
model_to_use = globals().get('best_model', pipe_tfidf_lr)

test_preds = model_to_use.predict(test['clean'])

out = test.copy()
out['label'] = test_preds
out_path = r"C:\Users\Chantal Silva\OneDrive\Escritorio\Ironhack\Week_7\Project\project-3-nlp\dataset\predictions.tsv"
out.to_csv(out_path, sep='\t', header=False, index=False)
print("Saved predictions to:", out_path)
display(out.head())


Saved predictions to: C:\Users\Chantal Silva\OneDrive\Escritorio\Ironhack\Week_7\Project\project-3-nlp\dataset\predictions.tsv


Unnamed: 0,label,headline,clean
0,0,copycat muslim terrorist arrested with assault...,copycat muslim terrorist arrested assault weapons
1,0,wow! chicago protester caught on camera admits...,wow chicago protester caught camera admits vio...
2,1,germany's fdp look to fill schaeuble's big shoes,germany's fdp look fill schaeuble's big shoes
3,0,mi school sends welcome back packet warning ki...,mi school sends welcome back packet warning ki...
4,1,u.n. seeks 'massive' aid boost amid rohingya '...,u n seeks massive aid boost amid rohingya emer...


## 9) Save model (optional)

In [11]:
import joblib
import os

model_path = r"C:\Users\Chantal Silva\OneDrive\Escritorio\Ironhack\Week_7\Project\project-3-nlp\best_model.joblib"

joblib.dump(model_to_use, model_path)

print("Model saved to:", model_path)


Model saved to: C:\Users\Chantal Silva\OneDrive\Escritorio\Ironhack\Week_7\Project\project-3-nlp\best_model.joblib
