# ðŸ“° News Topic Classification

Goal: Build a model that reads a news article and guesses its topic.

We'll do 2 things:

1. Train a **simple model**: TF-IDF + Naive Bayes  
2. Train a ** deep learning model**: DistilBERT

Then we compare how good they are using **F1 score**.


In [None]:
!pip install -q pandas numpy scikit-learn
!pip install -q torch transformers datasets


## 1. Imports and label names

Here I import all the libraries Iâ€™ll need.

- `datasets` â†’ to load the AG News dataset
- `sklearn` â†’ for the simple TF-IDF + Naive Bayes model
- `transformers` + `torch` â†’ for BERT


In [None]:
import numpy as np
import pandas as pd

from datasets import load_dataset

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, f1_score

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

import torch

# AG News has 4 categories:
label_names = ["World", "Sports", "Business", "Sci/Tech"]
label_names


## 2. Load the AG News dataset

I don't have my own dataset, so I'm using **AG News**, a public dataset:

- Each row = a news text
- `label` is a number from 0 to 3 (corresponding to the topics above)


In [None]:
import torch
torch.cuda.is_available()


In [None]:
dataset = load_dataset("ag_news")

dataset


In [None]:
# Let's look at one training example
dataset["train"][0]


## 3. Simple model: TF-IDF + Naive Bayes


- TF-IDF turns text into numbers based on how important words are.
- Naive Bayes looks at those numbers and learns which patterns match which topic.

First, I grab the training and test texts and labels.


In [None]:
train_texts = dataset["train"]["text"]
train_labels = dataset["train"]["label"]

test_texts  = dataset["test"]["text"]
test_labels = dataset["test"]["label"]

len(train_texts), len(test_texts)


In [None]:
# Pipeline: TF-IDF vectorizer + Naive Bayes classifier
nb_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=30000,      # up to 30k words/phrases
        ngram_range=(1, 2),      # unigrams + bigrams (single words and pairs)
        stop_words="english"     # remove super common words like "the", "is", etc.
    )),
    ("clf", MultinomialNB())
])

# Train the simple model
nb_pipeline.fit(train_texts, train_labels)

# Predict on test set
nb_preds = nb_pipeline.predict(test_texts)

print("=== TF-IDF + Naive Bayes ===")
print(classification_report(test_labels, nb_preds, target_names=label_names))

f1_nb = f1_score(test_labels, nb_preds, average="weighted")
print("Naive Bayes weighted F1:", f1_nb)


## 4. Deep learning model: DistilBERT

Now I use a pretrained model called **DistilBERT**.

My understanding:

- DistilBERT has already learned a lot of English from tons of text.
- I "fine-tune" it on this news dataset so it learns to classify news topics.

First I need a **tokenizer** to turn text into tokens that BERT understands.


In [None]:
model_name = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_batch(batch):
    # This will turn the texts into input_ids + attention_masks
    return tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

# Apply tokenizer to the whole dataset (train + test)
tokenized_dataset = dataset.map(tokenize_batch, batched=True)

# Remove the original text column
tokenized_dataset = tokenized_dataset.remove_columns(["text"])

# Use PyTorch tensors
tokenized_dataset.set_format("torch")

tokenized_dataset


In [None]:
# âš¡ Uses a smaller subset so training is faster
small_train_dataset = tokenized_dataset["train"].shuffle(seed=42).select(range(2000))
small_test_dataset  = tokenized_dataset["test"].shuffle(seed=42).select(range(1000))

len(small_train_dataset), len(small_test_dataset)


Now I create the DistilBERT model with a classification head on top.

- `num_labels` = 4 topics


In [None]:
num_labels = len(label_names)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels
)


## 5. Training setup for BERT (bert-base-uncased)

For the BERT part, I need:

- a function to compute **F1 score**
- a smaller subset of the data so training is not too slow
- a tokenizer for `bert-base-uncased`
- Hugging Face `Dataset` objects for train/validation
- training arguments (learning rate, batch size, epochs, etc.)
- a `Trainer` object that handles the training loop


In [None]:
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
import numpy as np
from datasets import Dataset
from transformers import AutoTokenizer

# 1) Metric function for Trainer
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    f1 = f1_score(labels, preds, average="weighted")
    return {"f1": f1}

# 2) Make a smaller subset from the training data for BERT fine-tuning
X = np.array(train_texts)
y = np.array(train_labels)

# For example, take 12,000 samples from the training set
N = 12000
X_small = X[:N]
y_small = y[:N]

# Train/validation split for BERT
X_train_small, X_val_small, y_train_small, y_val_small = train_test_split(
    X_small, y_small,
    test_size=0.2,
    random_state=42,
    stratify=y_small
)

len(X_train_small), len(X_val_small)


In [None]:
# 3) Tokenizer for BERT-base
bert_model_name = "bert-base-uncased"
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)

def tokenize_batch(batch):
    return bert_tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

# 4) Wrap the small train/val splits into Hugging Face Datasets
train_ds_raw = Dataset.from_dict({"text": list(X_train_small), "label": list(y_train_small)})
val_ds_raw   = Dataset.from_dict({"text": list(X_val_small),   "label": list(y_val_small)})

train_ds = train_ds_raw.map(tokenize_batch, batched=True)
val_ds   = val_ds_raw.map(tokenize_batch,   batched=True)

train_ds = train_ds.remove_columns(["text"])
val_ds   = val_ds.remove_columns(["text"])

train_ds.set_format("torch")
val_ds.set_format("torch")

train_ds, val_ds


## 6. Train and evaluate BERT

Now I actually train the `bert-base-uncased` model on the smaller subset and
evaluate it on the validation split.


In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = len(label_names)

# 1) Create BERT-base classification model
model = AutoModelForSequenceClassification.from_pretrained(
    bert_model_name,
    num_labels=num_labels
)

# 2) Training arguments
training_args = TrainingArguments(
    output_dir="./bert-base-agnews-small",
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps=50
)

# 3) Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    compute_metrics=compute_metrics
)

print("Trainer OK")


In [None]:
# 4) Train the BERT-base model
trainer.train()

# 5) Evaluate on the small validation set
metrics = trainer.evaluate()

print("=== Raw metrics dictionary ===")
print(metrics)

# Safely grab the F1 score
f1_bert = metrics.get("eval_f1", metrics.get("f1", None))

if f1_bert is not None:
    print("\n=== BERT-base (small subset) ===")
    print("BERT-base weighted F1 on validation set:", f1_bert)
else:
    print("\nNo F1 score found. Check the printed metrics dictionary above.")


## 7. Compare Naive Bayes vs BERT (on the same validation data)

To make the comparison fair, I:

- run **Naive Bayes** on the same validation split (`X_val_small`, `y_val_small`)
- compare its F1 to my fine-tuned **BERT-base** F1 on that same split


In [None]:
# Naive Bayes on the same small validation split
nb_val_preds = nb_pipeline.predict(X_val_small)
f1_nb_small = f1_score(y_val_small, nb_val_preds, average="weighted")

print("Naive Bayes F1 on small validation set: ", f1_nb_small)
print("BERT-base F1 on small validation set:   ", f1_bert)

improvement = (f1_bert - f1_nb_small) / f1_nb_small * 100

print(f"\nF1 improvement (BERT-base vs Naive Bayes on small val set): {improvement:.2f}%")


In [None]:
## 8. Use the trained BERT model on my own text

Now I write a small helper function that:

- takes a string (news text)
- returns the predicted topic as a word (e.g. "Sports")


In [None]:
id2label = {i: name for i, name in enumerate(label_names)}
id2label


In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

device


In [None]:
def predict_topic_bert(text: str):
    model.eval()

    # Tokenize text
    inputs = bert_tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding="max_length",
        max_length=128
    )

    # Move inputs to the same device as the model (GPU or CPU)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        pred_id = logits.argmax(dim=1).item()

    return id2label[pred_id]


# Example
example_text = "The party secured a last-minute win in the election."
print("Text:", example_text)
print("Predicted topic (BERT-base):", predict_topic_bert(example_text))


## 9. What I learned

I used a public dataset called AG News, which contains short news articles labeled into four topics: World, Sports, Business, and Sci/Tech.

I built a strong baseline model using TF-IDF + Naive Bayes, which represents each article using word-importance scores and classifies the text based on patterns learned from these features.

I then implemented a deep learning approach using bert-base-uncased. Because full BERT training is computationally heavy, I fine-tuned BERT on a reduced but meaningful subset of the AG News dataset that could be trained efficiently on the available hardware.

I evaluated both models on the same validation split using the weighted F1 score.

Naive Bayes achieved an F1 score of 0.8941.

My fine-tuned BERT-base model achieved an F1 score of 0.9361,
resulting in an overall improvement of about 4.70%.
This demonstrates that with enough data and training steps, BERT can outperform classical ML approaches even on shorter texts.

Through this comparison, I learned that deep learning models like BERT typically require more data, time, and computational resources to reach their potential, while classical models such as Naive Bayes can perform surprisingly well on short, clean, and structured text.

Overall, this project taught me how to build, fine-tune, and evaluate two very different NLP modelsâ€”TF-IDF + Naive Bayes and BERT-baseâ€”and how to compare their performance fairly using consistent evaluation splits and metrics like the F1 score. The 4.70% improvement from BERT fine-tuning demonstrates the value of transformer-based models when given sufficient training data and compute.
