# Task 1: News Topic Classifier Using BERT (AG News)
**Objective:** Fine-tune `bert-base-uncased` to classify news headlines into 4 topics (AG News).

**What this notebook does (high-level)**
1. Install required libraries and check GPU.
2. Load AG News dataset from Hugging Face `datasets`.
3. Tokenize and preprocess (BERT tokenizer).
4. Fine-tune `bert-base-uncased` using Hugging Face `Trainer`.
5. Evaluate (accuracy, F1).
6. Create a simple **Gradio** app to try the model live.

**Notes for beginners**
- This notebook uses the `datasets` and `transformers` libraries which handle dataset fetching and training.
- If you see memory issues during training, reduce `per_device_train_batch_size` (e.g., to 8).


In [1]:
# Install required libraries
!pip install -q transformers datasets evaluate accelerate gradio scikit-learn


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# Check GPU availability and versions
import torch, transformers, datasets
print("Torch version:", torch.__version__)
print("Transformers version:", transformers.__version__)
print("Datasets version:", datasets.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))


Torch version: 2.8.0+cu126
Transformers version: 4.55.2
Datasets version: 4.0.0
CUDA available: True
GPU: Tesla T4


## 1) Load the AG News dataset
We will use the `ag_news` built into HF `datasets`. It has 4 classes:
0 → World
1 → Sports
2 → Business
3 → Sci/Tech


In [3]:
# Load AG News
from datasets import load_dataset

raw_datasets = load_dataset("ag_news")
raw_datasets


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

## 2) Prepare tokenizer & tokenize dataset
We will use `bert-base-uncased` tokenizer with `max_length=128`. We will tokenize the **text** (headline) and keep labels.


In [13]:
# Tokenize dataset
from transformers import BertTokenizerFast

MODEL_NAME = "distilbert-base-uncased"
tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)

max_length = 128

def preprocess(batch):
    # AG News uses column 'text' for headline+description; we'll use it as input
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=max_length)

tokenized_datasets = raw_datasets.map(preprocess, batched=True, remove_columns=["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DistilBertTokenizer'. 
The class this function is called from is 'BertTokenizerFast'.


Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7600
    })
})

In [14]:
from sklearn.metrics import classification_report

preds_output = trainer.predict(tokenized_datasets["test"])
y_true = preds_output.label_ids
y_pred = preds_output.predictions.argmax(axis=1)

print(classification_report(y_true, y_pred, target_names=labels))


              precision    recall  f1-score   support

       World       0.25      0.87      0.38      1900
      Sports       0.09      0.01      0.01      1900
    Business       0.28      0.01      0.02      1900
    Sci/Tech       0.16      0.06      0.08      1900

    accuracy                           0.24      7600
   macro avg       0.19      0.24      0.13      7600
weighted avg       0.19      0.24      0.13      7600



## 3) Create train/validation splits
Hugging Face `ag_news` already has `train` and `test`. We'll take a small validation split from train to evaluate during training.


In [15]:
# Create train/validation/test splits
from datasets import DatasetDict

train_val = tokenized_datasets["train"].train_test_split(test_size=0.1, seed=42)
ds = DatasetDict({
    "train": train_val["train"],
    "validation": train_val["test"],
    "test": tokenized_datasets["test"]
})
ds


DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 108000
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 12000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7600
    })
})

## 4) Define model, training args, and metrics
We'll use `AutoModelForSequenceClassification` with 4 labels and Trainer API.


In [16]:
# Setup model and compute_metrics
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

num_labels = 4
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)

def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average="weighted")
    prec = precision_score(labels, preds, average="weighted")
    rec = recall_score(labels, preds, average="weighted")
    return {"accuracy": acc, "f1": f1, "precision": prec, "recall": rec}

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Training configuration
- Keep epochs small initially (2) to finish quickly on Colab.
- If you have time, increase `num_train_epochs` to 3–4 for better performance.
- Reduce batch size if you run out of memory.


In [25]:
from transformers import TrainingArguments, Trainer

# Define training arguments (for transformers >= 4.55)
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",         # replaced evaluation_strategy
    save_strategy="epoch",         # keep saving per epoch
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    weight_decay=0.01,
    logging_steps=50,
    report_to="none"               # disables wandb etc.
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# ✅ Save both model and tokenizer together
trainer.save_model("news-bert-model")
tokenizer.save_pretrained("news-bert-model")


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.1754,0.177009,0.941842,0.94191,0.942005,0.941842
2,0.1312,0.180551,0.946842,0.946912,0.947109,0.946842
3,0.0855,0.222765,0.943947,0.94397,0.944246,0.943947
4,0.0526,0.259935,0.944868,0.944901,0.944982,0.944868


('news-bert-model/tokenizer_config.json',
 'news-bert-model/special_tokens_map.json',
 'news-bert-model/vocab.txt',
 'news-bert-model/added_tokens.json',
 'news-bert-model/tokenizer.json')

## 5) Evaluate on Test set
We will run evaluation on the held-out test set and print metrics.


In [29]:
# Evaluate on test set
metrics = trainer.evaluate(ds["test"])
print(metrics)


{'eval_loss': 0.25993505120277405, 'eval_accuracy': 0.9448684210526316, 'eval_f1': 0.9449006135528849, 'eval_precision': 0.9449821157258761, 'eval_recall': 0.9448684210526316, 'eval_runtime': 8.561, 'eval_samples_per_second': 887.75, 'eval_steps_per_second': 55.484, 'epoch': 4.0}


In [28]:
# ============================
# Deploying the model with Gradio
# ============================


import gradio as gr
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, pipeline

# Define label mapping (AG News has 4 classes)
id2label = {
    0: "World",
    1: "Sports",
    2: "Business",
    3: "Sci/Tech"
}
label2id = {v: k for k, v in id2label.items()}

# Load trained model + tokenizer
model_path = "news-bert-model"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
model = DistilBertForSequenceClassification.from_pretrained(
    model_path,
    id2label=id2label,
    label2id=label2id
)

# Create pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Prediction function
def predict_news_topic(text):
    preds = classifier(text, truncation=True, max_length=512)
    label = preds[0]['label']
    score = round(preds[0]['score'], 3)
    return f"Predicted Topic: {label} (confidence: {score})"

# Gradio UI
demo = gr.Interface(
    fn=predict_news_topic,
    inputs=gr.Textbox(lines=3, placeholder="Enter a news headline..."),
    outputs="text",
    title="📰 News Topic Classifier (DistilBERT)",
    description="Fine-tuned DistilBERT model on AG News dataset. Enter a news headline and get its topic prediction."
)

# Launch the app
demo.launch(share=True)


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'DistilBertTokenizerFast'.
Device set to use cuda:0


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://2eb6d031c435b99282.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [None]:
import nbformat
from google.colab import _message

# 1. Get current notebook JSON
nb_json = _message.blocking_request('get_ipynb')['ipynb']

# 2. Parse into nbformat object
nb = nbformat.from_dict(nb_json)

# 3. Remove problematic metadata at notebook level
if "widgets" in nb["metadata"]:
    print("Removing notebook-level widgets metadata...")
    del nb["metadata"]["widgets"]

if "application/vnd.jupyter.widget-state+json" in nb["metadata"]:
    print("Removing notebook-level widget-state metadata...")
    del nb["metadata"]["application/vnd.jupyter.widget-state+json"]

# 4. Remove problematic metadata from each cell
for cell in nb.cells:
    if "metadata" in cell:
        if "widgets" in cell["metadata"]:
            del cell["metadata"]["widgets"]
        if "application/vnd.jupyter.widget-view+json" in cell["metadata"]:
            del cell["metadata"]["application/vnd.jupyter.widget-view+json"]

    # Also clean outputs if present
    if "outputs" in cell:
        for out in cell["outputs"]:
            if "data" in out:
                if "application/vnd.jupyter.widget-view+json" in out["data"]:
                    del out["data"]["application/vnd.jupyter.widget-view+json"]

# 5. Save cleaned copy
clean_path = "Task_1_News_Topic_Classifier_Using_BERT_2_clean.ipynb"
with open(clean_path, "w", encoding="utf-8") as f:
    nbformat.write(nb, f)

print("✅ Cleaned notebook saved as", clean_path)
