## Task 1: News Topic Classifier Using BERT
Problem Statement & Objective: Develop an NLP model to classify news headlines into four categories (World, Sports, Business, Sci/Tech) using a Transformer-based architecture.

Dataset Loading & Preprocessing: Used the AG News Dataset. Preprocessing involved tokenization using BertTokenizer, padding, and truncation to a maximum length of 128 tokens for computational efficiency.

Model Development & Training: Fine-tuned bert-base-uncased using the Hugging Face Trainer API. Optimized with fp16 mixed precision and a learning rate of 2e-5.

Evaluation Metrics: Achieved high performance measured via Accuracy and Weighted F1-score.

Visualizations: Training logs showing the decrease in loss and improvement in accuracy per epoch.

Final Summary / Insights: Transformers like BERT outperform traditional RNNs because they capture bidirectional context, making them highly effective for short-text classification like headlines.

## Prerequisites

In [None]:
!pip install transformers datasets evaluate accelerate scikit-learn gradio torch

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


## Tokenization and Preprocessing as well as Fine-tune the bert-base-uncased Model

In [None]:
# 1. Install necessary libraries (Run once)
!pip install -q transformers[torch] datasets evaluate accelerate scikit-learn

import os
import numpy as np
import evaluate
import torch
from datasets import load_dataset
from transformers import BertTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# Clear GPU cache to prevent memory errors
torch.cuda.empty_cache()

# --- STEP 1: TOKENIZE AND PREPROCESS ---
dataset = load_dataset("ag_news")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    # Using max_length=128 instead of 512 to save memory and speed up training
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Smaller subsets to ensure stability on free-tier Colab
# 5000 train and 500 test samples is enough to demonstrate fine-tuning for Task 1
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(5000))
test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(500))

# --- STEP 2: FINE-TUNE BERT ---
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=4)

metric_acc = evaluate.load("accuracy")
metric_f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = metric_acc.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = metric_f1.compute(predictions=predictions, references=labels, average="weighted")["f1"]
    return {"accuracy": acc, "f1": f1}

# Optimized Training Arguments for Colab T4 GPU
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",        # Corrected for Transformers v4.46+
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8, # Reduced from 16 to 8 to prevent memory crash
    per_device_eval_batch_size=8,  # Reduced from 16 to 8
    num_train_epochs=2,            # 2 epochs is sufficient for a news classifier
    weight_decay=0.01,
    logging_dir='./logs',
    fp16=True,                     # USE MIXED PRECISION: This stops crashes and doubles speed
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

# Start Training
print("Training started on GPU...")
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./news_classifier_bert")
tokenizer.save_pretrained("./news_classifier_bert")
print("Model saved successfully!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Training started on GPU...


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mareebaa9999[0m ([33mareebaa9999-devex[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4539,0.389141,0.892,0.892781
2,0.2613,0.392632,0.896,0.896138


Model saved successfully!


## Deploy using Gradio for Live Interaction

In [None]:
import gradio as gr
from transformers import pipeline

# Load the fine-tuned model
model_path = "./news_classifier_bert"
classifier = pipeline("text-classification", model=model_path, tokenizer=model_path)

# Label Mapping for AG News
labels = ["World", "Sports", "Business", "Sci/Tech"]

def predict_news_topic(headline):
    result = classifier(headline)[0]
    # Extract label index (e.g., 'LABEL_1') and map to string
    label_idx = int(result['label'].split('_')[1])
    return f"Topic: {labels[label_idx]} (Confidence: {result['score']:.2f})"

# Create Gradio Interface
interface = gr.Interface(
    fn=predict_news_topic,
    inputs=gr.Textbox(lines=2, placeholder="Enter news headline here..."),
    outputs="text",
    title="News Topic Classifier (BERT)",
    description="Enter a news headline to classify it into World, Sports, Business, or Sci/Tech."
)

if __name__ == "__main__":
    interface.launch()

Device set to use cuda:0


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://47212c2f62f1ea0e9e.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
