# 📰 Task 1: News Topic Classifier Using BERT

# 📌 Objective
The goal of this task is to fine-tune a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model to classify news headlines into four distinct categories: World, Sports, Business, and Sci/Tech.

We utilize Transfer Learning to leverage BERT's existing knowledge of the English language, adapting it to our specific classification task using the AG News dataset.

# 🛠️ Prerequisites
Before running the code, ensuring the following libraries are installed.

In [1]:
!pip install transformers datasets torch accelerate evaluate scikit-learn -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h

# ⚙️ Step 1: Data Loading & Exploration
We use the **AG News** dataset, a benchmark dataset for text classification.

Why this matters:

- **Input:** Text headlines (e.g., "Stock market hits record high").

- **Output:** Class ID (0-3).

- **Mapping:** 0: World, 1: Sports, 2: Business, 3: Sci/Tech.

In [2]:
# Load Data
from datasets import load_dataset

print("⏳ Loading AG News Dataset...")
dataset = load_dataset("ag_news")

# Print dataset structure to verify
print(dataset['train'][0])

⏳ Loading AG News Dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2}


# 🧹 Step 2: Data Preprocessing (Tokenization)
Neural networks cannot read raw text; they need numbers. We use the BERT Tokenizer to convert text into "Input IDs".

**Key Technical Details:**

- **Padding:** Ensures all sentences are the same length (128 tokens) so they fit into a matrix.

- **Truncation:** Cuts off sentences longer than 128 tokens.

- **Attention Mask:** Tells the model which zeros are "padding" and should be ignored.

In [3]:
# Tokenization
from transformers import AutoTokenizer

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

print("⚙️ Tokenizing data...")
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# OPTIONAL: Use a smaller subset for faster training (Speed Run)
# Remove .select() if you want to train on the full dataset (120k rows)
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(2000))
eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(500))

print(f"✅ Training Set: {len(train_dataset)} examples")
print(f"✅ Validation Set: {len(eval_dataset)} examples")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

⚙️ Tokenizing data...


Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

✅ Training Set: 2000 examples
✅ Validation Set: 500 examples


#📏 Step 3: Define Evaluation Metrics
We use Accuracy and F1-Score to measure performance. F1-Score is particularly useful as it balances Precision and Recall.

In [4]:
# Metrics
import evaluate
import numpy as np

accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    # Compute Accuracy
    acc = accuracy_metric.compute(predictions=predictions, references=labels)
    # Compute F1 (Weighted for multi-class)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average="weighted")

    return {**acc, **f1}

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

#🧠 Step 4: Model Initialization
We load BertForSequenceClassification with num_labels=4. This adds a fresh classification layer on top of the pre-trained BERT body.

In [5]:
# Loading Model
from transformers import AutoModelForSequenceClassification

num_labels = 4
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=num_labels)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#🏋️ Step 5: Training Configuration
We configure the Trainer.

- eval_strategy="epoch": Evaluates the model at the end of every epoch.

- logging_steps=50: Fixes the "No log" issue by printing loss every 50 steps.

In [6]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir="bert-news-classifier",
    eval_strategy="epoch",       # Evaluate every epoch
    save_strategy="epoch",       # Save checkpoint every epoch
    learning_rate=2e-5,          # Standard learning rate for BERT
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=50,            # Log loss every 50 steps
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

print("🚀 Starting BERT Fine-Tuning...")
trainer.train()
print("🎉 Training Complete!")

🚀 Starting BERT Fine-Tuning...


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4645,0.389262,0.882,0.881751
2,0.2506,0.374025,0.89,0.889904
3,0.1531,0.375417,0.89,0.889982


🎉 Training Complete!


# 🔍 Step 6: Inference (Real-World Test)
We create a pipeline to test the model on unseen headlines to demonstrate real-world applicability.

In [7]:
# Cell 8: Testing
from transformers import pipeline

# Load the trained model into a pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

# Map labels manually (AG News classes)
label_map = {
    "LABEL_0": "World",
    "LABEL_1": "Sports",
    "LABEL_2": "Business",
    "LABEL_3": "Sci/Tech"
}

# Test Headlines
headlines = [
    "Stock market hits record high as tech giants rally.",
    "Manchester United wins the championship in a stunning game.",
    "New AI model solves complex physics problems.",
    "Peace treaty signed between the two nations."
]

print("\n🔍 INFERENCE RESULTS:")
for text in headlines:
    result = classifier(text)[0]
    human_label = label_map.get(result['label'], result['label'])
    print(f"📰 Text: '{text}'")
    print(f"🏷️ Prediction: {human_label} (Confidence: {result['score']:.2f})\n")

Device set to use cuda:0



🔍 INFERENCE RESULTS:
📰 Text: 'Stock market hits record high as tech giants rally.'
🏷️ Prediction: Business (Confidence: 0.96)

📰 Text: 'Manchester United wins the championship in a stunning game.'
🏷️ Prediction: Sports (Confidence: 0.98)

📰 Text: 'New AI model solves complex physics problems.'
🏷️ Prediction: Sci/Tech (Confidence: 0.98)

📰 Text: 'Peace treaty signed between the two nations.'
🏷️ Prediction: World (Confidence: 0.96)



#💾 Step 7: Saving the Model and Configuration for further use.



In [8]:
# Saving the Model & Tokenizer
# We save to a directory, which is the standard "Professional" way for Transformers
model_path = "./bert_news_model"

print(f"💾 Saving model to {model_path}...")
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)
print("✅ Model saved successfully!")

# Zip it up so you can download it easily from Colab
!zip -r bert_news_model.zip bert_news_model
print("📦 Zipped! Check your files tab to download 'bert_news_model.zip'")

💾 Saving model to ./bert_news_model...
✅ Model saved successfully!
  adding: bert_news_model/ (stored 0%)
  adding: bert_news_model/special_tokens_map.json (deflated 42%)
  adding: bert_news_model/tokenizer_config.json (deflated 75%)
  adding: bert_news_model/config.json (deflated 52%)
  adding: bert_news_model/vocab.txt (deflated 53%)
  adding: bert_news_model/model.safetensors (deflated 7%)
  adding: bert_news_model/tokenizer.json (deflated 71%)
📦 Zipped! Check your files tab to download 'bert_news_model.zip'
