# Task 1: News Topic Classifier Using BERT

 ## 1. Problem Statement & Objective

###  Problem Statement:
News articles come in various categories like World, Sports, Business, and Science/Technology. Manually categorizing them is time-consuming.

###  Objective:
Build a machine learning model using a pre-trained BERT transformer to automatically classify news headlines into one of four categories.

You will:
- Load the AG News dataset
- Tokenize and fine-tune a BERT model
- Evaluate performance using accuracy and F1 score
- Optionally deploy the model using Streamlit or Gradio


## Libraries Install

In [3]:
!pip install --upgrade --force-reinstall \
  transformers==4.38.2 \
  accelerate==0.27.2 \
  peft==0.10.0 \
  torch==2.6.0 \
  torchvision==0.21.0 \
  torchaudio==2.6.0 \
  pandas==2.2.2 \
  numpy==1.26.4 \
  fsspec==2025.3.2 \
  pyarrow==14.0.0 \
  scikit-learn


Collecting transformers==4.38.2
  Downloading transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/130.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m130.7/130.7 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate==0.27.2
  Downloading accelerate-0.27.2-py3-none-any.whl.metadata (18 kB)
Collecting peft==0.10.0
  Downloading peft-0.10.0-py3-none-any.whl.metadata (13 kB)
Collecting torch==2.6.0
  Downloading torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchvision==0.21.0
  Downloading torchvision-0.21.0-cp311-cp311-manylinux1_x86_64.whl.metadata (6.1 kB)
Collecting torchaudio==2.6.0
  Downloading torchaudio-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata

Check it Install or Not exact version

In [46]:
import importlib.util

def check_package(package_name):
    spec = importlib.util.find_spec(package_name)
    if spec is not None:
        try:
            module = __import__(package_name)
            print(f" {package_name} is installed (version: {module.__version__})")
        except AttributeError:
            print(f" {package_name} is installed")
    else:
        print(f" {package_name} is NOT installed")

#  Check essential packages for BERT fine-tuning
required_packages = [
    "transformers",
    "torch",
    "torchvision",
    "torchaudio",
    "peft",
    "accelerate",
    "pandas",
    "numpy",
    "fsspec",
    "pyarrow",
    "sklearn"
]

for pkg in required_packages:
    check_package(pkg)


 transformers is installed (version: 4.38.2)
 torch is installed (version: 2.6.0+cu124)
 torchvision is installed (version: 0.21.0+cu124)
 torchaudio is installed (version: 2.6.0+cu124)
 peft is installed (version: 0.10.0)
 accelerate is installed (version: 0.27.2)
 pandas is installed (version: 2.2.2)
 numpy is installed (version: 1.26.4)
 fsspec is installed (version: 2025.3.2)
 pyarrow is installed (version: 14.0.0)
 sklearn is installed (version: 1.7.0)


 ## 2. Dataset Loading & Preprocessing

In [53]:
!mkdir -p ag_news
!wget -O ag_news/train.csv https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
!wget -O ag_news/test.csv https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv


--2025-07-13 11:06:39--  https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29470338 (28M) [text/plain]
Saving to: ‚Äòag_news/train.csv‚Äô


2025-07-13 11:06:40 (260 MB/s) - ‚Äòag_news/train.csv‚Äô saved [29470338/29470338]

--2025-07-13 11:06:40--  https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1857427 (1.8M) [text/plain]
Saving to: ‚Äòag_news/test.csv

## Only 2000 samples were taken because training the full dataset would take around 60 hours.

In [54]:
import pandas as pd

# Load the CSV files
train_df = pd.read_csv("ag_news/train.csv", header=None)
test_df = pd.read_csv("ag_news/test.csv", header=None)

# Rename columns for clarity
train_df.columns = ["label", "title", "description"]
test_df.columns = ["label", "title", "description"]

# Convert labels from 1‚Äì4 to 0‚Äì3
train_df["label"] = train_df["label"] - 1
test_df["label"] = test_df["label"] - 1


# Reduce to 2000 samples for quick testing
train_df = train_df.sample(n=2000, random_state=42).reset_index(drop=True)
test_df = test_df.sample(n=500, random_state=42).reset_index(drop=True)



In [26]:
train_df.head()


Unnamed: 0,label,title,description
0,2,"BBC set for major shake-up, claims newspaper","London - The British Broadcasting Corporation,..."
1,2,Marsh averts cash crunch,Embattled insurance broker #39;s banks agree t...
2,1,"Jeter, Yankees Look to Take Control (AP)",AP - Derek Jeter turned a season that started ...
3,3,Flying the Sun to Safety,When the Genesis capsule comes back to Earth w...
4,2,Stocks Seen Flat as Nortel and Oil Weigh,NEW YORK (Reuters) - U.S. stocks were set to ...


 ## 3. Tokenize News Titles with BERT Tokenizer

Import Tokenizer

In [55]:
from transformers import AutoTokenizer

# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")




Tokenize the Titles

In [56]:
# Sample data
titles = train_df["title"].tolist()
labels = train_df["label"].tolist()


In [57]:
tokenized = tokenizer(
    titles,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)


Prepare the Dataset for Training

In [58]:
import torch
from torch.utils.data import Dataset

class NewsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create dataset
train_dataset = NewsDataset(tokenized, labels)


For test data

In [59]:
test_titles = test_df["title"].tolist()
test_labels = test_df["label"].tolist()

test_tokenized = tokenizer(
    test_titles,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

test_dataset = NewsDataset(test_tokenized, test_labels)


## 4. Model Development & Training (Fine-Tuning BERT)

 Load Pre-trained BERT Model

In [60]:
from transformers import AutoModelForSequenceClassification

# Load model: 4 output classes (World, Sports, Business, Sci/Tech)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=4)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Define Training Arguments

In [61]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    logging_steps=50,
    logging_dir='./logs',
    save_steps=500,
    report_to=[],  # disables wandb
)


 Define Evaluation Metrics

In [62]:
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average="macro")
    return {"accuracy": acc, "f1": f1}


 Use Trainer to Fine-Tune

In [63]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)


 Start Training

In [64]:
trainer.train()


Step,Training Loss
50,1.0844
100,0.6908
150,0.6762
200,0.5979
250,0.6473
300,0.3239
350,0.3934
400,0.3283
450,0.3258


Created dataset file at: .gradio/flagged/dataset1.csv


Step,Training Loss
50,1.0844
100,0.6908
150,0.6762
200,0.5979
250,0.6473
300,0.3239
350,0.3934
400,0.3283
450,0.3258
500,0.4753


Checkpoint destination directory ./results/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


TrainOutput(global_step=750, training_loss=0.4346499048868815, metrics={'train_runtime': 3436.728, 'train_samples_per_second': 1.746, 'train_steps_per_second': 0.218, 'total_flos': 126418909968000.0, 'train_loss': 0.4346499048868815, 'epoch': 3.0})

## 5. Evaluate the Model

In [65]:
# Evaluate on test data
eval_result = trainer.evaluate(eval_dataset=test_dataset)
print(" Evaluation results:")
print(eval_result)


 Evaluation results:
{'eval_loss': 0.8082232475280762, 'eval_accuracy': 0.822, 'eval_f1': 0.8234538225513137, 'eval_runtime': 76.9514, 'eval_samples_per_second': 6.498, 'eval_steps_per_second': 0.819, 'epoch': 3.0}


## 6. Make Predictions

In [66]:
# Map label numbers to actual category names
label_map = {
    0: "World",
    1: "Sports",
    2: "Business",
    3: "Sci/Tech"
}


In [67]:
# Pick some test samples
sample_titles = test_df["title"].iloc[:5].tolist()

# Tokenize
inputs = tokenizer(
    sample_titles,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# Run model in eval mode
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=1)

# Show results
for i, title in enumerate(sample_titles):
    pred_label = predictions[i].item()
    print(f" Title: {title}")
    print(f" Predicted label: {label_map[pred_label]}")
    print("---")



 Title: Fan v Fan: Manchester City-Tottenham Hotspur
 Predicted label: Sports
---
 Title: Paris Tourists Search for Key to 'Da Vinci Code' (Reuters)
 Predicted label: Sci/Tech
---
 Title: Net firms: Don't tax VoIP
 Predicted label: Sci/Tech
---
 Title: Dependent species risk extinction
 Predicted label: Sci/Tech
---
 Title: EDS Is Charter Member of Siebel BPO Alliance (NewsFactor)
 Predicted label: Business
---


In [68]:
custom_titles = [
    "Apple releases new iPhone with AI camera features",
    "NASA plans mission to explore Europa",
    "Cristiano Ronaldo scores winning goal for Portugal",
    "Stock market drops after inflation report"
]

# Tokenize your input
inputs = tokenizer(
    custom_titles,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, axis=1)

# Label mapping
label_map = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}

# Display results
for i, title in enumerate(custom_titles):
    print(f" Title: {title}")
    print(f" Predicted label: {label_map[predictions[i].item()]}")
    print("---")


 Title: Apple releases new iPhone with AI camera features
 Predicted label: Sci/Tech
---
 Title: NASA plans mission to explore Europa
 Predicted label: Sci/Tech
---
 Title: Cristiano Ronaldo scores winning goal for Portugal
 Predicted label: Sports
---
 Title: Stock market drops after inflation report
 Predicted label: Business
---


## 7. Save the Model

In [43]:
model.save_pretrained("bert-news-topic-model")
tokenizer.save_pretrained("bert-news-topic-model")


('bert-news-topic-model/tokenizer_config.json',
 'bert-news-topic-model/special_tokens_map.json',
 'bert-news-topic-model/vocab.txt',
 'bert-news-topic-model/added_tokens.json',
 'bert-news-topic-model/tokenizer.json')

## 8. Gardio setup

In [48]:
!pip install gradio




In [69]:
import gradio as gr
import torch

# Label mapping from class number to actual category
label_map = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}

# Prediction function
def classify_news(title):
    # Tokenize user input
    inputs = tokenizer(
        title,
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors="pt"
    )

    # Predict
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = torch.argmax(outputs.logits, dim=1).item()

    return label_map[prediction]

# Gradio interface
interface = gr.Interface(
    fn=classify_news,
    inputs=gr.Textbox(lines=2, placeholder="Enter a news headline here..."),
    outputs="text",
    title="üì∞ News Topic Classifier",
    description="Enter any news headline and this app will predict whether it's about World, Sports, Business, or Sci/Tech."
)

# Launch app
interface.launch(debug=False)


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://974e0461ea3d573c28.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


