# News headline classifier

Fine tuning a transformer model (BERT) on AG News dataset to categorize the news headlines

Steps:  
1: load the dataset using huggingface datasets  
2: explore the dataset, print the columns, the data on it  
3:

## Change runtime to CUDA
As first step I am changing runtime to CUDA on colab, and testing if it changed or not. The huggingface transformers will automatically detect and use it.

In [21]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


## Loading AG News dataset
`datasets` is hugging face datasets library, and use HF_TOKEN environment variable behind the scene to authenticate to huggingface and download the dataset, if you didn't set that token either on your system or colab secrets please copy from your huggingface profile and add here.

In [22]:
from datasets import load_dataset

ds = load_dataset("fancyzhx/ag_news")

## Xploring the dataset

Below is a `DatasetDict`, which holds two datasets (train, test)

In [23]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [24]:
ds.keys()

dict_keys(['train', 'test'])

In [25]:
ds["train"].column_names

['text', 'label']

In [26]:
ds["train"][1:2]

{'text': ['Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.'],
 'label': [2]}

In [27]:
ds["train"].features

{'text': Value('string'),
 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'])}

In [28]:
ds["train"][0]

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'label': 2}

## Tokenization

In [29]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

defining the toknization function with some configurations

In [30]:
def tokenize_fn(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

here we map `ds` our DatasetDict which have both `train` and `test` datasets to the `tokenize_fn`. The `tokenize_fn` accepts a batch which is ds itself, and picks the "text" column from it. while the tokenization proces is completed, it will add two new columns to the dataset `input_ids` and `attention_mask`.

In [31]:
tokenized_dataset = ds.map(
    tokenize_fn,
    batched=True
)

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [32]:
print(tokenized_dataset["train"].column_names)

['text', 'label', 'input_ids', 'attention_mask']


In [33]:
tokenized_dataset["train"][0]

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'label': 2,
 'input_ids': [101,
  2813,
  2358,
  1012,
  6468,
  15020,
  2067,
  2046,
  1996,
  2304,
  1006,
  26665,
  1007,
  26665,
  1011,
  2460,
  1011,
  19041,
  1010,
  2813,
  2395,
  1005,
  1055,
  1040,
  11101,
  2989,
  1032,
  2316,
  1997,
  11087,
  1011,
  22330,
  8713,
  2015,
  1010,
  2024,
  3773,
  2665,
  2153,
  1012,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,

in the `tokenized_dataset` we have 'text', 'label', 'input_ids', 'attention_mask'. the model doesn't need the raw text, so we can optionally removed that column.

In [34]:
tokenized_dataset = tokenized_dataset.remove_columns(["text"])

In [35]:
tokenized_dataset["train"][0]

{'label': 2,
 'input_ids': [101,
  2813,
  2358,
  1012,
  6468,
  15020,
  2067,
  2046,
  1996,
  2304,
  1006,
  26665,
  1007,
  26665,
  1011,
  2460,
  1011,
  19041,
  1010,
  2813,
  2395,
  1005,
  1055,
  1040,
  11101,
  2989,
  1032,
  2316,
  1997,
  11087,
  1011,
  22330,
  8713,
  2015,
  1010,
  2024,
  3773,
  2665,
  2153,
  1012,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  

we use pytorch to fine-tune BERT model, so it we change the tokenized dataset format

In [36]:
tokenized_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "label"]
)

Our dataset has four labels, so the num_labels=4

In [37]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=4
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


below we set TrainingArguments, learning_rate, epochs and some other.

In [38]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./bert_ag_news",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,  # here I set epoch to 1, due to resources usage on google colab
    weight_decay=0.01,
    logging_steps=100,
    push_to_hub=False,
    load_best_model_at_end=True,

    # wandb related configs
    report_to="wandb",
    run_name="ag-news-bert-fine-tuned-run_01"
)

Give model, training_configurations, train and test datasets, and tokenizers to trainer

In [39]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer
)

  trainer = Trainer(


before start training, I am using wandb to collect training analytics and weight and biases. so I am setting some wandb environment variables first

In [40]:
import os
from google.colab import userdata

# Your W&B project details
os.environ["WANDB_PROJECT"] = "ag-news-bert-fine-tuned"
os.environ["WANDB_ENTITY"] = "naveedahmadhematmal"
os.environ["WANDB_RUN_NAME"] = "ag-news-bert-fine-tuned-run_01"
os.environ["WANDB_API_KEY"] = userdata.get('WANDB_API_KEY')

start training to fine-tune the model

In [41]:
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Currently logged in as: [33mnaveedhematmal[0m ([33mnaveedahmadhematmal[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,0.1865,0.17694


TrainOutput(global_step=7500, training_loss=0.2182170716603597, metrics={'train_runtime': 2572.0868, 'train_samples_per_second': 46.655, 'train_steps_per_second': 2.916, 'total_flos': 7893473402880000.0, 'train_loss': 0.2182170716603597, 'epoch': 1.0})

upload all weight and biases to wandb

In [42]:
import wandb

model.save_pretrained("./bert_ag_news")
tokenizer.save_pretrained("./bert_ag_news")
wandb.save("./bert_ag_news/*")



['/content/wandb/run-20251216_113551-0t0gzy46/files/bert_ag_news/checkpoint-7500',
 '/content/wandb/run-20251216_113551-0t0gzy46/files/bert_ag_news/config.json',
 '/content/wandb/run-20251216_113551-0t0gzy46/files/bert_ag_news/model.safetensors',
 '/content/wandb/run-20251216_113551-0t0gzy46/files/bert_ag_news/special_tokens_map.json',
 '/content/wandb/run-20251216_113551-0t0gzy46/files/bert_ag_news/tokenizer.json',
 '/content/wandb/run-20251216_113551-0t0gzy46/files/bert_ag_news/tokenizer_config.json',
 '/content/wandb/run-20251216_113551-0t0gzy46/files/bert_ag_news/vocab.txt']

evaluate the model

In [43]:
results = trainer.evaluate()
print(results)

{'eval_loss': 0.17693980038166046, 'eval_runtime': 52.321, 'eval_samples_per_second': 145.257, 'eval_steps_per_second': 9.079, 'epoch': 1.0}


inferencing with news

In [46]:
import torch

# Example texts
texts = ["The stock market crashed today", "The football match was exciting"]

# Tokenize
tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Move inputs to same device as model
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
tokens = {k: v.to(device) for k, v in tokens.items()}

# Set human-readable labels (AG News)
model.config.id2label = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}

# Forward pass
with torch.no_grad():
    outputs = model(**tokens)

# Predictions
preds = torch.argmax(outputs.logits, dim=1)

# Print numeric labels
print("Numeric labels:", preds)

# Print human-readable labels
labels = [model.config.id2label[i.item()] for i in preds]
print("Predicted classes:", labels)


Numeric labels: tensor([2, 1], device='cuda:0')
Predicted classes: ['Business', 'Sports']
