# News headline classifier

Fine tuning a transformer model (BERT) on AG News dataset to categorize the news headlines

Steps:  
1: load the dataset using huggingface datasets  
2: explore the dataset, print the columns, the data on it  
3:

## Change runtime to CUDA
As first step I am changing runtime to CUDA on colab, and testing if it changed or not. The huggingface transformers will automatically detect and use it.

In [None]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

## Loading AG News dataset
`datasets` is hugging face datasets library, and use HF_TOKEN environment variable behind the scene to authenticate to huggingface and download the dataset, if you didn't set that token either on your system or colab secrets please copy from your huggingface profile and add here.

In [None]:
from datasets import load_dataset

ds = load_dataset("fancyzhx/ag_news")

## Xploring the dataset

Below is a `DatasetDict`, which holds two datasets (train, test)

In [None]:
ds

In [None]:
ds.keys()

In [None]:
ds["train"].column_names

In [None]:
ds["train"][1:2]

In [None]:
ds["train"].features

In [None]:
ds["train"][0]

## Tokenization

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

defining the toknization function with some configurations

In [None]:
def tokenize_fn(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

here we map `ds` our DatasetDict which have both `train` and `test` datasets to the `tokenize_fn`. The `tokenize_fn` accepts a batch which is ds itself, and picks the "text" column from it. while the tokenization proces is completed, it will add two new columns to the dataset `input_ids` and `attention_mask`.

In [None]:
tokenized_dataset = ds.map(
    tokenize_fn,
    batched=True
)

In [None]:
print(tokenized_dataset["train"].column_names)

In [None]:
tokenized_dataset["train"][0]

in the `tokenized_dataset` we have 'text', 'label', 'input_ids', 'attention_mask'. the model doesn't need the raw text, so we can optionally removed that column.

In [None]:
tokenized_dataset = tokenized_dataset.remove_columns(["text"])

In [None]:
tokenized_dataset["train"][0]

we use pytorch to fine-tune BERT model, so it we change the tokenized dataset format

In [None]:
tokenized_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "label"]
)

Our dataset has four labels, so the num_labels=4

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=4
)

below we set TrainingArguments, learning_rate, epochs and some other.

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./bert_ag_news",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,  # here I set epoch to 1, due to resources usage on google colab
    weight_decay=0.01,
    logging_steps=100,
    push_to_hub=False,
    load_best_model_at_end=True,

    # wandb related configs
    report_to="wandb",
    run_name="ag-news-bert-fine-tuned-run_01"
)

Give model, training_configurations, train and test datasets, and tokenizers to trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer
)

before start training, I am using wandb to collect training analytics and weight and biases. so I am setting some wandb environment variables first

In [None]:
import os
from google.colab import userdata

# Your W&B project details
os.environ["WANDB_PROJECT"] = "ag-news-bert-fine-tuned"
os.environ["WANDB_ENTITY"] = "naveedahmadhematmal"
os.environ["WANDB_RUN_NAME"] = "ag-news-bert-fine-tuned-run_01"
os.environ["WANDB_API_KEY"] = userdata.get('WANDB_API_KEY')

start training to fine-tune the model

In [None]:
trainer.train()

upload all weight and biases to wandb

In [None]:
import wandb

model.save_pretrained("./bert_ag_news")
tokenizer.save_pretrained("./bert_ag_news")
wandb.save("./bert_ag_news/*")

evaluate the model

In [None]:
results = trainer.evaluate()
print(results)

inferencing with news

In [None]:
import torch

# Example texts
texts = ["The stock market crashed today", "The football match was exciting"]

# Tokenize
tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Move inputs to same device as model
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
tokens = {k: v.to(device) for k, v in tokens.items()}

# Set human-readable labels (AG News)
model.config.id2label = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}

# Forward pass
with torch.no_grad():
    outputs = model(**tokens)

# Predictions
preds = torch.argmax(outputs.logits, dim=1)

# Print numeric labels
print("Numeric labels:", preds)

# Print human-readable labels
labels = [model.config.id2label[i.item()] for i in preds]
print("Predicted classes:", labels)
