## **Problem Statement & Objective**

The goal of this project is to build a news topic classification system. This involves training a model to accurately categorize news articles into predefined classes based on their textual content. The objective is to leverage a pre-trained transformer model, specifically BERT, and fine-tune it on the AG News dataset to achieve high classification performance. The project also aims to demonstrate the process of deploying the trained model for practical use.



## **Install libraries**



In [1]:
!pip install transformers datasets torch scikit-learn streamlit



## **Load dataset**

Load the AG News dataset.


In [2]:
from datasets import load_dataset

ag_news_dataset = load_dataset('ag_news')
print(ag_news_dataset)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


## **Preprocess data**

Tokenize and preprocess the dataset using a BERT tokenizer. This will involve padding and creating attention masks.Remove the original text column, and set the format to torch.




In [4]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=256)

tokenized_datasets = ag_news_dataset.map(tokenize_function, batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(['text'])

tokenized_datasets.set_format('torch')

print(tokenized_datasets)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7600
    })
})


## **Load model**

Load the pre-trained `bert-base-uncased` model from Hugging Face Transformers.


In [5]:
from transformers import BertForSequenceClassification

num_labels = 4
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)
print(model.config)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.55.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



## **Fine-tune model and Evaluation**



In [6]:
from transformers import TrainingArguments, Trainer, BertForSequenceClassification
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

num_labels = 4
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
)


trainer.train()

predictions = trainer.predict(tokenized_datasets['test'])

predicted_labels = np.argmax(predictions.predictions, axis=1)
true_labels = predictions.label_ids

accuracy = accuracy_score(true_labels, predicted_labels)
f1 = f1_score(true_labels, predicted_labels, average='weighted')

print(f'Accuracy: {accuracy}')
print(f'F1-score (weighted): {f1}')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,0.1862,0.17503


Accuracy: 0.9451315789473684
F1-score (weighted): 0.9451670903538101


## **Save model and tokenizer for deployment**

In [7]:
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.txt',
 './fine_tuned_model/added_tokens.json')

## **Load Model and Deploy with Gradio**

In [10]:
import gradio as gr
from transformers import BertTokenizer, BertForSequenceClassification, pipeline

# Load fine-tuned model
model_path = "./fine_tuned_model"
tokenizer = BertTokenizer.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)

# Create a pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Define prediction function
def predict(text):
    results = classifier(text)
    return {res["label"]: float(res["score"]) for res in results}

# Launch Gradio app
gr.Interface(
    fn=predict,
    inputs=gr.Textbox(lines=2, placeholder="Enter text for classification..."),
    outputs=gr.Label(num_top_classes=4),
    title="News Topic Classifier",
    description="Enter text and see predicted class probabilities."
).launch()


Device set to use cuda:0


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://1c651d21f62049d222.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## **Final Summary / Insights**

This project successfully demonstrates fine-tuning a BERT model for news topic classification using the AG News dataset. The process involved installing necessary libraries, loading and preprocessing the dataset, loading and fine-tuning a pre-trained bert-base-uncased model, and evaluating its performance. The model achieved an accuracy of 0.945 and a weighted F1-score of 0.945 on the test set, indicating strong performance in classifying news articles into their respective categories. The fine-tuned model and tokenizer were saved for future use and successfully deployed using Gradio for a practical demonstration of the classification system.