# **AG News Dataset Overview**

The AG News dataset is a popular benchmark dataset for text classification tasks, particularly in the field of Natural Language Processing (NLP). Here are some key highlights about the dataset:


# About the Dataset

**Source**: The dataset was constructed by Xiang Zhang and his team by selecting the top 4 largest categories from the AG’s Corpus of News Articles.

# Categories

**World**: International news and global events.

**Sports**: News related to various sports and sporting events.

**Business**: Financial news, economic updates, and business-related articles.

**Sci/Tech**: Articles about science, technology, and innovations.


# Dataset Structure

**Training Set**: 120,000 news articles (30,000 articles per category).

**Test Set**: 7,600 news articles (1,900 articles per category).

**Format**: Each article consists of a title and a short description.

# PIPs that need to be installed

pip install transformers, datasets, torch, matplotlib, wordcloud, scikit-learn

# **AG_News Classification Using DistilBERT Transformers**

# Importing Libraries

In [2]:
import pandas as pd
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding
from sklearn.metrics import classification_report
import torch

  from .autonotebook import tqdm as notebook_tqdm





# Load AG News dataset

In [3]:
dataset = load_dataset('ag_news')

# Split dataset into train and test sets

In [4]:
train_dataset = dataset["train"].shuffle(seed=42).select(range(1000))  
test_dataset = dataset["test"].shuffle(seed=42).select(range(1000))  

# Load pre-trained tokenizer

In [5]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the data

In [6]:
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Set format for PyTorch tensors

In [7]:
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

# Load pre-trained BERT model for sequence classification

# Define data collator

In [8]:

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Define training arguments

In [9]:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
)




# Define Trainer and train the model

In [9]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()


                                                 
100%|██████████| 125/125 [17:23<00:00,  8.35s/it]

{'eval_loss': 0.5574982166290283, 'eval_runtime': 295.5518, 'eval_samples_per_second': 3.384, 'eval_steps_per_second': 0.423, 'epoch': 1.0}
{'train_runtime': 1043.8733, 'train_samples_per_second': 0.958, 'train_steps_per_second': 0.12, 'train_loss': 0.8215930786132812, 'epoch': 1.0}





TrainOutput(global_step=125, training_loss=0.8215930786132812, metrics={'train_runtime': 1043.8733, 'train_samples_per_second': 0.958, 'train_steps_per_second': 0.12, 'total_flos': 100210111560000.0, 'train_loss': 0.8215930786132812, 'epoch': 1.0})

# Evaluate the model and Classification Report

In [10]:
predictions = trainer.predict(test_dataset)
preds = predictions.predictions.argmax(-1)

labels = test_dataset['label']
print("Classification Report:")
print(classification_report(labels, preds, target_names=['World', 'Sports', 'Business', 'Sci/Tech']))

100%|██████████| 125/125 [04:53<00:00,  2.35s/it]

Classification Report:
              precision    recall  f1-score   support

       World       0.86      0.86      0.86       266
      Sports       0.95      0.99      0.97       246
    Business       0.71      0.84      0.77       246
    Sci/Tech       0.85      0.65      0.74       242

    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.83      1000
weighted avg       0.84      0.84      0.84      1000






In [16]:
def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding='max_length', max_length=512)
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1).item()
    return predictions

# Example input text
sample_text = "New iPad released Just like every other September, this one is no different. Apple is planning to release a bigger, heavier, fatter iPad that..."
predicted_label = predict(sample_text)
label_names = dataset['train'].features['label'].names

print(f"Sample Text: {sample_text}")
print(f"Predicted Label: {label_names[predicted_label]}")

Sample Text: New iPad released Just like every other September, this one is no different. Apple is planning to release a bigger, heavier, fatter iPad that...
Predicted Label: Business
