# Text Classification with Fine-tuned DistilBERT

This notebook demonstrates how to fine-tune a DistilBERT model for text classification using the AG News dataset. We will:

1. Load and preprocess the AG News dataset
2. Set up the DistilBERT model and tokenizer
3. Create data loaders for training
4. Configure training parameters and optimization
5. Fine-tune the model
6. Evaluate performance
7. Perform inference on new text samples

The AG News dataset contains news articles classified into 4 categories: World, Sports, Business, and Sci/Tech.

## Step 1: Import Required Libraries

Import all necessary libraries for data processing, model training, and evaluation.

In [7]:
!pip install evaluate fsspec
!pip install -U datasets


Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-4.0.0-py3-none-any.whl (494 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.8/494.8 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency r

In [1]:
# Import datasets library for loading and processing machine learning datasets
from datasets import load_dataset

# Import Hugging Face transformers components for pre-trained models
from transformers import (
    AutoTokenizer,                      # For text tokenization
    AutoModelForSequenceClassification, # For classification models
    get_scheduler                       # For learning rate scheduling
)

# Import PyTorch utilities for data handling and training
from torch.utils.data import DataLoader  # For batch data loading
import torch                             # Core PyTorch library
from torch.optim import AdamW            # Adam optimizer with weight decay

# Import utilities for training progress and evaluation
from tqdm import tqdm                    # For progress bars during training
import evaluate                          # Hugging Face evaluate library for metrics
import numpy as np                       # Numerical computing library

## Step 2: Load Dataset

Load the AG News dataset, which contains news articles classified into 4 categories:
- World (0)
- Sports (1)
- Business (2)
- Sci/Tech (3)

In [2]:
# Load the AG News dataset from Hugging Face datasets hub
# This dataset contains news articles with their corresponding category labels
dataset = load_dataset("ag_news")

# Display basic information about the dataset
print(f"Dataset structure: {dataset}")
print(f"Training samples: {len(dataset['train'])}")
print(f"Test samples: {len(dataset['test'])}")
print(f"Features: {dataset['train'].features}")

# Show a sample from the dataset to understand the data structure
print("\nSample from training set:")
sample = dataset['train'][0]
print(f"Text: {sample['text'][:100]}...")
print(f"Label: {sample['label']}")

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Dataset structure: DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})
Training samples: 120000
Test samples: 7600
Features: {'text': Value('string'), 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'])}

Sample from training set:
Text: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\b...
Label: 2


## Step 3: Initialize Model and Tokenizer

Set up the DistilBERT model and tokenizer for sequence classification. DistilBERT is a smaller, faster version of BERT with 97% of BERT's performance but 60% smaller.

In [3]:
# Define the pre-trained model name
# DistilBERT is a distilled version of BERT that's faster and smaller
model_name = "distilbert-base-uncased"

# Load the tokenizer for text preprocessing
# The tokenizer converts text into tokens that the model can understand
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the pre-trained DistilBERT model for sequence classification
# num_labels=4 specifies that we have 4 categories in AG News dataset
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

# Display model information
print(f"Model: {model_name}")
print(f"Model parameters: {model.num_parameters():,}")
print(f"Tokenizer vocab size: {len(tokenizer)}")
print(f"Number of classification labels: {model.config.num_labels}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: distilbert-base-uncased
Model parameters: 66,956,548
Tokenizer vocab size: 30522
Number of classification labels: 4


## Step 4: Data Preprocessing

Tokenize the text data and prepare it for training. This includes converting text to tokens, adding padding, and setting up the proper format for PyTorch.

In [4]:
# Define preprocessing function to tokenize the text data
def preprocess_function(examples):
    # Tokenize the text with the following parameters:
    # - truncation=True: Cut off text that exceeds max_length
    # - padding="max_length": Pad shorter sequences to max_length
    # - max_length=128: Set maximum sequence length to 128 tokens
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

# Apply the preprocessing function to the entire dataset
# batched=True processes multiple examples at once for efficiency
encoded_dataset = dataset.map(preprocess_function, batched=True)

# Rename the 'label' column to 'labels' as expected by the model
# This is required for Hugging Face transformers compatibility
encoded_dataset = encoded_dataset.rename_column("label", "labels")

# Set the format to PyTorch tensors for the required columns
# This converts the data to the format expected by PyTorch models
encoded_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

# Display information about the processed dataset
print("Preprocessing completed!")
print(f"Encoded dataset structure: {encoded_dataset}")
print(f"Sample encoded data shape:")
sample_encoded = encoded_dataset['train'][0]
for key, value in sample_encoded.items():
    print(f"  {key}: {value.shape if hasattr(value, 'shape') else type(value)}")

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

Preprocessing completed!
Encoded dataset structure: DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 7600
    })
})
Sample encoded data shape:
  labels: torch.Size([])
  input_ids: torch.Size([128])
  attention_mask: torch.Size([128])


## Step 5: Reduce Dataset Size for Faster Training

For demonstration purposes, we'll use a smaller subset of the data to speed up training. In production, you would typically use the full dataset.

In [6]:
# Reduce dataset size for quicker training and experimentation
# Select only the first 1000 samples from training set for fine-tuning
# This allows for faster iteration during development and testing
encoded_dataset["train"] = encoded_dataset["train"].select(range(1000))

# Select only the first 500 samples from test set for evaluation
# This reduces evaluation time while still providing meaningful metrics
encoded_dataset["test"] = encoded_dataset["test"].select(range(500))

# Display the reduced dataset sizes
print("Dataset size after reduction:")
print(f"Training samples: {len(encoded_dataset['train'])}")
print(f"Test samples: {len(encoded_dataset['test'])}")

# Calculate and display the distribution of labels in the training set
train_labels = encoded_dataset['train']['labels']
# Convert the labels to a PyTorch Tensor
train_labels_tensor = torch.tensor(train_labels)
unique_labels, counts = torch.unique(train_labels_tensor, return_counts=True)
print("\nLabel distribution in training set:")
label_names = ['World', 'Sports', 'Business', 'Sci/Tech']
for label, count in zip(unique_labels, counts):
    print(f"  {label_names[label]} ({label}): {count} samples")

Dataset size after reduction:
Training samples: 1000
Test samples: 500

Label distribution in training set:
  World (0): 212 samples
  Sports (1): 142 samples
  Business (2): 174 samples
  Sci/Tech (3): 472 samples


## Step 6: Create Data Loaders

Set up PyTorch DataLoaders to handle batch processing during training and evaluation.

In [7]:
# Create DataLoader for training data
# batch_size=16: Process 16 samples at a time (balanced between memory usage and efficiency)
# shuffle=True: Randomize the order of samples to improve training convergence
train_loader = DataLoader(encoded_dataset["train"], batch_size=16, shuffle=True)

# Create DataLoader for test data
# batch_size=16: Same batch size as training for consistency
# shuffle=False: No need to shuffle test data since we're not training on it
test_loader = DataLoader(encoded_dataset["test"], batch_size=16)

# Display DataLoader information
print("DataLoaders created successfully!")
print(f"Training batches: {len(train_loader)}")
print(f"Test batches: {len(test_loader)}")
print(f"Batch size: {train_loader.batch_size}")

# Calculate total number of training steps
num_epochs = 5  # We'll train for 5 epochs
total_training_steps = len(train_loader) * num_epochs
print(f"Total training steps (for {num_epochs} epochs): {total_training_steps}")

DataLoaders created successfully!
Training batches: 63
Test batches: 32
Batch size: 16
Total training steps (for 5 epochs): 315


## Step 7: Training Setup

Configure the device, optimizer, learning rate scheduler, and evaluation metrics for training.

In [8]:
# Set up the computing device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move the model to the selected device for computation
model.to(device)

# Initialize the AdamW optimizer
# AdamW is Adam with weight decay, which helps prevent overfitting
# lr=2e-5 is a commonly used learning rate for fine-tuning BERT-like models
optimizer = AdamW(model.parameters(), lr=2e-5)

# Calculate the total number of training steps for the learning rate scheduler
num_training_steps = len(train_loader) * 5  # 5 epochs

# Create a linear learning rate scheduler
# This gradually decreases the learning rate from the initial value to 0
# num_warmup_steps=0: No warmup period (learning rate starts at maximum)
# This helps the model converge more stably during fine-tuning
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

# Load the accuracy metric for evaluation
# This will be used to measure model performance during and after training
accuracy_metric = evaluate.load("accuracy")

print("Training setup completed!")
print(f"Optimizer: AdamW with learning rate {optimizer.param_groups[0]['lr']}")
print(f"Learning rate scheduler: Linear decay over {num_training_steps} steps")
print(f"Evaluation metric: Accuracy")

Using device: cuda


Downloading builder script: 0.00B [00:00, ?B/s]

Training setup completed!
Optimizer: AdamW with learning rate 2e-05
Learning rate scheduler: Linear decay over 315 steps
Evaluation metric: Accuracy


## Step 8: Training Loop

Execute the fine-tuning process for 5 epochs with progress tracking and loss monitoring.

In [9]:
# Set model to training mode
# This enables dropout and batch normalization layers to behave appropriately
model.train()

# Training loop for 5 epochs
for epoch in range(5):
    # Initialize total loss for this epoch
    total_loss = 0.0

    # Create progress bar for this epoch
    # desc parameter shows which epoch we're currently training
    loop = tqdm(train_loader, desc=f"Epoch {epoch+1}")

    # Iterate through batches in the training data
    for batch in loop:
        # Move batch data to the same device as the model (GPU/CPU)
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass: compute model outputs and loss
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass: compute gradients
        loss.backward()

        # Gradient clipping to prevent exploding gradients
        # max_norm=1.0 clips gradients that exceed this threshold
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        # Update model parameters using computed gradients
        optimizer.step()

        # Update learning rate according to the scheduler
        lr_scheduler.step()

        # Clear gradients for the next iteration
        # This prevents accumulation of gradients from previous steps
        optimizer.zero_grad()

        # Accumulate loss for reporting
        total_loss += loss.item()

        # Update progress bar with current loss
        loop.set_postfix(loss=loss.item())

    # Calculate and print average loss for this epoch
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1} finished. Average Loss: {avg_loss:.4f}")

print("Training completed!")

Epoch 1: 100%|██████████| 63/63 [00:11<00:00,  5.72it/s, loss=0.601]


Epoch 1 finished. Average Loss: 0.9762


Epoch 2: 100%|██████████| 63/63 [00:10<00:00,  6.30it/s, loss=0.175]


Epoch 2 finished. Average Loss: 0.4633


Epoch 3: 100%|██████████| 63/63 [00:10<00:00,  6.24it/s, loss=0.0722]


Epoch 3 finished. Average Loss: 0.2967


Epoch 4: 100%|██████████| 63/63 [00:10<00:00,  6.20it/s, loss=0.281]


Epoch 4 finished. Average Loss: 0.2201


Epoch 5: 100%|██████████| 63/63 [00:10<00:00,  6.17it/s, loss=0.063]

Epoch 5 finished. Average Loss: 0.1695
Training completed!





## Step 9: Save the Fine-tuned Model

Save the trained model and tokenizer for future use and deployment.

In [10]:
# Save the fine-tuned model to a local directory
# This saves the model weights, configuration, and architecture
model.save_pretrained("./my_finetuned_model")

# Save the tokenizer to the same directory
# This ensures we can properly tokenize text during inference
tokenizer.save_pretrained("./my_finetuned_model")

print("Model and tokenizer saved successfully!")
print("Saved to directory: ./my_finetuned_model")

# List the files that were saved
import os
if os.path.exists("./my_finetuned_model"):
    saved_files = os.listdir("./my_finetuned_model")
    print("Saved files:")
    for file in saved_files:
        print(f"  - {file}")

Model and tokenizer saved successfully!
Saved to directory: ./my_finetuned_model
Saved files:
  - special_tokens_map.json
  - model.safetensors
  - config.json
  - tokenizer.json
  - vocab.txt
  - tokenizer_config.json


## Step 10: Model Evaluation

Define an evaluation function and test the model's performance on the test dataset.

In [11]:
# Define evaluation function to assess model performance
def evaluate_model(model, dataloader):
    # Set model to evaluation mode
    # This disables dropout and sets batch normalization to inference mode
    model.eval()

    # Load the accuracy metric for evaluation
    accuracy_metric = evaluate.load("accuracy")

    # Disable gradient computation for faster inference and lower memory usage
    with torch.no_grad():
        # Iterate through evaluation batches
        for batch in dataloader:
            # Move batch data to the same device as the model
            batch = {k: v.to(device) for k, v in batch.items()}

            # Forward pass: get model predictions
            outputs = model(**batch)

            # Get predicted class labels by taking the argmax of logits
            # logits are raw prediction scores before softmax
            predictions = torch.argmax(outputs.logits, dim=-1)

            # Add predictions and true labels to the metric
            # This accumulates results across all batches
            accuracy_metric.add_batch(predictions=predictions, references=batch["labels"])

    # Compute the final accuracy score
    result = accuracy_metric.compute()

    # Display the evaluation result
    print(f"Evaluation Accuracy: {result['accuracy']:.4f}")

    return result

# Evaluate the model on the test set
print("Evaluating model on test set...")
test_results = evaluate_model(model, test_loader)

Evaluating model on test set...


Downloading builder script: 0.00B [00:00, ?B/s]

Evaluation Accuracy: 0.8540


## Step 11: Inference Function

Create a function to make predictions on new text samples using the fine-tuned model.

In [12]:
# Define inference function for making predictions on new text
def predict(texts, model_path="./my_finetuned_model"):
    # Set up device for inference
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Load the fine-tuned model from the saved directory
    model = AutoModelForSequenceClassification.from_pretrained(model_path)

    # Load the corresponding tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # Move model to the appropriate device and set to evaluation mode
    model.to(device)
    model.eval()

    # Tokenize the input texts
    # return_tensors="pt": Return PyTorch tensors
    # padding=True: Pad sequences to the same length within the batch
    # truncation=True: Truncate sequences that exceed max_length
    # max_length=128: Maximum sequence length (same as training)
    inputs = tokenizer(
        texts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128
    ).to(device)

    # Disable gradient computation for inference
    with torch.no_grad():
        # Forward pass to get model predictions
        outputs = model(**inputs)

        # Get predicted class labels by taking argmax of logits
        # dim=-1 means we take argmax along the last dimension (class dimension)
        predictions = torch.argmax(outputs.logits, dim=-1)

    # Convert predictions to CPU and return as a Python list
    return predictions.cpu().tolist()

print("Inference function defined successfully!")
print("You can now use predict(texts) to classify new text samples.")

Inference function defined successfully!
You can now use predict(texts) to classify new text samples.


## Step 12: Sample Predictions

Test the inference function with sample news articles to see how well the model classifies different types of content.

In [13]:
# Define sample texts representing different news categories
sample_texts = [
    "Apple is looking at buying a UK startup for $1 billion.",           # Business/Tech
    "The stock market crashed today due to inflation fears.",            # Business
    "Local team wins the championship in a thrilling match!",            # Sports
    "NASA is planning a new moon mission for 2026.",                     # Sci/Tech
    "The latest iPhone was released with improved camera features."      # Sci/Tech
]

# Make predictions on the sample texts
predicted_labels = predict(sample_texts)

# Get the label names from the original dataset for human-readable output
label_names = dataset['train'].features['label'].names
print("Label mapping:")
for i, name in enumerate(label_names):
    print(f"  {i}: {name}")

# Convert numeric predictions to human-readable category names
human_readable_preds = [label_names[label] for label in predicted_labels]

# Display the results
print("\nSample Predictions:")
print("=" * 80)
for i, (text, pred_num, pred_name) in enumerate(zip(sample_texts, predicted_labels, human_readable_preds)):
    print(f"\nText {i+1}: {text}")
    print(f"Predicted Label: {pred_num} ({pred_name})")
    print("-" * 80)

# Summary of predictions
print(f"\nRaw Predictions: {predicted_labels}")
print(f"Human-readable Predictions: {human_readable_preds}")

# Analysis of predictions
print("\nPrediction Analysis:")
for category in label_names:
    count = human_readable_preds.count(category)
    print(f"  {category}: {count} predictions")

Label mapping:
  0: World
  1: Sports
  2: Business
  3: Sci/Tech

Sample Predictions:

Text 1: Apple is looking at buying a UK startup for $1 billion.
Predicted Label: 3 (Sci/Tech)
--------------------------------------------------------------------------------

Text 2: The stock market crashed today due to inflation fears.
Predicted Label: 2 (Business)
--------------------------------------------------------------------------------

Text 3: Local team wins the championship in a thrilling match!
Predicted Label: 1 (Sports)
--------------------------------------------------------------------------------

Text 4: NASA is planning a new moon mission for 2026.
Predicted Label: 3 (Sci/Tech)
--------------------------------------------------------------------------------

Text 5: The latest iPhone was released with improved camera features.
Predicted Label: 3 (Sci/Tech)
--------------------------------------------------------------------------------

Raw Predictions: [3, 2, 1, 3, 3]
Human-r