# DistilBERT Training for Advertisement Detection

This notebook trains a DistilBERT model to classify RSS article titles as advertisements or news.

## Goals:
1. Load the training dataset (title + label only)
2. Prepare data for DistilBERT training
3. Fine-tune DistilBERT on our labeled data
4. Evaluate model performance
5. Save the trained model

## Model: DistilBERT
- Faster and smaller than BERT
- Good performance for classification tasks
- Perfect for our binary classification problem


In [1]:
# Install required packages
%pip install "transformers[torch]" torch datasets scikit-learn accelerate


Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import torch
from transformers import (
    DistilBertTokenizer, 
    DistilBertForSequenceClassification,
    TrainingArguments, 
    Trainer,
    AutoTokenizer,
    AutoModelForSequenceClassification
)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from datasets import Dataset
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")


✅ Libraries imported successfully!
PyTorch version: 2.8.0
CUDA available: False
Using device: cpu


In [3]:
# Load the training dataset
print("📊 Loading training dataset...")

# Load the CSV file created by the previous notebook
df = pd.read_csv('../data/training_dataset.csv')

print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

# Keep only title and label columns
df = df[['title', 'label']].copy()

# Check for missing values
print(f"\nMissing values:")
print(df.isnull().sum())

# Remove any rows with missing values
df = df.dropna()
print(f"\nDataset after removing missing values: {df.shape}")

# Check label distribution
print(f"\nLabel distribution:")
print(df['label'].value_counts())
print(f"\nLabel percentages:")
print((df['label'].value_counts() / len(df) * 100).round(1))


📊 Loading training dataset...
Dataset shape: (1213, 5)
Columns: ['title', 'description', 'source', 'label', 'text']

Missing values:
title    0
label    0
dtype: int64

Dataset after removing missing values: (1213, 2)

Label distribution:
label
news             1034
advertisement     179
Name: count, dtype: int64

Label percentages:
label
news             85.2
advertisement    14.8
Name: count, dtype: float64


In [4]:
# Prepare data for training
print("🔧 Preparing data for training...")

# Create label mapping (text to numeric)
label_mapping = {'news': 0, 'advertisement': 1}
df['label_id'] = df['label'].map(label_mapping)

print(f"Label mapping: {label_mapping}")
print(f"Label distribution after mapping:")
print(df['label_id'].value_counts().sort_index())

# Split data into train/validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['title'].tolist(),
    df['label_id'].tolist(),
    test_size=0.2,
    random_state=42,
    stratify=df['label_id']
)

print(f"\nTrain set: {len(train_texts)} samples")
print(f"Validation set: {len(val_texts)} samples")

# Check distribution in splits
print(f"\nTrain set label distribution:")
print(pd.Series(train_labels).value_counts().sort_index())
print(f"\nValidation set label distribution:")
print(pd.Series(val_labels).value_counts().sort_index())


🔧 Preparing data for training...
Label mapping: {'news': 0, 'advertisement': 1}
Label distribution after mapping:
label_id
0    1034
1     179
Name: count, dtype: int64

Train set: 970 samples
Validation set: 243 samples

Train set label distribution:
0    827
1    143
Name: count, dtype: int64

Validation set label distribution:
0    207
1     36
Name: count, dtype: int64


In [5]:
# Initialize DistilBERT tokenizer and model
print("🤖 Initializing DistilBERT...")

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=2,
    id2label={0: "news", 1: "advertisement"},
    label2id={"news": 0, "advertisement": 1}
)

print(f"✅ Loaded model: {model_name}")
print(f"Model parameters: {model.num_parameters():,}")

# Move model to device
model = model.to(device)
print(f"Model moved to: {device}")


🤖 Initializing DistilBERT...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded model: distilbert-base-uncased
Model parameters: 66,955,010
Model moved to: cpu


In [6]:
# Tokenize the data
print("🔤 Tokenizing data...")

def tokenize_function(texts):
    return tokenizer(
        texts,
        truncation=True,
        padding=True,
        max_length=128,
        return_tensors="pt"
    )

# Tokenize training data
train_encodings = tokenize_function(train_texts)
val_encodings = tokenize_function(val_texts)

print(f"✅ Tokenized training data: {train_encodings['input_ids'].shape}")
print(f"✅ Tokenized validation data: {val_encodings['input_ids'].shape}")

# Create PyTorch datasets
class NewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NewsDataset(train_encodings, train_labels)
val_dataset = NewsDataset(val_encodings, val_labels)

print(f"✅ Created datasets - Train: {len(train_dataset)}, Val: {len(val_dataset)}")


🔤 Tokenizing data...
✅ Tokenized training data: torch.Size([970, 53])
✅ Tokenized validation data: torch.Size([243, 52])
✅ Created datasets - Train: 970, Val: 243


In [7]:
# Define evaluation metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    accuracy = accuracy_score(labels, predictions)
    
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Set up training arguments
training_args = TrainingArguments(
    output_dir='../models/distilbert-ad-detection',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='../logs',
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    report_to=None,  # Disable wandb/tensorboard logging
)

print("✅ Training arguments configured")
print(f"Output directory: {training_args.output_dir}")
print(f"Training epochs: {training_args.num_train_epochs}")
print(f"Batch size: {training_args.per_device_train_batch_size}")


✅ Training arguments configured
Output directory: ../models/distilbert-ad-detection
Training epochs: 3
Batch size: 16


In [8]:
# Initialize trainer
print("🚀 Initializing trainer...")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

print("✅ Trainer initialized")
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

# Start training
print("\n🎯 Starting training...")
print("This may take several minutes depending on your hardware...")

trainer.train()

print("✅ Training completed!")


🚀 Initializing trainer...
✅ Trainer initialized
Training samples: 970
Validation samples: 243

🎯 Starting training...
This may take several minutes depending on your hardware...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.5205,0.288675,0.851852,0.783704,0.725652,0.851852
2,0.229,0.144454,0.946502,0.944066,0.944868,0.946502
3,0.1275,0.239796,0.942387,0.94018,0.940254,0.942387


✅ Training completed!


In [9]:
# Evaluate the model
print("📊 Evaluating model performance...")

# Get evaluation results
eval_results = trainer.evaluate()
print(f"\nValidation Results:")
for key, value in eval_results.items():
    print(f"  {key}: {value:.4f}")

# Make predictions on validation set
print("\n🔍 Making predictions on validation set...")
predictions = trainer.predict(val_dataset)
pred_labels = np.argmax(predictions.predictions, axis=1)

# Detailed classification report
print("\n📋 Detailed Classification Report:")
print(classification_report(val_labels, pred_labels, 
                          target_names=['news', 'advertisement']))

# Show some example predictions
print("\n🔍 Example Predictions:")
print("=" * 80)
for i in range(10):
    true_label = "news" if val_labels[i] == 0 else "advertisement"
    pred_label = "news" if pred_labels[i] == 0 else "advertisement"
    confidence = max(predictions.predictions[i])
    
    print(f"Title: {val_texts[i][:60]}...")
    print(f"True: {true_label} | Predicted: {pred_label} | Confidence: {confidence:.3f}")
    print("-" * 80)


📊 Evaluating model performance...



Validation Results:
  eval_loss: 0.1445
  eval_accuracy: 0.9465
  eval_f1: 0.9441
  eval_precision: 0.9449
  eval_recall: 0.9465
  eval_runtime: 1.0471
  eval_samples_per_second: 232.0660
  eval_steps_per_second: 15.2800
  epoch: 3.0000

🔍 Making predictions on validation set...

📋 Detailed Classification Report:
               precision    recall  f1-score   support

         news       0.95      0.99      0.97       207
advertisement       0.90      0.72      0.80        36

     accuracy                           0.95       243
    macro avg       0.92      0.85      0.88       243
 weighted avg       0.94      0.95      0.94       243


🔍 Example Predictions:
Title: Hundreds of flights delayed at Heathrow and other airports a...
True: news | Predicted: news | Confidence: 1.899
--------------------------------------------------------------------------------
Title: [Sponsor] Dekáf Coffee Roasters...
True: news | Predicted: news | Confidence: 1.524
-----------------------------------

In [10]:
# Save the trained model and tokenizer
print("💾 Saving trained model...")

# Save model and tokenizer
model.save_pretrained('../models/distilbert-ad-detection')
tokenizer.save_pretrained('../models/distilbert-ad-detection')

print("✅ Model saved to: ../models/distilbert-ad-detection")

# Test the saved model with a few examples
print("\n🧪 Testing saved model...")

# Load the saved model
loaded_model = DistilBertForSequenceClassification.from_pretrained('../models/distilbert-ad-detection')
loaded_tokenizer = DistilBertTokenizer.from_pretrained('../models/distilbert-ad-detection')

# Test predictions
test_titles = [
    "Apple Announces New iPhone with Advanced AI Features",
    "50% OFF - Limited Time Offer on Premium Headphones!",
    "Scientists Discover New Method for Carbon Capture",
    "Buy Now! Get Free Shipping on All Electronics Today Only!"
]

print("\nTest Predictions:")
print("=" * 80)
for title in test_titles:
    inputs = loaded_tokenizer(title, return_tensors="pt", truncation=True, max_length=128)
    
    with torch.no_grad():
        outputs = loaded_model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()
        confidence = torch.max(predictions).item()
    
    predicted_label = "news" if predicted_class == 0 else "advertisement"
    
    print(f"Title: {title}")
    print(f"Prediction: {predicted_label} (confidence: {confidence:.3f})")
    print("-" * 80)

print("\n🎉 Model training and evaluation completed!")
print("Your DistilBERT model is ready for inference! 🚀")


💾 Saving trained model...
✅ Model saved to: ../models/distilbert-ad-detection

🧪 Testing saved model...

Test Predictions:
Title: Apple Announces New iPhone with Advanced AI Features
Prediction: news (confidence: 0.994)
--------------------------------------------------------------------------------
Title: 50% OFF - Limited Time Offer on Premium Headphones!
Prediction: advertisement (confidence: 0.951)
--------------------------------------------------------------------------------
Title: Scientists Discover New Method for Carbon Capture
Prediction: news (confidence: 0.993)
--------------------------------------------------------------------------------
Title: Buy Now! Get Free Shipping on All Electronics Today Only!
Prediction: advertisement (confidence: 0.929)
--------------------------------------------------------------------------------

🎉 Model training and evaluation completed!
Your DistilBERT model is ready for inference! 🚀
