# Ad Analysis with Knowledge Distillation

- **Objective:**
  - Show how to leverage knowledge distillation to train a compact student model using a larger teacher model.
  - Predict ad conversion rates to evaluate ad quality.

- **Dataset:**
  - Synthetic CSV dataset with 1,000 rows.
  - **Features include:**
    - `ad_type` (e.g., video, banner, print, etc.)
    - `demographic` (e.g., teens, young adults, adults, seniors, professionals)
    - `conversion_rate` (target regression value)

- **Models:**
  - **Teacher Model:** BERT-based model fine-tuned for regression.
  - **Student Model:** DistilBERT model trained with knowledge distillation.

- **Workflow:**
  - Generate and download a synthetic CSV dataset.
  - Preprocess and tokenize the dataset.
  - Fine-tune the teacher model.
  - Train the student model using combined ground-truth and distillation loss.
  - Evaluate the student model on sample ad inputs.


#Generate Ad Synthetic CSV Dataset

In [None]:
import pandas as pd
import numpy as np

# Set a random seed for reproducibility
np.random.seed(42)

# Define the number of samples
num_samples = 1000

# Define possible values for ad_type and demographic
ad_types = ["video", "banner", "print", "social", "email", "tv", "radio", "outdoor", "search"]
demographics = ["teens", "young adults", "adults", "seniors", "professionals"]

# Randomly select ad types and demographics for each sample
ad_type_data = np.random.choice(ad_types, num_samples)
demographic_data = np.random.choice(demographics, num_samples)

# Generate random conversion rates between 0.05 and 0.25 (rounded to 2 decimals)
conversion_rate_data = np.round(np.random.uniform(0.05, 0.25, num_samples), 2)

# Create a DataFrame
df_large = pd.DataFrame({
    "ad_type": ad_type_data,
    "demographic": demographic_data,
    "conversion_rate": conversion_rate_data
})

# Save the DataFrame as a CSV file
csv_filename = "ad_data_large.csv"
df_large.to_csv(csv_filename, index=False)
print(f"CSV file created: {csv_filename}")


CSV file created: ad_data_large.csv


In [None]:
from google.colab import files
files.download("ad_data_large.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Imports and Setups

In [None]:
import torch
from torch.utils.data import DataLoader
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW
from datasets import Dataset
import pandas as pd
import numpy as np

# Set device (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)


Using device: cpu


##Load and Preprocess the CSV Dataset

In [None]:
# Load the CSV dataset using pandas
df = pd.read_csv("ad_data_large.csv")

# Create a new text column by combining ad_type and demographic.
# This text will serve as the input to the model.
df["text"] = "Ad Type: " + df["ad_type"] + ". Demographic: " + df["demographic"] + "."

# Convert the pandas DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Initialize the tokenizer (using the student model's tokenizer: DistilBERT)
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Tokenization function: adjust max_length if needed.
def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=128)

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Remove unneeded columns and rename the target column to "label" (for regression)
tokenized_dataset = tokenized_dataset.remove_columns(["ad_type", "demographic", "text"])
tokenized_dataset = tokenized_dataset.rename_column("conversion_rate", "label")
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])


# Create a DataLoader for training (using a small batch size for demonstration)
train_dataloader = DataLoader(tokenized_dataset, batch_size=8, shuffle=True)


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

##Initialize Teacher and Student Models for Regression

In [None]:
# Initialize teacher and student models for regression (num_labels=1)
teacher_model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=1).to(device)
student_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=1).to(device)

# Set the problem type to regression for both models
teacher_model.config.problem_type = "regression"
student_model.config.problem_type = "regression"

# Set models to appropriate modes
teacher_model.train()  # teacher will be fine-tuned first
student_model.train()  # student will be trained with distillation later


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


##Fine-Tune the Teacher Model on the CSV Data

In [14]:
teacher_optimizer = AdamW(teacher_model.parameters(), lr=5e-5)
num_epochs_teacher = 5

print("Training Teacher Model...\n")
for epoch in range(num_epochs_teacher):
    total_loss = 0.0
    for batch in train_dataloader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        # Ensure labels are float and have the proper shape (batch_size, 1)
        labels = batch["label"].to(device).unsqueeze(1)

        outputs = teacher_model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = outputs.logits  # shape: [batch_size, 1]

        loss = F.mse_loss(predictions, labels)

        teacher_optimizer.zero_grad()
        loss.backward()
        teacher_optimizer.step()

        total_loss += loss.item()
    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1}/{num_epochs_teacher} - Average Loss: {avg_loss:.4f}")

# Set teacher to evaluation mode after training
teacher_model.eval()




Training Teacher Model...

Epoch 1/5 - Average Loss: 0.0178
Epoch 2/5 - Average Loss: 0.0066
Epoch 3/5 - Average Loss: 0.0054
Epoch 4/5 - Average Loss: 0.0051
Epoch 5/5 - Average Loss: 0.0052


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

##Train the Student Model Using Knowledge Distillation

In [15]:
student_optimizer = AdamW(student_model.parameters(), lr=5e-5)
num_epochs_student = 5
alpha = 0.5  # Weighting factor: 50% distillation loss, 50% ground-truth loss

print("\nTraining Student Model with Knowledge Distillation...\n")
for epoch in range(num_epochs_student):
    total_loss = 0.0
    student_model.train()  # ensure student is in training mode
    for batch in train_dataloader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device).unsqueeze(1)

        # Get teacher predictions (using no_grad)
        with torch.no_grad():
            teacher_outputs = teacher_model(input_ids=input_ids, attention_mask=attention_mask)
            teacher_preds = teacher_outputs.logits

        # Get student predictions
        student_outputs = student_model(input_ids=input_ids, attention_mask=attention_mask)
        student_preds = student_outputs.logits

        # Compute distillation loss (MSE between teacher and student outputs)
        loss_kd = F.mse_loss(student_preds, teacher_preds)
        # Compute ground-truth loss (MSE between student output and true conversion rate)
        loss_gt = F.mse_loss(student_preds, labels)
        # Combined loss: weighted sum of both losses
        loss = alpha * loss_kd + (1 - alpha) * loss_gt

        student_optimizer.zero_grad()
        loss.backward()
        student_optimizer.step()

        total_loss += loss.item()
    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1}/{num_epochs_student} - Average Loss: {avg_loss:.4f}")

# Set student model to evaluation mode after training
student_model.eval()



Training Student Model with Knowledge Distillation...

Epoch 1/5 - Average Loss: 0.0033
Epoch 2/5 - Average Loss: 0.0021
Epoch 3/5 - Average Loss: 0.0019
Epoch 4/5 - Average Loss: 0.0020
Epoch 5/5 - Average Loss: 0.0019


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


##Evaluate the Student Model on a Sample Advertisement

In [16]:
# Sample ad features: modify these values to test with your own ad data.
sample_ad_type = "social"
sample_demographic = "young adults"
sample_text = f"Ad Type: {sample_ad_type}. Demographic: {sample_demographic}."

# Tokenize the sample text
inputs = tokenizer(sample_text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)

with torch.no_grad():
    outputs = student_model(**inputs)
    prediction = outputs.logits.item()  # predicted conversion rate

print("Sample Ad Evaluation:")
print("Ad Type:", sample_ad_type)
print("Demographic:", sample_demographic)
print("Predicted Conversion Rate:", prediction)


Sample Ad Evaluation:
Ad Type: social
Demographic: young adults
Predicted Conversion Rate: 0.15807195007801056
