# Model Training for Cross-Modal Audience Intelligence

This notebook demonstrates the end-to-end training process for the multimodal fusion model used in audience engagement prediction, including:
- Data preparation
- Feature extraction
- Model architecture
- Training process
- Model evaluation
- Model optimization
- Saving trained models

In [1]:
%pip install pandas numpy matplotlib seaborn networkx scipy tqdm torch torchvision scikit-learn pillow
%pip install caip 
%pip install onnx onnxruntime

Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement caip (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for caip[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Collecting onnx
  Downloading onnx-1.17.0-cp311-cp311-macosx_12_0_universal2.whl.metadata (16 kB)
Collecting protobuf>=3.20.2 (from onnx)
  Using cached protobuf-6.30.2-cp39-abi3-macosx_10_9_universal2.whl.metadata (593 bytes)
Downloading onnx-1.17.0-cp311-cp311-macosx_12_0_universal2.whl (16.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading protobuf-6.30.2-cp39-abi3-macosx_10_9_universal2.whl (417 kB)
Installing collected packages: protobuf, onnx
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.0
    Uninstalling protobuf-3.20.0:
      Successful

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, random_split
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, r2_score
from PIL import Image
from tqdm.notebook import tqdm
import os
import json

# Import platform components
from models.fusion.fusion_model import MultimodalFusionModel
from models.visual.clip_model import CLIPWrapper
from models.text.roberta_model import RoBERTaWrapper
from models.optimization.quantization import ModelQuantizer
from models.optimization.onnx_export import ONNXExporter
from data.data_loader import DataLoader as DataImporter

# Set up plotting
plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)
plt.rcParams['font.size'] = 12

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

## Data Loading and Preparation

Let's load the preprocessed data and prepare it for model training.

In [None]:
# Configure paths
DATA_DIR = "./data"
MODEL_DIR = "./models/saved"
os.makedirs(MODEL_DIR, exist_ok=True)

# Initialize data loader
data_importer = DataImporter(cache_dir=f"{DATA_DIR}/cache")

# Load processed data (assuming you've already run the data_exploration notebook)
try:
    # Try to load processed data from cache
    processed_data = data_importer.load_processed_data()
    print(f"Loaded processed data from cache")
except FileNotFoundError:
    print("Processed data not found in cache. Creating sample data...")
    # Create sample data for demonstration
    processed_data = {
        "content_ids": [f"SHOW{i}" for i in range(100)],
        "text_content": [f"Sample description for content {i}" for i in range(100)],
        "image_paths": [f"{DATA_DIR}/images/sample_{i}.jpg" for i in range(100)],
        "engagement": np.random.rand(100) * 0.8 + 0.1  # Random values between 0.1 and 0.9
    }

## Create Dataset for PyTorch

We'll create a custom dataset for efficient training.

In [None]:
class AudienceDataset(Dataset):
    """Dataset for audience engagement prediction."""
    
    def __init__(self, content_ids, text_content, image_paths, engagement, 
                 transform=None, text_transform=None):
        """Initialize dataset.
        
        Args:
            content_ids: List of content identifiers
            text_content: List of text descriptions
            image_paths: List of paths to images
            engagement: Array of engagement values
            transform: Image transformation function
            text_transform: Text transformation function
        """
        self.content_ids = content_ids
        self.text_content = text_content
        self.image_paths = image_paths
        self.engagement = engagement
        self.transform = transform
        self.text_transform = text_transform
        
    def __len__(self):
        return len(self.content_ids)
    
    def __getitem__(self, idx):
        # Get image
        try:
            # Try to load image
            image_path = self.image_paths[idx]
            if os.path.exists(image_path):
                image = Image.open(image_path).convert('RGB')
                if self.transform:
                    image = self.transform(image)
            else:
                # Create a dummy image if file doesn't exist
                image = torch.zeros((3, 224, 224))
        except Exception as e:
            # Create a dummy image if loading fails
            print(f"Error loading image {self.image_paths[idx]}: {e}")
            image = torch.zeros((3, 224, 224))
        
        # Get text
        text = self.text_content[idx]
        if self.text_transform:
            text = self.text_transform(text)
        
        # Get engagement (target)
        engagement = torch.tensor(self.engagement[idx], dtype=torch.float32)
        
        # Get content ID for reference
        content_id = self.content_ids[idx]
        
        return {
            'content_id': content_id,
            'image': image,
            'text': text,
            'engagement': engagement
        }

In [None]:
# Initialize feature extractors for images and text
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Initialize CLIP model for image features
clip_model = CLIPWrapper(model_name="openai/clip-vit-base-patch32", device=device)

# Initialize RoBERTa model for text features
roberta_model = RoBERTaWrapper(model_name="roberta-base", device=device)

In [None]:
# Define transformations
def encode_image(image):
    """Encode image using CLIP."""
    if isinstance(image, torch.Tensor):
        # Handle dummy images
        if image.shape == (3, 224, 224):
            # Create a zero embedding
            return torch.zeros((1, clip_model.model.config.projection_dim), device=device)
    
    # Encode real image
    return clip_model.encode_images(image)

def encode_text(text):
    """Encode text using RoBERTa."""
    return roberta_model.encode_text(text)

In [None]:
# Create dataset
dataset = AudienceDataset(
    content_ids=processed_data["content_ids"],
    text_content=processed_data["text_content"],
    image_paths=processed_data["image_paths"],
    engagement=processed_data["engagement"],
    transform=encode_image,
    text_transform=encode_text
)

# Split into training and validation sets (80% training, 20% validation)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print(f"Training set size: {len(train_dataset)}")
print(f"Validation set size: {len(val_dataset)}")

In [None]:
# Create data loaders
batch_size = 16
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=2
)

val_loader = torch.utils.data.DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=2
)

## Initialize Fusion Model

We'll create and initialize the multimodal fusion model for training.

In [None]:
# Get feature dimensions
visual_dim = clip_model.model.config.projection_dim  # Usually 768 for CLIP-ViT-B/32
text_dim = roberta_model.model.config.hidden_size    # Usually 768 for RoBERTa-base

print(f"Visual feature dimension: {visual_dim}")
print(f"Text feature dimension: {text_dim}")

# Initialize the fusion model
fusion_model = MultimodalFusionModel(
    visual_dim=visual_dim,
    text_dim=text_dim,
    fusion_dim=512,
    num_layers=4,
    num_heads=8,
    feedforward_dim=2048,
    dropout=0.1,
    num_engagement_classes=5,  # For classification mode
    engagement_type="regression",  # Use regression for continuous engagement values
    device=device
)

fusion_model.to(device)
print(fusion_model)

## Training Loop

Let's define the training loop and train the model.

In [None]:
# Define optimizer and loss function
optimizer = optim.Adam(fusion_model.parameters(), lr=0.0001)
criterion = nn.MSELoss()  # Mean Squared Error for regression

# Learning rate scheduler
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=5, verbose=True
)

In [None]:
def train_epoch(model, dataloader, optimizer, criterion, device):
    """Train model for one epoch."""
    model.train()
    running_loss = 0.0
    
    for batch in tqdm(dataloader, desc="Training"):
        # Get data
        visual_features = batch['image'].to(device)
        text_features = batch['text'].to(device)
        engagement = batch['engagement'].to(device)
        
        # Forward pass
        outputs = model(visual_features, text_features)
        
        # Get predicted engagement
        if model.engagement_type == "regression":
            predicted_engagement = outputs["engagement"]["score"].squeeze(-1)
        else:  # classification
            predicted_engagement = torch.argmax(outputs["engagement"]["probabilities"], dim=1).float() / model.num_engagement_classes
        
        # Calculate loss
        loss = criterion(predicted_engagement, engagement)
        
        # Backward pass and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Record loss
        running_loss += loss.item() * visual_features.size(0)
    
    # Calculate epoch loss
    epoch_loss = running_loss / len(dataloader.dataset)
    
    return epoch_loss

def validate(model, dataloader, criterion, device):
    """Validate model."""
    model.eval()
    running_loss = 0.0
    all_predictions = []
    all_targets = []
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Validating"):
            # Get data
            visual_features = batch['image'].to(device)
            text_features = batch['text'].to(device)
            engagement = batch['engagement'].to(device)
            
            # Forward pass
            outputs = model(visual_features, text_features)
            
            # Get predicted engagement
            if model.engagement_type == "regression":
                predicted_engagement = outputs["engagement"]["score"].squeeze(-1)
            else:  # classification
                predicted_engagement = torch.argmax(outputs["engagement"]["probabilities"], dim=1).float() / model.num_engagement_classes
            
            # Calculate loss
            loss = criterion(predicted_engagement, engagement)
            
            # Record loss
            running_loss += loss.item() * visual_features.size(0)
            
            # Record predictions and targets for metrics
            all_predictions.extend(predicted_engagement.cpu().numpy())
            all_targets.extend(engagement.cpu().numpy())
    
    # Calculate validation loss
    val_loss = running_loss / len(dataloader.dataset)
    
    # Calculate additional metrics
    mse = mean_squared_error(all_targets, all_predictions)
    r2 = r2_score(all_targets, all_predictions)
    
    return val_loss, mse, r2, all_predictions, all_targets

In [None]:
# Training hyperparameters
num_epochs = 30
best_val_loss = float('inf')

# Track metrics
train_losses = []
val_losses = []
val_mses = []
val_r2s = []

# Training loop
for epoch in range(num_epochs):
    print(f"Epoch {epoch+1}/{num_epochs}")
    
    # Train
    train_loss = train_epoch(fusion_model, train_loader, optimizer, criterion, device)
    train_losses.append(train_loss)
    
    # Validate
    val_loss, val_mse, val_r2, predictions, targets = validate(fusion_model, val_loader, criterion, device)
    val_losses.append(val_loss)
    val_mses.append(val_mse)
    val_r2s.append(val_r2)
    
    # Update learning rate
    scheduler.step(val_loss)
    
    # Print epoch results
    print(f"  Train Loss: {train_loss:.4f}")
    print(f"  Val Loss: {val_loss:.4f}, MSE: {val_mse:.4f}, R²: {val_r2:.4f}")
    
    # Save best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        print(f"  Saving new best model with val_loss: {val_loss:.4f}")
        fusion_model.save(f"{MODEL_DIR}/fusion_model_best.pt")
    
    print()

## Visualize Training Progress

Let's visualize the training and validation metrics.

In [None]:
# Plot training and validation loss
plt.figure(figsize=(12, 6))
plt.plot(range(1, num_epochs+1), train_losses, marker='o', linestyle='-', label='Training Loss')
plt.plot(range(1, num_epochs+1), val_losses, marker='o', linestyle='-', label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig(f"{MODEL_DIR}/training_loss.png")
plt.show()

# Plot validation metrics
fig, ax1 = plt.subplots(figsize=(12, 6))

# MSE on left axis
ax1.set_xlabel('Epoch')
ax1.set_ylabel('MSE', color='tab:red')
ax1.plot(range(1, num_epochs+1), val_mses, marker='o', linestyle='-', color='tab:red', label='MSE')
ax1.tick_params(axis='y', labelcolor='tab:red')

# R² on right axis
ax2 = ax1.twinx()
ax2.set_ylabel('R²', color='tab:blue')
ax2.plot(range(1, num_epochs+1), val_r2s, marker='o', linestyle='-', color='tab:blue', label='R²')
ax2.tick_params(axis='y', labelcolor='tab:blue')

# Add title and grid
plt.title('Validation Metrics')
plt.grid(True, alpha=0.3)

# Add legend
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper right')

plt.savefig(f"{MODEL_DIR}/validation_metrics.png")
plt.show()

## Model Evaluation

Let's evaluate the trained model on the validation set.

In [None]:
# Load the best model
best_model = MultimodalFusionModel.load(f"{MODEL_DIR}/fusion_model_best.pt", device=device)

# Evaluate on validation set
val_loss, val_mse, val_r2, predictions, targets = validate(best_model, val_loader, criterion, device)

print(f"Final Validation Results:")
print(f"  Loss: {val_loss:.4f}")
print(f"  MSE: {val_mse:.4f}")
print(f"  R²: {val_r2:.4f}")

In [None]:
# Visualize predictions vs targets
plt.figure(figsize=(10, 8))
plt.scatter(targets, predictions, alpha=0.5)
plt.plot([min(targets), max(targets)], [min(targets), max(targets)], 'r--')  # Perfect prediction line
plt.title('Predictions vs Targets')
plt.xlabel('Actual Engagement')
plt.ylabel('Predicted Engagement')
plt.grid(True, alpha=0.3)

# Add metrics to plot
plt.text(0.05, 0.95, f"MSE: {val_mse:.4f}\nR²: {val_r2:.4f}", transform=plt.gca().transAxes,
         bbox=dict(facecolor='white', alpha=0.8))

plt.savefig(f"{MODEL_DIR}/predictions_vs_targets.png")
plt.show()

## Model Optimization

Let's optimize the model for deployment.

In [None]:
# Initialize quantizer
quantizer = ModelQuantizer(best_model)

# Quantize the model
print("Quantizing model...")
quantized_model = quantizer.dynamic_quantization(dtype="int8")

# Save quantized model
torch.save(quantized_model, f"{MODEL_DIR}/fusion_model_quantized.pt")
print(f"Saved quantized model to {MODEL_DIR}/fusion_model_quantized.pt")

In [None]:
# Benchmark original vs quantized model
# Create sample inputs for benchmarking
sample_visual = torch.randn(1, visual_dim).to(device)
sample_text = torch.randn(1, text_dim).to(device)

# Run benchmark
benchmark_results = quantizer.benchmark(
    input_data=(sample_visual, sample_text),
    num_runs=100,
    warmup_runs=10
)

# Print benchmark results
print("Benchmark Results:")
print(f"  Original model inference time: {benchmark_results['original_time_s']*1000:.2f} ms")
print(f"  Quantized model inference time: {benchmark_results['quantized_time_s']*1000:.2f} ms")
print(f"  Speedup: {benchmark_results['speedup_factor']:.2f}x")
print(f"  Original model size: {benchmark_results['original_size_mb']:.2f} MB")
print(f"  Quantized model size: {benchmark_results['quantized_size_mb']:.2f} MB")
print(f"  Size reduction: {benchmark_results['size_reduction']*100:.2f}%")

In [None]:
# Export to ONNX format
# Initialize ONNX exporter
onnx_exporter = ONNXExporter(best_model, device=device)

# Export model to ONNX
print("Exporting model to ONNX...")
onnx_path = onnx_exporter.export(
    dummy_input=(sample_visual, sample_text),
    output_path=f"{MODEL_DIR}/fusion_model.onnx",
    input_names=["visual_features", "text_features"],
    output_names=["engagement", "sentiment", "content_features"],
    verbose=True,
    optimize=True
)

print(f"Exported ONNX model to {onnx_path}")

## Save Model Configuration

Let's save the model configuration for easy loading.

In [None]:
# Save model configuration
model_config = {
    "model_type": "MultimodalFusionModel",
    "visual_dim": visual_dim,
    "text_dim": text_dim,
    "fusion_dim": 512,
    "num_layers": 4,
    "num_heads": 8,
    "engagement_type": "regression",
    "training_info": {
        "num_epochs": num_epochs,
        "best_val_loss": best_val_loss,
        "final_mse": val_mse,
        "final_r2": val_r2,
        "date_trained": pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S")
    },
    "files": {
        "pytorch_model": "fusion_model_best.pt",
        "quantized_model": "fusion_model_quantized.pt",
        "onnx_model": "fusion_model.onnx"
    }
}

# Save to JSON
with open(f"{MODEL_DIR}/model_config.json", "w") as f:
    json.dump(model_config, f, indent=2)

print(f"Saved model configuration to {MODEL_DIR}/model_config.json")

## Conclusion

In this notebook, we demonstrated the end-to-end training process for the multimodal fusion model used in audience engagement prediction:

1. We prepared the data and created a custom PyTorch dataset
2. We initialized the fusion model architecture
3. We trained the model and tracked its performance
4. We evaluated the model on validation data
5. We optimized the model through quantization and ONNX export
6. We saved the model configuration for future use

The trained model can now be used for audience engagement prediction in production systems.