# 🚀 Multimodal Price Prediction Pipeline
## Combining Text (DistilBERT) + Vision (ResNet/EfficientNet) for Enhanced Performance

### Current Baseline (Text-only)
- **Model**: DistilBERT + MLP Regressor
- **Performance**: 18.04% SMAPE, 185.47 MAE, 421.33 RMSE, 0.752 R²

### Target Multimodal Architecture
- **Text Branch**: DistilBERT → 768-dim features
- **Image Branch**: ResNet-50/EfficientNet → 512-dim features  
- **Fusion Layer**: Multi-head attention + Concatenation → 640-dim
- **Output**: MLP Regressor → Price prediction

### Expected Improvement
- **Target SMAPE**: ~13.5% (25% improvement)
- **Target MAE**: ~143 (23% improvement)
- **Target RMSE**: ~331 (21% improvement)

---

## 📋 Notebook Structure
1. Environment Setup & Dependencies
2. Data Loading & Preprocessing
3. Image Processing Pipeline (Download, Cache, Augmentation)
4. Multimodal Dataset & DataLoaders
5. Model Architecture (Text Encoder, Image Encoder, Fusion, Regressor)
6. Training Pipeline (Progressive Training, Mixed Precision)
7. Evaluation & Analysis
8. Test Predictions & Submission

In [1]:
# ========================================
# 1. ENVIRONMENT SETUP & DEPENDENCIES
# ========================================

import torch
import numpy as np
import transformers

print("="*60)
print("ENVIRONMENT INFORMATION")
print("="*60)
print(f"NumPy version: {np.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"\nCUDA available: {torch.cuda.is_available()}")
print(f"CUDA device count: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    print(f"GPU name: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("⚠️  No CUDA GPU detected - training will be slower on CPU")

print("="*60)

ENVIRONMENT INFORMATION
NumPy version: 2.1.3
PyTorch version: 2.8.0+cu128
Transformers version: 4.55.2

CUDA available: True
CUDA device count: 1
GPU name: NVIDIA GeForce RTX 3050 Ti Laptop GPU
CUDA version: 12.8
GPU Memory: 4.0 GB


In [None]:
# Import all required libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import (
    AutoTokenizer, 
    AutoModel,
    get_linear_schedule_with_warmup
)

# Image processing imports
from PIL import Image
import requests
from torchvision import transforms, models
from pathlib import Path
import hashlib
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

# ML utilities
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import re
import json
from datetime import datetime
from typing import Optional, Tuple, List, Dict, Union
import logging
import math

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

# Set compute device
compute_device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"\n✓ Using device: {compute_device}")
if torch.cuda.is_available():
    print(f"✓ GPU: {torch.cuda.get_device_name(0)}")
    print(f"✓ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

print("\n✓ All libraries imported successfully!")

Loading datasets...

✓ Data loaded successfully
Train shape: (75000, 4)
Test shape: (75000, 3)

First 3 rows of training data:


Unnamed: 0,sample_id,catalog_content,image_link,price
0,33127,"Item Name: La Victoria Green Taco Sauce Mild, ...",https://m.media-amazon.com/images/I/51mo8htwTH...,4.89
1,198967,"Item Name: Salerno Cookies, The Original Butte...",https://m.media-amazon.com/images/I/71YtriIHAA...,13.12
2,261251,"Item Name: Bear Creek Hearty Soup Bowl, Creamy...",https://m.media-amazon.com/images/I/51+PFEe-w-...,1.97


## 2️⃣ Data Loading & Preprocessing

In [None]:
# Load datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

print("="*60)
print("DATASET OVERVIEW")
print("="*60)
print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")
print(f"\nColumns: {list(train_df.columns)}")

# Display sample
print("\n" + "="*60)
print("SAMPLE DATA")
print("="*60)
print(train_df.head(3))

# Check for missing values
print("\n" + "="*60)
print("MISSING VALUES")
print("="*60)
print(train_df.isnull().sum())
print(f"\nImage availability: {train_df['image_link'].notna().sum()}/{len(train_df)} ({train_df['image_link'].notna().mean()*100:.1f}%)")

In [None]:
# Text preprocessing function
def clean_text_data(text):
    """Clean and normalize text data."""
    if pd.isna(text):
        return ""
    
    text = str(text).strip()
    text = re.sub(r'\s+', ' ', text)  # Remove excessive whitespace
    
    return text

# Apply text preprocessing
train_df['catalog_content_processed'] = train_df['catalog_content'].apply(clean_text_data)
test_df['catalog_content_processed'] = test_df['catalog_content'].apply(clean_text_data)

# Analyze text lengths
text_lengths = train_df['catalog_content_processed'].str.len()

print("="*60)
print("TEXT STATISTICS")
print("="*60)
print(f"Mean length: {text_lengths.mean():.1f} characters")
print(f"Median length: {text_lengths.median():.1f} characters")
print(f"Min length: {text_lengths.min()} characters")
print(f"Max length: {text_lengths.max()} characters")

# Price statistics
print("\n" + "="*60)
print("PRICE DISTRIBUTION")
print("="*60)
print(f"Price range: ${train_df['price'].min():.2f} - ${train_df['price'].max():.2f}")
print(f"Price mean: ${train_df['price'].mean():.2f}")
print(f"Price median: ${train_df['price'].median():.2f}")
print(f"Price std: ${train_df['price'].std():.2f}")

print("\n✓ Data preprocessing completed!")

## 3️⃣ Image Processing Pipeline
### Load Images from Local Directory and Preprocess

In [None]:
# ========================================
# IMAGE PROCESSOR CLASS (LOCAL FILES)
# ========================================

class ImageProcessor:
    """
    Handles local image loading, preprocessing, and augmentation.
    
    Features:
    - Load images from local directory
    - ImageNet normalization
    - Data augmentation for training
    - Error handling for broken/missing images
    """
    
    def __init__(
        self,
        image_dir: str = "./images",
        img_size: int = 224,
        use_augmentation: bool = True
    ):
        self.image_dir = Path(image_dir)
        self.img_size = img_size
        self.use_augmentation = use_augmentation
        
        # ImageNet normalization
        self.mean = [0.485, 0.456, 0.406]
        self.std = [0.229, 0.224, 0.225]
        
        # Training transforms (with augmentation)
        self.train_transform = transforms.Compose([
            transforms.Resize((img_size + 32, img_size + 32)),
            transforms.RandomCrop(img_size),
            transforms.RandomHorizontalFlip(p=0.5),
            transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
            transforms.RandomRotation(degrees=10),
            transforms.ToTensor(),
            transforms.Normalize(mean=self.mean, std=self.std)
        ])
        
        # Validation transforms (no augmentation)
        self.val_transform = transforms.Compose([
            transforms.Resize((img_size, img_size)),
            transforms.ToTensor(),
            transforms.Normalize(mean=self.mean, std=self.std)
        ])
        
        # Statistics
        self.stats = {'success': 0, 'failed': 0, 'missing': 0}
        
        logger.info(f"ImageProcessor initialized: image_dir={image_dir}, img_size={img_size}")
    
    def get_image_path(self, image_link: str) -> Path:
        """
        Extract image filename from link and construct local path.
        
        Example:
        https://m.media-amazon.com/images/I/81WiVwz7KkL.jpg -> ./images/81WiVwz7KkL.jpg
        """
        if pd.isna(image_link) or not image_link:
            return None
        
        # Extract filename from URL
        filename = Path(image_link).name
        return self.image_dir / filename
    
    def load_image(self, image_link: str) -> Optional[Image.Image]:
        """Load image from local directory."""
        image_path = self.get_image_path(image_link)
        
        if image_path is None:
            self.stats['missing'] += 1
            return None
        
        if not image_path.exists():
            self.stats['missing'] += 1
            return None
        
        try:
            img = Image.open(image_path).convert('RGB')
            self.stats['success'] += 1
            return img
        except Exception as e:
            self.stats['failed'] += 1
            logger.warning(f"Failed to load image: {image_path} - {e}")
            return None
    
    def preprocess_image(self, image: Union[Image.Image, str], training: bool = False) -> Optional[torch.Tensor]:
        """Preprocess image for model input."""
        if isinstance(image, str):
            image = self.load_image(image)
        
        if image is None:
            return None
        
        try:
            transform = self.train_transform if (training and self.use_augmentation) else self.val_transform
            return transform(image)
        except Exception as e:
            logger.warning(f"Failed to preprocess image: {e}")
            return None
    
    def get_placeholder_tensor(self) -> torch.Tensor:
        """Create placeholder tensor for missing images (gray image)."""
        placeholder = torch.zeros(3, self.img_size, self.img_size)
        for c in range(3):
            placeholder[c] = (0.5 - self.mean[c]) / self.std[c]
        return placeholder
    
    def get_stats(self) -> dict:
        """Get image loading statistics."""
        total = sum(self.stats.values())
        return {
            **self.stats,
            'total': total,
            'success_rate': self.stats['success'] / total if total > 0 else 0
        }

print("✓ ImageProcessor class defined (LOCAL mode)")

In [None]:
# ========================================
# TEST IMAGE LOADING (Optional - Run to verify)
# ========================================

print("="*60)
print("TESTING IMAGE LOADING")
print("="*60)

# Test with a few sample images
test_processor = ImageProcessor(image_dir="./images", img_size=224, use_augmentation=False)

print("\nTesting image loading from local directory...")
sample_links = train_df['image_link'].head(5).tolist()

for i, link in enumerate(sample_links):
    img = test_processor.load_image(link)
    if img:
        print(f"✅ Image {i+1}: Loaded successfully - Size: {img.size}")
    else:
        print(f"❌ Image {i+1}: Failed to load from {link}")

stats = test_processor.get_stats()
print(f"\nTest Statistics:")
print(f"  Success: {stats['success']}")
print(f"  Missing: {stats['missing']}")
print(f"  Failed: {stats['failed']}")

print("="*60)

### 📁 Image Directory Setup

**Important**: Ensure your images are organized as follows:

```
./images/
├── 81WiVwz7KkL.jpg
├── 71AbCdEfGh.jpg
├── 91XyZaBcDe.jpg
└── ... (other images)
```

The image filenames should match the filenames in the `image_link` URLs.

**Example**:
- If `image_link` = `"https://m.media-amazon.com/images/I/81WiVwz7KkL.jpg"`
- Then the file should be: `./images/81WiVwz7KkL.jpg`

Missing images will automatically use placeholder tensors, so training won't break.

## 4️⃣ Loss Functions & Metrics

In [None]:
# ========================================
# LOSS FUNCTIONS
# ========================================

class SymmetricLoss(nn.Module):
    """Symmetric Mean Absolute Percentage Error (SMAPE) Loss."""
    
    def __init__(self):
        super(SymmetricLoss, self).__init__()
    
    def forward(self, y_pred, y_true):
        y_pred = torch.clamp(y_pred, min=1e-8)
        y_true = torch.clamp(y_true, min=1e-8)
        
        numerator = torch.abs(y_pred - y_true)
        denominator = (torch.abs(y_true) + torch.abs(y_pred)) / 2
        smape = torch.mean(numerator / denominator)
        return smape


def calculate_symmetric_error(y_true, y_pred):
    """Calculate SMAPE metric."""
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    
    denominator = (np.abs(y_true) + np.abs(y_pred)) / 2
    denominator = np.where(denominator == 0, 1e-8, denominator)
    
    smape_value = np.mean(np.abs(y_pred - y_true) / denominator)
    return smape_value


def calculate_metrics(y_true, y_pred):
    """Calculate comprehensive metrics."""
    smape = calculate_symmetric_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    
    return {
        'smape': smape,
        'mae': mae,
        'rmse': rmse,
        'r2': r2
    }

print("✓ Loss functions and metrics defined")

## 5️⃣ Multimodal Dataset Class
### Combines Text + Image Data for Efficient Loading

In [None]:
# ========================================
# MULTIMODAL DATASET CLASS
# ========================================

class MultimodalDataset(Dataset):
    """
    Dataset for multimodal price prediction combining text and images.
    """
    
    def __init__(
        self,
        df: pd.DataFrame,
        text_tokenizer,
        image_processor: ImageProcessor,
        max_text_length: int = 256,
        training: bool = True,
        use_images: bool = True
    ):
        self.df = df.reset_index(drop=True)
        self.text_tokenizer = text_tokenizer
        self.image_processor = image_processor
        self.max_text_length = max_text_length
        self.training = training
        self.use_images = use_images
        self.has_labels = 'price' in df.columns
        
        # Fill missing values
        self.df['catalog_content_processed'] = self.df['catalog_content_processed'].fillna('')
        self.df['image_link'] = self.df['image_link'].fillna('')
    
    def __len__(self) -> int:
        return len(self.df)
    
    def __getitem__(self, idx: int) -> Dict:
        row = self.df.iloc[idx]
        
        # Process text
        text = str(row['catalog_content_processed'])
        text_encoding = self.text_tokenizer(
            text,
            max_length=self.max_text_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        # Process image
        image_tensor = None
        image_available = False
        
        if self.use_images:
            image_url = str(row['image_link'])
            image_tensor = self.image_processor.preprocess_image(image_url, training=self.training)
            
            if image_tensor is None:
                image_tensor = self.image_processor.get_placeholder_tensor()
                image_available = False
            else:
                image_available = True
        else:
            image_tensor = self.image_processor.get_placeholder_tensor()
        
        output = {
            'input_ids': text_encoding['input_ids'].squeeze(0),
            'attention_mask': text_encoding['attention_mask'].squeeze(0),
            'image': image_tensor,
            'image_available': torch.tensor(image_available, dtype=torch.bool)
        }
        
        if self.has_labels:
            output['labels'] = torch.tensor(row['price'], dtype=torch.float)
        
        return output


class MultimodalCollator:
    """Custom collator for batching multimodal data."""
    
    def __call__(self, batch: List[Dict]) -> Dict:
        input_ids = torch.stack([item['input_ids'] for item in batch])
        attention_mask = torch.stack([item['attention_mask'] for item in batch])
        images = torch.stack([item['image'] for item in batch])
        image_available = torch.stack([item['image_available'] for item in batch])
        
        collated = {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'images': images,
            'image_available': image_available
        }
        
        if 'labels' in batch[0]:
            labels = torch.stack([item['labels'] for item in batch])
            collated['labels'] = labels
        
        return collated

print("✓ Multimodal dataset classes defined")

## 6️⃣ Multimodal Model Architecture
### Text Encoder + Image Encoder + Fusion Layer + Regressor

In [None]:
# ========================================
# INITIALIZE COMPONENTS
# ========================================

# Initialize tokenizer
print("Loading tokenizer...")
text_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
print(f"✓ Tokenizer loaded: {BASE_MODEL}")

# Initialize image processor
print("\nInitializing image processor...")
image_processor = ImageProcessor(
    image_dir="./images",
    img_size=IMG_SIZE,
    use_augmentation=True
)
print(f"✓ Image processor initialized (LOCAL mode)")

# Split data
print("\nSplitting data...")
X_train_text = train_df['catalog_content_processed']
y_train = train_df['price']

X_train, X_val, y_train_split, y_val_split = train_test_split(
    train_df, train_df, test_size=0.2, random_state=42
)

print(f"✓ Training samples: {len(X_train)}")
print(f"✓ Validation samples: {len(X_val)}")

# Create datasets
print("\nCreating multimodal datasets...")
train_dataset = MultimodalDataset(
    df=X_train,
    text_tokenizer=text_tokenizer,
    image_processor=image_processor,
    max_text_length=SEQ_LENGTH,
    training=True,
    use_images=USE_IMAGES
)

val_dataset = MultimodalDataset(
    df=X_val,
    text_tokenizer=text_tokenizer,
    image_processor=image_processor,
    max_text_length=SEQ_LENGTH,
    training=False,
    use_images=USE_IMAGES
)

print(f"✓ Train dataset: {len(train_dataset)} samples")
print(f"✓ Val dataset: {len(val_dataset)} samples")

# Create dataloaders
print("\nCreating dataloaders...")
collator = MultimodalCollator()

train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=2,
    collate_fn=collator,
    pin_memory=torch.cuda.is_available()
)

val_loader = DataLoader(
    val_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=2,
    collate_fn=collator,
    pin_memory=torch.cuda.is_available()
)

print(f"✓ Train batches: {len(train_loader)}")
print(f"✓ Val batches: {len(val_loader)}")

print("\n✅ All components initialized successfully!")

In [None]:
# ========================================
# FUSION LAYER (Multimodal Feature Fusion)
# ========================================

class FusionLayer(nn.Module):
    """
    Fusion layer for combining text and image features.
    Supports multiple fusion strategies: concatenation, attention, gated.
    """
    
    def __init__(
        self,
        text_dim: int = 768,
        image_dim: int = 512,
        output_dim: int = 640,
        fusion_type: str = 'concat',
        dropout: float = 0.3
    ):
        super().__init__()
        
        self.fusion_type = fusion_type
        
        if fusion_type == 'concat':
            # Simple concatenation + projection
            concat_dim = text_dim + image_dim
            self.fusion = nn.Sequential(
                nn.Linear(concat_dim, output_dim * 2),
                nn.BatchNorm1d(output_dim * 2),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout),
                nn.Linear(output_dim * 2, output_dim),
                nn.BatchNorm1d(output_dim),
                nn.ReLU(inplace=True)
            )
            
        elif fusion_type == 'gated':
            # Gated fusion with learned weights
            self.text_projection = nn.Linear(text_dim, output_dim)
            self.image_projection = nn.Linear(image_dim, output_dim)
            self.gate = nn.Sequential(
                nn.Linear(output_dim * 2, output_dim),
                nn.Sigmoid()
            )
            self.output = nn.Sequential(
                nn.Linear(output_dim, output_dim),
                nn.BatchNorm1d(output_dim),
                nn.ReLU(inplace=True),
                nn.Dropout(dropout)
            )
        
        # Initialize weights
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.kaiming_normal_(module.weight, mode='fan_out', nonlinearity='relu')
                if module.bias is not None:
                    nn.init.constant_(module.bias, 0)
    
    def forward(self, text_features: torch.Tensor, image_features: torch.Tensor) -> torch.Tensor:
        """Fuse text and image features."""
        if self.fusion_type == 'concat':
            combined = torch.cat([text_features, image_features], dim=1)
            return self.fusion(combined)
            
        elif self.fusion_type == 'gated':
            text_proj = self.text_projection(text_features)
            image_proj = self.image_projection(image_features)
            gate_input = torch.cat([text_proj, image_proj], dim=1)
            gate_values = self.gate(gate_input)
            fused = gate_values * text_proj + (1 - gate_values) * image_proj
            return self.output(fused)

print("✓ FusionLayer class defined")

In [None]:
# ========================================
# MULTIMODAL REGRESSOR (Complete Integrated Model)
# ========================================

class MultimodalRegressor(nn.Module):
    """
    Complete multimodal price prediction model.
    Combines text encoder (DistilBERT), image encoder (CNN), fusion layer, and MLP regressor.
    """
    
    def __init__(
        self,
        bert_model_name: str = 'distilbert-base-uncased',
        image_model_name: str = 'resnet50',
        text_dropout: float = 0.2,
        image_dropout: float = 0.3,
        fusion_dropout: float = 0.3,
        image_output_dim: int = 512,
        fusion_output_dim: int = 640,
        fusion_type: str = 'concat',
        freeze_image_backbone: bool = True,
        use_images: bool = True
    ):
        super().__init__()
        
        self.use_images = use_images
        
        # Text encoder (DistilBERT)
        self.text_encoder = AutoModel.from_pretrained(bert_model_name)
        text_dim = self.text_encoder.config.hidden_size  # 768 for DistilBERT
        
        # Image encoder (ResNet/EfficientNet)
        if use_images:
            self.image_encoder = ImageEncoder(
                model_name=image_model_name,
                pretrained=True,
                output_dim=image_output_dim,
                dropout=image_dropout,
                freeze_backbone=freeze_image_backbone
            )
            
            # Fusion layer
            self.fusion = FusionLayer(
                text_dim=text_dim,
                image_dim=image_output_dim,
                output_dim=fusion_output_dim,
                fusion_type=fusion_type,
                dropout=fusion_dropout
            )
            regressor_input_dim = fusion_output_dim
        else:
            self.image_encoder = None
            self.fusion = None
            regressor_input_dim = text_dim
        
        # MLP Regressor head
        self.regressor = nn.Sequential(
            nn.Dropout(text_dropout),
            nn.Linear(regressor_input_dim, regressor_input_dim // 2),
            nn.ReLU(),
            nn.Dropout(text_dropout),
            nn.Linear(regressor_input_dim // 2, 1)
        )
        
        # Initialize regressor weights
        for module in self.regressor:
            if isinstance(module, nn.Linear):
                nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
                nn.init.constant_(module.bias, 0)
    
    def forward(self, input_ids, attention_mask, images=None, image_available=None):
        """
        Forward pass through the multimodal model.
        
        Args:
            input_ids: Text token IDs [B, seq_len]
            attention_mask: Text attention mask [B, seq_len]
            images: Image tensors [B, 3, H, W]
            image_available: Boolean tensor indicating valid images [B]
        
        Returns:
            Price predictions [B]
        """
        # Extract text features
        text_outputs = self.text_encoder(input_ids=input_ids, attention_mask=attention_mask)
        text_features = text_outputs.last_hidden_state[:, 0]  # CLS token [B, 768]
        
        if self.use_images and images is not None:
            # Extract image features
            image_features = self.image_encoder(images)  # [B, image_output_dim]
            
            # Fuse text and image features
            fused_features = self.fusion(text_features, image_features)  # [B, fusion_output_dim]
            
            # Predict price
            logits = self.regressor(fused_features)
        else:
            # Text-only mode
            logits = self.regressor(text_features)
        
        return logits.squeeze(-1)
    
    def unfreeze_image_backbone(self, layers: Optional[int] = None):
        """Unfreeze image encoder backbone for fine-tuning."""
        if self.image_encoder is not None:
            self.image_encoder.unfreeze_backbone(layers)
    
    def get_num_trainable_params(self):
        """Count trainable parameters."""
        return sum(p.numel() for p in self.parameters() if p.requires_grad)

print("✓ MultimodalRegressor class defined")

## 7️⃣ Training Pipeline
### Progressive Training with Mixed Precision Support

In [None]:
# ========================================
# TRAINING FUNCTION
# ========================================

def train_multimodal_model(
    model,
    train_loader,
    val_loader,
    criterion,
    optimizer,
    scheduler,
    epochs,
    device,
    use_amp=True,
    early_stopping_patience=3
):
    """
    Train the multimodal model with comprehensive monitoring.
    
    Args:
        model: Multimodal regressor model
        train_loader: Training data loader
        val_loader: Validation data loader
        criterion: Loss function (SMAPE)
        optimizer: Optimizer
        scheduler: Learning rate scheduler
        epochs: Number of training epochs
        device: Compute device (CPU/GPU)
        use_amp: Use automatic mixed precision
        early_stopping_patience: Patience for early stopping
    
    Returns:
        Training history (losses, metrics)
    """
    from torch.cuda.amp import autocast, GradScaler
    
    scaler = GradScaler() if use_amp and device.type == 'cuda' else None
    
    train_losses = []
    val_losses = []
    val_smapes = []
    
    best_val_smape = float('inf')
    best_model_state = None
    patience_counter = 0
    
    for epoch in range(epochs):
        # ============ TRAINING ============
        model.train()
        train_loss = 0.0
        train_predictions = []
        train_labels = []
        
        print(f"\n{'='*60}")
        print(f"Epoch {epoch + 1}/{epochs}")
        print(f"{'='*60}")
        
        for batch_idx, batch in enumerate(train_loader):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            images = batch['images'].to(device)
            image_available = batch['image_available'].to(device)
            labels = batch['labels'].to(device)
            
            optimizer.zero_grad()
            
            # Forward pass with mixed precision
            if use_amp and scaler is not None:
                with autocast():
                    outputs = model(input_ids, attention_mask, images, image_available)
                    loss = criterion(outputs, labels)
                
                scaler.scale(loss).backward()
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                scaler.step(optimizer)
                scaler.update()
            else:
                outputs = model(input_ids, attention_mask, images, image_available)
                loss = criterion(outputs, labels)
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                optimizer.step()
            
            scheduler.step()
            
            train_loss += loss.item()
            train_predictions.extend(outputs.detach().cpu().numpy())
            train_labels.extend(labels.detach().cpu().numpy())
            
            if batch_idx % 50 == 0:
                print(f"Batch {batch_idx}/{len(train_loader)}, Loss: {loss.item():.4f}")
        
        avg_train_loss = train_loss / len(train_loader)
        train_metrics = calculate_metrics(train_labels, train_predictions)
        
        # ============ VALIDATION ============
        model.eval()
        val_loss = 0.0
        val_predictions = []
        val_labels = []
        
        with torch.no_grad():
            for batch in val_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                images = batch['images'].to(device)
                image_available = batch['image_available'].to(device)
                labels = batch['labels'].to(device)
                
                if use_amp and device.type == 'cuda':
                    with autocast():
                        outputs = model(input_ids, attention_mask, images, image_available)
                        loss = criterion(outputs, labels)
                else:
                    outputs = model(input_ids, attention_mask, images, image_available)
                    loss = criterion(outputs, labels)
                
                val_loss += loss.item()
                val_predictions.extend(outputs.cpu().numpy())
                val_labels.extend(labels.cpu().numpy())
        
        avg_val_loss = val_loss / len(val_loader)
        val_metrics = calculate_metrics(val_labels, val_predictions)
        
        # Store history
        train_losses.append(avg_train_loss)
        val_losses.append(avg_val_loss)
        val_smapes.append(val_metrics['smape'])
        
        # Print metrics
        print(f"\n📊 Training Metrics:")
        print(f"   Loss: {avg_train_loss:.4f} | SMAPE: {train_metrics['smape']:.4f} ({train_metrics['smape']*100:.2f}%)")
        print(f"   MAE: ${train_metrics['mae']:.2f} | RMSE: ${train_metrics['rmse']:.2f} | R²: {train_metrics['r2']:.4f}")
        
        print(f"\n📈 Validation Metrics:")
        print(f"   Loss: {avg_val_loss:.4f} | SMAPE: {val_metrics['smape']:.4f} ({val_metrics['smape']*100:.2f}%)")
        print(f"   MAE: ${val_metrics['mae']:.2f} | RMSE: ${val_metrics['rmse']:.2f} | R²: {val_metrics['r2']:.4f}")
        
        # Early stopping check
        if val_metrics['smape'] < best_val_smape:
            best_val_smape = val_metrics['smape']
            best_model_state = model.state_dict().copy()
            patience_counter = 0
            print(f"\n✅ New best validation SMAPE: {best_val_smape:.4f} ({best_val_smape*100:.2f}%)")
        else:
            patience_counter += 1
            print(f"\n⚠️  No improvement ({patience_counter}/{early_stopping_patience})")
            
            if patience_counter >= early_stopping_patience:
                print(f"\n🛑 Early stopping triggered after {epoch + 1} epochs")
                break
    
    # Load best model
    if best_model_state is not None:
        model.load_state_dict(best_model_state)
        print(f"\n🎯 Best validation SMAPE: {best_val_smape:.4f} ({best_val_smape*100:.2f}%)")
    
    return {
        'train_losses': train_losses,
        'val_losses': val_losses,
        'val_smapes': val_smapes,
        'best_val_smape': best_val_smape
    }

print("✓ Training function defined")

## 8️⃣ Model Configuration & Setup

In [None]:
# ========================================
# CONFIGURATION
# ========================================

# Model configuration
BASE_MODEL = 'distilbert-base-uncased'
IMAGE_MODEL = 'resnet50'  # Options: 'resnet50', 'efficientnet_b3'
SEQ_LENGTH = 256
IMG_SIZE = 224
BATCH_SIZE = 16  # Reduced for multimodal (memory intensive)
NUM_EPOCHS = 10
LEARNING_RATE_TEXT = 2e-5
LEARNING_RATE_IMAGE = 1e-4
LEARNING_RATE_FUSION = 5e-4

# Multimodal settings
USE_IMAGES = True  # Set to False for text-only baseline
FUSION_TYPE = 'concat'  # Options: 'concat', 'gated'
IMAGE_OUTPUT_DIM = 512
FUSION_OUTPUT_DIM = 640
FREEZE_IMAGE_BACKBONE = True  # Start with frozen, unfreeze later

# Training settings
EARLY_STOPPING_PATIENCE = 3
USE_MIXED_PRECISION = torch.cuda.is_available()

print("="*60)
print("MODEL CONFIGURATION")
print("="*60)
print(f"Text Model: {BASE_MODEL}")
print(f"Image Model: {IMAGE_MODEL}")
print(f"Max Sequence Length: {SEQ_LENGTH}")
print(f"Image Size: {IMG_SIZE}")
print(f"Batch Size: {BATCH_SIZE}")
print(f"Epochs: {NUM_EPOCHS}")
print(f"Use Images: {USE_IMAGES}")
print(f"Fusion Type: {FUSION_TYPE}")
print(f"Mixed Precision: {USE_MIXED_PRECISION}")
print("="*60)

In [None]:
# ========================================
# INITIALIZE COMPONENTS
# ========================================

# Initialize tokenizer
print("Loading tokenizer...")
text_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
print(f"✓ Tokenizer loaded: {BASE_MODEL}")

# Initialize image processor
print("\nInitializing image processor...")
image_processor = ImageProcessor(
    cache_dir="./image_cache",
    img_size=IMG_SIZE,
    use_augmentation=True
)
print(f"✓ Image processor initialized")

# Split data
print("\nSplitting data...")
X_train_text = train_df['catalog_content_processed']
y_train = train_df['price']

X_train, X_val, y_train_split, y_val_split = train_test_split(
    train_df, train_df, test_size=0.2, random_state=42
)

print(f"✓ Training samples: {len(X_train)}")
print(f"✓ Validation samples: {len(X_val)}")

# Create datasets
print("\nCreating multimodal datasets...")
train_dataset = MultimodalDataset(
    df=X_train,
    text_tokenizer=text_tokenizer,
    image_processor=image_processor,
    max_text_length=SEQ_LENGTH,
    training=True,
    use_images=USE_IMAGES
)

val_dataset = MultimodalDataset(
    df=X_val,
    text_tokenizer=text_tokenizer,
    image_processor=image_processor,
    max_text_length=SEQ_LENGTH,
    training=False,
    use_images=USE_IMAGES
)

print(f"✓ Train dataset: {len(train_dataset)} samples")
print(f"✓ Val dataset: {len(val_dataset)} samples")

# Create dataloaders
print("\nCreating dataloaders...")
collator = MultimodalCollator()

train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=2,
    collate_fn=collator,
    pin_memory=torch.cuda.is_available()
)

val_loader = DataLoader(
    val_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=2,
    collate_fn=collator,
    pin_memory=torch.cuda.is_available()
)

print(f"✓ Train batches: {len(train_loader)}")
print(f"✓ Val batches: {len(val_loader)}")

print("\n✅ All components initialized successfully!")

In [None]:
# ========================================
# BUILD MULTIMODAL MODEL
# ========================================

print("Building multimodal model...")
model = MultimodalRegressor(
    bert_model_name=BASE_MODEL,
    image_model_name=IMAGE_MODEL,
    text_dropout=0.2,
    image_dropout=0.3,
    fusion_dropout=0.3,
    image_output_dim=IMAGE_OUTPUT_DIM,
    fusion_output_dim=FUSION_OUTPUT_DIM,
    fusion_type=FUSION_TYPE,
    freeze_image_backbone=FREEZE_IMAGE_BACKBONE,
    use_images=USE_IMAGES
)

# Multi-GPU support
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = nn.DataParallel(model)

model.to(compute_device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\n{'='*60}")
print(f"MODEL STATISTICS")
print(f"{'='*60}")
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Model size: {total_params * 4 / 1024 / 1024:.1f} MB")
print(f"{'='*60}")

# Test forward pass
print("\nTesting forward pass...")
model.eval()
with torch.no_grad():
    sample_batch = next(iter(train_loader))
    input_ids = sample_batch['input_ids'].to(compute_device)
    attention_mask = sample_batch['attention_mask'].to(compute_device)
    images = sample_batch['images'].to(compute_device)
    image_available = sample_batch['image_available'].to(compute_device)
    labels = sample_batch['labels'].to(compute_device)
    
    outputs = model(input_ids, attention_mask, images, image_available)
    print(f"✓ Input shape: {input_ids.shape}")
    print(f"✓ Output shape: {outputs.shape}")
    print(f"✓ Sample predictions: {outputs[:5].cpu().numpy()}")
    print(f"✓ Sample labels: {labels[:5].cpu().numpy()}")

print("\n✅ Model built and tested successfully!")

In [None]:
# ========================================
# SETUP TRAINING
# ========================================

# Loss function
criterion = SymmetricLoss()

# Optimizer with different learning rates for different components
if USE_IMAGES:
    # Separate parameter groups for text, image, and fusion
    param_groups = [
        {'params': model.module.text_encoder.parameters() if hasattr(model, 'module') else model.text_encoder.parameters(), 
         'lr': LEARNING_RATE_TEXT},
        {'params': model.module.image_encoder.parameters() if hasattr(model, 'module') else model.image_encoder.parameters(), 
         'lr': LEARNING_RATE_IMAGE},
        {'params': model.module.fusion.parameters() if hasattr(model, 'module') else model.fusion.parameters(), 
         'lr': LEARNING_RATE_FUSION},
        {'params': model.module.regressor.parameters() if hasattr(model, 'module') else model.regressor.parameters(), 
         'lr': LEARNING_RATE_FUSION}
    ]
else:
    # Text-only parameters
    param_groups = model.parameters()

optimizer = AdamW(param_groups, weight_decay=0.01)

# Learning rate scheduler
total_steps = len(train_loader) * NUM_EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=total_steps // 10,
    num_training_steps=total_steps
)

print("="*60)
print("TRAINING CONFIGURATION")
print("="*60)
print(f"Loss function: Symmetric SMAPE")
print(f"Optimizer: AdamW")
if USE_IMAGES:
    print(f"Learning rates:")
    print(f"  - Text encoder: {LEARNING_RATE_TEXT}")
    print(f"  - Image encoder: {LEARNING_RATE_IMAGE}")
    print(f"  - Fusion & Regressor: {LEARNING_RATE_FUSION}")
else:
    print(f"Learning rate: {LEARNING_RATE_TEXT}")
print(f"Total training steps: {total_steps}")
print(f"Warmup steps: {total_steps // 10}")
print(f"Early stopping patience: {EARLY_STOPPING_PATIENCE}")
print("="*60)

print("\n✅ Training setup complete!")

## 9️⃣ Training Execution
### Run this cell to train the multimodal model

In [None]:
# ========================================
# START TRAINING
# ========================================

print("\n" + "🚀 "*20)
print("STARTING MULTIMODAL TRAINING")
print("🚀 "*20 + "\n")

training_history = train_multimodal_model(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    epochs=NUM_EPOCHS,
    device=compute_device,
    use_amp=USE_MIXED_PRECISION,
    early_stopping_patience=EARLY_STOPPING_PATIENCE
)

print("\n" + "🎉 "*20)
print("TRAINING COMPLETED!")
print("🎉 "*20)

## 🔟 Model Saving & Evaluation

In [None]:
# ========================================
# SAVE MODEL
# ========================================

# Create save directory
model_type = "multimodal" if USE_IMAGES else "text_only"
save_directory = f"./{model_type}_{BASE_MODEL.replace('/', '_')}_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
os.makedirs(save_directory, exist_ok=True)

# Save model weights
model_save_path = os.path.join(save_directory, "model_best.pt")
if hasattr(model, 'module'):
    torch.save(model.module.state_dict(), model_save_path)
else:
    torch.save(model.state_dict(), model_save_path)
print(f"✅ Model weights saved to: {model_save_path}")

# Save tokenizer
try:
    text_tokenizer.save_pretrained(save_directory)
    print(f"✅ Tokenizer saved to: {save_directory}")
except Exception as e:
    print(f"⚠️  Tokenizer save failed: {e}")

# Save optimizer and scheduler states
torch.save({
    "optimizer_state_dict": optimizer.state_dict(),
    "scheduler_state_dict": scheduler.state_dict(),
}, os.path.join(save_directory, "training_states.pt"))
print("✅ Optimizer & scheduler states saved")

# Save training metrics
metrics_path = os.path.join(save_directory, "training_metrics.json")
metrics_to_save = {
    "train_losses": [float(x) for x in training_history['train_losses']],
    "val_losses": [float(x) for x in training_history['val_losses']],
    "val_smapes": [float(x) for x in training_history['val_smapes']],
    "best_val_smape": float(training_history['best_val_smape']),
    "config": {
        "base_model": BASE_MODEL,
        "image_model": IMAGE_MODEL if USE_IMAGES else None,
        "seq_length": SEQ_LENGTH,
        "batch_size": BATCH_SIZE,
        "use_images": USE_IMAGES,
        "fusion_type": FUSION_TYPE if USE_IMAGES else None
    }
}

with open(metrics_path, "w") as f:
    json.dump(metrics_to_save, f, indent=4)

print(f"✅ Training metrics saved to: {metrics_path}")
print(f"\n🎉 All model components successfully saved in: {save_directory}")

In [None]:
# ========================================
# VISUALIZE TRAINING HISTORY
# ========================================

plt.figure(figsize=(18, 5))

# Plot 1: Training and Validation Loss
plt.subplot(1, 3, 1)
plt.plot(training_history['train_losses'], label='Training Loss', color='blue', linewidth=2)
plt.plot(training_history['val_losses'], label='Validation Loss', color='red', linewidth=2)
plt.title('Training and Validation Loss', fontsize=14, fontweight='bold')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Validation SMAPE
plt.subplot(1, 3, 2)
plt.plot(training_history['val_smapes'], label='Validation SMAPE', color='green', linewidth=2, marker='o')
plt.axhline(y=training_history['best_val_smape'], color='red', linestyle='--', label=f'Best: {training_history["best_val_smape"]:.4f}')
plt.title('Validation SMAPE Over Epochs', fontsize=14, fontweight='bold')
plt.xlabel('Epoch')
plt.ylabel('SMAPE')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 3: Validation SMAPE (%)
plt.subplot(1, 3, 3)
smape_percentages = [smape * 100 for smape in training_history['val_smapes']]
plt.plot(smape_percentages, label='Validation SMAPE (%)', color='purple', linewidth=2, marker='s')
plt.axhline(y=training_history['best_val_smape'] * 100, color='red', linestyle='--', label=f'Best: {training_history["best_val_smape"]*100:.2f}%')
plt.title('Validation SMAPE (%)', fontsize=14, fontweight='bold')
plt.xlabel('Epoch')
plt.ylabel('SMAPE (%)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(save_directory, 'training_curves.png'), dpi=300, bbox_inches='tight')
plt.show()

print(f"✅ Training curves saved to: {os.path.join(save_directory, 'training_curves.png')}")

In [None]:
# ========================================
# VALIDATION SET EVALUATION
# ========================================

print("="*60)
print("FINAL VALIDATION EVALUATION")
print("="*60)

model.eval()
val_predictions = []
val_true_labels = []

with torch.no_grad():
    for batch in val_loader:
        input_ids = batch['input_ids'].to(compute_device)
        attention_mask = batch['attention_mask'].to(compute_device)
        images = batch['images'].to(compute_device)
        image_available = batch['image_available'].to(compute_device)
        labels = batch['labels'].to(compute_device)
        
        outputs = model(input_ids, attention_mask, images, image_available)
        val_predictions.extend(outputs.cpu().numpy())
        val_true_labels.extend(labels.cpu().numpy())

# Calculate final metrics
final_metrics = calculate_metrics(val_true_labels, val_predictions)

print(f"\n📊 Final Validation Metrics:")
print(f"{'='*60}")
print(f"SMAPE: {final_metrics['smape']:.4f} ({final_metrics['smape']*100:.2f}%)")
print(f"MAE:   ${final_metrics['mae']:.2f}")
print(f"RMSE:  ${final_metrics['rmse']:.2f}")
print(f"R²:    {final_metrics['r2']:.4f}")
print(f"{'='*60}")

# Scatter plot: Predictions vs Actual
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.scatter(val_true_labels, val_predictions, alpha=0.5, s=10)
plt.plot([min(val_true_labels), max(val_true_labels)], 
         [min(val_true_labels), max(val_true_labels)], 
         'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Price ($)', fontsize=12)
plt.ylabel('Predicted Price ($)', fontsize=12)
plt.title(f'Predictions vs Actual\nSMAPE: {final_metrics["smape"]:.4f} ({final_metrics["smape"]*100:.2f}%)', 
          fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.scatter(val_true_labels, val_predictions, alpha=0.5, s=10)
plt.plot([min(val_true_labels), max(val_true_labels)], 
         [min(val_true_labels), max(val_true_labels)], 
         'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Price ($)', fontsize=12)
plt.ylabel('Predicted Price ($)', fontsize=12)
plt.title(f'Predictions vs Actual (Log Scale)\nR²: {final_metrics["r2"]:.4f}', 
          fontsize=14, fontweight='bold')
plt.xscale('log')
plt.yscale('log')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(save_directory, 'predictions_vs_actual.png'), dpi=300, bbox_inches='tight')
plt.show()

print(f"\n✅ Prediction plot saved to: {os.path.join(save_directory, 'predictions_vs_actual.png')}")

In [None]:
# ========================================
# SAMPLE PREDICTIONS ANALYSIS
# ========================================

print("\n" + "="*60)
print("SAMPLE PREDICTIONS")
print("="*60)

sample_indices = np.random.choice(len(val_true_labels), 15, replace=False)
sample_data = []

for i, idx in enumerate(sample_indices):
    actual = val_true_labels[idx]
    predicted = val_predictions[idx]
    error = abs(predicted - actual) / ((abs(actual) + abs(predicted)) / 2) * 100
    
    sample_data.append({
        'Actual': f'${actual:.2f}',
        'Predicted': f'${predicted:.2f}',
        'Error (%)': f'{error:.2f}%'
    })
    
    print(f"{i+1:2d}. Actual: ${actual:8.2f} | Predicted: ${predicted:8.2f} | Error: {error:5.2f}%")

# Save sample predictions
sample_df = pd.DataFrame(sample_data)
sample_df.to_csv(os.path.join(save_directory, 'sample_predictions.csv'), index=False)
print(f"\n✅ Sample predictions saved to: {os.path.join(save_directory, 'sample_predictions.csv')}")

## 1️⃣1️⃣ Test Set Predictions & Submission

In [None]:
# ========================================
# GENERATE TEST SET PREDICTIONS
# ========================================

print("="*60)
print("GENERATING TEST SET PREDICTIONS")
print("="*60)

# Create test dataset
test_dataset = MultimodalDataset(
    df=test_df,
    text_tokenizer=text_tokenizer,
    image_processor=image_processor,
    max_text_length=SEQ_LENGTH,
    training=False,
    use_images=USE_IMAGES
)

test_loader = DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=2,
    collate_fn=MultimodalCollator(),
    pin_memory=torch.cuda.is_available()
)

print(f"Test samples: {len(test_dataset)}")
print(f"Test batches: {len(test_loader)}")

# Generate predictions
model.eval()
test_predictions = []

print("\nGenerating predictions...")
with torch.no_grad():
    for batch_idx, batch in enumerate(test_loader):
        input_ids = batch['input_ids'].to(compute_device)
        attention_mask = batch['attention_mask'].to(compute_device)
        images = batch['images'].to(compute_device)
        image_available = batch['image_available'].to(compute_device)
        
        outputs = model(input_ids, attention_mask, images, image_available)
        test_predictions.extend(outputs.cpu().numpy())
        
        if batch_idx % 50 == 0:
            print(f"Processed batch {batch_idx}/{len(test_loader)}")

test_predictions = np.array(test_predictions)
test_predictions = np.clip(test_predictions, 1e-8, None)  # Ensure positive prices

print(f"\n{'='*60}")
print(f"TEST PREDICTIONS SUMMARY")
print(f"{'='*60}")
print(f"Number of predictions: {len(test_predictions)}")
print(f"Price range: ${test_predictions.min():.2f} - ${test_predictions.max():.2f}")
print(f"Price mean: ${test_predictions.mean():.2f}")
print(f"Price median: ${np.median(test_predictions):.2f}")
print(f"{'='*60}")

In [None]:
# ========================================
# CREATE SUBMISSION FILE
# ========================================

# Create submission dataframe
submission_df = pd.DataFrame({
    'sample_id': test_df['sample_id'],
    'price': test_predictions
})

print(f"\n{'='*60}")
print(f"SUBMISSION FILE")
print(f"{'='*60}")
print(f"\nFirst 10 predictions:")
print(submission_df.head(10))

# Save submission
submission_filename = f'submission_{model_type}_{datetime.now().strftime("%Y%m%d_%H%M%S")}.csv'
submission_df.to_csv(submission_filename, index=False)
print(f"\n✅ Submission saved as: {submission_filename}")

# Also save in model directory
submission_df.to_csv(os.path.join(save_directory, 'submission.csv'), index=False)
print(f"✅ Submission also saved to: {os.path.join(save_directory, 'submission.csv')}")

# Verification
print(f"\n{'='*60}")
print(f"SUBMISSION VERIFICATION")
print(f"{'='*60}")
print(f"Shape: {submission_df.shape}")
print(f"Columns: {list(submission_df.columns)}")
print(f"Sample IDs range: {submission_df['sample_id'].min()} - {submission_df['sample_id'].max()}")
print(f"All prices positive: {(submission_df['price'] > 0).all()}")
print(f"No missing values: {submission_df.isnull().sum().sum() == 0}")

# Price distribution
print(f"\n{'='*60}")
print(f"PREDICTION DISTRIBUTION")
print(f"{'='*60}")
bins = [0, 10, 50, 100, 500, float('inf')]
labels = ['< $10', '$10-$50', '$50-$100', '$100-$500', '> $500']

for i, (lower, upper) in enumerate(zip(bins[:-1], bins[1:])):
    count = ((submission_df['price'] >= lower) & (submission_df['price'] < upper)).sum()
    percentage = count / len(submission_df) * 100
    print(f"{labels[i]:12s}: {count:5d} ({percentage:5.1f}%)")

print(f"\n{'='*60}")
print(f"✅ SUBMISSION READY!")
print(f"{'='*60}")

## 1️⃣2️⃣ Performance Summary & Next Steps

### 📊 Model Performance Comparison

| Metric | Text-Only Baseline | Multimodal Target | Your Result |
|--------|-------------------|-------------------|-------------|
| SMAPE | 18.04% | ~13.5% | _Run to see_ |
| MAE | $185.47 | ~$143 | _Run to see_ |
| RMSE | $421.33 | ~$331 | _Run to see_ |
| R² | 0.752 | ~0.842 | _Run to see_ |

### 🚀 Advanced Optimization Strategies

#### 1. Progressive Image Backbone Unfreezing
```python
# After initial training with frozen backbone, gradually unfreeze:
model.unfreeze_image_backbone(layers=10)  # Unfreeze last 10 layers
# Train for 2-3 more epochs with lower learning rate
```

#### 2. Test-Time Augmentation (TTA)
```python
# Average predictions across multiple augmentations
# Can improve SMAPE by 0.5-1%
```

#### 3. Model Ensembling
- Train multiple models with different:
  - Random seeds
  - Fusion strategies ('concat' vs 'gated')
  - Image encoders (ResNet50 vs EfficientNet)
- Average predictions

#### 4. Hyperparameter Tuning
- Adjust fusion output dimension
- Experiment with dropout rates
- Try different batch sizes
- Tune learning rates

### 📁 Files Created
- Model weights: `model_best.pt`
- Training metrics: `training_metrics.json`
- Submission file: `submission_*.csv`
- Visualizations: `training_curves.png`, `predictions_vs_actual.png`

In [None]:
# ========================================
# OPTIONAL: PROGRESSIVE FINE-TUNING
# ========================================
# Uncomment and run this cell to further improve performance
# by unfreezing the image backbone and fine-tuning

"""
print("="*60)
print("PROGRESSIVE FINE-TUNING")
print("="*60)

# Unfreeze image backbone (gradually)
if USE_IMAGES:
    print("Unfreezing image encoder backbone...")
    model.unfreeze_image_backbone(layers=10)
    
    # Lower learning rate for fine-tuning
    fine_tune_optimizer = AdamW([
        {'params': model.text_encoder.parameters(), 'lr': 1e-5},
        {'params': model.image_encoder.parameters(), 'lr': 5e-5},
        {'params': model.fusion.parameters(), 'lr': 1e-4},
        {'params': model.regressor.parameters(), 'lr': 1e-4}
    ], weight_decay=0.01)
    
    fine_tune_steps = len(train_loader) * 3  # 3 more epochs
    fine_tune_scheduler = get_linear_schedule_with_warmup(
        fine_tune_optimizer,
        num_warmup_steps=fine_tune_steps // 10,
        num_training_steps=fine_tune_steps
    )
    
    # Fine-tune for 3 more epochs
    fine_tune_history = train_multimodal_model(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        criterion=criterion,
        optimizer=fine_tune_optimizer,
        scheduler=fine_tune_scheduler,
        epochs=3,
        device=compute_device,
        use_amp=USE_MIXED_PRECISION,
        early_stopping_patience=2
    )
    
    print(f"\\n✅ Fine-tuning completed!")
    print(f"New best SMAPE: {fine_tune_history['best_val_smape']:.4f}")
"""

print("💡 Tip: Uncomment the code above to run progressive fine-tuning")

In [None]:
# ========================================
# IMAGE LOADING STATISTICS
# ========================================

print("="*60)
print("IMAGE PROCESSING STATISTICS")
print("="*60)

stats = image_processor.get_stats()
print(f"\nImage Loading Statistics:")
print(f"  Total attempts: {stats['total']}")
print(f"  Successfully loaded: {stats['success']}")
print(f"  Missing images: {stats['missing']}")
print(f"  Failed to load: {stats['failed']}")
print(f"  Success rate: {stats['success_rate']:.2%}")

print(f"\nImages loaded from: './images/' directory")
print(f"Missing/failed images use gray placeholder tensors")
print("="*60)

---

## 🎓 Quick Start Guide

### Running the Notebook

1. **Ensure images are in `./images/` folder** - Images should already be downloaded
2. **Run all cells sequentially** from top to bottom
3. **Training takes ~1-2 hours** on GPU, longer on CPU

### Configuration Options

To **disable multimodal** and run text-only (baseline):
```python
USE_IMAGES = False  # In the configuration cell
```

To change **image model**:
```python
IMAGE_MODEL = 'efficientnet_b3'  # More powerful but slower
```

To adjust **batch size** (if running out of memory):
```python
BATCH_SIZE = 8  # Reduce if GPU memory is limited
```

To change **fusion strategy**:
```python
FUSION_TYPE = 'gated'  # Alternative to concatenation
```

### Expected Performance

**With Images (Multimodal)**:
- Target SMAPE: 13-15%
- Training time: ~1.5-2 hours (GPU)
- Memory: ~8GB GPU RAM

**Without Images (Text-Only)**:
- Expected SMAPE: 18-20%
- Training time: ~30-45 minutes (GPU)
- Memory: ~4GB GPU RAM

### Troubleshooting

**Out of Memory Error**:
- Reduce `BATCH_SIZE` to 8 or 4
- Set `USE_IMAGES = False` for text-only mode
- Disable mixed precision: `USE_MIXED_PRECISION = False`

**Missing Images**:
- Ensure images are in `./images/` directory
- Image filenames should match the URLs in `image_link` column
- Missing images will use gray placeholder tensors (training continues)

**Model Not Improving**:
- Try unfreezing image backbone (progressive fine-tuning cell)
- Increase number of epochs
- Adjust learning rates
- Ensure images are being used (`USE_IMAGES = True`)

---

### 🎯 Final Checklist

- ✅ Images folder exists at `./images/` with product images
- ✅ All cells run without errors
- ✅ Training completes successfully
- ✅ Validation SMAPE < 20%
- ✅ Submission file created
- ✅ Predictions look reasonable (no negative prices, reasonable distribution)

**Good luck! 🚀**