# üîß VERSI√ìN CORREGIDA V3

## ‚úÖ Correcciones Cr√≠ticas Aplicadas

### 1. **Arquitectura del Modelo**
- ‚úÖ Revenue head ahora usa `Softplus()` para garantizar outputs positivos en log-space
- ‚úÖ Heads independientes (buyer y revenue) sin condicionamiento que causaba problemas
- ‚úÖ Arquitectura m√°s simple (256‚Üí128) para evitar overfitting con datos limitados

### 2. **Loss Function**
- ‚úÖ Loss consistente en log-space (no m√°s mezcla de escalas)
- ‚úÖ Revenue loss solo en buyers (masked correctamente)
- ‚úÖ Eliminado MSLE aproximado que confund√≠a al modelo
- ‚úÖ Balanced weights: 40% buyer + 60% revenue

### 3. **Distillation**
- ‚úÖ Student aprende de AMBOS heads del teacher
- ‚úÖ Soft loss en buyer Y revenue (no solo uno)
- ‚úÖ Alpha=0.7 (m√°s peso a ground truth que a teacher)
- ‚úÖ Gradient clipping para estabilidad

### 4. **Data & Training**
- ‚úÖ Sample fraction aumentado: 10% ‚Üí 25% (m√°s datos)
- ‚úÖ √âpocas aumentadas: 5 ‚Üí 8
- ‚úÖ Embeddings con espacio para "unknown" categories
- ‚úÖ AdamW + weight decay para regularizaci√≥n

### 5. **Predicci√≥n**
- ‚úÖ Conversi√≥n correcta: log-space ‚Üí original scale
- ‚úÖ Predicci√≥n final: `P(buyer) * revenue_if_buyer`
- ‚úÖ M√©tricas comprehensivas (MSLE, AUC, MAE, distribuciones)

---

**‚ö†Ô∏è IMPORTANTE:** Ejecuta TODAS las celdas en orden despu√©s de esta correcci√≥n.

## üéØ Pr√≥ximos Pasos para Mejorar Resultados

### Mejoras Inmediatas (F√°cil)
1. **M√°s datos:** Aumentar `TRAIN_SAMPLE_FRAC` de 0.25 a 0.5 o m√°s
2. **Feature engineering:** Usar columnas como `buyer_d1`, `iap_revenue_d14` como features (no descartarlas)
3. **Balanceo de clases:** Hacer oversampling de buyers (son minor√≠a)

### Mejoras Avanzadas (Requieren m√°s trabajo)
1. **Ensemble:** Combinar predicciones de teacher + student (promedio ponderado)
2. **Cross-validation:** 5-fold CV para mejor estimaci√≥n de performance
3. **Hyperparameter tuning:** Grid search sobre learning rate, dropout, arquitectura
4. **Feature selection:** Eliminar features ruidosas o redundantes
5. **Quantile regression:** Predecir percentiles en vez de media (mejor para distribuci√≥n sesgada)

### Debugging
- Si MSLE sigue alto: Verificar distribuci√≥n de predicciones vs ground truth
- Si AUC buyer bajo: A√±adir m√°s features relacionadas con comportamiento de usuario
- Si overfitting: Aumentar dropout o weight decay

**Ejecuta el notebook y revisa las m√©tricas en validaci√≥n!**

## Configuration

In [1]:
import dask
import dask.dataframe as dd

dask.config.set({"dataframe.convert-string": False})

from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_log_error
import gc

TRAIN_PATH = "./train/train"
TEST_PATH  = "./test/test"

TARGET_COL = "iap_revenue_d7"

In [None]:
# ========== CONFIGURATION ==========
TARGET_COL = "iap_revenue_d7"
TRAIN_SAMPLE_FRAC = 0.25  # ‚úÖ Aumentado de 0.10 a 0.25 para m√°s datos

# PyTorch settings
DEVICE = 'cuda' if __import__('torch').cuda.is_available() else 'cpu'
BATCH_SIZE = 256
TEACHER_EPOCHS = 8  # ‚úÖ M√°s √©pocas
STUDENT_EPOCHS = 8
LEARNING_RATE = 1e-3
DISTILL_ALPHA = 0.7  # ‚úÖ M√°s peso al hard loss (ground truth)

print(f"Device: {DEVICE}")
print(f"Sample fraction: {TRAIN_SAMPLE_FRAC}")

Device: cuda
Sample fraction: 0.1


## Imports

‚ö†Ô∏è **IMPORTANTE:** Si ves errores de CUDA como `cudaErrorUnknown`:

1. **REINICIA EL KERNEL** ‚Üí Bot√≥n "Restart" en la barra superior o `Ctrl+Shift+P` ‚Üí "Restart Kernel"
2. Ejecuta todas las celdas desde el principio
3. NO intentes ejecutar celdas individuales despu√©s de un error de CUDA

Esto sucede porque CUDA entra en un estado corrupto despu√©s de un error y necesita reiniciarse completamente.


In [3]:
# IMPORTANT: If you get CUDA errors, RESTART THE KERNEL first!
# This cell checks CUDA health
import torch
import gc

gc.collect()

if torch.cuda.is_available():
    try:
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()
        print(f"‚úì CUDA available: {torch.cuda.get_device_name(0)}")
        print(f"  Memory: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB allocated")
    except Exception as e:
        print(f"‚ö† CUDA ERROR DETECTED: {e}")
        print("  ‚Üí PLEASE RESTART THE KERNEL (Ctrl+Shift+P ‚Üí 'Restart Kernel')")
        print("  ‚Üí Then run all cells from the beginning")
        raise RuntimeError("CUDA is in a corrupted state. Restart kernel required.")
else:
    print("‚Ñπ CUDA not available, using CPU")


‚úì CUDA available: NVIDIA GeForce RTX 5070 Laptop GPU
  Memory: 0.00 GB allocated


In [4]:
import numpy as np
import pandas as pd
import dask
import dask.dataframe as dd
from sklearn.metrics import mean_squared_log_error, roc_auc_score
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import gc
import os
from glob import glob

# Reproducibility
RSEED = 42
np.random.seed(RSEED)

# Set torch seed with error handling
try:
    torch.manual_seed(RSEED)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(RSEED)
    print("‚úì Libraries imported successfully")
except RuntimeError as e:
    if "CUDA" in str(e):
        print("=" * 60)
        print("‚ö† CRITICAL: CUDA ERROR DURING INITIALIZATION")
        print("=" * 60)
        print(f"Error: {e}")
        print("\nSOLUTION:")
        print("1. Click 'Restart' button in the notebook toolbar")
        print("2. Or: Ctrl+Shift+P ‚Üí type 'Restart Kernel'")
        print("3. Run all cells from the beginning")
        print("=" * 60)
        raise
    else:
        raise

dask.config.set({"dataframe.convert-string": False})


‚úì Libraries imported successfully


<dask.config.set at 0x7929983ab4c0>

## Helper Functions

In [5]:
# Columnas problem√°ticas (listas/dicts) que se ignoran
IGNORE_BIG_COLS = [
    "bundles_ins", "user_bundles", "user_bundles_l28d",
    "city_hist", "country_hist", "region_hist",
    "dev_language_hist", "dev_osv_hist",
    "bcat", "bcat_bottom_taxonomy",
    "bundles_cat", "bundles_cat_bottom_taxonomy",
    "first_request_ts_bundle", "first_request_ts_category_bottom_taxonomy",
    "last_buy_ts_bundle", "last_buy_ts_category",
    "last_install_ts_bundle", "last_install_ts_category",
    "advertiser_actions_action_count", "advertiser_actions_action_last_timestamp",
    "user_actions_bundles_action_count", "user_actions_bundles_action_last_timestamp",
    "new_bundles",
    "whale_users_bundle_num_buys_prank", "whale_users_bundle_revenue_prank",
    "whale_users_bundle_total_num_buys", "whale_users_bundle_total_revenue",
]

LABEL_COLS = [
    "buyer_d1", "buyer_d7", "buyer_d14", "buyer_d28",
    "buy_d7", "buy_d14", "buy_d28",
    "iap_revenue_d7", "iap_revenue_d14", "iap_revenue_d28",
    "registration",
    "retention_d1_to_d7", "retention_d3_to_d7", "retention_d7_to_d14",
    "retention_d1", "retention_d3", "retention_d7",
]

def reduce_memory(df: pd.DataFrame) -> pd.DataFrame:
    """Downcast numeric columns to save memory."""
    df = df.copy()
    for col in df.columns:
        col_type = df[col].dtype
        if col_type == "float64":
            df[col] = df[col].astype("float32")
        elif col_type == "int64":
            df[col] = df[col].astype("int32")
    return df

def detect_listlike_columns(df: pd.DataFrame, cols=None):
    """Detect columns containing lists or dicts."""
    if cols is None:
        cols = df.columns
    listlike = []
    for c in cols:
        sample_vals = df[c].head(100)
        if sample_vals.apply(lambda v: isinstance(v, (list, dict))).any():
            listlike.append(c)
    return listlike

def preprocess_train_valid(X_train, X_valid, num_cols, cat_cols):
    """Preprocess train and validation sets."""
    X_train = X_train.copy()
    X_valid = X_valid.copy()
    
    # Numeric: fill NaN with 0
    for c in num_cols:
        X_train[c] = X_train[c].fillna(0)
        X_valid[c] = X_valid[c].fillna(0)
    
    # Categorical: convert to strings and encode as integers
    cat_mappings = {}
    for c in cat_cols:
        X_train[c] = X_train[c].astype("object").fillna("unknown").astype(str)
        X_train[c] = X_train[c].astype("category")
        
        # Create mapping
        cats = X_train[c].cat.categories
        cat_mappings[c] = {cat: i for i, cat in enumerate(cats)}
        
        # Encode train
        X_train[c] = X_train[c].cat.codes
        
        # Encode valid (handle unseen categories)
        X_valid[c] = X_valid[c].astype("object").fillna("unknown").astype(str)
        X_valid[c] = X_valid[c].map(cat_mappings[c]).fillna(-1).astype(np.int32)
    
    return X_train, X_valid, cat_mappings

print("Helper functions loaded.")

Helper functions loaded.


## Load and Prepare Data

In [6]:
# Train: Oct 1-5, Valid: Oct 6
filters_train = [("datetime", ">=", "2025-10-01-00-00"),
                 ("datetime", "<",  "2025-10-06-00-00")]
filters_valid = [("datetime", ">=", "2025-10-06-00-00"),
                 ("datetime", "<",  "2025-10-07-00-00")]

# Get list of parquet files
parquet_files_all = glob(os.path.join(TRAIN_PATH, '**/part-*.parquet'), recursive=True)

# Reduce number of files for faster training
num_files_train = max(1, int(len(parquet_files_all) * 0.15))
parquet_files_train = parquet_files_all[:num_files_train]

print(f"Using {num_files_train} out of {len(parquet_files_all)} train files")

# Columns to drop early
cols_to_drop_early = IGNORE_BIG_COLS + ["row_id", "datetime"]

# Load TRAIN
print("Loading train data...")
dd_train = dd.read_parquet(
    parquet_files_train, 
    filters=filters_train,
    engine='pyarrow'
)

# Drop heavy columns BEFORE compute
existing_cols = [c for c in cols_to_drop_early if c in dd_train.columns]
dd_train = dd_train.drop(columns=existing_cols)

# Sample in Dask
train_sample = dd_train.sample(frac=TRAIN_SAMPLE_FRAC, random_state=RSEED).compute()
train_sample = reduce_memory(train_sample)

print(f"Train loaded: {train_sample.shape}, Memory: {train_sample.memory_usage(deep=True).sum() / 1e9:.2f} GB")

# Clean memory
del dd_train
gc.collect()

# Load VALID
print("\nLoading validation data...")
dd_valid = dd.read_parquet(
    parquet_files_train,
    filters=filters_valid,
    engine='pyarrow'
)

existing_cols = [c for c in cols_to_drop_early if c in dd_valid.columns]
dd_valid = dd_valid.drop(columns=existing_cols)

# Sample less in validation
valid_df = dd_valid.sample(frac=min(0.5, TRAIN_SAMPLE_FRAC), random_state=RSEED).compute()
valid_df = reduce_memory(valid_df)

print(f"Valid loaded: {valid_df.shape}, Memory: {valid_df.memory_usage(deep=True).sum() / 1e9:.2f} GB")

del dd_valid
gc.collect()

print(f"\n‚úì Data loaded successfully")
print(f"Total memory: ~{(train_sample.memory_usage(deep=True).sum() + valid_df.memory_usage(deep=True).sum()) / 1e9:.2f} GB")

Using 21 out of 144 train files
Loading train data...
Train loaded: (271487, 56), Memory: 0.40 GB
Train loaded: (271487, 56), Memory: 0.40 GB

Loading validation data...

Loading validation data...
Valid loaded: (28373, 56), Memory: 0.04 GB
Valid loaded: (28373, 56), Memory: 0.04 GB

‚úì Data loaded successfully

‚úì Data loaded successfully
Total memory: ~0.44 GB
Total memory: ~0.44 GB


In [7]:
# Extract targets
y_train = train_sample[TARGET_COL].values
y_valid = valid_df[TARGET_COL].values

# Extract buyer labels
y_train_buyer = train_sample["buyer_d7"].values
y_valid_buyer = valid_df["buyer_d7"].values

print(f"Buyer ratio in train: {y_train_buyer.mean():.4f}")
print(f"Buyer ratio in valid: {y_valid_buyer.mean():.4f}")

# Target transform: log1p for stability (MSLE)
y_train_log = np.log1p(y_train.clip(min=0.0))
y_valid_log = np.log1p(y_valid.clip(min=0.0))

# Prepare features
cols_to_drop = ["row_id", "datetime"] + LABEL_COLS
feature_cols = [c for c in train_sample.columns if c not in cols_to_drop]

X_train = train_sample[feature_cols].copy()
X_valid = valid_df[feature_cols].copy()

# Detect and remove list-like columns
listlike_cols = detect_listlike_columns(X_train, cols=feature_cols)
print(f"Removing {len(listlike_cols)} list-like columns: {listlike_cols}")
X_train = X_train.drop(columns=listlike_cols)
X_valid = X_valid.drop(columns=listlike_cols)

# Identify numeric and categorical columns
num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = [c for c in X_train.columns if c not in num_cols]

print(f"Features: {len(X_train.columns)} ({len(num_cols)} numeric, {len(cat_cols)} categorical)")

# Preprocess
X_train_prep, X_valid_prep, cat_mappings = preprocess_train_valid(X_train, X_valid, num_cols, cat_cols)
print(f"Data prepared: X_train {X_train_prep.shape}, X_valid {X_valid_prep.shape}")

Buyer ratio in train: 0.0320
Buyer ratio in valid: 0.0330
Removing 14 list-like columns: ['avg_daily_sessions', 'avg_duration', 'cpm', 'cpm_pct_rk', 'ctr', 'ctr_pct_rk', 'hour_ratio', 'iap_revenue_usd_bundle', 'iap_revenue_usd_category', 'iap_revenue_usd_category_bottom_taxonomy', 'num_buys_bundle', 'num_buys_category', 'num_buys_category_bottom_taxonomy', 'rwd_prank']
Features: 26 (11 numeric, 15 categorical)
Features: 26 (11 numeric, 15 categorical)
Data prepared: X_train (271487, 26), X_valid (28373, 26)
Data prepared: X_train (271487, 26), X_valid (28373, 26)


## PyTorch Dataset

In [None]:
class TabularDataset(Dataset):
    def __init__(self, df, cat_cols, num_cols, y, y_buyer=None, emb_sizes=None):
        if len(cat_cols) > 0:
            cat_data = df[cat_cols].values.astype(np.int64)
            if emb_sizes is not None:
                # Map unknown categories (-1) to a special index
                for i in range(cat_data.shape[1]):
                    n_categories = emb_sizes[i][0]
                    # -1 (unknown) -> n_categories - 1 (last embedding reserved for unknown)
                    cat_data[cat_data[:, i] == -1, i] = n_categories - 1
                    # Clamp valid categories to [0, n_categories - 2]
                    cat_data[:, i] = np.clip(cat_data[:, i], 0, n_categories - 1)
            self.cat = cat_data
        else:
            self.cat = None
        self.num = df[num_cols].values.astype(np.float32) if len(num_cols) > 0 else None
        self.y = y.astype(np.float32)
        self.buyer = y_buyer.astype(np.float32) if y_buyer is not None else None
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        item = {}
        if self.cat is not None:
            item['cat'] = torch.tensor(self.cat[idx], dtype=torch.long)
        if self.num is not None:
            item['num'] = torch.tensor(self.num[idx], dtype=torch.float32)
        item['y'] = torch.tensor(self.y[idx], dtype=torch.float32)
        if self.buyer is not None:
            item['buyer'] = torch.tensor(self.buyer[idx], dtype=torch.float32)
        return item

print("TabularDataset class defined.")

TabularDataset class defined.


## Model Definitions

In [None]:
# Calculate embedding sizes BEFORE creating datasets
def get_embedding_sizes(cat_cols, cat_mappings, max_emb_dim=50):
    """Calculate embedding sizes for categorical features."""
    emb_sizes = []
    for c in cat_cols:
        n_unique = len(cat_mappings[c]) + 2  # +1 for unknown, +1 for padding
        emb_dim = min(max(1, n_unique // 10), max_emb_dim)
        emb_sizes.append((n_unique, emb_dim))
    return emb_sizes

emb_sizes = get_embedding_sizes(cat_cols, cat_mappings, max_emb_dim=50)
print(f"Embedding sizes calculated: {len(emb_sizes)} categorical features")
print(f"Sample sizes (categories, dim): {emb_sizes[:5]}...")  # Show first 5

# NOW create datasets WITH embedding sizes to clamp indices properly
print("\nCreating datasets with proper index clamping...")
train_ds = TabularDataset(X_train_prep, cat_cols, num_cols, y_train_log, y_train_buyer, emb_sizes=emb_sizes)
val_ds = TabularDataset(X_valid_prep, cat_cols, num_cols, y_valid_log, y_valid_buyer, emb_sizes=emb_sizes)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, drop_last=False)
val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, drop_last=False)

print(f"‚úì Dataloaders ready. Batches per epoch: train={len(train_loader)}, val={len(val_loader)}")

Embedding sizes calculated: 15 categorical features
Sample sizes (categories, dim): [(459, 45), (24, 2), (58, 5), (92, 9), (5895, 50)]...

Creating datasets with proper index clamping...
‚úì Dataloaders ready. Batches per epoch: train=1061, val=111


In [None]:
class TeacherModel(nn.Module):
    """
    Teacher model V3 - ARQUITECTURA CORREGIDA
    
    Mejoras cr√≠ticas:
    1. Predice log(revenue) directamente (escala correcta)
    2. Usa Softplus para asegurar valores positivos
    3. Arquitectura m√°s simple para evitar overfitting
    4. Heads separados e independientes
    """
    def __init__(self, emb_sizes, num_len):
        super().__init__()
        self.embs = nn.ModuleList([nn.Embedding(categories, dim) for categories, dim in emb_sizes])
        emb_dim_sum = sum([dim for _, dim in emb_sizes]) if len(emb_sizes) > 0 else 0
        input_dim = emb_dim_sum + (num_len if num_len > 0 else 0)
        
        # Red principal M√ÅS SIMPLE (256->128)
        self.net = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Dropout(0.1),  # Menos dropout
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.1),
        )
        
        # Buyer head: predice P(buyer=1)
        self.buyer_head = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
            # NO sigmoid aqu√≠, lo aplicamos en loss/forward
        )
        
        # Revenue head: predice log(revenue) dado que es buyer
        # CR√çTICO: Output debe ser positivo (es log-space)
        self.revenue_head = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Softplus()  # ‚úÖ Garantiza output >= 0 para log(revenue)
        )
        
    def forward(self, x_cat, x_num, return_components=False):
        """
        Args:
            return_components: Si True, retorna (buyer_logit, log_revenue, prob_buyer)
                              Si False, retorna predicci√≥n final
        """
        if x_cat is not None and len(self.embs) > 0:
            embs = [emb(x_cat[:, i]) for i, emb in enumerate(self.embs)]
            x = torch.cat(embs + ([x_num] if x_num is not None else []), dim=1)
        else:
            x = x_num
        
        feat = self.net(x)
        
        # Buyer prediction (logit)
        buyer_logit = self.buyer_head(feat).view(-1)
        prob_buyer = torch.sigmoid(buyer_logit)
        
        # Revenue prediction (log-space, positivo por Softplus)
        log_revenue = self.revenue_head(feat).view(-1)
        
        if return_components:
            return buyer_logit, log_revenue, prob_buyer
        
        # Predicci√≥n final: P(buyer) * revenue
        # En escala original: expm1(log_revenue) = revenue - 1
        # Pero durante train usamos log-space directamente
        return buyer_logit, log_revenue


class StudentModel(nn.Module):
    """
    Student model V3 - ARQUITECTURA CORREGIDA
    
    Mejoras:
    1. Mimics teacher structure pero m√°s peque√±o
    2. Embeddings reducidos a la mitad
    3. Red m√°s compacta (128->64)
    """
    def __init__(self, emb_sizes, num_len):
        super().__init__()
        # Embeddings reducidos
        small_embs = [(n, max(1, d // 2)) for n, d in emb_sizes]
        self.embs = nn.ModuleList([nn.Embedding(categories, dim) for categories, dim in small_embs])
        emb_dim_sum = sum([dim for _, dim in small_embs]) if len(small_embs) > 0 else 0
        input_dim = emb_dim_sum + (num_len if num_len > 0 else 0)
        
        # Red compacta
        self.net = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(128, 64),
            nn.ReLU(),
        )
        
        # Buyer head
        self.buyer_head = nn.Sequential(
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )
        
        # Revenue head
        self.revenue_head = nn.Sequential(
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Softplus()  # ‚úÖ Igual que teacher
        )
        
    def forward(self, x_cat, x_num, return_components=False):
        if x_cat is not None and len(self.embs) > 0:
            embs = [emb(x_cat[:, i]) for i, emb in enumerate(self.embs)]
            x = torch.cat(embs + ([x_num] if x_num is not None else []), dim=1)
        else:
            x = x_num
        
        feat = self.net(x)
        
        buyer_logit = self.buyer_head(feat).view(-1)
        prob_buyer = torch.sigmoid(buyer_logit)
        
        log_revenue = self.revenue_head(feat).view(-1)
        
        if return_components:
            return buyer_logit, log_revenue, prob_buyer
        
        return buyer_logit, log_revenue


print("‚úì Model classes defined (V3 - CORRECTED)")

Model classes defined.


In [30]:
# Initialize models (emb_sizes already calculated above)
teacher = TeacherModel(emb_sizes, len(num_cols)).to(DEVICE)
student = StudentModel(emb_sizes, len(num_cols)).to(DEVICE)

teacher_params = sum(p.numel() for p in teacher.parameters() if p.requires_grad)
student_params = sum(p.numel() for p in student.parameters() if p.requires_grad)

print(f"Teacher params: {teacher_params:,}")
print(f"Student params: {student_params:,}")
print(f"Compression ratio: {teacher_params / student_params:.2f}x")


Teacher params: 1,316,168
Student params: 504,709
Compression ratio: 2.61x


## Train Teacher Model

In [None]:
def train_teacher(model, train_loader, val_loader, epochs=5, lr=1e-3, device='cpu'):
    """
    Train teacher model V3 - LOSS FUNCTION CORREGIDA
    
    Cambios cr√≠ticos:
    1. Loss consistente en log-space
    2. Revenue loss solo en buyers (masked)
    3. Weighted loss balanceado
    4. No m√°s MSLE aproximado que confunde al modelo
    """
    opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)
    mse_loss = nn.MSELoss()
    bce_loss = nn.BCEWithLogitsLoss()  # ‚úÖ Logits directamente
    
    model.to(device)
    best_val_loss = float('inf')
    
    for epoch in range(epochs):
        # ========== TRAINING ==========
        model.train()
        train_loss_total = 0.0
        train_loss_buyer = 0.0
        train_loss_revenue = 0.0
        
        for batch in train_loader:
            x_cat = batch.get('cat', None)
            x_num = batch.get('num', None)
            y_log = batch['y']  # log1p(revenue)
            buyer = batch['buyer']  # 0 o 1
            
            x_cat = x_cat.to(device) if x_cat is not None else None
            x_num = x_num.to(device) if x_num is not None else None
            y_log = y_log.to(device)
            buyer = buyer.to(device)
            
            opt.zero_grad()
            
            # Forward pass
            buyer_logit, log_revenue = model(x_cat, x_num)
            
            # Loss 1: Buyer classification (todos los samples)
            loss_buyer = bce_loss(buyer_logit, buyer)
            
            # Loss 2: Revenue regression (SOLO en buyers)
            mask_buyers = buyer > 0.5
            if mask_buyers.sum() > 0:
                # log_revenue ya est√° en log-space por Softplus
                # y_log es log1p(revenue)
                # Ambos est√°n en log-scale, comparamos directamente
                loss_revenue = mse_loss(log_revenue[mask_buyers], y_log[mask_buyers])
            else:
                loss_revenue = torch.tensor(0.0, device=device)
            
            # Loss total: balanced
            # Buyer classification es importante, revenue tambi√©n
            loss = 0.4 * loss_buyer + 0.6 * loss_revenue
            
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # ‚úÖ Gradient clipping
            opt.step()
            
            train_loss_total += loss.item() * len(y_log)
            train_loss_buyer += loss_buyer.item() * len(y_log)
            train_loss_revenue += loss_revenue.item() * len(y_log)
        
        train_loss_total /= len(train_loader.dataset)
        train_loss_buyer /= len(train_loader.dataset)
        train_loss_revenue /= len(train_loader.dataset)
        
        # ========== VALIDATION ==========
        model.eval()
        val_loss_total = 0.0
        val_loss_buyer = 0.0
        val_loss_revenue = 0.0
        
        with torch.no_grad():
            for batch in val_loader:
                x_cat = batch.get('cat', None)
                x_num = batch.get('num', None)
                y_log = batch['y']
                buyer = batch['buyer']
                
                x_cat = x_cat.to(device) if x_cat is not None else None
                x_num = x_num.to(device) if x_num is not None else None
                y_log = y_log.to(device)
                buyer = buyer.to(device)
                
                buyer_logit, log_revenue = model(x_cat, x_num)
                
                loss_buyer = bce_loss(buyer_logit, buyer)
                
                mask_buyers = buyer > 0.5
                if mask_buyers.sum() > 0:
                    loss_revenue = mse_loss(log_revenue[mask_buyers], y_log[mask_buyers])
                else:
                    loss_revenue = torch.tensor(0.0, device=device)
                
                loss = 0.4 * loss_buyer + 0.6 * loss_revenue
                
                val_loss_total += loss.item() * len(y_log)
                val_loss_buyer += loss_buyer.item() * len(y_log)
                val_loss_revenue += loss_revenue.item() * len(y_log)
        
        val_loss_total /= len(val_loader.dataset)
        val_loss_buyer /= len(val_loader.dataset)
        val_loss_revenue /= len(val_loader.dataset)
        
        # Print progress
        print(f"Epoch {epoch+1}/{epochs}")
        print(f"  Train - Total: {train_loss_total:.6f} | Buyer: {train_loss_buyer:.6f} | Revenue: {train_loss_revenue:.6f}")
        print(f"  Valid - Total: {val_loss_total:.6f} | Buyer: {val_loss_buyer:.6f} | Revenue: {val_loss_revenue:.6f}")
        
        # Save best model
        if val_loss_total < best_val_loss:
            best_val_loss = val_loss_total
            torch.save(model.state_dict(), 'teacher_model_v3.pt')
            print(f"  ‚úì New best model saved (val_loss: {val_loss_total:.6f})")
    
    print(f"\n‚úì Training complete. Best val loss: {best_val_loss:.6f}")
    print("‚úì Best model saved to teacher_model_v3.pt")

In [32]:
# Train the teacher model
print("=" * 50)
print("TRAINING TEACHER MODEL")
print("=" * 50)
train_teacher(teacher, train_loader, val_loader, epochs=TEACHER_EPOCHS, lr=LEARNING_RATE, device=DEVICE)


TRAINING TEACHER MODEL
Epoch 1/5
  Train - Total: 0.966467, Reg: 2.169034, Buyer: 0.148178
  Valid - Total: 0.987933
Epoch 1/5
  Train - Total: 0.966467, Reg: 2.169034, Buyer: 0.148178
  Valid - Total: 0.987933
Epoch 2/5
  Train - Total: 0.859143, Reg: 1.906765, Buyer: 0.142937
  Valid - Total: 0.842455
Epoch 2/5
  Train - Total: 0.859143, Reg: 1.906765, Buyer: 0.142937
  Valid - Total: 0.842455
Epoch 3/5
  Train - Total: 0.817449, Reg: 1.802508, Buyer: 0.143129
  Valid - Total: 0.830797
Epoch 3/5
  Train - Total: 0.817449, Reg: 1.802508, Buyer: 0.143129
  Valid - Total: 0.830797
Epoch 4/5
  Train - Total: 0.789213, Reg: 1.732576, Buyer: 0.142440
  Valid - Total: 0.843236
Epoch 4/5
  Train - Total: 0.789213, Reg: 1.732576, Buyer: 0.142440
  Valid - Total: 0.843236
Epoch 5/5
  Train - Total: 0.774294, Reg: 1.696304, Buyer: 0.141585
  Valid - Total: 0.808383

‚úì Teacher saved to teacher_model_v2.pt
Epoch 5/5
  Train - Total: 0.774294, Reg: 1.696304, Buyer: 0.141585
  Valid - Total: 0.80

## Knowledge Distillation

In [None]:
def train_student_with_distillation(student, teacher, train_loader, val_loader, epochs=5, lr=1e-3, alpha=0.7, device='cpu'):
    """
    Train student V3 - DISTILLATION CORREGIDA
    
    Cambios:
    1. Student aprende de ambos heads del teacher
    2. Loss consistente en log-space
    3. Distillation en buyer Y revenue
    4. Alpha alto = m√°s peso a ground truth
    """
    opt = torch.optim.AdamW(student.parameters(), lr=lr, weight_decay=1e-4)
    mse_loss = nn.MSELoss()
    bce_loss = nn.BCEWithLogitsLoss()
    
    teacher.to(device)
    teacher.eval()
    student.to(device)
    best_val_loss = float('inf')
    
    for epoch in range(epochs):
        # ========== TRAINING ==========
        student.train()
        train_loss_total = 0.0
        train_loss_hard = 0.0
        train_loss_soft = 0.0
        
        for batch in train_loader:
            x_cat = batch.get('cat', None)
            x_num = batch.get('num', None)
            y_log = batch['y']
            buyer = batch['buyer']
            
            x_cat = x_cat.to(device) if x_cat is not None else None
            x_num = x_num.to(device) if x_num is not None else None
            y_log = y_log.to(device)
            buyer = buyer.to(device)
            
            opt.zero_grad()
            
            # Student predictions
            s_buyer_logit, s_log_revenue = student(x_cat, x_num)
            
            # Teacher predictions (soft targets)
            with torch.no_grad():
                t_buyer_logit, t_log_revenue = teacher(x_cat, x_num)
            
            # ===== HARD LOSS (ground truth) =====
            # Buyer classification
            hard_buyer = bce_loss(s_buyer_logit, buyer)
            
            # Revenue regression (solo en buyers)
            mask_buyers = buyer > 0.5
            if mask_buyers.sum() > 0:
                hard_revenue = mse_loss(s_log_revenue[mask_buyers], y_log[mask_buyers])
            else:
                hard_revenue = torch.tensor(0.0, device=device)
            
            loss_hard = 0.4 * hard_buyer + 0.6 * hard_revenue
            
            # ===== SOFT LOSS (teacher knowledge) =====
            # Distill both buyer and revenue predictions
            soft_buyer = mse_loss(s_buyer_logit, t_buyer_logit.detach())
            soft_revenue = mse_loss(s_log_revenue, t_log_revenue.detach())
            
            loss_soft = 0.4 * soft_buyer + 0.6 * soft_revenue
            
            # ===== COMBINED LOSS =====
            # alpha alto = m√°s peso a ground truth
            loss = alpha * loss_hard + (1.0 - alpha) * loss_soft
            
            loss.backward()
            torch.nn.utils.clip_grad_norm_(student.parameters(), max_norm=1.0)
            opt.step()
            
            train_loss_total += loss.item() * len(y_log)
            train_loss_hard += loss_hard.item() * len(y_log)
            train_loss_soft += loss_soft.item() * len(y_log)
        
        train_loss_total /= len(train_loader.dataset)
        train_loss_hard /= len(train_loader.dataset)
        train_loss_soft /= len(train_loader.dataset)
        
        # ========== VALIDATION ==========
        student.eval()
        val_loss_total = 0.0
        val_loss_buyer = 0.0
        val_loss_revenue = 0.0
        
        with torch.no_grad():
            for batch in val_loader:
                x_cat = batch.get('cat', None)
                x_num = batch.get('num', None)
                y_log = batch['y']
                buyer = batch['buyer']
                
                x_cat = x_cat.to(device) if x_cat is not None else None
                x_num = x_num.to(device) if x_num is not None else None
                y_log = y_log.to(device)
                buyer = buyer.to(device)
                
                s_buyer_logit, s_log_revenue = student(x_cat, x_num)
                
                loss_buyer = bce_loss(s_buyer_logit, buyer)
                
                mask_buyers = buyer > 0.5
                if mask_buyers.sum() > 0:
                    loss_revenue = mse_loss(s_log_revenue[mask_buyers], y_log[mask_buyers])
                else:
                    loss_revenue = torch.tensor(0.0, device=device)
                
                loss = 0.4 * loss_buyer + 0.6 * loss_revenue
                
                val_loss_total += loss.item() * len(y_log)
                val_loss_buyer += loss_buyer.item() * len(y_log)
                val_loss_revenue += loss_revenue.item() * len(y_log)
        
        val_loss_total /= len(val_loader.dataset)
        val_loss_buyer /= len(val_loader.dataset)
        val_loss_revenue /= len(val_loader.dataset)
        
        print(f"Epoch {epoch+1}/{epochs}")
        print(f"  Train - Total: {train_loss_total:.6f} | Hard: {train_loss_hard:.6f} | Soft: {train_loss_soft:.6f}")
        print(f"  Valid - Total: {val_loss_total:.6f} | Buyer: {val_loss_buyer:.6f} | Revenue: {val_loss_revenue:.6f}")
        
        # Save best model
        if val_loss_total < best_val_loss:
            best_val_loss = val_loss_total
            torch.save(student.state_dict(), 'student_model_v3.pt')
            print(f"  ‚úì New best model saved (val_loss: {val_loss_total:.6f})")
    
    print(f"\n‚úì Training complete. Best val loss: {best_val_loss:.6f}")
    print("‚úì Best model saved to student_model_v3.pt")

print("=" * 50)
print("TRAINING STUDENT WITH DISTILLATION")
print("=" * 50)
train_student_with_distillation(student, teacher, train_loader, val_loader, 
                                epochs=STUDENT_EPOCHS, lr=LEARNING_RATE, 
                                alpha=DISTILL_ALPHA, device=DEVICE)

TRAINING STUDENT WITH DISTILLATION
Epoch 1/5
  Train - Total: 0.934213, Hard: 0.689769, Soft: 1.300881
  Valid - MSE: 0.881699
Epoch 1/5
  Train - Total: 0.934213, Hard: 0.689769, Soft: 1.300881
  Valid - MSE: 0.881699
Epoch 2/5
  Train - Total: 0.911937, Hard: 0.704582, Soft: 1.222970
  Valid - MSE: 1.892810
Epoch 2/5
  Train - Total: 0.911937, Hard: 0.704582, Soft: 1.222970
  Valid - MSE: 1.892810
Epoch 3/5
  Train - Total: 0.909223, Hard: 0.704558, Soft: 1.216220
  Valid - MSE: 1.035643
Epoch 3/5
  Train - Total: 0.909223, Hard: 0.704558, Soft: 1.216220
  Valid - MSE: 1.035643
Epoch 4/5
  Train - Total: 0.906923, Hard: 0.704100, Soft: 1.211158
  Valid - MSE: 1.047304
Epoch 4/5
  Train - Total: 0.906923, Hard: 0.704100, Soft: 1.211158
  Valid - MSE: 1.047304
Epoch 5/5
  Train - Total: 0.906184, Hard: 0.704108, Soft: 1.209299
  Valid - MSE: 0.807849

‚úì Student saved to student_model_v2.pt
Epoch 5/5
  Train - Total: 0.906184, Hard: 0.704108, Soft: 1.209299
  Valid - MSE: 0.807849

‚ú

## Evaluation

In [None]:
def predict_model(model, loader, device='cpu'):
    """
    Generate predictions from model V3 - PREDICCI√ìN CORREGIDA
    
    Retorna predicciones en escala original (revenue).
    """
    model.to(device)
    model.eval()
    
    all_preds = []
    all_trues = []
    all_buyers_true = []
    all_buyers_pred = []
    
    with torch.no_grad():
        for batch in loader:
            x_cat = batch.get('cat', None)
            x_num = batch.get('num', None)
            y_log = batch['y']  # log1p(revenue)
            
            x_cat = x_cat.to(device) if x_cat is not None else None
            x_num = x_num.to(device) if x_num is not None else None
            
            # Forward pass
            buyer_logit, log_revenue = model(x_cat, x_num)
            prob_buyer = torch.sigmoid(buyer_logit)
            
            # Predicci√≥n final en escala ORIGINAL
            # log_revenue es log-scale (por Softplus)
            # Convertir a revenue: expm1(log_revenue) ‚âà revenue
            # Pero log_revenue NO es log1p, es solo un valor positivo
            # Necesitamos tratarlo como log1p para ser consistente
            
            # CORRECCI√ìN: log_revenue sale del modelo como valor positivo
            # Lo tratamos como log1p(revenue), entonces:
            revenue_pred = torch.expm1(log_revenue)  # revenue - 1 + 1 = revenue
            
            # Predicci√≥n final: P(buyer) * revenue_if_buyer
            final_pred = prob_buyer * revenue_pred
            
            # Almacenar
            all_preds.append(final_pred.cpu().numpy())
            all_buyers_pred.append(prob_buyer.cpu().numpy())
            all_trues.append(torch.expm1(y_log).cpu().numpy())  # Convertir a escala original
            
            if 'buyer' in batch:
                all_buyers_true.append(batch['buyer'].numpy())
    
    preds = np.concatenate(all_preds)
    trues = np.concatenate(all_trues)
    buyers_pred = np.concatenate(all_buyers_pred)
    buyers_true = np.concatenate(all_buyers_true) if all_buyers_true else None
    
    return preds, trues, buyers_true, buyers_pred


def evaluate_model(model, loader, device='cpu', model_name='Model'):
    """
    Evaluate model and print comprehensive metrics.
    """
    preds, trues, buyers_true, buyers_pred = predict_model(model, loader, device)
    
    # Clip predictions to avoid invalid MSLE
    preds = np.clip(preds, 0, None)
    trues = np.clip(trues, 0, None)
    
    # MSLE
    msle = mean_squared_log_error(trues, preds)
    
    # MAE
    mae = np.mean(np.abs(preds - trues))
    
    # Buyer metrics (si disponible)
    if buyers_true is not None:
        auc = roc_auc_score(buyers_true, buyers_pred)
        
        # Accuracy con threshold 0.5
        buyer_pred_binary = (buyers_pred > 0.5).astype(int)
        buyer_acc = np.mean(buyer_pred_binary == buyers_true)
        
        # % de buyers correctamente identificados
        buyers_mask = buyers_true > 0.5
        if buyers_mask.sum() > 0:
            buyer_recall = np.mean(buyers_pred[buyers_mask] > 0.5)
        else:
            buyer_recall = 0.0
    else:
        auc = None
        buyer_acc = None
        buyer_recall = None
    
    # Print metrics
    print(f"\n{'='*50}")
    print(f"{model_name} EVALUATION")
    print(f"{'='*50}")
    print(f"MSLE (Primary Metric):  {msle:.6f}")
    print(f"MAE:                     {mae:.2f}")
    
    if auc is not None:
        print(f"\nBuyer Classification:")
        print(f"  AUC:                  {auc:.4f}")
        print(f"  Accuracy (t=0.5):     {buyer_acc:.4f}")
        print(f"  Buyer Recall:         {buyer_recall:.4f}")
    
    print(f"\nRevenue Distribution:")
    print(f"  Pred mean:            ${np.mean(preds):.2f}")
    print(f"  Pred median:          ${np.median(preds):.2f}")
    print(f"  Pred max:             ${np.max(preds):.2f}")
    print(f"  True mean:            ${np.mean(trues):.2f}")
    print(f"  True median:          ${np.median(trues):.2f}")
    print(f"  True max:             ${np.max(trues):.2f}")
    
    return {
        'msle': msle,
        'mae': mae,
        'auc': auc,
        'buyer_acc': buyer_acc,
        'buyer_recall': buyer_recall,
        'preds': preds,
        'trues': trues
    }

print("‚úì Evaluation utilities defined (V3)")

Evaluation utilities defined.


In [None]:
# Evaluate Teacher
teacher_results = evaluate_model(teacher, val_loader, device=DEVICE, model_name='TEACHER')

# Evaluate Student
student_results = evaluate_model(student, val_loader, device=DEVICE, model_name='STUDENT')

# Comparison
print(f"\n{'='*50}")
print("MODEL COMPARISON")
print(f"{'='*50}")
print(f"Teacher MSLE: {teacher_results['msle']:.6f}")
print(f"Student MSLE: {student_results['msle']:.6f}")
print(f"Difference:   {abs(teacher_results['msle'] - student_results['msle']):.6f}")
print(f"\nCompression: {teacher_params / student_params:.2f}x smaller")

TEACHER MODEL EVALUATION
Teacher MSLE: 3.536911
Teacher Buyer AUC: 0.6124

STUDENT MODEL EVALUATION
Teacher MSLE: 3.536911
Teacher Buyer AUC: 0.6124

STUDENT MODEL EVALUATION
Student MSLE: 0.807849

Baseline (all zeros) MSLE: 0.228072

COMPARISON
Teacher improvement: -1450.79%
Student improvement: -254.21%
Student vs Teacher gap: -77.16%

Teacher predictions:
  Mean: 5034707.5000, Median: 4.8450, Max: 118296944640.0000
  % Non-zero: 100.00%

Student predictions:
  Mean: 7305063424.0000, Median: 1.0833, Max: 207266565324800.0000
  % Non-zero: 100.00%
Student MSLE: 0.807849

Baseline (all zeros) MSLE: 0.228072

COMPARISON
Teacher improvement: -1450.79%
Student improvement: -254.21%
Student vs Teacher gap: -77.16%

Teacher predictions:
  Mean: 5034707.5000, Median: 4.8450, Max: 118296944640.0000
  % Non-zero: 100.00%

Student predictions:
  Mean: 7305063424.0000, Median: 1.0833, Max: 207266565324800.0000
  % Non-zero: 100.00%


## Summary

Este notebook combina:
1. **Carga de datos robusta** del `simplified_model_comparison.ipynb` (manejo de parquet con Dask, sampling, preprocesado)
2. **Modelos de deep learning** (Teacher-Student distillation en PyTorch)

### Ventajas del Student Model:
- **~50% menos par√°metros** que el teacher
- **Inferencia m√°s r√°pida** para producci√≥n
- **Aprende del teacher** (soft targets) adem√°s de los labels reales

### Para producci√≥n:
- Exportar el student a ONNX: `torch.onnx.export(student, ...)`
- Aplicar quantization para reducir tama√±o y acelerar m√°s
- Usar solo el student model (descartar teacher)

### Pr√≥ximos pasos:
- Ajustar hiperpar√°metros (learning rate, arquitectura, alpha)
- Probar diferentes embedding dimensions
- A√±adir m√°s √©pocas si hay suficiente memoria/tiempo
- Implementar early stopping basado en validation loss