# TabTransformer++ for Residual Learning

This notebook demonstrates a **residual learning approach** using a TabTransformer architecture. The key idea is:

1. Train a simple "base" model (Ridge Regression) to make initial predictions
2. Train a TabTransformer to predict the **residuals** (errors) of the base model
3. Combine: `Final Prediction = Base Prediction + Predicted Residual`

This stacking technique often yields better results than either model alone.

## Key Components
- **Quantile Binning**: Converts continuous features into discrete tokens
- **Gated Fusion**: Learns to balance binned tokens with raw scalar values
- **EMA (Exponential Moving Average)**: Polyak averaging for more stable predictions
- **Isotonic Calibration**: Post-processing to improve residual predictions

## 1. Setup & Configuration

Import required libraries and define hyperparameters:

- **Feature Engineering**: Number of bins for quantile discretization
- **Model Architecture**: Embedding dimensions, attention heads, transformer layers
- **Training**: Learning rate, batch size, EMA decay for Polyak averaging

In [3]:
import os
import gc
import time
import warnings
import numpy as np
import pandas as pd
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import KFold, train_test_split
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.isotonic import IsotonicRegression
from sklearn.metrics import root_mean_squared_error

# Suppress warnings
warnings.filterwarnings("ignore")
pd.set_option("mode.copy_on_write", True)

# ================= Configuration =================
class Config:
    SEED            = 2025
    
    # --- Feature Engineering ---
    NBINS           = 32        # Quantile bins for raw numeric features
    NBINS_BASE      = 128       # Finer bins for Base model prediction
    NBINS_DT        = 64        # Bins for DeepTables/Tree model prediction
    
    # --- Model Architecture ---
    EMB_DIM         = 64
    N_HEADS         = 4
    N_LAYERS        = 3
    MLP_HID         = 192
    DROPOUT         = 0.1
    EMB_DROPOUT     = 0.05
    TOKENDROP_P     = 0.12      # Feature noise probability
    
    # --- Training ---
    EPOCHS          = 10        # Shortened for demo
    BATCH_SIZE      = 1024
    LR              = 2e-3
    WEIGHT_DECAY    = 1e-5
    EMA_DECAY       = 0.995     # Polyak Averaging
    DEVICE          = "cuda" if torch.cuda.is_available() else "cpu"

def seed_everything(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)

seed_everything(Config.SEED)
print(f"Running on {Config.DEVICE}")

Running on cpu


## 2. Data Simulation: Building the "Stack"

This section simulates a real-world stacking scenario:

1. **Load Data**: California Housing dataset (predicting median house values)
2. **Train Base Models** using K-Fold cross-validation:
   - `Ridge` regression → generates `base_pred` (our primary predictions)
   - `RandomForest` → generates `dt_pred` (provides additional signal)
3. **Calculate Residuals**: `residual = target - base_pred`
   - This is what the TabTransformer will learn to predict

The out-of-fold (OOF) predictions prevent data leakage.

In [4]:
def get_simulated_data():
    """
    1. Loads California Housing.
    2. Trains 'Base' (Ridge) and 'DT' (Random Forest) models to generate OOF predictions.
    3. Calculates the 'Residual' (Target - Base).
    """
    print("\n--- 1. Simulating Base & DT Models (The 'Stack') ---")
    data = fetch_california_housing(as_frame=True)
    df = data.frame
    target_col = "MedHouseVal"
    
    # Split Holdout Test Set (acts as 'Private LB')
    train_df, test_df = train_test_split(df, test_size=0.2, random_state=Config.SEED)
    train_df = train_df.reset_index(drop=True)
    test_df = test_df.reset_index(drop=True)
    
    # Placeholders
    train_df["base_pred"] = 0.0
    train_df["dt_pred"] = 0.0
    train_df["fold"] = -1
    
    # Create Folds for OOF generation
    kf = KFold(n_splits=5, shuffle=True, random_state=Config.SEED)
    
    # Models to simulate your stack
    model_base = Ridge(alpha=1.0)
    model_dt = RandomForestRegressor(n_estimators=20, max_depth=8, n_jobs=-1, random_state=Config.SEED)
    
    print("Generating OOF predictions...")
    for fold, (tr_idx, val_idx) in enumerate(kf.split(train_df)):
        X_tr = train_df.loc[tr_idx].drop(columns=[target_col, "base_pred", "dt_pred", "fold"])
        y_tr = train_df.loc[tr_idx, target_col]
        X_val = train_df.loc[val_idx].drop(columns=[target_col, "base_pred", "dt_pred", "fold"])
        
        # Fit & Predict Base
        model_base.fit(X_tr, y_tr)
        train_df.loc[val_idx, "base_pred"] = model_base.predict(X_val)
        
        # Fit & Predict DT
        model_dt.fit(X_tr, y_tr)
        train_df.loc[val_idx, "dt_pred"] = model_dt.predict(X_val)
        
        train_df.loc[val_idx, "fold"] = fold

    # Generate Test Preds (Trained on full train)
    print("Generating Test predictions...")
    X_full = train_df.drop(columns=[target_col, "base_pred", "dt_pred", "fold"])
    y_full = train_df[target_col]
    X_test = test_df.drop(columns=[target_col]) # cols must match, will fill pred later
    
    # Reuse models trained on full data for test
    model_base.fit(X_full, y_full)
    test_df["base_pred"] = model_base.predict(X_test)
    
    model_dt.fit(X_full, y_full)
    test_df["dt_pred"] = model_dt.predict(X_test)
    
    # Calculate Residuals (Target - Base)
    # The Transformer will try to predict THIS column
    train_df["residual"] = train_df[target_col] - train_df["base_pred"]
    
    # Identify feature columns
    features = [c for c in train_df.columns if c not in [target_col, "base_pred", "dt_pred", "fold", "residual"]]
    
    base_rmse = root_mean_squared_error(train_df[target_col], train_df['base_pred'])
    print(f"Base Model RMSE (Train OOF): {base_rmse:.4f}")
    
    return train_df, test_df, features

# Run simulation
train_df, test_df, features = get_simulated_data()


--- 1. Simulating Base & DT Models (The 'Stack') ---
Generating OOF predictions...
Generating Test predictions...
Base Model RMSE (Train OOF): 0.8094


## 3. Tabular Tokenizer

The `TabularTokenizer` prepares data for the transformer:

### Quantile Binning (Discretization)
- Converts continuous features into discrete "tokens" (like words in NLP)
- Uses quantile-based bins so each bin has roughly equal samples
- Different bin counts for features (32), base predictions (128), and tree predictions (64)

### Z-Score Normalization
- Standardizes raw values: `(x - mean) / std`
- Preserves the original numeric information alongside tokens

This dual representation (tokens + scalars) gives the model both discrete patterns and continuous precision.

In [5]:
class TabularTokenizer:
    """
    Handles Quantile Binning and Z-Scoring.
    Fits on 'train' subset, transforms 'val'/'test'.
    """
    def __init__(self, cols):
        self.cols = cols
        self.edges = {}
        self.stats = {} # (mean, std)
        
    def _make_edges(self, x, nbins):
        x = x[np.isfinite(x)]
        if len(x) == 0: return np.array([0.0, 1.0])
        qs = np.linspace(0.0, 1.0, nbins+1)
        edges = np.unique(np.quantile(x, qs))
        if len(edges) < 2: edges = np.array([x.min(), x.max()+1e-6])
        return edges

    def fit(self, df):
        # 1. Numeric Features
        for c in self.cols:
            self.edges[c] = self._make_edges(df[c].values, Config.NBINS)
            self.stats[c]  = (df[c].mean(), df[c].std() + 1e-8)
            
        # 2. Special Tokens (Base & DT)
        self.edges["_base_"] = self._make_edges(df["base_pred"].values, Config.NBINS_BASE)
        self.stats["_base_"] = (df["base_pred"].mean(), df["base_pred"].std() + 1e-8)
        
        self.edges["_dt_"] = self._make_edges(df["dt_pred"].values, Config.NBINS_DT)
        self.stats["_dt_"] = (df["dt_pred"].mean(), df["dt_pred"].std() + 1e-8)
        
        # 3. Target Stats (for Residual scaling)
        self.stats["_target_"] = (df["residual"].mean(), df["residual"].std() + 1e-8)

    def transform(self, df):
        # Returns: Tokens (Int), Values (Float)
        N = len(df)
        T = len(self.cols) + 2 # cols + base + dt
        
        toks = np.zeros((N, T), dtype=np.int64)
        vals = np.zeros((N, T), dtype=np.float32)
        
        def _proc(col_name, edge_key, stat_key, out_idx):
            v = df[col_name].values
            # Digitize (Binning)
            idx = np.searchsorted(self.edges[edge_key], v, side="right") - 1
            toks[:, out_idx] = np.clip(idx, 0, len(self.edges[edge_key]) - 2)
            # Standardize (Z-Score)
            mu, sd = self.stats[stat_key]
            vals[:, out_idx] = (v - mu) / sd

        # Features
        for i, c in enumerate(self.cols):
            _proc(c, c, c, i)
            
        # Base & DT
        _proc("base_pred", "_base_", "_base_", T-2)
        _proc("dt_pred",   "_dt_",   "_dt_",   T-1)
        
        return toks, vals
    
    def get_vocab_sizes(self):
        s = [len(self.edges[c])-1 for c in self.cols]
        s.append(len(self.edges["_base_"])-1)
        s.append(len(self.edges["_dt_"])-1)
        return s

## 4. Model Architecture

### TabTransformerGated

The model combines several key innovations:

| Component | Purpose |
|-----------|---------|
| **Token Embeddings** | Learns representations for each quantile bin |
| **Value MLPs** | Projects raw scalar values to embedding space |
| **Learnable Gates** | Sigmoid gates that blend tokens + scalars per feature |
| **CLS Token** | Special token that aggregates information for prediction |
| **TokenDrop** | Regularization that randomly masks features during training |
| **Transformer Encoder** | Self-attention layers that model feature interactions |

### Gated Fusion Formula
```
embedding[i] = token_emb[i] + σ(gate[i]) × value_emb[i]
```
The model learns how much to rely on discrete vs. continuous representations for each feature.

In [6]:
class TokenDrop(nn.Module):
    """Masks tokens during training for regularization."""
    def __init__(self, p=0.1):
        super().__init__()
        self.p = p
    def forward(self, x):
        # x: [B, 1+T, D]
        if not self.training or self.p <= 0: return x
        mask = (torch.rand(x.shape[0], x.shape[1], 1, device=x.device) > self.p).float()
        mask[:, 0, :] = 1.0 # Never drop CLS
        return x * mask

class PerTokenValMLP(nn.Module):
    """Small MLP to project scalar value to embedding space."""
    def __init__(self, emb_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(1, emb_dim),
            nn.GELU(),
            nn.Linear(emb_dim, emb_dim),
            nn.LayerNorm(emb_dim)
        )
    def forward(self, x): return self.net(x)

class TabTransformerGated(nn.Module):
    def __init__(self, vocab_sizes):
        super().__init__()
        self.num_tokens = len(vocab_sizes)
        
        # 1. Embeddings (for Bins)
        self.embs = nn.ModuleList([nn.Embedding(v+1, Config.EMB_DIM) for v in vocab_sizes])
        
        # 2. Value MLPs (for Scalars)
        self.val_mlps = nn.ModuleList([PerTokenValMLP(Config.EMB_DIM) for _ in vocab_sizes])
        
        # 3. Learnable Gates
        self.gates = nn.ParameterList([nn.Parameter(torch.zeros(1)) for _ in vocab_sizes])
        self.sigmoid = nn.Sigmoid()
        
        # 4. Transformer Backbone
        self.cls_token = nn.Parameter(torch.zeros(1, 1, Config.EMB_DIM))
        self.emb_dropout = nn.Dropout(Config.EMB_DROPOUT)
        self.tokendrop = TokenDrop(Config.TOKENDROP_P)
        
        enc_layer = nn.TransformerEncoderLayer(
            d_model=Config.EMB_DIM, nhead=Config.N_HEADS, dim_feedforward=Config.EMB_DIM*4,
            dropout=Config.DROPOUT, batch_first=True, norm_first=True, activation="gelu"
        )
        self.encoder = nn.TransformerEncoder(enc_layer, num_layers=Config.N_LAYERS)
        
        # 5. Head
        self.head = nn.Sequential(
            nn.LayerNorm(Config.EMB_DIM),
            nn.Linear(Config.EMB_DIM, Config.MLP_HID), nn.GELU(), nn.Dropout(Config.DROPOUT),
            nn.Linear(Config.MLP_HID, 1)
        )
        
    def forward(self, x_tok, x_val):
        B = x_tok.shape[0]
        
        # Gated Fusion Loop
        emb_list = []
        for i in range(self.num_tokens):
            tok_e = self.embs[i](x_tok[:, i])
            val_e = self.val_mlps[i](x_val[:, i:i+1])
            g = self.sigmoid(self.gates[i])
            emb_list.append(tok_e + g * val_e)
            
        x = torch.stack(emb_list, dim=1) # [B, T, D]
        x = self.emb_dropout(x)
        
        # Append CLS
        cls = self.cls_token.expand(B, 1, -1)
        x = torch.cat([cls, x], dim=1) # [B, 1+T, D]
        
        # Encoder
        x = self.tokendrop(x)
        x = self.encoder(x)
        
        return self.head(x[:, 0, :]).squeeze(-1)

class TTDataset(Dataset):
    def __init__(self, toks, vals, y=None):
        self.toks = torch.as_tensor(toks, dtype=torch.long)
        self.vals = torch.as_tensor(vals, dtype=torch.float32)
        self.y = torch.as_tensor(y, dtype=torch.float32) if y is not None else None
    def __len__(self): return len(self.toks)
    def __getitem__(self, i): 
        return (self.toks[i], self.vals[i]), (self.y[i] if self.y is not None else 0.0)

## 5. Training Loop

### Cross-Validation Strategy
For each of the 5 folds:

1. **Leak-Free Tokenization**: Fit tokenizer only on training data
2. **Z-Score Targets**: Normalize residuals for stable training
3. **Train with EMA**: 
   - Main model learns via gradient descent
   - EMA model maintains exponential moving average of weights (Polyak averaging)
   - EMA often generalizes better than the final trained weights

### Isotonic Calibration
After training, we calibrate predictions using **Isotonic Regression**:
- Maps the model's z-scored outputs back to actual residual values
- Monotonic transformation that can correct systematic biases
- Fitted on validation data, then applied to test predictions

### Final Prediction
```
final_prediction = base_pred + calibrated_residual
```

In [7]:
# Storage for results
oof_preds = np.zeros(len(train_df))
test_preds_accum = np.zeros(len(test_df))

folds = sorted(train_df["fold"].unique())
print(f"\n--- 2. Training Residual TabTransformer ({len(folds)} folds) ---")

for k in folds:
    # A. Split & Leak-Free Tokenization
    tr_mask = train_df["fold"] != k
    va_mask = train_df["fold"] == k
    
    tokenizer = TabularTokenizer(features)
    tokenizer.fit(train_df[tr_mask]) # Fit on TRAIN ONLY
    
    # Transform
    X_tr_tok, X_tr_val = tokenizer.transform(train_df[tr_mask])
    X_va_tok, X_va_val = tokenizer.transform(train_df[va_mask])
    X_te_tok, X_te_val = tokenizer.transform(test_df)
    
    # Targets (Z-Scored)
    y_mu, y_std = tokenizer.stats["_target_"]
    y_tr = (train_df.loc[tr_mask, "residual"].values - y_mu) / y_std
    y_va_raw = train_df.loc[va_mask, "residual"].values 
    
    # B. Dataloaders
    dl_tr = DataLoader(TTDataset(X_tr_tok, X_tr_val, y_tr), batch_size=Config.BATCH_SIZE, shuffle=True)
    dl_va = DataLoader(TTDataset(X_va_tok, X_va_val), batch_size=Config.BATCH_SIZE, shuffle=False)
    dl_te = DataLoader(TTDataset(X_te_tok, X_te_val), batch_size=Config.BATCH_SIZE, shuffle=False)
    
    # C. Init Models (Main + EMA)
    model = TabTransformerGated(tokenizer.get_vocab_sizes()).to(Config.DEVICE)
    ema_model = TabTransformerGated(tokenizer.get_vocab_sizes()).to(Config.DEVICE)
    ema_model.load_state_dict(model.state_dict())
    
    opt = torch.optim.AdamW(model.parameters(), lr=Config.LR, weight_decay=Config.WEIGHT_DECAY)
    loss_fn = nn.SmoothL1Loss(beta=1.0)
    
    # D. Train Loop
    for epoch in range(Config.EPOCHS):
        model.train()
        for (xt, xv), y in dl_tr:
            xt, xv, y = xt.to(Config.DEVICE), xv.to(Config.DEVICE), y.to(Config.DEVICE)
            opt.zero_grad()
            pred = model(xt, xv)
            loss = loss_fn(pred, y)
            loss.backward()
            opt.step()
            
            # Update EMA
            with torch.no_grad():
                for p, ema_p in zip(model.parameters(), ema_model.parameters()):
                    ema_p.data.mul_(Config.EMA_DECAY).add_(p.data, alpha=1 - Config.EMA_DECAY)
    
    # E. Evaluation & Isotonic Calibration
    ema_model.eval()
    
    # 1. Predict Validation (Z-space)
    preds_z = []
    with torch.no_grad():
        for (xt, xv), _ in dl_va:
            preds_z.append(ema_model(xt.to(Config.DEVICE), xv.to(Config.DEVICE)).cpu().numpy())
    preds_z = np.concatenate(preds_z)
    
    # 2. Calibrate: Map Z-score Preds -> Real Residuals
    iso = IsotonicRegression(out_of_bounds="clip")
    iso.fit(preds_z, y_va_raw) 
    calib_preds = iso.predict(preds_z)
    
    oof_preds[va_mask] = calib_preds
    rmse = root_mean_squared_error(y_va_raw, calib_preds)
    print(f"Fold {k} | Residual RMSE: {rmse:.4f}")
    
    # 3. Predict Test (Apply fold's calibration)
    preds_te_z = []
    with torch.no_grad():
        for (xt, xv), _ in dl_te:
            preds_te_z.append(ema_model(xt.to(Config.DEVICE), xv.to(Config.DEVICE)).cpu().numpy())
    preds_te_z = np.concatenate(preds_te_z)
    test_preds_accum += iso.predict(preds_te_z) / len(folds)
    
    del model, ema_model, opt, dl_tr
    if Config.DEVICE == "cuda": torch.cuda.empty_cache()

# ================= Results =================
# Final Prediction = Base Prediction + Predicted Residual
final_oof = train_df["base_pred"] + oof_preds
final_test = test_df["base_pred"] + test_preds_accum

base_cv = root_mean_squared_error(train_df["MedHouseVal"], train_df["base_pred"])
tt_cv   = root_mean_squared_error(train_df["MedHouseVal"], final_oof)

base_test = root_mean_squared_error(test_df["MedHouseVal"], test_df["base_pred"])
tt_test   = root_mean_squared_error(test_df["MedHouseVal"], final_test)

print("\n" + "="*45)
print("FINAL RESULTS SUMMARY")
print("="*45)
print(f"TRAIN (CV) RMSE:")
print(f"  Base Model Only:      {base_cv:.5f}")
print(f"  Base + TT Residual:   {tt_cv:.5f}")
print("-" * 20)
print(f"TEST (Holdout) RMSE:")
print(f"  Base Model Only:      {base_test:.5f}")
print(f"  Base + TT Residual:   {tt_test:.5f}")
print("="*45)


--- 2. Training Residual TabTransformer (5 folds) ---
Fold 0 | Residual RMSE: 0.6073
Fold 1 | Residual RMSE: 0.9915
Fold 2 | Residual RMSE: 0.6077
Fold 3 | Residual RMSE: 0.6098
Fold 4 | Residual RMSE: 0.6089

FINAL RESULTS SUMMARY
TRAIN (CV) RMSE:
  Base Model Only:      0.80939
  Base + TT Residual:   0.70200
--------------------
TEST (Holdout) RMSE:
  Base Model Only:      0.73611
  Base + TT Residual:   0.59240
