# Final Project - example solution

This notebook demonstrates the integration of TabPFN for two tasks:
1.  **Regression**: Predicting the target variable `y`.
2.  **Markov Blanket (MB) Discovery**: Identifying the optimal feature set using TabPFN embeddings.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dotenv import load_dotenv
from datasets import load_dataset
from pathlib import Path
import torch
from tqdm.notebook import tqdm

from blanket.plots import plot_graph
from blanket.metrics import rmse, jaccard_score

from tabpfn import TabPFNRegressor
from tabpfn_extensions.embedding import TabPFNEmbedding

# Sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier

load_dotenv()
logging.basicConfig(level=logging.ERROR)

# Load data

In [3]:
develop = load_dataset("CSE472-blanket-challenge/final-dataset", 'develop', split='train')
submit = load_dataset("CSE472-blanket-challenge/final-dataset", 'submit', split='train')

In [4]:
print(len(develop), len(submit))

182 46


[🤗 Dataset](CSE472-blanket-challenge/final-dataset)

Develop: 182 datasets

- X_train, y_train, X_test, y_test, metadata

Subumit: 46 datasets

- X_train, y_train, X_test

Develop and Submit use the same script for data generation.

For generation details, refer to <https://huggingface.co/datasets/CSE472-blanket-challenge/final-dataset>

You task:

1. Train a model using `develop` to predict `y` and `markov_blanket`
2. Test your model on `submit`


## TabPFN models

In [5]:
from tabpfn.model_loading import ModelSource
regressor_models = ModelSource.get_regressor_v2_5()

print("Available TabPFN Regressor Models:\n", )
for model_name in regressor_models.filenames:
    print(f"  {model_name}")

Available TabPFN Regressor Models:

  tabpfn-v2.5-regressor-v2.5_default.ckpt
  tabpfn-v2.5-regressor-v2.5_low-skew.ckpt
  tabpfn-v2.5-regressor-v2.5_quantiles.ckpt
  tabpfn-v2.5-regressor-v2.5_real-variant.ckpt
  tabpfn-v2.5-regressor-v2.5_real.ckpt
  tabpfn-v2.5-regressor-v2.5_small-samples.ckpt
  tabpfn-v2.5-regressor-v2.5_variant.ckpt


TabPFN load `tabpfn-v2.5-regressor-v2.5_default.ckpt` by default

the `real` variant are fine-tuned on real world dataset

For details, see TabPFN's [technical report](https://storage.googleapis.com/prior-labs-tabpfn-public/reports/TabPFN_2_5_tech_report.pdf)

In [6]:
import os
from huggingface_hub import hf_hub_download

cache_dir = os.getenv("TABPFN_MODEL_CACHE_DIR", None)
TABPFN_MODEL_CACHE_DIR = Path(cache_dir) if cache_dir else None

model_path = hf_hub_download(repo_id=regressor_models.repo_id, filename="tabpfn-v2.5-regressor-v2.5_real.ckpt", local_dir=TABPFN_MODEL_CACHE_DIR)

### TabPFN toy example
mode 1: in context learning

mode 2: fine-tuning

fine-tuning indeed improves the performance

---

Tips

1. using MB results to help regression task
2. instead of fine-tuning datasets separately, you can fine-tune on all datasets together to improve generalization

## Predicting MB

In the following, we describe a simple approach to estimate MB using TabPFN.

A key characteristic of MB is that it remains the same for both the training and testing sets.
Therefore, unlike standard target prediction tasks, we train TabPFN on MB from the training datasets 
and evaluate it on MB from the testing datasets.

However, TabPFN natively supports only single-target prediction, so we need to adapt it for multi-output classification to predict the dimensions of MB.

Another challenge is that different datasets have different MB dimensions.
A simple solution is to train separate models for each MB dimension.
In our case, there are only two MB sizes (9 and 19), so we train two independent models.

Alternatively, one could pad MB vectors to a uniform size and train a single shared model across datasets.

---

Implementation details

For each MB size, we train a model that maps  
(n_dataset, n_estimators, n_samples, embedding_dim) -> (n_dataset, mb_dim)

To achieve this, we proceed as follows:

1. Compute embeddings of $X$ for each dataset using `nfold` to obtain more robust embeddings (https://arxiv.org/pdf/2502.17361).  
   The resulting embeddings have shape (n_estimators * n_samples * embedding_dim = 192).
2. Aggregate and reshape embeddings across estimators: 
   (n_dataset, n_estimators, n_samples, embedding_dim) -> (n_dataset * n_samples, embedding_dim).
3. Expand MB labels accordingly:  
   (n_dataset, mb_dim) -> (n_dataset * n_samples, mb_dim).
4. Train a multi-output model (e.g., MultiOutputClassifier) on these processed embeddings.
5. On the testing datasets, extract embeddings of $X$ and repeat step 2 for consistency.
6. Predict MB for all samples, obtaining outputs of shape (n_dataset * n_samples, mb_dim).
7. Aggregate predictions across samples within each dataset using a majority vote (or averaging with a 0.5 threshold) to recover dataset-level MB predictions:  
   (n_dataset * n_samples, mb_dim) -> (n_dataset, mb_dim).



Tips

1. MultiOutputClassifer treat each target independently, which is not ideal since mb features are correlated.
2. Use a NN model to better predict mb from embedding (e.g. MLP, GNN, seq2seq etc.)

## Submission

On submission dataset, for each dataset (data_id), predict both `y` and `markov_blanket`, and save results in `submission.csv` with the following format:

| data_id | y_pred | markov_blanket_pred |
|---------|--------|---------------------|
| int     | float  | list of int         |

## Evaluation

For each dataset $j$ in `submit` with testing set $\mathcal{D}^{(j)}_{\text{test}}$ we compute:

- **RMSE** on regression:
$$ \text{RMSE}_j = \sqrt{\frac{1}{N_{\text{test}}} \sum_{(x, y) \in \mathcal{D}^{(j)}_{\text{test}}} (y - \hat{y}(x))^2 }. $$
- **Jaccard score** on MB masks:
$$ \text{Jaccard}_j = \frac{|\hat{m}^{(j)} \cap m^{(j)}_{\text{true}}|}{|\hat{m}^{(j)} \cup m^{(j)}_{\text{true}}|}. $$

Averaging over $N$ tasks gives $\overline{\text{RMSE}}$ and $\overline{\text{Jaccard}}$, and
the final challenge **score** is
$$ \text{Score} = avg(RMSE_i * (1 - Jaccard_i)). $$

- On the **develop test**, you can compute this score yourself.
- On the **submission test**, the organizers run the same computation on a
  hidden set of tasks.
- Ground truth data will be released after the challenge ends.


In [7]:
def evaluate_single_task(
    y_query_true: np.ndarray,
    y_query_pred: np.ndarray,
    mb_true: np.ndarray,
    mb_pred: np.ndarray,
) -> dict:
    rmse_val = rmse(y_query_true, y_query_pred)
    jaccard_val = jaccard_score(mb_true, mb_pred)
    score_val = rmse_val * (1.0 - jaccard_val)
    return {"rmse": rmse_val, "jaccard": jaccard_val, "score": score_val}


# Complete Solution: MB-First Pipeline

Strategy: Predict MB masks first using embeddings, then use filtered features for regression.

Architecture:
1. Extract TabPFN embeddings for each dataset
2. Train MLP classifiers to predict MB masks (separate models for 9-feat and 19-feat)
3. Use predicted MB to filter features for TabPFN regression
4. Generate predictions for submission

In [8]:
# Split by n_features
data_9 = [d for d in develop if d['n_features'] == 9]
data_19 = [d for d in develop if d['n_features'] == 19]

print(f"9-feature tasks: {len(data_9)}")
print(f"19-feature tasks: {len(data_19)}")

# Train/val split (80/20)
train_9, val_9 = data_9[:70], data_9[70:]
train_19, val_19 = data_19[:76], data_19[76:]

print(f"\nTrain/Val split:")
print(f"  9-feat:  train={len(train_9)}, val={len(val_9)}")
print(f"  19-feat: train={len(train_19)}, val={len(val_19)}")

9-feature tasks: 87
19-feature tasks: 95

Train/Val split:
  9-feat:  train=70, val=17
  19-feat: train=76, val=19


In [9]:
def extract_embeddings_for_dataset(dataset_list, n_fold=5, device='cuda'):
    """Extract TabPFN embeddings for a list of datasets."""
    regressor_config = {
        "ignore_pretraining_limits": True,
        "device": device,
        "n_estimators": 8,  # Reduced for speed
        "random_state": 42,
        "inference_precision": "auto"
    }
    
    regressor = TabPFNRegressor(model_path=model_path, **regressor_config)
    embedding_extractor = TabPFNEmbedding(tabpfn_reg=regressor, n_fold=n_fold)
    
    all_embeddings = []
    all_mb_masks = []
    
    print(f"Extracting embeddings for {len(dataset_list)} datasets...")
    for d in tqdm(dataset_list, desc="Embedding extraction"):
        X_train = np.asarray(d['X_train'])
        y_train = np.asarray(d['y_train'])
        X_test = np.asarray(d['X_test'])
        
        # Get embeddings for train data
        emb = embedding_extractor.get_embeddings(X_train, y_train, X_test, data_source="train")
        
        # Aggregate across estimators (mean)
        emb_agg = np.mean(emb, axis=0)  # (n_samples, embed_dim)
        
        all_embeddings.append(emb_agg)
        all_mb_masks.append(np.asarray(d['feature_mask']))
    
    return all_embeddings, all_mb_masks

# Extract for both 9-feat and 19-feat
print("\\n=== Extracting embeddings for 9-feature datasets ===")
train_9_emb, train_9_mb = extract_embeddings_for_dataset(train_9, device='cuda')

print("\\n=== Extracting embeddings for 19-feature datasets ===")
train_19_emb, train_19_mb = extract_embeddings_for_dataset(train_19, device='cuda')

print("\\nEmbedding extraction complete!")

\n=== Extracting embeddings for 9-feature datasets ===
Extracting embeddings for 70 datasets...


Embedding extraction:   0%|          | 0/70 [00:00<?, ?it/s]

\n=== Extracting embeddings for 19-feature datasets ===
Extracting embeddings for 76 datasets...


Embedding extraction:   0%|          | 0/76 [00:00<?, ?it/s]

\nEmbedding extraction complete!


In [10]:
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

class MBPredictor(nn.Module):
    def __init__(self, embed_dim=192, hidden_dims=[256, 128], n_features=9, dropout=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embed_dim, hidden_dims[0]),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dims[0], hidden_dims[1]),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dims[1], n_features),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.net(x)

def prepare_mb_training_data(embeddings_list, mb_masks_list, n_features):
    """Prepare training data for MB predictor.
    
    Args:
        embeddings_list: List of (n_samples, embed_dim) arrays
        mb_masks_list: List of (n_features,) binary masks
        n_features: Number of features (9 or 19)
    
    Returns:
        X: (n_datasets * n_samples, embed_dim) tensor
        y: (n_datasets * n_samples, n_features) tensor
    """
    X_all = []
    y_all = []
    
    for emb, mb_mask in zip(embeddings_list, mb_masks_list):
        n_samples = emb.shape[0]
        # Replicate MB mask for each sample in the dataset
        mb_replicated = np.tile(mb_mask, (n_samples, 1))
        
        X_all.append(emb)
        y_all.append(mb_replicated)
    
    X = torch.FloatTensor(np.vstack(X_all))
    y = torch.FloatTensor(np.vstack(y_all))
    
    return X, y

# Prepare training data
print("Preparing training data for MB predictors...")
X_train_9, y_train_9 = prepare_mb_training_data(train_9_emb, train_9_mb, 9)
X_train_19, y_train_19 = prepare_mb_training_data(train_19_emb, train_19_mb, 19)

print(f"9-feat training data: X={X_train_9.shape}, y={y_train_9.shape}")
print(f"19-feat training data: X={X_train_19.shape}, y={y_train_19.shape}")

Preparing training data for MB predictors...
9-feat training data: X=torch.Size([28000, 192]), y=torch.Size([28000, 9])
19-feat training data: X=torch.Size([30400, 192]), y=torch.Size([30400, 19])


In [11]:
def train_mb_predictor(X_train, y_train, n_features, epochs=100, batch_size=128, lr=0.001, device='cuda'):
    """Train MB predictor model."""
    model = MBPredictor(embed_dim=192, hidden_dims=[256, 128], n_features=n_features, dropout=0.3).to(device)
    
    # Calculate pos_weight for class imbalance
    pos_weight = (y_train == 0).sum() / (y_train ==1).sum()
    pos_weight = torch.FloatTensor([pos_weight]).to(device)
    
    criterion = nn.BCELoss()  # Binary cross-entropy
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
    
    # Create dataloader
    dataset = TensorDataset(X_train, y_train)
    loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    
    model.train()
    best_loss = float('inf')
    
    for epoch in tqdm(range(epochs), desc=f"Training MB-{n_features}"):
        total_loss = 0
        for X_batch, y_batch in loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)
            
            optimizer.zero_grad()
            y_pred = model(X_batch)
            loss = criterion(y_pred, y_batch)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        avg_loss = total_loss / len(loader)
        if (epoch + 1) % 20 == 0:
            print(f"  Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")
        
        if avg_loss < best_loss:
            best_loss = avg_loss
    
    return model

# Train models
print("\\n=== Training 9-feature MB Predictor ===")
mb_model_9 = train_mb_predictor(X_train_9, y_train_9, n_features=9, epochs=100, device='cuda')

print("\\n=== Training 19-feature MB Predictor ===")
mb_model_19 = train_mb_predictor(X_train_19, y_train_19, n_features=19, epochs=100, device='cuda')

print("\\nTraining complete!")

\n=== Training 9-feature MB Predictor ===


Training MB-9:   0%|          | 0/100 [00:00<?, ?it/s]

  Epoch 20/100, Loss: 0.3970


  Epoch 40/100, Loss: 0.2830


  Epoch 60/100, Loss: 0.2154


  Epoch 80/100, Loss: 0.1660


  Epoch 100/100, Loss: 0.1428
\n=== Training 19-feature MB Predictor ===


Training MB-19:   0%|          | 0/100 [00:00<?, ?it/s]

  Epoch 20/100, Loss: 0.2853


  Epoch 40/100, Loss: 0.1664


  Epoch 60/100, Loss: 0.1191


  Epoch 80/100, Loss: 0.0988


  Epoch 100/100, Loss: 0.0867
\nTraining complete!


In [12]:
def predict_task(task_data, mb_model_9, mb_model_19, device='cuda'):
    """Complete pipeline: predict MB and y for a single task.
    
    Args:
        task_data: Dict with 'X_train', 'y_train', 'X_test', optionally 'n_features'
        mb_model_9, mb_model_19: Trained MB predictor models
        device: 'cuda' or 'cpu'
    
    Returns:
        y_pred: (n_test,) array of predictions
        mb_pred: (n_features,) binary array of predicted MB mask
    """
    X_train = np.asarray(task_data['X_train'])
    y_train = np.asarray(task_data['y_train'])
    X_test = np.asarray(task_data['X_test'])
    
    # Infer n_features from X_train shape (submit tasks don't have this field)
    n_features = task_data.get('n_features', X_train.shape[1])
    
    # Step 1: Extract embeddings
    regressor_config = {
        "ignore_pretraining_limits": True,
        "device": device,
        "n_estimators": 8,
        "random_state": 42,
        "inference_precision": "auto"
    }
    
    regressor = TabPFNRegressor(model_path=model_path, **regressor_config)
    embedding_extractor = TabPFNEmbedding(tabpfn_reg=regressor, n_fold=5)
    
    # Get embeddings (use combined train+test)
    X_all = np.vstack([X_train, X_test])
    y_all_temp = np.hstack([y_train, np.zeros(len(X_test))])  # Dummy y for test
    embeddings = embedding_extractor.get_embeddings(X_train, y_train, X_test, data_source="train")
    
    # Aggregate across estimators
    emb_agg = np.mean(embeddings, axis=0)  # (n_samples, 192)
    
    # Step 2: Predict MB mask
    mb_model = mb_model_9 if n_features == 9 else mb_model_19
    mb_model.eval()
    
    with torch.no_grad():
        X_emb = torch.FloatTensor(emb_agg).to(device)
        mb_probs = mb_model(X_emb)  # (n_samples, n_features)
        
        # Aggregate predictions across samples (majority vote via mean > 0.5)
        mb_pred = (mb_probs.mean(dim=0) > 0.5).int().cpu().numpy()
    
    # Ensure at least one feature is selected
    if mb_pred.sum() == 0:
        # Fallback: select top 3 features by probability
        top_k = min(3, n_features)
        top_indices = mb_probs.mean(dim=0).argsort(descending=True)[:top_k].cpu().numpy()
        mb_pred[top_indices] = 1
    
    # Step 3: Filter features and run regression
    X_train_filt = X_train[:, mb_pred == 1]
    X_test_filt = X_test[:, mb_pred == 1]
    
    # TabPFN regression
    regressor_final = TabPFNRegressor(
        model_path=model_path,
        device=device,
        n_estimators=24,
        ignore_pretraining_limits=True,
        random_state=42
    )
    
    regressor_final.fit(X_train_filt, y_train)
    y_pred = regressor_final.predict(X_test_filt)
    
    return y_pred, mb_pred

print("Prediction pipeline defined!")

Prediction pipeline defined!


In [13]:
# Validate on validation sets
val_results = []

print("\\n=== Validating on 9-feature tasks ===")
for task in tqdm(val_9[:5], desc="Val 9-feat"):  # Sample first 5 for speed
    y_pred, mb_pred = predict_task(task, mb_model_9, mb_model_19, device='cuda')
    
    y_true = np.asarray(task['y_test'])
    mb_true = np.asarray(task['feature_mask'])
    
    # Compute metrics
    task_rmse = rmse(y_true, y_pred)
    task_jaccard = jaccard_score(mb_true, mb_pred)
    task_score = task_rmse * (1.0 - task_jaccard)
    
    val_results.append({
        'n_features': 9,
        'rmse': task_rmse,
        'jaccard': task_jaccard,
        'score': task_score
    })

print("\\n=== Validating on 19-feature tasks ===")
for task in tqdm(val_19[:5], desc="Val 19-feat"):  # Sample first 5 for speed
    y_pred, mb_pred = predict_task(task, mb_model_9, mb_model_19, device='cuda')
    
    y_true = np.asarray(task['y_test'])
    mb_true = np.asarray(task['feature_mask'])
    
    task_rmse = rmse(y_true, y_pred)
    task_jaccard = jaccard_score(mb_true, mb_pred)
    task_score = task_rmse * (1.0 - task_jaccard)
    
    val_results.append({
        'n_features': 19,
        'rmse': task_rmse,
        'jaccard': task_jaccard,
        'score': task_score
    })

# Compute average metrics
val_df = pd.DataFrame(val_results)
print("\\n" + "="*60)
print("VALIDATION RESULTS")
print("="*60)
print(f"Average RMSE: {val_df['rmse'].mean():.4f}")
print(f"Average Jaccard: {val_df['jaccard'].mean():.4f}")
print(f"Average Score: {val_df['score'].mean():.4f} (lower is better)")
print("="*60)

\n=== Validating on 9-feature tasks ===


Val 9-feat:   0%|          | 0/5 [00:00<?, ?it/s]

\n=== Validating on 19-feature tasks ===


Val 19-feat:   0%|          | 0/5 [00:00<?, ?it/s]

VALIDATION RESULTS
Average RMSE: 0.5997
Average Jaccard: 0.7583
Average Score: 0.1822 (lower is better)


In [14]:
# Generate submission
submission_results = []

print("\\n=== Generating Predictions for Submission ===")
print(f"Total submit tasks: {len(submit)}")

for task in tqdm(submit, desc="Submission"):
    data_id = task['data_id']
    
    # Run prediction
    y_pred, mb_pred = predict_task(task, mb_model_9, mb_model_19, device='cuda')
    
    submission_results.append({
        'data_id': data_id,
        'y_pred': y_pred.tolist(),  # Convert to list for CSV
        'markov_blanket_pred': mb_pred.tolist()
    })

# Create submission dataframe
submission_df = pd.DataFrame(submission_results)

# Save to CSV
submission_path = '/home/mtopiwal/CSE472-blanket-challenge/submission.csv'
submission_df.to_csv(submission_path, index=False)

print(f"\\n✓ Submission saved to: {submission_path}")
print(f"✓ Total tasks: {len(submission_df)}")
print("\\nFirst 3 submissions:")
print(submission_df.head(3))

\n=== Generating Predictions for Submission ===
Total submit tasks: 46


Submission:   0%|          | 0/46 [00:00<?, ?it/s]

\n✓ Submission saved to: /home/mtopiwal/CSE472-blanket-challenge/submission.csv
✓ Total tasks: 46
\nFirst 3 submissions:
         data_id                                             y_pred  \
0  data_199180bb  [-0.913066565990448, -0.0921970009803772, 0.11...   
1  data_36d2b833  [0.7441467046737671, 0.978523850440979, 0.8209...   
2  data_e84cde01  [-0.6426916122436523, -0.26027601957321167, -1...   

                                 markov_blanket_pred  
0  [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, ...  
1                        [1, 1, 1, 0, 0, 1, 0, 1, 0]  
2                        [1, 1, 1, 0, 0, 1, 0, 0, 0]  


## Step 7: Generate Submission

Run prediction pipeline on all 46 submit tasks and create submission.csv.

## Step 6: Validation on Develop Set

Evaluate performance on validation split to estimate final score.

## Step 5: Complete Prediction Pipeline

Predict MB → Filter features → Regress with TabPFN.

## Step 4: Train MB Predictors

Train separate MLP models for 9-feature and 19-feature tasks.

## Step 3: Define MLP MB Predictor

Multi-layer perceptron for predicting binary MB masks from embeddings.

## Step 2: Extract TabPFN Embeddings

Extract embeddings from training data for MB prediction.

## Step 1: Data Preparation

Split develop set by n_features and create train/val splits.