# Final Project - example solution

This notebook demonstrates the integration of TabPFN for two tasks:
1.  **Regression**: Predicting the target variable `y`.
2.  **Markov Blanket (MB) Discovery**: Identifying the optimal feature set using TabPFN embeddings.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import logging
import numpy as np
import matplotlib.pyplot as plt
from dotenv import load_dotenv
from datasets import load_dataset
from pathlib import Path
import torch

from blanket.plots import plot_graph
from blanket.metrics import rmse, jaccard_score

from tabpfn import TabPFNRegressor
from tabpfn_extensions.embedding import TabPFNEmbedding

# Sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier

load_dotenv()
logging.basicConfig(level=logging.ERROR)

# Load data

In [None]:
develop = load_dataset("CSE472-blanket-challenge/final-dataset", 'develop', split='train')
submit = load_dataset("CSE472-blanket-challenge/final-dataset", 'submit', split='train')

In [None]:
print(len(develop), len(submit))

[ðŸ¤— Dataset](CSE472-blanket-challenge/final-dataset)

Develop: 182 datasets

- X_train, y_train, X_test, y_test, metadata

Subumit: 46 datasets

- X_train, y_train, X_test

Develop and Submit use the same script for data generation.

For generation details, refer to <https://huggingface.co/datasets/CSE472-blanket-challenge/final-dataset>

You task:

1. Train a model using `develop` to predict `y` and `markov_blanket`
2. Test your model on `submit`


In [None]:
# Select Test Case
example_data = develop[89]
X_train = np.asarray(example_data['X_train'])
y_train = np.asarray(example_data['y_train'])
X_test = np.asarray(example_data['X_test'])
y_test = np.asarray(example_data['y_test'])

print(f"Example data id: {example_data['data_id']}")
print(f"Train shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")
print(f"Env: {example_data['environment']}")
print(f"SCM: {example_data['scm']}")

In [None]:
# Visualize Causal Graph
plot_graph(example_data['adjacency_matrix'], title=f"Causal Graph: {example_data['graph_id']}", figsize=(8, 8))
plt.show()

## TabPFN models

In [None]:
from tabpfn.model_loading import ModelSource
regressor_models = ModelSource.get_regressor_v2_5()

print("Available TabPFN Regressor Models:\n", )
for model_name in regressor_models.filenames:
    print(f"  {model_name}")

TabPFN load `tabpfn-v2.5-regressor-v2.5_default.ckpt` by default

the `real` variant are fine-tuned on real world dataset

For details, see TabPFN's [technical report](https://storage.googleapis.com/prior-labs-tabpfn-public/reports/TabPFN_2_5_tech_report.pdf)

In [None]:
import os
from huggingface_hub import hf_hub_download

TABPFN_MODEL_CACHE_DIR = Path(os.getenv("TABPFN_MODEL_CACHE_DIR", None))

model_path = hf_hub_download(repo_id=regressor_models.repo_id, filename="tabpfn-v2.5-regressor-v2.5_real.ckpt", local_dir=TABPFN_MODEL_CACHE_DIR)

### TabPFN toy example
mode 1: in context learning

In [None]:
# tabpfn is a subclass of sklearn estimators, so it has the same API
# toy example
regressor_config = {
        "ignore_pretraining_limits": True,
        "device": "cuda" if torch.cuda.is_available() else "cpu",
        "n_estimators": 24,
        "random_state":42,
        "inference_precision": "auto"
    }

regressor = TabPFNRegressor(model_path = model_path, **regressor_config)

regressor.fit(X_train, y_train)

preds = regressor.predict(X_test)

print("RMSE:", rmse(y_test, preds))

mode 2: fine-tuning

In [None]:
from tabpfn.utils import meta_dataset_collator
from tabpfn.finetune_utils import clone_model_for_evaluation
from torch.utils.data import DataLoader
from torch.optim import Adam

from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm

regressor_config = {
        "ignore_pretraining_limits": True,
        "device": "cuda" if torch.cuda.is_available() else "cpu",
        "n_estimators": 24,
        "random_state":42,
        "inference_precision": "auto"
    }

# --- Setup model ---
regressor = TabPFNRegressor(
    **regressor_config,
    fit_mode="batched", differentiable_input=False
)
# initialize model weights
regressor._initialize_model_variables()

optimizer = Adam(regressor.model_.parameters(), lr=1.5e-6)

# --- Dataloader setup ---
datasets = regressor.get_preprocessed_datasets(X_train, y_train, train_test_split, 10000)
loader = DataLoader(datasets, batch_size=1, collate_fn=meta_dataset_collator)

# --- Fine-tuning loop ---
for epoch in tqdm(range(10), desc="Fine-tuning Epochs", leave=False):
    for data_batch in tqdm(loader, desc=f"Epoch {epoch}"):
        optimizer.zero_grad()
        (X_tr, X_te, y_tr, y_te, cat_ixs, confs, raw_space, znorm_space, _, _) = data_batch
        regressor.raw_space_bardist_ = raw_space[0]
        regressor.znorm_space_bardist_ = znorm_space[0]
        regressor.fit_from_preprocessed(X_tr, y_tr, cat_ixs, confs)
        preds, _, _ = regressor.forward(X_te)
        loss_fn = znorm_space[0]
        loss = loss_fn(preds, y_te.to(regressor.device)).mean()
        loss.backward()
        optimizer.step()

# --- Evaluation ---
eval_reg = clone_model_for_evaluation(regressor, {}, TabPFNRegressor)
eval_reg.fit(X_train, y_train)
preds = eval_reg.predict(X_test)

print("RMSE:", rmse(y_test, preds))

fine-tuning indeed improves the performance

---

Tips

1. using MB results to help regression task
2. instead of fine-tuning datasets separately, you can fine-tune on all datasets together to improve generalization

## Predicting MB

In the following, we describe a simple approach to estimate MB using TabPFN.

A key characteristic of MB is that it remains the same for both the training and testing sets.
Therefore, unlike standard target prediction tasks, we train TabPFN on MB from the training datasets 
and evaluate it on MB from the testing datasets.

However, TabPFN natively supports only single-target prediction, so we need to adapt it for multi-output classification to predict the dimensions of MB.

Another challenge is that different datasets have different MB dimensions.
A simple solution is to train separate models for each MB dimension.
In our case, there are only two MB sizes (9 and 19), so we train two independent models.

Alternatively, one could pad MB vectors to a uniform size and train a single shared model across datasets.

---

Implementation details

For each MB size, we train a model that maps  
(n_dataset, n_estimators, n_samples, embedding_dim) -> (n_dataset, mb_dim)

To achieve this, we proceed as follows:

1. Compute embeddings of $X$ for each dataset using `nfold` to obtain more robust embeddings (https://arxiv.org/pdf/2502.17361).  
   The resulting embeddings have shape (n_estimators * n_samples * embedding_dim = 192).
2. Aggregate and reshape embeddings across estimators: 
   (n_dataset, n_estimators, n_samples, embedding_dim) -> (n_dataset * n_samples, embedding_dim).
3. Expand MB labels accordingly:  
   (n_dataset, mb_dim) -> (n_dataset * n_samples, mb_dim).
4. Train a multi-output model (e.g., MultiOutputClassifier) on these processed embeddings.
5. On the testing datasets, extract embeddings of $X$ and repeat step 2 for consistency.
6. Predict MB for all samples, obtaining outputs of shape (n_dataset * n_samples, mb_dim).
7. Aggregate predictions across samples within each dataset using a majority vote (or averaging with a 0.5 threshold) to recover dataset-level MB predictions:  
   (n_dataset * n_samples, mb_dim) -> (n_dataset, mb_dim).



In [None]:
# filter mb size == 9
# for the purpose of demo, we only use a subset of data with mb size 9
data_mb9 = [d for d in develop if d['n_features'] == 9]
train_data_mb9 = data_mb9[:20]  # use a smaller subset for faster demo
test_data_mb9 = data_mb9[:10]  # use a smaller subset for faster demo

In [None]:
# Initialize TabPFN model
regressor_config = {
        "ignore_pretraining_limits": True,
        "device": "cuda" if torch.cuda.is_available() else "cpu",
        "n_estimators": 24,
        "random_state":42,
        "inference_precision": "auto"
    }

# --- Setup model ---
regressor = TabPFNRegressor(
    **regressor_config,
)

# create an embedding extractor
embedding_extractor = TabPFNEmbedding(tabpfn_reg=regressor, n_fold=5)

train_embeddings = []
for d in tqdm(train_data_mb9):
    train_embedding = embedding_extractor.get_embeddings(np.asarray(d['X_train']), np.asarray(d['y_train']), np.asarray(d['X_test']), data_source="train")
    train_embeddings.append(train_embedding)

test_embeddings = []
for d in tqdm(test_data_mb9):
    test_embedding = embedding_extractor.get_embeddings(np.asarray(d['X_train']), np.asarray(d['y_train']), np.asarray(d['X_test']), data_source="test")
    test_embeddings.append(test_embedding)

In [None]:
np.stack(train_embeddings).shape # n_dataset * n_estimators * n_samples * embedding_dim (192)

In [None]:
# aggreagte embeddings by n_estimators
agg_embeddings = np.mean(np.stack(train_embeddings), axis=1).reshape(-1, 192)
agg_embeddings.shape

In [None]:
mb_train = np.stack([np.tile(d['feature_mask'], (d['n_train'], 1)) for d in train_data_mb9]).reshape(-1, 9)
mb_train.shape

In [None]:
clf = MultiOutputClassifier(LogisticRegression(), n_jobs=4).fit(agg_embeddings, mb_train)

In [None]:
# predict, switch case
test_embeddings_agg = np.mean(np.stack(test_embeddings), axis=1).reshape(-1, 192)
test_embeddings_agg.shape

predicted_mb = clf.predict(test_embeddings_agg)

In [None]:
pred_mb_new = predicted_mb.reshape(len(test_data_mb9), -1, 9)
for d in range(len(test_data_mb9)):
    pred_mb = np.mean(pred_mb_new[d, :], axis=0) >= 0.5
    pred_mb = pred_mb.astype(int)
    true_mb = np.asarray(test_data_mb9[d]['feature_mask'])

    print(f"True MB: {true_mb}")
    print(f"Predicted MB: {pred_mb}")
    print("Jaccard Score: ", jaccard_score(true_mb, pred_mb))
    print("---")


Tips

1. MultiOutputClassifer treat each target independently, which is not ideal since mb features are correlated.
2. Use a NN model to better predict mb from embedding (e.g. MLP, GNN, seq2seq etc.)

## Submission

On submission dataset, for each dataset (data_id), predict both `y` and `markov_blanket`, and save results in `submission.csv` with the following format:

| data_id | y_pred | markov_blanket_pred |
|---------|--------|---------------------|
| int     | float  | list of int         |

## Evaluation

For each dataset $j$ in `submit` with testing set $\mathcal{D}^{(j)}_{\text{test}}$ we compute:

- **RMSE** on regression:
$$ \text{RMSE}_j = \sqrt{\frac{1}{N_{\text{test}}} \sum_{(x, y) \in \mathcal{D}^{(j)}_{\text{test}}} (y - \hat{y}(x))^2 }. $$
- **Jaccard score** on MB masks:
$$ \text{Jaccard}_j = \frac{|\hat{m}^{(j)} \cap m^{(j)}_{\text{true}}|}{|\hat{m}^{(j)} \cup m^{(j)}_{\text{true}}|}. $$

Averaging over $N$ tasks gives $\overline{\text{RMSE}}$ and $\overline{\text{Jaccard}}$, and
the final challenge **score** is
$$ \text{Score} = avg(RMSE_i * (1 - Jaccard_i)). $$

- On the **develop test**, you can compute this score yourself.
- On the **submission test**, the organizers run the same computation on a
  hidden set of tasks.
- Ground truth data will be released after the challenge ends.


In [None]:
def evaluate_single_task(
    y_query_true: np.ndarray,
    y_query_pred: np.ndarray,
    mb_true: np.ndarray,
    mb_pred: np.ndarray,
) -> dict:
    rmse_val = rmse(y_query_true, y_query_pred)
    jaccard_val = jaccard_score(mb_true, mb_pred)
    score_val = rmse_val * (1.0 - jaccard_val)
    return {"rmse": rmse_val, "jaccard": jaccard_val, "score": score_val}
