# Using cPCA for Tabular In-Context Learning

Following the release of [TabPFN](https://arxiv.org/abs/2207.01848) as a transformer model capable of strong in-context learning (ICL) on tabular data, this project aims to evaluate using contrastive PCA (cPCA) as an additional method to further increase TabPFN learning on tabular data. 

Benefits of adding cPCA as a preprocessing step before TabPFN include:

- an improvement in classification accuracy
- a reduction in the number of features, thus decreasing dataset size and improving TabPFN inference time

### Summary of results:

These results were run on the [balance-scale (UCI)](https://www.openml.org/search?type=data&id=11) dataset from OpenML. 

| Preprocessing | Accuracy |
| --- | --- |
| None | 0.941 |
| PCA (2-dimensions) | 0.853 | 
| cPCA (2-dimensions) (best score) | **0.956** | 

### Install Packages

In [74]:
!pip install tabpfn openml contrastive tqdm

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


### Get Dataset from OpenML

In [98]:
import openml

suite = openml.study.get_suite(99)
print(suite)

OpenML Benchmark Suite
ID..............: 99
Name............: OpenML-CC18 Curated Classification benchmark
Status..........: active
Main Entity Type: task
Study URL.......: https://www.openml.org/s/99
# of Data.......: 72
# of Tasks......: 72
Creator.........: https://www.openml.org/u/1
Upload Time.....: 2019-02-21 18:47:13


In [97]:
VERBOSE = False

if VERBOSE:
    for task_id in suite.tasks[:30]:
        task = openml.tasks.get_task(task_id)
        print(task)
        print(dir(task))

### Split Foreground and Background Data

In [99]:
import tqdm
import numpy as np


TASK_ID=11
task = openml.tasks.get_task(TASK_ID)
print(task)

X, y = task.get_X_and_y()
X = np.asarray(X)
y = np.asarray(y)

X_foreground = []
y_foreground = []

X_background = []
y_background = []

for i in tqdm.trange(X.shape[0]):
    if y[i]==2:
        X_background.append(X[i])
        y_background.append(y[i])
    else:
        X_foreground.append(X[i])
        y_foreground.append(y[i])

X_foreground = np.asarray(X_foreground)
X_background = np.asarray(X_background)

  X, y = task.get_X_and_y()
  X, y, _, _ = dataset.get_data(


OpenML Classification Task
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 11
Task URL.............: https://www.openml.org/t/11
Estimation Procedure.: crossvalidation
Target Feature.......: class
# of Classes.........: 3
Cost Matrix..........: Available


100%|██████████| 625/625 [00:00<00:00, 439175.74it/s]


In [100]:
print(f"foreground shape: {X_foreground.shape}")
print(f"background shape: {X_background.shape}")

foreground shape: (337, 4)
background shape: (288, 4)


### Run TabPFN with no PCA or cPCA

In [103]:
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

from tabpfn import TabPFNClassifier

# Load data
task = openml.tasks.get_task(45)
X, y = task.get_X_and_y()
X_train, X_test, y_train, y_test = train_test_split(X_foreground, y_foreground, test_size=0.2, random_state=42)

# Initialize a classifier
clf = TabPFNClassifier()
clf.fit(X_train, y_train)

print("Accuracy with no PCA or cPCA:")

# Predict probabilities
prediction_probabilities = clf.predict_proba(X_test)
print("ROC AUC:", round(roc_auc_score(y_test, prediction_probabilities[:, 1]),3))

# Predict labels
predictions = clf.predict(X_test)
print("Accuracy", round(accuracy_score(y_test, predictions),3))

  X, y = task.get_X_and_y()
  X, y, _, _ = dataset.get_data(


Accuracy with no PCA or cPCA:
ROC AUC: 0.969
Accuracy 0.941


### Run TabPFN with PCA

In [105]:
from sklearn.decomposition import PCA

pca_model = PCA(n_components=2)
X_data_original_compress = pca_model.fit_transform(X_foreground)

X_train, X_test, y_train, y_test = train_test_split(X_data_original_compress, y_foreground, test_size=0.2, random_state=42)

# Initialize a classifier
clf = TabPFNClassifier()
clf.fit(X_train, y_train)

print("Accuracy with PCA:")

# Predict probabilities
prediction_probabilities = clf.predict_proba(X_test)
print("ROC AUC:", round(roc_auc_score(y_test, prediction_probabilities[:, 1]),3))

# Predict labels
predictions = clf.predict(X_test)
print("Accuracy", round(accuracy_score(y_test, predictions),3))


Accuracy with PCA:
ROC AUC: 0.925
Accuracy 0.853


### Run TabPFN with cPCA

In [108]:
from contrastive import CPCA

mdl = CPCA(n_components=2)
projected_data = mdl.fit_transform(X_foreground, X_background)

#returns a set of 2-dimensional projections of the foreground data stored in the list 'projected_data', for several different values of 'alpha' that are automatically chosen (by default, 4 values of alpha are chosen)

print("Accuracy with cPCA:")
print("-------------------")

for i in range(np.asarray(projected_data).shape[0]):
    X_train, X_test, y_train, y_test = train_test_split(np.asarray(projected_data)[i], y_foreground, test_size=0.2, random_state=42)
    
    # Initialize a classifier
    clf = TabPFNClassifier()
    clf.fit(X_train, y_train)

    print(f"choice {i+1} of alpha:")
    # Predict probabilities
    prediction_probabilities = clf.predict_proba(X_test)
    print("ROC AUC:", round(roc_auc_score(y_test, prediction_probabilities[:, 1]),3))
    
    # Predict labels
    predictions = clf.predict(X_test)
    print("Accuracy", round(accuracy_score(y_test, predictions),3))
    print()

Accuracy with cPCA:
-------------------
choice 1 of alpha:
ROC AUC: 0.971
Accuracy 0.956

choice 2 of alpha:
ROC AUC: 0.964
Accuracy 0.897

choice 3 of alpha:
ROC AUC: 0.955
Accuracy 0.868

choice 4 of alpha:
ROC AUC: 0.962
Accuracy 0.897

