# Using cPCA for Tabular In-Context Learning

Following the release of [TabPFN](https://arxiv.org/abs/2207.01848) as a transformer model capable of strong in-context learning (ICL) on tabular data, this project aims to evaluate using contrastive PCA (cPCA) as an additional method to further increase TabPFN learning on tabular data. 

Benefits of adding cPCA as a preprocessing step before TabPFN include:

- an improvement in classification accuracy
- a reduction in the number of features, thus decreasing dataset size and improving TabPFN inference time

### Summary of results:

These results were run on the [balance-scale (UCI)](https://www.openml.org/search?type=data&id=11) dataset from OpenML. 

| Preprocessing | Accuracy |
| --- | --- |
| None | 0.941 |
| PCA (2-dimensions) | 0.853 | 
| cPCA (2-dimensions) (best score) | **0.956** | 

### Install Packages

In [74]:
!pip install tabpfn openml contrastive tqdm

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [96]:
!cd tabicl && pip install -e .

Obtaining file:///workspace/additional-cpca-experiments/notebooks/tabpfn_results/tabicl
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: tabicl
  Building editable for tabicl (pyproject.toml) ... [?25ldone
[?25h  Created wheel for tabicl: filename=tabicl-0.1.1-py3-none-any.whl size=6915 sha256=df9027659eb2947b9023d4e683f6874e1c586afde4e21dc993883d2faa16d2c3
  Stored in directory: /tmp/pip-ephem-wheel-cache-ghoz_mwv/wheels/a0/35/75/4b858c0eb991723035f86da0ea792a905b36f780c1e2bd04f0
Successfully built tabicl
Installing collected packages: tabicl
  Attempting uninstall: tabicl
    Found existing installation: tabicl 0.1.1
    Uninstalling tabicl-0.1.1:
      Successfully uninstalled tabicl-0.1.1


### Get Dataset from OpenML

In [2]:
import openml

suite = openml.study.get_suite(99)
print(suite)

OpenML Benchmark Suite
ID..............: 99
Name............: OpenML-CC18 Curated Classification benchmark
Status..........: active
Main Entity Type: task
Study URL.......: https://www.openml.org/s/99
# of Data.......: 72
# of Tasks......: 72
Creator.........: https://www.openml.org/u/1
Upload Time.....: 2019-02-21 18:47:13


In [3]:
VERBOSE = False

if VERBOSE:
    for task_id in suite.tasks[:30]:
        task = openml.tasks.get_task(task_id)
        print(task)
        print(dir(task))

### Split Foreground and Background Data

In [100]:
import tqdm
import numpy as np

# TASK_ID=11 # balance scale
# TASK_ID=167140 # dna
# TASK_ID= 53 # vehicle
# TASK_ID=2074 # SAT IMAgE
# TASK_ID = 167140 #DNA
TASK_ID =3560 #authorship
# TASK_ID=12

task = openml.tasks.get_task(TASK_ID)
print(task)

X, y = task.get_X_and_y()
X = np.asarray(X)
y = np.asarray(y)

X_foreground = []
y_foreground = []

X_background = []
y_background = []



for i in tqdm.trange(X.shape[0]):
    if y[i] not in [0,1]:
        X_background.append(X[i])
        y_background.append(y[i])
    else:
        X_foreground.append(X[i])
        y_foreground.append(y[i])

X_foreground = np.asarray(X_foreground)
X_background = np.asarray(X_background)

  X, y = task.get_X_and_y()
  X, y, _, _ = dataset.get_data(


OpenML Classification Task
Task Type Description: https://www.openml.org/tt/TaskType.SUPERVISED_CLASSIFICATION
Task ID..............: 3560
Task URL.............: https://www.openml.org/t/3560
Estimation Procedure.: crossvalidation
Target Feature.......: Prevention
# of Classes.........: 6
Cost Matrix..........: Available


100%|██████████| 797/797 [00:00<00:00, 794972.72it/s]


In [101]:
print(f"foreground shape: {X_foreground.shape}")
print(f"background shape: {X_background.shape}")

foreground shape: (259, 4)
background shape: (538, 4)


### Run TabPFN with no PCA or cPCA

In [102]:
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

from tabpfn import TabPFNClassifier

# Load data
# task = openml.tasks.get_task(TASK_ID)
# X, y = task.get_X_and_y()
X_train, X_test, y_train, y_test = train_test_split(X_foreground, y_foreground, test_size=0.2, random_state=42)

# Initialize a classifier
clf = TabPFNClassifier(ignore_pretraining_limits=True)
clf.fit(X_train, y_train)

print("Accuracy with no PCA or cPCA:")

# Predict probabilities
prediction_probabilities = clf.predict_proba(X_test)
print("ROC AUC:", round(roc_auc_score(y_test, prediction_probabilities[:, 1]),3))

# Predict labels
predictions = clf.predict(X_test)
print("Accuracy", round(accuracy_score(y_test, predictions),3))

Accuracy with no PCA or cPCA:
ROC AUC: 0.633
Accuracy 0.577


In [94]:
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(X_foreground, y_foreground, test_size=0.2, random_state=42)

# Initialize a classifier
clf = SVC()
clf.fit(X_train, y_train)

print("Accuracy with no PCA or cPCA:")

# Predict labels
predictions = clf.predict(X_test)
print("Accuracy", round(accuracy_score(y_test, predictions),3))

Accuracy with no PCA or cPCA:
Accuracy 1.0


In [97]:
from tabicl import TabICLClassifier

clf = TabICLClassifier()
clf.fit(X_train, y_train)  # this is cheap
clf.predict(X_test)  # in-context learning happens here

ImportError: cannot import name 'TabICLClassifier' from 'tabicl' (unknown location)

### Run TabPFN with PCA

In [87]:
from sklearn.decomposition import PCA

pca_model = PCA(n_components=2)
X_data_original_compress = pca_model.fit_transform(X_foreground)

X_train, X_test, y_train, y_test = train_test_split(X_data_original_compress, y_foreground, test_size=0.2, random_state=42)

# Initialize a classifier
clf = TabPFNClassifier()
clf.fit(X_train, y_train)

print("Accuracy with PCA:")

# Predict probabilities
prediction_probabilities = clf.predict_proba(X_test)
print("ROC AUC:", round(roc_auc_score(y_test, prediction_probabilities[:, 1]),3))

# Predict labels
predictions = clf.predict(X_test)
print("Accuracy", round(accuracy_score(y_test, predictions),3))


Accuracy with PCA:
ROC AUC: 0.606
Accuracy 0.577


In [88]:
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(X_data_original_compress, y_foreground, test_size=0.2, random_state=42)

# Initialize a classifier
clf = SVC()
clf.fit(X_train, y_train)

print("Accuracy with PCA:")

# Predict labels
predictions = clf.predict(X_test)
print("Accuracy", round(accuracy_score(y_test, predictions),3))

Accuracy with PCA:
Accuracy 0.558


### Run TabPFN with cPCA

In [89]:
from contrastive import CPCA

mdl = CPCA(n_components=2)
projected_data = mdl.fit_transform(X_foreground, X_background)

#returns a set of 2-dimensional projections of the foreground data stored in the list 'projected_data', for several different values of 'alpha' that are automatically chosen (by default, 4 values of alpha are chosen)

print("Accuracy with cPCA:")
print("-------------------")

for i in range(np.asarray(projected_data).shape[0]):
    X_train, X_test, y_train, y_test = train_test_split(np.asarray(projected_data)[i], y_foreground, test_size=0.2, random_state=42)
    
    # Initialize a classifier
    clf = TabPFNClassifier()
    clf.fit(X_train, y_train)

    print(f"choice {i+1} of alpha:")
    # Predict probabilities
    # prediction_probabilities = clf.predict_proba(X_test)
    # print("ROC AUC:", round(roc_auc_score(y_test, prediction_probabilities[:, 1]),3))
    
    # Predict labels
    predictions = clf.predict(X_test)
    print("tabpfn Accuracy", round(accuracy_score(y_test, predictions),3))

    # Initialize a classifier
    clf = SVC()
    clf.fit(X_train, y_train)
    
    # print("Accuracy with PCA:")
    
    # Predict labels
    predictions = clf.predict(X_test)
    print("svc Accuracy", round(accuracy_score(y_test, predictions),3))
    print()


Accuracy with cPCA:
-------------------
choice 1 of alpha:
tabpfn Accuracy 0.558
svc Accuracy 0.558

choice 2 of alpha:
tabpfn Accuracy 0.481
svc Accuracy 0.5

choice 3 of alpha:
tabpfn Accuracy 0.462
svc Accuracy 0.654

choice 4 of alpha:
tabpfn Accuracy 0.5
svc Accuracy 0.635

