<a href="https://www.kaggle.com/code/youneseloiarm/simple-tabpfn-approach-for-score-of-15-in-1-min?scriptVersionId=258909371" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

#### First, I thank `Carl McBride Ellis` and `Samuel` for giving us the opportunity to use tabPFN in this competition. Much appreciation to them.

The approach contains the following components:
- Use an ensemble of TabPFN and (default) XGBoost
- Reweight the probabilites to match the balanced log loss used in this competition
- Use median nan imputation
- Use the time column from the training data, and use a time (max time in training) + 1 for test
- Use all four classes provided in greeks.Alpha and aggregate probabilites for the latter three classes, as they all correspond to different illnesses

In [1]:
!pip install /kaggle/input/tabpfn-019-whl/tabpfn-0.1.9-py3-none-any.whl

Processing /kaggle/input/tabpfn-019-whl/tabpfn-0.1.9-py3-none-any.whl
Installing collected packages: tabpfn
Successfully installed tabpfn-0.1.9
[0m

In [2]:
!mkdir -p /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff
!cp /kaggle/input/tabpfn-019-whl/prior_diff_real_checkpoint_n_0_epoch_100.cpkt /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff/

In [3]:
import numpy as np
import pandas as pd
import json



pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import warnings
warnings.filterwarnings("ignore")

In [4]:
# LOAD THE DATA

BASE_DIR = '/kaggle/input/icr-identify-age-related-conditions'
# Import data directly as H2O frame
maindf = pd.read_csv(f'{BASE_DIR}/train.csv')
greeksdf = pd.read_csv(f'{BASE_DIR}/greeks.csv')
testdf = pd.read_csv(f'{BASE_DIR}/test.csv')

print(maindf.EJ.unique())
first_cat = maindf.EJ.unique()[0]
maindf.EJ = maindf.EJ.eq(first_cat).astype('int')
testdf.EJ = testdf.EJ.eq(first_cat).astype('int')

['B' 'A']


In [5]:
# Greeks contains time information that we can use, we just need to parse it to int / nan.

from datetime import date, datetime
times = greeksdf.Epsilon.copy()
times[greeksdf.Epsilon != 'Unknown'] = greeksdf.Epsilon[greeksdf.Epsilon != 'Unknown'].map(lambda x: datetime.strptime(x,'%m/%d/%Y').toordinal())
times[greeksdf.Epsilon == 'Unknown'] = np.nan

In [6]:
# Set predictor and target columns
target = 'Class'
predictors = [n for n in maindf.columns if n != target and n != 'Id']

In [7]:
from sklearn.base import BaseEstimator
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.impute import SimpleImputer
from catboost import Pool, CatBoostClassifier
import xgboost
from tabpfn import TabPFNClassifier


class WeightedEns(BaseEstimator):
    def __init__(self):
        self.classifiers = [xgboost.XGBClassifier(),TabPFNClassifier(N_ensemble_configurations=64,device='cuda:0')]
        self.imp = SimpleImputer(missing_values=np.nan, strategy='median')
    
    def fit(self, X, y):
        cls, y = np.unique(y, return_inverse=True)
        self.classes_ = cls
        X = self.imp.fit_transform(X)
        for cl in self.classifiers:
            cl.fit(X,y)
    
    def predict_proba(self, X):
        X = self.imp.transform(X)
        ps = np.stack([cl.predict_proba(X) for cl in self.classifiers])
        p = np.mean(ps,axis=0)
        class_0_est_instances = p[:,0].sum()
        others_est_instances = p[:,1:].sum()
        # we reweight the probs, since the loss is also balanced like this
        # our models out of the box optimize CE
        # with these changes they optimize balanced CE
        new_p = p * np.array([[1/(class_0_est_instances if i==0 else others_est_instances) for i in range(p.shape[1])]])
        return new_p / np.sum(new_p,axis=1,keepdims=1)

In [8]:
pred_and_time = pd.concat((maindf[predictors], times), 1)

In [9]:
test_predictors = np.array(testdf[predictors])
test_pred_and_time = np.concatenate((test_predictors, np.zeros((len(test_predictors),1)) + pred_and_time.Epsilon.max()+1),1)

In [10]:
m = WeightedEns()
m.fit(np.array(pred_and_time),np.array(greeksdf['Alpha']))
p = m.predict_proba(test_pred_and_time)
assert (m.classes_[0] == 'A')
p = np.concatenate((p[:,:1],np.sum(p[:,1:],1,keepdims=True)), 1)
p0 = p[:,:1]
p0[p0 > 0.863] = 1
p0[p0 < 0.137] = 0
submit=pd.DataFrame(testdf["Id"], columns=["Id"])
submit["class_0"] = p0
submit["class_1"] = 1 - p0
submit.to_csv('submission.csv',index=False)

Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters


In [11]:
pd.read_csv('submission.csv')

Unnamed: 0,Id,class_0,class_1
0,00eed32682bb,0.5,0.5
1,010ebe33f668,0.5,0.5
2,02fa521e1838,0.5,0.5
3,040e15f562a2,0.5,0.5
4,046e85c7cc7f,0.5,0.5
