In [36]:
!pip install /kaggle/input/tabpfn-019-whl/tabpfn-0.1.9-py3-none-any.whl

Processing /kaggle/input/tabpfn-019-whl/tabpfn-0.1.9-py3-none-any.whl
tabpfn is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.


TabPFN (0.1.9) whl needs to be installed in the Data button located on the right side of your notebook from the Kaggle available datasets: https://www.kaggle.com/datasets/carlmcbrideellis/tabpfn-019-whl

In [37]:
!mkdir /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff
!cp /kaggle/input/tabpfn-019-whl/prior_diff_real_checkpoint_n_0_epoch_100.cpkt /opt/conda/lib/python3.10/site-packages/tabpfn/models_diff/

mkdir: cannot create directory ‘/opt/conda/lib/python3.10/site-packages/tabpfn/models_diff’: File exists


In [38]:
from tabpfn import TabPFNClassifier

Import libraries

In [39]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder,normalize
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
import imblearn
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
import xgboost
import inspect
from collections import defaultdict
from tabpfn import TabPFNClassifier
import warnings
warnings.filterwarnings('ignore')

Import datasets in the Input section

In [40]:
train = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/train.csv')
test = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/test.csv')
sample = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/sample_submission.csv')
greeks = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/greeks.csv')

As we already seen in the Dataset Visualization notebook the EJ column is categorical and needs to be trasformed to numeric.

In [41]:
# Get the first unique category from the 'EJ' column in the 'train' dataset
first_category = train.EJ.unique()[0]

# Convert 'EJ' column in 'train' dataset to binary format: 1 if it matches the first category, 0 otherwise
train.EJ = train.EJ.eq(first_category).astype('int')

# Convert 'EJ' column in 'test' dataset to binary format: 1 if it matches the first category, 0 otherwise
test.EJ = test.EJ.eq(first_category).astype('int')

Our target "Class" is unbalanced. In order to balance the class distribution by randomly selecting a subset of the majority class (class 0) to match the number of samples in the minority class (class 1) a function is used to perform random undersampling of the train that contains two classes ('Class' column with values 0 and 1).

In [42]:
def random_under_sampler(df):
    # Calculate the number of samples for each label. 
    neg, pos = np.bincount(df['Class'])

    # Choose the samples with class label `1`.
    one_df = df.loc[df['Class'] == 1] 
    # Choose the samples with class label `0`.
    zero_df = df.loc[df['Class'] == 0]
    # Select `pos` number of negative samples.
    # This makes sure that we have equal number of samples for each label.
    zero_df = zero_df.sample(n=pos)

    # Join both label dataframes.
    undersampled_df = pd.concat([zero_df, one_df])

    # Shuffle the data and return
    return undersampled_df.sample(frac = 1)

In [43]:
train_good = random_under_sampler(train)

Now train_good represent undersampled train dataframe with a balanced class distribution, containing equal samples of both classes.

Preparing the data for training a machine learning model.

In [44]:
# Get a list of predictor columns by excluding the 'Class' and 'Id' columns from the 'train' dataset
predictor_columns = [n for n in train.columns if n != 'Class' and n != 'Id']

# Assign the predictor columns to the variable 'x' for training the model
x = train[predictor_columns]

# Assign the 'Class' column as the target variable to be predicted and assign it to the variable 'y'
y = train['Class']

Now, I can use x and y to train your model using an appropriate algorithm.

Applying two cross-validation strategies using K-Fold in order to get a more reliable estimate of the model's performance on unseen data and avoid overfitting.


In [45]:
# Importing the necessary modules
from sklearn.model_selection import KFold, GridSearchCV

# Setting up the outer cross-validation strategy with 10 folds
cv_outer = KFold(n_splits=10, shuffle=True, random_state=0)

# Setting up the inner cross-validation strategy with 5 folds
cv_inner = KFold(n_splits=5, shuffle=True, random_state=0)

Submissions are evaluated using a balanced logarithmic loss. The overall effect is such that each class is roughly equally important for the final score. For this reason I created balanced_log_loss function as implementation of a balanced log loss metric, which can be used for evaluating the performance of binary classifiers. 

In [46]:
def balanced_log_loss(y_true, y_pred):
    # y_true: correct labels 0, 1
    # y_pred: predicted probabilities of class=1
    # calculate the number of observations for each class
    N_0 = np.sum(1 - y_true)
    N_1 = np.sum(y_true)
    # calculate the weights for each class to balance classes
    w_0 = 1 / N_0
    w_1 = 1 / N_1
    # calculate the predicted probabilities for each class
    p_1 = np.clip(y_pred, 1e-15, 1 - 1e-15)
    p_0 = 1 - p_1
    # calculate the summed log loss for each class
    log_loss_0 = -np.sum((1 - y_true) * np.log(p_0))
    log_loss_1 = -np.sum(y_true * np.log(p_1))
    # calculate the weighted summed logarithmic loss
    # (factgor of 2 included to give same result as LL with balanced input)
    balanced_log_loss = 2*(w_0 * log_loss_0 + w_1 * log_loss_1) / (w_0 + w_1)
    # return the average log loss
    return balanced_log_loss/(N_0+N_1)

The Ensemble class is a custom implementation of an ensemble model that combines the predictions of two classifiers: XGBClassifier from XGBoost and TabPFNClassifier from the TabNet framework. The class includes methods for fitting the ensemble and making predictions using the combined classifiers. 

The class's purpose is to create an ensemble of classifiers using the XGBoost classifier and the TabNet classifier. The fit method is responsible for fitting the ensemble using the training data (X and y). The predict_proba method is used for making probability predictions using the trained ensemble.

It's worth noting that the class handles class imbalance by calculating class imbalance weights based on the predicted probabilities and adjusting the final probabilities accordingly. This can be beneficial when dealing with imbalanced datasets and can help improve the overall performance of the ensemble.

In [47]:
class Ensemble():
    def __init__(self):
        # Initializing the Ensemble class
        self.imputer = SimpleImputer(missing_values=np.nan, strategy='median')
        self.classifiers = [ xgboost.XGBClassifier(),
                            TabPFNClassifier(N_ensemble_configurations=64)]
    
    def fit(self, X, y):
        # Preparing the data for training the ensemble
        y = y.values
        unique_classes, y = np.unique(y, return_inverse=True)
        self.classes_ = unique_classes
        
        # Converting the 'EJ' column to binary format
        first_category = X.EJ.unique()[0]
        X.EJ = X.EJ.eq(first_category).astype('int')
        
        # Imputing missing values in the data
        X = self.imputer.fit_transform(X)
        
        # Fitting the classifiers in the ensemble
        for classifier in self.classifiers:
            if classifier == self.classifiers[1]:
                # Special case for TabPFNClassifier with overwrite_warning=True
                classifier.fit(X, y, overwrite_warning=True)
            else:
                classifier.fit(X, y)
     
    def predict_proba(self, x):
        # Preprocessing the data for prediction
        x = self.imputer.transform(x)
        
        # Obtaining probabilities from each classifier in the ensemble
        probabilities = np.stack([classifier.predict_proba(x) for classifier in self.classifiers])
        
        # Averaging the probabilities across classifiers
        averaged_probabilities = np.mean(probabilities, axis=0)
        
        # Calculating class imbalance weights
        class_0_est_instances = averaged_probabilities[:, 0].sum()
        others_est_instances = averaged_probabilities[:, 1:].sum()
        
        # Weighted probabilities based on class imbalance
        new_probabilities = averaged_probabilities * np.array([[1/(class_0_est_instances if i==0 else others_est_instances) for i in range(averaged_probabilities.shape[1])]])
        
        # Normalizing the probabilities
        return new_probabilities / np.sum(new_probabilities, axis=1, keepdims=1)

In [48]:
# module to visialize the progress as bar, as it gives you a visual indication of the progress and helps estimate the time remaining.
from tqdm.notebook import tqdm

The training function performs the training and evaluation of a given model using nested cross-validation. Nested cross-validation is used to assess the model's performance by using both an outer cross-validation loop and an inner cross-validation loop. The use of balanced log loss and the transformation of predicted probabilities into class labels with specific thresholds are techniques designed to handle class imbalance in the dataset.

In [49]:
def training(model, x, y, y_meta):
    # Initialize variables for tracking results
    outer_results = list()
    best_loss = np.inf
    split = 0
    splits = 5
    
    # Perform inner cross-validation loop
    for train_idx, val_idx in tqdm(cv_inner.split(x), total=splits):
        split += 1
        
        # Split data into train and validation sets
        x_train, x_val = x.iloc[train_idx], x.iloc[val_idx]
        y_train, y_val = y_meta.iloc[train_idx], y.iloc[val_idx]
                
        # Fit the model on the training data
        model.fit(x_train, y_train)
        
        # Make predictions on the validation data
        y_pred = model.predict_proba(x_val)
        
        # Transform predicted probabilities into class labels
        probabilities = np.concatenate((y_pred[:, :1], np.sum(y_pred[:, 1:], 1, keepdims=True)), axis=1)
        p0 = probabilities[:, :1]
        p0[p0 > 0.86] = 1
        p0[p0 < 0.14] = 0
        y_p = np.empty((y_pred.shape[0],))
        
        # Assign class labels based on transformed probabilities
        for i in range(y_pred.shape[0]):
            if p0[i] >= 0.5:
                y_p[i] = False
            else:
                y_p[i] = True
        y_p = y_p.astype(int)
        
        # Calculate the balanced log loss for validation data
        loss = balanced_log_loss(y_val, y_p)

        # Update the best model and loss if current loss is lower
        if loss < best_loss:
            best_model = model
            best_loss = loss
            print('best_model_saved')
        
        # Append the loss to outer_results list and print current loss
        outer_results.append(loss)
        print('>val_loss=%.5f, split = %.1f' % (loss, split))
    
    # Print the mean loss across all splits
    print('LOSS: %.5f' % (np.mean(outer_results)))
    
    # Return the best model
    return best_model

greeks.csv is Supplemental metadata, only available for the training set.
Here, the 'times' Series will contain datetime ordinal values for non-'Unknown' dates and NaN for 'Unknown' dates, which makes it suitable for further data analysis or modeling.

In [50]:
# Importing the necessary module
from datetime import datetime

# Copying the 'Epsilon' column to 'times'
times = greeks.Epsilon.copy()

# Converting non-'Unknown' values in 'Epsilon' column to datetime ordinal
times[greeks.Epsilon != 'Unknown'] = greeks.Epsilon[greeks.Epsilon != 'Unknown'].map(lambda x: datetime.strptime(x, '%m/%d/%Y').toordinal())

# Setting 'Unknown' values in 'Epsilon' column to NaN
times[greeks.Epsilon == 'Unknown'] = np.nan

Data preparation for making predictions on the 'test' dataset using a machine learning model. It combines the 'train' dataset and the 'times' column along the columns axis, prepares the 'test_predictors' dataset, and concatenates it with an additional column derived from the 'train_pred_and_time.Epsilon' column.

In [51]:
# Concatenate the 'train' dataset and 'times' column along the columns axis
train_pred_and_time = pd.concat((train, times), axis=1)

# Select predictor columns from the 'test' dataset
test_predictors = test[predictor_columns]

# Convert the 'EJ' column in 'test_predictors' to binary format
first_category = test_predictors.EJ.unique()[0]
test_predictors.EJ = test_predictors.EJ.eq(first_category).astype('int')

# Concatenate 'test_predictors' and a column of zeros with a value equal to the maximum value in 'train_pred_and_time.Epsilon' plus 1
test_pred_and_time = np.concatenate((test_predictors, np.zeros((len(test_predictors), 1)) + train_pred_and_time.Epsilon.max() + 1), axis=1)

Random oversampling using the RandomOverSampler to address the class imbalance in the 'train_pred_and_time' dataset with respect to the 'greeks.Alpha' target variable. Random oversampling is a technique used to increase the number of instances of the minority class by randomly duplicating some of its samples until it reaches the same number of samples as the majority class.

In [52]:
# Initialize the RandomOverSampler with a random state of 0
ros = RandomOverSampler(random_state=0)

# Perform oversampling on the 'train_pred_and_time' dataset and 'greeks.Alpha' target variable
train_ros, y_ros = ros.fit_resample(train_pred_and_time, greeks.Alpha)

# Print the value counts of the original 'Alpha' classes
print('Original dataset shape')
print(greeks.Alpha.value_counts())

# Print the value counts of the resampled 'Alpha' classes
print('Resample dataset shape')
print(y_ros.value_counts())

Original dataset shape
A    509
B     61
G     29
D     18
Name: Alpha, dtype: int64
Resample dataset shape
B    509
A    509
D    509
G    509
Name: Alpha, dtype: int64


Make variables x_ros and y_ are created for training a machine learning model using the oversampled dataset (train_ros) to predict the target variable Class.
    
* x_ros: A DataFrame containing the input features (predictor columns) from the oversampled dataset. It is ready to be used as input for training a machine learning model.

* y_: A Series containing the target variable 'Class' (binary labels) from the oversampled dataset. It represents the class labels corresponding to the input features in x_ros and is used as the target variable during model training.

In [53]:
x_ros = train_ros.drop(['Class', 'Id'],axis=1)
y_ = train_ros.Class

Variable yt is created as an instance of the Ensemble class. 

In [54]:
yt = Ensemble()

Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters


The training function is used to train the Ensemble model yt on the balanced dataset x_ros and its corresponding target labels y_. The function uses nested cross-validation to assess the model's performance and returns the best trained model m. Now I have the best trained ensemble model stored in the variable m, and you can use this model to make predictions on new data or perform further analysis.

In [55]:
m = training(yt,x_ros,y_,y_ros)

  0%|          | 0/5 [00:00<?, ?it/s]

best_model_saved
>val_loss=0.00000, split = 1.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.EJ = X.EJ.eq(first_category).astype('int')


best_model_saved
>val_loss=0.00000, split = 2.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.EJ = X.EJ.eq(first_category).astype('int')


>val_loss=0.13011, split = 3.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.EJ = X.EJ.eq(first_category).astype('int')


>val_loss=0.39033, split = 4.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.EJ = X.EJ.eq(first_category).astype('int')


>val_loss=0.12761, split = 5.0
LOSS: 0.12961


In [56]:
#calculates the relative frequency (proportion) of each class in the target variable y_. It shows the percentage of each class in the balanced target variable after random oversampling.
y_.value_counts()/y_.shape[0]

1    0.75
0    0.25
Name: Class, dtype: float64

y_pred is used to store the probability predictions made by the trained ensemble model m on the test dataset test_pred_and_time. The predict_proba method of the m model is used to obtain the probability predictions for each sample in the test dataset.

In [57]:
y_pred = m.predict_proba(test_pred_and_time)



The predicted probabilities y_pred obtained from the ensemble model for the test dataset are transformed into class probabilities. Then, thresholding is applied to assign class labels based on the probabilities. The variable p0 will contain the predicted class labels based on the thresholding of the predicted probabilities.

In [58]:
# Transform the predicted probabilities into class probabilities
probabilities = np.concatenate((y_pred[:, :1], np.sum(y_pred[:, 1:], 1, keepdims=True)), axis=1)

# Extract the column of probabilities for class 0
p0 = probabilities[:, :1]

# Threshold the probabilities to assign class labels
p0[p0 > 0.62] = 1  # Assign values above 0.63 as class 1
p0[p0 < 0.26] = 0  # Assign values below 0.26 as class 0

Prepare the submission file for the binary classification task. The DataFrame will have three columns: 'Id', 'class_0', and 'class_1'.

In [59]:
# Create a DataFrame for the submission with the 'Id' column from the 'test' dataset
submission = pd.DataFrame(test["Id"], columns=["Id"])

# Add the 'class_0' column containing the thresholded predictions for class 0
submission["class_0"] = p0

# Add the 'class_1' column containing the complement of the thresholded predictions for class 0
submission["class_1"] = 1 - p0

# Save the submission DataFrame to a CSV file named 'submission.csv' without including the index column
submission.to_csv('submission.csv', index=False)

In [60]:
#Visualization of submission table
submission_df = pd.read_csv('submission.csv')
submission_df

Unnamed: 0,Id,class_0,class_1
0,00eed32682bb,0.5,0.5
1,010ebe33f668,0.5,0.5
2,02fa521e1838,0.5,0.5
3,040e15f562a2,0.5,0.5
4,046e85c7cc7f,0.5,0.5
