# 📈 Boosting Tabular Performance with TabPFN, Optuna & Voting Ensemble

> 📌 **Inspired by:** [H-Z-Ning on Kaggle](https://www.kaggle.com/hzning)
> ✍️ **Extended & Enhanced by:** \[Your Name]
> 📅 **Date:** July 12, 2025
> 🎯 **Competition:** [Kaggle Playground Series - Season 5, Episode 7](https://www.kaggle.com/competitions/playground-series-s5e7)

---

## 🔍 Smart Feature Engineering + Optuna Tuning

* **Correlation & Redundancy Checks**
* **Missing Value Imputation**
* **Feature Engineering (new features, noise removal)**

> 🎯 **Bonus: Automated Tuning with Optuna**
> To push model performance further, I used **Optuna** for hyperparameter optimization. This allowed me to fine-tune models like XGBoost, LightGBM, and CatBoost efficiently—reducing manual work while improving validation metrics.

---

## 🧠 Ensemble Learning with TabPFN + Voting

Rather than relying on a single model, I used a **voting ensemble**, combining strengths across a set of strong learners:

### 🔧 Models Included:

* XGBoost
* LightGBM
* CatBoost
* TabPFN ([Transformer for Tabular Data](https://github.com/PriorLabs/TabPFN))
* MLP (Multi-Layer Perceptron)

### 🧪 Voting Strategy:

The ensemble averages prediction probabilities across all models (**soft voting**) to make final predictions—improving both **robustness** and **performance consistency**.

# **1. Install TabPFN**

In [1]:
%%capture
! pip install tabpfn
! git clone https://github.com/priorlabs/tabpfn-extensions.git
! pip install -e tabpfn-extensions
! pip install holoviews hvplot

# **2. Import Libraries**

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.preprocessing import LabelEncoder, StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.cluster import KMeans
from category_encoders import TargetEncoder
import xgboost as xgb

import plotly.graph_objects as go
import pandas as pd

import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.neural_network import MLPClassifier

import optuna
from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings('ignore')


plt.style.use("seaborn-v0_8-darkgrid")
warnings.filterwarnings("ignore")
plt.rc("font",family="SimHei",size="15")  
# import csv
train_df = pd.read_csv("/kaggle/input/playground-series-s5e7/train.csv")
datasert_df = pd.read_csv("/kaggle/input/extrovert-vs-introvert-behavior-data-backup/personality_datasert.csv")
test_df = pd.read_csv("/kaggle/input/playground-series-s5e7/test.csv")

# **3. Feature Engineering**

In [3]:
datasert_df = (
    datasert_df
    .rename(columns={'Personality': 'match_p'})
    .drop_duplicates(['Time_spent_Alone', 'Stage_fear', 'Social_event_attendance',
                      'Going_outside', 'Drained_after_socializing', 
                      'Friends_circle_size', 'Post_frequency'])
)

merge_cols = ['Time_spent_Alone', 'Stage_fear', 'Social_event_attendance',
              'Going_outside', 'Drained_after_socializing', 
              'Friends_circle_size', 'Post_frequency']

train_df = train_df.merge(datasert_df, how='left', on=merge_cols)
test_df = test_df.merge(datasert_df, how='left', on=merge_cols)

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18524 entries, 0 to 18523
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         18524 non-null  int64  
 1   Time_spent_Alone           17334 non-null  float64
 2   Stage_fear                 16631 non-null  object 
 3   Social_event_attendance    17344 non-null  float64
 4   Going_outside              17058 non-null  float64
 5   Drained_after_socializing  17375 non-null  object 
 6   Friends_circle_size        17470 non-null  float64
 7   Post_frequency             17260 non-null  float64
 8   Personality                18524 non-null  object 
 9   match_p                    178 non-null    object 
dtypes: float64(5), int64(1), object(4)
memory usage: 1.4+ MB


In [4]:
numeric_df = train_df.select_dtypes(include='number').drop(columns=['id'])
numeric_df.corr()
corr = numeric_df.corr()

fig = go.Figure(
    data=go.Heatmap(
        z=corr.values,
        x=corr.columns,
        y=corr.columns,
        colorscale='RdBu',
        zmin=-1, zmax=1,
        colorbar=dict(title="Correlation"),
    )
)

for i in range(len(corr)):
    for j in range(len(corr)):
        fig.add_annotation(
            text=str(round(corr.values[i][j], 2)),
            x=corr.columns[j],
            y=corr.columns[i],
            showarrow=False,
            font=dict(color='black' if abs(corr.values[i][j]) < 0.5 else 'white')
        )

fig.update_layout(
    title="Correlation Heatmap (Interactive)",
    xaxis=dict(tickangle=45),
    yaxis=dict(autorange='reversed'),
    width=700, height=600
)

fig.show()

In [5]:
train_ID = train_df['id']
test_ID = test_df['id']

#Now drop the  'id' colum since it's unnecessary for  the prediction process.
train_df.drop("id", axis = 1, inplace = True)
test_df.drop("id", axis = 1, inplace = True)

ntrain = train_df.shape[0] 
ntest = test_df.shape[0]
y_train = train_df['Personality'].map({'Extrovert': 1, 'Introvert': 0}).values # 训练集的Y

all_data = pd.concat((train_df, test_df)).reset_index(drop=True)
all_data.drop(['Personality'], axis=1, inplace=True)

In [6]:
def fill_missing_by_quantile_group(df, group_source_col, target_col, quantiles=[0, 0.25, 0.5, 0.75, 1.0], labels=None):
    """
    Group the target_col based on quantiles of group_source_col, and fill missing values in target_col
    within each group using the group's median.
    
    Parameters:
        df (pd.DataFrame): Original dataset
        group_source_col (str): Column used for grouping (numerical)
        target_col (str): Target column to fill missing values
        quantiles (list): Quantile cut points for grouping (default is quartiles)
        labels (list): Labels for each group (default auto-generated as Q1, Q2, ...)
    
    Returns:
        pd.DataFrame: DataFrame with missing values filled (in-place modification)
    """
    #  Automatically generate group labels
    if labels is None:
        labels = [f'Q{i+1}' for i in range(len(quantiles)-1)]

    temp_bin_col = f'{group_source_col}_bin'

    # Step 1: Create grouping column
    df[temp_bin_col] = pd.qcut(df[group_source_col], q=quantiles, labels=labels)

    # Step 2: Fill missing values within each group using the group's median
    df[target_col] = df[target_col].fillna(df.groupby(temp_bin_col)[target_col].transform('median'))

    # Step 3: Remove the temporary grouping column
    df.drop(columns=[temp_bin_col], inplace=True)

    return df

all_data = fill_missing_by_quantile_group(
    df=all_data,
    group_source_col='Social_event_attendance',
    target_col='Time_spent_Alone'
)

all_data = fill_missing_by_quantile_group(
    df=all_data,
    group_source_col='Going_outside',
    target_col='Time_spent_Alone'
)

all_data = fill_missing_by_quantile_group(
    df=all_data,
    group_source_col='Friends_circle_size',
    target_col='Social_event_attendance'
)

all_data = fill_missing_by_quantile_group(
    df=all_data,
    group_source_col='Going_outside',
    target_col='Social_event_attendance'
)

all_data = fill_missing_by_quantile_group(
    df=all_data,
    group_source_col='Post_frequency',
    target_col='Social_event_attendance'
)


all_data = fill_missing_by_quantile_group(
    df=all_data,
    group_source_col='Social_event_attendance',
    target_col='Going_outside'
)

all_data = fill_missing_by_quantile_group(
    df=all_data,
    group_source_col='Post_frequency',
    target_col='Friends_circle_size'
)
all_data = fill_missing_by_quantile_group(
    df=all_data,
    group_source_col='Going_outside',
    target_col='Friends_circle_size'
)
all_data = fill_missing_by_quantile_group(
    df=all_data,
    group_source_col='Friends_circle_size',
    target_col='Post_frequency'
)
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24699 entries, 0 to 24698
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Time_spent_Alone           24699 non-null  float64
 1   Stage_fear                 22208 non-null  object 
 2   Social_event_attendance    24699 non-null  float64
 3   Going_outside              24699 non-null  float64
 4   Drained_after_socializing  23118 non-null  object 
 5   Friends_circle_size        24699 non-null  float64
 6   Post_frequency             24699 non-null  float64
 7   match_p                    236 non-null    object 
dtypes: float64(5), object(3)
memory usage: 1.5+ MB


In [7]:
all_data.fillna({
    'Stage_fear': 'UnKnow',
    'Drained_after_socializing': 'UnKnow'
}, inplace=True)
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24699 entries, 0 to 24698
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Time_spent_Alone           24699 non-null  float64
 1   Stage_fear                 24699 non-null  object 
 2   Social_event_attendance    24699 non-null  float64
 3   Going_outside              24699 non-null  float64
 4   Drained_after_socializing  24699 non-null  object 
 5   Friends_circle_size        24699 non-null  float64
 6   Post_frequency             24699 non-null  float64
 7   match_p                    236 non-null    object 
dtypes: float64(5), object(3)
memory usage: 1.5+ MB


In [8]:
all_data = pd.get_dummies(all_data, columns=['Stage_fear', 'Drained_after_socializing','match_p'], prefix=['Stage', 'Drained','match'])
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24699 entries, 0 to 24698
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Time_spent_Alone         24699 non-null  float64
 1   Social_event_attendance  24699 non-null  float64
 2   Going_outside            24699 non-null  float64
 3   Friends_circle_size      24699 non-null  float64
 4   Post_frequency           24699 non-null  float64
 5   Stage_No                 24699 non-null  bool   
 6   Stage_UnKnow             24699 non-null  bool   
 7   Stage_Yes                24699 non-null  bool   
 8   Drained_No               24699 non-null  bool   
 9   Drained_UnKnow           24699 non-null  bool   
 10  Drained_Yes              24699 non-null  bool   
 11  match_Extrovert          24699 non-null  bool   
 12  match_Introvert          24699 non-null  bool   
dtypes: bool(8), float64(5)
memory usage: 1.1 MB


# **4. Models Training**

In [9]:
X_train = all_data[:ntrain]
X_test = all_data[ntrain:]
X=X_train
y=y_train

In [10]:
class_0 = y_train.sum()
class_1 = len(y_train) - class_0
scale_pos_weight = class_1 / class_0

In [11]:
def tune_xgb_with_optuna(X, y, scale_pos_weight):
    def objective(trial):
        params = {
            'max_depth': trial.suggest_int('max_depth', 3, 10),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
            'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
            'subsample': trial.suggest_float('subsample', 0.5, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
            'gamma': trial.suggest_float('gamma', 0, 5),
            'reg_alpha': trial.suggest_float('reg_alpha', 0, 5),
            'reg_lambda': trial.suggest_float('reg_lambda', 0, 5),
            'scale_pos_weight': scale_pos_weight,
            'random_state': 42,
            'use_label_encoder': False,
            'eval_metric': 'logloss'
        }
        model = XGBClassifier(**params)
        score = cross_val_score(model, X, y, cv=3, scoring='accuracy').mean()
        return score

    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=30)
    print("✅ Best XGB params:", study.best_params)
    return XGBClassifier(**study.best_params)


def tune_cat_with_optuna(X, y, scale_pos_weight):
    def objective(trial):
        params = {
            'iterations': trial.suggest_int('iterations', 100, 500),
            'depth': trial.suggest_int('depth', 4, 10),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
            'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1, 10),
            'random_seed': 42,
            'verbose': 0,
            'class_weights': [scale_pos_weight, 1]
        }
        model = CatBoostClassifier(**params)
        score = cross_val_score(model, X, y, cv=3, scoring='accuracy').mean()
        return score

    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=30)
    print("✅ Best CatBoost params:", study.best_params)
    return CatBoostClassifier(**study.best_params)


def tune_lgbm_with_optuna(X, y, scale_pos_weight):
    def objective(trial):
        params = {
            'num_leaves': trial.suggest_int('num_leaves', 20, 60),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
            'n_estimators': trial.suggest_int('n_estimators', 100, 500),
            'subsample': trial.suggest_float('subsample', 0.5, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
            'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 5.0),
            'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 5.0),
            'class_weight': {0: scale_pos_weight, 1: 1},
            'random_state': 42
        }
        model = LGBMClassifier(**params)
        score = cross_val_score(model, X, y, cv=3, scoring='accuracy').mean()
        return score

    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=30)
    print("✅ Best LGBM params:", study.best_params)
    return LGBMClassifier(**study.best_params)

In [12]:
import sys
import os
import optuna
from contextlib import redirect_stdout, redirect_stderr

# Suppress Optuna logging
optuna.logging.set_verbosity(optuna.logging.WARNING)

# Context to suppress all stdout/stderr
with open(os.devnull, 'w') as fnull:
    with redirect_stdout(fnull), redirect_stderr(fnull):
        xgb = tune_xgb_with_optuna(X, y, scale_pos_weight)
        cat = tune_cat_with_optuna(X, y, scale_pos_weight)
        lgbm = tune_lgbm_with_optuna(X, y, scale_pos_weight)

In [13]:
mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    activation='relu',
    solver='adam',
    learning_rate_init=0.001,
    max_iter=300,
    random_state=42
)

In [14]:
from tabpfn import TabPFNClassifier

# Initialize a classifier
clf = TabPFNClassifier(ignore_pretraining_limits=True)

In [15]:
ensemble = VotingClassifier(
    estimators=[
        ('xgb', xgb),
        ('cat', cat),
        ('lgbm', lgbm),
        ('tabpfn', clf),
        ('mlp', mlp)
    ],
    voting='soft',
    weights=[3, 3, 2, 1, 1]
)

In [16]:
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

In [17]:
with open(os.devnull, 'w') as fnull:
    with redirect_stdout(fnull), redirect_stderr(fnull):
        ensemble.fit(X_train, y_train)

(…)fn-v2-classifier-finetuned-zk73skhh.ckpt:   0%|          | 0.00/29.0M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/37.0 [00:00<?, ?B/s]

In [18]:
best_threshold = 0.5

# **5. Prediction**

In [19]:
test_probs = ensemble.predict_proba(X_test)[:, 1]
test_preds = (test_probs >= best_threshold).astype(int)

# Create submission
submission = pd.DataFrame({
    'id': test_ID,
    'Personality': test_preds
})
print(submission.head())
submission['Personality'] = submission['Personality'].map({1: 'Extrovert', 0: 'Introvert'})
submission.to_csv('submission.csv', index=False)
print("Submitted successfully")

      id  Personality
0  18524            1
1  18525            0
2  18526            1
3  18527            1
4  18528            0
Submitted successfully
