## Introduction: Interaction Features & Target Encoding with CatBoost

In my previous [Notebook](https://www.kaggle.com/code/masayakawamata/s5e11-te-xgb-interaction-features/notebook), I implemented a feature engineering (FE) strategy that involved:
1.  Manually creating interaction terms.
2.  Applying Target Encoding (TE) to these new, high-cardinality features to make them useful for the model.

The purpose of *this* notebook is to demonstrate how to implement that same core methodology using **CatBoost** and to explore the unique advantages this framework offers for such a task.

Reference: [CatBoost: gradient boosting with categorical features support](https://arxiv.org/pdf/1706.09516)

### Built-in Categorical Feature Handling

CatBoost has powerful, integrated functionalities for handling categorical data, which we aim to leverage.

* **Ordered Target Encoding:** The primary feature we are interested in is CatBoost's internal TE mechanism. By simply passing our categorical column names (including our manually created interactions) to the `cat_features` parameter, the model automatically applies its own robust, ordered version of Target Encoding. 

* **Internal Feature Interactions:** CatBoost also has a built-in mechanism to test interactions between categorical columns. In this notebook, we are passing our *manually-created interaction features* as categorical features. This implies that CatBoost may also consider second-order relationships *between these already-combined interaction terms*, potentially capturing even more complex patterns.

### A Key Advantage: Multi-GPU Support

Beyond its handling of categorical data, CatBoost offers a significant advantage in the Kaggle environment: **native multi-GPU support**.

In our Kaggle Notebooks, we have access to a dual T4 GPU setup. However, models like XGBoost and LightGBM typically only utilize one of these GPUs.

CatBoost, on the other hand, will automatically detect and utilize **both** available T4 GPUs for training. This can dramatically speed up computation and experimentation, which feels like a great bonus for iterating quickly.

In [None]:
import warnings
warnings.simplefilter('ignore')

In [None]:
import pandas as pd, numpy as np

train = pd.read_csv('/kaggle/input/playground-series-s5e11/train.csv')
test = pd.read_csv('/kaggle/input/playground-series-s5e11/test.csv')
orig = pd.read_csv('/kaggle/input/loan-prediction-dataset-2025/loan_dataset_20000.csv')
print('Train Shape:', train.shape)
print('Test Shape:', test.shape)
print('Orig Shape:', orig.shape)

train.head(3)

In [None]:
TARGET = 'loan_paid_back'
CATS = ['gender', 'marital_status', 'education_level', 'employment_status', 'loan_purpose', 'grade_subgrade']
BASE = [col for col in train.columns if col not in ['id', TARGET]]

In [None]:
from itertools import combinations

INTER = []

for col1, col2 in combinations(BASE, 2):
    new_col_name = f'{col1}_{col2}'
    INTER.append(new_col_name)
    for df in [train, test, orig]:
        df[new_col_name] = df[col1].astype(str) + '_' + df[col2].astype(str)
        
print(f'{len(INTER)} Features.')

In [None]:
ORIG = []

for col in BASE:
    # MEAN
    mean_map = orig.groupby(col)[TARGET].mean()
    new_mean_col_name = f"orig_mean_{col}"
    mean_map.name = new_mean_col_name
    
    train = train.merge(mean_map, on=col, how='left')
    test = test.merge(mean_map, on=col, how='left')
    ORIG.append(new_mean_col_name)

    # COUNT
    new_count_col_name = f"orig_count_{col}"
    count_map = orig.groupby(col).size().reset_index(name=new_count_col_name)
    
    train = train.merge(count_map, on=col, how='left')
    test = test.merge(count_map, on=col, how='left')
    ORIG.append(new_count_col_name)

print(len(ORIG), 'Orig Features Created!!')

In [None]:
FEATURES = BASE + ORIG + INTER
print(len(FEATURES), 'Features.')

In [None]:
X = train[FEATURES]
y = train[TARGET]

In [None]:
from sklearn.model_selection import StratifiedKFold, KFold

N_SPLITS = 5
skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=42)

In [None]:
from catboost import CatBoostClassifier
from sklearn.metrics import roc_auc_score

In [None]:
cat_params = {
    'loss_function': 'Logloss',
    'bootstrap_type': 'Bernoulli',
    'eval_metric': 'AUC',     
    'iterations': 100000,      
    'learning_rate': 0.01,
    'max_depth': 5,
    'subsample': 0.8,
    'early_stopping_rounds': 100,
    'random_seed': 42,        
    'thread_count': -1,       
    'verbose': 1000,           
    'task_type': 'GPU'
}

In [None]:
all_categorical_features = INTER + CATS 

oof_preds = np.zeros(len(X))
test_preds = np.zeros(len(test))

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y), 1):
    print(f'--- Fold {fold}/{N_SPLITS} ---')
    
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    X_test = test[FEATURES].copy()

    model = CatBoostClassifier(**cat_params)
    model.fit(X_train, y_train,
              eval_set=(X_val, y_val),      
              cat_features=all_categorical_features 
             )

    val_preds = model.predict_proba(X_val)[:, 1]
    oof_preds[val_idx] = val_preds
    
    fold_score = roc_auc_score(y_val, val_preds)
    print(f'Fold {fold} AUC: {fold_score:.4f}')
    
    test_preds += model.predict_proba(X_test)[:, 1] / N_SPLITS

overall_auc = roc_auc_score(y, oof_preds)
print(f'====================')
print(f'Overall OOF AUC: {overall_auc:.4f}')
print(f'====================')

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

feature_importances = model.feature_importances_

importance_df = pd.DataFrame({
    'feature': X_train.columns, 
    'importance': feature_importances
})

importance_df = importance_df.sort_values('importance', ascending=False)

plt.style.use('fivethirtyeight')
plt.figure(figsize=(12, 20))
sns.barplot(x='importance', 
            y='feature', 
            data=importance_df.head(50)) 
plt.title('Feature Importance (Fold5 model)')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.tight_layout()
plt.show()

In [None]:
pd.DataFrame({'id': train.id, TARGET: oof_preds}).to_csv(f'oof_cat_cv_{overall_auc}.csv', index=False)
pd.DataFrame({'id': test.id, TARGET: test_preds}).to_csv(f'test_cat_cv_{overall_auc}.csv', index=False)