# 🏦 LightGBM Starter for Bank Classification (90%+ AUC, CPU only)

This notebook is a **beginner-friendly baseline** for the [Kaggle Playground Series S5E8](https://www.kaggle.com/competitions/playground-series-s5e8).  
The goal is to predict whether a client subscribes to a bank term deposit (`y`).

### ✅ Key points:
- Uses **LightGBM**, fast and efficient on CPU (no GPU required).
- Minimal preprocessing: simple imputation + one-hot encoding for categorical features.
- Achieves **~90–92% ROC AUC** with a very short runtime.
- Designed for **clarity and reproducibility**, not leaderboard chasing.

This is a great starting point if you’re new to Kaggle or tabular ML — you can build on top of this with feature engineering, hyperparameter tuning, or model ensembling.


In [27]:
import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from lightgbm import LGBMClassifier


In [28]:
# Load datasets
train = pd.read_csv("/kaggle/input/playground-series-s5e8/train.csv")
test = pd.read_csv("/kaggle/input/playground-series-s5e8/test.csv")
sample_submission = pd.read_csv("/kaggle/input/playground-series-s5e8/sample_submission.csv")

print(train.shape, test.shape)
train.head()


(750000, 18) (250000, 17)


Unnamed: 0,id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,0,42,technician,married,secondary,no,7,no,no,cellular,25,aug,117,3,-1,0,unknown,0
1,1,38,blue-collar,married,secondary,no,514,no,no,unknown,18,jun,185,1,-1,0,unknown,0
2,2,36,blue-collar,married,secondary,no,602,yes,no,unknown,14,may,111,2,-1,0,unknown,0
3,3,27,student,single,secondary,no,34,yes,no,unknown,28,may,10,2,-1,0,unknown,0
4,4,26,technician,married,secondary,no,889,yes,no,cellular,3,feb,902,1,-1,0,unknown,1


In [30]:
print(train.dtypes)
print(train['y'].value_counts(normalize=True))


id            int64
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y             int64
dtype: object
y
0    0.879349
1    0.120651
Name: proportion, dtype: float64


In [31]:
# Drop id from both train and test right here
X = train.drop(columns=["y", "id"])
y = train["y"]

X_test = test.drop(columns=["id"])
test_ids = test["id"]


In [32]:
# Identify categorical vs numerical
cat_cols = X.select_dtypes(include=["object"]).columns.tolist()
num_cols = X.select_dtypes(exclude=["object"]).columns.tolist()

# Simple pipeline
preprocessor = ColumnTransformer([
    ("num", SimpleImputer(strategy="median"), num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
])


In [33]:
# Model: simple LightGBM
model = LGBMClassifier(
    n_estimators=2000,
    learning_rate=0.05,
    random_state=42,
    n_jobs=-1
)

pipe = Pipeline([
    ("prep", preprocessor),
    ("clf", model)
])


In [34]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = []

for train_idx, val_idx in cv.split(X, y):
    X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    pipe.fit(X_tr, y_tr)
    preds = pipe.predict_proba(X_val)[:,1]
    score = roc_auc_score(y_val, preds)
    scores.append(score)

print("CV AUC scores:", scores)
print("Mean AUC:", np.mean(scores))


[LightGBM] [Info] Number of positive: 72391, number of negative: 527609
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.036241 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1046
[LightGBM] [Info] Number of data points in the train set: 600000, number of used features: 51
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.120652 -> initscore=-1.986273
[LightGBM] [Info] Start training from score -1.986273
[LightGBM] [Info] Number of positive: 72391, number of negative: 527609
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.036153 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1044
[LightGBM] [Info] Number of data points in the train set: 600000, number of used features: 51
[LightGBM] [In

In [36]:
# Fit on full data
pipe.fit(X, y)

# Predict probabilities for test set
test_preds = pipe.predict_proba(X_test)[:,1]

# Create submission
submission = pd.DataFrame({
    "id": test_ids,
    "y": test_preds
})

submission.to_csv("submission.csv", index=False)
submission.head()


[LightGBM] [Info] Number of positive: 90488, number of negative: 659512
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.045052 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1045
[LightGBM] [Info] Number of data points in the train set: 750000, number of used features: 51
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.120651 -> initscore=-1.986283
[LightGBM] [Info] Start training from score -1.986283


Unnamed: 0,id,y
0,750000,0.001691
1,750001,0.080367
2,750002,0.000207
3,750003,2e-05
4,750004,0.014951
