
# 🏦 Binary Classification with a Bank Dataset (Kaggle Playground S5E8) — macOS (VS Code)

This notebook gives you an **end-to-end, well-structured pipeline**:
- Data loading (local or Kaggle CLI)
- Minimal feature engineering
- **CatBoost** (CPU) with robust Stratified K-Fold CV
- **LightGBM** (CPU) with native categoricals
- Simple **blend** for better AUC
- Optional **PyTorch MLP** that runs on **Apple GPU (MPS)** (note: NNs may underperform gradient boosting on this dataset, but this shows GPU usage on Mac)

> **GPU note for macOS:** CatBoost/LightGBM GPU backends require CUDA/OpenCL and **do not use Apple's MPS**. So tree models here run on CPU. The **optional PyTorch** section demonstrates Apple GPU (MPS) acceleration.


## 0. Hardware & Environment Checks

In [1]:

import platform, sys, subprocess, shutil

print("Python:", sys.version)
print("OS:", platform.platform())
print("Machine:", platform.machine())
print("Processor:", platform.processor())

# Optional: check if kaggle CLI exists (installed via pip or system)
print("kaggle CLI available:", shutil.which("kaggle") is not None)

# Optional: check for PyTorch + MPS (Apple GPU)
try:
    import torch
    mps_ok = hasattr(torch.backends, "mps") and torch.backends.mps.is_available()
    print("PyTorch version:", torch.__version__)
    print("Apple GPU (MPS) available:", mps_ok)
except Exception as e:
    print("PyTorch not installed or import failed. MPS check skipped.", e)


Python: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 10:07:17) [Clang 14.0.6 ]
OS: macOS-15.6-arm64-arm-64bit
Machine: arm64
Processor: arm
kaggle CLI available: True
PyTorch version: 2.8.0
Apple GPU (MPS) available: True


## 1. Install/Verify Dependencies

In [2]:

# If you already have these, you can skip. VS Code often handles envs well.
# Run these one by one if you prefer.
%pip install -U pip setuptools wheel
%pip install -U numpy pandas scikit-learn matplotlib
%pip install -U catboost lightgbm xgboost
# Optional (for Apple GPU demo with a simple MLP):
%pip install -U torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
# ^ On macOS, PyTorch wheel will include MPS if running on Apple Silicon + macOS 12.3+.


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.


## 2. Data: Local Path or Kaggle Download

In [6]:
import os
os.environ["KAGGLE_CONFIG_DIR"] = "/Users/sanyuktatuti/Documents/Bank_Binary_Classification/.kaggle"

In [8]:
# ✅ Robust Kaggle download for VS Code/macOS using the Kaggle Python API (no shell CLI)

from pathlib import Path
import os, stat, zipfile, glob

# Where to put the CSVs
DATA_DIR = Path("./data")
DATA_DIR.mkdir(parents=True, exist_ok=True)

# If you keep your token in the project, export this before running (or set it here):
# os.environ["KAGGLE_CONFIG_DIR"] = "/Users/sanyuktatuti/Documents/Bank_Binary_Classification/.kaggle"

# Fallback to ~/.kaggle if KAGGLE_CONFIG_DIR not set
cfg_dir = Path(os.environ.get("KAGGLE_CONFIG_DIR", Path.home() / ".kaggle"))
cfg_dir.mkdir(parents=True, exist_ok=True)
token = cfg_dir / "kaggle.json"
if not token.exists():
    raise SystemExit(
        f"kaggle.json not found at {token}.\n"
        "Download from https://www.kaggle.com/settings (Create New API Token) and place it there.\n"
        "Or set os.environ['KAGGLE_CONFIG_DIR'] to your project .kaggle folder."
    )
# Fix permissions (Kaggle requires 600)
token.chmod(stat.S_IRUSR | stat.S_IWUSR)

# Make sure 'kaggle' package is installed in THIS kernel/env
import sys, subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "-qU", "kaggle"], check=True)

# Use the official Python API (avoids broken shell scripts)
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

print("Downloading to:", DATA_DIR.resolve())
api.competition_download_files("playground-series-s5e8", path=str(DATA_DIR), quiet=False)

# Unzip everything we just downloaded
zips = sorted(glob.glob(str(DATA_DIR / "*.zip")))
if not zips:
    raise SystemExit("No ZIPs found after download — check auth output above.")
for z in zips:
    with zipfile.ZipFile(z) as zf:
        zf.extractall(DATA_DIR)

# Verify expected files
expected = ["train.csv", "test.csv", "sample_submission.csv"]
missing = [f for f in expected if not (DATA_DIR / f).exists()]
if missing:
    raise SystemExit(f"Missing files after unzip: {missing}")
print("✅ Kaggle files ready:", expected)

Downloading to: /Users/sanyuktatuti/Documents/Bank_Binary_Classification/data
Downloading playground-series-s5e8.zip to data


100%|██████████| 14.7M/14.7M [00:00<00:00, 2.85GB/s]


✅ Kaggle files ready: ['train.csv', 'test.csv', 'sample_submission.csv']





## 3. Load Data & Quick Sanity Checks

In [9]:
import pandas as pd, pathlib
DATA_DIR = pathlib.Path("./data")
train = pd.read_csv(DATA_DIR / "train.csv")
test  = pd.read_csv(DATA_DIR / "test.csv")
sample= pd.read_csv(DATA_DIR / "sample_submission.csv")
print(train.shape, test.shape)
train.head()

(750000, 18) (250000, 17)


Unnamed: 0,id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,0,42,technician,married,secondary,no,7,no,no,cellular,25,aug,117,3,-1,0,unknown,0
1,1,38,blue-collar,married,secondary,no,514,no,no,unknown,18,jun,185,1,-1,0,unknown,0
2,2,36,blue-collar,married,secondary,no,602,yes,no,unknown,14,may,111,2,-1,0,unknown,0
3,3,27,student,single,secondary,no,34,yes,no,unknown,28,may,10,2,-1,0,unknown,0
4,4,26,technician,married,secondary,no,889,yes,no,cellular,3,feb,902,1,-1,0,unknown,1


## 4. Tiny Feature Engineering (cheap but helpful)

In [10]:

import numpy as np

for df in (train, test):
    # Special code used in bank-style data
    df["pdays_is_neg1"] = (df["pdays"] == -1).astype("str")
    # Whether any previous contacts exist
    df["has_previous"]  = (df["previous"] > 0).astype("str")
    # Balance sign bucket
    df["balance_sign"]  = np.where(df["balance"] < 0, "neg",
                            np.where(df["balance"] == 0, "zero", "pos")).astype("str")
    # Extra small features
    month_order = {'jan':1,'feb':2,'mar':3,'apr':4,'may':5,'jun':6,
                   'jul':7,'aug':8,'sep':9,'oct':10,'nov':11,'dec':12}
    df["month_idx"]   = df["month"].map(month_order).fillna(0).astype(int)
    df["age_bin"]     = pd.cut(df["age"], bins=[-1,25,35,45,55,65,120]).astype(str)
    df["campaign_cap"]= np.minimum(df["campaign"], 10).astype(int)
    df["pdays_bin"]   = pd.cut(df["pdays"], bins=[-2,-1,5,30,90,365,100000], right=True).astype(str)

print("Engineered columns added. Example:")
display(train[["pdays_is_neg1","has_previous","balance_sign","month","month_idx","age","age_bin","campaign","campaign_cap","pdays","pdays_bin"]].head())


Engineered columns added. Example:


Unnamed: 0,pdays_is_neg1,has_previous,balance_sign,month,month_idx,age,age_bin,campaign,campaign_cap,pdays,pdays_bin
0,True,False,pos,aug,8,42,"(35, 45]",3,3,-1,"(-2, -1]"
1,True,False,pos,jun,6,38,"(35, 45]",1,1,-1,"(-2, -1]"
2,True,False,pos,may,5,36,"(35, 45]",2,2,-1,"(-2, -1]"
3,True,False,pos,may,5,27,"(25, 35]",2,2,-1,"(-2, -1]"
4,True,False,pos,feb,2,26,"(25, 35]",1,1,-1,"(-2, -1]"


## 5. Prepare Features & Inferred Categoricals

In [27]:
from sklearn.metrics import roc_auc_score
import numpy as np
import pandas as pd

d = pd.to_numeric(train["duration"], errors="coerce").fillna(0)

print("AUC(duration raw):", roc_auc_score(y, d))
print("AUC(log1p(duration)):", roc_auc_score(y, np.log1p(d.clip(lower=0))))


AUC(duration raw): 0.8895128234037682
AUC(log1p(duration)): 0.8895128234037682


In [29]:
import numpy as np, pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.isotonic import IsotonicRegression
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

assert "duration" in train.columns, "duration missing. Rebuild features with DROP_DURATION=False."

FOLDS = 10
rs = 42
skf = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=rs)

dur_tr = train["duration"].astype(float).to_numpy()
dur_te = test["duration"].astype(float).to_numpy()
y_arr  = y.values
contact_tr = train["contact"].astype(str).fillna("NA").values
contact_te = test["contact"].astype(str).fillna("NA").values
month_tr   = train["month"].astype(str).fillna("NA").values
month_te   = test["month"].astype(str).fillna("NA").values

def rankit(a):
    return pd.Series(a).rank(method="average").to_numpy() / (len(a)+1e-9)

# 1) Global isotonic (raw + log1p), blended
oof_raw  = np.zeros_like(dur_tr, dtype=np.float32); pred_raw  = np.zeros_like(dur_te, dtype=np.float32)
oof_log  = np.zeros_like(dur_tr, dtype=np.float32); pred_log  = np.zeros_like(dur_te, dtype=np.float32)
for tr_idx, va_idx in skf.split(dur_tr.reshape(-1,1), y_arr):
    ir = IsotonicRegression(out_of_bounds="clip").fit(dur_tr[tr_idx], y_arr[tr_idx])
    oof_raw[va_idx]  = ir.predict(dur_tr[va_idx]);  pred_raw += ir.predict(dur_te)/FOLDS
    ir2 = IsotonicRegression(out_of_bounds="clip")
    xtr = np.log1p(np.clip(dur_tr[tr_idx],0,None)); xva = np.log1p(np.clip(dur_tr[va_idx],0,None)); xte = np.log1p(np.clip(dur_te,0,None))
    ir2.fit(xtr, y_arr[tr_idx])
    oof_log[va_idx]  = ir2.predict(xva);           pred_log += ir2.predict(xte)/FOLDS

auc_raw = roc_auc_score(y_arr, oof_raw)
auc_log = roc_auc_score(y_arr, oof_log)
w_raw = 0.5 if abs(auc_raw-auc_log)<0.002 else (1.0 if auc_raw>auc_log else 0.0)
oof_iso_global  = w_raw*oof_raw + (1-w_raw)*oof_log
pred_iso_global = w_raw*pred_raw + (1-w_raw)*pred_log
print(f"[Iso-Global] raw={auc_raw:.6f} log1p={auc_log:.6f} → blended={roc_auc_score(y_arr,oof_iso_global):.6f}")

# 2) Grouped isotonic — per contact and per month, fold-safe OOF with global fallback
def grouped_iso_oof_pred(groups_tr, groups_te, min_group=2000, label="contact"):
    oof_g = np.zeros_like(dur_tr, dtype=np.float32)
    pred_g = np.zeros_like(dur_te, dtype=np.float32)
    for tr_idx, va_idx in skf.split(dur_tr.reshape(-1,1), y_arr):
        g_ir = IsotonicRegression(out_of_bounds="clip").fit(dur_tr[tr_idx], y_arr[tr_idx])  # fallback
        per = {}
        for g in np.unique(groups_tr[tr_idx]):
            m = (groups_tr[tr_idx]==g)
            if m.sum() >= min_group:
                per[g] = IsotonicRegression(out_of_bounds="clip").fit(dur_tr[tr_idx][m], y_arr[tr_idx][m])
        # OOF for this fold
        for g in np.unique(groups_tr[va_idx]):
            m = (groups_tr[va_idx]==g)
            model = per.get(g, g_ir)
            oof_g[va_idx][m] = model.predict(dur_tr[va_idx][m])
    # Test using full train fit
    g_ir_full = IsotonicRegression(out_of_bounds="clip").fit(dur_tr, y_arr)
    per_full = {}
    for g in np.unique(groups_tr):
        m = (groups_tr==g)
        if m.sum() >= min_group:
            per_full[g] = IsotonicRegression(out_of_bounds="clip").fit(dur_tr[m], y_arr[m])
    for g in np.unique(groups_te):
        m = (groups_te==g)
        model = per_full.get(g, g_ir_full)
        pred_g[m] = model.predict(dur_te[m])
    auc = roc_auc_score(y_arr, oof_g)
    print(f"[Iso-{label}] OOF AUC={auc:.6f} | groups kept>={min_group}: {len(per_full)}")
    return oof_g, pred_g

oof_iso_contact, pred_iso_contact = grouped_iso_oof_pred(contact_tr, contact_te, min_group=2000, label="contact")
oof_iso_month,   pred_iso_month   = grouped_iso_oof_pred(month_tr,   month_te,   min_group=2000, label="month")

# Combine duration channels (global + contact + month)
dur_channels = {
    "global": (oof_iso_global,  pred_iso_global),
    "contact":(oof_iso_contact, pred_iso_contact),
    "month":  (oof_iso_month,   pred_iso_month),
}
# Simple weight grid to find best duration-only mix
best_dur_auc, best_dur_w = -1.0, None
grid = np.linspace(0,1,11)
for wg in grid:           # weight for global
    for wc in grid:       # weight for contact
        if wg+wc<=1:
            wm = 1-(wg+wc)
            oof_dur = wg*dur_channels["global"][0] + wc*dur_channels["contact"][0] + wm*dur_channels["month"][0]
            auc = roc_auc_score(y_arr, oof_dur)
            if auc > best_dur_auc:
                best_dur_auc = auc; best_dur_w = (wg,wc,wm)
best_oof_dur = best_dur_w[0]*dur_channels["global"][0] + best_dur_w[1]*dur_channels["contact"][0] + best_dur_w[2]*dur_channels["month"][0]
best_pred_dur= best_dur_w[0]*dur_channels["global"][1] + best_dur_w[1]*dur_channels["contact"][1] + best_dur_w[2]*dur_channels["month"][1]
print(f"[Iso-DUR*] best OOF AUC={best_dur_auc:.6f} at (w_global={best_dur_w[0]:.2f}, w_contact={best_dur_w[1]:.2f}, w_month={best_dur_w[2]:.2f})")

# A) Submit pure duration isotonic (strong baseline)
subA = pd.DataFrame({"id": test["id"], "y": best_pred_dur})
subA.to_csv("submission_duration_iso_contact_month.csv", index=False)
print("Wrote submission_duration_iso_contact_month.csv")

# 3) Blend duration with boosters — rank vs mean, search weights finely
def best_blend_3way(cb_oof, lgb_oof, dur_oof, cb_pred, lgb_pred, dur_pred):
    best_auc, best_name, best_test = -1.0, None, None
    # mean simplex search (fine grid)
    for w_cb in np.linspace(0,1,41):
        for w_lgb in np.linspace(0,1,41):
            if w_cb + w_lgb <= 1:
                w_dur = 1 - (w_cb + w_lgb)
                oof = w_cb*cb_oof + w_lgb*lgb_oof + w_dur*dur_oof
                auc = roc_auc_score(y_arr, oof)
                if auc > best_auc:
                    best_auc, best_name, best_test = auc, f"mean(wcb={w_cb:.3f},wlgb={w_lgb:.3f},wdur={w_dur:.3f})", \
                        (w_cb*cb_pred + w_lgb*lgb_pred + w_dur*dur_pred)
    # rank equal-weights (often robust)
    oof_r = (rankit(cb_oof)+rankit(lgb_oof)+rankit(dur_oof))/3.0
    auc_r = roc_auc_score(y_arr, oof_r)
    if auc_r > best_auc:
        best_auc, best_name, best_test = auc_r, "rank_equal", \
            (rankit(cb_pred)+rankit(lgb_pred)+rankit(dur_pred))/3.0
    print(f"[Blend3] best OOF AUC={best_auc:.6f} via {best_name}")
    return best_test, best_name, best_auc

best_test_blend, best_name_blend, best_auc_blend = best_blend_3way(
    cb_oof, lgb_oof, best_oof_dur, cb_pred, lgb_pred, best_pred_dur
)
subB = pd.DataFrame({"id": test["id"], "y": best_test_blend})
subB.to_csv("submission_rank_or_mean_blend_cb_lgb_dur.csv", index=False)
print("Wrote submission_rank_or_mean_blend_cb_lgb_dur.csv")

# 4) Proper K-fold meta stack on OOF features (more honest than fitting LR on full OOF)
meta_feats_tr = np.column_stack([
    cb_oof, lgb_oof, best_oof_dur,
    rankit(dur_tr), np.log1p(np.clip(dur_tr,0,None))
])
meta_feats_te = np.column_stack([
    cb_pred, lgb_pred, best_pred_dur,
    rankit(dur_te), np.log1p(np.clip(dur_te,0,None))
])

meta_oof = np.zeros(len(y_arr), dtype=np.float32)
meta_pred = np.zeros(len(test), dtype=np.float32)
for tr_idx, va_idx in skf.split(meta_feats_tr, y_arr):
    m = LogisticRegression(C=3.0, solver="lbfgs", max_iter=500)
    m.fit(meta_feats_tr[tr_idx], y_arr[tr_idx])
    meta_oof[va_idx] = m.predict_proba(meta_feats_tr[va_idx])[:,1]
    meta_pred += m.predict_proba(meta_feats_te)[:,1] / FOLDS
auc_meta = roc_auc_score(y_arr, meta_oof)
print(f"[Meta-LR KFold] OOF AUC={auc_meta:.6f}")
subC = pd.DataFrame({"id": test["id"], "y": meta_pred})
subC.to_csv("submission_meta_lr_kfold_cb_lgb_dur.csv", index=False)
print("Wrote submission_meta_lr_kfold_cb_lgb_dur.csv")


[Iso-Global] raw=0.891598 log1p=0.891598 → blended=0.891598
[Iso-contact] OOF AUC=0.500000 | groups kept>=2000: 3
[Iso-month] OOF AUC=0.500000 | groups kept>=2000: 12
[Iso-DUR*] best OOF AUC=0.891598 at (w_global=0.10, w_contact=0.00, w_month=0.90)
Wrote submission_duration_iso_contact_month.csv
[Blend3] best OOF AUC=0.956735 via mean(wcb=0.025,wlgb=0.050,wdur=0.925)
Wrote submission_rank_or_mean_blend_cb_lgb_dur.csv
[Meta-LR KFold] OOF AUC=0.958932
Wrote submission_meta_lr_kfold_cb_lgb_dur.csv


In [None]:
# # === Duration-centric booster: global iso, per-contact iso, best blend, and a simple stack ===
# import numpy as np, pandas as pd
# from sklearn.isotonic import IsotonicRegression
# from sklearn.model_selection import StratifiedKFold
# from sklearn.metrics import roc_auc_score
# from sklearn.linear_model import LogisticRegression

# assert "duration" in train.columns, "duration missing. Rebuild features with DROP_DURATION=False."

# FOLDS = 10
# skf   = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=42)

# dur_tr = train["duration"].astype(float).to_numpy()
# dur_te = test["duration"].astype(float).to_numpy()
# y_arr  = y.values

# # ------------------ 1) Global isotonic (raw + log1p, pick/blend) ------------------
# oof_raw  = np.zeros_like(dur_tr, dtype=np.float32)
# pred_raw = np.zeros_like(dur_te,  dtype=np.float32)
# oof_log  = np.zeros_like(dur_tr, dtype=np.float32)
# pred_log = np.zeros_like(dur_te,  dtype=np.float32)

# for tr_idx, va_idx in skf.split(dur_tr.reshape(-1,1), y_arr):
#     # raw
#     ir = IsotonicRegression(out_of_bounds="clip")
#     ir.fit(dur_tr[tr_idx], y_arr[tr_idx])
#     oof_raw[va_idx]  = ir.predict(dur_tr[va_idx])
#     pred_raw        += ir.predict(dur_te) / FOLDS
#     # log1p
#     xtr = np.log1p(np.clip(dur_tr[tr_idx], 0, None))
#     xva = np.log1p(np.clip(dur_tr[va_idx], 0, None))
#     xte = np.log1p(np.clip(dur_te,        0, None))
#     ir2 = IsotonicRegression(out_of_bounds="clip")
#     ir2.fit(xtr, y_arr[tr_idx])
#     oof_log[va_idx] = ir2.predict(xva)
#     pred_log       += ir2.predict(xte) / FOLDS

# auc_raw = roc_auc_score(y_arr, oof_raw)
# auc_log = roc_auc_score(y_arr, oof_log)
# w_raw = 0.5 if abs(auc_raw-auc_log)<0.002 else (1.0 if auc_raw>auc_log else 0.0)
# oof_iso_global  = w_raw*oof_raw + (1-w_raw)*oof_log
# pred_iso_global = w_raw*pred_raw + (1-w_raw)*pred_log
# print(f"[Iso-Global] OOF raw={auc_raw:.6f} log1p={auc_log:.6f} blended={roc_auc_score(y_arr,oof_iso_global):.6f}")

# # ------------------ 2) Per-contact isotonic (fallback to global) ------------------
# contact_tr = train["contact"].astype(str).fillna("NA").values
# contact_te = test["contact"].astype(str).fillna("NA").values
# min_group = 2000  # need enough samples to fit stable iso per group

# oof_iso_contact = np.zeros_like(dur_tr, dtype=np.float32)
# pred_iso_contact = np.zeros_like(dur_te, dtype=np.float32)

# for tr_idx, va_idx in skf.split(dur_tr.reshape(-1,1), y_arr):
#     # fit a global fallback on fold's train idx
#     g_ir = IsotonicRegression(out_of_bounds="clip")
#     g_ir.fit(dur_tr[tr_idx], y_arr[tr_idx])
#     # groups in this fold
#     groups = np.unique(contact_tr[tr_idx])
#     per_group_models = {}
#     for g in groups:
#         mask = (contact_tr[tr_idx] == g)
#         if mask.sum() >= min_group:
#             ir = IsotonicRegression(out_of_bounds="clip")
#             ir.fit(dur_tr[tr_idx][mask], y_arr[tr_idx][mask])
#             per_group_models[g] = ir
#     # OOF preds for this fold
#     for g in np.unique(contact_tr[va_idx]):
#         mask = (contact_tr[va_idx]==g)
#         model = per_group_models.get(g, g_ir)
#         oof_iso_contact[va_idx][mask] = model.predict(dur_tr[va_idx][mask])

# # For test, fit per-contact on full train (with same fallback)
# g_ir_full = IsotonicRegression(out_of_bounds="clip").fit(dur_tr, y_arr)
# per_group_full = {}
# for g in np.unique(contact_tr):
#     mask = (contact_tr==g)
#     if mask.sum() >= min_group:
#         ir = IsotonicRegression(out_of_bounds="clip")
#         ir.fit(dur_tr[mask], y_arr[mask])
#         per_group_full[g] = ir

# for g in np.unique(contact_te):
#     mask = (contact_te==g)
#     model = per_group_full.get(g, g_ir_full)
#     pred_iso_contact[mask] = model.predict(dur_te[mask])

# print(f"[Iso-PerContact] OOF AUC={roc_auc_score(y_arr, oof_iso_contact):.6f}")

# # ------------------ 3) Blends including duration models ------------------
# def rankit(a):
#     return pd.Series(a).rank(method="average").to_numpy() / (len(a)+1e-9)

# cands = {}
# # two duration channels (global + per-contact) averaged for stability
# oof_dur = 0.5*(oof_iso_global + oof_iso_contact)
# pred_dur = 0.5*(pred_iso_global + pred_iso_contact)
# print(f"[Iso-Avg] OOF AUC={roc_auc_score(y_arr, oof_dur):.6f}")

# # Baselines you already have
# cands["mean_cb_lgb"]     = (0.5*(cb_oof + lgb_oof), 0.5*(cb_pred + lgb_pred))
# cands["rank_cb_lgb"]     = (0.5*(rankit(cb_oof)+rankit(lgb_oof)),
#                             0.5*(rankit(cb_pred)+rankit(lgb_pred)))

# # Add duration into blends
# cands["mean_cb_lgb_dur"] = ((cb_oof + lgb_oof + oof_dur)/3.0,
#                             (cb_pred + lgb_pred + pred_dur)/3.0)
# cands["rank_cb_lgb_dur"] = ((rankit(cb_oof)+rankit(lgb_oof)+rankit(oof_dur))/3.0,
#                             (rankit(cb_pred)+rankit(lgb_pred)+rankit(pred_dur))/3.0)

# # Simplex grid over 3-way MEAN blend (CB/LGB/DUR)
# best_auc, best_name, best_test = -1.0, None, None
# weights = np.linspace(0,1,11)  # 0.0..1.0 step 0.1
# for w_cb in weights:
#     for w_lgb in weights:
#         if w_cb + w_lgb <= 1.0:
#             w_dur = 1.0 - (w_cb + w_lgb)
#             oof_m = w_cb*cb_oof + w_lgb*lgb_oof + w_dur*oof_dur
#             auc = roc_auc_score(y_arr, oof_m)
#             if auc > best_auc:
#                 best_auc, best_name, best_test = auc, f"mean_grid(w_cb={w_cb:.1f},w_lgb={w_lgb:.1f},w_dur={w_dur:.1f})", \
#                     (w_cb*cb_pred + w_lgb*lgb_pred + w_dur*pred_dur)

# # Compare with rank 3-way blend too
# oof_rank3 = (rankit(cb_oof)+rankit(lgb_oof)+rankit(oof_dur))/3.0
# auc_rank3 = roc_auc_score(y_arr, oof_rank3)
# if auc_rank3 > best_auc:
#     best_auc, best_name, best_test = auc_rank3, "rank_equal_3way", \
#         (rankit(cb_pred)+rankit(lgb_pred)+rankit(pred_dur))/3.0

# print(f"[BLEND+] Best OOF AUC={best_auc:.6f} via {best_name}")

# sub = pd.DataFrame({"id": test["id"], "y": best_test})
# fname = f"submission_best_blend_with_duration.csv"
# sub.to_csv(fname, index=False)
# print("Wrote:", fname)

# # ------------------ 4) Bonus: tiny meta-stack (logistic on OOF) ------------------
# # Features to meta: [cb_oof, lgb_oof, oof_dur, duration_rank]
# dur_rank = rankit(dur_tr)
# meta_X = np.column_stack([cb_oof, lgb_oof, oof_dur, dur_rank])
# meta_te = np.column_stack([cb_pred, lgb_pred, pred_dur, rankit(dur_te)])

# # CV AUC for meta (honest OOF already available; LR on OOF is okay here)
# lr = LogisticRegression(C=2.0, solver="lbfgs", max_iter=500)
# lr.fit(meta_X, y_arr)
# oof_meta = lr.predict_proba(meta_X)[:,1]
# auc_meta = roc_auc_score(y_arr, oof_meta)
# pred_meta = lr.predict_proba(meta_te)[:,1]
# print(f"[META-LR] OOF AUC={auc_meta:.6f}")

# sub2 = pd.DataFrame({"id": test["id"], "y": pred_meta})
# sub2_name = "submission_meta_lr_cb_lgb_dur.csv"
# sub2.to_csv(sub2_name, index=False)
# print("Wrote:", sub2_name)


[Iso-Global] OOF raw=0.891598 log1p=0.891598 blended=0.891598
[Iso-PerContact] OOF AUC=0.500000
[Iso-Avg] OOF AUC=0.891598
[BLEND+] Best OOF AUC=0.956628 via mean_grid(w_cb=0.1,w_lgb=0.2,w_dur=0.7)
Wrote: submission_best_blend_with_duration.csv
[META-LR] OOF AUC=0.958436
Wrote: submission_meta_lr_cb_lgb_dur.csv


In [None]:
# # === FINAL MODE: CatBoost + LightGBM + best-of mean/rank blend (full data, 10-fold) ===
# import os, time, numpy as np, pandas as pd
# from sklearn.model_selection import StratifiedKFold
# from sklearn.metrics import roc_auc_score
# from catboost import CatBoostClassifier, Pool
# import lightgbm as lgb

# # Sanity: ensure 'duration' is in features (this comp rewards it)
# if "duration" not in X.columns:
#     print("⚠️ 'duration' not in features. Top scores usually include it. "
#           "Rebuild X with DROP_DURATION=False if you want max AUC.")

# # Keep your Mac cooler
# avail = os.cpu_count() or 8
# threads = min(max(avail - 2, 6), 12)
# os.environ["OMP_NUM_THREADS"] = str(threads)
# os.environ["MKL_NUM_THREADS"] = str(threads)

# FOLDS = 10
# RANDOM_STATE = 42

# # ---------- CatBoost (full strength CPU preset) ----------
# cb_params = dict(
#     loss_function="Logloss",
#     eval_metric="AUC",
#     iterations=12000,            # ES will trim
#     learning_rate=0.03,
#     depth=8,
#     l2_leaf_reg=20,
#     random_strength=0.2,
#     bootstrap_type="Bayesian",
#     bagging_temperature=1.0,
#     rsm=0.8,
#     one_hot_max_size=64,
#     max_ctr_complexity=2,
#     border_count=254,
#     auto_class_weights="Balanced",
#     early_stopping_rounds=800,
#     random_seed=RANDOM_STATE,
#     thread_count=threads,
#     verbose=200,
#     allow_writing_files=False
# )

# skf = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=RANDOM_STATE)
# cb_oof = np.zeros(len(X), dtype=np.float32)
# cb_pred = np.zeros(len(X_test), dtype=np.float32)

# t0 = time.time()
# for fold, (tr, va) in enumerate(skf.split(X, y), 1):
#     X_tr, y_tr = X.iloc[tr], y.iloc[tr]
#     X_va, y_va = X.iloc[va], y.iloc[va]

#     model = CatBoostClassifier(**cb_params)
#     model.fit(
#         Pool(X_tr, y_tr, cat_features=cat_idx),
#         eval_set=Pool(X_va, y_va, cat_features=cat_idx),
#         use_best_model=True
#     )

#     cb_oof[va] = model.predict_proba(X_va)[:,1]
#     cb_pred   += model.predict_proba(Pool(X_test, cat_features=cat_idx))[:,1] / FOLDS
#     print(f"[CB] fold {fold}: AUC={roc_auc_score(y_va, cb_oof[va]):.6f}")
# cb_oof_auc = roc_auc_score(y, cb_oof)
# print(f"[CB] OOF AUC={cb_oof_auc:.6f}  (time {time.time()-t0:.1f}s)")

# # ---------- LightGBM (strong CPU preset with cat smoothing) ----------
# X_lgb = X.copy(); X_test_lgb = X_test.copy()
# for c in cat_cols:
#     X_lgb[c] = X_lgb[c].astype("category")
#     X_test_lgb[c] = X_test_lgb[c].astype("category")

# lgb_params = dict(
#     objective="binary",
#     boosting_type="gbdt",
#     metric="auc",
#     learning_rate=0.02,
#     n_estimators=40000,            # ES trims
#     num_leaves=255,
#     max_depth=-1,
#     min_data_in_leaf=40,
#     feature_fraction=0.8,
#     bagging_fraction=0.8,
#     bagging_freq=1,
#     lambda_l1=1.0, lambda_l2=2.0,
#     max_bin=255,
#     min_data_per_group=50,
#     cat_l2=20.0, cat_smooth=20.0,  # categorical smoothing
#     force_col_wise=True,
#     deterministic=True,
#     n_jobs=threads,
#     verbose=-1
# )

# lgb_oof = np.zeros(len(X_lgb), dtype=np.float32)
# lgb_pred = np.zeros(len(X_test_lgb), dtype=np.float32)

# t1 = time.time()
# for fold, (tr, va) in enumerate(skf.split(X_lgb, y), 1):
#     X_tr, y_tr = X_lgb.iloc[tr], y.iloc[tr]
#     X_va, y_va = X_lgb.iloc[va], y.iloc[va]

#     lgbm = lgb.LGBMClassifier(**lgb_params)
#     lgbm.fit(
#         X_tr, y_tr,
#         eval_set=[(X_va, y_va)],
#         eval_metric="auc",
#         categorical_feature=cat_cols,
#         callbacks=[lgb.early_stopping(stopping_rounds=1200, verbose=False)]
#     )
#     lgb_oof[va] = lgbm.predict_proba(X_va)[:,1]
#     lgb_pred   += lgbm.predict_proba(X_test_lgb)[:,1] / FOLDS
#     print(f"[LGBM] fold {fold}: AUC={roc_auc_score(y_va, lgb_oof[va]):.6f}")
# lgb_oof_auc = roc_auc_score(y, lgb_oof)
# print(f"[LGBM] OOF AUC={lgb_oof_auc:.6f}  (time {time.time()-t1:.1f}s)")

# # ---------- Best-of mean vs rank blend ----------
# def rankit(a):
#     return pd.Series(a).rank(method="average").to_numpy() / (len(a) + 1e-9)

# best_auc, best_name, best_w = -1.0, None, None
# for w in np.linspace(0, 1, 41):  # 0.00..1.00 step 0.025
#     oof_mean = w*cb_oof + (1-w)*lgb_oof
#     auc_mean = roc_auc_score(y, oof_mean)
#     if auc_mean > best_auc:
#         best_auc, best_name, best_w = auc_mean, f"mean (w_cb={w:.3f})", w
#     oof_rank = w*rankit(cb_oof) + (1-w)*rankit(lgb_oof)
#     auc_rank = roc_auc_score(y, oof_rank)
#     if auc_rank > best_auc:
#         best_auc, best_name, best_w = auc_rank, f"rank (w_cb={w:.3f})", w

# print(f"[BLEND] Best OOF AUC={best_auc:.6f} using {best_name}")
# use_rank = best_name.startswith("rank")
# test_blend = best_w*(rankit(cb_pred) if use_rank else cb_pred) + \
#              (1-best_w)*(rankit(lgb_pred) if use_rank else lgb_pred)

# # ---------- Submission ----------
# sub = pd.DataFrame({"id": test["id"], "y": test_blend})
# fname = f"submission_cb_lgb_{'rank' if use_rank else 'mean'}_blend_final.csv"
# sub.to_csv(fname, index=False)
# print("Wrote:", fname)


⚠️ 'duration' not in features. Top scores usually include it. Rebuild X with DROP_DURATION=False if you want max AUC.
0:	test: 0.7929568	best: 0.7929568 (0)	total: 51.3ms	remaining: 10m 14s
200:	test: 0.8434138	best: 0.8434138 (200)	total: 5.67s	remaining: 5m 32s
400:	test: 0.8502251	best: 0.8502251 (400)	total: 11.5s	remaining: 5m 34s
600:	test: 0.8536598	best: 0.8536598 (600)	total: 17.4s	remaining: 5m 29s
800:	test: 0.8554976	best: 0.8554976 (800)	total: 23.2s	remaining: 5m 23s
1000:	test: 0.8568576	best: 0.8568591 (998)	total: 28.9s	remaining: 5m 17s
1200:	test: 0.8577602	best: 0.8577602 (1200)	total: 34.7s	remaining: 5m 11s
1400:	test: 0.8585213	best: 0.8585213 (1400)	total: 40.4s	remaining: 5m 5s
1600:	test: 0.8589860	best: 0.8589912 (1598)	total: 46.2s	remaining: 4m 59s
1800:	test: 0.8594119	best: 0.8594119 (1800)	total: 51.9s	remaining: 4m 53s
2000:	test: 0.8597962	best: 0.8597986 (1997)	total: 57.6s	remaining: 4m 47s
2200:	test: 0.8599679	best: 0.8599679 (2200)	total: 1m 3s	re

In [None]:
# # === Checkpoint: make sure data & key column names are defined ===
# from pathlib import Path
# import pandas as pd
# import numpy as np

# # Load if not already loaded in this kernel
# DATA_DIR = Path("./data")
# if "train" not in globals():
#     train = pd.read_csv(DATA_DIR / "train.csv")
# if "test" not in globals():
#     test  = pd.read_csv(DATA_DIR / "test.csv")

# # Detect id/target columns robustly
# ID_COL = next((c for c in ["id","Id","ID"] if c in train.columns), None)
# TARGET = next((c for c in ["y","target","label"] if c in train.columns), None)

# assert ID_COL is not None and TARGET is not None, 
#     f"Could not find id/target columns in train. Columns: {list(train.columns)}"

# print("ID_COL:", ID_COL, "| TARGET:", TARGET)
# print("train shape:", train.shape, "| test shape:", test.shape)


ID_COL: id | TARGET: y
train shape: (750000, 25) | test shape: (250000, 24)


In [None]:

# # Toggle: safer baseline is to drop 'duration'. Try both.
# DROP_DURATION = True
# features = [c for c in train.columns if c not in [ID_COL, TARGET]]
# if DROP_DURATION and "duration" in features:
#     features.remove("duration")

# def infer_cats(df, cols, max_int_card=30):
#     cats, nums = [], []
#     for c in cols:
#         if df[c].dtype == "object":
#             cats.append(c)
#         elif pd.api.types.is_integer_dtype(df[c]) and df[c].nunique() <= max_int_card:
#             cats.append(c)
#         else:
#             nums.append(c)
#     return cats, nums

# cat_cols, num_cols = infer_cats(train, features)

# # Cast categoricals to str for CatBoost; LightGBM will accept category dtype later
# for c in cat_cols:
#     train[c] = train[c].astype("str")
#     test[c]  = test[c].astype("str")

# X      = train[features].copy()
# y      = train[TARGET].astype(int).copy()
# X_test = test[features].copy()

# cat_idx = [X.columns.get_loc(c) for c in cat_cols]
# print(f"Using {len(features)} features → {len(cat_cols)} categorical / {len(num_cols)} numeric. DROP_DURATION={DROP_DURATION}")


Using 22 features → 16 categorical / 6 numeric. DROP_DURATION=True


## 6. CatBoost (CPU) — Stratified K-Fold CV

In [None]:
# import os
# os.environ["OMP_NUM_THREADS"] = "6"   # or 4
# os.environ["MKL_NUM_THREADS"] = "6"   # or 4

In [None]:
# # === Fast & Strong: CatBoost + LightGBM + Smart Blend (macOS CPU) ===
# import os, time, numpy as np, pandas as pd
# from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
# from sklearn.metrics import roc_auc_score
# from catboost import CatBoostClassifier, Pool
# import lightgbm as lgb

In [None]:
# # ---------- Dev vs Final switch ----------
# DEV_MODE = True     # True = quick iteration; False = full strength for final
# FOLDS    = 3 if DEV_MODE else 5

# # ---------- CPU hygiene ----------
# avail = os.cpu_count() or 8
# threads = min(max(avail - 2, 4), 8)  # leave headroom for macOS; cap at 8
# os.environ["OMP_NUM_THREADS"] = str(threads)
# os.environ["MKL_NUM_THREADS"] = str(threads)

# # Optional: train on a subset in dev to iterate quickly (keeps class balance)
# if DEV_MODE:
#     sss = StratifiedShuffleSplit(n_splits=1, test_size=0.6, random_state=42)  # keep 40%
#     keep_idx, _ = next(sss.split(X, y))
#     X_run, y_run = X.iloc[keep_idx], y.iloc[keep_idx]
# else:
#     X_run, y_run = X, y

In [None]:
# # ---------- CatBoost (lighter/faster for mac CPU) ----------
# cb_params = dict(
#     loss_function="Logloss",
#     eval_metric="AUC",
#     iterations=4000 if DEV_MODE else 8000,
#     learning_rate=0.05 if DEV_MODE else 0.035,
#     depth=6 if DEV_MODE else 8,
#     l2_leaf_reg=20,
#     random_strength=0.2,
#     bootstrap_type="Bernoulli",             # faster than Bayesian on CPU
#     subsample=0.8,                           # row subsampling
#     rsm=0.7,                                 # col subsampling
#     one_hot_max_size=32 if DEV_MODE else 64,
#     max_ctr_complexity=1 if DEV_MODE else 2,
#     border_count=128 if DEV_MODE else 254,   # fewer bins -> faster
#     early_stopping_rounds=300 if DEV_MODE else 600,
#     auto_class_weights="Balanced",
#     thread_count=threads,
#     random_seed=42,
#     verbose=200,
#     allow_writing_files=False
# )

# skf_cb = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=42)
# cb_oof = np.zeros(len(X_run), dtype=np.float32)
# cb_pred = np.zeros(len(X_test), dtype=np.float32)

# t0 = time.time()
# for f, (tr, va) in enumerate(skf_cb.split(X_run, y_run), 1):
#     X_tr, y_tr = X_run.iloc[tr], y_run.iloc[tr]
#     X_va, y_va = X_run.iloc[va], y_run.iloc[va]
#     model = CatBoostClassifier(**cb_params)
#     model.fit(
#         Pool(X_tr, y_tr, cat_features=cat_idx),
#         eval_set=Pool(X_va, y_va, cat_features=cat_idx),
#         use_best_model=True
#     )
#     cb_oof[va] = model.predict_proba(X_va)[:,1]
#     cb_pred += model.predict_proba(Pool(X_test, cat_features=cat_idx))[:,1] / FOLDS
#     print(f"[CB] fold {f}: AUC={roc_auc_score(y_va, cb_oof[va]):.6f}")
# cb_oof_auc = roc_auc_score(y_run, cb_oof)
# print(f"[CB] OOF AUC={cb_oof_auc:.6f}  (time {time.time()-t0:.1f}s)")

0:	test: 0.7702030	best: 0.7702030 (0)	total: 11.3ms	remaining: 45.2s
200:	test: 0.8381570	best: 0.8381570 (200)	total: 1.79s	remaining: 33.9s
400:	test: 0.8438095	best: 0.8438095 (400)	total: 3.5s	remaining: 31.4s
600:	test: 0.8459956	best: 0.8459956 (600)	total: 5.21s	remaining: 29.5s
800:	test: 0.8473366	best: 0.8473409 (790)	total: 6.95s	remaining: 27.8s
1000:	test: 0.8480945	best: 0.8480991 (993)	total: 8.64s	remaining: 25.9s
1200:	test: 0.8485436	best: 0.8485505 (1167)	total: 10.3s	remaining: 24.1s
1400:	test: 0.8486800	best: 0.8486877 (1398)	total: 12s	remaining: 22.3s
1600:	test: 0.8487266	best: 0.8488507 (1487)	total: 13.7s	remaining: 20.6s
Stopped by overfitting detector  (300 iterations wait)

bestTest = 0.8488507412
bestIteration = 1487

Shrink model to first 1488 iterations.
[CB] fold 1: AUC=0.848851
0:	test: 0.7665027	best: 0.7665027 (0)	total: 9.82ms	remaining: 39.3s
200:	test: 0.8407328	best: 0.8407328 (200)	total: 1.73s	remaining: 32.6s
400:	test: 0.8463732	best: 0.846

In [None]:

# import numpy as np, time
# from sklearn.model_selection import StratifiedKFold
# from sklearn.metrics import roc_auc_score
# from catboost import CatBoostClassifier, Pool

# N_FOLDS, RANDOM_STATE = 5, 42

# cb_params = dict(
#     loss_function="Logloss",
#     eval_metric="AUC",
#     iterations=4000,            # was 12000
#     learning_rate=0.05,         # a bit faster per-iter
#     depth=6,                    # shallower trees
#     l2_leaf_reg=20,
#     random_strength=0.2,
#     bootstrap_type="Bernoulli", # cheaper than Bayesian
#     subsample=0.8,              # row subsampling
#     rsm=0.7,                    # column subsampling
#     one_hot_max_size=32,
#     max_ctr_complexity=1,
#     border_count=128,           # fewer bins → faster
#     early_stopping_rounds=300,
#     auto_class_weights="Balanced",
#     thread_count=6,             # match OMP/MKL
#     random_seed=42,
#     verbose=200,
#     allow_writing_files=False
# )
# N_FOLDS = 3  # use 3 folds while iterating; switch back to 5–10 for final


# skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=RANDOM_STATE)
# cb_oof = np.zeros(len(X), dtype=np.float32)
# cb_pred = np.zeros(len(X_test), dtype=np.float32)
# fold_aucs = []

# start = time.time()
# for fold, (tr_idx, va_idx) in enumerate(skf.split(X, y), 1):
#     X_tr, y_tr = X.iloc[tr_idx], y.iloc[tr_idx]
#     X_va, y_va = X.iloc[va_idx], y.iloc[va_idx]

#     train_pool = Pool(X_tr, y_tr, cat_features=cat_idx)
#     valid_pool = Pool(X_va, y_va, cat_features=cat_idx)
#     test_pool  = Pool(X_test,      cat_features=cat_idx)

#     model = CatBoostClassifier(**cb_params)
#     model.fit(train_pool, eval_set=valid_pool, use_best_model=True)

#     cb_oof[va_idx] = model.predict_proba(valid_pool)[:, 1]
#     cb_pred += model.predict_proba(test_pool)[:, 1] / N_FOLDS

#     auc = roc_auc_score(y_va, cb_oof[va_idx])
#     fold_aucs.append(auc)
#     print(f"[CB] Fold {fold}: AUC={auc:.6f}")

# oof_auc = roc_auc_score(y, cb_oof)
# print(f"\n[CB] OOF AUC={oof_auc:.6f} | folds: {', '.join(f'{a:.5f}' for a in fold_aucs)}")
# print(f"Time: {time.time()-start:.1f}s")

# # Save OOF for later blends
# import pandas as pd
# pd.DataFrame({ID_COL: train[ID_COL], TARGET: y, "oof_cb": cb_oof}).to_csv(
#     f"oof_catboost_{'noDur' if DROP_DURATION else 'withDur'}.csv", index=False
# )

# # Feature importance (PredictionValuesChange)
# importances = pd.Series(model.get_feature_importance(type="PredictionValuesChange"), index=X.columns)               .sort_values(ascending=False)
# display(importances.head(25).to_frame("importance"))


0:	test: 0.7787472	best: 0.7787472 (0)	total: 24.3ms	remaining: 1m 37s
200:	test: 0.8412393	best: 0.8412393 (200)	total: 4.38s	remaining: 1m 22s
400:	test: 0.8472951	best: 0.8472951 (400)	total: 8.77s	remaining: 1m 18s
600:	test: 0.8499414	best: 0.8499414 (600)	total: 13.2s	remaining: 1m 14s
800:	test: 0.8515696	best: 0.8515696 (800)	total: 17.6s	remaining: 1m 10s
1000:	test: 0.8525174	best: 0.8525174 (1000)	total: 22s	remaining: 1m 5s
1200:	test: 0.8532080	best: 0.8532086 (1198)	total: 26.4s	remaining: 1m 1s
1400:	test: 0.8536296	best: 0.8536296 (1400)	total: 30.7s	remaining: 57s
1600:	test: 0.8539648	best: 0.8539648 (1600)	total: 35.1s	remaining: 52.6s
1800:	test: 0.8542443	best: 0.8542443 (1800)	total: 39.5s	remaining: 48.2s
2000:	test: 0.8543972	best: 0.8543972 (2000)	total: 43.9s	remaining: 43.8s
2200:	test: 0.8544883	best: 0.8544969 (2183)	total: 48.2s	remaining: 39.4s
2400:	test: 0.8545722	best: 0.8545722 (2400)	total: 52.5s	remaining: 35s
2600:	test: 0.8545553	best: 0.8545752 (

Unnamed: 0,importance
balance,19.898697
poutcome,10.605021
day,10.070681
month,8.510089
month_idx,8.447581
contact,7.947882
age,5.872536
housing,4.790214
pdays,4.10069
campaign,3.548862


## 7. LightGBM (CPU, native categoricals) — Stratified K-Fold CV

In [None]:
# # ---------- LightGBM (fast + categorical smoothing) ----------
# # Ensure category dtype for LGBM
# X_lgb = X_run.copy(); X_test_lgb = X_test.copy()
# for c in cat_cols:
#     X_lgb[c] = X_lgb[c].astype("category")
#     X_test_lgb[c] = X_test_lgb[c].astype("category")

# lgb_params = dict(
#     objective="binary",
#     boosting_type="gbdt",
#     metric="auc",
#     learning_rate=0.03 if DEV_MODE else 0.025,
#     n_estimators=10000 if DEV_MODE else 20000,  # early stopping trims
#     num_leaves=63 if DEV_MODE else 127,
#     max_depth=-1,
#     min_data_in_leaf=80 if DEV_MODE else 60,
#     feature_fraction=0.8,
#     bagging_fraction=0.8,
#     bagging_freq=1,
#     lambda_l1=1.0, lambda_l2=2.0,
#     max_bin=255,
#     # categorical smoothing knobs:
#     min_data_per_group=50,
#     cat_l2=10.0, cat_smooth=10.0,
#     # speed:
#     force_col_wise=True,
#     deterministic=True,
#     n_jobs=threads,
#     verbose=-1
# )

# skf_lgb = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=42)
# lgb_oof = np.zeros(len(X_lgb), dtype=np.float32)
# lgb_pred = np.zeros(len(X_test_lgb), dtype=np.float32)

# t1 = time.time()
# for f, (tr, va) in enumerate(skf_lgb.split(X_lgb, y_run), 1):
#     X_tr, y_tr = X_lgb.iloc[tr], y_run.iloc[tr]
#     X_va, y_va = X_lgb.iloc[va], y_run.iloc[va]
#     lgbm = lgb.LGBMClassifier(**lgb_params)
#     lgbm.fit(
#         X_tr, y_tr,
#         eval_set=[(X_va, y_va)],
#         eval_metric="auc",
#         categorical_feature=cat_cols,
#         callbacks=[lgb.early_stopping(stopping_rounds=600 if DEV_MODE else 800, verbose=False)]
#     )
#     lgb_oof[va] = lgbm.predict_proba(X_va)[:,1]
#     lgb_pred += lgbm.predict_proba(X_test_lgb)[:,1] / FOLDS
#     print(f"[LGBM] fold {f}: AUC={roc_auc_score(y_va, lgb_oof[va]):.6f}")
# lgb_oof_auc = roc_auc_score(y_run, lgb_oof)
# print(f"[LGBM] OOF AUC={lgb_oof_auc:.6f}  (time {time.time()-t1:.1f}s)")

[LGBM] fold 1: AUC=0.850863
[LGBM] fold 2: AUC=0.854482
[LGBM] fold 3: AUC=0.853723
[LGBM] OOF AUC=0.853002  (time 72.1s)


In [None]:

# import lightgbm as lgb
# from sklearn.model_selection import StratifiedKFold
# from sklearn.metrics import roc_auc_score
# import numpy as np, pandas as pd

# # Ensure categorical dtype for LightGBM
# X_lgb = X.copy()
# X_test_lgb = X_test.copy()
# for c in cat_cols:
#     X_lgb[c] = X_lgb[c].astype("category")
#     X_test_lgb[c] = X_test_lgb[c].astype("category")

# lgb_params = dict(
#     objective="binary", metric="auc",
#     learning_rate=0.03, n_estimators=10000,
#     num_leaves=63, min_data_in_leaf=80,
#     feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=1,
#     lambda_l1=1.0, lambda_l2=2.0,
#     max_bin=255, n_jobs=6, verbose=-1
# )


# skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# lgb_oof = np.zeros(len(X_lgb), dtype=np.float32)
# lgb_pred = np.zeros(len(X_test_lgb), dtype=np.float32)
# fold_aucs_lgb = []

# for fold, (tr, va) in enumerate(skf.split(X_lgb, y), 1):
#     X_tr, y_tr = X_lgb.iloc[tr], y.iloc[tr]
#     X_va, y_va = X_lgb.iloc[va], y.iloc[va]
#     lgbm = lgb.LGBMClassifier(**lgb_params)
#     lgbm.fit(
#         X_tr, y_tr,
#         eval_set=[(X_va, y_va)],
#         eval_metric="auc",
#         categorical_feature=cat_cols,
#         callbacks=[lgb.early_stopping(stopping_rounds=800, verbose=False)]
#     )
#     lgb_oof[va] = lgbm.predict_proba(X_va)[:,1]
#     lgb_pred += lgbm.predict_proba(X_test_lgb)[:,1] / skf.n_splits
#     auc = roc_auc_score(y_va, lgb_oof[va]); fold_aucs_lgb.append(auc)
#     print(f"[LGBM] Fold {fold}: {auc:.6f}")

# print(f"[LGBM] OOF AUC: {roc_auc_score(y, lgb_oof):.6f}")


[LGBM] Fold 1: 0.860405
[LGBM] Fold 2: 0.857053
[LGBM] Fold 3: 0.857265
[LGBM] Fold 4: 0.857450
[LGBM] Fold 5: 0.855750
[LGBM] OOF AUC: 0.857582


## 8. Blend (CatBoost + LightGBM) & Create Submission

In [None]:

# # ---------- Mean vs Rank blend; pick best on OOF ----------
# def rankit(a):
#     # Stable rank in [0,1]
#     return pd.Series(a).rank(method="average").to_numpy() / (len(a) + 1e-9)

# best_auc, best_name, best_w = -1.0, None, None
# for w in np.linspace(0, 1, 21):
#     # mean blend
#     oof_mean = w*cb_oof + (1-w)*lgb_oof
#     auc_mean = roc_auc_score(y_run, oof_mean)
#     if auc_mean > best_auc:
#         best_auc, best_name, best_w = auc_mean, f"mean (w_cb={w:.2f})", w
#     # rank blend
#     oof_rank = w*rankit(cb_oof) + (1-w)*rankit(lgb_oof)
#     auc_rank = roc_auc_score(y_run, oof_rank)
#     if auc_rank > best_auc:
#         best_auc, best_name, best_w = auc_rank, f"rank (w_cb={w:.2f})", w

# print(f"[BLEND] Best OOF AUC={best_auc:.6f} using {best_name}")

# # Build test preds with the same strategy
# use_rank = best_name.startswith("rank")
# if use_rank:
#     test_blend = best_w*rankit(cb_pred) + (1-best_w)*rankit(lgb_pred)
# else:
#     test_blend = best_w*cb_pred + (1-best_w)*lgb_pred

# # If we trained on a subset in DEV_MODE, rebuild full-data models quickly for test preds (optional).
# # (Often not needed; you can just switch DEV_MODE=False for your final run.)
# # -- Skipping here for simplicity.

[BLEND] Best OOF AUC=0.853226 using rank (w_cb=0.20)


In [None]:
# # ---------- Write submission ----------
# sub = pd.DataFrame({"id": test["id"], "y": test_blend})
# fname = f"submission_cb_lgb_{'rank' if use_rank else 'mean'}_blend.csv"
# sub.to_csv(fname, index=False)
# print("Wrote:", fname)

# # ---------- (Optional) Persist OOF from full set if not in DEV_MODE ----------
# if not DEV_MODE:
#     pd.DataFrame({ID_COL: train.loc[X_run.index, ID_COL], TARGET: y_run, 
#                   "oof_cb": cb_oof, "oof_lgb": lgb_oof}).to_csv(
#         f"oof_cb_lgb_{'rank' if use_rank else 'mean'}_blend.csv", index=False
#     )

Wrote: submission_cb_lgb_rank_blend.csv


In [None]:

# import numpy as np, pandas as pd
# from sklearn.metrics import roc_auc_score

# # Grid-search a simple linear weight on OOF
# best_w, best_auc = None, -1.0
# for w in np.linspace(0, 1, 21):
#     oof_blend = w*cb_oof + (1-w)*lgb_oof
#     auc = roc_auc_score(y, oof_blend)
#     if auc > best_auc:
#         best_auc, best_w = auc, w

# print(f"[BLEND] Best OOF AUC = {best_auc:.6f} at w_cb={best_w:.2f}, w_lgb={1-best_w:.2f}")

# # Apply the same weight to test preds
# test_blend = best_w*cb_pred + (1-best_w)*lgb_pred

# sub = pd.DataFrame({"id": test["id"], "y": test_blend})
# sub_path = "submission_cb_lgb_blend.csv"
# sub.to_csv(sub_path, index=False)
# print("Wrote:", sub_path)
# display(sub.head())


[BLEND] Best OOF AUC = 0.857638 at w_cb=0.05, w_lgb=0.95
Wrote: submission_cb_lgb_blend.csv


Unnamed: 0,id,y
0,750000,0.045683
1,750001,0.08843
2,750002,0.102545
3,750003,0.001105
4,750004,0.227797


## 9. (Optional) PyTorch MLP using Apple GPU (MPS)

In [None]:

# # ⚠️ Note: Tree boosters usually outperform simple MLPs on tabular data.
# # This section is just to demonstrate Apple GPU (MPS) usage on macOS.
# # If you want best AUC, rely on CatBoost/LightGBM above.

# import numpy as np, pandas as pd
# from sklearn.model_selection import StratifiedKFold
# from sklearn.metrics import roc_auc_score
# from sklearn.preprocessing import OneHotEncoder
# from sklearn.compose import ColumnTransformer
# from sklearn.pipeline import Pipeline
# import torch
# import torch.nn as nn
# from torch.utils.data import TensorDataset, DataLoader

# device = torch.device("mps") if (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()) else          torch.device("cuda" if torch.cuda.is_available() else "cpu")
# print("Using device:", device)

# # Build a preprocessing pipeline: one-hot categoricals, pass-through numerics
# ohe = OneHotEncoder(handle_unknown="ignore", sparse=True)
# pre = ColumnTransformer(
#     transformers=[("ohe", ohe, cat_cols)],
#     remainder="passthrough"
# )

# # Fit on full train, transform to sparse -> dense (careful with memory; data here is moderate)
# X_all = pre.fit_transform(X)
# X_test_all = pre.transform(X_test)

# # Convert to torch tensors (use float32)
# import scipy.sparse as sp
# if sp.issparse(X_all):
#     X_all = X_all.astype(np.float32).toarray()
#     X_test_all = X_test_all.astype(np.float32).toarray()
# else:
#     X_all = X_all.astype(np.float32)
#     X_test_all = X_test_all.astype(np.float32)

# X_all_t = torch.tensor(X_all, dtype=torch.float32)
# y_all_t = torch.tensor(y.values, dtype=torch.float32).view(-1,1)

# def make_mlp(in_dim):
#     return nn.Sequential(
#         nn.Linear(in_dim, 256), nn.ReLU(),
#         nn.Linear(256, 128), nn.ReLU(),
#         nn.Linear(128, 64), nn.ReLU(),
#         nn.Linear(64, 1), nn.Sigmoid()
#     )

# # 5-fold CV for MLP (quick)
# skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# mlp_oof = np.zeros(len(X_all), dtype=np.float32)
# mlp_pred = np.zeros(len(X_test_all), dtype=np.float32)

# for fold, (tr_idx, va_idx) in enumerate(skf.split(X_all, y), 1):
#     model = make_mlp(X_all.shape[1]).to(device)
#     opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
#     bce = nn.BCELoss()

#     tr_ds = TensorDataset(X_all_t[tr_idx], y_all_t[tr_idx])
#     va_ds = TensorDataset(X_all_t[va_idx], y_all_t[va_idx])
#     tr_dl = DataLoader(tr_ds, batch_size=4096, shuffle=True)
#     va_dl = DataLoader(va_ds, batch_size=8192, shuffle=False)

#     best_auc, best_state = 0.0, None
#     for epoch in range(15):  # short training
#         model.train()
#         for xb, yb in tr_dl:
#             xb, yb = xb.to(device), yb.to(device)
#             pred = model(xb)
#             loss = bce(pred, yb)
#             opt.zero_grad(); loss.backward(); opt.step()
#         # quick eval
#         model.eval()
#         with torch.no_grad():
#             all_preds = []
#             for xb, yb in va_dl:
#                 xb = xb.to(device)
#                 all_preds.append(model(xb).detach().cpu().numpy())
#             va_p = np.vstack(all_preds).ravel()
#         auc = roc_auc_score(y_all_t[va_idx].numpy(), va_p)
#         if auc > best_auc:
#             best_auc, best_state = auc, model.state_dict()
#         # print(f"Fold {fold} Epoch {epoch+1}: AUC={auc:.4f}")
#     if best_state:
#         model.load_state_dict(best_state)

#     # Save OOF for fold
#     model.eval()
#     with torch.no_grad():
#         va_dl = DataLoader(va_ds, batch_size=8192, shuffle=False)
#         all_preds = []
#         for xb, yb in va_dl:
#             xb = xb.to(device)
#             all_preds.append(model(xb).detach().cpu().numpy())
#         mlp_oof[va_idx] = np.vstack(all_preds).ravel()

#     # Test preds
#     Xt_t = torch.tensor(X_test_all, dtype=torch.float32)
#     t_dl = DataLoader(TensorDataset(Xt_t, torch.zeros(len(Xt_t),1)), batch_size=8192, shuffle=False)
#     with torch.no_grad():
#         all_preds = []
#         for xb, _ in t_dl:
#             xb = xb.to(device)
#             all_preds.append(model(xb).detach().cpu().numpy())
#         mlp_pred += np.vstack(all_preds).ravel() / skf.n_splits

# print("[MLP] OOF AUC:", roc_auc_score(y, mlp_oof))
