<a href="https://colab.research.google.com/github/Sornambal/Predicting-Loan-Payback-Kaggle-Playground-Series-S5E11/blob/main/Predicting_Loan_Payback_%E2%80%93_Kaggle_Playground_Series_S5E11_pynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Explanation:**

gc – Python’s garbage collector, used to free memory manually (gc.collect()).

warnings.filterwarnings("ignore") – hides non-critical warnings for cleaner logs.

numpy, pandas – standard libraries for numerical operations and tabular data.

StratifiedKFold – splits the data into folds while preserving the class ratio (important for classification).

SimpleImputer – fills missing values.

OrdinalEncoder – converts categorical features to integer codes.

roc_auc_score – metric used in this competition (AUC).

compute_class_weight – computes weights for imbalanced classes.

lightgbm – main model: Gradient Boosted Decision Trees (LGBMClassifier).

In [1]:

import gc
import warnings
from datetime import datetime

import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import roc_auc_score
from sklearn.utils.class_weight import compute_class_weight

import lightgbm as lgb

warnings.filterwarnings("ignore")

**Explanation:**

Defines file paths for train.csv, test.csv, and the output submission.csv.

target_col is the label to predict (loan_paid_back).

id_col is the row identifier (id) required in the submission.

In [None]:

TRAIN_PATH = "/content/train.csv"
TEST_PATH  = "/content/test.csv"
SUBMISSION_PATH = "submission_lgbm_multiseed.csv"

target_col = "loan_paid_back"
id_col = "id"

In [None]:
print("Loading data ...")
train = pd.read_csv(TRAIN_PATH)
test  = pd.read_csv(TEST_PATH)

print("Train shape:", train.shape)
print("Test shape :", test.shape)

# Drop NaN targets if any
if train[target_col].isnull().any():
    print(f"Dropping {train[target_col].isnull().sum()} rows with NaN target")
    train = train.dropna(subset=[target_col]).reset_index(drop=True)

y = train[target_col].astype(int).values
train_ids = train[id_col].values
test_ids  = test[id_col].values


**Explanation:**

Reads train.csv and test.csv into pandas DataFrames.

Prints their shapes for a quick sanity check.

If any rows in the target column are NaN, they are dropped (can happen in some Kaggle datasets).

y – numpy array of target labels.

train_ids, test_ids – store id values for later use in submission.

In [None]:
# ========= CLASS WEIGHTS =========
classes = np.unique(y)
cw = compute_class_weight(class_weight="balanced", classes=classes, y=y)
class_weight_dict = {cls: w for cls, w in zip(classes, cw)}
print("Class weights:", class_weight_dict)


**Explanation**:

Computes class weights so the model gives equal importance to both classes (0 and 1), even if they are imbalanced.

These weights are passed into LightGBM as class_weight to improve performance on the minority class.

In [None]:
# ========= DETECT FEATURES =========
cat_cols = [c for c in train.columns
            if train[c].dtype == "object"
            and c not in [target_col, id_col]]

for c in train.select_dtypes(include=["int64", "int32"]).columns:
    if c in [id_col, target_col]:
        continue
    if train[c].nunique() < 30:
        cat_cols.append(c)

cat_cols = sorted(list(set(cat_cols)))
num_cols = [c for c in train.columns if c not in cat_cols + [target_col, id_col]]

print(f"Categorical columns ({len(cat_cols)}): {cat_cols}")
print(f"Numeric columns      ({len(num_cols)}): {num_cols[:10]}{' ...' if len(num_cols) > 10 else ''}")


 **Explanation:**

Detects categorical columns in two ways:

dtype == object → text-like columns.

Integer columns with small number of unique values (< 30) → treated as categorical (e.g., codes, flags).

cat_cols – list of categorical feature names.

num_cols – all remaining features (numeric).

This dynamic detection makes the script more general.

In [None]:
# ========= PREPROCESSING =========
num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="constant", fill_value="__MISSING__")
ordinal_enc = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)

# concatenate train+test for consistent transforms
full = pd.concat([train.drop(columns=[target_col]), test], axis=0, ignore_index=True)

full_num = pd.DataFrame(num_imputer.fit_transform(full[num_cols]), columns=num_cols)
full_cat = pd.DataFrame(cat_imputer.fit_transform(full[cat_cols]), columns=cat_cols)

if cat_cols:
    full_cat_enc = pd.DataFrame(
        ordinal_enc.fit_transform(full_cat),
        columns=cat_cols
    ).astype(np.int32)
else:
    full_cat_enc = pd.DataFrame(index=full.index)

full_processed = pd.concat(
    [full_num.reset_index(drop=True), full_cat_enc.reset_index(drop=True)],
    axis=1
)

X_all  = full_processed.iloc[:len(train), :].reset_index(drop=True)
X_test = full_processed.iloc[len(train):, :].reset_index(drop=True)

print("Processed train shape:", X_all.shape)
print("Processed test shape :", X_test.shape)


**Explanation:**

SimpleImputer:

For numeric columns: median imputation.

For categorical columns: fill missing values with "__MISSING__".

OrdinalEncoder:

Converts categories to integer codes,

handle_unknown="use_encoded_value", unknown_value=-1 prevents crashes if test has unseen categories.

train and test are concatenated into full so that imputation and encoding are consistent across both.

After preprocessing:

X_all – model features for training (rows from original train).

X_test – model features for test set.

In [None]:
# ========= LIGHTGBM PARAMS TEMPLATE =========
def get_lgb_params(seed):
    return {
        "objective": "binary",
        "boosting_type": "gbdt",
        "metric": "auc",
        "learning_rate": 0.015,
        "num_leaves": 63,
        "max_depth": -1,
        "min_child_samples": 60,
        "min_gain_to_split": 0.01,
        "feature_fraction": 0.7,   # colsample_bytree
        "bagging_fraction": 0.8,   # subsample
        "bagging_freq": 1,
        "lambda_l1": 0.3,
        "lambda_l2": 0.6,
        "max_bin": 255,
        "n_estimators": 7000,
        "n_jobs": -1,
        "verbose": -1,
        "force_col_wise": True,
        "class_weight": class_weight_dict,
        "random_state": seed,
    }


**Explanation:**

Defines a function that returns a LightGBM parameter dictionary for a given seed.

Important choices:

learning_rate = 0.015 – relatively low for better generalization.

num_leaves = 63 – tree complexity.

feature_fraction, bagging_fraction – column and row sampling for regularization.

lambda_l1, lambda_l2 – L1 & L2 regularization to reduce overfitting.

n_estimators = 7000 – large upper bound on trees; actual used trees are controlled by early stopping.

class_weight = class_weight_dict – handles class imbalance.

random_state = seed – makes the model reproducible per seed.

In [None]:
# ========= MULTI-SEED CV TRAINING =========
SEEDS = [42, 2024, 7]   # you can tweak seeds here
NFOLD = 5

oof_blend = np.zeros(len(train))
test_blend = np.zeros(len(test))

for seed in SEEDS:
    print(f"\n================== SEED {seed} ==================")
    params = get_lgb_params(seed)

    skf = StratifiedKFold(n_splits=NFOLD, shuffle=True, random_state=seed)

    oof_seed = np.zeros(len(train))
    test_seed = np.zeros(len(test))

    for fold, (tr_idx, val_idx) in enumerate(skf.split(X_all, y), 1):
        print(f"\n--- Seed {seed} | Fold {fold}/{NFOLD} ---")

        X_tr, X_val = X_all.iloc[tr_idx], X_all.iloc[val_idx]
        y_tr, y_val = y[tr_idx], y[val_idx]

        model = lgb.LGBMClassifier(**params)
        model.fit(
            X_tr, y_tr,
            eval_set=[(X_val, y_val)],
            eval_metric="auc",
            callbacks=[lgb.early_stopping(200, verbose=True)]
        )

        val_pred = model.predict_proba(X_val)[:, 1]
        test_pred = model.predict_proba(X_test)[:, 1]

        oof_seed[val_idx] = val_pred
        test_seed += test_pred / NFOLD

        fold_auc = roc_auc_score(y_val, val_pred)
        print(f"Seed {seed} | Fold {fold} AUC: {fold_auc:.6f}")

        del model, X_tr, X_val, y_tr, y_val
        gc.collect()

    seed_auc = roc_auc_score(y, oof_seed)
    print(f"\nSeed {seed} OOF AUC: {seed_auc:.6f}")

    oof_blend += oof_seed / len(SEEDS)
    test_blend += test_seed / len(SEEDS)


**Explanation:**

SEEDS = [42, 2024, 7]: the same model is trained 3 times with different random seeds.

This is a standard Kaggle trick called multi-seed ensembling.

NFOLD = 5: 5-fold StratifiedKFold cross-validation for each seed.

For each seed:

Create new LightGBM parameters with that seed.

Do 5-fold CV:

Split into training and validation sets.

Train LightGBM with early stopping (early_stopping(200)).

Generate:

val_pred → out-of-fold predictions.

test_pred → test predictions for that fold.

For each seed:

oof_seed collects the OOF predictions.

test_seed is the average of test predictions over folds.

After each seed:

Compute seed_auc (OOF AUC for that seed).

Add oof_seed and test_seed into global blended arrays, averaging over number of seeds.

This reduces variance and usually improves leaderboard performance slightly.

In [None]:
final_oof_auc = roc_auc_score(y, oof_blend)
print(f"\n================== FINAL BLENDED OOF AUC: {final_oof_auc:.6f} ==================")

# ========= SUBMISSION =========
final_pred = np.clip(test_blend, 0.0, 1.0)

submission = pd.DataFrame({
    id_col: test_ids,
    target_col: final_pred
})

submission.to_csv(SUBMISSION_PATH, index=False)
print(f"\nSaved submission to {SUBMISSION_PATH}")
print("Finished at", datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S UTC"))


**Explanation:**

final_oof_auc: AUC on the entire training set using the blended OOF predictions from all seeds.

This is your best internal estimate of how good the model is.

np.clip(test_blend, 0.0, 1.0): ensures no predicted probability goes outside [0, 1] (safety).

Creates the submission DataFrame with:

id column

loan_paid_back (predicted probabilities)

Saves submission_lgbm_multiseed.csv in the required Kaggle format.

Prints the finish time.