# Give Me Some Credit – v2

This notebook is a reproducible version of my **credit risk modeling** work on the *Give Me Some Credit* dataset.

**Key ideas implemented:**
- Predictive imputation (MICE + RandomForest single-target fills)
- Outlier capping (winsorization)
- Custom delinquency severity score (1×30–59, 2×60–89, 3×90+)
- Manual decile binning + Weight of Evidence (WoE) encoding (for scorecard & interpretability)
- Class imbalance strategies: cost-sensitive `scale_pos_weight` and SMOTEENN resampling
- Model comparison (XGBoost variants; WoE Logistic baseline)
- F₂-optimized hyperparameter search (Recall-weighted)
- Final fit on full training data and **test-set predictions** (since test labels unavailable)

> **Run top-to-bottom.** Each section is self-contained and re-runnable.


In [None]:
# Install external packages (safe to re-run)
!pip install -q xgboost category_encoders imbalanced-learn

In [None]:
# Imports ---------------------------------------------------------------------
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sklearn core
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    roc_auc_score, accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, roc_curve, auc,
    precision_recall_curve, make_scorer, fbeta_score
)

# Imbalance tools
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN

# Encoding
from category_encoders.woe import WOEEncoder

# XGBoost
from xgboost import XGBClassifier

# Repro
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


## 1. Load Data & Basic Cleaning

In [None]:
# Load training & test CSVs (Kaggle format). Adjust paths as needed.
train = pd.read_csv('cs-training.csv', index_col=0)
test  = pd.read_csv('cs-test.csv',     index_col=0)

TARGET = 'SeriousDlqin2yrs'

print(f"Train shape: {train.shape}")
print(f"Test  shape: {test.shape}\n")

# Drop duplicate rows (rare but good hygiene)
n_dups = train.duplicated().sum()
print(f"Found {n_dups} duplicate rows in training data.")
if n_dups:
    train = train.drop_duplicates()
    print(f"After dropping dupes: {train.shape}")
print()

# Quick missing summary
print("Missing (train):\n", train.isnull().sum())
print("\nMissing (test):\n", test.isnull().sum())

Train shape: (150000, 11)
Test  shape: (101503, 11)

Found 609 duplicate rows in training data.
After dropping dupes: (149391, 11)

Missing (train):
 SeriousDlqin2yrs                            0
RevolvingUtilizationOfUnsecuredLines        0
age                                         0
NumberOfTime30-59DaysPastDueNotWorse        0
DebtRatio                                   0
MonthlyIncome                           29221
NumberOfOpenCreditLinesAndLoans             0
NumberOfTimes90DaysLate                     0
NumberRealEstateLoansOrLines                0
NumberOfTime60-89DaysPastDueNotWorse        0
NumberOfDependents                       3828
dtype: int64

Missing (test):
 SeriousDlqin2yrs                        101503
RevolvingUtilizationOfUnsecuredLines         0
age                                          0
NumberOfTime30-59DaysPastDueNotWorse         0
DebtRatio                                    0
MonthlyIncome                            20103
NumberOfOpenCreditLinesAndLoans

## 2. Predictive Imputation
We try two strategies:
- **MICE** (IterativeImputer w/ small RandomForest)
- **Single-target RF** for `MonthlyIncome` & `NumberOfDependents`.

In [None]:
# --- Prep feature matrices ---
X_train = train.drop(columns=TARGET)
y_train = train[TARGET]
# Some Kaggle test files include the target (all NaN); drop if present
X_test = test.drop(columns=[TARGET], errors='ignore')
# align columns
X_test = X_test[X_train.columns]

# --- MICE Imputation --------------------------------------------------------
imp = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=20, random_state=RANDOM_STATE),
    initial_strategy='median',
    max_iter=10,
    random_state=RANDOM_STATE
)
X_train_mice = pd.DataFrame(imp.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_mice  = pd.DataFrame(imp.transform(X_test),      columns=X_train.columns, index=X_test.index)
train_mice   = X_train_mice.assign(**{TARGET: y_train})

print("MICE → remaining nulls (train):", train_mice.isnull().sum().sum())
print("MICE → remaining nulls (test):",  X_test_mice.isnull().sum().sum())

# --- Manual RF for MonthlyIncome & NumberOfDependents -----------------------
cols_to_impute = ['MonthlyIncome', 'NumberOfDependents']
X_train_rf = X_train.copy()
X_test_rf  = X_test.copy()

for col in cols_to_impute:
    # features excluding columns currently being imputed pair
    feats = X_train_rf.columns.difference(cols_to_impute)
    rf_imp = RandomForestRegressor(n_estimators=100, random_state=RANDOM_STATE)
    mask_tr = X_train_rf[col].notnull()
    rf_imp.fit(X_train_rf.loc[mask_tr, feats], X_train_rf.loc[mask_tr, col])
    # fill train
    X_train_rf.loc[~mask_tr, col] = rf_imp.predict(X_train_rf.loc[~mask_tr, feats])
    # fill test
    mask_te = X_test_rf[col].isnull()
    X_test_rf.loc[mask_te, col] = rf_imp.predict(X_test_rf.loc[mask_te, feats])

train_rf = X_train_rf.assign(**{TARGET: y_train})

print("RF → remaining nulls (train):", train_rf.isnull().sum().sum())
print("RF → remaining nulls (test):",  X_test_rf.isnull().sum().sum())

MICE → remaining nulls (train): 0
MICE → remaining nulls (test): 0
RF → remaining nulls (train): 0
RF → remaining nulls (test): 0


## 3. Class Imbalance Strategies
We compute `scale_pos_weight` for cost-sensitive boosting and also create SMOTEENN-balanced versions of the data.

In [None]:
def _print_balance(y, label):
    print(f"{label}:", dict(pd.Series(y).value_counts()))

datasets = {'MICE': train_mice, 'RF': train_rf}
results = {}
for name, df in datasets.items():
    X = df.drop(columns=TARGET)
    y = df[TARGET]
    print(f"\n— {name} Imputed Data —")
    spw = y.value_counts()[0] / y.value_counts()[1]
    print(f"  scale_pos_weight = {spw:.2f}")
    smenn = SMOTEENN(random_state=RANDOM_STATE)
    X_sme, y_sme = smenn.fit_resample(X, y)
    _print_balance(y_sme, "  SMOTEENN class balance")
    results[name] = {
        'X_orig': X, 'y_orig': y,
        'spw': spw,
        'X_sme': X_sme, 'y_sme': y_sme
    }


— MICE Imputed Data —
  scale_pos_weight = 13.93
  SMOTEENN class balance: {1: np.int64(120820), 0: np.int64(94600)}

— RF Imputed Data —
  scale_pos_weight = 13.93
  SMOTEENN class balance: {1: np.int64(120941), 0: np.int64(94865)}


## 4. Feature Engineering & WoE

Steps:
1. **Winsorize** heavy-tailed numeric vars (1st–99th %ile).
2. **Custom delinquency score** = 1×30–59 + 2×60–89 + 3×90+.
3. Alias `RevolvingUtilizationOfUnsecuredLines` → `revol_util`.
4. Build **decile bins** (fit on MICE original).
5. Derive **Weight of Evidence** per bin (fit on MICE original) and map across all sets.


In [None]:
# --- Winsorization helper ---------------------------------------------------
def winsorize_series(s, lower_pct=0.01, upper_pct=0.99):
    lo, hi = s.quantile([lower_pct, upper_pct])
    return s.clip(lo, hi)

# --- Feature builder --------------------------------------------------------
def add_custom_features(df):
    df = df.copy()
    for var in ['MonthlyIncome','DebtRatio','RevolvingUtilizationOfUnsecuredLines','NumberOfOpenCreditLinesAndLoans']:
        df[var] = winsorize_series(df[var])
    df['delinq_score'] = (
        1 * df['NumberOfTime30-59DaysPastDueNotWorse'] +
        2 * df['NumberOfTime60-89DaysPastDueNotWorse'] +
        3 * df['NumberOfTimes90DaysLate']
    )
    df['revol_util'] = df['RevolvingUtilizationOfUnsecuredLines']
    return df

# reference df for calculating bin edges (MICE original, engineered)
ref = add_custom_features(results['MICE']['X_orig'])
num_vars = ['age','DebtRatio','MonthlyIncome','NumberOfOpenCreditLinesAndLoans','revol_util','delinq_score']

# build bin edges dict
bins = {var: pd.qcut(ref[var], q=10, duplicates='drop', retbins=True)[1] for var in num_vars}
bin_cols = [f"{v}_bin" for v in num_vars]

# apply feature builder + bins to each dataset version
for name in ['MICE','RF']:
    for mode in ['orig','sme']:
        X = results[name][f'X_{mode}']
        Xp = add_custom_features(X)
        for var in num_vars:
            Xp[f'{var}_bin'] = pd.cut(Xp[var], bins=bins[var], include_lowest=True).astype(str)
        results[name][f'X_{mode}_fe'] = Xp

# build WoE maps using MICE original
woe_maps = {}
y_ref = results['MICE']['y_orig']
for col in bin_cols:
    ct = pd.crosstab(results['MICE']['X_orig_fe'][col], y_ref)
    prop = ct.div(ct.sum(axis=1), axis=0)
    # avoid divide-by-zero by clipping small values
    woe_maps[col] = np.log((prop[0].clip(1e-6)) / (prop[1].clip(1e-6)))

# map WoE to all datasets
for name in ['MICE','RF']:
    for mode in ['orig','sme']:
        Xb = results[name][f'X_{mode}_fe']
        W = pd.DataFrame(index=Xb.index)
        for col in bin_cols:
            W[col] = Xb[col].map(woe_maps[col]).fillna(0)
        results[name][f'X_{mode}_fe_woe'] = W

print("Feature engineering & WoE complete.")

Feature engineering & WoE complete.


## 5. Model Comparison (Recall-focused CV)

We compare:
- XGBoost w/ cost-sensitive weighting (orig data)
- XGBoost trained on SMOTEENN-balanced data
- Logistic Regression on WoE bins (scorecard-style)

Metric: **Recall** (catch defaulters).

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

for name in ['MICE','RF']:
    print(f"\n=== {name}-imputed data ===")
    spw = results[name]['spw']

    # XGB cost-sensitive
    Xo = results[name]['X_orig_fe'].select_dtypes(include=[np.number])
    yo = results[name]['y_orig']
    xgb_cs = XGBClassifier(eval_metric='auc', scale_pos_weight=spw, random_state=RANDOM_STATE)
    recall_cs = cross_val_score(xgb_cs, Xo, yo, cv=cv, scoring='recall')
    print(f"XGB (orig, numeric)      Recall: {recall_cs.mean():.3f} ± {recall_cs.std():.3f}")

    # XGB SMOTEENN
    Xs = results[name]['X_sme_fe'].select_dtypes(include=[np.number])
    ys = results[name]['y_sme']
    xgb_sm = XGBClassifier(eval_metric='auc', random_state=RANDOM_STATE)
    recall_sm = cross_val_score(xgb_sm, Xs, ys, cv=cv, scoring='recall')
    print(f"XGB (SMOTEENN, numeric)  Recall: {recall_sm.mean():.3f} ± {recall_sm.std():.3f}")

    # Logistic WoE
    Xw = results[name]['X_orig_fe_woe']
    lr = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)
    recall_lr = cross_val_score(lr, Xw, yo, cv=cv, scoring='recall')
    print(f"LR  (WoE bins)           Recall: {recall_lr.mean():.3f} ± {recall_lr.std():.3f}")


=== MICE-imputed data ===
XGB (orig, numeric)      Recall: 0.704 ± 0.005
XGB (SMOTEENN, numeric)  Recall: 0.948 ± 0.001
LR  (WoE bins)           Recall: 0.149 ± 0.008

=== RF-imputed data ===
XGB (orig, numeric)      Recall: 0.708 ± 0.003
XGB (SMOTEENN, numeric)  Recall: 0.903 ± 0.002
LR  (WoE bins)           Recall: 0.149 ± 0.007


## 6. F₂-Optimized Hyperparameter Search (Recall-weighted)

We tune XGBoost to maximize F₂ (recall gets 4× weight of precision).

In [None]:
best_params_mice = None
best_params_rf   = None

param_grid = {
    'max_depth':       [4,5,7],
    'learning_rate':   [0.05,0.06,0.07],
    'n_estimators':    [200,250],
    'colsample_bytree':[0.7,0.8,0.9]
}

fb2 = make_scorer(fbeta_score, beta=2)
cv3 = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)

for name in ['MICE','RF']:
    print(f"\n===== {name}-imputed data =====\n-- F₂ GridSearch XGBoost --")
    spw = results[name]['spw']
    Xo = results[name]['X_orig_fe'].select_dtypes(include=[np.number])
    yo = results[name]['y_orig']

    X_tr, X_val, y_tr, y_val = train_test_split(
        Xo, yo, test_size=0.2, stratify=yo, random_state=RANDOM_STATE
    )

    base = XGBClassifier(eval_metric='auc', scale_pos_weight=spw, random_state=RANDOM_STATE)
    grid = GridSearchCV(base, param_grid, scoring=fb2, cv=cv3, n_jobs=-1, verbose=1)
    grid.fit(X_tr, y_tr)
    print("Best params:", grid.best_params_)

    if name=='MICE':
        best_params_mice = grid.best_params_
    else:
        best_params_rf = grid.best_params_

    best = grid.best_estimator_
    y_prob = best.predict_proba(X_val)[:,1]
    y_pred = best.predict(X_val)
    print(f"ROC AUC:   {roc_auc_score(y_val, y_prob):.3f}")
    print(f"Precision: {precision_score(y_val, y_pred):.3f}")
    print(f"Recall:    {recall_score(y_val, y_pred):.3f}")
    print(f"F1 Score:  {f1_score(y_val, y_pred):.3f}")
    print("Confusion matrix:\n", confusion_matrix(y_val, y_pred))


===== MICE-imputed data =====
-- F₂ GridSearch XGBoost --
Fitting 3 folds for each of 54 candidates, totalling 162 fits
Best params: {'colsample_bytree': 0.7, 'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 200}
ROC AUC:   0.865
Precision: 0.218
Recall:    0.785
F1 Score:  0.341
Confusion matrix:
 [[22242  5635]
 [  431  1571]]

===== RF-imputed data =====
-- F₂ GridSearch XGBoost --
Fitting 3 folds for each of 54 candidates, totalling 162 fits
Best params: {'colsample_bytree': 0.7, 'learning_rate': 0.07, 'max_depth': 4, 'n_estimators': 200}
ROC AUC:   0.866
Precision: 0.217
Recall:    0.788
F1 Score:  0.340
Confusion matrix:
 [[22193  5684]
 [  425  1577]]


## 7. Final Model Fit & Test Predictions

We train the **best MICE XGBoost** (recall-weighted) on the full training data and score the test set (unlabeled).

In [None]:
# Use MICE best params; if not set (skipped tuning), fall back to defaults
if best_params_mice is None:
    best_params_mice = {'max_depth':5,'learning_rate':0.05,'n_estimators':200,'colsample_bytree':0.8}

# feature sets
X_tr_full = results['MICE']['X_orig_fe'].select_dtypes(include=[np.number])
y_tr_full = results['MICE']['y_orig']

# build test features from raw MICE-imputed test
X_test_mice_fe = add_custom_features(X_test_mice).copy()
# must create delinq_score, etc already done in add_custom_features; we just select numeric
X_te_full = X_test_mice_fe.select_dtypes(include=[np.number])

# final model
final_model = XGBClassifier(
    **best_params_mice,
    eval_metric='auc',
    scale_pos_weight=results['MICE']['spw'],
    random_state=RANDOM_STATE
)
final_model.fit(X_tr_full, y_tr_full)

# predictions (class + probability)
test_probs = final_model.predict_proba(X_te_full)[:,1]
test_preds = final_model.predict(X_te_full)

# assemble submission (target is unknown; produce probability)
pred_df = pd.DataFrame({
    'Id': X_te_full.index,
    'ProbabilityOfDefault': test_probs,
    'PredictedClass_0_1': test_preds
})
pred_df.to_csv('GiveMeSomeCredit_test_predictions.csv', index=False)
print("Saved predictions → GiveMeSomeCredit_test_predictions.csv")
pred_df.head()

Saved predictions → GiveMeSomeCredit_test_predictions.csv


Unnamed: 0,Id,ProbabilityOfDefault,PredictedClass_0_1
0,1,0.507322,1
1,2,0.372901,0
2,3,0.14671,0
3,4,0.558543,1
4,5,0.602537,1
