The purpose of this notebook is to calculate the Weight of Evidence (WOE) and Information Value (IV) for each predictor variable in our cleaned dataset.

WOE and IV help us identify which variables have the strongest predictive power in distinguishing good (non-default) vs bad (default) borrowers for our Elder-Care Support Loan credit scorecard.

In [1]:
import pandas as pd
import numpy as np
import json
from pathlib import Path

pd.set_option('display.max_columns', 120)

DATA_PATH = Path("merged_applicant_and_bureau_cleaned_2.csv")   
OUT_DIR = Path("woe_iv_outputs")                              # local subfolder for outputs
OUT_DIR.mkdir(parents=True, exist_ok=True)

TARGET_COL = "TARGET"   # 0 = good, 1 = bad 
ID_COL     = "SK_ID_CURR"  
SEED       = 42    # ensures reproducibility for random operations.



In [2]:
# Load your cleaned dataset
df = pd.read_csv(DATA_PATH)

# Quick checks
print("Shape:", df.shape)
print("Columns:", df.columns.tolist()[:20])  # show first 20 columns
df[TARGET_COL].value_counts(dropna=False)
df.head()


Shape: (254358, 29)
Columns: ['Unnamed: 0', 'SK_ID_CURR', 'TARGET', 'NAME_INCOME_TYPE', 'NAME_FAMILY_STATUS', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'OCCUPATION_TYPE', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'AGE', 'AMT_CREDIT_SUM_sum', 'AMT_CREDIT_SUM_DEBT_sum', 'AMT_CREDIT_SUM_OVERDUE_max', 'CREDIT_DAY_OVERDUE_max', 'CNT_CREDIT_PROLONG_sum', 'CREDIT_ACTIVE_Active']


Unnamed: 0.1,Unnamed: 0,SK_ID_CURR,TARGET,NAME_INCOME_TYPE,NAME_FAMILY_STATUS,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,OCCUPATION_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY,AGE,AMT_CREDIT_SUM_sum,AMT_CREDIT_SUM_DEBT_sum,AMT_CREDIT_SUM_OVERDUE_max,CREDIT_DAY_OVERDUE_max,CNT_CREDIT_PROLONG_sum,CREDIT_ACTIVE_Active,CREDIT_ACTIVE_Closed,CREDIT_TYPE_Consumer credit,CREDIT_TYPE_Credit card,CREDIT_TYPE_Microloan,CREDIT_TYPE_Unknown type of loan,CREDIT_TYPE_Another type of loan,DEBT_RATIO,OVERDUE_RATIO,YEARS_EMPLOYED
0,0,100003,0,State servant,Married,0,270000.0,1293502.5,35698.5,1129500.0,Core staff,N,N,45.931507,94900.5,0.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,3.254795
1,1,100004,0,Working,Single / not married,0,67500.0,135000.0,6750.0,135000.0,Laborers,Y,Y,52.180822,,,,,,,,,,,,,,,0.616438
2,2,100006,0,Working,Civil marriage,0,135000.0,312682.5,29686.5,297000.0,Laborers,N,Y,52.068493,,,,,,,,,,,,,,,8.326027
3,3,100007,0,Working,Single / not married,0,121500.0,513000.0,21865.5,513000.0,Core staff,N,Y,54.608219,,,,,,,,,,,,,,,8.323288
4,4,100008,0,State servant,Married,0,99000.0,490495.5,27517.5,454500.0,Laborers,N,Y,46.413699,,,,,,,,,,,,,,,4.350685


To prevent information leakage, we split the dataset into:

70% training data (used for binning and IV computation)

30% testing data (held out for later model validation)

The split is stratified by TARGET so that the proportion of good vs bad loans remains consistent across both sets.

In [3]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(
    df, 
    test_size=0.3, 
    random_state=SEED, 
    stratify=df[TARGET_COL]
)

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)


Train shape: (178050, 29)
Test shape: (76308, 29)


**Helper functions defined as:**

1. Automatically bin continuous variables into quantile-based bins

2. Group categorical values with small counts into “OTHER”

3. Compute WOE and IV for each variable

4. Store binning specifications for later WOE transformation

5. These functions ensure consistent binning across all variables.

- Which function is for what(John)
- What is _safe_div used for (John)

In [5]:
# Helpers for binning, WOE, IV

def _safe_div(a, b):
    return a / b if b != 0 else 0.0

def compute_woe_iv_for_binned_series(y_true, bins_series):
    """
    y_true: 1D array-like of TARGET (0 good, 1 bad)
    bins_series: categorical/ordinal bins for the same rows
    Returns: (df_bins, IV)
    df_bins columns: [bin, good, bad, dist_good, dist_bad, WOE, IV_component]
    """
    tmp = pd.DataFrame({"bin": bins_series, "y": y_true})
    grp = tmp.groupby("bin")["y"].agg(["count","sum"])
    grp = grp.rename(columns={"count":"total","sum":"bad"})
    grp["good"] = grp["total"] - grp["bad"]

    total_good = grp["good"].sum()
    total_bad  = grp["bad"].sum()
    # Add small epsilon to avoid log(0)
    eps = 1e-6

    grp["dist_good"] = grp["good"] / (total_good + eps)
    grp["dist_bad"]  = grp["bad"]  / (total_bad  + eps)
    grp["WOE"] = np.log((grp["dist_good"] + eps) / (grp["dist_bad"] + eps))
    grp["IV_component"] = (grp["dist_good"] - grp["dist_bad"]) * grp["WOE"]
    iv = grp["IV_component"].sum()

    grp = grp.reset_index()
    return grp, iv

def quantile_binner(x, n_bins=5, min_unique=10):
    """
    Numeric binning via quantiles. Returns pd.Categorical bin labels.
    Falls back to unique sorting if very few unique values.
    """
    s = pd.Series(x)
    # Handle all-missing / all-constant edge cases
    if s.dropna().nunique() < max(2, min_unique//2):
        # Just a single-bin category to avoid explode
        return pd.Categorical(["All"]*len(s))

    try:
        # qcut may fail if many ties; we rank with method='first' to stabilize
        binned = pd.qcut(s.rank(method="first"), q=n_bins, duplicates="drop")
        return binned
    except Exception:
        # fallback: cut into equal width
        return pd.cut(s, bins=n_bins, duplicates="drop")

def categorical_binner(x, top_k=10):
    """
    Convert categorical into limited levels:
    - Keep top_k frequent levels; others -> 'OTHER'
    - Missing -> 'MISSING'
    """
    s = pd.Series(x).astype("object")
    s = s.fillna("MISSING")
    vc = s.value_counts(dropna=False)
    keep = set(vc.index[:top_k])
    s2 = s.where(s.isin(keep), other="OTHER")
    return pd.Categorical(s2)

def is_numeric_series(s):
    return pd.api.types.is_numeric_dtype(s)

def woe_iv_for_column(train_df, col, target=TARGET_COL, n_bins=5, top_k=10):
    """
    Bins a column (numeric or categorical), computes WOE/IV on TRAIN.
    Returns: bin_table (with WOE, IV_component), iv_value, bin_labels (spec)
    """
    s = train_df[col]
    y = train_df[target].astype(int)

    if is_numeric_series(s):
        # Treat special values
        s_clean = s.replace([np.inf, -np.inf], np.nan)
        binned = quantile_binner(s_clean, n_bins=n_bins)
        label_type = "numeric_quantile"
        # Add an explicit MISSING bin if there are NaNs
        if s_clean.isna().any():
            binned = binned.astype(object)
            binned = pd.Series(binned)
            binned = binned.where(~s_clean.isna(), other="MISSING")
    else:
        binned = categorical_binner(s, top_k=top_k)
        label_type = "categorical_topk"

    bins_table, iv = compute_woe_iv_for_binned_series(y, pd.Categorical(binned))
    # Save a lightweight "spec" you can re-apply later
    bin_spec = {
        "type": label_type,
        "has_missing": bool(train_df[col].isna().any()),
        "levels": [str(l) for l in pd.Categorical(binned).categories] if label_type=="categorical_topk" else None,
        "quantiles": None
    }

    if label_type == "numeric_quantile":
        # Persist bin edges for later application
        tmp = pd.Series(train_df[col].replace([np.inf,-np.inf], np.nan))
        try:
            q = tmp.quantile(np.linspace(0,1,6))  # 5 bins => 6 edges
            bin_spec["quantiles"] = [None if pd.isna(v) else float(v) for v in q.values]
        except Exception:
            bin_spec["quantiles"] = None

    return bins_table, float(iv), bin_spec


**Start Applying woe calculation for each column**

In [6]:
# Choose candidate columns
exclude_cols = {TARGET_COL, ID_COL}
candidates = [c for c in train_df.columns if c not in exclude_cols]

iv_rows = []
bin_specs = {}

for col in candidates:
    try:
        bins_table, iv, spec = woe_iv_for_column(train_df, col, target=TARGET_COL)
        iv_rows.append({"variable": col, "IV": iv})
        bin_specs[col] = spec
    except Exception as e:
        iv_rows.append({"variable": col, "IV": np.nan, "error": str(e)})

iv_df = pd.DataFrame(iv_rows).sort_values("IV", ascending=False)
iv_df.head(15)


  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","sum"])
  grp = tmp.groupby("bin")["y"].agg(["count","

Unnamed: 0,variable,IV
26,YEARS_EMPLOYED,0.09063
8,OCCUPATION_TYPE,0.07445
11,AGE,0.053058
1,NAME_INCOME_TYPE,0.048554
24,DEBT_RATIO,0.045424
13,AMT_CREDIT_SUM_DEBT_sum,0.043672
5,AMT_CREDIT,0.035502
7,AMT_GOODS_PRICE,0.034763
17,CREDIT_ACTIVE_Active,0.024696
6,AMT_ANNUITY,0.017279


Weight of Evidence (WOE) measures how well a feature separates good and bad borrowers.

Information Value (IV) quantifies a variable’s overall predictive strength:

IV Range	Predictive Power
< 0.02	Not predictive
0.02 – 0.1	Weak
0.1 – 0.3	Medium
0.3 – 0.5	Strong
more than 0.5	Suspicious (potential leakage)

In [None]:
iv_df.to_csv(OUT_DIR / "woe_iv_summary.csv", index=False)
with open(OUT_DIR / "woe_bin_specs.json", "w") as f:
    json.dump(bin_specs, f, indent=2)

print("Files saved to:", OUT_DIR)


Files saved to: woe_iv_outputs


woe_iv_outputs/woe_iv_summary.csv → IV summary table

woe_iv_outputs/woe_bin_specs.json → binning details for each variable

If you look at woe_iv_summary.csv, 
- DAYS_EMPLOYED
- DEBT_RATIO
- AMT_CREDIT_SUM_DEBT_sum

are strongest predictors

