### Weight of Evidence and Information Value Computation

##### Understanding WoE

Weight of Evidence (WoE) tells predictive strength and direction of a feature (or bin of a feature) in relation to a binary target variable (e.g., default vs non-default).It helps in improving interpretability, model performance, and linearity with log-odds. It helps us understand how much each group (bin) contributes to distinguishing between the two target classes. It also helps in treating outliers as well given the continuous variables are transformed into Bins which are no more prone to outliers.

- WoE is supposed to have monotonic trend, i.e., WoE should be increasing or decreasing across bin.
- It is expected that Missings are treated as a separate bin.
- Bins should be heterogeneous in nature, meaning each bin ought to exhibit distinct characteristics. To ensure this, bins with identical WoE values—indicating similar behavior—should be aggregated or combined to form a new, consolidated bin.
- All the above will be considered when we will be performing Classing i.e. Fine Classing and Coarse Classing
- Here, the idea was to understand how you can compute the WoE, probably will try to cover it as part of different exercise.

##### Understanding IV

Information Value (IV) measures how well a variable separates two binary classes (e.g., default vs non-default). It is a single summary score computed from Weight of Evidence (WoE) across bins of a variable. It is generally used as variable reduction technique.

- We would follow the below rules pertaining to IV: (Source - Listen Data)

| **Information Value**   | **Variable Predictiveness**    |
|-------------------------|--------------------------------|
|    < 0.02               |     Not useful for prediction  |
|    0.02 - 0.1           |     Weak predictive Power      |
|    0.02 - 0.1           |     Medium predictive Power    |
|    0.3 - 0.5            |     Strong predictive Power    |
|    > 0.5                |     Suspicious Predictive Power|

- Disclaimer : The below code helps to make better understanding of computation of WoE and IV. This might not be the best optimized code, however the idea is to understand how one can compute WOE and IV in Python.The same results will then be performed in excel for comparison purpose.

In [1]:
## Libraries
import pandas as pd
import numpy as np

## Data Loading
Data = pd.read_excel(r"C:\Users\Hp\Downloads\WoE_Data.xlsx", sheet_name="data")

## Divind the data into deciles (10 bins)
## q in the .qcut functions defines number of bins. it's been provided as 11 to adjust for maximum and minimum value of bins
Data['Decile'] = pd.qcut(Data['DAINC'], q = 11, labels=False, duplicates="drop")  # Labels 1 to 10

## Creating an empty dataframe to store the final outcome
df = pd.DataFrame()

## Defining Custom Labels, based on min and max value of each bin
decile_max = Data.groupby('Decile')['DAINC'].max().sort_index().tolist()

## Function for custom labeling
def custom_label(x):
    for i in range(11):
        if i == 0:
            if x <= decile_max[i]:
                return f"<= {decile_max[i]:.2f}"
        elif i == 9:
            if x > decile_max[i-1]:
                return f"> {decile_max[i-1]:.2f}"
        else:
            if decile_max[i-1] < x <= decile_max[i]:
                return f"> {decile_max[i-1]:.2f} and <= {decile_max[i]:.2f}"

## Applying custom labels named as Label
Data['Labels'] = Data['DAINC'].apply(custom_label)

## Computing WoE and IV corresponding to each bin

## Counting Goods and Bads
Counts = Data.groupby('Labels').agg({'BAD': ['count', 'sum']})
Counts.columns = ["Total","Bads"]
Counts["Goods"] = Counts["Total"] - Counts["Bads"]
Counts ["Bins"] = Counts.index 
df = Counts[["Bins","Total","Goods","Bads"]].reset_index(drop = True)

## Computing Good% and Bads%
df['Good%'] = df['Goods'] / (df['Goods'].sum()) * 100
df['Bad%'] = df['Bads'] / (df['Bads'].sum()) * 100

## WOE Computation as Log (Goods/bads)
df['WoE'] = np.log(df['Good%']/df['Bad%'])

## IV Computation as (Goods% - Bads%) * WoE
df['IV'] = (df['WoE'] * (df['Good%']- df['Bad%']))/100

## Printing the final df
df

## Final IV is supposed to sum of all this IV i.e. df['IV'].sum() = 0.24854

Unnamed: 0,Bins,Total,Goods,Bads,Good%,Bad%,WoE,IV
0,<= 4326.00,223,125,98,13.858093,30.340557,-0.783616,0.129159
1,> 15000.00 and <= 18000.00,104,80,24,8.86918,7.430341,0.177011,0.002547
2,> 18000.00 and <= 21474.00,102,85,17,9.423503,5.263158,0.582476,0.024233
3,> 21474.00 and <= 24900.00,111,83,28,9.201774,8.668731,0.059674,0.000318
4,> 24900.00 and <= 30000.00,136,107,29,11.862528,8.978328,0.278571,0.008035
5,> 30000.00 and <= 36000.00,103,75,28,8.314856,8.668731,-0.041679,0.000147
6,> 36000.00 and <= 45000.00,106,89,17,9.866962,5.263158,0.628461,0.028933
7,> 4326.00 and <= 9630.00,111,77,34,8.536585,10.526316,-0.209517,0.004169
8,> 45000.00,101,88,13,9.756098,4.024768,0.885425,0.050747
9,> 9630.00 and <= 15000.00,128,93,35,10.310421,10.835913,-0.049711,0.000261


In [2]:
df['IV'].sum()

0.24854903563884123

**IV = 0.2485** following the interepretation proived initially variable **DAINC has Medium predictive Power**. Hence, it can be considered for further analysis.