# Configuration


sphinx:
  config:
    mathjax_path: https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js

# Library

In [1]:
import pandas as pd
import numpy as np

# Functions

I've made them available as gists, so you can easily access them and start using them in your own data analysis projects.

Here are the names of each function with their respective links:

- [Discrete Variable WoE and IV](https://gist.github.com/DeborahBarbedo/08ed242316fe3b9ed3350460e2a140f3)
- [Continuous Variable WoE and IV](https://gist.github.com/DeborahBarbedo/d9ddd529f9b4359e4a867a649ab9544b)
- [Combined Discrete and Continuous Variable WoE and IV](https://gist.github.com/DeborahBarbedo/bc3597b64ad2fcd54266664c62adbe3f)

Give them a try and let me know what you think!


In [2]:
#########################################
###### Discrete Variable WoE and IV #####
#########################################


def Woe_IV_Dis(df, features, target):
    aux = features + [target] 
    
    df = df[aux].copy()
    
    # Empty dataframe
    df_woe_iv = pd.DataFrame({},index=[])
    
    for feature in features:
        df_woe_iv_aux = pd.crosstab(df[feature], df[target], normalize='columns') \
                        .assign(WoE=lambda i: np.log(i[1] / i[0])) \
                        .assign(IV=lambda i: (i['WoE']*(i[1]-i[0]))) \
                        .assign(IV_total=lambda i: np.sum(i['IV']))

        df_woe_iv = pd.concat([df_woe_iv, df_woe_iv_aux])
    
    return df_woe_iv    

In [3]:
###########################################
###### Continuous Variable WoE and IV #####
###########################################


def Woe_IV_cont(df, features, target):
    
    aux = features + [target] 
    
    df = df[aux].copy()
    
    # Empty dataframe
    df_woe_iv = pd.DataFrame({},index=[])
    
    # Number of rows with target = 1
    _t1 = sum(df[target])
    # Number of rows with target = 0
    _t0 =  len(df[target]) - _t1
    
    # Percentile of continuous variables
    _quantile = df.iloc[:, :-1].quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9], axis = 0)
    
    
    for _column in _quantile.columns:
   
        # Non-duplicated quantile limit values
        list_aux = _quantile[[_column]].drop_duplicates().to_numpy()
    
        _tiv = 0
        
        for q in range(len(list_aux)):
            
            
            if q>0:
                location = df[(df.loc[:,_column] > float(list_aux[q-1])) & (df.loc[:,_column] <= float(list_aux[q]))].index
                limit = str(list_aux[q-1]) + ' a ' + str(list_aux[q])
            else:
                location = df[(df.loc[:,_column] <= float(list_aux[q]))].index
                limit = '<=' + str(list_aux[q])
                
            _many = len(location)  
            
            # Target = 1
            _1 = sum(df.loc[location,target])
            _p1 = _1/_t1
            
            # Target = 0
            _0 = _many - _1
            _p0 = _0/_t0
            
            # “Distr Good,” and “Distr Bad”
            if _p1 == 0 or _p0 == 0:
                _Distr = 1
            else:
                _Distr = _p1/_p0
            
            # Weight of evidence
            _woe = np.log(_Distr)
            
            # Information value
            _iv = round(_woe*(_p1-_p0),2)
            
            # Information value - total
            _tiv = _tiv+_iv
                    
            
            dframe = pd.DataFrame({'variable': _column , 'limit':limit , '0': _p0 , '1': _p1, 'WoE': _woe , 'IV':  _iv}
                                  , index = [ _column])  
            
            df_woe_iv = pd.concat([df_woe_iv, dframe], ignore_index=True)
            
        dframe = pd.DataFrame({'variable': _column ,'limit': ' ' , '0': 1 , '1': 1, 'WoE': 0 , 'IV':  _tiv}
                                  , index =[ _column])
         
        df_woe_iv = pd.concat([df_woe_iv, dframe], ignore_index=True)
            
    return df_woe_iv

In [4]:
#################################################################
###### Combined Discrete and Continuous Variable WoE and IV #####
#################################################################

def Woe_IV(df, features_dis, features_cont, target):
    
    df_dis =  Woe_IV_Dis(df, features_dis, target)
    df_cont =  Woe_IV_cont(df, features_cont, target)
    
    df_dis.reset_index(inplace=True)
    df_dis = df_dis.rename(columns = {'index':'variable',0: '0', 1: '1'})
    df_dis.insert(loc = 1, column = 'limit', value = ' ')

    df_cont['IV_total'] = ' '
    
    df_woe_iv = pd.concat([df_dis, df_cont])
    
    return df_woe_iv

# Data

The used data comes from the competition [Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/data) by [Kaggle](https://www.kaggle.com/).



## Data import

In [5]:
df = pd.read_csv('titanic/train.csv')

## Visualization

In [6]:
df.head(100)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,0,3,"Shorney, Mr. Charles Joseph",male,,0,0,374910,8.0500,,S
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
97,98,1,1,"Greenfield, Mr. William Bertram",male,23.0,0,1,PC 17759,63.3583,D10 D12,C
98,99,1,2,"Doling, Mrs. John T (Ada Julia Bone)",female,34.0,0,1,231919,23.0000,,S


# IV an WoE




In [7]:
features_dis = ['Sex','Embarked']
features_cont = ['Age','Fare']
target = 'Survived'

In [8]:
Woe_IV_cont(df, features_cont,target )

Unnamed: 0,variable,limit,0,1,WoE,IV
0,Age,<=[14.],0.058288,0.131579,0.814214,0.06
1,Age,[14.] a [19.],0.096539,0.099415,0.029356,0.0
2,Age,[19.] a [22.],0.087432,0.055556,-0.453474,0.01
3,Age,[22.] a [25.],0.080146,0.076023,-0.052805,0.0
4,Age,[25.] a [28.],0.067395,0.070175,0.040424,0.0
5,Age,[28.] a [31.8],0.07286,0.076023,0.042505,0.0
6,Age,[31.8] a [36.],0.08561,0.128655,0.40733,0.02
7,Age,[36.] a [41.],0.061931,0.055556,-0.108634,0.0
8,Age,[41.] a [50.],0.08561,0.090643,0.057127,0.0
9,Age,,1.0,1.0,0.0,0.09


In [9]:
Woe_IV_Dis(df, features_dis,target )

Survived,0,1,WoE,IV,IV_total
female,0.147541,0.681287,1.529877,0.816565,1.341681
male,0.852459,0.318713,-0.983833,0.525116,1.341681
C,0.136612,0.273529,0.694264,0.095057,0.122728
Q,0.08561,0.088235,0.030203,7.9e-05,0.122728
S,0.777778,0.638235,-0.197734,0.027592,0.122728


In [10]:
Woe_IV(df,features_dis,features_cont,target)

Unnamed: 0,variable,limit,0,1,WoE,IV,IV_total
0,female,,0.147541,0.681287,1.529877,0.816565,1.341681
1,male,,0.852459,0.318713,-0.983833,0.525116,1.341681
2,C,,0.136612,0.273529,0.694264,0.095057,0.122728
3,Q,,0.08561,0.088235,0.030203,7.9e-05,0.122728
4,S,,0.777778,0.638235,-0.197734,0.027592,0.122728
0,Age,<=[14.],0.058288,0.131579,0.814214,0.06,
1,Age,[14.] a [19.],0.096539,0.099415,0.029356,0.0,
2,Age,[19.] a [22.],0.087432,0.055556,-0.453474,0.01,
3,Age,[22.] a [25.],0.080146,0.076023,-0.052805,0.0,
4,Age,[25.] a [28.],0.067395,0.070175,0.040424,0.0,


## Formulas


In [11]:
pd.crosstab(df['Sex'], df[target])

Survived,0,1
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,81,233
male,468,109


What is commonly referred to as 'good' is the $target_{0}$.

 $ \% Target_{0, sector_i} = \frac{ \# Target_{0, sector_i}}{ \# Target_{0}} $
 
Let's assume the chosen $sector$ is 'female'.

For this case:

$ \% Survived_{0, female } = \frac{ \# Survived_{0, female}}{ \# Survived_{0}}  = \frac{81}{81+468} \approx 0.147541 $



What is commonly described as 'bad' is the $target_{1}$.


 $ \% Target_{1, sector_i} = \frac{ \# Target_{1, sector_i}}{ \# Target_{1}} $ 
 
 For this problem, in the 'female' $sector$:
 
 $ \% Survived_{1, female} = \frac{ \# Survived_{1, female}}{ \# Survived_{1}  	} = \frac{233}{233+109} \approx 0.681287 $ 
 
 Percentage of the population in the study sector:
 
 $ \% Pop_{sector_i} = \frac{ \# Pop_{sector_i}}{ \# Pop} $ 
 
 $ \% Pop_{female} = \frac{ \# Pop_{female}}{ \# Pop} = \frac{81 + 233}{81 + 233 +  468 +109} \approx 0,352413 $ 

Distribution of 'good' and 'bad' within each segment.
    

 $  Distr_{sector_i} = \frac{ \% Target_{1, sector_i}}{ \% Target_{0, sector_i}} $ 
 
 $  Distr_{female} = \frac{ \% Survived_{1, female}}{ \% Survived_{0, female }} =  \frac{ \frac{233}{233+109}}{ \frac{81}{81+468} }  \approx 4.617609 $ 
 


$ WoE_{sector_i} = ln(Distr_{sector_i}) $ 

Para este problema, no $sector$ female:


$ WoE_{female} = ln(Distr_{female}) \approx  1.529877 $ 

$ IV_{sector_i} = WoE_{sector_i} \times (\% Target_{1, sector_i} - \% Target_{0, sector_i} ) $ 

Para este problema, no $sector$ female:

$ IV_{female} = WoE_{female} \times (\% Survived_{1, female} - \% Survived_{0, female} )  = 1.529877 \times (0.681287 - 0.147541 ) \approx 0.816566 $ 

IV Classification:
    
     <0,02 - unpredictive
    0,02 -0,1 - weak
    0,1 - 0,3 - medium
    0,3 - 0,5 - strong
    > 0,5 - should be checked for overpredicting
    
    
 

In [12]:
Woe_IV_Dis(df, ['Sex'],target )

Survived,0,1,WoE,IV,IV_total
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,0.147541,0.681287,1.529877,0.816565,1.341681
male,0.852459,0.318713,-0.983833,0.525116,1.341681


Regarding the Weight of Evidence (WoE), the farther away from 0 it is, the more discriminatory the variable is. A negative value indicates that the variable does not favor the occurrence, while a positive value indicates its favorability.

Therefore, it is evident that being female favors the passenger's survival.

The Information Value (IV) of this variable even suggests a suspicious predictive power, as it is strongly associated with the response variable.

How to perform category grouping?

    - Analyze the similarity of categories in terms of their discrimination power towards the targets.
    - Evaluate the number of cases in each attribute to ensure representativeness.
    - Group categories with meaningful relationships.

* Benefits:

    - Prepares variables for the model.
    - Simplifies the model equation.
    - Enhances model stability, minimizing the risk of overfitting.

* Note:

    - The information value always decreases when categories of a variable are grouped.
    - Combine categories with similar WoE values.

In [13]:
Woe_IV_cont(df, ["Fare"],target )

Unnamed: 0,variable,limit,0,1,WoE,IV
0,Fare,<=[7.55],0.143898,0.038012,-1.331211,0.14
1,Fare,[7.55] a [7.8542],0.111111,0.076023,-0.37949,0.01
2,Fare,[7.8542] a [8.05],0.15847,0.055556,-1.048181,0.11
3,Fare,[8.05] a [10.5],0.10929,0.052632,-0.730685,0.04
4,Fare,[10.5] a [14.4542],0.087432,0.105263,0.185606,0.0
5,Fare,[14.4542] a [21.6792],0.092896,0.108187,0.15238,0.0
6,Fare,[21.6792] a [27.],0.078324,0.134503,0.540729,0.03
7,Fare,[27.] a [39.6875],0.103825,0.099415,-0.043403,0.0
8,Fare,[39.6875] a [77.9583],0.076503,0.137427,0.585766,0.04
9,Fare,,1.0,1.0,0.0,0.37


The Fare variable has a strong predictive power as indicated by its Information Value (IV). For an improved model, it would be ideal to create a binary variable indicating whether it is <=10.5. It is important to note that the information value always decreases when categories of a variable are grouped, and categories with similar Weight of Evidence (WoE) values should be combined.

In this case, the cutoff at line 17 of Fare, despite having a negative WoE, is very close to zero. This proximity suggests it falls within a neutral range regarding the occurrence of survival. Therefore, including it in the group that favors survival would not pose any issues.

In [14]:
df['FLG_female'] = df['Sex'].map({'male': 0, 'female': 1})


In [15]:
df['FLG_Fare_leq_10.5'] = np.where(df.loc[:,'Fare' ] <= 10.5, 1, 0)



In [16]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FLG_female,FLG_Fare_leq_10.5
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,0,0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,1,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,1,0
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,0,0


# References:

* Beraldi, Fidel. Atualização dinâmica de modelo de regressão logística binária para detecção de fraudes em transações eletrônicas com cartão de crédito. Universidade de São Paulo, 1 Dec. 2014. DOI.org (Crossref), https://doi.org/10.11606/D.45.2014.tde-05022015-232801.  

* Siddiqi, Naeem. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Wiley, 2006.


# Notes:

This is a supporting material for my [blog](https://deborahbarbedo.github.io/), which is focused on introducing readers to the tools and knowledge necessary to harness the power of data through statistics and machine learning.

## Posts about WoE and Information Value:

- [Function in Python to calculate WoE and IV](https://deborahbarbedo.github.io/posts/2023-04-09-WoE_IV_Python_Function)
- [Função em Python para calcular WoE e IV](https://deborahbarbedo.github.io/pt/2023-04-17-WoE_IV_Python_Function)
- [Mastering Logistic Regression: Unpacking WOE and IV Metrics for Variable Selection and Interpretation.](https://deborahbarbedo.github.io/posts/2023-04-24-Unpacking_WOE_and_IV)
- [Dominando a Regressão Logística: Desvendando as Métricas WoE e IV para Seleção e Interpretação de Variáveis.](https://deborahbarbedo.github.io/pt/2023-05-08-Unpacking_WOE_and_IV)
- [Mastering Logistic Regression: A Comprehensive Guide to WoE and IV Calculation.](https://deborahbarbedo.github.io/posts/2023-06-05-WoE_IV_Calculation)
