# Configuration


sphinx:
  config:
    mathjax_path: https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js

# Library

In [1]:
import pandas as pd
import numpy as np

# Functions

I've made them available as gists, so you can easily access them and start using them in your own data analysis projects.

Here are the names of each function with their respective links:

- [Discrete Variable WoE and IV](https://gist.github.com/DeborahBarbedo/08ed242316fe3b9ed3350460e2a140f3)
- [Continuous Variable WoE and IV](https://gist.github.com/DeborahBarbedo/d9ddd529f9b4359e4a867a649ab9544b)
- [Combined Discrete and Continuous Variable WoE and IV](https://gist.github.com/DeborahBarbedo/bc3597b64ad2fcd54266664c62adbe3f)

Give them a try and let me know what you think!


In [2]:
#################################
###### IV and WoE Discretas #####
#################################


def Woe_IV_Dis(df, features, target):
    aux = features + [target] 
    
    df = df[aux].copy()
    
    # Dataframe vazio
    df_woe_iv = pd.DataFrame({},index=[])
    
    for feature in features:
        df_woe_iv_aux = pd.crosstab(df[feature], df[target], normalize='columns') \
                        .assign(RR=lambda dfx: dfx[1] / dfx[0]) \
                        .assign(WoE=lambda dfx: np.log(dfx[1] / dfx[0])) \
                        .assign(IV=lambda dfx: (dfx['WoE']*(dfx[1]-dfx[0]))) \
                        .assign(IV_total=lambda dfx: np.sum(dfx['IV']))

        df_woe_iv = pd.concat([df_woe_iv, df_woe_iv_aux])
    
    return df_woe_iv  

In [3]:
#######################
###### IV and WoE #####
#######################

def Woe_IV(df, features_dis, features_cont, target):
    
    df_dis =  Woe_IV_Dis(df, features_dis, target)
    df_cont =  Woe_IV_cont(df, features_cont, target)
    
    df_dis.reset_index(inplace=True)
    df_dis = df_dis.rename(columns = {'index':'variable',0: '0', 1: '1'})
    df_dis.insert(loc = 1, column = 'limit', value = ' ')

    df_cont['IV_total'] = ' '
    
    df_woe_iv = pd.concat([df_dis, df_cont])
    
    return df_woe_iv

In [4]:
#################################
###### IV and WoE Contínuas #####
#################################


def Woe_IV_cont(df, features, target):
    
    aux = features + [target] 
    
    df = df[aux].copy()
    
    # Empty dataframe
    df_woe_iv = pd.DataFrame({},index=[])
    
    # Number of rows with target = 1
    _t1 = sum(df[target])
    # Number of rows with target = 0
    _t0 =  len(df[target]) - _t1
    
    # Percentile of continuous variables
    _quantile = df.iloc[:, :-1].quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9], axis = 0)
    
    
    for _column in _quantile.columns:
   
        # Non-duplicated quantile limit values
        list_aux = _quantile[[_column]].drop_duplicates().to_numpy()
    
        _tiv = 0
        
        for q in range(len(list_aux)):
            
            
            if q>0:
                location = df[(df.loc[:,_column] > float(list_aux[q-1])) & (df.loc[:,_column] <= float(list_aux[q]))].index
                limit = str(list_aux[q-1]) + ' a ' + str(list_aux[q])
            else:
                location = df[(df.loc[:,_column] <= float(list_aux[q]))].index
                limit = '<=' + str(list_aux[q])
                
            _many = len(location)  
            
            # Target = 1
            _1 = sum(df.loc[location,target])
            _p1 = _1/_t1
            
            # Target = 0
            _0 = _many - _1
            _p0 = _0/_t0
            
            # Relative risk
            if _p1 == 0 or _p0 == 0:
                _RR = 1
            else:
                _RR = _p1/_p0
            
            # Weight of evidence
            _woe = np.log(_RR)
            
            # Information value
            _iv = round(_woe*(_p1-_p0),2)
            
            # Information value - total
            _tiv = _tiv+_iv
                    
            
            dframe = pd.DataFrame({'variable': _column , 'limit':limit , '0': _p0 , '1': _p1, 'RR':_RR, 'WoE': _woe , 'IV':  _iv}
                                  , index = [ _column])  
            
            df_woe_iv = pd.concat([df_woe_iv, dframe], ignore_index=True)
            
        dframe = pd.DataFrame({'variable': _column ,'limit': ' ' , '0': 1 , '1': 1, 'RR': 1 , 'WoE': 0 , 'IV':  _tiv}
                                  , index =[ _column])
         
        df_woe_iv = pd.concat([df_woe_iv, dframe], ignore_index=True)
            
    return df_woe_iv

# Data

The used data comes from the competition [Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/data) by [Kaggle](https://www.kaggle.com/).



## Data import

In [5]:
df = pd.read_csv('titanic/train.csv')

## Visualization

In [6]:
df.head(100)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,0,3,"Shorney, Mr. Charles Joseph",male,,0,0,374910,8.0500,,S
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
97,98,1,1,"Greenfield, Mr. William Bertram",male,23.0,0,1,PC 17759,63.3583,D10 D12,C
98,99,1,2,"Doling, Mrs. John T (Ada Julia Bone)",female,34.0,0,1,231919,23.0000,,S


# IV an WoE




In [7]:
features_dis = ['Sex','Embarked']
features_cont = ['Age','Fare']
target = 'Survived'

In [None]:
Woe_IV_cont(df, features_cont,target )

In [None]:
Woe_IV_Dis(df, features_dis,target )

In [None]:
Woe_IV(df,features_dis,features_cont,target)

## Fórmulas


In [11]:
pd.crosstab(df['Sex'], df[target])

Survived,0,1
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,81,233
male,468,109


O que gealmente descrevem como 'bom':

 $ \% Target_{0, sector_i} = \frac{ \# Target_{0, sector_i}}{ \# Target_{0}} $
 
Suponhamos o $sector$ escolhido ser o 'female'.

Para este problemas:

$ \% Survived_{0, female } = \frac{ \# Survived_{0, female}}{ \# Survived_{0}}  = \frac{81}{81+468} $

O que gealmente descrevem como 'mau':

 $ \% Target_{1, sector_i} = \frac{ \# Target_{1, sector_i}}{ \# Target_{1}} $ 
 
 
 $ \% Pop_{sector_i} = \frac{ \# Pop_{sector_i}}{ \# Pop} $ 