## Weight of evidence (WOE) and information value (IV) 

A Python example is included at the end

### What is Weight of Evidence (WOE)?
The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers. "Bad Customers" refers to the customers who defaulted on a loan. and "Good Customers" refers to the customers who paid back loan.
The formula to calculate the weight of evidence for any feature is given by
![title](https://editor.analyticsvidhya.com/uploads/30208woe.png)

If any of the categories/bins of a feature has a large proportion of events compared to the proportion of non-events, we will get a high value of WoE which in turn says that that class of the feature separates the events from non-events.

### Why WoE is often seen in logistic regression?
> - First, logistic regression is a parametric method that requires us to calculate a linear equation. This requires that all features are numerical. However, we might have categorical features in our datasets that are either nominal or ordinal. For this regard, WoE values for the various categories of a categorical variable can be used to impute a categorical feature and convert it into a numerical feature;
> - Second, the WoE of a feature has a linear relationship with the log odds. This ensures that the requirement of the features having linear relation with the log odds is satisfied.
> - Third, if a continuous feature does not have a linear relationship with the log odds, the feature can be binned into groups and a new feature created by replaced each bin with its WoE value can be used instead of the original feature. In other words, WoE transformation helps you to build strict linear relationship with log odds. Otherwise it is not easy to accomplish linear relationship using other transformation methods such as log, square-root etc.

### Other benefits of WoE:
> - It can treat outliers. Suppose you have a continuous variable such as annual salary and extreme values are more than 500 million dollars. These values would be grouped to a class of (let's say 250-500 million dollars). Later, instead of using the raw values, we would be using WOE scores of each classes.
> - It can handle missing values as missing values can be binned separately.
> - Since WOE Transformation handles categorical variable so there is no need for dummy variables.

### Usage of WOE

Weight of Evidence (WOE) helps to transform a continuous independent variable into a set of groups or bins based on similarity of dependent variable distribution i.e. number of events and non-events. 

__For continuous independent variables__: First, create bins (categories / groups) for a continuous independent variable and then combine categories with similar WOE values and replace categories with WOE values. Use WOE values rather than input values in your model.

__For categorical independent variables__: Combine categories with similar WOE and then create new categories of an independent variable with continuous WOE values. In other words, use WOE values rather than raw categories in your model. The transformed variable will be a continuous variable with WOE values. It is same as any continuous variable.

### How to check correct binning with WOE?
- Each category (bin) should have at least 5% of the observations.
    - Ideally, each bin should contain at least 5% cases. The number of bins determines the amount of smoothing - the fewer bins, the more smoothing. If someone asks you ' "why not to form 1000 bins?" The answer is the fewer bins capture important patterns in the data, while leaving out noise. Bins with less than 5% cases might not be a true picture of the data distribution and might lead to model instability.
- Each category (bin) should be non-zero for both non-events and events.
- The WOE should be distinct for each category. Similar groups should be aggregated.
    - why? It is because the categories with similar WOE have almost same proportion of events and non-events. In other words, the behavior of both the categories is same.
- The WOE should be monotonic, i.e. either growing or decreasing with the groupings.
- Missing values are binned separately.

### Information Value
Having discussed the WoE value, the WoE value tells us the predictive power of each bin of a feature. However, a single value representing the entire feature’s predictive power will be useful in feature selection.

The equation for IV is
![title](https://editor.analyticsvidhya.com/uploads/66585IV.png)

> Note that the term (percentage of events – the percentage of non-events) follows the same sign as WoE hence ensuring that the IV is always a positive number.

### interpretation the IV value
<table>
<tr>
<td>Information Value </td>
<td>Variable Predictiveness</td>
</tr>
<tr>
<td>Less than 0.02 </td>
<td>Not useful for prediction</td>
</tr>
<tr>
<td>0.02 to 0.1 </td>
<td>Weak predictive Power</td>
</tr>
<tr>
<td>0.1 to 0.3 </td>
<td>Medium predictive Power</td>
</tr>
<tr>
<td>0.3 to 0.5</td>
<td>Strong predictive Power</td>
</tr>
<tr>
<td>0.5 </td>
<td>Suspicious Predictive Power</td>
</tr>
</table>

If the IV statistic is:
- Less than 0.02, then the predictor is not useful for modeling (separating the Goods from the Bads)
- 0.02 to 0.1, then the predictor has only a weak relationship to the Goods/Bads odds ratio
- 0.1 to 0.3, then the predictor has a medium strength relationship to the Goods/Bads odds ratio
- 0.3 to 0.5, then the predictor has a strong relationship to the Goods/Bads odds ratio.
- 0.5, suspicious relationship (Check once)

### Important Points
1. Information value increases as bins / groups increases for an independent variable. Be careful when there are more than 20 bins as some bins may have a very few number of events and non-events.
2. Information value is not an optimal feature (variable) selection method when you are building a classification model other than binary logistic regression (for eg. random forest or SVM) as conditional log odds (which we predict in a logistic regression model) is highly related to the calculation of weight of evidence. In other words, it's designed mainly for binary logistic regression model. Also think this way - Random forest can detect non-linear relationship very well so selecting variables via Information Value and using them in random forest model might not produce the most accurate and robust predictive model.

### Conclusion
the calculation of the WoE and IV are beneficial and help us analyze multiple points as listed below

1. WoE helps check the linear relationship of a feature with its dependent feature to be used in the model.

2. WoE is a good variable transformation method for both continuous and categorical features.

3. WoE is better than on-hot encoding as this method of variable transformation does not increase the complexity of the model.

4. IV is a good measure of the predictive power of a feature and it also helps point out the suspicious feature.

Though WoE and IV are highly useful, always ensure that it is only used with logistic regression. Unlike other feature selection methods available, the features selected using IV might not be the best feature set for a non-linear model building.

In [1]:
import pandas as pd
import numpy as np
mydata = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata.head()

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


In [2]:
def iv_woe(data, target, bins=10, show_woe=False):

    #Empty Dataframe
    newDF,woeDF = pd.DataFrame(), pd.DataFrame()
    
    #Extract Column Names
    cols = data.columns
    
    #Run WOE and IV on all the independent variables
    for ivars in cols[~cols.isin([target])]:
        if (data[ivars].dtype.kind in 'bifc') and (len(np.unique(data[ivars]))>10):
            binned_x = pd.qcut(data[ivars], bins,  duplicates='drop')
            d0 = pd.DataFrame({'x': binned_x, 'y': data[target]})
        else:
            d0 = pd.DataFrame({'x': data[ivars], 'y': data[target]})
        d = d0.groupby("x", as_index=False).agg({"y": ["count", "sum"]})
        d.columns = ['Cutoff', 'N', 'Events']
        d['% of Events'] = np.maximum(d['Events'], 0.5) / d['Events'].sum()
        d['Non-Events'] = d['N'] - d['Events']
        d['% of Non-Events'] = np.maximum(d['Non-Events'], 0.5) / d['Non-Events'].sum()
        d['WoE'] = np.log(d['% of Events']/d['% of Non-Events'])
        d['IV'] = d['WoE'] * (d['% of Events'] - d['% of Non-Events'])
        d.insert(loc=0, column='Variable', value=ivars)
        print("Information value of " + ivars + " is " + str(round(d['IV'].sum(),6)))
        temp =pd.DataFrame({"Variable" : [ivars], "IV" : [d['IV'].sum()]}, columns = ["Variable", "IV"])
        newDF=pd.concat([newDF,temp], axis=0)
        woeDF=pd.concat([woeDF,d], axis=0)

        #Show WOE Table
        if show_woe == True:
            print(d)
    return newDF, woeDF

In [3]:
iv, woe = iv_woe(data = mydata, target = 'admit', bins=10, show_woe = False)

Information value of gre is 0.312882
Information value of gpa is 0.27002
Information value of rank is 0.292044


In [4]:
iv

Unnamed: 0,Variable,IV
0,gre,0.312882
0,gpa,0.27002
0,rank,0.292044


In [5]:
woe

Unnamed: 0,Variable,Cutoff,N,Events,% of Events,Non-Events,% of Non-Events,WoE,IV
0,gre,"(219.999, 440.0]",48,6,0.047244,42,0.153846,-1.180625,0.125857
1,gre,"(440.0, 500.0]",51,12,0.094488,39,0.142857,-0.41337,0.019994
2,gre,"(500.0, 520.0]",24,10,0.07874,14,0.051282,0.428812,0.011774
3,gre,"(520.0, 560.0]",51,15,0.11811,36,0.131868,-0.110184,0.001516
4,gre,"(560.0, 580.0]",29,6,0.047244,23,0.084249,-0.57845,0.021406
5,gre,"(580.0, 620.0]",53,21,0.165354,32,0.117216,0.344071,0.016563
6,gre,"(620.0, 660.0]",45,17,0.133858,28,0.102564,0.266294,0.008333
7,gre,"(660.0, 680.0]",20,9,0.070866,11,0.040293,0.564614,0.017262
8,gre,"(680.0, 740.0]",44,12,0.094488,32,0.117216,-0.215545,0.004899
9,gre,"(740.0, 800.0]",35,19,0.149606,16,0.058608,0.937135,0.085278
