[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/tutorials/11_nb_imbalance_n_costs.ipynb) 

# The notebook is under development and not ready for use by our students. Please ignore this notebook for the time being.

# Chapter 11 - Imbalanced and cost-sensitive learning
Welcome to chapter 10 of [Business Analytics and Data Science](). In this week, we will revisit the lecture on imbalanced and cost-sensitive learning. Imbalance and asymmetric error costs occur frequently in business applications. Our well-known credit scoring exemplifies just this. Banks approve credit applications selectively and have achieved high sophistication in default prediction. Consequently, the observed default rates are often low, which implies that a credit scoring data set will typically exhibit class imbalance. Good customers represent the majority while defaulting clients represent the minority class. Using strategies from the reals of imbalanced learning, we can enhance the recognition of the minority class during classifier training.

Likewise, it is a known fact that error costs are asymmetric. Approaching credit risk prediction as a binary classification problem, the two possible errors are accepting a client who will default (false positive error) and rejecting a client who would repay had we approved the credit (false negative error). Just to be sure, here we define the good payers as the *positive* class. Thus, accepting a bad risk is a false positive error. The classifier predicts a client to be a good payer (i.e., positive), and we therefore accept the client, whereas the client defaults eventually (i.e., turns out to be a negative client). So the classifier has made a false positive error. Likewise, *false negative* means that the classifier predicts a client to belong to the negative class, that is the bad risks, whereas the client is actually a good payer. In other words, the classifier predicts the negative class and this prediction turns out to be false; hence false negative. False positive errors are more costly than false negative errors because they imply an actual loss. A false negative error, on the other hand, implies an opportunity costs from not lending to - and not earning interest -  a good payer. The literature on cost-sensitive learning has developed approaches to address asymmetric error costs during the *training* and/or the *evaluation* of a classifier. 

Time to explore some instruments for class imbalance and cost-sensitivity. Since the corresponding literature routinely considers classification, we will also stick to this form of predictive modeling. Thus, the tutorial will not touch on imbalance and cost-asymmetry in regression. 

The outline of the tutorial is as follows:
- Preliminaries
-
-
-

# Preliminaries
We begin with the usual preparations.

In [1]:
# Import standard packages. 
import pandas as pd
import numpy as np
import time

import matplotlib.pyplot as plt
%matplotlib inline  
plt.rcParams["figure.figsize"] = (12,6)

# Load the data for this tutorial directly from GitHub
data_url = 'https://raw.githubusercontent.com/Humboldt-WI/bads/master/data/hmeq_modeling.csv'

df = pd.read_csv(data_url, index_col='index')

# Extract target variable and feature matrix 
X = df.drop(['BAD'], axis=1) 
y = df[['BAD']].values

# Data partitioning
from sklearn.model_selection import train_test_split

# Select random state to make results reproducable
rnd_state = 888 

# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = rnd_state)  # 30% of the data as hold-out

# Make yourself familiar with these vectors
print('Shape of the data ', y_train.shape, X_train.shape, y_test.shape, X_test.shape)

Shape of the data  (4172, 1) (4172, 18) (1788, 1) (1788, 18)


In [2]:
type(X)

pandas.core.frame.DataFrame

In [3]:
X

Unnamed: 0_level_0,LOAN,MORTDUE,VALUE,YOJ,CLAGE,NINQ,CLNO,DEBTINC,DEROGzero,REASON_HomeImp,REASON_IsMissing,JOB_Office,JOB_Other,JOB_ProfExe,JOB_Sales,JOB_Self,DELINQcat_1,DELINQcat_1+
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
0,-1.832283,-1.295882,-1.335526,0.266788,-1.075278,-0.065054,-1.297476,0.137456,True,1,0,0,1,0,0,0,0,0
1,-1.810666,-0.013474,-0.672699,-0.236615,-0.723092,-0.826792,-0.756608,0.137456,True,1,0,0,1,0,0,0,0,1
2,-1.789048,-1.654549,-1.839275,-0.668103,-0.368769,-0.065054,-1.189302,0.137456,True,1,0,0,1,0,0,0,0,0
3,-1.789048,-0.159552,-0.202559,-0.236615,-0.061033,-0.065054,-0.107566,0.137456,True,0,1,0,1,0,0,0,0,0
4,-1.767431,0.791699,0.311107,-0.811933,-1.088528,-0.826792,-0.756608,0.137456,True,1,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5955,2.545249,-0.384589,-0.181135,1.057849,0.558823,-0.826792,-0.540260,0.354834,True,0,0,0,1,0,0,0,0,0
5956,2.545249,-0.462591,-0.119037,1.057849,0.390638,-0.826792,-0.648434,0.312440,True,0,0,0,1,0,0,0,0,0
5957,2.545249,-0.478000,-0.119331,0.914020,0.436639,-0.826792,-0.648434,0.261479,True,0,0,0,1,0,0,0,0,0
5958,2.545249,-0.584642,-0.143317,0.770191,0.457322,-0.826792,-0.540260,0.057266,True,0,0,0,1,0,0,0,0,0


# Imbalanced learning

## Imbalance, accuracy, and how the two don't go together
We have made the point in the lecture that one problem of class imbalance related to the fact that standard indicators of classification performance provide misleading signals when the data is imbalanced. Let's start by demonstrating the issue with a little experiment. 

# TODO

1. Introduce imbalance ratio
2. Create a plot of classification accuracy (y-axis) versus imbalance ratio (x-axis) using logistic regression and our toy data from tutorial 5 (code already below). You need to repeat the standard training/test workflow multiple times using different data samples (i.e., so that the imbalance increases)  
3. Briefly discuss the plot. At some point, logit should become a naive classifier. This is what we want to show. If it doesn't work, you can use a tree instead of logit.
4. Have a 2nd plot like the first one but display for each imbalance ratio the accuracy of a naive classifier that always predict the majority class.
5. State an exercise task to repeat the analysis using different classifiers and the HMEQ data set.
6. Showcase how AUC is robust by charting AUC versus imbalance ratio

In [3]:
def toy_data(n=1000, mu1=[1,1], mu2=[4, 4], sig1=1, sig2=1):
    """ Customer function to generate linearly seperable toy data. The code has been discussed in more detail in Tutorial #3.
        
        The arguments represent, respectively, the size of the data, the mean vectors of the two Gaussians from which we
        sample class 1 and class 2 data points, and their standard deviations.
    """
    
    class1_x1 = np.random.normal(loc=mu1[0], scale=sig1, size=n)
    class1_x2 = np.random.normal(loc=mu1[1], scale=sig1, size=n)

    class2_x1 = np.random.normal(loc=mu2[0], scale=sig2, size=n)
    class2_x2 = np.random.normal(loc=mu2[1], scale=sig2, size=n)

    y1 = np.repeat(0, n)
    y2 = np.repeat(1, n)

    class1 = np.vstack((class1_x1, class1_x2)).T
    class2 = np.vstack((class2_x1, class2_x2)).T

    X = np.vstack((class1,class2))
    y = np.concatenate((y1,y2))
    
    return X, y

In [4]:
# Create and plot the data
X, y = toy_data()

# Always useful to remind oneself of the dimensions of a data set
print("Shape of X {}".format(X.shape))  
print("Shape of y {}".format(y.shape))

Shape of X (2000, 2)
Shape of y (2000,)


## Remedying class imbalance via resampling
We introduced *resampling* as one model-agnostic way to address class imbalance. The functioning of simple under/oversampling is trivial. So let us focus on slightly more sophisticated resampling techniques. The *SMOTE* algorithm is clearly one of the best-known strategies. It is widely used in the industry and considered as a benchmark approach to new imbalanced learners in virtually any academic paper. So for let's take a deep-dive into SMOTE.

### SMOTE from scratch

# TODO 
1. Code to build SMOTE from scratch
  - ideally the implementation should also support categories
  - however, it this proves too difficult, we can restrict the part of numerical variables
2. Showcasing the from-scratch version of SMOTE with a little example
  - You can create an imbalanced toy data set and how a classifier trained on that data performs worse than a classifier trained on the same data + SMOTE

### Libraries for imbalanced learning

# TODO
The idea in this part is to showcase library implementation of SMOTE and other resampling algorithms. I guess **imbalanced-lean** is the goto library so let's stick to that one. Refer to the documentation for a number of useful examples. 

For our demo, we can use the HMEQ data set and try out how different classifiers perform in conjunction with different resampling strategies supported by the library (i.e., SMOTE + others). It may be that HMEQ is a poor choice. You will have to first increase the class imbalance by removing bad cases. Afterwards, you can hopefully find a classifier that experiences problems with the data and performs better after some resampling techniques. If not, we might have to switch to another data set.

It is not straightforward how to get the resampling right. For the demo described above, you can simply train/test split the data and make sure to only resample the training set while leaving the test set untouched. Afterwards, you can refer to https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ where demos of resampling + cross-validation are available. Between, the page is very nice, as usual for Jason, so you might want to draw on the demos for the tutorial. I'm sure you already saw the page ;)


# Cost-sensitive learning

- briefly discuss similarity between class imbalance and cost-sensitive learning

We have already established in the beginning of the notebook that credit scoring is a good environment to think about error costs. Therefore, the rest of the tutorial will focus on our good old HMEQ data set. 

- state of cost-matrix for the following exercises

## Cost-minimal threshold
We have discussed Bayesian decision theory in the lecture and arrived at an expression for the cost-optimal classification threshold. 

- reproduce equation from slide 37


# TODO
- calculate the optimal cut-off for the cost-matrix stated above (use whatever cost ratio gives nice results. Yet, false positive error must be higher than false negative errors as explained in the intro)
- calculate error costs on the HMEQ test set of a logit model when using a default threshold and when using the optimal threshold. Supposedly the optimal threshold will give better result. Correct?
- Next introduce a strong benchmark than the default cut-off. Use a validation set drawn from the training data or cross-validate the training data to tune the cut-off empirical. That is, empirically determine the threshold that gives the lowest error costs on the validation set and check how this classifier performs (in costs) on the test set. Does it better than the Bayes optimal threshold? 
- The logit model is expected to give well-calibrated predictions. Exercise for the students is to repeat the analysis using a tree, which will supposedly not provide calibrated predictions.
- Maybe we can show a reliability plot at some point to show the calibration of the logit model.

I leave it to you how you implement the threshold tuning. I guess a function to do it is available in sklearn. If yes, please demonstrate its use. Or else, you could use some functionality of the Yellowbrick lib (e.g., https://www.scikit-yb.org/en/latest/api/classifier/threshold.html).

A manual implementation can come in addition to a demo of a sklearn or other official package, but could also be left as an exercise. Again, Jason has a potentially useful tutorial: https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/


## Cost-sensitive classifier learning

# TODO
In this part, I'd simply like some demos of training sklearn classifiers with weights. Linear models and trees support this, and maybe others as well. I envision a list of models that are supported and a demo using HMEQ. For example, train a classifier with and without class weights (i.e., costs) and compare which one works better. When saying 'work', I think of costs as performance measure. Actually you could also showcase other performance measures like AUC and discuss how they differ agree with costs. 


# Conclusions