# Modeling and Evaluation with scikit-learn

# Part 4: Data wrangling and model performance

Some data wrangling operations, such as converting data to formats acceptable to scikit-learn, dealing with erroneous and missing data, are necessary before we can conduct supervised learning. 

Some other data wrangling operations are optional, albeit may affect the performance of a trained model. In this part of the lecture, we consider the following two commonly-seen data wrangling operations and see how they may or may not affect model performance.

+ ***balancing data w.r.t. the target variable***
+ ***data normalization/standardization***
    
We use the LendingClub dataset.

In [1]:
import numpy as np
import pandas as pd
pd.set_option('max_columns', 50)

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier



In [2]:
# Prepare the LendingClub dataset

df = pd.read_csv('LendingClub.csv')

# Convert categorical variable "purpose" to dummies, and drop the most frequent dummy
df = pd.get_dummies(df, columns=['purpose']).drop(columns=['purpose_debt_consolidation'])

X = df.drop(columns=['not_fully_paid'])
y = df['not_fully_paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=365)
X_train = X_train.copy()
X_test = X_test.copy()

In [3]:
y.value_counts(normalize=True)

0    0.839946
1    0.160054
Name: not_fully_paid, dtype: float64

## Dealing with severely unbalanced data

### Unbalanced data and the resulting bias in predictions

In [15]:
# Logistic regression
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='none', max_iter=1000)
clf.fit(X_train,y_train)

y_predict = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_predict).round(4)
print(f"The accuracy is: {accuracy:.2%}")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)

# save the results for later comparison
clf_lr = clf
accuracy_lr = accuracy
cm_lr = cm

The accuracy is: 84.39%
The confusion matrix is:
[[1616    0]
 [ 299    1]]


The accuracy above, while pretty high, is misleading because we actually got an extremely biased trained model: this trained model almost always predicts that borrowers will not default, as evident from the confusion matrix. 

This extremely biased trained model is triggered by the severely unbalanced dataset:

In [6]:
y_test.value_counts(normalize=True)

0    0.843424
1    0.156576
Name: not_fully_paid, dtype: float64

In [4]:
y.value_counts()

0    8045
1    1533
Name: not_fully_paid, dtype: int64

### Options for dealing with severely unbalanced data

+ **Option 1. Re-sampling the data to make it balanced.** This can be done in two ways:
  + **undersampling** the majority class
    + this is the usual choice when we have large enough data
  + **oversampling** the minority class
    + it may cause the [data leakage](https://towardsdatascience.com/data-leakage-in-machine-learning-10bdd3eec742) problem, thus should be avoided unless the data size is too small
+ **Option 2. Do not use "accuracy" as the performance metric.** Instead, 
  + use alternative metrics that can give different weight to different classes of the target, e.g., counts '1' more heavily than '0' in the target of the LendingClub dataset (to be discussed in the next lecture)

### Undersampling the majority class

The function for this is `sklearn.utils.resample()`.

In [5]:
# First, separate the classes, where we already know 'not_fully_paid==0' is the majority class
df_0 = df[df.not_fully_paid==0]
df_1 = df[df.not_fully_paid==1]

# Remember the sizes of the two classes
n_majority_class = df_0.shape[0]
n_minority_class = df_1.shape[0]
print(f"The majority class contains {n_majority_class} records. \nThe minority class contains {n_minority_class} records. ")

The majority class contains 8045 records. 
The minority class contains 1533 records. 


In [6]:
n_minority_class

1533

In [8]:
from sklearn.utils import resample

# undersample the majority class
df_0_undersampled = resample(df_0, replace=False, 
                             n_samples=n_minority_class, 
                             random_state=1234)
df_0_undersampled.shape

(1533, 19)

In [9]:
# oversample the minority class
# bootstraping

df_1_oversampled = resample(df_1, replace=True, 
                             n_samples=n_majority_class, 
                             random_state=1234)
df_1_oversampled.shape

(8045, 19)

### Combining the two classes into a single (resampled) dataset

In [10]:
df_balanced = pd.concat([df_0_undersampled, df_1])
df_balanced.not_fully_paid.value_counts()

0    1533
1    1533
Name: not_fully_paid, dtype: int64

In [21]:
# Save the balanced data for future use
df_balanced.to_csv('LendingClub_balanced.csv', index=False)

### Comments on oversampling

The reason it should be avoided when possible: the the [data leakage](https://towardsdatascience.com/data-leakage-in-machine-learning-10bdd3eec742) problem.

However, if you have to use it because the size of the minority class is too small, here are a few hints:
+ make sure you do `train_test_split()` *before* oversampling (why?)
+ ways to oversample:
  + Use [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)), a.k.a. `resample()` with the option `replace=True`.
  + Use [`imblearn.over_sampling.SMOTE`](https://imbalanced-learn.org/stable/over_sampling.html) -- a k-NN inspired method to create synthetic records
    + [A nice tutorial on SMOTE](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)

### Splitting this balanced data into train and test

In [23]:
X = df_balanced.drop(columns=['not_fully_paid'])
y = df_balanced['not_fully_paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=365)
X_train = X_train.copy()
X_test = X_test.copy()

### Training the logistic regression model over this balanced data

In [24]:
clf.fit(X_train,y_train)

y_predict = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_predict).round(4)
print(f"The accuracy is: {accuracy:.2%}")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)

# save the results for later comparison
clf_lr = clf
accuracy_lr = accuracy
cm_lr = cm

The accuracy is: 53.58%
The confusion matrix is:
[[172 126]
 [159 157]]


As shown above, the predictions are no longer extremely biased.

## Normalize/standardize the data

Recall that: "normalize" --> [0,1], and "standardize" --> mean 0 and std 1.

The LendingClub dataset consists of columns of varying scales. In addition, some columns are significantly skewed.

In [27]:
X_train.agg(['mean','std','skew'])

Unnamed: 0,credit_policy,int_rate,installment,log_annual_inc,dti,fico,days_with_cr_line,revol_bal,revol_util,inq_last_6mths,delinq_2yrs,pub_rec,purpose_all_other,purpose_credit_card,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business
mean,0.743883,0.126485,330.044372,10.928757,12.788018,705.790783,4543.281522,19162.296493,48.881656,1.923736,0.167618,0.077488,0.249592,0.122349,0.035481,0.073409,0.040375,0.084421
std,0.436576,0.026826,215.253362,0.641827,6.987826,37.485151,2453.147008,44268.83769,29.295286,2.633675,0.553948,0.271957,0.432865,0.327755,0.18503,0.260861,0.196878,0.278075
skew,-1.118162,0.147838,0.846862,0.025453,-0.016562,0.595084,1.051591,11.791871,-0.021394,3.537308,6.176599,3.343906,1.157923,2.306349,5.025096,3.273309,4.67295,2.991415


Variables of varying scales, and skewed variables, are commonly seen in business datasets. 
+ E.g., salary is in the tens of thousands, while age is usually in two digits
+ E.g., monetary variables (salary, spending, ...) are often right skewed 

### *Do we need to normalize/standardize the data?*

Nowadays, almost always **yes** because:
+ Many learning algorithms are sensitive to varying data scales (e.g., kNN, SVM) or varying data distribution shapes (e.g., regression)
+ **Regularization** is heavily used in modern machine learning. And regularization does NOT work without data normalization/stanardization
    + See these two brief posts on the concept of regularization: [Over-fitting and Regularization](https://towardsdatascience.com/over-fitting-and-regularization-64d16100f45c), [L1 and L2 Regularization Methods](https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c)

Tree-based classifiers are an exception because they don't compare column values when splitting.


### Manually performing data normalization/standardization

In [37]:
# Make a copy, as later we'll also try another standardization method
X_train_std_manual = X_train.copy()
X_test_std_manual = X_test.copy()

In [38]:
# Below we normalize/standardize some input columns
# Remember we need to work on both train and test datasets
# In practice, remember to update your data description file afterwards!

for x in [X_train_std_manual, X_test_std_manual]:
    x['installment1000'] = x.installment / 1000
    x.drop('installment', axis=1, inplace=True)

    x['fico_ratio'] = x.fico / 850
    x.drop('fico', axis=1, inplace=True)

    x['decades_with_cr_line'] = x.days_with_cr_line / 3650
    x.drop('days_with_cr_line', axis=1, inplace=True)

    x['log_revol_bal'] = np.log(x.revol_bal + 1)
    x.drop('revol_bal', axis=1, inplace=True)

    x.revol_util = x.revol_util / 100

In [39]:
# Check the summary statistics of the transformed data
X_train_std_manual.agg(['mean','std','skew'])

Unnamed: 0,credit_policy,int_rate,log_annual_inc,dti,revol_util,inq_last_6mths,delinq_2yrs,pub_rec,purpose_all_other,purpose_credit_card,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business,installment1000,fico_ratio,decades_with_cr_line,log_revol_bal
mean,0.743883,0.126485,10.928757,12.788018,0.488817,1.923736,0.167618,0.077488,0.249592,0.122349,0.035481,0.073409,0.040375,0.084421,0.330044,0.830342,1.244735,8.650228
std,0.436576,0.026826,0.641827,6.987826,0.292953,2.633675,0.553948,0.271957,0.432865,0.327755,0.18503,0.260861,0.196878,0.278075,0.215253,0.0441,0.672095,2.202951
skew,-1.118162,0.147838,0.025453,-0.016562,-0.021394,3.537308,6.176599,3.343906,1.157923,2.306349,5.025096,3.273309,4.67295,2.991415,0.846862,0.595084,1.051591,-2.108405


In [40]:
# Now let's run the logistic regression again with this transformed data
clf.fit(X_train_std_manual,y_train)

y_predict = clf.predict(X_test_std_manual)

accuracy = accuracy_score(y_test, y_predict).round(4)
print(f"The accuracy is: {accuracy:.2%}")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)

The accuracy is: 59.28%
The confusion matrix is:
[[178 120]
 [130 186]]


### Automatically performing data normalization/standardization

We can automatically standardize data using `sklearn.preprocessing.StandardScaler`.

In [11]:
# Make a copy, as later we'll also try another standardization method
X_train_std_auto = X_train.copy()
X_test_std_auto = X_test.copy()

In [12]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [13]:
# Warning: we don't want to standardize any categorical columns!
# Therefore, let's pick out only the numerical ones.
num_columns = ['int_rate', 'installment', 'log_annual_inc', 'dti', 
               'fico', 'days_with_cr_line', 'revol_bal', 'revol_util',
               'inq_last_6mths', 'delinq_2yrs', 'pub_rec']

In [14]:
scaler.fit(X_train_std_auto[num_columns])

StandardScaler()

In [15]:
print(scaler.mean_)

[1.22602793e-01 3.18479901e+02 1.09329970e+01 1.25938684e+01
 7.10858001e+02 4.57314781e+03 1.68191928e+04 4.69906630e+01
 1.57478465e+00 1.66405638e-01 6.08196293e-02]


In [48]:
X_train_std_auto[num_columns] = scaler.transform(X_train_std_auto[num_columns])
X_test_std_auto[num_columns] = scaler.transform(X_test_std_auto[num_columns])

In [49]:
# Verify that standardization is done
X_train_std_auto.agg(['mean','std','skew'])

Unnamed: 0,credit_policy,int_rate,installment,log_annual_inc,dti,fico,days_with_cr_line,revol_bal,revol_util,inq_last_6mths,delinq_2yrs,pub_rec,purpose_all_other,purpose_credit_card,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business
mean,0.743883,3.821712e-16,-1.505049e-16,1.596512e-16,-1.014686e-16,1.515463e-16,2.491097e-16,-1.6843510000000003e-17,-8.575703e-17,-1.886292e-16,-3.8894030000000003e-17,-3.164951e-17,0.249592,0.122349,0.035481,0.073409,0.040375,0.084421
std,0.436576,1.000204,1.000204,1.000204,1.000204,1.000204,1.000204,1.000204,1.000204,1.000204,1.000204,1.000204,0.432865,0.327755,0.18503,0.260861,0.196878,0.278075
skew,-1.118162,0.1478385,0.8468624,0.0254527,-0.01656245,0.5950842,1.051591,11.79187,-0.02139405,3.537308,6.176599,3.343906,1.157923,2.306349,5.025096,3.273309,4.67295,2.991415


In [51]:
X_train_std_auto.head()

Unnamed: 0,credit_policy,int_rate,installment,log_annual_inc,dti,fico,days_with_cr_line,revol_bal,revol_util,inq_last_6mths,delinq_2yrs,pub_rec,purpose_all_other,purpose_credit_card,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business
2790,1,1.249599,0.916854,-0.16983,-0.696785,-0.101148,-0.164444,-0.29924,1.697491,-0.350812,-0.30265,-0.284985,0,0,0,0,0,1
9203,0,-0.174668,-0.759931,1.542214,1.058053,-0.101148,0.960888,3.261643,1.246815,-0.350812,-0.30265,-0.284985,0,0,0,0,0,0
5127,1,1.249599,0.426773,0.398419,0.671588,1.099572,-0.078415,-0.404731,-1.460658,1.548063,-0.30265,-0.284985,0,0,0,0,0,1
8921,0,0.306301,-1.060708,0.114295,0.700215,-0.367975,0.12915,-0.428477,-1.563084,0.788513,3.308536,-0.284985,0,0,0,0,0,0
2129,1,1.603802,-0.738975,-2.046065,-0.49067,-1.168455,-1.335798,-0.379448,1.219501,-0.350812,-0.30265,-0.284985,0,0,1,0,0,0


In [None]:
## The above scaler.fit() and scaler.transform() steps can be combined into one:
# X_train_std_auto[num_columns] = scaler.fit_transform(X_train_std_auto[num_columns])

In [50]:
# Now let's run the logistic regression again with this standardized data
clf.fit(X_train_std_auto,y_train)

y_predict = clf.predict(X_test_std_auto)

accuracy = accuracy_score(y_test, y_predict).round(4)
print(f"The accuracy is: {accuracy:.2%}")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)

The accuracy is: 60.59%
The confusion matrix is:
[[180 118]
 [124 192]]


# Summary of lecture "Modeling and Evaluation with scikit-learn"

To recap, in this lecture we studied:
+ (Part 1) How to conduct supervised learning using the `scikit-learn` package. 
+ (Part 2) A walk-through of several popular supervised learning algorithms.
+ (Part 3) Various performance metrics: why many, and when to use each
+ (Part 4) Several focused discussions on data wrangling and model performance
    + Imputation (embedded in earlier parts; not a standalone discussion)
    + Unbalanced data
    + Data standardization/normalization    

In addition to learning Python based machine learning, we learned some general guidelines that help us more efficiently plan out our analytical work (rather than unnecessarily wasting time on the wrong paths). For example, we have some ideas on which classifiers to try (and to not try) if performance (interpretability) is the priority. What data wrangling choices have a better chance to help. And so on.

That said, by now you should realize that we will also face many questions that *only data can answer*. For example:
+ For a random forest, should we set max_depth at 3 or 4 or 5? Should we use 100 trees or 50 trees?
+ Should we standardize data before training, or not?

Answering the above empirical questions will require us to keep trying and comparing different combinations of model hyperparameters and data wrangling choices. Doing so manually is very time demanding -- in fact, this is what analysts spend a lot of their time on! The good news is that we can use **hyperparameter tuning** to shift a lot of this work to computers, thus significantly cutting down our human-side work. We will study hyperparameter tuning in the next lecture.

