## Introduction

When creating predictive models, analysts often find themselves working with imbalanced datasets. These datasets, which are marked by the disproportionate distribution of observations across categories associated with a given field, are quite common, appearing in industries such as fraud detection, disease screening, and more. 

What is particularly frustrating about imbalanced datasets is the fact that they render conventional accuracy measurements, at best, unhelpful, and at worst, virtually useless. This is because classification models trained on such datasets tend to (rather intelligently) predict that nearly all observations belong to the category overrepresented in the data. Put differently, a model trained on a dataset with 100 observations, 95 of which belong to category “A”, will likely achieve an accuracy of around 95%, predicting nearly all observations belong to category “A”. Although a model with such a high accuracy score might seem useful, it would prove untenable if the 5 observations in category “B” were the fraudsters or diseases analysts had set out to identify. 

In this post, I will tackle the problem of class imbalance and discuss how to use common resampling techniques to create models which better classify underrepresented categories. The analysis will utilize one quarter of mortgage acquisition and performance data from Fannie Mae.

### When and Why “Accuracy” Doesn’t Cut It

To demonstrate the ineffectiveness of accuracy measurements when dealing with imbalanced datasets, let’s attempt to create a model to predict whether or not a mortgage will receive a modification at any point during its lifetime. Our first step in this process is to import both our dependencies and the datasets we will be using. To both speed up and simplify our analysis, let's use just one quarter of Fannie Mae data. 

In [95]:
######################################
# Step 1) Import Dependencies & Data #
######################################

import numpy as np
import pandas as pd

#set root directory
root_directory = "/Users/scottinderbitzen/Desktop/Fannie Data"

#acquisition data
acquisition_data_path = root_directory + "/Acquisition_2006Q1.txt"
df_acquisition = pd.read_table(acquisition_data_path, delimiter = "|",
    names = ['LOAN_IDENTIFIER', 'CHANNEL', 'SELLER_NAME', 'ORIGINAL_INTEREST_RATE', 
             'ORIGINAL_UNPAID_PRINCIPAL_BALANCE_(UPB)', 'ORIGINAL_LOAN_TERM', 
             'ORIGINATION_DATE ', 'FIRST_PAYMENT_DATE', 'ORIGINAL_LOAN-TO-VALUE_(LTV)', 
             'ORIGINAL_COMBINED_LOAN-TO-VALUE_(CLTV)', 'NUMBER_OF_BORROWERS', 
             'DEBT-TO-INCOME_RATIO_(DTI)', 'BORROWER_CREDIT_SCORE', 
             'FIRST-TIME_HOME_BUYER_INDICATOR','LOAN_PURPOSE', 'PROPERTY_TYPE',
             'NUMBER_OF_UNITS', 'OCCUPANCY_STATUS', 'PROPERTY_STATE', 'ZIP_(3-DIGIT)', 
             'MORTGAGE_INSURANCE_PERCENTAGE', 'PRODUCT_TYPE', 'CO-BORROWER_CREDIT_SCORE', 
             'MORTGAGE_INSURANCE_TYPE', 'RELOCATION_MORTGAGE_INDICATOR'])


#performance data
performance_data_path = root_directory + "/Performance_2006Q1.txt"
df_performance = pd.read_table(performance_data_path, delimiter = "|",
    names = ['Loan_Identifier', 'Monthly_Reporting_Period', 'Servicer_Name', 
    'Current_Interest_Rate', 'Current_Actual_UPB', 'Loan_Age', 'Remaining_Months_Legal_Matrty', 
    'Adjusted_Months_Remaining_Matrty', 'Matrty_Date',
    'Metropolitan_Statistical_Area', 'Current_Loan_Delq_Status', 'Modification_Flag', 
    'Zero_Balance_Code', 'Zero_Balance_Effective_Flag', 'Last_Paid_Installment_date', 
    'Foreclosure_Date', 'Disposition_Date', 'Foreclosure_Costs', 'Property_Pres_Ren_Costs', 
    'Asset_Recovery_Costs', 'Misc_Holding_Expenses_Credits', 'Associated_Taxes_Holding', 
    'Net_Sale_Proceeds', 'Credit_Enhancements_Proceeds', 'Repurchase_Make_Whole_Proceeds', 
    'Other_Foreclosure_Proceeds', 'Non_Interest_Bearing_UPB', 'Principal_Forgiveness_UPB', 
    'Repurchase_Make_Whole_Flag', 'Foreclosure_Principal_Write_Off_Amount', 
    'Servicer_Activity_Indicator'])

Now that we have imported our data and stored them in pandas dataframes, our next step is to clean the dataframes and prepare them for analysis. To do this, we will first determine which loans in the "df_performance" dataframe have recevied modifications by isolating unique loan identifiers where the "Modification_Flag" is equal to "Y". Then, we will add a field to our "df_acquisition" dataframe called "MODIFIED". We will assign a value of "Yes" to our newly created "MODIFIED" field for each observation with a unique loan identifier isolated in the previous step. We will assign a value of "No" to this field for all other observations. 

In [96]:
#############################################
# Step 2) Clean Data & Prepare for Analysis #
#############################################

#Find all unique loan ids that have a modification flag with a value of "Y"
modified_loans = df_performance.loc[df_performance["Modification_Flag"]=="Y", "Loan_Identifier"].unique()

#Add modification flag to acquisition dataset
df_acquisition["MODIFIED"] = np.where(df_acquisition["LOAN_IDENTIFIER"].isin(modified_loans),"Yes","No")

Now that our "MODIFIED" field has been created, we will next convert it into a dummy variable where a "1" indicates a loan has received a modification and a "0" indicates a loan has not been modified. Then, we will create a dummy variable for "Mortgage_Insurace_Type". Finally, we will clean observations with "NaN" values, dropping them or converting them to zero where sensible. 

In [97]:
#Convert modification flag to dummy variable
df_acq_mod_dummies = pd.concat([df_acquisition, pd.get_dummies(df_acquisition["MODIFIED"],drop_first = True)],
                                axis=1); df_acquisition

#Create mortgage insurance type dummy
df_acq_mod_dummies["MORTGAGE_INSURANCE_TYPE_1"] = np.where(df_acq_mod_dummies["MORTGAGE_INSURANCE_TYPE"]==1,1,0)
df_acq_mod_dummies["MORTGAGE_INSURANCE_TYPE_2"] = np.where(df_acq_mod_dummies["MORTGAGE_INSURANCE_TYPE"]==2,1,0)
df_acq_mod_dummies = df_acq_mod_dummies.drop("MORTGAGE_INSURANCE_TYPE",1)

#Make mortgage insurance percentage and dti 0 for nan
df_acq_mod_dummies["MORTGAGE_INSURANCE_PERCENTAGE"].fillna(0, inplace=True)
df_acq_mod_dummies["DEBT-TO-INCOME_RATIO_(DTI)"].fillna(0, inplace=True)

#Drop reamining N/A
df_acq_mod_dummies = df_acq_mod_dummies.dropna(axis=0, how="any")

The next step in preparing our data for analysis is choosing which features we want to include in our model. To keep things simple, let's include all of the fields but the unique "Loan_Identifier". Lastly, we will split our data into training and testing sets. Our training sets will include 67% of our data and we will hold out the reamining 33% for use in model testing. 

In [98]:
#Create dataframe of independent variables, dropping loan ID and "Yes" (modification dummy)
features = ['CHANNEL', 'SELLER_NAME', 'ORIGINAL_INTEREST_RATE', 
            'ORIGINAL_UNPAID_PRINCIPAL_BALANCE_(UPB)', 'ORIGINAL_LOAN_TERM', 
            'ORIGINATION_DATE ', 'FIRST_PAYMENT_DATE', 'ORIGINAL_LOAN-TO-VALUE_(LTV)', 
            'ORIGINAL_COMBINED_LOAN-TO-VALUE_(CLTV)', 'NUMBER_OF_BORROWERS', 
            'DEBT-TO-INCOME_RATIO_(DTI)', 'BORROWER_CREDIT_SCORE', 
            'FIRST-TIME_HOME_BUYER_INDICATOR','LOAN_PURPOSE', 'PROPERTY_TYPE','NUMBER_OF_UNITS', 
            'OCCUPANCY_STATUS', 'PROPERTY_STATE', 'ZIP_(3-DIGIT)', 
            'MORTGAGE_INSURANCE_PERCENTAGE', 'PRODUCT_TYPE','RELOCATION_MORTGAGE_INDICATOR', 
            "MORTGAGE_INSURANCE_TYPE_1", "MORTGAGE_INSURANCE_TYPE_2"]
x = pd.get_dummies(df_acq_mod_dummies[features], drop_first = True)
y = df_acq_mod_dummies["Yes"]

from sklearn.model_selection import train_test_split
pd.options.mode.chained_assignment = None
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

Our data are now clean and ready to be used in model training. To demonstrate the misleading nature of the accuracy measurement when used with imbalanced datasets, let’s train a simple k-nearest neighbors (KNN) model on our training set and see how it performs. 

In [99]:
##################################################
# Step 3) Create basic K-Nearest Neighbors Model #
##################################################

#Test accuracy of a k-nearest neighbors classifier
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

#set nearest neighbors classifier
knn.fit(x_train, y_train)
y_predict_knn = knn.predict(x_test)

#print the accuracy of the KNN model
print("The model achieves an accuracy score of:")
print(accuracy_score(y_test, y_predict_knn)*100, "%")

The model achieves an accuracy score of:
95.03218438538205 %


Even without any hyperparameter tuning, the KNN model achieves an accuracy of 95.03%. In some cases, this would be good enough that an analyst might call it a day and head home. In this case however, our model may not actually be doing what we want it to -- there is more work to be done. Let's examine.

In [116]:
##################################################
# Step 4) Examine Accuracy Against Other Metrics #
##################################################

print("The model predicted", y_predict_knn.sum(), "modifications out of a total", y_predict_knn.size, )
print("predictions, represting <", np.round(y_predict_knn.sum()/y_predict_knn.size,2)*100,"%.")
print("This stems from the fact that modifications make up only")
print(np.round((df_acq_mod_dummies.Yes.sum()/df_acq_mod_dummies.Yes.count())*100,2),"% of the dataset.")

The model predicted 195 modifications out of a total 38528
predictions, represting < 1.0 %.
This stems from the fact that modifications make up only
4.64 % of the dataset.


Even though our accuracy score is quite high, we can see that our model is predicting a negligible number of modifications. In other words, if we simply predicted no loans would receive modifications, we would achieve a very similar accuracy score. Given we want our model to be able to isolate observations indicative of modifications, our current KNN model will not suffice. Let's use the area under a ROC curve (AUROC) to further demonstrate this point. 

In [102]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
print(np.round(roc_auc_score(y_test,y_predict_knn)*100,2))

50.56


As you can see, our AUROC score is about 51%. While a full description of ROC curves falls outside the scope of this post, you can think of the AUROC score as representing the percent liklihood that, were we to randomly select both a positive observation (modification) and negative observation (non-modification) from our dataset, the model would assign a higher predicted probability to the positive observation. In other words, the AUROC score is a useful way to measure our model's classification capability. 

Given our model's AUROC score is only 51%, we can conclude that our current model is virtually no better at classifying than a coin flip. Fortunately, there are easy and intuitive ways to enhance our model's performance.

### Upsampling & Downsampling

A great way to correct for the impacts of an imbalanced dataset is to use common resampling methods like upsampling or downsampling. The general intuition of upsampling and downsampling is straightforward: models trained on datasets with equally represented categories will prove better classifiers. In the case of upsampling, equal representation is achieved by increasing the number of observations in the minority category to match the number of observations in the majority category. In the case of downsampling on the other hand, equal representation is achieved by decreasing the number of observations in the majority category to match the number of observations in the minority category. Let's use scikit-learn's resampling package to test these methods and their effectiveness. 

In [103]:
##################################
# Step 5) Upsample Modifications #
##################################
from sklearn.utils import resample

#combine training sets 
comb_train_up = x_train.join(y_train)

#separate by modifications & no modifications
comb_train_up_mod = comb_train_up[comb_train_up.Yes==1]
comb_train_up_no_mod = comb_train_up[comb_train_up.Yes==0]

#make size of mods equal to size of no mods
comb_train_up_mod = resample(comb_train_up_mod, 
                                    replace=True,
                                    n_samples=comb_train_up_no_mod.Yes.count())

#concatenate into one dataframe
comb_train_up = pd.concat([comb_train_up_no_mod, comb_train_up_mod])

#check that value_counts are equivalent
comb_train_up.Yes.value_counts()

1    74579
0    74579
Name: Yes, dtype: int64

As you can see, after upsampling, our training dataset now has an even distribution of both modifications and non-modifications (where "1" represesnts a modification and "0" represents a non-modification). To demonstrate how this helps improve classification power, let's re-train our KNN model on the upsampled dataset and calculate a new AUROC score.

In [104]:
#create new split and then retrain knn model on upsampled set
x_train_up = comb_train_up.drop("Yes", axis=1)
y_train_up = comb_train_up["Yes"]

#refit our knn model
knn.fit(x_train_up, y_train_up)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [105]:
#use our newly trained model to predict classifications
y_predict_knn_up = knn.predict(x_test)
print(np.round(roc_auc_score(y_test,y_predict_knn_up)*100,2))

57.88


Our new AUROC score reveals that, by simply upsampling our modification data, we can raise our model's classification power by about 7% -- an impressive increase gained through just a few lines of code. Let's try downsampling to see if we can do even better.

In [106]:
####################################### 
# 6) Downsample our Non-Modifications #
#######################################
#combine training sets 
comb_train_down = x_train.join(y_train)

#separate by modifications & no modifications
comb_train_down_mod = comb_train_down[comb_train_down.Yes==1]
comb_train_down_no_mod = comb_train_down[comb_train_down.Yes==0]

#make size of mods equal to size of no mods
comb_train_down_no_mod = resample(comb_train_down_no_mod, 
                                    replace=False,
                                    n_samples=comb_train_down_mod.Yes.count())

#concatenate into one dataframe
comb_train_down = pd.concat([comb_train_down_no_mod, comb_train_down_mod])

#check that value_counts are equivalent
comb_train_down.Yes.value_counts()

1    3643
0    3643
Name: Yes, dtype: int64

In [107]:
x_train_down = comb_train_down.drop("Yes", axis=1)
y_train_down = comb_train_down["Yes"]
knn.fit(x_train_down, y_train_down)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [110]:
y_predict_knn_down = knn.predict(x_test)
print(np.round(roc_auc_score(y_test,y_predict_knn_down)*100,2))

60.46


As you can see, downsampling does yield a better AUROC score, increasing our KNN model's classification power by an additional 2%, representing a full 9% increase over our original model's performance. 

While an AUROC score of 60% is still fairly low, this example is simply meant to demonstrate the power and ease of use of resampling. 

If we need improved performance, there is much we can do to train better models with our downsampled data. A full discussion of model optimization falls outside the scope of this post, but in the next section, I will breifly touch on some common techniques used to improve classification model performance.

### Improving Classification Model Performance

One thing we can do to further improve our model's performance is test different classification methods. Thus far, we have only used a KNN model, but there are many other popular algorithms readily available to analysts seeking to build classifiers. For this analysis, let's try using our downsampled data to train and test a Random Forest Classifier (RFC), a model which creates multiple decision trees by randomly holding out features at each iteration and then averages the resulting scores. 

In [111]:
#########################################
# Step 7) Try a RandomForestClassifier  #
#########################################

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train_down, y_train_down)
y_predict_rfc_down = rfc.predict(x_test)

In [112]:
print(np.round(roc_auc_score(y_test,y_predict_rfc_down)*100,2))

70.9


Wow! Simplpy changing our classification algorithm increased our AUROC score by nearly 11%. In just a few short steps, we have gone from a model that couldn't outperform random guessing to a model that is beginning to classify observations reasonably well.

This is impressive, but if you're a competitive analyst like me, you're probably wondering, "Can we do even better?" The answer -- a resounding "Yes". 

The truth is, these techniques only scratch the surface of model optimization. Analysts can take their models in a multitude of directions to improve performance by testing a variety of different feature sets, classifiers, and hyperparameters. While this can prove to be a long and arduous process, it can also be insightful and, I daresay, fun. Additionally, there are many easy-to-use packages designed to expedite this process. In the next section of this post, we will examine an introductory example of how to use grid search cross validation to speed up the process of hyperparameter tuning. 

### Using Grid Search Cross Validation to Tune Models

Grid search cross validation is a tool provided by scikit-learn which allows analysts to both quickly test a wide range of hyperparameters and efficiently determine which model, of the ones tested, produces the best results according to a specified scoring criteria. Let's use grid search cross validation below to see if we can improve the performance of our RFC model. Note, the grid search function tests every combination of hyperparameters the user passes it. As a consequence, the time it takes a grid search to execute increases for every additional hyperparameter the analyst includes. If you are building a model and need to optimize it in a short timeframe, consider using RandomizedSearchCV instead.

In [113]:
######################################
# Step 8) Tune RFC with GridSearchCV #
######################################

#set nearest neighbors classifier
rfc = RandomForestClassifier()
estimators = [250, 500, 750, 1000, 1250]
features = [5, "sqrt", 25, None]
samples = [1, 5, 10, 25, 50]
params = {"n_estimators": estimators, "oob_score": [True], "max_features": features, "min_samples_leaf": samples}

#create grid to search
rfc_grid = GridSearchCV(rfc, param_grid=params, scoring="roc_auc", n_jobs=-1)
rfc_grid.fit(x_train_down, y_train_down)
#print best estimator
print(rfc_grid.best_estimator_)
y_predict_rfc_down_grid = rfc_grid.predict(x_test)

print(np.round(roc_auc_score(y_test, y_predict_rfc_down_grid)*100,2))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=25, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)
74.62


As you can see, using grid search cross validation has increased our AUROC score by an additional 4%.  To further enhance performance, analysts could continue to use grid search, testing a variety of classifiers, features, and parameters. 

### Conclusion

At the end of the day, there is no single perfect method analysts can use to always produce the best optimized classification model. That said, when dealing with imbalanced datasets, analysts can achieve superior classification performance through the use of easily implementable resampling methods. Additionally, to further improve performance, analysts can train various types of classification models on resampled datasets, seeking to iteratively improve AUROC (or other) scores. While at first glance imbalanced datasets might seem daunting, dealing with them can be simple, enjoyable, and can ultimately produce models which classify quite well. 