Supervised ML evaluation of credit risk data
This analysis uses machine learning to train a model to predict high risk credit loans. The number of low risk loans in the dataset far outweigh the number of high risk loans. Therefore, the imblanced learn library was used to deal with this asymmetry.
Several algorithms were used in order to determine the best performance for the model. In particular, resampling and ensemble techniques were used.
Oversampling from the high risk cases using RandomOverSampler and SMOTE algorithms were used to balance the number of cases being analyzed. The ClusterCentroids algorithm was used to undersample the low risk cases for the same purpose. Additionally, a combination of oversampling and undersampling, SMOTEENN was used to balance the inputs.
###Ensemble Classifiers
The BalancedRandomForestClassifier and AdaBoost algorithms were used to employ decision tree techniques to produce a more robust and accurate model.
The recall score will provide the best indicator of the number of high risk cases that are caught. Using this figure may label some low risk cases as high risk, but a higher recall score ensures that the highest number of high risk cases are identified.
Native Random Oversampling
High risk recall score of 0.72
SMOTE Oversampling
High risk recall score of 0.61
Cluster Centroids Undersampling
High risk recall score of 0.69
SMOTEENN Combination Sampling
High risk recall score of 0.78
Balanced Random Forest Classifier
High risk recall score of 0.70
Easy Ensemble AdaBoost Classifier
High risk recall score of 0.92
The Easy Ensemble AdaBoost Classifier algorithm is the obvious choice, because it correctly identified 92% of the high risk cases.