To apply machine learning to solve a real-world challenge: credit card risk.
To employ different techniques to train and evaluate models with unbalanced classes since credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans.
In random oversampling, instances of the minority class are randomly selected and added to the training set until the majority and minority classes are balanced.
The precision for high risk loans is very low around 1% but is very good at predicting low risk loans with precision of almost 100%.
Recall is around 70% for high risk loans that is to say model can identify almost 70 % of risky loans but it can only identify about 63% of good ones.
In synthetic minority oversampling technique (SMOTE), new instances are interpolated and the size of the minority is increased.
The precision for high risk loans is very low around 1% but is very good at predicting low risk loans with precision of almost 100%.
Recall is around 63% for high risk loans that is to say model can identify almost 63 % of risky loans and it can also identify about 69% of good ones.
Cluster centroid undersampling is akin to SMOTE. The algorithm identifies clusters of the majority class, then generates synthetic data points, called centroids, that are representative of the clusters.
The precision for high risk loans is very low around 1% but is very good at predicting low risk loans with precision of almost 100%.
Recall is around 69 % for high risk loans that is to say model can identify almost 69 % of risky loans but it can only identify about 40% of good ones.
SMOTEENN combines the SMOTE and Edited Nearest Neighbors (ENN) algorithms
The precision for high risk loans is very low around 1% but is very good at predicting low risk loans with precision of almost 100%.
Recall is around 72% for high risk loans that is to say model can identify almost 72 % of risky loans but it can only identify about 57% of good ones.
A balanced random forest randomly under-samples each boostrap sample.
The precision for high risk loans is low around 3% but is very good at predicting good loans with precision of almost 100%.
Recall is around 70% for high risk loans that is to say model can identify almost 70 % of risky loans and 87% of good loans are identified.
Ensemble learning is the process of combining multiple models, like decision tree algorithms, to help improve the accuracy and robustness, as well as decrease variance of the model
The precision for high risk loans is low around 9% but is very good at predicting good loans with precision of almost 100%.
Recall is around 92% for high risk loans that is to say model can identify almost 92 % of risky loans and it can also identify about 94% of good ones.
Although SMOTE reduces the risk of oversampling but it does not always outperform random oversampling.While resampling can attempt to address imbalance, it does not guarantee better results.Resampling with SMOTEENN did not work miracles, but some of the metrics show an improvement over undersampling.Balanced random forest model have precision of 3% for bad loan applications which is indicative of large number of false negatives.
EasyEnsembleClassifier model can identify 92% of risky loans and 94% of good loans. It has a precision of almost 100% for good loans and only 9 % for bad loans, that is to say there are lot of false negatives and it failed to notice several good loan applications but still overall it is a better model to use.