Skip to content

To apply machine learning to solve a real-world challenge: credit card risk.

Notifications You must be signed in to change notification settings

Ayesha-da/Credit_Risk_Analysis

Repository files navigation

Credit_Risk_Analysis

Overview of the analysis:

To apply machine learning to solve a real-world challenge: credit card risk.

Purpose

To employ different techniques to train and evaluate models with unbalanced classes since credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans.

Results:

Oversampling

Naive Random Oversampling

In random oversampling, instances of the minority class are randomly selected and added to the training set until the majority and minority classes are balanced.

naiveOversample

The accuracy score of this model is around 66%
The precision for high risk loans is very low around 1% but is very good at predicting low risk loans with precision of almost 100%.
Recall is around 70% for high risk loans that is to say model can identify almost 70 % of risky loans but it can only identify about 63% of good ones.

SMOTE Oversampling

In synthetic minority oversampling technique (SMOTE), new instances are interpolated and the size of the minority is increased.

smoteOversample

The accuracy score of this model is around 66%
The precision for high risk loans is very low around 1% but is very good at predicting low risk loans with precision of almost 100%.
Recall is around 63% for high risk loans that is to say model can identify almost 63 % of risky loans and it can also identify about 69% of good ones.

Undersampling

Cluster centroid undersampling is akin to SMOTE. The algorithm identifies clusters of the majority class, then generates synthetic data points, called centroids, that are representative of the clusters.

undersampling

The accuracy score of this model is around 54%
The precision for high risk loans is very low around 1% but is very good at predicting low risk loans with precision of almost 100%.
Recall is around 69 % for high risk loans that is to say model can identify almost 69 % of risky loans but it can only identify about 40% of good ones.

Combination( Over and Under) sampling

SMOTEENN combines the SMOTE and Edited Nearest Neighbors (ENN) algorithms

combination

The accuracy score of this model is around 64%
The precision for high risk loans is very low around 1% but is very good at predicting low risk loans with precision of almost 100%.
Recall is around 72% for high risk loans that is to say model can identify almost 72 % of risky loans but it can only identify about 57% of good ones.

Balanced Random Forest Classifier

A balanced random forest randomly under-samples each boostrap sample.

randomForest

The accuracy score of this model is around 87%
The precision for high risk loans is low around 3% but is very good at predicting good loans with precision of almost 100%.
Recall is around 70% for high risk loans that is to say model can identify almost 70 % of risky loans and 87% of good loans are identified.

Easy Ensemble AdaBoost Classifier

Ensemble learning is the process of combining multiple models, like decision tree algorithms, to help improve the accuracy and robustness, as well as decrease variance of the model

easyEnsemble

The accuracy score of this model is around 93%
The precision for high risk loans is low around 9% but is very good at predicting good loans with precision of almost 100%.
Recall is around 92% for high risk loans that is to say model can identify almost 92 % of risky loans and it can also identify about 94% of good ones.

Summary:

Although SMOTE reduces the risk of oversampling but it does not always outperform random oversampling.While resampling can attempt to address imbalance, it does not guarantee better results.Resampling with SMOTEENN did not work miracles, but some of the metrics show an improvement over undersampling.Balanced random forest model have precision of 3% for bad loan applications which is indicative of large number of false negatives.

Recommendation:

EasyEnsembleClassifier model can identify 92% of risky loans and 94% of good loans. It has a precision of almost 100% for good loans and only 9 % for bad loans, that is to say there are lot of false negatives and it failed to notice several good loan applications but still overall it is a better model to use.

About

To apply machine learning to solve a real-world challenge: credit card risk.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published