- Overview of the analysis: Explain the purpose of this analysis.
Credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans. Therefore, I need to employ different techniques to train and evaluate models withunbalanced classes. Jill asked me to use imbalanced-learn and scikit-learn libraries to build and evaluate models using resampling. Using the credit card credit dataset from LendingClub, a peer-to-peer lending services company, I oversampled the data using the RandomOverSampler and SMOTE algorithms, and undersample the data using the ClusterCentroids algorithm. Then, I used a combinatorial approach of over- and undersampling using the SMOTEENN algorithm. Then I compared two new machine learning models that reduce bias, BalancedRandomForestClassifier and EasyEnsembleClassifier, to predict credit risk. At the end I evaluated the performance of these models and make a written recommendation on whether they should be used to predict credit risk.
- Results: Using bulleted lists, describe the balanced accuracy scores and the precision and recall scores of all six machine learning models. Use screenshots of your outputs to support your results.
RandomOversampler Balance Accuracy Score 66.29%
Combination sampling using SMOTEEN
Balance Accuracy Score: 66.29
Balanced RandomForestClassifier Balance accuracy score= 78.78
EasyEnsembleClassifer Accuracy score- 92.5%
Confusion matrix Accuracy score-64.39%
- Summary: Summarize the results of the machine learning models, and include a recommendation on the model to use, if any. If you do not recommend any of the models, justify your reasoning.
Based on the results, it seems likes the EasyEnsemble in AdaBoost is the best method. It had a 92.54% accuracy. The balancing is achieved by random under-sampling.