![Classification%20Algo.png](attachment:Classification%20Algo.png)

# <font color='purple'>Classification of Credit Card Default

Credit card companies strongly rely on their customers to pay  at least some of the balance it is owed. This balance depends on the agreement between customer and the company. Churn rate, also known as attrition rate, is the rate at which customers stop doing business with a company. The loss of customers is major concern to banks and card companies that depend on keeping their customers to stay profitable. Customers who default on their payments cannot continue to do business with that company. It would be in the interest of a prospective company to predict which of its current customers would default on its payments. Knowing which customers will default and what features lead to such an outcome could help a company retain its customers. High retention rates involve more that knowing which customers who are less likely to default. Features that influence default rates among customers can be used to inform strategic decisions, which could leave to increased customer loyalty, service, and satisfaction

# <font color='purple'>1.Data Origins

The dataset used in this project was obtained from the University of California, Irvine, School of Computer Science. UCI provides a variety of datasets that is  popular among data scientists and machine learning engineers. The repository started in 1987 and continues to house over 600 datasets. The data for the capstone ccomes from Department of Information Management, Chung Hua University, Taiwan. The data was donated to UCI in January 26, 2016. The data contains observational data of payment history of customers.  

[Credit Card Default Dataset](https://archive-beta.ics.uci.edu/ml/datasets/default+of+credit+card+clients)

# <font color='purple'>2.Data Wrangling

[Data Wrangling Notebook](https://github.com/JideOkesanjo/DataScience-Capstone-Projects/blob/c25f4472b1c7b262c480a44146785d7d4c93b11d/Capstone_Two%20.ipynb)

The data contains 30000 customers with 24 features. The features __LIMIT_BAL__ and __default payment next month__ were changed to __CREDIT_LIMIT__ AND __default__ clearity during exploratory data analysis and eventually modeling. The education feature was originally represented with ordinal values from 0 to 6. This was recategorized to represent numbers 0, 1, 2, 3 as no education, high school, undergraduate, and, graduate. Oridinal values equal to or higher than 4 were categorized as __other__. Marriage feature was also recategorized to represent __single__, __married__, and __unknown__ for customers who did not provide their marital status. 

# <font color='purple'>3.Exploratory Data Analysis

# Overall Default Rate
The default rate, the ratio of the number of people who defaulted to the number of people who did not, was 28%. In other words, nearly 3 out of every ten customers defaulted on their credit card payments the following month. 
![overall_default_rates.png](attachment:overall_default_rates.png)

# How Do Men And Women Compare In Default Rates?
Men and women were almost the same in defaulting the following month. Interestingly, more women paid their credit card bills the following month than men did. This difference may be because some men did not have balances to pay the next month as the women did.
![barplot_default_rate_by_sex.png](attachment:barplot_default_rate_by_sex.png)


# Does Education Among Customers Affect Default Rates?
Undergraduate students had the highest default rates. Conversely, undergraduate students also had the highest non-defaulters among customers. These observations are understandable since undergraduate students appear to be the most customers in the dataset. High school students, who were least customers in the dataset, had the lowest default rates. 
![barplot_default_rate_by_education.png](attachment:barplot_default_rate_by_education.png)

# How Are Default Rates Distributed Across This Age?
Kernel density estimate (KDE) plot was used to visualize how the default rates fared across age in a dataset. Analogous to a histogram, KDE represents the data using a continuous probability density curve. The green plot represents people who did not default. While the blue kdeplot shows people who did. Default rates showed a sharp increase and peaked between ages 25 and 30. At this age range, more people were also less likely to default on the bills the following month. It was hard to tell how age affected default rates from the KDE plot since more people in this group are also likely to open a credit card.
![kdeplot_default_by_age.png](attachment:kdeplot_default_by_age.png)


# <font color='purple'> 4. Preprocessing 
[Preprocessing & Modeling Notebook](https://github.com/JideOkesanjo/DataScience-Capstone-Projects/blob/main/Capstone_Two%20.ipynb)

People with defaullt status were labeled 1 while those who did not were labeled 0. One hot encoding was applied to sex, marriage, and education columns to improve model performance. 70% of the dataset was used to train the model and the remaining 30% was used validate the model. Because the target variable, default status, was imbalanced, Synthetic minority oversampling technique was used to address this issue.


# <font color='purple'>5. Modeling

Pycaret Module was used to train and validate my classifaction problem. Pycaret is a open source, low code machine learning library in python. I used the pycaret module for this task because I wanted to spend more time experimenting with several machine learning algorithms to find the most efficient one.
![Classification_Models.png](attachment:Classification_Models.png)

# <font color='purple'>6. Which Model Is Best?

__Accuracy__:
A perfect model would easily capture all the default and non-default customers, without any disadvantages. To determine the best model, we could use accuracy. Accuracy is  not a good metric for this dataset, however, because the model could easily classify all the observations as the majority class (non-default status customers). When dealing with an unbalanced dataset such as this one, choosing the correct performance metric is important. Since we need know more about the default status customers, a model with high accuracy such as the __light gradient boosting model__ will not be helpful. A more robust model that cares about classifying most of the customers likely to default(__recall__), while keeping the cost at which this task is accomplished under control(__precision__) is needed.




__Recall__ is the models ability to find all default status customers in the dataset. Recall is defined as the number of true positives divided by the number of true positives plus the number of false negatives. True positives are default status customers classified as positive by the model that actually are positive (meaning they are correct), and false negatives are default status customers the model identifies as negative that actually are positive (incorrect). Recall is the model's ability to find all the default status customers in a dataset. 

With __Precision__ we are looking to answer the question: __"of all the predicted default status customers, how many of them are actually correct?".__ Precision is defined as true positive divided by the sum of true positives and false positives. False positives are default status customers the model identifies as positive that are actually negative.

# <font color='purple'> 7. The Battle Between Precision and Recall

To fully evaluate the performance of best model for the dataset, precision and recall must be examined. Unfortunately, there is tension between precision and recall. When recall is increased to find all default status customers, precision is reduced and vice versa. For example, __the naive bayes__ model has the highest recall _(87.4%)_ out of the compared fourteen models, but it's precision is relatively low. The __naive bayes classifier__ will do better at predicting defaults at the cost of poor precision. Since we need classifier that tells us who is non-default and default, one that tells us almost all the customers are defaulting can't be useful.
![naive_bayes_classifier_diagram.png](attachment:naive_bayes_classifier_diagram.png)

On the other hand, the __gradient boosting classifier(gbc)__, which has the highest precision of _(68.25%)_ out of the fourteen models, has very poor recall metrics. The gbc model is somewhat good at predicting default customers, but we cannot trust when it predicts a non-default customer. When it predicts a negative (non-default customer), it cannot be trusted. Since defaulting customers likely to leave the business, the cost of a false negative is much higher than a false positive.



![Precision_Recall_Plot.png](attachment:Precision_Recall_Plot.png)

# <font color='purple'>8. Area Under The Curve
Gradient boosting classifier also had Area Under The Curve, AUC, out the fourteen models. AUC represents the degree of separability between defaulters and non-defaulters. In other words, the AUC tells us how well the gradient boosting classifier is capable of distinguishing between the two groups(defaulters and non-defaulters) in the dataset. The higher the AUC, the better the model is at predicting non-default and default status customers. AUC curve is a great way to  visualize the performance of the model at various threshold settings. The AUC curve plots the true positive rate, TPR, against the false positive rate.
![ROC_curve_defualt_rates.png](attachment:ROC_curve_defualt_rates.png)

# <font color='purple'>8. The Bottomline

All fourteen models were not good in properly identifying default and non-default customers based the business problem. A model with high precision and high recall is needed. Such a model would be excellent at showing who is likely to default and who is not. The gbc model does somewhat well in telling us which customers actually defaulted from the predictions it made. It did not, however, perform any better in identifying all defaulting customers in the dataset. Although the naive bayes classifer identified more defaulting customers than the gradient boosting classifier, a model that identifies almost every customer as defaulting cannot be a good model. One cannot be sure of its performance when predicting non-default customers.

# <font color='purple'>9. Future Improvements 

I really wish I had more time on this project. I would have enjoyed improving it by doing
- __1. Dimensionality Reduction:__ There were too many features used to predict default rates in the data. I regret not reducing the features in the data. Reducing the dimensions in the data should have played a major part in my exploratory data analysis. I would use PCA to reduce the noise in the data. The features that play a major role in predicting defaults would included in the model, while those that do not will be excluded.
- __2. Deep Learning:__ Deep learning has made great sride in the driverless cars, speech recognition, and image recognition to name a few. I think a deep learning model would perform better at predicting default and non default customers.I intend to try run the data through an artificial neural network to see how well it does in predicting default and non default classes

# <font color='purple'> 10. Credits
- [Precision vs Recall by Kimberly Fessel](https://www.youtube.com/watch?v=qWfzIYCvBqo)
- [Beyond accuracy by Will Koehrsen](https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c)
- [Different Metrics to Evaluate Binary Classification Models and Some Strategies to Choose The Right One by Zoumana Keita](https://towardsdatascience.com/different-metrics-to-evaluate-binary-classification-models-and-some-strategies-to-choose-the-right-911ef72a107b)
- [Threshold Moving For Imbalanced Classification by Jason Brownlee](https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/)
- [Stable variable selection of class-imbalanced data with precision-recall criterion](https://www.sciencedirect.com/science/article/abs/pii/S0169743917303441)
