# **Competition #2 - Audit Report**

April 8, 2019 

Juliana Hoffmann, Kendall Stopa & Jonathan Uy

# **Business Understanding**

**Framing Analytical Questions**

- Who is likely to default on their credit card?
- Which instances are associates with those most like to default?
- Could other features impact the decision of whether someone would default?
- Which is the best modeling technique to predict someone likely to default?

*What does it mean to default?*
- To default means going 180 days without making a credit card payment. 
- The credit card company writes off the charge as a bad debt expense. 
- The credit card company will sell your account to a collection agency. 

*What are the characteristics of credit card default customers generally?* 
- Barely keeping up with minimum credit card payments
- Growing debt
- Falling credit score
- Customer avoidance of collection agencies 
- Several maxed out credit cards
- Rising interest rates on personal debt
- Credit card payments exceed 15% of gross income 
- Ratio of debt to income

*What more information would be useful IF included within the dataset?*
- Credit score 
- Number of credit cards the individual has 
- Credit card interest rate
- Gross income of customer 


# **Understanding Raw Data** 

The Raw Dataset was run through a Logistic Regression and Decision Tree to understand what we were starting with. That can be found in:  [RawData](./RawData.ipynb). 

The logistic regression provided us with: 

1. Accuracy: 0.7873333333333333
2. Precision: 0.0
3. Recall: 0.0

With the zeros, we understand that binning and feature engineering would have to be necessary parts of preparing the data for more advanced modeling.

# **Descriptive Statistics and Visualizations of Raw Data**

# **Preparing Data in Pipelines**

In all pipelines, we implemented data processing techniques that came from our pipelines in the first competition.

All **ODD** numbered pipelines follow this order: 

1. Adjusting Skew
2. Fixing Outliers using 3 Standard Deviation Method
3. Normalization using Scaler method

All **EVEN** numbered pipelines follow this order: 

1. Fixing Outliers using IQR
2. Normalization using MinMax 
3. Adjusting Skew 

Each pipeline ends with a RFE, Logistic Regression and Decision Tree to compare to the Raw Data. 
We want to ensure that any adjustments we make to the dataset result in a stronger accuracy, precision and recall than the Raw Data. 

# **Breakdown of 10 Pipelines**

**Pipeline 1** 
Data is processed however no attributes are binned. 

Here is the notebook: [Pipeline 1](./Pipe1.ipynb)

The results of the Logistic Regression are: 
1. Accuracy: 0.784
2. Precision: 0.0
3. Recall: 0.0

**Pipeline 2** 
Data is processed however no attributes are binned. 

Here is the notebook: [Pipeline 2](./Pipe2.ipynb)

The results of the Logistic Regression are: 
1. Accuracy: 0.784
2. Precision: 0.0
3. Recall: 0.0

**Pipeline 3** 
Data is processed and X5 is binned. 

Here is the notebook: [Pipeline 3](./Pipe3.ipynb)

The results of the Logistic Regression are: 
1. Accuracy: 0.784
2. Precision: 0.0
3. Recall: 0.0

**Pipeline 4** 
Data is processed however no attributes are binned. 

Here is the notebook: [Pipeline 4](./Pipe4.ipynb)

The results of the Logistic Regression are: 
1. Accuracy: 0.784
2. Precision: 0.0
3. Recall: 0.0

**Due to the zeroes still present and the score not improving, these pipelines above will not be used in advanced modelings.**

**Pipeline 5** 
Data is processed, X5-X11 & X18-X23 are binned and X12-X17 are feature engineered. The original columns have been replaced by the new.  

Here is the notebook: [Pipeline 5](./Pipe5.ipynb)

The results of the Logistic Regression are: 
1. Accuracy: 0.8176
2. Precision: 0.6726027397260274
3. Recall: 0.3030864197530864

**Pipeline 6** 
Data is processed, X5-X11 & X18-X23 are binned and X12-X17 are feature engineered. The original columns have been replaced by the new.  

Here is the notebook: [Pipeline 6](./Pipe6.ipynb)

The results of the Logistic Regression are: 
1. Accuracy: 0.8172
2. Precision: 0.6717241379310345
3. Recall: 0.30061728395061726

**Pipeline 7** 
Data is processed, X5-X11 are binned and X12-X17 are feature engineered. The original columns have been replaced by the new.  

Here is the notebook: [Pipeline 7](./Pipe7.ipynb)

The results of the Logistic Regression are: 
1. Accuracy: 0.8158666666666666
2. Precision: 0.670958512160229
3. Recall: 0.2895061728395062

**Pipeline 8** 
Data is processed, X5-X11 are binned and X12-X17 are feature engineered. The original columns have been replaced by the new.  

Here is the notebook: [Pipeline 8](./Pipe8.ipynb)

The results of the Logistic Regression are: 
1. Accuracy: 0.8152
2. Precision: 0.6666666666666666
3. Recall: 0.28888888888888886

**Pipeline 9** 
Data is processed, X3-X11 & X18-X23 are binned and X12-X17 are feature engineered. The original columns have been replaced by the new.  

Here is the notebook: [Pipeline 9](./Pipe9.ipynb)

The results of the Logistic Regression are: 
1. Accuracy: 0.8174666666666667
2. Precision: 0.6707482993197279
3. Recall: 0.304320987654321

**Pipeline 10* 
Data is processed, X3-X11 & X18-X23 are binned and X12-X17 are feature engineered. The original columns have been replaced by the new.  

Here is the notebook: [Pipeline 10](./Pipe10.ipynb)

The results of the Logistic Regression are: 
1. Accuracy: 0.8161333333333334
2. Precision: 0.6718972895863052
3. Recall: 0.29074074074074074


# **Binning**

The processes for binning can be found in these notebooks: 
1. [Binning X3 & X4](./Binnings/BinningCat.ipynb)
2. [Binning X5](./Binnings/BinningX5.ipynb)
3. [Binning X6-X11](./Binnings/BinningX6.ipynb)
4. [Feature Engineering X12-X17](./Binnings/FeatEnginX12.ipynb)
5. [BinningX18](./Binnings/BinningX18.ipynb)

# **Extra Credit**

We are either going to attempt Cross Validation or an Advanced Modeling Technique such as Random Forest. 

# **What's Next?**

For now, we are finished with our pipelines and for the most part we are finished with our preliminary modeling. We just want to figure out Naive Bayes for the Pipelines that we are going to use in the Advanced Modeling stage. 

After that, we are on to the Advanced Modeling stage to see which combination of model and pipeline will produce the best results. 

We want to try Cross-Validation on the Pipelines since we are splitting our data after the processing is complete. 

If we are not getting great results, we are going to circle back to the pipelines to create more that could potentially produce better results. For now, Pipelines 5-10 are going to be entering our advanced models. 