# Competition 2 - Credit Default
**Alex Cattie, Mike Szemenyei, Kerry Clarke**

## Part 1 - Understanding & Exploration
### **Business Understanding - Framing the Analytical Question**
#### - Using existing customer information, how accurately are we able to predict that a customer is going to default?

Full Notebook: [Business Understanding](EDA/Business_Understanding.ipynb)

### **Data Dictionary**
| Column Name  | Contents |
| ------------- | ------------- |
| **X1** | Amount of credit that an individual or family was given |
| **X2** | Gender (1=M; 2=F) | 
| **X3** | Education level (1 = graduate school; 2 = university; 3 = high school; 4 = others; 5, 6, 0 = unknown) | 
| **X4** | Martial Status (1=married; 2=single; 3=others, 0=unknown) | 
| **X5** | Age (year) | 
| **X6** | Repayment status in September 2005 | 
| **X7** | Repayment status in August 2005 | 
| **X8** | Repayment status in July 2005 | 
| **X9** | Repayment status in June 2005 | 
| **X10** | Repayment status in May 2005 | 
| **X11** | Repayment status in April 2005 | 
| **X12** | Amount of bill statement in September 2005 | 
| **X13** | Amount of bill statement in August 2005 | 
| **X14** | Amount of bill statement in July 2005 | 
| **X15** | Amount of bill statement in June 2005 | 
| **X16** | Amount of bill statement in May 2005 | 
| **X17** | Amount of bill statement in April 2005 | 
| **X18** | Amount paid in September 2005 | 
| **X19** | Amount paid in August 2005 | 
| **X20** | Amount paid in July 2005 | 
| **X21** | Amount paid in June 2005 | 
| **X22** | Amount paid in May 2005 | 
| **X23** | Amount paid in April 2005 | 
| **Y** | Target - Default (Yes=1; No=0) | 

Full Notebook: [Data Dictionary](EDA/Data_Dictionary.ipynb)

### **Data Understanding - EDA**

- When we first initially started the competition we did some basic EDA in order to gain a better understanding of the dataset and the characteristics of the variables. This started with things as simple as printing out the head and column names of the dataset all the way down to make a correlation heatmap to see if any of the variables were closely related. We also checked to see how many rows there were as well as using the describe function in order to gain look at some more advanced statistics. We observed that there were 25 rows and 30,000 entries in each row which is what we expected. Next, some bar charts were created in order to see the distribution between the categorical variables. We looked at gender distribution, marital status, education level, and if individuals defaulted or not. For the eight nominal value columns we created histograms to show their distributions. The last step in EDA that we performed was creating a correlation heatmap of all the variables. This helped us to get a good idea of how the variables compared to one another. We felt that this amount of EDA gave us a good initial handle of the dataset before getting started on the competition.  


![Gender Counts](EDA/images/gender.PNG)

![Education Level](EDA/images/education.PNG)

![Marital Status](EDA/images/maritalstatus.png)

![Age Histogram](EDA/images/age.PNG)

![Target](EDA/images/y_imbalanced.png)

![X1 and X5](EDA/images/x1x5.PNG)

![X12 and X13](EDA/images/x12x13.PNG)

![X14 and X15](EDA/images/x14x15.PNG)

![X16 and X17](EDA/images/x16x17.PNG)  



Full Notebook: [Data Understanding](EDA/Data_Understanding_EDA.ipynb)

### **Initial Modeling with Raw Data**
- **Decision Tree Results**  
![DT Results](Raw/images/dt_raw_results.PNG)    

- **Logistic Regression Results**  
![Log Reg Results](Raw/images/logreg_raw_results.PNG)    

- We introduced resampling in these models to address the imbalanced nature of the Target that we discovered during EDA
- Also ran a TPOT model with the raw data. The best model identified with the raw data was Decision Tree

Notebooks:
- [Decision Tree & Logistic Regression](Raw/decision_tree_logreg.ipynb)
- [TPOT](Raw/TPOT_raw.ipynb)

## Part 2 - Preprocessing 
### **Data Preprocessing**
**Pipeline 1 - Assumes no Normal Distribution**
- IQR
    - Found the IQR for each column of data, created a function to go through each column and determine whether the value is an outlier
    - If the value is an outlier, determined whether it is a high or low outlier and if so, replace the initial number with the low or high IQR outlier value
- Min/Max 
    - Used the updated dataset with outliers removed to standardize the data
    - Utilized skearn’s preprocessing package (MinMaxScaler()) to find the min and max values in each column
    - The Fit_transform() method then scaled each column with a value from 0 to 1 based on the min and max from each column
- Skew
    - Analyzed the overall flow of the data to account for any skewness 
    - Took each individual variable and applied the sqrt or cbrt function to fix the overall transformation of the dataset  
  
**Pipeline 2 - Assumes Normal Distribution** 
- Skew
    - Analyzed the overall flow of the data to account for any skewness before accounting for any outliers
    - Took each individual variable and applied the sqrt or cbrt function to fix the overall transformation of the dataset
- 3 Standard Deviations
    - Found the upper and lower bounds for each column: 3 standard deviations from the mean of each column
    - Used a for loop to go through each column and determine whether a data value is an outlier and if so, replace it with the lower or upper bound
- Z-Score 
    - Used the updated dataset with outliers removed
    - Utilized skearn’s preprocessing package (StandardScaler()) to calculate z-scores for each data value


### **Balancing the Target**
- Critical Step to take before modeling
- With a balanced target we can be more confident that our modeling results are xx% better than random guessing
- With an imbalanced target we are less confident because an imbalanced target throws off the models 
- We used the **imblearn package** to help balance the target (see code below)  
![Code for Balancing the Target](images/balancing_target.PNG)

### **Cross Validation vs. Train_Test_Split** 
- Using the train_test_split, we only have 1 static split of the data for training and testing
    - We built most of our models using train_test_split first and then using cross validation
    - In this presentation we only highlight the models using Cross Validation
- Cross Validation is a a better was to split the data for modeling because it gives us k folds for training and testing
    - We used the **Stratified KFolds** package to do Cross Validation 
        - Code will be shown below in the modeling section 

### **Feature Engineering**
- Based on our discoveries from EDA we decided to do some feature engineering

- **X3-X5: Education, Marital Status, Age**
    - We created bins for these variables as we saw that they contained values that were not defined in the data dictionary provided
    - By binning these variables we made them easier to work with and we have accounted for all the unknown values in the features  
    
![Creating Bins Code Example](images/edu_bins.PNG)
![Education Bins](images/edu_bins_pic.PNG)
![Marital Status Bins](images/marital_bins_pic.PNG)
![Age Bins](images/age_bins_pic.PNG)
  
  
- **X6-X11: History of Past Payment**
    - Only concerned about it the customer paid on time or was late 
      - Binning makes the feature binary - much easier to work with
           - 1 = on time
           - 2 = delayed/late  
           
![Binning X6-X11](images/x6bins.PNG)  
  
  
- **X12-X17: Anount of Bill Statement**
     - **3 new variables will be created for these features**
          - 1: The first will be the absolute value of the original payment amount
          - 2: The second will address the positivity or negativity of the original payment amount (binary variable - will account for the negative values in the original that were removed in the first new features - maintains integrity of original data)
          - 3: The third will address the time series nature of these features and will be the monthly difference from when the data was collected to the time when the payment was made  
  
  
- **X18-X23: Amount of Previous Payment**
     - Only concerned if they paid or not that month - not the specific amount 
          - 0 = no payment
          - 1 = payment



Notebooks:
- [Pipeline 1](pipeline_1/p1_full.ipynb)
- [Pipeline 2](pipeline_2/p2_full.ipynb)
- [Feature Engineering](comp2_initial.ipynb)

### Part 3 - Modeling 
- The majority of out time was spent working with various models from basic models like Decision Tree and Logistic Regression to much more advanced models like XGBoost and Random Forest  
- These models have no parameters specified other than the random state and n_estimators for XGBoost & Random Forest
- Below screenshot are only of Pipeline 1. Pipeline 2 work is exactly the same and results can be found in the notebook linked below
- Random State for all models was 2019; XGBoost base n_esimators=150; Random Forest base n_estimators=100
- The same hyperparamaters and parameters were used for both Pipeline 1 and Pipeline 2

### **All Models:**
- Decistion Tree
- Logistic Regression
- KNeighborsClassifier
- AdaBoost
- XGBoost
- Random Forest

![P1 All Models Results](images/p1cv_USE.PNG)  
![P1 Random Forest](images/random_forest_base_USE.PNG)  

- The results of the models were mediocre and the ones that seemed to perform well are most likely overfitted. 


### **Selected Models With Parameters**
- **Decision Tree**
    - Random State 
        - Utilized the random_state parameter because of running the program numerous times
        - If you don’t specify a random_state, every time the program is run, a different random_state could be used and could result in unpredictable and unreliable results (random_state=2019)
    - Max Depth
        - Utilized the max_depth parameter in order to limit the number of decisions a tree has to decide on each time a node is split to control the number of possible solutions and amount of overall error
        - Tweaked to 15
    - Max Features
         - Limits the number of features that the tree can use for each split in the tree
         - Tweaked to 10
    - Criterion parameter
        - The default for this parameter is the gini index, which is the best criterion for making sure that there is little likelihood of misclassification
        - The other criterion option is entropy, but we chose to keep the default because this is a classification problem
    - Splitter parameter
        - The default for this parameter is “best”, which means that what the tree splits on a node, it will choose the most relevant feature at that time to split on
        - If we had chosen “random” as the splitter parameter, there would be more of a chance to be forced to go more deeply into the tree and introduce more error and  less precision

        

- **XGBoost** (Definitions taken from XGBoost classwork notebook in week 10) 
    - Learning_rate
        - Step size shrinkage used to prevent overfitting. Range is `[0,1]`
        - Tweaked to 0.5
    - Max_depth
        - Determines how deeply each tree is allowed to grow during any boosting round
        - Tweaked to 4
    - Colsample_bytree
        - Percentage of features used per tree. High value can lead to overfitting
        - Tweaked to 0.5
    - N_estimators
        - Number of trees you want to build - the more trees you build, the longer the training will be
        - Tweaked to 150
    - Random_state
        - Ensures that results can be reproduced later on. Random_state=2019
        
    
- **Random Forest**
    - Max Depth
        - Defined as the depth of each tree in the forest
        - By limiting to 13 we increase the speed of the model 
    - Max Features
        - Defined as the number of features to consider when looking for the best split
        - By limiting to 8 we have told the model it may only consider 8 features before it must move on. This increases the speed of the model
    - Min Sample Spilt
        - Defined as the minimum number of samples required to split an internal node
        - By specifiying to 10, we have told the model it must consider 10 separate samples before moving to the next spilt
    - N_Estimators
        - Limited to 100 so to improve the run time of the model (would run and find a solution faster)
    - Random State
        - Specified as 2019 to ensure that results could be reproduced
    - Used this article for help on understanding Random Forest parameters [Random Forest Article](https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d)
    
![P1 Selected Models Results](images/p1cv_selected_USE.PNG)
![P1 Random Forest Parameters](images/random_forest_tweaked_USE.PNG)  


### **Consolidated Best Results**
- **Pipeline 1**
    - Decision Tree: F1=0.74115; AUC=0.81325
    - XGBoost: F1=0.75409, AUC=0.82760
    - Random Forest: F1=0.76286, AUC=0.85512
- **Pipeline 2**
    - Decision Tree: F1=0.74852, AUC=0.81803
    - XGBoost: F1=0.75599, AUC=0.82677
    - Random Forest: F1=0.76941, AUC=0.85767


Notebooks:
- [Pipeline 1 Models](pipeline_1/p1_CVmodels.ipynb)
- [Pipeline 1 Random Forest](p1_Random_Forest.ipynb)
- [Pipeline 2 Models](pipeline_2/p2_CVmodels.ipynb)
- [Pipeline 2 Random Forest](p2_Random_Forest.ipynb)

## Part 4  - Explanation of Best Model 
### **Best Model: XGBoost - Pipeline 1**
- **Original Results**
    - **F1:** 0.65837
    - **AUC:** 0.74466  
    

- **"Tweaking" of the model's hyperparameters**
    - **Col_Sample_Bytree**
        - The first parameter that we tried tuning was col_sample_bytree which is choosing the percentage of the features that we want to be involved in the trees each time the model. We were well aware that raising this to high could potentially cause over-fitting. We started with it at 0.5 and changed it in .1 increments all the way up to 1.0 and honestly did not see our score change that much. Because of this we decided to keep it at 0.5 in order to ensure that we do not overfit the model
    - **Learning Rate**
        - The second parameter that we tuned was learning rate. We initially started with it at 0.5. We adjusted it in 0.1 increments between .4 and 1.0 and did not find a noticeable difference. We ended up choosing 0.5 to keep for our model
    - **Max Depth**
        - Next, we tuned the max_depth, which is how deep each of the trees are allowed to go when the model is run. This was one of the two parameters that we found could most impact the model. When we increased the depth the score increased greatly. While it was great to get high scores we were concerned that the model might become over-fitted by using a higher max_depth. We choose to put max_depth=4
    - **N_Estimators**
        - The last parameter that we tuned was the n_estimators, which was the number of trees that we wanted the model to make. We realized that the more trees that we tried to create that the longer the model would take to run. We initially started using a 100 trees and increased the number all the way up to 250. Each time we increased the number of estimators we received a bit of a higher score; however, in order to not have the model take a very long time to run we decided that the best amount of estimators was 150 
  

- **Fine Tuned Results**
    - **F1:** 0.75409
    - **AUC:** 0.82760


- **Feature Importance**
![XGBoost Feature Importance](images/XGB_featureimport.PNG)
- The most important features were the absolute values of the bill statement amouunts and the original amount of credit given to the individual or family received. These features can be used to help identify the customers that may be more likely to default on their payments

- **We chose XGBoost has the best model because it showed the highest margin of improvement from the Baseline model and we felt that the Random Forest was too prone to over-fitting even though the results were higher.** 


- Notebooks:
    - [Pipeline 1 XGBoost](p1_XGBoost.ipynb)
    - [Pipeline 2 XGBoost](p2_XGBoost.ipynb)

## Part 5 - TPOT
- TPOT does preprocessing and modeling automatically 
- TPOT also optimizes **performance** rather than **reproducibility**
- It's important to know how to balance the two aspects
    - On the one side we can explain the process start to finish but we may sacrifice performance and risk overfitting 
    - On the other with TPOT we cannot explain the preprocessing or the modeling but we tend to get better results
- According to TPOT, **Logistic Regression** is the best performing model
- Notesbooks:
    - [Pipeline 1 TPOT](pipeline_1/p1_TPOT.ipynb)
    - [Pipeline 2 TPOT](pipeline_2/p2_TPOT.ipynb)

## Conclusion

### Original Question: How accurately are we able to predict that a customer is going to default?

### Our Results: F1: 0.75, AUC: 0.82 