# Competition 2 - Credit Default
**Alex Cattie, Mike Szemenyei, Kerry Clarke**

## Part 1 - Understanding & Exploration
### **Business Understanding - Framing the Analytical Question**
- How accurately are we able to predict that a customer is going to default?

Full Notebook: [Business Understanding](EDA/Business_Understanding.ipynb)

### **Data Dictionary**
| Column Name  | Contents |
| ------------- | ------------- |
| **X1** | Amount of credit that an individual or family was given |
| **X2** | Gender (1=M; 2=F) | 
| **X3** | Education level (1 = graduate school; 2 = university; 3 = high school; 4 = others; 5, 6, 0 = unknown) | 
| **X4** | Martial Status (1=married; 2=single; 3=others, 0=unknown) | 
| **X5** | Age (year) | 
| **X6** | Repayment status in September 2005 | 
| **X7** | Repayment status in August 2005 | 
| **X8** | Repayment status in July 2005 | 
| **X9** | Repayment status in June 2005 | 
| **X10** | Repayment status in May 2005 | 
| **X11** | Repayment status in April 2005 | 
| **X12** | Amount of bill statement in September 2005 | 
| **X13** | Amount of bill statement in August 2005 | 
| **X14** | Amount of bill statement in July 2005 | 
| **X15** | Amount of bill statement in June 2005 | 
| **X16** | Amount of bill statement in May 2005 | 
| **X17** | Amount of bill statement in April 2005 | 
| **X18** | Amount paid in September 2005 | 
| **X19** | Amount paid in August 2005 | 
| **X20** | Amount paid in July 2005 | 
| **X21** | Amount paid in June 2005 | 
| **X22** | Amount paid in May 2005 | 
| **X23** | Amount paid in April 2005 | 
| **Y** | Target - Default (Yes=1; No=0) | 

Full Notebook: [Data Dictionary](EDA/Data_Dictionary.ipynb)

### **Data Understanding - EDA**


![Gender Counts](EDA/images/gender.PNG)

![Education Level](EDA/images/education.PNG)

![Marital Status](EDA/images/maritalstatus.png)

![Age Histogram](EDA/images/age.PNG)

![Target](EDA/images/y_imbalanced.png)

![X1 and X5](EDA/images/x1x5.PNG)

![X12 and X13](EDA/images/x12x13.PNG)

![X14 and X15](EDA/images/x14x15.PNG)

![X16 and X17](EDA/images/x16x17.PNG)  



Full Notebook: [Data Understanding](EDA/Data_Understanding_EDA.ipynb)

### **Initial Modeling with Raw Data**
- **Decision Tree Results**  
![DT Results](Raw/images/dt_raw_results.PNG)    

- **Logistic Regression Results**  
![Log Reg Results](Raw/images/logreg_raw_results.PNG)    

- We introduced resampling in these models to address the imbalanced nature of the Target that we discovered during EDA.  

Notebooks:
- [Decision Tree & Logistic Regression](Raw/decision_tree_logreg.ipynb)
- [TPOT](Raw/TPOT_raw.ipynb)

## Part 2 - Preprocessing 
### **Data Preprocessing**
**Pipeline 1 - Assumes no Normal Distribution**
- IQR
    - Found the IQR for each column of data, created a function to go through each column and determine whether the value is an outlier
    - If the value is an outlier, determine whether it is a high or low outlier and if so, replace the initial number with the low or high IQR outlier value
- Min/Max 
    - Use the updated dataset with outliers removed to standardize the data
    - Utilized skearn’s preprocessing package (MinMaxScaler()) to find the min and max values in each column
    - The Fit_transform() method then scaled each column with a value from 0 to 1 based on the min and max from each column
- Skew
    - Analyze the overall flow of the data to account for any skewness 
    - Take each individual variable and apply the sqrt or cbrt function to fix the overall transformation of the dataset  
  
**Pipeline 2 - Assumes Normal Distribution**  

### **Balancing the Target**
- Critical Step to take before modeling
- With a balanced target we can be more confident that our modeling results are xx% better than random guessing
- With an imbalanced target we are less confident because an imbalanced target throws off the models 
- We used the **imblearn package** to help balance the target (see code below)  
![Code for Balancing the Target](images/balancing_target.PNG)

### **Cross Validation vs. Train_Test_Split** 
- Using the train_test_split, we only have 1 static split of the data for training and testing
- Cross Validation is a a better was to split the data for modeling because it gives us k folds for training and testing
    - Cross Validation also reduces the "leaking" of the testing data into the training data
    - We used the **Stratified KFolds** package to do Cross Validation 
        - Code will be shown below in the modeling section 

### **Feature Engineering**
- Based on our discoveries from EDA we decided to do some feature engineering

- **X3-X5: Education, Marital Status, Age**
    - We created bins for these variables as we saw that they contained values that were not defined in the data dictionary provided
    - By binning these variables we made them easier to work with and we have accounted for all the unknown values in the features  
    
![Creating Bins Code Example](images/edu_bins.PNG)
![Education Bins](images/edu_bins_pic.PNG)
![Marital Status Bins](images/marital_bins_pic.PNG)
![Age Bins](images/age_bins_pic.PNG)
  
  
- **X6-X11: History of Past Payment**
    - Only concerned about it the customer paid on time or was late 
      - Binning makes the feature binary - much easier to work with
           - 1 = on time
           - 2 = delayed/late  
           
![Binning X6-X11](images/x6bins.PNG)  
  
  
- **X12-X17: Anount of Bill Statement**
     - **3 new variables will be created for these features**
          - 1: The first will be the absolute value of the original payment amount
          - 2: The second will address the positivity or negativity of the original payment amount (binary variable - will account for the negative values in the original that were removed in the first new features - maintains integrity of original data)
          - 3: The third will address the time series nature of these features and will be the monthly difference from when the data was collected to the time when the payment was made  
  
  
- **X18-X23: Amount of Previous Payment**
     - Only concerned if they paid or not that month - not the specific amount 
          - 0 = no payment
          - 1 = payment


Notebooks:
- [Pipeline 1](pipeline_1/p1_full.ipynb)
- [Pipeline 2](pipeline_2/p2_full.ipynb)
- [Feature Engineering](comp2_initial.ipynb)

## Part 3 - Modeling 
- The majority of out time was spent working with various models from basic models like Decision Tree and Logistic Regression to much more advanced models like XGBoost and Random Forest  
- These models have no parameters specified other than the random state and n_estimators for XGBoost & Random Forest

### **All Models:**
- Decistion Tree
- Logistic Regression
- KNeighborsClassifier
- AdaBoost
- XGBoost
- Random Forest

![P1 All Models Results](images/p1cv_all_results.PNG)  
![P1 Random Forest](images/p1_rf.PNG)


### **Selected Models With Parameters**
- Decision Tree
- XGBoost
- Random Forest  
    
![P1 Selected Models Results](images/p1cv_selected_results.PNG)
![P1 Random Forest Parameters](images/p1_params_fr.PNG)  


### **Consolidated Best Results**
- **Pipeline 1**
    - Decision Tree: F1=0.74115; AUC=0.81325
    - XGBoost: F1=0.77072, AUC=0.84540
    - Random Forest: F1=0.76286, AUC=0.85512
- **Pipeline 2**
    - Decision Tree: F1=0.74852, AUC=0.81803
    - XGBoost: F1=0.76949, AUC=0.84001
    - Random Forest: F1=0.76941, AUC=0.85767


Notebooks:
- [Pipeline 1 Models](pipeline_1/p1_CVmodels.ipynb)
- [Pipeline 1 Random Forest](p1_Random_Forest.ipynb)
- [Pipeline 2 Models](pipeline_2/p2_CVmodels.ipynb)
- [Pipeline 2 Random Forest](p2_Random_Forest.ipynb)

## Part 4  - Explanation of Best Model 
### **Best Model: XGBoost - Pipeline 1**
- Original Results
    - **F1:** 0.65837
    - **AUC:** 0.74466  
    

- "Tweaking" of the model's hyperparameters (Definitions taken from week 10 XGBoost classwork notebook)
    - **learning_rate:** step size shrinkage used to prevent overfitting. Range is `[0,1]`
    - **max_depth:** determines how deeply each tree is allowed to grow during any boosting round
    - **colsample_bytree:** percentage of features used per tree. High value can lead to overfitting
    - **n_estimators:** number of trees you want to build - the more trees you build, the longer the training will be
    - **random_state:** ensures that results can be reproduced later on 
    
  
- Fine Tuned Results
    - **F1:** 0.77072
    - **AUC:** 0.84540


- Why it may have been the best model
    - Explain  
  

- While the .84 AUC is a bit high and may be suspect to overfitting we went with this model because it had a higher F1 score than the other models. Also, XGBoost showed the highest margin of improvement from the Baseline model

## Part 5 - TPOT
- TPOT does preprocessing and modeling automatically 
- TPOT also optimizes **performance** rather than **reproducibility**
- It's important to know how to balance the two aspects
    - On the one side we can explain the process start to finish but we may sacrifice performance and risk overfitting 
    - On the other with TPOT we cannot explain the preprocessing or the modeling but we tend to get better results
- According to TPOT, **Logistic Regression** is the best performing model
- Notesbooks:
    - [Pipeline 1 TPOT](pipeline_1/p1_TPOT.ipynb)
    - [Pipeline 2 TPOT](pipeline_2/p2_TPOT.ipynb)

## Conclusion

### Original Question: How accurately are we able to predict that a customer is going to default?

### Our Results: F1: 77%, AUC: 84% 