***

# Credit Default Modelling

***

This Jupyter Notebook compares different models for time series forecasting in the financial sector.<br>
The goal of this exercise is to XXXXX<br>
The main stucture of this notebook is as follows:

1. **[Service factory](#service_factory)**: Python functions for general data and system manipulation
<br>

2. **[Exploratory data analysis](#eda)**: 

    2.1: Start by *gathering financial time series data*: this could be daily stock prices, currency exchange rates, or other related data. For this project, the data could be found [here](https://www.kaggle.com/datasets/szrlee/stock-time-series-20050101-to-20171231?resource=download).
    
    2.2: *Normalize or standardize the data* to make it suitable for machine learning algorithms: create lag features, moving averages, and other relevant features to aid in prediction<br>
    
    2.3: *Random Forest can be particularly useful for this task*: its importance score can help identify which features are most informative for predicting future values.
<br>

3. **[Model Selection](model_selection)**:
<br>

4. **[Hyperparameter Optimization](hyperparameter_optimization)**:
<br>

5. **[Performance Metrics and Results](performance_metrics_and_results)**:
<br>

### Service Factory <a class="anchor" id="service_factory"></a>

### Exploratory data analysis <a class="anchor" id="eda"></a>

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt



In [4]:
os.chdir('..')
print(f'Main path for this project is {os.getcwd()}')

Main path for this project is /Users/andrea/Documents/Github/Capstone


In [5]:
# Import the data and perform EDA

credit_data = pd.read_csv('./data/UCI_Credit_Card.csv')
credit_data = credit_data.rename(columns = {'default.payment.next.month':'default'})

In [6]:
# Let's see the shape and all columns datatypes of all the credit dataset

credit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         30000 non-null  int64  
 1   LIMIT_BAL  30000 non-null  float64
 2   SEX        30000 non-null  int64  
 3   EDUCATION  30000 non-null  int64  
 4   MARRIAGE   30000 non-null  int64  
 5   AGE        30000 non-null  int64  
 6   PAY_0      30000 non-null  int64  
 7   PAY_2      30000 non-null  int64  
 8   PAY_3      30000 non-null  int64  
 9   PAY_4      30000 non-null  int64  
 10  PAY_5      30000 non-null  int64  
 11  PAY_6      30000 non-null  int64  
 12  BILL_AMT1  30000 non-null  float64
 13  BILL_AMT2  30000 non-null  float64
 14  BILL_AMT3  30000 non-null  float64
 15  BILL_AMT4  30000 non-null  float64
 16  BILL_AMT5  30000 non-null  float64
 17  BILL_AMT6  30000 non-null  float64
 18  PAY_AMT1   30000 non-null  float64
 19  PAY_AMT2   30000 non-null  float64
 20  PAY_AM

In [7]:
# Let's see if there are null values...

credit_data.isnull().sum()

ID           0
LIMIT_BAL    0
SEX          0
EDUCATION    0
MARRIAGE     0
AGE          0
PAY_0        0
PAY_2        0
PAY_3        0
PAY_4        0
PAY_5        0
PAY_6        0
BILL_AMT1    0
BILL_AMT2    0
BILL_AMT3    0
BILL_AMT4    0
BILL_AMT5    0
BILL_AMT6    0
PAY_AMT1     0
PAY_AMT2     0
PAY_AMT3     0
PAY_AMT4     0
PAY_AMT5     0
PAY_AMT6     0
default      0
dtype: int64

We are dealing with a good dataset since **all the values are not NA.**

In [8]:
credit_data.head(3)

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0


We see that all the data is **numeric** so we do not need any **from numeric to categorical** translation.

In [9]:
# The problem we're are going to solve is imbalanced: let's see how much it is imbalanced

credit_data['default'].value_counts(normalize=True)

default
0    0.7788
1    0.2212
Name: proportion, dtype: float64

We see that the **77.88% of our data is about good payers while the remaining is about "defaulters"**. <br>
Since our dataset has 30,000 observations, if mean we have ca. 6636 bad payers.

Our data is not that imbalanced as we could have been thinking at the benigging which means, under a business point of view, that the institution that has this dataset has lots of bad payers overall. <br>

In order to be accurate in out predictions, we are going to leverage the following tecniques to balance this dataset:

1. [SMOTE](https://learn.microsoft.com/en-us/azure/machine-learning/component-reference/smote?view=azureml-api-2)
2. [Oversampling](https://en.wikipedia.org/wiki/Oversampling)
3. [Undersampling](https://en.wikipedia.org/wiki/Undersampling)

As follows the code to do so.

In [10]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

In [11]:
# Step 1: Split the data into training and testing sets

X = credit_data.drop('default', axis=1)
y = credit_data['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [None]:
# Step 2: Apply SMOTE

smote            = SMOTE(random_state=44)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

In [20]:
# Step 3: apply random oversampling

oversampler                  = RandomOverSampler(random_state=123)
X_oversampled, y_oversampled = oversampler.fit_resample(X_train, y_train)

print(X_oversampled.shape)
y_oversampled.value_counts()

(37336, 24)


default
0    18668
1    18668
Name: count, dtype: int64

In [19]:
# Step 4: apply random undersampling

undersampler                   = RandomUnderSampler(random_state=123)
X_undersampled, y_undersampled = undersampler.fit_resample(X_train, y_train)

print(X_undersampled.shape)
y_undersampled.value_counts()

(10664, 24)


default
0    5332
1    5332
Name: count, dtype: int64

From what we have coded here above we can summarise:

>- **SMOTE**: we applied the SMOTE technique to generate synthetic samples of the minority class in the training data, helping to balance the class distribution.
>- **Random Oversampling**: we oversampled the minority class by randomly selecting samples with replacement, increasing the number of minority class samples in the training data.
>- **Random Undersampling**: we undersampled the majority class by randomly removing samples, reducing the number of majority class samples in the training data to balance the class distribution.

The "undersampled dataset" has a the best final ratio since 

### Model Selection <a class="anchor" id="model_selection"></a>

### Hyperparameter Optimization <a class="anchor" id="hyperparameter_optimization"></a>

### Performance Metrics and Results <a class="anchor" id="performance_metrics_and_results"></a>

In [None]:
import sys
print(f'The current Python interpreter is {sys.executable}')

In [None]:
import pandas as pd
import numpy as np
import xgboost

## First Bullet Header <a class="anchor" id="first-bullet"></a>