| Name | Admin No | Class |
| --- | --- | --- |
| Goh Rui Zhuo | 2222329 | DAAA/2A/05 |

# __Default Payment Classfication model__

## <font color='#63B2CB'> __Table of Contents__</font>
1. [Problem Statement](#1)     
1. [Background Research](#2) 
1. [Import the data](#3) 
1. [Import the library](#4) 
1. [Exploratory Data Analysis](#5) 
1. [Feature Engineering](#6) 
1. [Data Preprocessing](#7) 
1. [Model (Baseline)](#8) 
1. [Analysis of the model](#9) 
1. [Advanced Model](#10) 
1. [Hyperparameter Tune](#11) 
1. [Stacking Classifier](#15) 
1. [Final Model](#12) 
1. [Feature Importance](#13) 
1. [Summary](#14) 

<a class="anchor" id="1"></a>
## <font color='#63B2CB'> __Problem Statement / Objective__</font>



>To predict which customer will have default payment in the next month.

> Output variable: Default payment next month , 0 means paid while 1 means not paid


![Bank2](bank.jpg)

<a class="anchor" id="2"></a>
## <font color='#63B2CB'>__Background research__</font>

- In default payment, the word  default is failure to meet the legal obligations (or conditions) of a loan,  for example when a home buyer fails to make a mortgage payment, or when a corporation or government fails to pay a bond which has reached maturity. A national or sovereign default is the failure or refusal of a government to repay its national debt

- In addition, there will be a credit limit which determines how much one can spend on the creidt card. This can be based on annual income, credit report and other factors

- Overall, the main aim of this is to predict whether a customer with different infos that they are able to repay it properly

<a class="anchor" id="3"></a>
## <font color='#63B2CB'>__Import Libraries__</font>

In [None]:
!pip install pandas-profiling
!pip install imblearn

Importing all the required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
from termcolor import colored

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold, learning_curve ,RepeatedStratifiedKFold
from sklearn.feature_selection import SelectFromModel, RFE, SelectKBest, chi2, mutual_info_classif,f_classif
from sklearn.metrics import  roc_curve, make_scorer, fbeta_score, confusion_matrix, classification_report,RocCurveDisplay,ConfusionMatrixDisplay
from sklearn.tree import plot_tree
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder, LabelEncoder, OrdinalEncoder, KBinsDiscretizer,MinMaxScaler,Normalizer
from sklearn.impute import SimpleImputer 
from imblearn.over_sampling import ADASYN
from sklearn.decomposition import PCA
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import FunctionTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier , ExtraTreesClassifier,RandomForestClassifier ,AdaBoostClassifier,HistGradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB,CategoricalNB
from sklearn.linear_model import RidgeClassifierCV,RidgeClassifier,SGDClassifier,LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, validation_curve
from imblearn.over_sampling import SMOTE,KMeansSMOTE,SMOTEN
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import StackingClassifier
from imblearn.over_sampling import SMOTEN,ADASYN, RandomOverSampler
from imblearn.combine import SMOTEENN
from sklearn.inspection import permutation_importance
sns.set_style('darkgrid')
import warnings
warnings.filterwarnings('ignore')

<a class="anchor" id="4"></a>
## <font color='#63B2CB'>__Import dataset__</font>

In [None]:
df = pd.read_csv('credit_dataset.csv')
df = df.drop('Customer ID',axis=1)
df

Dataset contains:

- `Customer ID`: Unique customer identifier ranging from 1 to 1600
- `Credit Limit`: The  credit limit for the customer
- `Gender`: Customer gender
- `Education`: Customer education level
- `Marriage Status`: Customer marital status
- `Age`: Rotational Customer age
- `Bill_Amount1`: Customer credit card bill amount 1 month ago
- `Bill_Amount2`: Customer credit card bill amount 2 month ago
- `Bill_Amount3`: Customer credit card bill amount 3 month ago
- `Pay_Amount1`: The amount customer paid 1 month ago
- `Pay_Amount2`: The amount customer paid 2 month ago
- `Pay_Amount3`: The amount customer paid 3 month ago
- `Default payment next month`: Whether customer will default next month, 1 means default (customer will not pay the bill), 0 means non default (customer will pay the bill)

<a class="anchor" id="5"></a>
## <font color='#63B2CB'>__Exploratory Data Analysis__</font>


- Dataset are imported with pandas to make sure that columns are features or target variables
- Data that are of categorical features are put sorted out including those that are of numerical features 
( Answering to `How do you represent your data as features?
`)

#### General info

In [None]:
df.info()

<u><b>Things Observed</b></u>
- From the above dataset, we can conclude that there is `anomaly dtype`  in the dataset

In [None]:
df.describe()

<u><b>Things Observed</b></u>
- While all the columns for credit limit, age, paymount amount 1 to 3 months ago are pretty reasonable, we note that there is `negative values in bill amount` in terms of the minimum values

#### Check for null values in the dataset

In [None]:
df.isnull().sum()

<u><b>Things Observed</b></u>
- This dataset is clean and therefore we can do some expolratory data analysis on the dataset

#### Analysis on categorical data

> Function for analysing categorical features

In [None]:
class catAnalyser:
    def __init__(self):
        pass

    def barPiePlot(self, col):
        fig, ax = plt.subplots(2, 1, figsize=(9, 9))
        fig.set_facecolor('lightgray')
        fig.suptitle(f'Analysis on {col}', size=20, color='darkblue')
        df[col].value_counts().plot(
            kind='pie', title=f"Distribution in the {col} column (Pie Chart)", autopct='%1.1f%%',
                shadow=True, startangle=0,  label='index', ax=ax[0])
        sns.countplot(y=col, data=df, ax=ax[1])
        ax[1].title.set_text(f'Distribution in the {col} column (Pie Chart)')
        ax[1].set_ylabel(f"{col}", labelpad=10)
        ax[1].set_xlabel("Count of People", labelpad=10)
        for bars in ax[1].containers:
            ax[1].bar_label(bars)
        plt.show()

    def plotByTarget(self, col):
        fig, ax = plt.subplots(3, 1, figsize=(8, 8))
        fig.set_facecolor('lightblue')
        fig.suptitle(
            f'Analysis on {col} by Target Variable', size=20, color='darkblue')
        for index, value in enumerate(df['Default payment next month'].unique()):
            df[df['Default payment next month'] == value][[col]].value_counts().sort_values().plot(
                kind='pie', title=f"Distribution with respect to value {value}", autopct='%1.1f%%',
                shadow=True, startangle=0, label='index', ax=ax[index])
        sns.countplot(y=col, hue='Default payment next month',
                      data=df, ax=ax[2])
        for bars in ax[2].containers:
            ax[2].bar_label(bars)
        ax[2].title.set_text(f'Distribution in the {col} column (Pie Chart)')
        ax[2].set_ylabel(f"{col}", labelpad=10)
        ax[2].set_xlabel("Count of People", labelpad=10)
        plt.show()


catFunc = catAnalyser()


Retrive the `categorical` features in the dataset

In [None]:
cat_features = [col for col in df.columns if str(df[col].dtype) !='int64'] + ['Default payment next month']
print(f'The categorical features are {cat_features}')

Produce the graph for insights of the dataset

In [None]:
for feature in cat_features:
    catFunc.barPiePlot(feature)

<u><b>Things Observed</b></u>
- From the first graph , we can see that the dataset has a slightly higher proportion of female as compared to male at 41.2% as compared to 58.7%.
    - In addition, the countplot shows that female has over 800 rows while male contains    more than 600 rows
- From the second graph, we can see that has a higher proportion of university datas, at 37.7%
    - From the countplot, we can see that University data has over 700 rows in total
- From the third graph, we can see that a higher proportion of people are single at 56.2% and married at 43.8%.
    - From the countplot, we can see that single contains over 800 rows
- From the fourth graph, we can see that there is an imbalance of class where the output 0 occurs more than 78.8% of the time
    - From the countplot, this amounts to more than 1200 of class 0

Analysis on each column with `respect to target variable`

In [None]:
for col in cat_features[0:-1]:
    catFunc.plotByTarget(col)

<u><b>Things Observed</b></u>
- From the first graph of gender to default payment, we can see that female has a higher percentage of 59.4% with male at 40.6%
    - From the countplot, we can also see that there is a lower amount of total amount of people with default payment 1 

- From the second graph, we can see that university students has the highest percentage of people having classify as default payment 1 but this could be because university has the highest respondents in the dataset

- From the third graph, we can see that those that are single has a higher percentage of people  having default payment 1 similarly for default payment 0 but this could be due to higher respondents


#### Analysis on numerical features 

> Function for analysing numerical features

In [None]:
class numericAnalyser:
    def __init__(self):
        pass

    def numPlot(self, col):
        fig, ax = plt.subplots(2, 1, figsize=(9, 9))
        fig.suptitle('Analysis on Univaraite data',size=20,color='darkblue')
        fig.set_facecolor('lightgray')
        ax[0].title.set_text(f'Boxplot of {col}')
        ax[1].title.set_text(f'Distribution of {col}')
        sns.boxplot(x="Default payment next month", y=col, data=df,ax=ax[0])
        sns.histplot(x=col,
                     data=df,
                     stat='density',
                     bins=50,
                     kde=True,
                     line_kws={'color': 'red', 'linewidth': 3}, ax=ax[1])
        plt.show()
    def ratio(self,col1,col2,ax):
        ax.hist(df[col1],alpha=0.5,bins=10)
        ax.hist(df[col2],alpha=0.5,bins=10)
        ax.set_title(f'{col1} vs {col2}')
numFunc = numericAnalyser()

Retrive the `numerical` features in the dataset

In [None]:
num_features = [col for col in df.columns if col not in cat_features]
num_features

Produce the graph for insights of the dataset

In [None]:
for col in num_features:
    numFunc.numPlot(col)

<u><b>Things Observed</b></u>
- From the first plot on credit limit, we can see that those who have default payment of 0 have a larger range and interquartile range as compared to those who have a default payment of 1 
    - From the histogram, we can see that it is positively skewed, with majority between the 10000 to 20000 range

- From the second plot on Age, we can see that those who have default payment of 0 have a smaller range and interquartile range as compared to those who have a default payment of 1
    - From the histogram, we can see that age is positively skewed with majority less than the age of 40

- From the plots on Bill Amount 1 to 3 month ago, we can see that those who have default payment of 0 have a wider range and interquartile range as compared to those who have a default payment of 1. Both contains numerous outliers
    - From the histograms, we can see that Bill Amount 1 to 3 is positively skewed with majority less than 10000
    - Also, there are negative values in bill amounts

- From the plots on Pay Amount 1 to 3 month ago, we can see that those who have default payment of 0 have a wider range and interquartile range as compared to those who have a default payment of 1. Both contains numerous outliers
    - From the histograms, we can see that Pay Amount 1 to 3 is positively skewed with majority less than around 3000

#### Analysis on pay vs bill

In [None]:
col1s = [f'Pay_Amount{i}' for i in range(1,4)]
col2s = [f'Bill_Amount{i}' for i in range(1,4)]
fig,ax = plt.subplots(3,1,figsize=(12,10))
fig.suptitle('Analysis on pay vs bill',size=20,color='darkblue')
index = 0
for col1,col2 in zip(col1s,col2s):
    ax[index] = numFunc.ratio(col1,col2,ax[index])
    index += 1

plt.show()

<u><b>Things Observed</b></u>
-  From the above graph, we can see that for all the pay amount are way less as compared to the bill amount which means that people are underpaying there bill which shows a small proportion is being paid each month

#### Check on the correlation and covariance between each variable

In [None]:
types = ['pearson','spearman','kendall']

In [None]:
fig,ax = plt.subplots(3,1,figsize=(12,13))
fig.suptitle('Correlations plot',size=20,color='darkblue')
index = 0
for tp in types:
  # Create the graph and dataframe 
  ax[index].title.set_text(f'{tp.upper()} correlation')
  ax[index] = sns.heatmap(df.corr(method=tp), cmap="BuPu", annot=True,ax=ax[index])
  display(df.corr(method = tp).style.bar(color='green'))
  index +=1
plt.tight_layout()
plt.show()


In [None]:
fig,ax = plt.subplots(1,1,figsize=(8,8))
sns.heatmap(df.cov()).set(
    title="Relationship between the different variables (Covariance)")
plt.show()

<u><b>Things Observed</b></u>

- From the different types correlation plot, we observed that there is a strong relationship between bill amounts and pay amounts for different months
  - However, as bill amount 1 2 and 3 are variables that are dependent with each other, `multicollinearity is not found`

- From the covariance plot, observe a similar pattern and observe that there could be a relationship between credit limit and age too



#### Analysis on target variable

In [None]:
plt.title('Analysis on the target vraible',)
df['Default payment next month'].value_counts().plot(kind='barh')
plt.show()

<u><b>Things Observed</b></u>
- Here, there is an `imbalance` dataset in the target varable

#### Additional analysis with pandas profiling

In [None]:
profile = ProfileReport(df)
profile

<u><b>Things Observed</b></u>
- Similar to correlation plot we can see high correlation between bill amounts however nothing to be concern as they are not independent variables

<a class="anchor" id="6"></a>
## <font color='#63B2CB'>__Feature Engineering__</font>

In [None]:
df = pd.read_csv('credit_dataset.csv')
df

- From the above exploratory data analysis, I have come up with multiple possible features that can be extract

\begin{align*}
\text{Difference Between Pay and Bill} &= Pay - Bill
\end{align*}

\begin{align*}
\text{percentage of credit limit (pay)} &= \frac{Pay}{Credit Limit}\times{100}
\end{align*}

\begin{align*}
\text{percentage of credit limit (bill)} &= \frac{Pay}{Credit Limit}\times{100}
\end{align*}

\begin{align*}
\text{marriage and gender} &= marriage + gender
\end{align*}

\begin{align*}
\text{Difference from Mean E} &= Bill Amount - Mean(\text{Bill Amount by Education})
\end{align*}

\begin{align*}
\text{Difference from Mean A} &= Bill Amount - Mean(\text{Bill Amount by Age})
\end{align*}

> Function for feature extraction here

In [None]:
class FeatureExtraction():
    def __init__(self):
        pass
    def new_features(self,n:int):
      # Function for feature extraction
        for i in range(1,n):
            pass
            df[f'Difference_month{i}'] = df[f'Pay_Amount{i}'] - df[f'Bill_Amount{i}']
            df['MarriageGender'] = df['Marriage Status'] +' '+ df['Gender']
            df[f'per_of_pay_df_limit{i}'] = (df[f'Pay_Amount{i}'] / df[f'Credit Limit'])*100
            df[f'per_of_bill_df_limit{i}'] = (df[f'Bill_Amount{i}'] / df[f'Credit Limit'])*100
        
        # Mean by education
        mean = dict(df.groupby(by = 'Education').mean()['Credit Limit'])
        mean = dict(sorted(mean.items(), key=lambda x:x[1]))

        df['DifferenceMeanE'] = df['Credit Limit'] -  df["Education"].apply(lambda x: mean.get(x))

         # Mean by Gender
        mean = dict(df.groupby(by = 'Gender').mean()['Credit Limit'])
        mean = dict(sorted(mean.items(), key=lambda x:x[1]))

        df['DifferenceMeanG'] = df['Credit Limit'] -  df["Gender"].apply(lambda x: mean.get(x))
        
         # Mean by MarriageGender
        mean = dict(df.groupby(by = 'MarriageGender').mean()['Credit Limit'])
        mean = dict(sorted(mean.items(), key=lambda x:x[1]))

        df['DifferenceMeanMG'] = df['Credit Limit'] -  df["MarriageGender"].apply(lambda x: mean.get(x))
        return df
featureFunc = FeatureExtraction()


In [None]:
featureFunc.new_features(4)

#### Feature selection 

- Drop customer id due to no importance in the dataset

In [None]:
df = df.drop('Customer ID',axis=1)
display(df)

#### Rows Selection
- As we observed from exploratory data analysis, we can see that there are bill amounts that are less than 0 and that the default payment is 0 so that we do not affect the minority class

In [None]:
df = df[~((df['Bill_Amount3'] <= 0) & (df['Bill_Amount2'] <= 0) &(df['Bill_Amount1'] <= 0) & (df['Default payment next month'] == 0))]
df=df.reset_index().drop('index', axis = 1)
df

Create a original dataframe for furture comparison

In [None]:
original_df = pd.read_csv('credit_dataset.csv')
original_df = original_df.drop('Customer ID',axis=1)
original_df

<a class="anchor" id="7"></a>
## <font color='#63B2CB'>__Data Preprocessing__</font>

Answeing to `  Did you process the features in any way? `

#### Encoding for those that are categorical data

In [None]:
df_encode = df.copy()

- One Hot encoding
- Label encoder
- Ordinal Encoder
- Custom Encoding

Here we will encode those that are for ordinal encoding
- Marraige status has only two different unique values
- Gender has two unique values only

In [None]:
oe = OrdinalEncoder()
df_encode['Marriage Statu'] = oe.fit_transform(df_encode[['Marriage Status']])
pd.DataFrame(df_encode)

In [None]:
oe = OrdinalEncoder()
df_encode['Gender'] = oe.fit_transform(df_encode[['Gender']])
pd.DataFrame(df_encode)

Here we will create a class for custom encoding
- For education, education status result in less likely to default payment due to regular income

In [None]:
class CustomEncoder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        encoded_values = []
        for value in X:
            if value == 'high school':
                encoded_values.append([0])
            elif value == 'university':
                encoded_values.append([1])
            else:
                encoded_values.append([2])
        return np.array(encoded_values)

    def fit_transform(self, X, y=None):
        return self.transform(X)
custom_encoder = CustomEncoder()

In [None]:
df_encode['Education'] = custom_encoder.fit_transform(df_encode['Education'])
df_encode

- Above are the final encoded section

#### Create the target variable and the features

- Here the target variable is Default payment next month hence y will contain just that while x will contain other variales

In [None]:
X = df.drop(['Default payment next month'],axis=1)
y = df['Default payment next month']
X

- Here we will do the same to the original dataset

In [None]:
X_original = original_df.drop('Default payment next month',axis=1)
y_original = original_df['Default payment next month']

#### Split the dataset into train and test

- Set the train test split for data that `have not been` feature engineered

In [None]:
X_train_original, X_test_original, y_train_original, y_test_original = train_test_split(X_original, y_original, test_size=.2, stratify=y_original, shuffle=True,random_state = 32)

- Set the train test split for data that `have been` feature engineered

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y, shuffle=True,random_state = 32)

<a class="anchor" id="8"></a>
## <font color='#63B2CB'>__Pipeline Building__ (Basline)</font>

- Get the numeric features on data that `have not been` feature engineered

In [None]:
num_features_original = [col for col in X_train_original.columns if str(original_df[col].dtype) != 'object']
num_features_original

- Get the numeric features on data that `have been` feature engineered

In [None]:
num_features = [ col for col in df.columns if str(df[col].dtype) != 'object']
num_features.remove('Default payment next month')
print(num_features)

> Functions for pipeline building 
- Nunmeric value fo scaled using `Normaliser`

In [None]:
class pipeline:
    def __init__(self):
        pass

    def pipeline_step(self, original, num_features):
      # Numeric transformer for numeric values
        numeric_transformer = Pipeline([("imputer", SimpleImputer(
            strategy="median")), ('scaler', Normalizer())])
        
        # Categorical transformer for ordinal encoding
        categorical_transformer = Pipeline([("imputer", SimpleImputer(
            strategy='most_frequent')),
            ('oe', OrdinalEncoder())])
        
        # Customer encoder
        categorical_transformer2 = Pipeline([("imputer", SimpleImputer(
            strategy='most_frequent')),
            ('custom',  CustomEncoder())])
        
        # One hot encoding
        categorical_transformer3 = Pipeline([("imputer", SimpleImputer(
            strategy='most_frequent')),
            ('ohe', OneHotEncoder())])
        
        # original and not original preporcessing step
        if original:
            preprocessing_step = ColumnTransformer(
                [
                    ("numeric", numeric_transformer, num_features),
                    ('oe', categorical_transformer,
                     ['Marriage Status', 'Gender']),
                    ('custom_e', categorical_transformer2, ['Education']),
                ],
                remainder="passthrough",
            )
        else:
            preprocessing_step = ColumnTransformer(
                [
                    ("numeric", numeric_transformer, num_features),
                    ('oe', categorical_transformer,
                     ['Marriage Status', 'Gender']),
                    ('custom_e', categorical_transformer2, ['Education']),
                    ('ohe', categorical_transformer3, ['MarriageGender'])
                ],
                remainder="passthrough",
            )
        process = [('preprocessing', preprocessing_step)]
        return process




p = pipeline()


- Generate the pipeline for `original` data

In [None]:
pipeline_step_original = p.pipeline_step(True,num_features_original)
pipeline_step_original

- Generate the pipeline for `original` data

In [None]:
pipeline_step = p.pipeline_step(False,num_features)
pipeline_step

<a class="anchor" id="8"></a>
## <font color='#63B2CB'> __Model (Baseline)__</font>

- Here we will do a baseline model across different types of machine learning algorithm

In [None]:
class model:
    def __init__(self):
        pass
   
    def plot_learning_curve(self, name, model,X,y,ax):
        train_sizes, train_scores, test_scores = learning_curve(
            estimator=model, X=X, y=y,   cv=StratifiedKFold(n_splits=10), train_sizes=np.linspace(0.1, 1.0, 10),
                                                     n_jobs=1)
        train_mean = np.mean(train_scores, axis=1)
        train_std = np.std(train_scores, axis=1)
        test_mean = np.mean(test_scores, axis=1)
        test_std = np.std(test_scores, axis=1)
        ax.set_ylim([-0.2,1.2])
        ax.plot(train_sizes, train_mean, color='blue', marker='o',
                markersize=5, label='Training Accuracy')
        ax.plot(train_sizes, test_mean, color='green', marker='+',
                markersize=5, linestyle='--', label='Validation Accuracy')
        ax.set_title(f'Learning Curve {name}')
        ax.set_xlabel('Training Data Size')
        ax.set_ylabel('Model accuracy')
        ax.grid()
        ax.legend(loc='lower right')
modelFunc = model()

#### Setting the evluation metrics for classification model

In [None]:
evaluation = ["accuracy", "f1_weighted","recall", "precision","roc_auc"]
evaluation

- Setting the dataframe evaluation train and test

In [None]:
def dataframeColumns(keys):
    col = ['model','fit_time','score_time']
    for i in keys:
        col.append(f'test_{i}')
        col.append(f'train_{i}')
    return col
evaluation_col = (dataframeColumns(evaluation))  
evaluation_col

#### Set the model

Answering to ` How did you select which learning algorithms to use?
`

- LogisticRegressionClf
  - Simple model for classification that is easy to be interpret and performs well when relationship are close to linear 
- KNeighborsClf
  - Simplest of all algorithms which labels the new data point base on its closest neighbours 
- DecisionTreeClf
  - Subdivide the feature space into with mostly the same label
  - Can capture linear and non linear relationship in both numerical and categorical datas
- ExtraTreesClf
  - Extra Trees is an ensemble method based on decision trees which  reduces overfitting by using randomization during the tree-building process.
- RandomForestClf
  - Ensemble method that combines multiple decision trees that  reduces overfitting and provides better generalization performance and high-dimensional data and capture non-linear relationships
- GradientBoostingClf,
  - Builds models sequentially, each correcting the mistakes of the previous model which can capture complex relationship
- HistGBClf
  - Variantof gradient boosting that uses histograms to speed up training whichcan handle large datasets and provides efficient training and prediction.
- AdaBoostClf
  - Ensemble method that combines multiple weak classifiers which  focuses on difficult-to-classify examples, improving overall accuracy
- SVC
 - Versatile classifier for binary and multi-class classification tasks which  can handle high-dimensional data and capture non-linear relationships using kernel tricks.
- Neural Network
   - Highly flexible and can learn complex patterns and relationships 


In [None]:
classification_models = {
    "LogisticRegressionClf": LogisticRegression(solver='newton-cg'),
    "KNeighborsClf": KNeighborsClassifier(),
    'DecisionTreeClf': DecisionTreeClassifier(),
    "ExtraTreesClf": ExtraTreesClassifier(),
    "RandomForestClf": RandomForestClassifier(class_weight={0: 1, 1: 10}),
    "GradientBoostingClf": GradientBoostingClassifier(),
    'HistGBClf': HistGradientBoostingClassifier(learning_rate=0.15),
    "AdaBoostClf": AdaBoostClassifier(),
    'SVC': SVC(),
    'Neural Network': MLPClassifier()
}

pipeline_step.append(0)
pipeline_step_original.append(0)


#### Testing against different models

In [None]:
def run_models(models,step,X,y,X_test,y_test):
  # Run the different models in a loop
    baseline_df = pd.DataFrame(columns = evaluation_col)
    i = 0 
    fig, ax = plt.subplots(2, 5, figsize=(15, 13))
    for model_name,model in models.items():
        step[-1] = ('clf',model)

        clf = Pipeline(steps=step)
        
        clf.fit(X, y)
        print(colored(model_name, 'green'),'is running')
        print(f"Model Accuracy Score :{clf.score(X, y)}")
        
        scores = cross_validate(clf,X,y,cv=StratifiedKFold(n_splits=10),scoring=evaluation,n_jobs=-1,return_train_score=True )

        display(pd.DataFrame(scores).mean())

        new_row = dict(pd.DataFrame(scores).mean())
        new_row.update({'model':model_name})
        
        baseline_df = baseline_df.append(new_row, ignore_index=True)
        fig.suptitle('Learning Curves')
        row = i // 5
        col = i % 5  
        modelFunc.plot_learning_curve(model_name,clf,X,y,ax=ax[row,col])
        i += 1
    plt.tight_layout()
    plt.show()
    return baseline_df


#### Dummy Classier

In [None]:
pipeline_step_original[-1] = ('dummy_model',DummyClassifier())
dummy_model = Pipeline(steps=pipeline_step_original)
dummy_model.fit(X_train_original, y_train_original)
print(colored('Dummy Classifier', 'green'),'is running')
print(f"Model Accuracy Score :{dummy_model.score(X, y)}")
scores = cross_validate(dummy_model,X_train,y_train,cv=StratifiedKFold(n_splits=10),scoring=evaluation,n_jobs=-1,return_train_score=True )
dummy_scores = pd.DataFrame(scores)
display(dummy_scores)

- Run the different classification models

In [None]:
baseline_original_df = run_models(classification_models,pipeline_step_original,X_train_original,y_train_original,X_test_original,y_test_original)

<u><b>Things Observed</b></u>
- All the models did not experience any overfitting

- Run the different classification models

In [None]:
baselin2_df  = run_models(classification_models,pipeline_step,X_train,y_train,X_test,y_test)

<u><b>Things Observed</b></u>
- All the models did not experience any overfitting

<a class="anchor" id="9"></a>
## <font color='#63B2CB'>__Analysis of the result of each model__</font>

- Accuracy Score (Focus)
    - Accuracy score function computes the accuracy, the percentage of
correct prediction
\begin{align*}
\text{Auccracy} &= \frac{\text{TP + TN}}{\text{TP + FN + 𝑇𝑃 + 𝐹𝑁}}
\end{align*}
- Recall (Focus)   
 - Proportion of actual positives was predicted correctly
\begin{align*}
\text{Recall} &= \frac{\text{TP}}{\text{TP + FN}}
\end{align*}
- Precision
    - Proportion of positive predictions was actually correct
\begin{align*}
\text{Precision} &= \frac{\text{TP}}{\text{TP + FP}}
\end{align*}
- F1 score
    - Harmonic mean of precision and recall
- ROC 
    - ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds

> Note for this model I will focus on recall and accuracy score due to the need of predicting true positive and it is always better to be safe than sorry for default payment

Original data without feature engineering

In [None]:
baseline_original_df.sort_values(by='test_accuracy',ascending=False).style.bar(subset =['test_accuracy','test_f1_weighted','test_recall','test_precision','test_roc_auc'],color='green')

<u><b>Things Observed</b></u>
- In this original dataframe, we can see that `Logistic Regression` has the highest test_accuracy
- In addition we can see that the test precision and test recall all have scores of less than 0.5, showing the imbalance class in the dataset

Check with classification report for logistic regression

In [None]:
pipeline_step_original[-1] =  ("LogisticRegressionClf", LogisticRegression(solver='newton-cg'))
clf = Pipeline(steps=pipeline_step_original)

clf.fit(X_train_original,y_train_original)
print(classification_report(y_test_original,clf.predict(X_test_original)))

<u><b>Things Observed</b></u>
- From the classification report, we can see that the f1 score accuracy is `0.69 `, precision is `0.62` and recall is `0.79`


Data after feature engineering

In [None]:
baselin2_df.sort_values(by='test_accuracy',ascending=False).style.bar(subset =['test_accuracy','test_f1_weighted','test_recall','test_precision','test_roc_auc'],color='green')

<u><b>Things Observed</b></u>
- From the above styled dataframe, we can see that `RandomForestClassifer` has the highest test accuracy at 0.793
- In addition, the test recall, test precision, all has value less than 0.5

In [None]:
pipeline_step[-1] =  ("RandomForestClf", RandomForestClassifier())
clf = Pipeline(steps=pipeline_step)

clf.fit(X_train,y_train)
print(classification_report(y_test,clf.predict(X_test)))

<u><b>Things Observed</b></u>
- From the classification report, we can see that the f1-score has a score of  `0.8`, recall of `0.81` and precision of `0.79`

#### Compare the difference

In [None]:
# Setting the difference 
difference = pd.DataFrame(columns = [evaluation_col[0]]+evaluation_col[3:])
for i in evaluation_col[3:]:
  dif = baselin2_df[i] - baseline_original_df[i]
  difference[i] = dif
difference['model'] = baselin2_df['model']
difference

<u><b>Things Observed</b></u>
- Test Accuracy: About half of the model increase while the rest remain the same
- Test recall increase for majority of the model


<a class="anchor" id="10"></a>
## <font color='#63B2CB'>__Advanced model (with discretization and resample on variables)__</font>

From our observations previously, we observe that some variables can be discretize such as age, credit amount, percentage of credit (bill and pay), difference comparison

- Types of discretization
    - Pandas get dummies
    - KBins (Selected)
    

#### Additional feature selection with sklearn

- Select K best
    - chi sqaure
    - Fisher’s Score
    - Correlation Coefficient
    
This is then implemented in the pipeline below

In [None]:
best_features = Pipeline([('skb', SelectKBest(f_classif, k=25))])
best_features

#### Discretization

In [None]:
def kbins(col,n_bins,strategy):
    kbins = KBinsDiscretizer(n_bins=n_bins, strategy=strategy, encode='ordinal')
    df_encode[col] = kbins.fit_transform(np.array(df_encode[col]).reshape(-1,1))
    df_encode[col] = df_encode[col].astype(int)
    return df_encode

Binning the age column first so that we are able to have one additional feature engineer

In [None]:
kbins = KBinsDiscretizer(n_bins=6, strategy='quantile', encode='ordinal')
df['Age'] = kbins.fit_transform(np.array(df['Age']).reshape(-1, 1))
df['Age'] = df['Age'].astype(int)

mean = dict(df.groupby(by='Age').mean()['Credit Limit'])
mean = dict(sorted(mean.items(), key=lambda x: x[1]))
print(mean)
df['DifferenceMeanA'] = df['Credit Limit'] -  df["Age"].apply(lambda x: mean.get(x))

In [None]:
class Pipeline2:
    def __init__(self):
        pass

    def pipeline_step(self):
        numeric_transformer = Pipeline([("imputer", SimpleImputer(
            strategy="median")), ('scaler', Normalizer())])

        categorical_transformer = Pipeline([("imputer", SimpleImputer(
            strategy='most_frequent')),
            ('oe', OrdinalEncoder())])

        categorical_transformer2 = Pipeline([("imputer", SimpleImputer(
            strategy='most_frequent')),
            ('ce', CustomEncoder())])

        categorical_transformer3 = Pipeline([("imputer", SimpleImputer(
            strategy='most_frequent')),
            ('ohe', OneHotEncoder())])

        preprocessing_step = ColumnTransformer(
            [
                ("numeric", numeric_transformer, ['Bill_Amount1', 'Bill_Amount2', 'Bill_Amount3', 'Pay_Amount1',
                                                  'Pay_Amount2', 'Pay_Amount3']),
                ('oe', categorical_transformer,
                 ['Marriage Status', 'Gender']),
                ('custom_e', categorical_transformer2, ['Education']),
                ('ohe', categorical_transformer3, ['MarriageGender']),
                ('Kbins', self.bins_transformer(10), [
                    'Credit Limit', 'DifferenceMeanMG', 'DifferenceMeanG', 'DifferenceMeanE','DifferenceMeanA']),
                ('KBins2', self.bins_transformer(8), ['per_of_pay_df_limit1', 'per_of_pay_df_limit2', 'per_of_pay_df_limit3',
                                                      'per_of_bill_df_limit1', 'per_of_bill_df_limit2', 'per_of_bill_df_limit3'])
            ],
            remainder="passthrough",
        )
        process = [('preprocessing', preprocessing_step),('select',best_features)]
        return process


    def bins_transformer(self, n):
        bins_transformer = Pipeline([('KBinsDiscretizer', KBinsDiscretizer(
            n_bins=n, strategy='quantile', encode='ordinal'))])
        return bins_transformer


p2 = Pipeline2()


Initiliase pipeline 2 with additional feature engineering

In [None]:
pipeline2_step = p2.pipeline_step()
pipeline2_step

#### Oversmapling imbalance target variable

- From the dataset, we observed that there is an imblanace of class, hence we need to resample the data in order to have a balance dataset
- Types of oversampling method
    - SMOTE
        - SMOTE stands for Synthetic Minority Oversampling Technique. It works by utilizing a k-nearest neighbor algorithm to create synthetic data.
    - ADASYN 
        - The synthetic data generation is inversely proportional to the density of the minority class. A comparatively larger number of synthetic data is created in regions of a low density of minority class than higher density regions.
    - KmeansSMOTE
        - K-Means SMOTE aids classification by generating minority class samples in safe and crucial areas of the input space. The method avoids the generation of noise and effectively overcomes imbalances between and within classes.
    - SMOTEN
        - Synthetic Minority Over-sampling Technique for Nominal (Selected)
    - Oversampler
        - Random Over Sampling balances the data by replicating the minority class samples. This does not cause any loss of information, but the dataset is prone to overfitting as the same information is copied.

- Make a second train test split

In [None]:
X = df.drop('Default payment next month',axis=1)
y = df['Default payment next month']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.15, stratify=y, shuffle=True,random_state = 32)

In [None]:
a = SMOTEN(random_state=32)

X_train, y_train = a.fit_resample(X_train,y_train)
print(f'Original dataset have shape of {len(X_train),len(y_train)}')
print(f'Resampled dataset have shape of {len(X_train),len(y_train)}')

Analysis on resampled data

In [None]:
fig,ax = plt.subplots(2,1,figsize=(10,10))
fig.suptitle("Analysis on Target Variable (Before vs After)",size=20,color='darkblue')
y_train_original.value_counts().plot(kind="barh",
                             title="Target Variable Distribution (Before)",
                             ylabel="Default payment next month",
                             xlabel="Count",
                             ax=ax[0])
y_train.value_counts().plot(kind="barh",
                             title="Target Variable Distribution (After)",
                             ylabel="Default payment next month",
                             xlabel="Count",
                             ax=ax[1])
plt.show()

#### Preparing the pipeline

In [None]:
pipeline2_step.append(0)

In [None]:
additional_df= run_models(classification_models,pipeline2_step,X_train,y_train,X_test,y_test)

Checking the average score of the different model

In [None]:
additional_df.sort_values(by='test_accuracy',ascending=False).style.bar(subset =['test_accuracy','test_f1_weighted','test_recall','test_precision','test_roc_auc'],color='green')

<u><b> Things Observed </b></u>
- From the result above, we see that almost all model cross the 80% test accuracy showing that the advanced model is a success comapred to baseline model
with the highest being extra tree classifier

Top 3
- Extra Tree
- Random Forest Classifier
- Gradient Boosting Classifier

Compare the difference with baseline 2

In [None]:
# Setting the difference 
difference2 = pd.DataFrame(columns = [evaluation_col[0]]+evaluation_col[3:])
for i in evaluation_col[3:]:
  dif = additional_df[i] - baselin2_df[i]
  difference2[i] = dif
difference2['model'] = baselin2_df['model']
difference2

#### Run the graph for comparison between before advanced method and after

In [None]:
def model_graph(models):
    for model_name,model in models.items():
        fig, ax = plt.subplots(2, 1, figsize=(8, 9))
        pipeline_step_original[-1] = (model_name,model)

        clf = Pipeline(steps=pipeline_step_original)
        
        pipeline2_step[-1] = (model_name,model)
        clf2 = Pipeline(steps=pipeline2_step)
        modelFunc.plot_learning_curve(model_name,clf,X_train_original,y_train_original,ax[0])
        modelFunc.plot_learning_curve(model_name,clf2,X_train,y_train,ax[1])
        plt.show()
# model_graph(classification_models)

def compare(models,pipelines,model_name):
    # Plot comparison model
    fig, ax = plt.subplots(1, 2, figsize=(12, 5))
    index = 0
    for model,pipe in zip(models,pipelines):
        roc_display1 = RocCurveDisplay.from_estimator(pipe, X_test, y_test, ax=ax[0], name=f"{model_name[0]} (Additional)")
        x = np.linspace(0, 1, 2)
        ax[index].plot(x, x, ":", color="red")
        ax[index].set_title(f"ROC Curve {model_name[i]}")

    plt.show()
def confusion(y_pred,y_test,name,word):
    # Plot confusion matrix
    fig, ax = plt.subplots(figsize=(7, 7))
    sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt="d",cmap="YlGnBu", ax=ax)
    fig.suptitle(f'{name} Clasifier {word}',size=20)
    fig.set_facecolor('lightgray')
    ax.set_ylabel("Actual Valuse")
    ax.set_xlabel("Predicted Values")
    ax.set_xticklabels(["0", "1"])
    ax.set_yticklabels(["0", "1"])
    plt.show()

#### Classification report of the top 3 models

Random Forest Classifier

In [None]:
pipeline2_step[-1] = ('clf',RandomForestClassifier())
pipeline1 = Pipeline(steps=pipeline2_step)
pipeline1.fit(X_train, y_train)
y_pred = pipeline1.predict(X_test)
print(classification_report(y_test, pipeline1.predict(X_test)))

- confusion matrix

In [None]:
confusion(y_pred,y_test,'Random Forest Classifier','Additional')

Extra Tree classifier

<u><b> Things Observed </b></u>
- For Gradient Boosting Classifier, we can see that the test accuracy increase compared to before tuning from `85.4` to `86.0`
- The classification report did for precision decrease to 0.71

In [None]:
pipeline2_step[-1] = ('clf',ExtraTreesClassifier())
pipeline2 = Pipeline(steps=pipeline2_step)
pipeline2.fit(X_train, y_train)
y_pred = pipeline2.predict(X_test)
print(classification_report(y_test, pipeline2.predict(X_test)))

- confusion matrix

In [None]:
confusion(y_pred,y_test,'Extra Tree classifier','Additional')

Gradient Boosting Classifier

In [None]:
pipeline2_step[-1] = ('clf',GradientBoostingClassifier())
pipeline3 = Pipeline(steps=pipeline2_step)
pipeline3.fit(X_train, y_train)
y_pred = pipeline3.predict(X_test)
print(classification_report(y_test, pipeline3.predict(X_test)))

- confusion matrix

In [None]:
confusion(y_pred,y_test,'Gradient Boosting Classifier','Additional')

<u><b>Things Observed</b></u>
- The top 3 scores in terms of test acucracy was choosen after accessing the dataframe
- Then, from the top 3 model, we can see that in classification model, Gradient Boosting performed the best at 0.8 for recall
- Models improved slightly after resampling


<a class="anchor" id="11"></a>
## <font color='#63B2CB'>__Hyperparameter Tuning__

Randomised cv was used over grid search due to the faster time and ability to run more folds in a shorter period of time

Answeing to ` Did you try to tune the hyperparameters of the learning algorithm, and in 
that case how?`

From the above model, logisitc regression, gradient boosting classier and random forest classifer was choosen to further tune

#### Tuning

Setting up the class for tuning

In [None]:
class Tuning:
    def __init__(self):
        pass

    def tune(self, param, model):
        # Tune the model here with added step
        pipeline2_step[-1] = ("tuning", RandomizedSearchCV(model, param, cv=StratifiedKFold(n_splits=10), n_jobs=-1, scoring="recall_weighted", n_iter=15
                                                           ),
                              )

        tuned = Pipeline(steps=pipeline2_step)
        tuned.fit(X_train, y_train)
        print(tuned.named_steps["tuning"].best_params_)
        return tuned


tune = Tuning()


#### Setting up the param grid for the three models

##### Extra Tree Classifier

In [None]:
param_grid_etc = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2'],
        'bootstrap': [True, False],                   
    'criterion': ['gini', 'entropy'],              
}
etc_tuned_result = tune.tune(param_grid_etc,ExtraTreesClassifier())
(pd.DataFrame(etc_tuned_result['tuning'].cv_results_))

Run the test result here

In [None]:
pipeline2_step[-1] = ("clf", etc_tuned_result.named_steps["tuning"].best_estimator_)
etc_tuned = Pipeline(steps=pipeline2_step)
etc_tuned.fit(X_train, y_train)
etc_tuned_y_pred = etc_tuned.predict(X_test)
scores = pd.DataFrame(cross_validate(etc_tuned,X_train,y_train,cv=StratifiedKFold(n_splits=10),scoring=evaluation))

Results mean

In [None]:
scores.mean()

Classsification report

In [None]:
print(classification_report(y_test,etc_tuned_y_pred))

Confusion matrix

In [None]:
confusion(etc_tuned_y_pred,y_test,'Extra Tree Classifier','tuned')

learning curve

In [None]:
fig,ax = plt.subplots(1,1,figsize=(7,7))
modelFunc.plot_learning_curve('Extra Tree classifier Tuned',etc_tuned,X_train,y_train,ax)

Comparing with dummy model

In [None]:
difference_dummy1 = scores.mean() - dummy_scores.mean() 
difference_dummy1.dropna()

<u><b> Things Observed </b></u>
- For Gradient Boosting Classifier, we can see that the test accuracy increase compared to before tuning 
- The classification report did for recall increase from 0.77 to 0.78 and precision increase too

##### Gradient Boosting Classifer

In [None]:
param_grid_gbc = {
    "max_depth":[1,3,100,200,300],
    "learning_rate":[0.01,0.1,0.15,0.18],
}
gbc_tuned_result = tune.tune(param_grid_gbc,GradientBoostingClassifier())
(pd.DataFrame(gbc_tuned_result['tuning'].cv_results_))

In [None]:
pipeline2_step[-1] = ("clf", gbc_tuned_result.named_steps["tuning"].best_estimator_)
gbc_tuned = Pipeline(steps=pipeline2_step)
gbc_tuned.fit(X_train, y_train)
gbc_tuned_y_pred = gbc_tuned.predict(X_test)
scores = pd.DataFrame(cross_validate(gbc_tuned,X_train,y_train,cv=StratifiedKFold(n_splits=10),scoring=evaluation))

In [None]:
scores.mean()

In [None]:
print(classification_report(y_test, gbc_tuned_y_pred))

In [None]:
confusion(gbc_tuned_y_pred,y_test,'Gradient Boosting Classifier','tuned')

In [None]:
fig,ax = plt.subplots(1,1,figsize=(7,7))
modelFunc.plot_learning_curve('Gradient Boosting Tuned',gbc_tuned,X_train,y_train,ax)

Comparing with dummy model

In [None]:
difference_dummy1 = scores.mean() - dummy_scores.mean() 
difference_dummy1.dropna()

<u><b> Things Observed </b></u>
- For Gradient Boosting Classifier, we can see that the test accuracy slight decrease compared to before tuning
- The classification report did experience a slight decrease

##### Random Forest Classifier

In [None]:
param_grid_rfc = {
    'max_depth': [10,20,100, 300,400, 500],             
    'max_features': ['sqrt'],                 
    'min_samples_leaf': [1, 2],               
    'min_samples_split': [1, 3, 5],       
    'n_estimators': [25, 50, 100, 300, 500],
     'min_weight_fraction_leaf': [0.0, 0.1, 0.2],
      'criterion': ['gini', 'entropy']
     
}
rfc_tuned_result = tune.tune(param_grid_rfc,RandomForestClassifier())
(pd.DataFrame(rfc_tuned_result['tuning'].cv_results_))

Run the tunning here 

In [None]:
pipeline2_step[-1] = ("clf", rfc_tuned_result.named_steps["tuning"].best_estimator_)
rfc_tuned = Pipeline(steps=pipeline2_step)
rfc_tuned.fit(X_train, y_train)
rfc_tuned_y_pred = rfc_tuned.predict(X_test)
scores = pd.DataFrame(cross_validate(rfc_tuned,X_train,y_train,cv=StratifiedKFold(n_splits=10),scoring=evaluation))

Results mean

In [None]:
scores.mean()

Classification report

In [None]:
print(classification_report(y_test, rfc_tuned_y_pred))

Confusion matrix

In [None]:
confusion(rfc_tuned_y_pred,y_test,'Random Forest Classifier','tuned')

Learning curve

In [None]:
fig,ax = plt.subplots(1,1,figsize=(7,7))
modelFunc.plot_learning_curve('Random Forest Tuned',rfc_tuned,X_train,y_train,ax)

Comparing with dummy model

In [None]:
difference_dummy1 = scores.mean() - dummy_scores.mean() 
difference_dummy1.dropna()

<u><b> Things Observed </b></u>
- For Random Forest Classifier, we can see that the test accuracy increase compared to before tuning 
- The classification report did remain the same however as we look at recall at 0.79 but precision improved

#### AUC Curve

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(12, 5))
fig.suptitle('ROC Curve on Tuned Models',size=20,color='darkblue')
x = np.linspace(0, 1, 2)
fig.set_facecolor('lightgray')
ax[0].plot(x, x, ":", color="red")
ax[1].plot(x, x, ":", color="red")
ax[2].plot(x, x, ":", color="red")
ax[0].set_title("Extra Tree Classifier")
ax[1].set_title("Random Forest Classifier")
ax[2].set_title("Gradient Boosting Classifier")
RocCurveDisplay.from_estimator(
    etc_tuned, X_test, y_test, ax=ax[0], name=f"Extra Tree (tuned)")
RocCurveDisplay.from_estimator(
    rfc_tuned, X_test, y_test, ax=ax[1], name=f"Random forest (tuned)")
RocCurveDisplay.from_estimator(
    gbc_tuned, X_test, y_test, ax=ax[2], name=f"Gradient Boosting (tuned)")


<u><b> Things Observed </b></u>
- Here we can see that Extra tree and random forest has the highest auc of 0.68 while gradient boosting has the lowest

<a class="anchor" id="15"></a>
## <font color='#63B2CB'>__Try Stacking together__</font>


From the above three, here I try to stack the tuned random forest and gradient boosting together

In [None]:
ensemble = StackingClassifier(
    estimators=[
        ("modelrfc", rfc_tuned_result.named_steps["tuning"].best_estimator_)
    ],
    final_estimator=gbc_tuned_result.named_steps["tuning"].best_estimator_,
    
    cv=StratifiedKFold(n_splits=10),
    passthrough=False,
    n_jobs=-1,
)
pipeline2_step[-1] = ('clf',ensemble)
stacking_classifiers = Pipeline(steps = pipeline2_step)
stacking_classifiers.fit(X_train, y_train)


In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
fig.suptitle('AUC curved Stacking Classifier')
RocCurveDisplay.from_estimator(stacking_classifiers, X_test, y_test, ax=ax, name="Untuned")
plt.show()

- Check the model with cross validate score

In [None]:
scores = pd.DataFrame(cross_validate(stacking_classifiers,X_train,y_train,cv=StratifiedKFold(n_splits=10),scoring=evaluation))

In [None]:
scores.mean()

In [None]:
print(classification_report(y_test, stacking_classifiers.predict(X_test)))

In [None]:
stacking_classifiers = rfc_tuned.predict(X_test)
confusion(stacking_classifiers,y_test,' Stacking Classifier','tuned')

<u><b> Things Observed </b></u>
- From the stacking of two different models, we do see that the f1-score are the same even when it is done run alone and that precision did boost up to `0.84`
- Precision also improved in the process
- AUC is lower than individual models

<a class="anchor" id="12"></a>
## <font color='#63B2CB'>__Final Model__</font>

- Fom the above analysis, even though stacking classifier does produced better results for some parts, it is more expensive when implementing hence  ` Random Forest Classifier is my final model`

Running the final model again

In [None]:
pipeline2_step[-1] = ("clf", rfc_tuned_result.named_steps["tuning"].best_estimator_)
rfc_tuned = Pipeline(steps=pipeline2_step)
rfc_tuned.fit(X_train, y_train)
rfc_tuned_y_pred = rfc_tuned.predict(X_test)
scores = pd.DataFrame(cross_validate(rfc_tuned,X_train,y_train,cv=StratifiedKFold(n_splits=10),scoring=evaluation))

In [None]:
scores.mean()

Classification report

In [None]:
print(classification_report(y_test, rfc_tuned_y_pred))

In [None]:
confusion(rfc_tuned_y_pred,y_test,'Random Forest Classifier','tuned')

In [None]:
dummy_ypred = dummy_model.predict(X_test_original)
confusion(dummy_ypred,y_test_original,'Base model','')

ROC Curve comparing with base model

In [None]:
fig,ax = plt.subplots(1,2,figsize=(7,7))
fig.suptitle("Comparison with dummy model")
modelFunc.plot_learning_curve('Random Forest Tuned',rfc_tuned,X_train,y_train,ax[0])
modelFunc.plot_learning_curve('Dummy Model',dummy_model,X_train_original,y_train_original,ax[1])

Comparing with base model

In [None]:
difference_dummy1 = scores.mean() - dummy_scores.mean() 
difference_dummy1.dropna()

Get the feature names for analysis

In [None]:
final_estimator = rfc_tuned.named_steps['clf']
feature_names = final_estimator.feature_importances_
feature_names

In [None]:
fig, ax = plt.subplots(figsize=(35, 30))
plot_tree(rfc_tuned[-1][42], filled=True, ax=ax, feature_names=X_train.columns, fontsize=12, class_names=['No default','default'], proportion=True, rounded=True)
plt.show()


<u><b>Things Observed</b></u>
- From the confusion matrix we can see that the model weakness is in terms of false negative where it has a high number which causes the accuracy
- From the above evaulation, it has a precision of `78`, recall score of '0.79' and an accuracy score of `0.86`, which is the best of both world, previous model has a higher accuracy score but this is due to overfitting which makes the model inaccurate
- Comparing with the base model, we can see that overall Random Forest Clasifier preformed much better than it

Answering to ` Can you say anything about the errors that the system makes?`

- This sytem tends to make mistake in predicting positive class but this could be due to the dataset being imbalanace and too small

<a class="anchor" id="13"></a>
## <font color='#63B2CB'>__Feature Importance Analysis__ </font>

In [None]:
fig,ax = plt.subplots(figsize=(8,8))
pd.DataFrame(rfc_tuned[-1].feature_importances_, index = X_train.columns,columns=["Feature Importance"]).sort_values("Feature Importance").plot(kind="barh", figsize=(4, 6),ax=ax)
fig.suptitle("Feature Importances of Random Forest")
fig.set_facecolor('lightblue')
plt.show()

<u><b>Things Observed</b></u>
- From the above feature importance, we can see that the best feautre was actually difference between mean between gender, which could possibly means that if there is a huge difference between the average, then there could be a higher risk of defaulting

<a class="anchor" id="14"></a>
## <font color='#63B2CB'>__Summary__</font>

- Overall the model improved as compared to a baseline model, while stacking classifier was great the score produced for auc is not ideal, hence I reverted to Random Forest for a more balanced score across the board
- In addition this model has a 0.99 score of recall when predicitng class 0
