<a href="https://colab.research.google.com/github/Nakulcj7/Creditcard/blob/main/Credit_card_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Credit  Card Default Prediction



##### **Project Type**    - Classification
##### **Contribution**    - Individual


# **Project Summary -**



The aim of a credit card default prediction project is to develop a machine learning model that can accurately predict which credit card users are likely to default on their payments in the future. The model should use historical data of credit card users such as their payment history, credit limit, age, education, and other demographic information to identify patterns and trends that can help predict default behavior. The project focuses on utilizing historical data of Customer's default payment in Taiwan.


*   There were 30000 records and 25 attributes in the dataset.

*   I started by importing the dataset, and necessary libraries and conducted exploratory data analysis (EDA) to get a clear insight into each feature by separating the dataset into numeric and categoric features. I did Univariate, Bivariate, and even multivariate analyses.

*   After that, the outliers and null values were checked from the raw data. Data were transformed to ensure that it was compatible with machine learning models.
*   In feature engineering we transformed raw data into a more useful and informative form, by encoding, feature manipulation, and feature selection. We handled target class imbalance using SMOTE.


*   Then finally cleaned and scaled data was sent to various models, the metrics were made to evaluate the model, and we tuned the hyperparameters to make sure the right parameters were being passed to the model. To select the final model based on requirements, we checked model_result.


*   When developing a machine learning model, it is generally recommended to track multiple metrics because each one highlights distinct aspects of model performance. We are, however, focusing more on the Recall score and F1 score because we are dealing with credit card data and our data is unbalanced.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Despite high returns, credit cards carry significant risks. The ever-expanding number of credit cards has achieved an expansion in how much credit card defaults and the subsequent enormous measure of bills and repayment data information have likewise carried specific hardships to the risk controllers. As a result, one of the primary concerns of banks is how to use the data generated by users and extract useful information to control risks, reduce the default rate, and control the growth of non-performing assets.

A credit card issuer based in Taiwan wants to learn more about how likely its customers are to default on their payments and the main factors that influence this probability. The issuer's decisions regarding who to issue a credit card to and what credit limit to offer would be informed by this information. The issuer's future strategy, including plans to offer targeted credit products to their customers, would be informed by a better understanding of their current and potential customers as a result of this.
# Objective

*   

    To determine the main factors that influence the likelihood of defaulting on a credit card.
*   To determine the likelihood that Bank customers will default on their credit card payments.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# libraries that are used for analysis and visualization
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# libraries to do statistical analysis
import math
from scipy.stats import *

# libraries used to pre-process
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# libraries used to implement models
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier


# libraries to evaluate performance
import sklearn.metrics as metrics
from sklearn.metrics import auc, accuracy_score, roc_auc_score, roc_curve, confusion_matrix
from sklearn.metrics import precision_score, f1_score, recall_score
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

# library of warnings would assist in ignoring warnings issued
import warnings
warnings.filterwarnings("ignore")

# to set max column display
pd.pandas.set_option('display.max_columns',None)

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')


In [None]:
# Loading the Dataset
clients_df = pd.read_csv('/content/drive/MyDrive/Almabetter/Creditcarddataset.csv', encoding='ISO-8859-1')

### Dataset First View

In [None]:
# Dataset First Look

# Viewing the top 5 rows to take a glimpse of the data
clients_df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
clients_df.shape

In [None]:
print(f'number of rows : {clients_df.shape[0]}  \nnumber of columns : {clients_df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
clients_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

value=len(clients_df[clients_df.duplicated()])
print("The number of duplicate values in the data set is = ",value)

This shows that there is no duplicate values in the dataset.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(clients_df.isnull().sum())

In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(clients_df, color='red',sort='ascending', figsize=(7,3), fontsize=12)

This confirms that there is zero null values in the dataset.

### What did you know about your dataset?



There are 30000 rows and 25 columns in the dataset. The dataset does not contain any duplicate or missing values.

The given dataset is from the banking industry. Our task is to examine customer credit default and its causes.The proactive identification of customers most likely to default on loan payments is the first step in predicting customer loan default. This is typically done by dynamically analyzing pertinent customer data and actions.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
clients_df.columns

In [None]:
# Dataset Describe
clients_df.describe().T

### Variables Description

The dataset contains data from credit card indistry in Taiwan and has collected
the usage, historical payments and default status of the customers.
# Attribute Information:

*   ID : ID of each client

*   LIMIT_BAL : Amount of given credit in NT dollars

*   SEX : Gender (1=male, 2=female)

*   EDUCATION : (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

*   MARRIAGE : Marital status (1=married, 2=single, 3=others)

*   AGE : Age in years


*   
PAY_0 : Repayment status in September, 2005 (-2=no consumption, -1=pay duly, 0=the use of revolving credit, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)


*   PAY_2 : Repayment status in August, 2005 (scale same as above)


*   PAY_3 : Repayment status in July, 2005 (scale same as above)

*   PAY_4 : Repayment status in June, 2005 (scale same as above)

*   PAY_5 : Repayment status in May, 2005 (scale same as above)

*   PAY_6 : Repayment status in April, 2005 (scale same as above)

*   BILL_AMT1 : Amount of bill statement in September, 2005 (NT dollar)
*   BILL_AMT2 : Amount of bill statement in August, 2005 (NT dollar)

*   BILL_AMT3 : Amount of bill statement in July, 2005 (NT dollar)
*   BILL_AMT4 : Amount of bill statement in June, 2005 (NT dollar)

*   BILL_AMT5 : Amount of bill statement in May, 2005 (NT dollar)

*   BILL_AMT6 : Amount of bill statement in April, 2005 (NT dollar)
*   PAY_AMT1 : Amount of previous payment in September, 2005 (NT dollar)


*   PAY_AMT2 : Amount of previous payment in August, 2005 (NT dollar)

*   PAY_AMT3 : Amount of previous payment in July, 2005 (NT dollar)
*   PAY_AMT4 : Amount of previous payment in June, 2005 (NT dollar)

*   PAY_AMT5 : Amount of previous payment in May, 2005 (NT dollar)

*   PAY_AMT6 : Amount of previous payment in April, 2005 (NT dollar)
*   default.payment.next.month : Default payment (1=yes, 0=no)

















### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
clients_df.nunique()


In [None]:
# Check Unique Values for each variable.
for i in clients_df.columns.tolist():
  print("No. of unique values in ",i,"is",clients_df[i].nunique())

# Observations:

*   We are focusing on several key columns of our dataset, including 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', and 'PAY_AMT' as they contain a wealth of information.
*   By utilizing these features, we plan to create a classification model and implement various classification algorithms.



## 3. ***EDA***

# Renaming Features

In [None]:
# Renaming complex columns name for the sake of simplicity    **(Not a necessary step to do)**
# Changing inconsistent column names "PAY_0" to 'PAY_1', 'default.payment.next.month'to 'DP_NEXT_MONTH'
clients_df.rename(columns={'PAY_0':'PAY_1','default payment next month':'DP_NEXT_MONTH'},inplace = True)
clients_df.columns

# Column: 'DP_NEXT_MONTH'

In [None]:
fig,ax = plt.subplots(1,2, figsize=(12,4))

# Univariate analysis
# Count Plot of Default Payment
count = sns.countplot(data=clients_df, x='DP_NEXT_MONTH', ax=ax[0])
count.set_title('Count Plot of Default Payment')

# adding value count on the top of bar
for p in count.patches:
  count.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# Univariate analysis
# Percentage of Default and Non-Default Payment
pie = clients_df['DP_NEXT_MONTH'].value_counts().plot(kind='pie',autopct="%1.1f%%",labels=['Not Defaulted','Defaulted'], ax=ax[1])
pie.set_title('Percentage of Default and Non-Default Payment')

**Observation:**


*   

    We can observe from the graphs that the number of default payments in the data is low in number compared to the number of not default payments. The count of default payments is 6636 while the count of not default payments is 23364.
*   By percentage 22.1% of customers defaulted on their payment whereas 77.9% of customers do not default on their credit card payment.


*   We can say that the data is highly imbalanced which we need to balance. We will do that in the feature engineering step.





# Column: 'LIMIT_BAL'

In [None]:
fig,ax = plt.subplots(1,4, figsize=(15,5))

# Distribution analysis of Limit Balance
hist = sns.histplot(clients_df['LIMIT_BAL'],bins=10, ax=ax[0])
hist.set_title('Distribution Plot of Limit Balance', size=15)

# Bi-variate analysis
# Limit Balance Vs Default Payment
hist = sns.histplot(data=clients_df, x='LIMIT_BAL', hue='DP_NEXT_MONTH',bins=10, ax=ax[1])
hist.set_title('Limit Balance Vs Default Payment', size=15)

# Multi-variate analysis
# Limit Balance Vs SEX
bar = sns.barplot(data=clients_df, x='SEX', y='LIMIT_BAL',hue='DP_NEXT_MONTH', ax=ax[2])
bar.set_title('Limit Balance Vs SEX', size=15)

# adding value count on the top of bar
for p in bar.patches:
    bar.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# Assign labels to the x-axis categories
# Gender (1=male, 2=female)
bar.set_xticklabels(['Male', 'Female'])

# Bi-variate analysis
# Limit Balance Vs EDUCATION
bar = sns.barplot(data=clients_df, x='EDUCATION', y='LIMIT_BAL', ax=ax[3])
bar.set_title('Limit Balance Vs EDUCATION', size=15)

# adding value count on the top of bar
for p in bar.patches:
    bar.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# Assign labels to the x-axis categories
# EDUCATION (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
bar.set_xticklabels(['Unknown','Graduate School', 'University', 'High School', 'Others', 'Unknown', 'Unknown'])

# Set x-ticks rotation to 90 degrees
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

Observation:


*   

    Most of the customers get up to 2 lakhs of credit limit balance.

*   
There appears to be a negative correlation between the percentage of defaults and credit limit.
*   On average females gets more limit than males. The female has an average of 170k while the male has an average of 163k.


*   Graph also indicates that higher education means a higher credit limit. We have to categorize all the unknown education categories as one.



# Column: 'SEX'

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(15, 5))

# Univariate analysis
count = sns.countplot(clients_df['SEX'], ax=ax[0])
count.set_title('Count Plot of Gender', size=15)

# adding value count on the top of the bar
for p in count.patches:
    count.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center')

# Assign labels to the x-axis categories
count.set_xticks([0, 1])  # Set the tick locations for Male and Female
count.set_xticklabels(['Male', 'Female'])  # Set the labels for the tick locations

# Bivariate analysis
# SEX Vs Default Payment
bar = sns.barplot(data=clients_df, x='SEX', y='DP_NEXT_MONTH', ax=ax[1])
bar.set_title('Proportion of Default Payment in Different Gender', size=15)

# Assign labels to the x-axis categories
bar.set_xticks([0, 1])  # Set the tick locations for Male and Female
bar.set_xticklabels(['Male', 'Female'])  # Set the labels for the tick locations

# adding value count on the top of the bar
for p in bar.patches:
    bar.annotate(format(p.get_height() * 100, '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center')

# Multivariate analysis
# SEX Vs Default Payment with Limit Balance
bar = sns.barplot(data=clients_df, x='SEX', y='LIMIT_BAL', hue='DP_NEXT_MONTH', ax=ax[2])
bar.set_title('SEX Vs Default Payment with Limit Balance', size=15)

# adding value count on the top of the bar
for p in bar.patches:
    bar.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center')

# Assign labels to the x-axis categories
bar.set_xticks([0, 1])  # Set the tick locations for Male and Female
bar.set_xticklabels(['Male', 'Female'])  # Set the labels for the tick locations

plt.tight_layout()
plt.show()

Observation:


*   There are 18112 females and 11888 males in the data set.

*   About 24% percent of males defaulted and about 21% of the female defaulted.
*   Number of males who defaulted is less in number but the proportion is greater. It might be possible because males have fewer credit limits on their credit cards as we can see in the graph too.



# Column: 'EDUCATION'

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(15, 5))

# Univariate analysis
count = sns.countplot(clients_df['EDUCATION'], ax=ax[0])
count.set_title('Count Plot of Education')

# adding value count on the top of the bar
for p in count.patches:
    count.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center')

# Assign labels to the x-axis categories
education_labels = ['Unknown', 'Graduate School', 'University', 'High School', 'Others', 'Unknown', 'Unknown']
count.set_xticks(range(len(education_labels)))  # Set the tick locations
count.set_xticklabels(education_labels, rotation=90)  # Set the labels with rotation

# Bivariate analysis
# EDUCATION Vs Default Payment
bar = sns.barplot(data=clients_df, x='EDUCATION', y='DP_NEXT_MONTH', ax=ax[1])
bar.set_title('Proportion of Default Payment in Different Education Level')

# adding value count on the top of the bar
for p in bar.patches:
    bar.annotate(format(p.get_height() * 100, '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center')

# Assign labels to the x-axis categories
bar.set_xticks(range(len(education_labels)))  # Set the tick locations
bar.set_xticklabels(education_labels, rotation=90)  # Set the labels with rotation

# Multivariate analysis
# EDUCATION Vs Default Payment with SEX
bar = sns.barplot(data=clients_df, x='EDUCATION', y='DP_NEXT_MONTH', hue='SEX', ax=ax[2])
bar.set_title('Education Vs Default Payment with SEX')

# Assign labels to the x-axis categories
bar.set_xticks(range(len(education_labels)))  # Set the tick locations
bar.set_xticklabels(education_labels, rotation=90)  # Set the labels with rotation

plt.tight_layout()
plt.show()

Observation:

*   10,585 people with degrees from graduate schools; 14,030 individuals with college degrees; 4,917 people with high school degrees. Count of customers who has completed University is most in numbers followed by Graduate School and High School.

*   With the rise in education level proportion of default decreases. We can see that Graduate School education level customers defaulted by 19% while University Education level customer default percentage is 24% followed by High School with 25%.
*   In almost all education levels females have less default percentage than males.


# Column: 'MARRIAGE'

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(1, 3, figsize=(15, 5))

# Univariate analysis
count = sns.countplot(clients_df['MARRIAGE'], ax=ax[0])
count.set_title('Count Plot of Marriage')

# adding value count on the top of the bar
for p in count.patches:
    count.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center')

# Assign labels to the x-axis categories
marriage_labels = ['Others', 'Married', 'Single', 'Divorce']
count.set_xticks(range(len(marriage_labels)))  # Set the tick locations
count.set_xticklabels(marriage_labels, rotation=90)  # Set the labels with rotation

# Bivariate analysis
# MARRIAGE Vs Default Payment
bar = sns.barplot(data=clients_df, x='MARRIAGE', y='DP_NEXT_MONTH', ax=ax[1])
bar.set_title('Proportion of Default Payment in Different Marital Status')

# adding value count on the top of the bar
for p in bar.patches:
    bar.annotate(format(p.get_height() * 100, '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center')

# Assign labels to the x-axis categories
bar.set_xticks(range(len(marriage_labels)))  # Set the tick locations
bar.set_xticklabels(marriage_labels, rotation=90)  # Set the labels with rotation

# Multivariate analysis
# MARRIAGE Vs Default Payment with SEX
bar = sns.barplot(data=clients_df, x='MARRIAGE', y='DP_NEXT_MONTH', hue='SEX', ax=ax[2])
bar.set_title('Marital Status Vs Default Payment with SEX')

# Assign labels to the x-axis categories
bar.set_xticks(range(len(marriage_labels)))  # Set the tick locations
bar.set_xticklabels(marriage_labels, rotation=90)  # Set the labels with rotation

plt.tight_layout()
plt.show()

Observation:

*   

    13,659 people who are married; 15,964 single people; 323 people who divorced; 54 people who are considered "others." Count of customers who are single is most in numbers followed by married and divorced.

*   The number of defaults appears to be highest among divorced people (26%) and lowest among single people (21%) (ignoring "Others" due to the low count).
*   In all Marital status females have less default percentage than males.



## Column: 'AGE'

In [None]:
fig,ax = plt.subplots(1,2, figsize=(12,5))

# Distribution analysis of Age
hist = sns.histplot(clients_df['AGE'],bins=6, ax=ax[0])
hist.set_title('Histogram Plot of Age', size=15)

# Bi-variate analysis
# Age Vs Default Payment
hist = sns.histplot(data=clients_df, x='AGE', hue='DP_NEXT_MONTH', bins=6, ax=ax[1])
hist.set(title='Age Vs Default Payment',ylabel='Default Payments Count')

plt.tight_layout()
plt.show()

Observation:


*   With the increase in age the count of customers decreases. Most of the customers belong to the 20-30 year age group followed by the 30-40 age group.
*   With an increase in the age group the count of default payments decreases.



# Columns: 'Payment History'

In [None]:
# Melt the dataset to transform the categorical columns to rows
melted_df = clients_df.melt(value_vars=['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'], var_name='Category', value_name='Value')

# Group the data by category and value and count the number of occurrences
grouped_df = melted_df.groupby(['Category', 'Value']).size().reset_index(name='Count')

# Create a dictionary to rename old values to new values
# (-2=no consumption, -1=pay duly, 0=the use of revolving credit, 1=payment delay for one month, 2=payment delay for two months,
# … 8=payment delay for eight months, 9=payment delay for nine months and above)
value_map = {-2:'no consumption', -1:'paid', 0:'revolving credit', 1:'1 month delay', 2:'2 month delay', 3:'3 month delay',
              4:'4 month delay', 5:'5 month delay', 6:'6 month delay', 7:'7 month delay', 8:'8 month delay', 9:'9 month and more delay'}

# Replace the old values with the new values
grouped_df['Value'] = grouped_df['Value'].replace(value_map)

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))

# Univariate analysis
bar = sns.barplot(data=grouped_df, x='Category', y='Count',palette='pastel', hue='Value')
bar.set_title('Bar Plot of Payment History')
lgd = plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

plt.show()

Observation:

*   

    In every month's payment history, most customers are from revolving credit followed by paid
*   Customers with payment delay in all the payment history have the most number in 2-month payment delay means a 2-month payment delay is a critical sign of the default of the payment.



## Columns: 'Bill Amounts'

In [None]:
# Creating few columns to consolidate all the bill amounts
clients_df['Sum_all_bill'] = clients_df['BILL_AMT1']+clients_df['BILL_AMT2']+clients_df['BILL_AMT3']+\
                             clients_df['BILL_AMT4']+clients_df['BILL_AMT5']+clients_df['BILL_AMT6']

clients_df['Avg_bill'] =    (clients_df['BILL_AMT1']+clients_df['BILL_AMT2']+clients_df['BILL_AMT3']+\
                             clients_df['BILL_AMT4']+clients_df['BILL_AMT5']+clients_df['BILL_AMT6'])/6

In [None]:
fig,ax = plt.subplots(1,3, figsize=(15,5))

# Distribution analysis of Bill Amount
hist = sns.histplot(clients_df['Sum_all_bill'],bins=11, ax=ax[0])
hist.set_title('Distribution Plot of Bill Amount', size=15)

# Bi-variate analysis
# Bill amount Vs Default Payment Count
hist = sns.histplot(data=clients_df, x='Avg_bill', hue='DP_NEXT_MONTH',bins=11, ax=ax[1])
hist.set_title('Bill amount Vs Default Payment Count', size=15)

# Bi-variate analysis
# Bill amount Vs Proportion of Default Payment
hist = sns.histplot(data=clients_df, x='Avg_bill', hue='DP_NEXT_MONTH', bins=11, multiple='fill', stat='probability', ax=ax[2])
hist.set_title('Bill amount Vs Proportion of Default Payment', size=15)

plt.tight_layout()
plt.show()

Observation:

*   

    In all the bill amounts there are some negative bill amount records means the bill amount value is less than zero.
*   Most of the defaults are from customers who have negative and up to 2 lakh bill amount on an average in the last 6 months.

*   But if we compare the bill amount with default payment, the proportion of default payment rises with the rise in the average bill amount.





# Columns: 'Pay Amounts'

In [None]:
# Creating few columns to consolidate all the bill amounts
clients_df['Sum_all_pay_amount'] = clients_df['PAY_AMT1']+clients_df['PAY_AMT2']+clients_df['PAY_AMT3']+\
                             clients_df['PAY_AMT4']+clients_df['PAY_AMT5']+clients_df['PAY_AMT6']

clients_df['Avg_pay_amount'] =    (clients_df['PAY_AMT1']+clients_df['PAY_AMT2']+clients_df['PAY_AMT3']+\
                             clients_df['PAY_AMT4']+clients_df['PAY_AMT5']+clients_df['PAY_AMT6'])/6

In [None]:
fig,ax = plt.subplots(1,3, figsize=(15,5))

# Distribution analysis of Pay Amount
hist = sns.histplot(clients_df['Sum_all_pay_amount'],bins=11, ax=ax[0])
hist.set_title('Distribution Plot of Pay Amount', size=15)

# Bi-variate analysis
# Pay amount Vs Default Payment Count
hist = sns.histplot(data=clients_df, x='Avg_pay_amount', hue='DP_NEXT_MONTH',bins=11, ax=ax[1])
hist.set_title('Pay Amount Vs Default Payment Count', size=15)

# Bi-variate analysis
# Pay amount Vs Proportion of Default Payment
hist = sns.histplot(data=clients_df, x='Avg_pay_amount', hue='DP_NEXT_MONTH', bins=11, multiple='fill', stat='probability', ax=ax[2])
hist.set_title('Pay Amount Vs Proportion of Default Payment', size=15)

plt.tight_layout()
plt.show()

Observation:

*   In all the pay amounts most of the paid amount is up to 50000

*   We have seen bill amounts up to 2 lacks but the pay amount is not averaged up to 2 lakh which is obvious because default payment occurs when the customer does not pay the credit card bill.
*   If we compare the pay amount with the default payment, the proportion of default payment decreases with the rise in the payment amount.



## ***4. Data Cleaning***

 **Duplicate Values**

In [None]:
# counting duplicate values
clients_df.duplicated().sum()

There is no duplicate records.


 **Missing Values**

In [None]:
# Missing Values/Null Values Count
print(clients_df.isnull().sum())

This shows that there is no missing values.

**Skewness**

In [None]:
# statistical summary
clients_df.describe().T



As can be seen in the statistical summary for numerical features, there is a significant difference between the 75% percentile and maximum value, indicating that the dataset contains skewness and outliers.


In [None]:
numerical_features = []
categorical_features = []

# splitting features into numeric and categoric.
'''
If feature has more than 15 categories we will consider it
as numerical_features, remaining features will be added to categorical_features.
'''
for col in clients_df.columns:
  if clients_df[col].nunique() > 15:
    numerical_features.append(col)
  else:
    categorical_features.append(col)

print(f'Numerical Features : {numerical_features}')
print(f'Categorical Features : {categorical_features}')

In [None]:
# figsize
plt.figure(figsize=(15,12))
# title
plt.suptitle('Data Distibution of Numerical Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(numerical_features):
  plt.subplot(4, 5, i+1)                       # subplots 4 rows, 5 columns

  # dist plots
  sns.distplot(clients_df[col])
  # x-axis label
  plt.xlabel(col)
  plt.tight_layout()

Observation:

*   For numerical features, we can see that the majority of distributions are right-skewed. The distribution of all the bill amounts and pay amounts is highly skewed to the right. It demonstrates that these columns have many outliers.
*   Some of the variables can get a normal distribution when outliers are removed. As a result, it appears that outliers should be removed before the transformation. First, we will get rid of outliers, and then we check to see if we need to use the transformation technique again.



**Treating Outliers**

In [None]:
# figsize
plt.figure(figsize=(15,12))
# title
plt.suptitle('Outlier Analysis of Numerical Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(numerical_features):
  plt.subplot(4, 5, i+1)            # subplot of 4 rows and 5 columns

  # countplot
  sns.boxplot(clients_df[col])
  # x-axis label
  plt.xlabel(col)
  plt.tight_layout()

Observation:


*   Outliers are visible in the all the bill amounts features and all the pay amounts features, and 'LIMIT_BAL' columns.



Clipping Method: In this method, we set a cap on our outliers data, which means that if a value is higher than or lower than a certain threshold, all values will be considered outliers. This method replaces values that fall outside of a specified range with either the minimum or maximum value within that range.

In [None]:
# we are going to replace the datapoints with upper and lower bound of all the outliers

def clip_outliers(clients_df):
    for col in clients_df[numerical_features]:
        # using IQR method to define range of upper and lower limit.
        q1 = clients_df[col].quantile(0.25)
        q3 = clients_df[col].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr

        # replacing the outliers with upper and lower bound
        clients_df[col] = clients_df[col].clip(lower_bound, upper_bound)
    return clients_df

In [None]:
# using the function to treat outliers
clients_df = clip_outliers(clients_df)

In [None]:
# checking the boxplot after outlier treatment

# figsize
plt.figure(figsize=(15,12))
# title
plt.suptitle('Outlier Analysis of Numerical Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(numerical_features):
  plt.subplot(4, 5, i+1)            # subplot of 4 rows and 5 columns

  # countplot
  sns.boxplot(clients_df[col])
  # x-axis label
  plt.xlabel(col)
  plt.tight_layout()

In [None]:
# checking for distribution after treating outliers.

# figsize
plt.figure(figsize=(15,12))
# title
plt.suptitle('Data Distibution of Numerical Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(numerical_features):
  plt.subplot(4, 5, i+1)                       # subplots 4 rows, 5 columns

  # dist plots
  sns.distplot(clients_df[col])
  # x-axis label
  plt.xlabel(col)
  plt.tight_layout()



*   We can also observe some shifts in the distribution of the data after treating outliers. Some of the data were skewed before handling outliers, but after doing so, the features almost follow the normal distribution. Therefore, we are not utilizing the numerical feature transformation technique.



# 4. Hypothesis Testing



Based on chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through codes and statistical testings.

Creating a class to calculate mean, median, variance, P value and all other metrics required for the calculation of hypothesis testing.


In [None]:
# Creating Parameter Class
class findz:
  def proportion(self,sample,hyp,size):
    return (sample - hyp)/math.sqrt(hyp*(1-hyp)/size)
  def mean(self,hyp,sample,size,std):
    return (sample - hyp)*math.sqrt(size)/std
  def varience(self,hyp,sample,size):
    return (size-1)*sample/hyp

variance = lambda x : sum([(i - np.mean(x))**2 for i in x])/(len(x)-1)
zcdf = lambda x: norm(0,1).cdf(x)
# Creating a function for getting P value
def p_value(z,tailed,t,hypothesis_number,df,col):
  if t!="true":
    z=zcdf(z)
    if tailed=='l':
      return z
    elif tailed == 'r':
      return 1-z
    elif tailed == 'd':
      if z>0.5:
        return 2*(1-z)
      else:
        return 2*z
    else:
      return np.nan
  else:
    z,p_value=stats.ttest_1samp(df[col],hypothesis_number)
    return p_value


# Conclusion about the P - Value
def conclusion(p):
  significance_level = 0.05
  if p>significance_level:
    return f"Failed to reject the Null Hypothesis for p = {p}."
  else:
    return f"Null Hypothesis rejected Successfully for p = {p}"

# Initializing the class
findz = findz()



1.   Men not defaulting are more than or equal to 40 years of AGE
2.   Customers defaulting have limit balance less than 100000

1.   Customers defaulting have total last bill amount of 50000.

In all of the hypothesis tests in this notebook, we will use a significance level of α = 0.05




**Hypothetical Statement - 1**

Men not defaulting are more than or equal to 40 years of AGE.



State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: N = 40

Alternate Hypothesis : N < 40

Test Type: Left Tailed Test


In [None]:
# Perform Statistical Test to obtain P-Value

# SEX:
# 1 = male; 2 = female

# DP_NEXT_MONTH:
# 0 = non-default; 1 = default

hypo_1 = clients_df[(clients_df['SEX']==1) & (clients_df["DP_NEXT_MONTH"]==0)]

# Getting the required parameter values for hypothesis testing
hypothesis_number = 40
sample_mean = hypo_1["AGE"].mean()
size = len(hypo_1)
std=(variance(hypo_1["AGE"]))**0.5

In [None]:
# Getting Z value
z = findz.mean(hypothesis_number,sample_mean,size,std)

# Getting P - Value
p = p_value(z=z,tailed='l',t="false",hypothesis_number=hypothesis_number,df=hypo_1,col="AGE")

# Getting Conclusion
print(conclusion(p))

Which statistical test have you done to obtain P-Value?


I used Z-Test as the statistical testing to get the P-Value, and the results showed that the null hypothesis could not be rejected, and male customers who didn't default were over 40 years old.

Why did you choose the specific statistical test?

In [None]:
# Visualizing code of hist plot for required columns to know the data distibution

fig=plt.figure(figsize=(9,6))
ax=fig.gca()
feature= (hypo_1["AGE"])
sns.distplot(hypo_1["AGE"])
ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
ax.set_title(col)
plt.show()

In [None]:
mean_median_difference=hypo_1["AGE"].mean()- hypo_1["AGE"].median()
print("Mean Median Difference is :-",mean_median_difference)



The figure demonstrates that the mean and median are roughly equivalent; the difference between them is 1.38 (less than 10). As a result, the distribution is normal. I have used Z-Test directly as a result.

We have failed to reject the null hypothesis that N < 40


**Hypothetical Statement - 2**

Customers defaulting have limit balance less than 100000



State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: N = 100000

Alternate Hypothesis : N > 100000

Test Type: Right Tailed Test


In [None]:
# Perform Statistical Test to obtain P-Value

# DP_NEXT_MONTH:
# 0 = non-default; 1 = default
hypo_2=clients_df[(clients_df["DP_NEXT_MONTH"]==1)]

# Getting the required parameter values for hypothesis testing
hypothesis_number = 100000
sample_mean = hypo_2["LIMIT_BAL"].mean()
size = len(hypo_2)
std=(variance(hypo_2["LIMIT_BAL"]))**0.5

In [None]:
# Getting Z value
z = findz.mean(hypothesis_number,sample_mean,size,std)

# Getting P - Value
p = p_value(z=z,tailed='r',t="true",hypothesis_number=hypothesis_number,df=hypo_2,col="LIMIT_BAL")

# Getting Conclusion
print(conclusion(p))

Which statistical test have you done to obtain P-Value?

I used T-Test as the statistical testing to get the P-Value, and the result showed that the null hypothesis was wrong and that customers who defaulted had a limit balance of less than 100,000.

Why did you choose the specific statistical test?

In [None]:
# Visualizing code of hist plot for required columns to know the data distibution

fig=plt.figure(figsize=(9,6))
ax=fig.gca()
feature= (hypo_2["LIMIT_BAL"])
sns.distplot(hypo_2["LIMIT_BAL"])
ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
ax.set_title(col)
plt.show()

In [None]:
mean_median_difference=hypo_2["LIMIT_BAL"].mean()- hypo_2["LIMIT_BAL"].median()
print("Mean Median Difference is :-",mean_median_difference)



The graph above demonstrates that the median is greater than the mean above 10,000. As a result, the distribution is positively skewed. Z-Test cannot be used with skewed data.

For small studies, non-parametric tests are most useful. In large studies, the use of non-parametric tests may answer the wrong question, causing readers confusion. Even with heavily skewed data, t-tests and the confidence intervals that go along with them should be used in studies with large sample sizes.

Therefore, the T-test can yield better results for skewed data. So, I used the t-test.


Hypothetical Statement - 3

Customers defaulting have total last bill amount of 50000.



State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: N = 50000

Alternate Hypothesis : N != 50000

Test Type: Two Tailed test


In [None]:
# Perform Statistical Test to obtain P-Value

# DP_NEXT_MONTH:
# 0 = non-default; 1 = default
hypo_3=clients_df[(clients_df["DP_NEXT_MONTH"]==1)]

# Getting the required parameter values for hypothesis testing
hypothesis_number = 50000
sample_mean = hypo_3["BILL_AMT1"].mean()
size = len(hypo_3)
std=(variance(hypo_3["BILL_AMT1"]))**0.5

In [None]:
# Getting Z value
z = findz.mean(hypothesis_number,sample_mean,size,std)

# Getting P - Value
p = p_value(z=z,tailed='d',t="true",hypothesis_number=hypothesis_number,df=hypo_3,col="BILL_AMT1")

# Getting Conclusion
print(conclusion(p))

Which statistical test have you done to obtain P-Value?

I used T-Test as the statistical testing to get the P-Value, and the result showed that the null hypothesis could not be rejected, so the statement that "Customers defaulted with a total last bill amount of 50,000" was correct.

Why did you choose the specific statistical test?

In [None]:
# Visualizing code of hist plot for required columns to know the data distibution

fig=plt.figure(figsize=(9,6))
ax=fig.gca()
feature= (hypo_3["BILL_AMT1"])
sns.distplot(hypo_3["BILL_AMT1"])
ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
ax.set_title(col)
plt.show()

In [None]:
mean_median_difference=hypo_3["BILL_AMT1"].median()- hypo_3["BILL_AMT1"].mean()
print("Mean Median Difference is :-",mean_median_difference)



The graph above demonstrates that the median is greater than the mean above 10,000. As a result, the distribution is positively skewed Z-Test cannot be used with skewed data.

For small studies, nonparametric tests are most useful. In large studies, the use of non-parametric tests may answer the wrong question, causing readers confusion. Even with heavily skewed data, t-tests and the confidence intervals that go along with them should be used in studies with large sample sizes.

Therefore, the T-test can yield better results for skewed data. So, I used the t-test.


# 5. Feature Engineering

**5.1 Feature Manipulation**

In [None]:
# copying this data to protect the work done till now
df_feature = clients_df.copy()

In [None]:
numerical_features = []
categorical_features = []

# splitting features into numeric and categoric.
'''
If feature has more than 15 categories we will consider it
as numerical_features, remaining features will be added to categorical_features.
'''
for col in df_feature.columns:
  if df_feature[col].nunique() > 15:
    numerical_features.append(col)
  else:
    categorical_features.append(col)

print(f'Numerical Features : {numerical_features}')
print(f'Categorical Features : {categorical_features}')

5.1.1 Bill_AMT

Negative bill amounts are not possible in a credit card dataset as a bill represents the amount of money that the credit card holder owes to the bank. It is always a positive value. However, negative values can occur due to data entry errors or other issues. Hence we are dropping all negative bill amount instances.

In [None]:
df_feature = df_feature[df_feature['BILL_AMT1'] >= 0]

5.1.2 EDUCATION

The education column has a lot of unknown sub-categories so combining them into one sub-category.

In [None]:
# Checking the value counts of each sub-category of EDUCATION
df_feature['EDUCATION'].value_counts()

In [None]:
# Lambda Function can be used to convert all unknown sub-category as one unknown sub-category
df_feature['EDUCATION'] = df_feature['EDUCATION'].apply(lambda x: 4 if x in [0, 5, 6] else x)

In [None]:
# Cheking the result
df_feature['EDUCATION'].value_counts()

**5.2 Encoding**

In [None]:
# Check Unique Values for each categorical variable.
for i in categorical_features:
  print("No. of unique values in",i,"is",df_feature[i].nunique())

In [None]:
# dropping our target variable from categorical features list
categorical_features.remove('DP_NEXT_MONTH')

In [None]:
# checking the data type of each feature
df_feature.info()

Observation

*   All the categorical columns have already been encoded, we just need to convert the categorical column data type as an object or category.



In [None]:
# Cast values in the categorical columns as type str.                 # can use astype('category') too.
df_feature[categorical_features] = df_feature[categorical_features].astype(str)

# checking the result
df_feature.dtypes

**5.3 Correlation Coefficient and Heatmap**

In [None]:
# Plotting correlation heatmap
plt.figure(figsize=(15,5))
sns.heatmap(df_feature.corr(), annot=True)

In [None]:
# find and remove correlated features

def correlation(dataset, threshold):
    col_corr = set()                                           # Set of all the names of correlated features
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold:        # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]               # getting the name of column
                col_corr.add(colname)
    return col_corr

In [None]:
# checking the highly correlated features
correlation(df_feature, 0.7)          # setting threshold of 0.7

In [None]:
# droping columns due to multi-collinearity

df_feature.drop(['Avg_bill','Avg_pay_amount','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','Sum_all_bill'], axis=1, inplace=True)

In [None]:
# Plotting correlation heatmap again
plt.figure(figsize=(15,5))
sns.heatmap(df_feature.corr(), annot=True)

5.4 Feature Selection



Feature selection is a technique in machine learning where you select a subset of the most important features from a larger set of features to use as inputs for a model. The goal of feature selection is to reduce the number of features used in the model, while retaining the most important and relevant information from the data.



*   Dropping unnecessary columns



In [None]:
# dropping the ID column
df_feature.drop('ID',axis = 1, inplace = True)

# Dropping Sum_all_pay_amount because it was created for EDA
df_feature.drop(['Sum_all_pay_amount'],axis=1, inplace=True)


5.5 Handling Imbalance

Checking if our target variable is balanced or not

In [None]:
# Dependant Column Value Counts
print(df_feature.DP_NEXT_MONTH.value_counts())
print(" ")

# Dependant Variable Column Visualization
fig,ax = plt.subplots(1,2, figsize=(15,6))

# pie chart for percentage
df_feature['DP_NEXT_MONTH'].value_counts().plot(kind='pie',autopct="%1.1f%%",startangle=90, ax=ax[0])

# bar chart for count
df_feature['DP_NEXT_MONTH'].value_counts().plot(kind='bar', ax=ax[1])
plt.show()

When there are significantly more instances of certain classes than others, the issue of class imbalance typically arises. Class imbalance in the target class is a problem for machine learning models because it can result in biased predictions. That is why we need to balance the target class.

The data set differs significantly. Our data, therefore, lack balance. We will use the Synthetic Minority Oversampling Technique (SMOTE) to resolve this issue.

*   SMOTE (Synthetic Minority Oversampling Technique) works by randomly selecting a minority class point and calculating its k-nearest neighbors. Between the selected point and its neighbors, the synthetic points are added. Continue with the steps until the data is balanced.



In [None]:
## Handling target class imbalance using SMOTE
from collections import Counter
from imblearn.over_sampling import SMOTE

X = df_feature.drop(columns='DP_NEXT_MONTH')     # independent features
y = df_feature['DP_NEXT_MONTH']                  # dependent features

print(f'Before Handling Imbalanced class {Counter(y)}')

# Resampling the minority class
smote = SMOTE(random_state=42)

# fit predictor and target variable
X, y = smote.fit_resample(X, y)

print(f'After Handling Imbalanced class {Counter(y)}')

We have successfully balanced the target variable

# 6. Model Building

6.1 Train Test Split

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape)
print(X_test.shape)

6.2 Scaling Data

In [None]:
# Scaling Data

# Initialize the scaler
scaler = StandardScaler()

# Scale the features using StandardScaler
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)



 **6.3 Model Training**

In [None]:
# empty list for appending performance metric score
model_result = []

def predict(ml_model, model_name):

  '''
  Pass the model and predict value.
  Function will calculate all the eveluation metrics and appending those metrics score on model_result table.
  Plotting confusion_matrix and roc_curve for test data.
  '''

  # model fitting
  model = ml_model.fit(X_train, y_train)

  # predicting value and probability
  y_train_pred = model.predict(X_train)
  y_test_pred = model.predict(X_test)
  y_train_prob = model.predict_proba(X_train)[:,1]
  y_test_prob = model.predict_proba(X_test)[:,1]


  ''' Performance Metrics '''
  # accuracy score  ---->  (TP+TN)/(TP+FP+TN+FN)
  train_accuracy = accuracy_score(y_train, y_train_pred)
  test_accuracy = accuracy_score(y_test, y_test_pred)
  print(f'train accuracy : {round(train_accuracy,3)}')
  print(f'test accuracy : {round(test_accuracy,3)}')

  # precision score  ---->  TP/(TP+FP)
  train_precision = precision_score(y_train, y_train_pred)
  test_precision = precision_score(y_test, y_test_pred)
  print(f'train precision : {round(train_precision,3)}')
  print(f'test precision : {round(test_precision,3)}')

  # recall score  ---->  TP/(TP+FN)
  train_recall = recall_score(y_train, y_train_pred)
  test_recall = recall_score(y_test, y_test_pred)
  print(f'train recall : {round(train_recall,3)}')
  print(f'test recall : {round(test_recall,3)}')

  # f1 score  ---->  Harmonic Mean of Precision and Recall
  train_f1 = f1_score(y_train, y_train_pred)
  test_f1 = f1_score(y_test, y_test_pred)
  print(f'train f1 : {round(train_f1,3)}')
  print(f'test f1 : {round(test_f1,3)}')

  # roc_auc score  ---->  It shows how well the model can differentiate between classes.
  train_roc_auc = roc_auc_score(y_train, y_train_prob)
  test_roc_auc = roc_auc_score(y_test, y_test_prob)
  print(f'train roc_auc : {round(train_roc_auc,3)}')
  print(f'test roc_auc : {round(test_roc_auc,3)}')
  print('-'*80)

  # classification report
  print(f'classification report for test data \n{classification_report(y_test, y_test_pred)}')
  print('-'*80)


  ''' plotting Confusion Matrix '''
  ConfusionMatrixDisplay.from_predictions(y_test, y_test_pred)
  plt.title('confusion matrix on Test data', weight='bold')
  plt.show()
  print('-'*80)


  ''' actual value vs predicted value on test data'''
  d = {'y_actual':y_test, 'y_predict':y_test_pred}
  print(pd.DataFrame(data=d).head(10).T)                   # constructing a dataframe with both actual and predicted values
  print('-'*80)

  '''Calculate threshold values for K-S chart'''

  # Compute the false positive rate, true positive rate, and thresholds for the ROC curve
  fpr, tpr, thresholds = roc_curve(y_test, y_test_prob)

  # Calculate the maximum difference between the true positive rate and false positive rate
  ks_stat = tpr - fpr

  # Compute the threshold that maximizes the difference between the false positive rate and the true positive rate
  ks_threshold = thresholds[np.argmax(ks_stat)]

  # Plot the KS chart
  plt.plot(thresholds, tpr, label='True Positive Rate')
  plt.plot(thresholds, fpr, label='False Positive Rate')
  plt.plot(thresholds, ks_stat, label='KS Statistic')
  plt.axvline(ks_threshold, color='black', linestyle='--', label=f'KS Threshold: {ks_threshold:.2f}')
  plt.title('KS Chart')
  plt.xlabel('Threshold')
  plt.ylabel('Rate')
  plt.legend()
  plt.show()


  '''Using the score from the performance metrics to create the final model_result'''
  model_result.append({'model':model_name,
                       'train_accuracy':train_accuracy,
                       'test_accuracy':test_accuracy,
                       'train_precision':train_precision,
                       'test_precision':test_precision,
                       'train_recall':train_recall,
                       'test_recall':test_recall,
                       'train_f1':train_f1,
                       'test_f1':test_f1,
                       'train_roc_auc':train_roc_auc,
                       'test_roc_auc':test_roc_auc})



## 7. Model Implementation

 **7.1 Logistic Regression**

In [None]:
predict(LogisticRegression(), 'LogisticRegression')

7.2 KNN (K-Nearest Neighbours)

In [None]:
# Checking the optimum value of the k:
accuracy=[]

# Iteratig for the optimum values of k
for i in range(1,15):
  knn=KNeighborsClassifier(n_neighbors=i)
  knn.fit(X_train,y_train)
  accuracy.append(knn.score(X_test, y_test))

#plotting the k-value vs accuracy
plt.title('k-NN Varying number of neighbors')
plt.plot(range(1,15), accuracy)
plt.xlabel('number of neighbours')
plt.ylabel('Accuracy')
plt.show()

The best accuracy is at K=1. So we will concentrate on low values of k.

In [None]:
predict(KNeighborsClassifier(n_neighbors=1), 'KNN')

**7.3 Decision Tree**

In [None]:
predict(DecisionTreeClassifier(), 'DecisionTree')

7.4 Random Forest

Hyperparameter Tunning using RandomizedSearchCV

In [None]:
rf_params = {'n_estimators': [50,75],           # number of trees in the ensemble
             'max_depth': [70,80],              # maximum number of levels allowed in each tree.
             'min_samples_split': [2,5],        # minimum number of samples necessary in a node to cause node splitting.
             'min_samples_leaf': [3,4]}         # minimum number of samples which can be stored in a tree leaf.



# performing Hyperparameter Tunning using RandomizedSearchCV
rf = RandomForestClassifier(random_state=42)
rf_gridsearch = GridSearchCV(estimator=rf, param_grid=rf_params, cv=5, verbose=2, n_jobs=-1)

# model fitting
rf_gridsearch.fit(X_train,y_train)

In [None]:
optimal_model = rf_gridsearch.best_estimator_
optimal_model

In [None]:
optimal_model =RandomForestClassifier(max_depth=70, min_samples_leaf=3, n_estimators=75,
                       random_state=42)
predict(optimal_model, 'RandomForest')

#### Chart - 1

In [None]:
# Chart - 1 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***