<a href="https://colab.research.google.com/github/Nakulcj7/Creditcard/blob/main/Credit_card_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Credit  Card Default Prediction



##### **Project Type**    - Classification
##### **Contribution**    - Individual


# **Project Summary -**



The aim of a credit card default prediction project is to develop a machine learning model that can accurately predict which credit card users are likely to default on their payments in the future. The model should use historical data of credit card users such as their payment history, credit limit, age, education, and other demographic information to identify patterns and trends that can help predict default behavior. The project focuses on utilizing historical data of Customer's default payment in Taiwan.


*   There were 30000 records and 25 attributes in the dataset.

*   I started by importing the dataset, and necessary libraries and conducted exploratory data analysis (EDA) to get a clear insight into each feature by separating the dataset into numeric and categoric features. I did Univariate, Bivariate, and even multivariate analyses.

*   After that, the outliers and null values were checked from the raw data. Data were transformed to ensure that it was compatible with machine learning models.
*   In feature engineering we transformed raw data into a more useful and informative form, by encoding, feature manipulation, and feature selection. We handled target class imbalance using SMOTE.


*   Then finally cleaned and scaled data was sent to various models, the metrics were made to evaluate the model, and we tuned the hyperparameters to make sure the right parameters were being passed to the model. To select the final model based on requirements, we checked model_result.


*   When developing a machine learning model, it is generally recommended to track multiple metrics because each one highlights distinct aspects of model performance. We are, however, focusing more on the Recall score and F1 score because we are dealing with credit card data and our data is unbalanced.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Despite high returns, credit cards carry significant risks. The ever-expanding number of credit cards has achieved an expansion in how much credit card defaults and the subsequent enormous measure of bills and repayment data information have likewise carried specific hardships to the risk controllers. As a result, one of the primary concerns of banks is how to use the data generated by users and extract useful information to control risks, reduce the default rate, and control the growth of non-performing assets.

A credit card issuer based in Taiwan wants to learn more about how likely its customers are to default on their payments and the main factors that influence this probability. The issuer's decisions regarding who to issue a credit card to and what credit limit to offer would be informed by this information. The issuer's future strategy, including plans to offer targeted credit products to their customers, would be informed by a better understanding of their current and potential customers as a result of this.
# Objective

*   

    To determine the main factors that influence the likelihood of defaulting on a credit card.
*   To determine the likelihood that Bank customers will default on their credit card payments.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# libraries that are used for analysis and visualization
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# libraries to do statistical analysis
import math
from scipy.stats import *

# libraries used to pre-process
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# libraries used to implement models
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier


# libraries to evaluate performance
import sklearn.metrics as metrics
from sklearn.metrics import auc, accuracy_score, roc_auc_score, roc_curve, confusion_matrix
from sklearn.metrics import precision_score, f1_score, recall_score
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

# library of warnings would assist in ignoring warnings issued
import warnings
warnings.filterwarnings("ignore")

# to set max column display
pd.pandas.set_option('display.max_columns',None)

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')


In [None]:
# Loading the Dataset
clients_df = pd.read_csv('/content/drive/MyDrive/Almabetter/Creditcarddataset.csv', encoding='ISO-8859-1')

### Dataset First View

In [None]:
# Dataset First Look

# Viewing the top 5 rows to take a glimpse of the data
clients_df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
clients_df.shape

In [None]:
print(f'number of rows : {clients_df.shape[0]}  \nnumber of columns : {clients_df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
clients_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

value=len(clients_df[clients_df.duplicated()])
print("The number of duplicate values in the data set is = ",value)

This shows that there is no duplicate values in the dataset.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(clients_df.isnull().sum())

In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(clients_df, color='red',sort='ascending', figsize=(7,3), fontsize=12)

This confirms that there is zero null values in the dataset.

### What did you know about your dataset?



There are 30000 rows and 25 columns in the dataset. The dataset does not contain any duplicate or missing values.

The given dataset is from the banking industry. Our task is to examine customer credit default and its causes.The proactive identification of customers most likely to default on loan payments is the first step in predicting customer loan default. This is typically done by dynamically analyzing pertinent customer data and actions.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
clients_df.columns

In [None]:
# Dataset Describe
clients_df.describe().T

### Variables Description

The dataset contains data from credit card indistry in Taiwan and has collected
the usage, historical payments and default status of the customers.
# Attribute Information:

*   ID : ID of each client

*   LIMIT_BAL : Amount of given credit in NT dollars

*   SEX : Gender (1=male, 2=female)

*   EDUCATION : (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

*   MARRIAGE : Marital status (1=married, 2=single, 3=others)

*   AGE : Age in years


*   
PAY_0 : Repayment status in September, 2005 (-2=no consumption, -1=pay duly, 0=the use of revolving credit, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)


*   PAY_2 : Repayment status in August, 2005 (scale same as above)


*   PAY_3 : Repayment status in July, 2005 (scale same as above)

*   PAY_4 : Repayment status in June, 2005 (scale same as above)

*   PAY_5 : Repayment status in May, 2005 (scale same as above)

*   PAY_6 : Repayment status in April, 2005 (scale same as above)

*   BILL_AMT1 : Amount of bill statement in September, 2005 (NT dollar)
*   BILL_AMT2 : Amount of bill statement in August, 2005 (NT dollar)

*   BILL_AMT3 : Amount of bill statement in July, 2005 (NT dollar)
*   BILL_AMT4 : Amount of bill statement in June, 2005 (NT dollar)

*   BILL_AMT5 : Amount of bill statement in May, 2005 (NT dollar)

*   BILL_AMT6 : Amount of bill statement in April, 2005 (NT dollar)
*   PAY_AMT1 : Amount of previous payment in September, 2005 (NT dollar)


*   PAY_AMT2 : Amount of previous payment in August, 2005 (NT dollar)

*   PAY_AMT3 : Amount of previous payment in July, 2005 (NT dollar)
*   PAY_AMT4 : Amount of previous payment in June, 2005 (NT dollar)

*   PAY_AMT5 : Amount of previous payment in May, 2005 (NT dollar)

*   PAY_AMT6 : Amount of previous payment in April, 2005 (NT dollar)
*   default.payment.next.month : Default payment (1=yes, 0=no)

















### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
clients_df.nunique()


In [None]:
# Check Unique Values for each variable.
for i in clients_df.columns.tolist():
  print("No. of unique values in ",i,"is",clients_df[i].nunique())

# Observations:

*   We are focusing on several key columns of our dataset, including 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', and 'PAY_AMT' as they contain a wealth of information.
*   By utilizing these features, we plan to create a classification model and implement various classification algorithms.



## 3. ***EDA***

# Renaming Features

In [None]:
# Renaming complex columns name for the sake of simplicity    **(Not a necessary step to do)**
# Changing inconsistent column names "PAY_0" to 'PAY_1', 'default.payment.next.month'to 'DP_NEXT_MONTH'
clients_df.rename(columns={'PAY_0':'PAY_1','default payment next month':'DP_NEXT_MONTH'},inplace = True)
clients_df.columns

# Column: 'DP_NEXT_MONTH'

In [None]:
fig,ax = plt.subplots(1,2, figsize=(12,4))

# Univariate analysis
# Count Plot of Default Payment
count = sns.countplot(data=clients_df, x='DP_NEXT_MONTH', ax=ax[0])
count.set_title('Count Plot of Default Payment')

# adding value count on the top of bar
for p in count.patches:
  count.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# Univariate analysis
# Percentage of Default and Non-Default Payment
pie = clients_df['DP_NEXT_MONTH'].value_counts().plot(kind='pie',autopct="%1.1f%%",labels=['Not Defaulted','Defaulted'], ax=ax[1])
pie.set_title('Percentage of Default and Non-Default Payment')

**Observation:**


*   

    We can observe from the graphs that the number of default payments in the data is low in number compared to the number of not default payments. The count of default payments is 6636 while the count of not default payments is 23364.
*   By percentage 22.1% of customers defaulted on their payment whereas 77.9% of customers do not default on their credit card payment.


*   We can say that the data is highly imbalanced which we need to balance. We will do that in the feature engineering step.





# Column: 'LIMIT_BAL'

In [None]:
fig,ax = plt.subplots(1,4, figsize=(15,5))

# Distribution analysis of Limit Balance
hist = sns.histplot(clients_df['LIMIT_BAL'],bins=10, ax=ax[0])
hist.set_title('Distribution Plot of Limit Balance', size=15)

# Bi-variate analysis
# Limit Balance Vs Default Payment
hist = sns.histplot(data=clients_df, x='LIMIT_BAL', hue='DP_NEXT_MONTH',bins=10, ax=ax[1])
hist.set_title('Limit Balance Vs Default Payment', size=15)

# Multi-variate analysis
# Limit Balance Vs SEX
bar = sns.barplot(data=clients_df, x='SEX', y='LIMIT_BAL',hue='DP_NEXT_MONTH', ax=ax[2])
bar.set_title('Limit Balance Vs SEX', size=15)

# adding value count on the top of bar
for p in bar.patches:
    bar.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# Assign labels to the x-axis categories
# Gender (1=male, 2=female)
bar.set_xticklabels(['Male', 'Female'])

# Bi-variate analysis
# Limit Balance Vs EDUCATION
bar = sns.barplot(data=clients_df, x='EDUCATION', y='LIMIT_BAL', ax=ax[3])
bar.set_title('Limit Balance Vs EDUCATION', size=15)

# adding value count on the top of bar
for p in bar.patches:
    bar.annotate(format(p.get_height(), '.0f'), (p.get_x(), p.get_height()))

# Assign labels to the x-axis categories
# EDUCATION (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
bar.set_xticklabels(['Unknown','Graduate School', 'University', 'High School', 'Others', 'Unknown', 'Unknown'])

# Set x-ticks rotation to 90 degrees
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

Observation:


*   

    Most of the customers get up to 2 lakhs of credit limit balance.

*   
There appears to be a negative correlation between the percentage of defaults and credit limit.
*   On average females gets more limit than males. The female has an average of 170k while the male has an average of 163k.


*   Graph also indicates that higher education means a higher credit limit. We have to categorize all the unknown education categories as one.



# Column: 'SEX'

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(15, 5))

# Univariate analysis
count = sns.countplot(clients_df['SEX'], ax=ax[0])
count.set_title('Count Plot of Gender', size=15)

# adding value count on the top of the bar
for p in count.patches:
    count.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center')

# Assign labels to the x-axis categories
count.set_xticks([0, 1])  # Set the tick locations for Male and Female
count.set_xticklabels(['Male', 'Female'])  # Set the labels for the tick locations

# Bivariate analysis
# SEX Vs Default Payment
bar = sns.barplot(data=clients_df, x='SEX', y='DP_NEXT_MONTH', ax=ax[1])
bar.set_title('Proportion of Default Payment in Different Gender', size=15)

# Assign labels to the x-axis categories
bar.set_xticks([0, 1])  # Set the tick locations for Male and Female
bar.set_xticklabels(['Male', 'Female'])  # Set the labels for the tick locations

# adding value count on the top of the bar
for p in bar.patches:
    bar.annotate(format(p.get_height() * 100, '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center')

# Multivariate analysis
# SEX Vs Default Payment with Limit Balance
bar = sns.barplot(data=clients_df, x='SEX', y='LIMIT_BAL', hue='DP_NEXT_MONTH', ax=ax[2])
bar.set_title('SEX Vs Default Payment with Limit Balance', size=15)

# adding value count on the top of the bar
for p in bar.patches:
    bar.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center')

# Assign labels to the x-axis categories
bar.set_xticks([0, 1])  # Set the tick locations for Male and Female
bar.set_xticklabels(['Male', 'Female'])  # Set the labels for the tick locations

plt.tight_layout()
plt.show()

Observation:


*   There are 18112 females and 11888 males in the data set.

*   About 24% percent of males defaulted and about 21% of the female defaulted.
*   Number of males who defaulted is less in number but the proportion is greater. It might be possible because males have fewer credit limits on their credit cards as we can see in the graph too.



# Column: 'EDUCATION'

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(15, 5))

# Univariate analysis
count = sns.countplot(clients_df['EDUCATION'], ax=ax[0])
count.set_title('Count Plot of Education')

# adding value count on the top of the bar
for p in count.patches:
    count.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center')

# Assign labels to the x-axis categories
education_labels = ['Unknown', 'Graduate School', 'University', 'High School', 'Others', 'Unknown', 'Unknown']
count.set_xticks(range(len(education_labels)))  # Set the tick locations
count.set_xticklabels(education_labels, rotation=90)  # Set the labels with rotation

# Bivariate analysis
# EDUCATION Vs Default Payment
bar = sns.barplot(data=clients_df, x='EDUCATION', y='DP_NEXT_MONTH', ax=ax[1])
bar.set_title('Proportion of Default Payment in Different Education Level')

# adding value count on the top of the bar
for p in bar.patches:
    bar.annotate(format(p.get_height() * 100, '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center')

# Assign labels to the x-axis categories
bar.set_xticks(range(len(education_labels)))  # Set the tick locations
bar.set_xticklabels(education_labels, rotation=90)  # Set the labels with rotation

# Multivariate analysis
# EDUCATION Vs Default Payment with SEX
bar = sns.barplot(data=clients_df, x='EDUCATION', y='DP_NEXT_MONTH', hue='SEX', ax=ax[2])
bar.set_title('Education Vs Default Payment with SEX')

# Assign labels to the x-axis categories
bar.set_xticks(range(len(education_labels)))  # Set the tick locations
bar.set_xticklabels(education_labels, rotation=90)  # Set the labels with rotation

plt.tight_layout()
plt.show()

Observation:

*   10,585 people with degrees from graduate schools; 14,030 individuals with college degrees; 4,917 people with high school degrees. Count of customers who has completed University is most in numbers followed by Graduate School and High School.

*   With the rise in education level proportion of default decreases. We can see that Graduate School education level customers defaulted by 19% while University Education level customer default percentage is 24% followed by High School with 25%.
*   In almost all education levels females have less default percentage than males.


# Column: 'MARRIAGE'

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(1, 3, figsize=(15, 5))

# Univariate analysis
count = sns.countplot(clients_df['MARRIAGE'], ax=ax[0])
count.set_title('Count Plot of Marriage')

# adding value count on the top of the bar
for p in count.patches:
    count.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center')

# Assign labels to the x-axis categories
marriage_labels = ['Others', 'Married', 'Single', 'Divorce']
count.set_xticks(range(len(marriage_labels)))  # Set the tick locations
count.set_xticklabels(marriage_labels, rotation=90)  # Set the labels with rotation

# Bivariate analysis
# MARRIAGE Vs Default Payment
bar = sns.barplot(data=clients_df, x='MARRIAGE', y='DP_NEXT_MONTH', ax=ax[1])
bar.set_title('Proportion of Default Payment in Different Marital Status')

# adding value count on the top of the bar
for p in bar.patches:
    bar.annotate(format(p.get_height() * 100, '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center')

# Assign labels to the x-axis categories
bar.set_xticks(range(len(marriage_labels)))  # Set the tick locations
bar.set_xticklabels(marriage_labels, rotation=90)  # Set the labels with rotation

# Multivariate analysis
# MARRIAGE Vs Default Payment with SEX
bar = sns.barplot(data=clients_df, x='MARRIAGE', y='DP_NEXT_MONTH', hue='SEX', ax=ax[2])
bar.set_title('Marital Status Vs Default Payment with SEX')

# Assign labels to the x-axis categories
bar.set_xticks(range(len(marriage_labels)))  # Set the tick locations
bar.set_xticklabels(marriage_labels, rotation=90)  # Set the labels with rotation

plt.tight_layout()
plt.show()

Observation:

*   

    13,659 people who are married; 15,964 single people; 323 people who divorced; 54 people who are considered "others." Count of customers who are single is most in numbers followed by married and divorced.

*   The number of defaults appears to be highest among divorced people (26%) and lowest among single people (21%) (ignoring "Others" due to the low count).
*   In all Marital status females have less default percentage than males.



## Column: 'AGE'

In [None]:
fig,ax = plt.subplots(1,2, figsize=(12,5))

# Distribution analysis of Age
hist = sns.histplot(clients_df['AGE'],bins=6, ax=ax[0])
hist.set_title('Histogram Plot of Age', size=15)

# Bi-variate analysis
# Age Vs Default Payment
hist = sns.histplot(data=clients_df, x='AGE', hue='DP_NEXT_MONTH', bins=6, ax=ax[1])
hist.set(title='Age Vs Default Payment',ylabel='Default Payments Count')

plt.tight_layout()
plt.show()

Observation:


*   With the increase in age the count of customers decreases. Most of the customers belong to the 20-30 year age group followed by the 30-40 age group.
*   With an increase in the age group the count of default payments decreases.



# Columns: 'Payment History'

In [None]:
# Melt the dataset to transform the categorical columns to rows
melted_df = clients_df.melt(value_vars=['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'], var_name='Category', value_name='Value')

# Group the data by category and value and count the number of occurrences
grouped_df = melted_df.groupby(['Category', 'Value']).size().reset_index(name='Count')

# Create a dictionary to rename old values to new values
# (-2=no consumption, -1=pay duly, 0=the use of revolving credit, 1=payment delay for one month, 2=payment delay for two months,
# … 8=payment delay for eight months, 9=payment delay for nine months and above)
value_map = {-2:'no consumption', -1:'paid', 0:'revolving credit', 1:'1 month delay', 2:'2 month delay', 3:'3 month delay',
              4:'4 month delay', 5:'5 month delay', 6:'6 month delay', 7:'7 month delay', 8:'8 month delay', 9:'9 month and more delay'}

# Replace the old values with the new values
grouped_df['Value'] = grouped_df['Value'].replace(value_map)

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))

# Univariate analysis
bar = sns.barplot(data=grouped_df, x='Category', y='Count',palette='pastel', hue='Value')
bar.set_title('Bar Plot of Payment History')
lgd = plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

plt.show()

Observation:

*   

    In every month's payment history, most customers are from revolving credit followed by paid
*   Customers with payment delay in all the payment history have the most number in 2-month payment delay means a 2-month payment delay is a critical sign of the default of the payment.



## Columns: 'Bill Amounts'

In [None]:
# Creating few columns to consolidate all the bill amounts
clients_df['Sum_all_bill'] = clients_df['BILL_AMT1']+clients_df['BILL_AMT2']+clients_df['BILL_AMT3']+\
                             clients_df['BILL_AMT4']+clients_df['BILL_AMT5']+clients_df['BILL_AMT6']

clients_df['Avg_bill'] =    (clients_df['BILL_AMT1']+clients_df['BILL_AMT2']+clients_df['BILL_AMT3']+\
                             clients_df['BILL_AMT4']+clients_df['BILL_AMT5']+clients_df['BILL_AMT6'])/6

In [None]:
fig,ax = plt.subplots(1,3, figsize=(15,5))

# Distribution analysis of Bill Amount
hist = sns.histplot(clients_df['Sum_all_bill'],bins=11, ax=ax[0])
hist.set_title('Distribution Plot of Bill Amount', size=15)

# Bi-variate analysis
# Bill amount Vs Default Payment Count
hist = sns.histplot(data=clients_df, x='Avg_bill', hue='DP_NEXT_MONTH',bins=11, ax=ax[1])
hist.set_title('Bill amount Vs Default Payment Count', size=15)

# Bi-variate analysis
# Bill amount Vs Proportion of Default Payment
hist = sns.histplot(data=clients_df, x='Avg_bill', hue='DP_NEXT_MONTH', bins=11, multiple='fill', stat='probability', ax=ax[2])
hist.set_title('Bill amount Vs Proportion of Default Payment', size=15)

plt.tight_layout()
plt.show()

Observation:

*   

    In all the bill amounts there are some negative bill amount records means the bill amount value is less than zero.
*   Most of the defaults are from customers who have negative and up to 2 lakh bill amount on an average in the last 6 months.

*   But if we compare the bill amount with default payment, the proportion of default payment rises with the rise in the average bill amount.





# Columns: 'Pay Amounts'

In [None]:
# Creating few columns to consolidate all the bill amounts
clients_df['Sum_all_pay_amount'] = clients_df['PAY_AMT1']+clients_df['PAY_AMT2']+clients_df['PAY_AMT3']+\
                             clients_df['PAY_AMT4']+clients_df['PAY_AMT5']+clients_df['PAY_AMT6']

clients_df['Avg_pay_amount'] =    (clients_df['PAY_AMT1']+clients_df['PAY_AMT2']+clients_df['PAY_AMT3']+\
                             clients_df['PAY_AMT4']+clients_df['PAY_AMT5']+clients_df['PAY_AMT6'])/6

In [None]:
fig,ax = plt.subplots(1,3, figsize=(15,5))

# Distribution analysis of Pay Amount
hist = sns.histplot(clients_df['Sum_all_pay_amount'],bins=11, ax=ax[0])
hist.set_title('Distribution Plot of Pay Amount', size=15)

# Bi-variate analysis
# Pay amount Vs Default Payment Count
hist = sns.histplot(data=clients_df, x='Avg_pay_amount', hue='DP_NEXT_MONTH',bins=11, ax=ax[1])
hist.set_title('Pay Amount Vs Default Payment Count', size=15)

# Bi-variate analysis
# Pay amount Vs Proportion of Default Payment
hist = sns.histplot(data=clients_df, x='Avg_pay_amount', hue='DP_NEXT_MONTH', bins=11, multiple='fill', stat='probability', ax=ax[2])
hist.set_title('Pay Amount Vs Proportion of Default Payment', size=15)

plt.tight_layout()
plt.show()

Observation:

*   In all the pay amounts most of the paid amount is up to 50000

*   We have seen bill amounts up to 2 lacks but the pay amount is not averaged up to 2 lakh which is obvious because default payment occurs when the customer does not pay the credit card bill.
*   If we compare the pay amount with the default payment, the proportion of default payment decreases with the rise in the payment amount.



### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***