<a href="https://colab.research.google.com/github/DharmendraYadav96/Credit-Card-Default-Prediction/blob/main/Credit_Card_Default_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Credit Card Default Prediction



##### **Project Type**    - EDA/Classification/supervised
##### **Contribution**    - Team
##### **Team Member 1 -**  Dharmendra Yadav
##### **Team Member 2 -**  Pranita Tiwari
##### **Team Member 3 -**  Kratika Jawariya

# **Project Summary -**

Write the summary here within 500-600 words.

This project revolves around the crucial task of predicting credit card payment defaults among customers in Taiwan. Rather than focusing solely on binary classification (credibility or not), our primary objective is to estimate the probability of default, which offers deeper insights into risk assessment. The dataset used encompasses 23 explanatory variables, including credit amount, gender, education, marital status, age, and extensive payment history, with the ultimate goal of developing a predictive model that can effectively identify customers at risk of defaulting on their credit card payments.

**Data Overview: **

The dataset at the core of this project consists of a binary response variable, "Default Payment," where 1 indicates a default, and 0 signifies no default. It's paired with a comprehensive set of explanatory variables. These variables encompass a wide range of aspects, such as the credit amount extended, gender of the cardholder, their educational background, marital status, age, and the history of past payments over several months. Additionally, it includes data on bill statement amounts and previous payment amounts, providing a holistic view of each customer's financial behavior.


**Business Objective:**

The primary aim of this project is to develop a robust predictive model capable of identifying customers who are likely to default on their credit card payments in the upcoming months. Credit card default, in this context, refers to the scenario where individuals consistently fail to pay the Minimum Amount Due for consecutive months. By predicting potential defaults proactively, our objective is to empower credit card companies with the tools to make informed decisions. This, in turn, can significantly reduce the incidence of defaults and facilitate targeted engagement with low-risk customer segments.

**Key Insights and Impact:**

**The impact of this project extends to several crucial areas:**

**Accurate Prediction:**

The developed predictive models offer the ability to identify potential defaulters at an early stage of delinquency.

**Risk Reduction:**

Enhanced risk assessment and management strategies enable credit card companies to minimize the impact of defaults on their financial health.

**Targeted Marketing:**

With the ability to predict potential defaults, credit card companies can tailor their credit offerings and engage effectively with low-risk customers.

**Operational Efficiency:**

Efficient resource allocation and streamlined credit approval processes lead to a more optimized and cost-effective operation.

**In conclusion,** this project addresses a significant concern within the credit card industry. It provides insights and tools that can help manage credit card default risk effectively, resulting in improved financial outcomes and operational efficiency. By predicting and proactively addressing defaults, credit card companies can navigate the challenges of risk management more effectively and offer better services to their customers.






# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

This project is dedicated to forecasting customer payment defaults in Taiwan. From a risk management viewpoint, the precision of predicting the probability of default holds greater significance than simply categorizing clients as either credible or not. We can employ the K-S chart to assess which customers are likely to experience credit card payment defaults.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# importing the library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

from google.colab import drive
drive.mount('/content/drive')


In [None]:
# Importing the dataset
df = pd.read_excel('/content/drive/MyDrive/Dataset/default of credit card clients.xls')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

In [None]:
# Visualizing the missing values
missing_values_per = pd.DataFrame((df.isnull().sum()/len(df))*100).reset_index()
plt.figure(figsize=(15,5))
plt.stem(missing_values_per['index'],missing_values_per[0])
plt.xticks(rotation=45,fontsize=10)
plt.title('Percentage of Missing Values')
plt.ylabel('%')
plt.show()

### What did you know about your dataset?

There are 4 object type variables which need to be converted to numerical data type for applying a machine learning algorithm. Additionally, it's worth noting that all columns have no missing values, and the dataset comprises 30,000 rows and 25 columns.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

In [None]:
df.describe().T

### Variables Description

**Breakdown of Our Features:**

We possess data for 30,000 customers, and the following describes all the available features.


**ID:** ID of each client


**LIMIT_BAL:** Amount of given credit in NT dollars (includes individual and family/supplementary credit)


**SEX:** Gender (1 = male, 2 = female)


**EDUCATION:** (1 = graduate school, 2 = university, 3 = high school, 0,4,5,6 = others)


**MARRIAGE:** Marital status (0 = others, 1 = married, 2 = single, 3 = others)


**AGE:** Age in years


**Scale for PAY_0 to PAY_6 :**


(-2 = No consumption, -1 = paid in full, 0 = use of revolving credit (paid minimum only), 1 = payment delay for one month, 2 = payment delay for two months, ... 8 = payment delay for eight months, 9 = payment delay for nine months and above)


**PAY_0:** Repayment status in September, 2005 (scale same as above)


**PAY_2:** Repayment status in August, 2005 (scale same as above)


**PAY_3:** Repayment status in July, 2005 (scale same as above)


**PAY_4:** Repayment status in June, 2005 (scale same as above)


**PAY_5:** Repayment status in May, 2005 (scale same as above)


**PAY_6:** Repayment status in April, 2005 (scale same as above)


**BILL_AMT1:**  Amount of bill statement in September, 2005 (NT dollar)


**BILL_AMT2:** Amount of bill statement in August, 2005 (NT dollar)


**BILL_AMT3:** Amount of bill statement in July, 2005 (NT dollar)


**BILL_AMT4:** Amount of bill statement in June, 2005 (NT dollar)


**BILL_AMT5:** Amount of bill statement in May, 2005 (NT dollar)


**BILL_AMT6:** Amount of bill statement in April, 2005 (NT dollar)


**PAY_AMT1:** Amount of previous payment in September, 2005 (NT dollar)


**PAY_AMT2:** Amount of previous payment in August, 2005 (NT dollar)


**PAY_AMT3:** Amount of previous payment in July, 2005 (NT dollar)


**PAY_AMT4:** Amount of previous payment in June, 2005 (NT dollar)


**PAY_AMT5:** Amount of previous payment in May, 2005 (NT dollar)


**PAY_AMT6:** Amount of previous payment in April, 2005 (NT dollar)


**default.payment.next.month:** Default payment (1=yes, 0=no)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Rename the columns

df = df.rename(columns={'default payment next month': 'DEFAULT_PAYMENT','PAY_0': 'PAY_1'})
df.head(2)

In [None]:
# Creating a list of categorical independent variable

independent_variable = ['SEX', 'EDUCATION', 'MARRIAGE', 'AGE']

In [None]:
# Value counts of the variable "independent_variable" #  (1 = graduate school, 2 = university, 3 = high school, 0,4,5,6 = others)

for col in independent_variable:
  print(df[col].value_counts())

In [None]:
# Replacing the value as according to feature description
df["EDUCATION"] = df["EDUCATION"].replace({0:4,5:4,6:4})
df["MARRIAGE"] = df["MARRIAGE"].replace({0:3})

In [None]:
# Value counts after replacing of "EDUCATION", "MARRIAGE"
print(df['EDUCATION'].value_counts())
print(df['MARRIAGE'].value_counts())

### What all manipulations have you done and insights you found?

 Renaming  the 'default payment next month' column to 'DEFAULT_PAYMENT' and the 'PAY_0' column to 'PAY_1'.

 we've created a list called independent_variable containing the names of columns that we want to consider as independent variables in your analysis. These columns include 'SEX', 'EDUCATION', 'MARRIAGE', and 'AGE'. we can use this list to subset our DataFrame or perform various analyses with these specific columns.

  We examine the distribution of unique values within each of the independent variables ('SEX', 'EDUCATION', 'MARRIAGE', 'AGE') by printing out the counts of each unique value in each column. It can be useful for understanding the distribution of categorical variables and the range of values in numerical variables.

  Next we did data preprocessing on a DataFrame, focusing on the 'EDUCATION' and 'MARRIAGE' columns. It replaces specific values in these columns as follows: it replaces 0, 5, and 6 in the 'EDUCATION' column with 4, and it replaces 0 in the 'MARRIAGE' column with 3. These replacements aim to clean and standardize the data for further analysis. After the replacements, the code prints the updated value counts for both columns, providing a concise summary of the distribution of values in each column post-processing. This data preprocessing step ensures that the data aligns with the intended analysis or modeling by handling and recoding certain values as needed.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
leg = ['Not Default (0)', 'Default (1)']
# plotting value counts of dependent variable
print(df['DEFAULT_PAYMENT'].value_counts())

plt.figure(figsize=(10,6))
# plt.style.use('fivethirtyeight')
sns.countplot(x = 'DEFAULT_PAYMENT', data = df)
plt.title('Default Credit Card Clients')  # \n (Default = 1, Not Default = 0)
plt.legend(['Not Default (0)', 'Default (1)'])
plt.show()

##### 1. Why did you pick the specific chart?

Count plot  is suitable for visualizing the distribution of categorical data.

##### 2. What is/are the insight(s) found from the chart?

It can be observed that there is an imbalance in the dataset regarding default and non-default cases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the distribution of default and non-default cases is crucial for credit card companies as it can help in risk assessment.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Plotting the countplot graph for "independent_variable"
plt.figure(figsize=(12, 7))
rows=2
cols=2
counter=1

for col in independent_variable:
  plt.subplot(rows,cols,counter)
  sns.countplot(x = col, data= df)
  plt.title(f'COUNT V/S {col}')
  counter=counter+1
  plt.tight_layout()

##### 1. Why did you pick the specific chart?

Count plot is suitable for visualizing the distribution of categorical data

##### 2. What is/are the insight(s) found from the chart?

**'SEX',** it's evident that there are more female clients (coded as 2) than male clients (coded as 1) in the dataset.

**'EDUCATION',** count plot reveals that the majority of clients have either a university education (coded as 2) or a graduate school education (coded as 1), with a smaller number having a high school education (coded as 3). The category "others" (coded as 4) also exists.

**'MARRIAGE',**count plot, it appears that the majority of clients are either single (coded as 2) or married (coded as 1), while the category "others" (coded as 3) has a smaller count.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Knowing the gender distribution can help tailor marketing and product offerings to specific gender demographics.

Understanding the educational background of clients can be useful for customizing financial products and services to cater to different educational levels.

Marital status information can be used for targeted promotions and services for married and single clients.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Creating a box plot of AGE V/S DEFAULT_PAYMENT with respect to sex
plt.figure(figsize=(10,7))
sns.boxplot(x='DEFAULT_PAYMENT',hue='SEX', y='AGE',data=df)

##### 1. Why did you pick the specific chart?

A box plot is chosen for this analysis because it provides a clear representation of the distribution of 'AGE' for different categories ('DEFAULT_PAYMENT' and 'SEX') and allows for the comparison of central tendencies and spreads.

##### 2. What is/are the insight(s) found from the chart?

**For clients who did not default (DEFAULT_PAYMENT = 0)**, the box plot shows that the median age is fairly similar for both genders (male and female). There are some outliers on the older side for both genders.

**For clients who defaulted (DEFAULT_PAYMENT = 1)**, the box plot indicates that the median age for both genders is lower than for non-default clients. Again, there are outliers on the older side, but the majority of default cases seem to occur among younger clients.

The use of 'SEX' as a hue parameter allows for a gender-based comparison within each category. It appears that the age distribution for both genders is similar within the 'DEFAULT_PAYMENT' categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The age distribution differences between default and non-default clients can be considered when setting credit limits or interest rates.

Understanding that defaults are more prevalent among younger clients can lead to targeted financial education or risk management strategies for this demographic.

The gender-based analysis within each category can inform marketing and product strategies tailored to specific age and gender groups.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Creating a box plot of AGE V/S DEFAULT_PAYMENT with respect to marriage
plt.figure(figsize=(20,10))
sns.boxplot(x='DEFAULT_PAYMENT',hue='MARRIAGE', y='AGE',data=df)

##### 1. Why did you pick the specific chart?

Its appropriate choice because it allows for the comparison of age distributions across different combinations of 'DEFAULT_PAYMENT' and 'MARRIAGE' categories. It effectively displays central tendencies and spread of age data.

##### 2. What is/are the insight(s) found from the chart?

**Clients who did not default (DEFAULT_PAYMENT = 0):**

For clients who are single (MARRIAGE = 2), the median age appears to be relatively lower

For married clients (MARRIAGE = 1), the median age seems slightly higher compared to single clients within this group.

There are some outliers on both the lower and higher age sides for both marital status categories within the 'not default' group.

**Clients who defaulted (DEFAULT_PAYMENT = 1):**

The median age for both single and married clients within this group appears to be lower compared to the 'not default' group.

There are outliers on both the lower and higher age sides, indicating that default cases include clients across a wider age range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding that default cases span a wider age range suggests that age alone may not be a strong predictor of default.


Marital status may have some influence, as seen in the slight differences in median age between single and married clients within each category.

These insights can contribute to more informed credit risk assessment and targeted strategies.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Creating a box plot of AGE V/S DEFAULT_PAYMENT with respect to education
plt.figure(figsize=(20,10))
sns.boxplot(x='DEFAULT_PAYMENT',hue='EDUCATION', y='AGE',data=df)

##### 1. Why did you pick the specific chart?

To visualize the relationship between 'AGE' and 'DEFAULT_PAYMENT' with respect to 'EDUCATION' levels

##### 2. What is/are the insight(s) found from the chart?

**Clients who did not default (DEFAULT_PAYMENT = 0):**

Among clients with a graduate school education (EDUCATION = 1), the median age appears to be relatively higher compared to other education levels within this 'not default' group.

Clients with a university education (EDUCATION = 2) also show a slightly higher median age compared to high school-educated clients (EDUCATION = 3) in the 'not default' group.

There are some outliers on both the lower and higher age sides for different education level categories within the 'not default' group, indicating the presence of clients with exceptional ages.

**Clients who defaulted (DEFAULT_PAYMENT = 1):**

The median age for clients with a high school education (EDUCATION = 3) within this group appears to be relatively lower compared to other education levels.
Graduate school (EDUCATION = 1) and university-educated (EDUCATION = 2) clients who defaulted have a slightly higher median age than high school-educated clients.

Similar to the 'not default' group, there are outliers on both the lower and higher age sides for various education level categories within the 'default' group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights can be valuable for risk assessment and business strategies, especially when combined with additional factors and analyses for a comprehensive understanding of credit risk.

#### Chart - 6

In [None]:
# Creating a function to get columnn names in the given range.
def getColumnsNames(prefix):
  '''
  This function is used for columnn names in the given range.
  '''
  return [prefix+str(x) for x in range(1,7)]

In [None]:
# Chart - 6 visualization code
# PAY_1 , PAY_2 , PAY_3 , PAY_4 , PAY_5, PAY_6
pay_status_columns = getColumnsNames('PAY_')
figure, ax = plt.subplots(2,3)
figure.set_size_inches(18,10)
for i in range(len(pay_status_columns)):
    row,col = int(i/3), i%3

    d = df[pay_status_columns[i]].value_counts()
    x = df[pay_status_columns[i]][(df['DEFAULT_PAYMENT']==1)].value_counts()
    ax[row,col].bar(d.index, d, align='center', color='y',alpha = 0.7)
    ax[row,col].bar(x.index, x, align='center', color='g')
    ax[row,col].set_title(pay_status_columns[i])
    plt.suptitle("Monthwise payment status for defaulters and non-defaulters \n Defaulters=Red, Non-defaulters=Yellow")

##### 1. Why did you pick the specific chart?

This choice of chart is appropriate because it allows for a clear comparison of payment statuses over time and highlights any differences between defaulters and non-defaulters.

##### 2. What is/are the insight(s) found from the chart?

**Payment Status (PAY_1 to PAY_6)**:  Each bar chart represents the payment status for a specific month, with different payment delay categories (ranging from -2 to 9) on the x-axis and the frequency of clients on the y-axis.

**Defaulters (Red):** For each month, the red bars represent the payment status distribution of clients who defaulted (DEFAULT_PAYMENT = 1).

**Non-defaulters (Yellow):** The yellow bars indicate the payment status distribution of clients who did not default (DEFAULT_PAYMENT = 0).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Across all months, there is a consistent pattern where clients who did not default (yellow bars) tend to have a higher frequency of on-time payments (indicated by values -2, -1, and 0) compared to clients who defaulted (red bars).

Clients who defaulted show a higher frequency of payment delays, especially for values 1, 2, and 3, which represent payment delays of one, two, and three months, respectively.

In general, as the months progress (from 'PAY_1' to 'PAY_6'), the payment statuses for both defaulters and non-defaulters tend to improve, with a decreasing frequency of delays.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

ax[row, col].set_yscale('log')

def plot_graph(prefix):
  pay_columns = getColumnsNames(prefix)
  figure, ax = plt.subplots(3,2)
  figure.set_size_inches(18,10)
  for i in range(len(pay_columns)):
    row,col =  i%3, int(i/3)

    ax[row,col].hist(df[pay_columns[i]], 30, color='y', alpha=0.7)
    ax[row,col].hist(df[pay_columns[i]][(df['DEFAULT_PAYMENT']==1)], 30, color='g')
    ax[row,col].set_title(pay_columns[i])
    plt.suptitle(f"Monthwise {prefix} distribution for defaulters and non-defaulters \n Defaulters=Red, Non-defaulters=Yellow")
    ax[row,col].set_yscale('log')  # Corrected line to set y-axis to a logarithmic scale


In [None]:
plot_graph('PAY_AMT')

##### 1. Why did you pick the specific chart?

Histograms are chosen because they allow for a clear representation of the distribution of payment amounts, making it possible to compare the distribution shapes between the two groups.

##### 2. What is/are the insight(s) found from the chart?

**Payment Amounts (PAY_AMT1 to PAY_AMT6):** Each histogram represents the distribution of payment amounts for a specific month, with the x-axis indicating payment amounts and the y-axis representing the frequency or count of clients.

**Defaulters (Green):** The green histograms represent the payment amount distribution for clients who defaulted (DEFAULT_PAYMENT = 1).

**Non-defaulters (Yellow):**The yellow histograms indicate the payment amount distribution for clients who did not default (DEFAULT_PAYMENT = 0).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Across all months,** it's evident that the majority of clients, both defaulters and non-defaulters, tend to make lower payment amounts. This is evident from the high peaks on the left side of each histogram.

While the distributions for both groups (defaulters and non-defaulters) have similar shapes, it appears that defaulters have a slightly higher frequency of lower payment amounts compared to non-defaulters.

In some months, defaulters have a noticeable peak at very low payment amounts, indicating a higher frequency of clients making minimal payments before defaulting.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plot_graph('BILL_AMT')

##### 1. Why did you pick the specific chart?

It allow for a clear representation of the distribution of payment amounts, making it possible to compare the distribution shapes between the two groups.

##### 2. What is/are the insight(s) found from the chart?

**Payment Amounts (PAY_AMT1 to PAY_AMT6):**Each histogram represents the distribution of payment amounts for a specific month, with the x-axis indicating payment amounts and the y-axis representing the frequency or count of clients.

**Defaulters (Green):** The Green histograms represent the payment amount distribution for clients who defaulted (DEFAULT_PAYMENT = 1).

**Non-defaulters (Yellow):** The yellow histograms indicate the payment amount distribution for clients who did not default (DEFAULT_PAYMENT = 0).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The distributions for both groups (defaulters and non-defaulters) have similar shapes, it appears that defaulters have a slightly higher frequency of lower payment amounts compared to non-defaulters.



#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Creating a distplot for "LIMIT_BAL"
plt.style.use('fivethirtyeight')
plt.figure(figsize=(10,6))
sns.distplot(df['LIMIT_BAL'],kde=True,bins=30)

##### 1. Why did you pick the specific chart?

This specific chart represent the distribution of credit limits on the x-axis and the probability density (or frequency) on the y-axis. The KDE curve offers a smooth representation of the distribution.



##### 2. What is/are the insight(s) found from the chart?

The distribution of credit limits appears to be right-skewed, with a higher concentration of clients having lower credit limits.

There are spikes in the distribution, indicating that certain credit limit values are more common than others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the distribution of credit limits helps in setting appropriate credit limits for clients, considering the majority of clients have lower limits.

Identifying the spikes in the distribution can help tailor marketing efforts and financial product offerings to clients with specific credit limit preferences.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

# Dropping the column "ID"
df = df.drop(['ID'],axis=1)

In [None]:
# Finding the correlation between different attribute
plt.figure(figsize=(22,12))
sns.heatmap(df.corr(),annot=True,cmap="coolwarm")

##### 1. Why did you pick the specific chart?

A heatmap is chosen because it effectively displays the correlation matrix, allowing for a quick and clear assessment of relationships between variables.

##### 2. What is/are the insight(s) found from the chart?

**BILL_AMT1 to BILL_AMT6:** The bill amounts for each month are highly positively correlated with each other. This indicates that clients who have higher bills in one month tend to have higher bills in other months as well.

**PAY_1 to PAY_6:**The repayment statuses for each month are moderately correlated with each other. Clients who have delayed payments in one month are likely to exhibit similar payment behavior in other months.

**LIMIT_BAL:** Credit limit shows a relatively weak positive correlation with some of the bill amounts (e.g., BILL_AMT1) and a weak negative correlation with some of the payment statuses (e.g., PAY_1). This suggests that clients with higher credit limits tend to have higher bills but may have better payment behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the correlations between attributes helps in identifying relationships that can inform credit risk assessment and product strategies.

Recognizing the relationships between attributes allows for tailored financial product offerings and marketing strategies.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Correlation between independent variables and dependent variable.
X = df.drop(['DEFAULT_PAYMENT'],axis=1)
plt.style.use('seaborn-whitegrid')
X.corrwith(df['DEFAULT_PAYMENT']).plot.bar(figsize = (20, 10), title = "Correlation with Default_Payment",
                                        fontsize = 20,rot = 90, grid = True)

##### 1. Why did you pick the specific chart?

 It allows for a clear comparison of correlation values for each independent variable, making it easy to identify which features have a stronger or weaker correlation with the target variable.

##### 2. What is/are the insight(s) found from the chart?

Education and age  variables have a positive correlation with 'DEFAULT_PAYMENT,indicating that higher values of these variables are associated with a higher likelihood of default.

Sex and marriage variables have a negative correlation with 'DEFAULT_PAYMENT,' indicating that higher values of these variables are associated with a lower likelihood of default.

The magnitude of correlation (the height of the bars) represents the strength of the relationship. Features with taller bars have a stronger correlation with 'DEFAULT_PAYMENT.'

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Features with a strong positive correlation with 'DEFAULT_PAYMENT' may be useful in predicting default risk. They can be used as important factors in credit scoring models.

Features with a strong negative correlation with 'DEFAULT_PAYMENT' may indicate factors that reduce the risk of default. These factors can inform credit risk assessment and product offerings.

#### Chart - 12

**SMOTE (Synthetic Minority Oversampling Technique)**

In [None]:
# Importing SMOTE
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Creating a SMOTE function.
smote = SMOTE()

# fit predictor and target variable
x_smote, y_smote = smote.fit_resample(X, df['DEFAULT_PAYMENT'])

print('Original shape of Dataset', len(X))
print('Resampled shape of Dataset', len(y_smote))

In [None]:
# Creating a new dataframe after using SMOTE
balance_df = pd.DataFrame(x_smote, columns = list(X.columns))
balance_df['DEFAULT_PAYMENT'] = y_smote

In [None]:
# Chart - 12 visualization code
# Creating a count plot for "DEFAULT_PAYMENT"
plt.style.use('fivethirtyeight')
plt.figure(figsize=(10,6))
sns.countplot(x='DEFAULT_PAYMENT', data=balance_df)


##### 1. Why did you pick the specific chart?

It is suitable for visualizing the distribution of categorical data.

##### 2. What is/are the insight(s) found from the chart?

Here we can see how many clients fall into each category: those who did not default (0) and those who defaulted (1).

It helps identify the balance or imbalance between the two classes, which is important for assessing the performance of classification models.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the distribution of default and non-default cases is essential for assessing the effectiveness of credit risk models and strategies.

It helps in evaluating whether there is a class imbalance issue, which can affect the performance of predictive models. If there is an imbalance, techniques like SMOTE (which you've already used) can be applied to address it.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

In [None]:
balance_df[balance_df['DEFAULT_PAYMENT']==1] # filtring default payment column

In [None]:
# importing library
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
x_train_sm, x_test_sm, y_train_sm, y_test_sm = train_test_split(x_smote,y_smote, test_size = 0.2, random_state = 24,stratify = y_smote)

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
# Creating a dummy copy of our dataset
df = balance_df.copy()

In [None]:
# Bin ‘AGE’ data to 6 groups
bins= [21,30,40,50,60,70,80]
labels = list(range(6))
df['AGE'] = pd.cut(df['AGE'],bins=bins, labels=labels,right=False)

In [None]:
# Covert categorical column into integer by extracting the code of the categories
df = df.astype({"AGE":'int64'})

In [None]:
# Define predictor variables and target variable
X = df.drop(columns=['DEFAULT_PAYMENT'])
y = df['DEFAULT_PAYMENT']

# Save all feature names as list
feature_cols = X.columns.tolist()

# Extract numerical columns and save as a list for rescaling
X_num = X.drop(columns=['SEX', 'EDUCATION', 'MARRIAGE', 'AGE'])
num_cols = X_num.columns.tolist()

In [None]:
# Define function to split data

def data_split(X, y):
  '''
  This function is used for splitting data into train and test.
  '''
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
  return X_train, X_test, y_train, y_test

In [None]:
# Define function to rescale training data using StandardScaler

def standard_scaler(X_train, X_test, numerical_cols):

  # Make copies of dataset
  X_train_std = X_train.copy()
  X_test_std = X_test.copy()

  # Apply standardization on numerical features only
  for i in numerical_cols:
    scl = StandardScaler().fit(X_train_std[[i]])     # fit on training data columns
    X_train_std[i] = scl.transform(X_train_std[[i]]) # transform the training data columns
    X_test_std[i] = scl.transform(X_test_std[[i]])   # transform the testing data columns

  return X_train_std,X_test_std

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1 Logistic Regression

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [None]:
def run_logistic_regression():
  '''
  This function can call Logistic Regression Model.
  '''
  # Split data
  X_train, X_test, y_train, y_test = data_split(X, y)

  # Rescale data
  X_train_std, X_test_std = standard_scaler(X_train, X_test, numerical_cols = num_cols)

  # Instantiate model
  clf_lr = LogisticRegression(random_state=42)

  # Fit the model
  clf_lr.fit(X_train_std, y_train)

  # Use model's default parameters to get cross validation score
  scores = cross_val_score(clf_lr, X_train_std, y_train, scoring ="roc_auc", cv = 5)
  roc_auc_lr = np.mean(scores)

  return "Logistic Regression", roc_auc_lr

In [None]:
# Creating a dataframe to store "ROC_AUC Score" of different "Model"
model_result = []

# Storing the value in dataframe for Logistic Regression model.
model_result.append(run_logistic_regression())
pd.DataFrame(model_result, columns = ["Model", "ROC_AUC Score"])

Tune Parameters of Logistic Regression

In [None]:
# Split data with SMOTE
X_train, X_test, y_train, y_test = data_split(X, y)

# Rescale data
X_train_std, X_test_std = standard_scaler(X_train, X_test, numerical_cols = num_cols)

In [None]:
from sklearn.model_selection import GridSearchCV
# paramter grid values for hyperparameter tunning.
grid_values = {'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000],
               'penalty':['l2', 'l1']}

# Instantiate the model
clf_lr = LogisticRegression(random_state=42)

# Instantiate grid search model
grid_search = GridSearchCV(estimator = clf_lr, param_grid = grid_values, cv = 3, verbose = 1)

# Fit grid search to the data
grid_search.fit(X_train_std, y_train)

In [None]:
# getting best parameters for model.
grid_search.best_params_

In [None]:
# Using the optimal data and fitting the data
lr_best = LogisticRegression(penalty = 'l2', C = 0.01)
lr_optimal = lr_best.fit(X_train_std, y_train)

In [None]:
# Get ROC_AUC score of tuned model on training data
scores_tuned = cross_val_score(lr_optimal, X_train_std, y_train, scoring = "roc_auc", cv = 5)
roc_auc_lr_best = np.mean(scores_tuned)

print(f'ROC_AUC score after tuning parameters:{roc_auc_lr_best}')

After fine-tuning parameter C, Logistic Regression model got ROC_AUC training score as 0.79413, slightly decreased from original score 0.79144. This could mean model is less overfitting the training data.

In [None]:
# Define a function to compute Accuracy, Precision, Recall and F1 score
def get_pre_rec_f1(model,X_test,y_test,title):
  '''
  This function calculates accuracy, precision, recall and F1 score using confusion matrix.
  '''
  y_pred = model.predict(X_test)
  tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

  accuracy = (tp + tn) / (tn + fp + tp + fn)
  precision = tp / (tp + fp)
  recall = tp / (tp + fn)
  F1 = 2 * (precision * recall) / (precision + recall)

  return title,accuracy, precision,recall,F1

In [None]:
from sklearn.metrics import confusion_matrix
# Creating a dataframe to store "Accuracy", "Precision", "Recall", "F1 score" of different "Model"
model_report = []

# Storing the value in dataframe for Logistic Regression model.
model_report.append(get_pre_rec_f1(lr_optimal, X_test_std, y_test,"Logistic Regression"))
pd.DataFrame(model_report, columns = ["Model","Accuracy", "Precision", "Recall", "F1 Score"])

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2 Decision Tree Classifier

In [None]:
def run_decision_tree():
  '''
  This function can call Decision Tree Model.
  '''
  # Split data
  X_train, X_test, y_train, y_test = data_split(X, y)

  # Instantiate model
  clf_dt = DecisionTreeClassifier()

  # Fit the model
  clf_dt.fit(X_train, y_train)

  # Use model's default parameters to get cross validation score
  scores = cross_val_score(clf_dt, X_train, y_train, scoring ="roc_auc", cv = 5)
  roc_auc_dt = np.mean(scores)

  return "Decision Tree", roc_auc_dt

In [None]:
from sklearn.tree import DecisionTreeClassifier
# Storing the value in dataframe for decision tree model.
model_result.append(run_decision_tree())
pd.DataFrame(model_result, columns = ["Model", "ROC_AUC Score"])

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

▶ After performing the various model we the get the best accuracy form the SVC
(Support Vector Classifier) and KNN.
▶ Decision Tree is the least accurate as compared to other models performed.

▶ SVC has the best precision and the recall balance.

▶ Higher recall can be achieved if low precision is acceptable.

▶ We can deploy the model and can be served as an aid to human decision.

▶ Model can be improved with more data and computational resources.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***