<a href="https://colab.research.google.com/github/Pruthviraj3196/capstone-project---3--Credit-Card-Default-Prediction/blob/main/Copy_of_Sample_ML_Submission_Template_Credit_Card_Default_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Credit Card Default Prediction



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Name  -** Pruthviraj Gopinath Barbole


# **Project Summary -**

The Credit Card Default Prediction Classification ML Project aims to develop a machine learning model that accurately predicts the likelihood of credit card holders defaulting on their payments. The project addresses the crucial need for financial institutions to identify high-risk customers and take proactive measures to mitigate potential losses.

The project utilizes historical credit card data, including various features such as demographics, credit limit, payment history, and transaction details, to train and evaluate the predictive model. By leveraging machine learning algorithms and techniques, the project team aims to build a robust and accurate classification model capable of distinguishing between customers who are likely to default and those who are not.

The project follows a structured approach, including the following key steps:

- Data Collection and Exploration: Gathering relevant credit card data from various sources and performing exploratory data analysis (EDA) to gain insights into the dataset.

- Data Preprocessing and Feature Engineering: Preparing the data for modeling by cleaning, transforming, and normalizing it. Feature engineering techniques are applied to derive new meaningful features that could enhance the predictive power of the model.

- Model Selection and Training: Selecting appropriate machine learning algorithms for classification, such as logistic regression, decision trees, random forests, or gradient boosting. Multiple models are explored and their performance is evaluated using suitable evaluation metrics like accuracy, precision, recall, and F1-score. The models are trained on the labeled dataset and fine-tuned using techniques like cross-validation or grid search.

- Model Evaluation and Validation: Assessing the performance of the trained models on a separate validation dataset to ensure generalizability and selecting the model which the best results.

The project aims to deliver a credit card default prediction model that provides financial institutions with a valuable tool to make informed decisions and allocate resources more efficiently. By identifying high-risk customers beforehand, banks can take preventive actions, such as offering credit counseling, adjusting credit limits, or initiating collection processes, ultimately reducing the overall credit risk and potential financial losses.

Overall, the Credit Card Default Prediction Classification ML Project aims to contribute to the advancement of the financial industry by harnessing the power of machine learning to improve credit risk assessment and decision-making processes.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Business Context**

This project is aimed at predicting the case of customers default payments in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. We can use the K-S chart to evaluate which customers will default on their credit card payments

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC


### Dataset Loading

In [None]:
# Load Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
df_credit = pd.read_excel("/content/drive/MyDrive/data/default of credit card clients.xls", header = 1)

### Dataset First View

In [None]:
# Dataset First Look

In [None]:
df_credit.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
df_credit.shape

### Dataset Information

In [None]:
# Dataset Info

In [None]:
df_credit.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
df_credit.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
df_credit.isnull().sum()

In [None]:
# Visualizing the missing values

In [None]:
# Visualizing the missing values
sns.heatmap(df_credit.isna(), cbar=False)

### What did you know about your dataset?

- This Dataset contains 30000 lines and 25 columns.
- default payment next month is our target variable we need to focus on this
- There is no Missing value In the Dataset.
- From above we can see that there is no Duplicated Value in the dataset.
- From Dataset info we came to know that there is interger data type value in each column

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
df_credit.columns

In [None]:
# Dataset Describe

In [None]:
df_credit.describe().T

### Variables Description 

- ID: ID of each client
- LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)
- SEX: Gender (1=male, 2=female)
- EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- MARRIAGE: Marital status (1=married, 2=single, 3=others)
- AGE: Age in years
- PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- PAY_2: Repayment status in August, 2005 (scale same as above)
- PAY_3: Repayment status in July, 2005 (scale same as above)
- PAY_4: Repayment status in June, 2005 (scale same as above)
- PAY_5: Repayment status in May, 2005 (scale same as above)
- PAY_6: Repayment status in April, 2005 (scale same as above)
- BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
- BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
- BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
- BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
- BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
- BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
- PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
- PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
- PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
- PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
- PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
- PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
- default.payment.next.month: Default payment (1=yes, 0=no)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
df_credit.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
# renaming all columns

In [None]:
df_credit.rename(columns={'PAY_0':'PAY_SEPT','PAY_2':'PAY_AUG','PAY_3':'PAY_JUL','PAY_4':'PAY_JUN','PAY_5':'PAY_MAY','PAY_6':'PAY_APR'},inplace=True)
df_credit.rename(columns={'BILL_AMT1':'BILL_AMT_SEPT','BILL_AMT2':'BILL_AMT_AUG','BILL_AMT3':'BILL_AMT_JUL','BILL_AMT4':'BILL_AMT_JUN','BILL_AMT5':'BILL_AMT_MAY','BILL_AMT6':'BILL_AMT_APR'}, inplace = True)
df_credit.rename(columns={'PAY_AMT1':'PAY_AMT_SEPT','PAY_AMT2':'PAY_AMT_AUG','PAY_AMT3':'PAY_AMT_JUL','PAY_AMT4':'PAY_AMT_JUN','PAY_AMT5':'PAY_AMT_MAY','PAY_AMT6':'PAY_AMT_APR'},inplace=True)
df_credit.rename(columns={'default payment next month' : 'default_payment_next_month'}, inplace=True)

In [None]:
df_credit.info()

In [None]:
df_credit.head()

### What all manipulations have you done and insights you found?

-We can renaming dependendent Variable and renaming some feature name for better understanding of feature

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# Dependent Variable Distribution

In [None]:
# Chart - 1 visualization code

In [None]:
df_credit['default_payment_next_month'].value_counts()

In [None]:
#plotting the count plot to vizualize the data distribution 
#plot the count plot to check the data distribution
plt.figure(figsize=(10,5))
sns.countplot(x = 'default_payment_next_month', data = df_credit)

##### 1. Why did you pick the specific chart?

I picked countplot here as it's easy to compare between default and not default payments using them.

##### 2. What is/are the insight(s) found from the chart?

From the above charts we can observe that exactly 23364 dataset clients are not anticipated to default on payments, whereas  exactly 6636 clients are anticipated to do so.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

According to the insights found we can say that 22% of clients are anticipated to default so it is very important to reduce this percentage to create a positive business impact, this can be done by analysing various other features which affect this as done in further EDA.

While the insights themselves may not lead to negative growth, it's important to consider potential pitfalls that could arise if they are not properly interpreted or acted upon.

## Categorical Features
We have few categorical features in our dataset that are \

-sex \
-education \
-marraige \
-age \

Categorical variables are qualitative data in which the values are assigned to a set of distinct groups or categories. These groups may consist of alphabetic (e.g., male, female) or numeric labels (e.g., male = 0, female = 1) that do not contain mathematical information beyond the frequency counts related to group membership.

Let'Check how they are related with out target class.

# SEX



1 - Male

2 - Female

In [None]:
# Chart - 2 visualization code

In [None]:
#plotting the count plot to vizualize the data distribution
plt.figure(figsize=(10,5))
sns.countplot(x = 'SEX', data = df_credit)

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable.

It uses the concept of a bar chart for the visual depiction

##### 2. What is/are the insight(s) found from the chart?

We can see that Female credit card holder is more than Male.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, as we can see that Female credit card holder is more than Male so to increase male customers, Bank can give some offers to increase Male customers and at the same time they should take care of their Female customer to increase their business.

# Education


1 = graduate school; 2 = university; 3 = high school; 0 = others

In [None]:
# Chart - 3 visualization code

In [None]:
# counts the education  data set variable data set
df_credit['EDUCATION'].value_counts()

EDUCATION' column: notice 5 and 6 are both recorded as 'unknown' and there is 0 which isn't explained in the dataset description. Since the amounts are so small, let's combine 0,4,5,6 to 0 which means"other'.



In [None]:
df_credit['EDUCATION'].value_counts()
# Change values 4, 5, 6 to 0 and define 0 as 'others'
df_credit["EDUCATION"] = df_credit["EDUCATION"].replace({4:0,5:0,6:0})
df_credit["EDUCATION"].value_counts()
#plotting the count plot to vizualize the data distribution
plt.figure(figsize=(10,5))
sns.countplot(x = 'EDUCATION', data = df_credit)

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable.

It uses the concept of a bar chart for the visual depiction

##### 2. What is/are the insight(s) found from the chart?

Maximum credit card holders are from university followed by graduate school.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

As we can see most credit card holders are from University followed by Graduate school so Bank can target these category people to increase their bussiness and as we know source of income for high school candidates are very less so no need to make more focus on this group of people.

# Marriage


1 = married; 2 = single; 3 = others

In [None]:
# Chart - 4 visualization code

In [None]:
# counts the education  data set
df_credit['MARRIAGE'].value_counts()

In [None]:
# How many customers had "MARRIAGE" status as 0?
df_credit["MARRIAGE"].value_counts(normalize=True)

'MARRIAGE' column: what does 0 mean in 'MARRIAGE'? Since there are only 0.18% (or 54) observations of 0, we will combine 0 and 3 in one value as 'others'

In [None]:
# Combine 0 and 3 by changing the value 0 into others
df_credit['MARRIAGE'] = df_credit['MARRIAGE'].replace([0], 3)

In [None]:
#plotting the count plot to vizualize the data distribution
plt.figure(figsize=(10,5))
sns.countplot(x = 'MARRIAGE', data = df_credit)

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable.

It uses the concept of a bar chart for the visual depiction

##### 2. What is/are the insight(s) found from the chart?

From the above data analysis we can say that

1 - married

2 - single

3 - others

More number of credit cards holder are Single.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Single people make lots of expenditure than married people so they will use credit card more.Hence targetting Single people will surely increase the business.

# AGE

Plotting graph of number of ages of all people with credit card irrespective of gender.



In [None]:
# Chart - 5 visualization code

In [None]:
# counts the education  data set
df_credit['AGE'].value_counts()

In [None]:
#check the mean of the age group rescpective to the default_payment_next_month
df_credit.groupby('default_payment_next_month')['AGE'].mean()

In [None]:
df = df_credit.astype('int')

In [None]:
#plotting the count plot to vizualize the data distribution
plt.figure(figsize=(15,7))
plt.style.use('ggplot')
sns.countplot(x = 'AGE', data = df_credit)
plt.show()

From the above data analysis we can say that

We can see more number of credit cards holder age are between 26-30 years old. Age above 60 years old rarely uses the credit card.

In [None]:
#plotting the box plot to vizualize the data distribution
plt.style.use('ggplot')
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="default_payment_next_month", y="AGE", data=df_credit, palette=['c', 'm'])

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable.

It uses the concept of a bar chart for the visual depiction.

##### 2. What is/are the insight(s) found from the chart?

We can see that most of the credit card holders age range between 26-30 years.

Credit card holders are very less after age of 60 years.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Youngsters are using credit card more so we will mainly focus on them to increase our business.



# Numerical features


Limit Balance

In [None]:
# Chart - 6 visualization code

In [None]:
# describe  the limit balance  data set
df_credit['LIMIT_BAL'].describe()

In [None]:
#plotting the dist plot to vizualize the data distribution
plt.figure(figsize=(15,10))
sns.distplot(df_credit['LIMIT_BAL'], kde=True, color='r')
plt.show()

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt

# Create a figure with a size of 15x10 inches
plt.figure(figsize=(15, 10))

# Plot a histogram of the 'LIMIT_BAL' column from the df_credit dataframe
plt.hist(df_credit['LIMIT_BAL'], bins=20, color='r', alpha=0.5)

# Add a title and axis labels
plt.title('Distribution of Credit Limit')
plt.xlabel('Credit Limit')
plt.ylabel('Count')

# Show the resulting plot
plt.show()


In [None]:
#plotting the bar plot to vizualize the data distribution
sns.barplot(x='default_payment_next_month', y='LIMIT_BAL', data=df_credit, palette=['r','m'])

In [None]:
#plotting the box plot to vizualize the data distribution
plt.figure(figsize=(10,10))
ax = sns.boxplot(x="default_payment_next_month", y="LIMIT_BAL", data=df_credit, palette=['m','c'])

##### 1. Why did you pick the specific chart?

A Density Plot visualises the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise.

Here i am using a countplot so that i can get counts for different classes that my dataset contain and can visually interpret the results for different classes.

Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups. They are built to provide high-level information at a glance, offering general information about a group of data’s symmetry, skew, variance, and outliers.

##### 2. What is/are the insight(s) found from the chart?

From the countplot we can see the various count of credit limit and when combing with distribution plot we can see that it is positively skewed data and from the boxplot we can infer that there are many outliers in credit limit.

# Total Bill Amount

In [None]:
# Chart - 7 visualization code

In [None]:
#assign the bill amount variable to a single variable 
total_bill_amount =df[['BILL_AMT_SEPT',	'BILL_AMT_AUG',	'BILL_AMT_JUL',	'BILL_AMT_JUN',	'BILL_AMT_MAY',	'BILL_AMT_APR']]

In [None]:
#plotting the pair plot for bill amount 
sns.pairplot(data = total_bill_amount)

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical.

So, I have used pair plot to represent the data in graphical form and to see any relationship in between the various features.

##### 2. What is/are the insight(s) found from the chart?

The distribution of the bill amounts and pay amounts are right skewed. Bill amounts are aproximately linearly related.

# History payment status

In [None]:
# Chart - 8 visualization code

In [None]:
pay_col = ['PAY_SEPT',	'PAY_AUG',	'PAY_JUL',	'PAY_JUN',	'PAY_MAY',	'PAY_APR']
for col in pay_col:
  plt.figure(figsize=(10,5))
  sns.countplot(x = col, hue = 'default_payment_next_month', data = df_credit)

In [None]:
pay_col = ['PAY_SEPT', 'PAY_AUG', 'PAY_JUL', 'PAY_JUN', 'PAY_MAY', 'PAY_APR']

# Create a grid of subplots with 2 columns and 3 rows
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))

# Plot a countplot for each payment column in a subplot
for i, ax in enumerate(axes.flatten()):
    if i < len(pay_col):
        sns.countplot(x=pay_col[i], hue='default_payment_next_month', data=df_credit, ax=ax, palette='Set2')

# Adjust the space between the subplots and show the plot
plt.tight_layout()
fig.suptitle('Count of Payment Status by Default Payment Next Month', y=1.05)
plt.show()


##### 1. Why did you pick the specific chart?

The above figure shows bar plot for each month payment status which show the count of defaulters and non-defaulter.

##### 2. What is/are the insight(s) found from the chart?

The number of defaulters are decreasing as we move on the x-scale which is good for he companies business.

# Paid Amount

In [None]:
# Chart - 9 visualization code 

In [None]:
#assign the Paid Amount variable to a single variable 
pay_amnt_df = df[['PAY_AMT_SEPT',	'PAY_AMT_AUG',	'PAY_AMT_JUL',	'PAY_AMT_JUN',	'PAY_AMT_MAY',	'PAY_AMT_APR', 'default_payment_next_month']]

In [None]:
sns.pairplot(data=pay_amnt_df, hue='default_payment_next_month', kind='reg')

##### 1. Why did you pick the specific chart?

Using a pair-plot we aim to visualize the correlation of each feature pair in a dataset against the class distribution at a one glance.Since a pair plot visually gives an idea of correlation of each feature pair, it helps us to understand and quickly analyse the correlation matrix (Pearson) of the dataset as well.

##### 2. What is/are the insight(s) found from the chart?

From this we can can visualize that amount of previous payments by defaulters is way less than the non defaulters.

# Bivariate Analysis

In [None]:
# Chart - 10 visualization code 

Sex Vs Default Payment Next Month

In [None]:
# Chart - 6 visualization code
#plotting the cat plot to vizualize the data distribution related to the default_payment_next_month
x,y = 'SEX', 'default_payment_next_month'

(df_credit
.groupby(x)[y]
.value_counts(normalize=True)
.mul(100)
.rename('percent')
.reset_index()
.pipe((sns.catplot,'data'), x=x,y='percent',hue=y,kind='bar'))

Education and default_payment_next_month

In [None]:
#plotting the cat plot to vizualize the data distribution related to the default_payment_next_month
x = 'EDUCATION' 
y = 'default_payment_next_month'
df_percent = (df_credit.groupby(x)[y].value_counts(normalize = True).mul(100).rename('percent').reset_index())
sns.catplot(data=df_percent, x=x, y='percent', hue=y, kind='bar', palette=['#191825', '#865DFF'])

Marriage Vs Default Payment Next Mont

In [None]:
#plotting the cat plot to vizualize the data distribution related to the default_payment_next_month
x = 'EDUCATION' 
y = 'default_payment_next_month'
df_percent = (df_credit.groupby(x)[y].value_counts(normalize = True).mul(100).rename('percent').reset_index())
sns.catplot(data=df_percent, x=x, y='percent', hue=y, kind='bar', palette=['#191825', '#865DFF'])

Age Vs Default Payment Next Month


In [None]:
#plotting the bar plot to vizualize the data distribution related to the default_payment_next_month
plt.figure(figsize=(19,7))
sns.barplot(x = 'AGE', y = 'default_payment_next_month', data = df_credit)

plt.show()

##### 1. Why did you pick the specific chart?

The countplot is used to represent the occurrence(counts) of the observation present in the categorical variable.

It uses the concept of a bar chart for the visual depiction.



##### 2. What is/are the insight(s) found from the chart?

Sex - 
As we know maximum number of credit card holders are Female and from above graph it is also clear that maximum number of defaulters are Female.

In terms of default ratio(default/(default+not_default)),males having more default ratio than females.

Education - 
As maximum number of credit card holder is from unversity and from above graph it is clear that more number of defaulter is also from University only but again in terms of default ratio university people having less default ratio.

Marriage - 
From above we can see that married people having higher default ratio

Age - 
From 21 t early 60s default ratio is varying non-linearly however after 60s default ratio is getting increased.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

sex -
 Bank shall mainly focus on Female customers to increase the bussiness.However some policies should be provided to Male customers also to reduce the chance of being default.

Education- 
We can target on University people more to increase our bussiness as most of the credit card holder is from this category.

Marriage - 
Married people having high default ratio than single and as we can see that most of the non-defaulter/maximum credit card holders belongs to Single category so we can mainly focus on them to increase the business.

Age - 
For the age of 21s to early 60s there is almost constant proportion of age default for credit card payment yet one insightful information is that there is higher risk for people of age group more than 60s.

####Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code 

In [None]:
# Correlation Heatmap visualization code
#plotting the heatmap 
plt.figure(figsize=(20,15))
sns.heatmap(df_credit.corr(),annot=True,cmap="coolwarm")

It seems from the above graph is there are some negatively correlated feature like age but we cannot blindly remove this feature because it could be important feature for prediction\
ID is unimportant and it has no role in prediction so we will remove it.

In [None]:
# Draw box plot to see if there is any outliers in our dataset
plt.figure (figsize= (18,7))
df_credit.boxplot()
plt.xticks(rotation=90)
# rotating xticks to 90 degrees. this is done when we want our x-axis label annotators to be vertical 
# because there may not be enough space for us to visualize them. 

From the above boxplot, we can see that there are quite a few outliers present in our features. And most of these outliers are present in features containing pay-amount and Bill amount data.

In [None]:
# creating a list columns in which outliers are present.
outlier_columns = ['LIMIT_BAL', 'BILL_AMT_SEPT','BILL_AMT_AUG', 'BILL_AMT_JUL', 'BILL_AMT_JUN', 'BILL_AMT_MAY',
                 'BILL_AMT_APR', 'PAY_AMT_SEPT', 'PAY_AMT_AUG', 'PAY_AMT_JUL','PAY_AMT_JUN', 'PAY_AMT_MAY',
                 'PAY_AMT_APR']
# using IQR method for dropping outliers from above columns
Q1 = df[outlier_columns].quantile(0.25)
Q3 = df[outlier_columns].quantile(0.75)

IQR = Q3 - Q1                   # interquartile range

# using interquartile range to find and remove outliers from our dataframe.
df = df[~((df[outlier_columns] < (Q1 - 1.5 * IQR)) |(df[outlier_columns] > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
df.shape

In [None]:
# Dropping some of the unnecessary columns.
df.drop(['ID'], axis=1,inplace =True)

In [None]:
df.shape

##### 1. Why did you pick the specific chart?

Correlation heatmaps are a type of plot that visualize the strength of relationships between numerical variables.

Correlation plots are used to understand which variables are related to each other and the strength of this relationship.

A correlation plot typically contains a number of numerical variables, with each variable represented by a column. The rows represent the relationship between each pair of variables.

The values in the cells indicate the strength of the relationship, with positive values indicating a positive relationship and negative values indicating a negative relationship.

##### 2. What is/are the insight(s) found from the chart?

It seems from the above graph is there are some negatively correlated feature like age but we cannot blindly remove this feature because it could be important feature for prediction.

ID is unimportant and it has no role in prediction so we will remove it.

## ***. Feature Engineering & Data Pre-processing***

In [None]:
# Now checking for correlation among our dependent variables (Multicollinearity) using VIF analysis.
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
# performing VIF analysis
calc_vif(df[[i for i in df.describe().columns if i not in ['is_defaulter']]])

As we can see from above, that some of our features have high multicollinearity in them particularly the bill amount columns. so we need to do some feature engineering on them.

In [None]:
# Lets add up all bill amount features together in one.
df['TOTAL_BILL_PAY'] = df['BILL_AMT_SEPT'] + df['BILL_AMT_AUG'] + df['BILL_AMT_JUL'] + df['BILL_AMT_JUN'] +  df['BILL_AMT_MAY'] + df['BILL_AMT_APR'] 

In [None]:
# Lets check again.
calc_vif(df[[i for i in df.describe().columns if i not in ['is_defaulter','BILL_AMT_SEPT','BILL_AMT_AUG','BILL_AMT_JUL','BILL_AMT_JUN','BILL_AMT_MAY','BILL_AMT_APR']]])

###Categorical Encoding

## One hot encoding 

In [None]:
#assigning the value for diffrent categories
df.replace({'SEX': {1 : 'MALE', 2 : 'FEMALE'}, 'EDUCATION' : {1 : 'graduate school', 2 : 'university', 3 : 'high school', 0 : 'others'}, 'MARRIAGE' : {1 : 'married', 2 : 'single', 3 : 'others'}}, inplace = True)


In [None]:
df.head()

In [None]:
#creating dummy variables
df = pd.get_dummies(df, columns = ['EDUCATION', 'MARRIAGE'])

In [None]:
df.shape

In [None]:
#creating dummy variables by droping first variable
df = pd.get_dummies(df, columns=['PAY_SEPT', 'PAY_AUG', 'PAY_JUL', 'PAY_JUN', 'PAY_MAY', 'PAY_APR'])

In [None]:
# LABEL ENCODING FOR Gender
encoders_nums = {"SEX":{"FEMALE": 0, "MALE": 1}}
df = df.replace(encoders_nums)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
# Creating dependent variable and independent variable
independent_variables = df.drop(['default_payment_next_month'],axis=1)
dependent_variable = df['default_payment_next_month']

###Data Scaling

In [None]:
# scaling the data using zscore.
from scipy.stats import zscore  
x = round(independent_variables.apply(zscore),3)
y = dependent_variable

## Data spliting

In [None]:
# train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)

## What all categorical encoding techniques have you used & why did you use those techniques?

One-hot encoding is an important step for preparing your dataset for use in machine learning. One-hot encoding turns your categorical data into a binary vector representation. Pandas get dummies makes this very easy.
This is important when working with many machine learning algorithms, such as decision trees and support vector machines, which accept only numeric inputs.
This means that for each unique value in a column, a new column is created. The values in this column are represented as 1s and 0s, depending on whether the value matches the column header.

## Which method have you used to scale you data and why?

I have used Z-score Scaler to scale my data because  they give us a sense of where a score falls in relation to the mean of its population (in terms of standard deviation of its population), they allow us to compare scores from different distributions, and  they can be transformed into percentiles.

## What data splitting ratio have you used and why?

I haved used 80/20 split ratio for train and test dataset as my model will get enough data to train itself and after that we can test our model on the unseen data.

##Handling Imbalanced Dataset

# APPLYING SMOTE (Synthetic Minority Oversampling Technique)

Since we have an imbalanced dataset, we are going to need to apply some technique to remedy this. So we will try oversampling technique called SMOTE.

In [None]:
# applying oversampling to overcome class imbalance
from imblearn.over_sampling import SMOTE
smote= SMOTE()
x_train_smote,y_train_smote = smote.fit_resample(x,y)

from collections import Counter
print('Original dataset shape', Counter(y_train))
print('Resample dataset shape', Counter(y_train_smote))
Counter(y_train_smote)

##### Do you think the dataset is imbalanced? Explain Why.

Yes this an imbalanced dataset as our target variable has a very high number of non defaulters compared to defaulters and this is good but when we are using machine learning models we need to care of this imbalance using various technique such SMOTE,Tomek links,Undersampling,Oversampling beacause if dont take care of this our overall accuracy would be high simply because the most transaction is not defaulters(not because your model is any good) and we will not be able to distinguish between defaulters and non defaulters and this will impact the business negatively.

As we have seen earlier that we have imbalanced dataset. So to remediate Imbalance we are using SMOTE(Synthetic Minority Oversampling Technique

SMOTE (Synthetic Minority Oversampling Technique) – Oversampling is one of the most commonly used oversampling methods to solve the imbalance problem. It aims to balance class distribution by randomly increasing minority class examples by replicating them.

##  ML Model Implementation

In [None]:
# importing all the evaluation metrics that we will need for comparison.
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import roc_auc_score, confusion_matrix, roc_curve, auc, classification_report

# 1. LOGISTIC REGRESSION

In [None]:
# Importing Logistics Regression and GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [None]:
# initiate the model.
logistic_model = LogisticRegression(class_weight='balanced')

# define the parameter grid.
param_grid = {'penalty':['l1','l2'], 'C' : [0.0001,0.001,0.003,0.004,0.005, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 10, 20, 50, 100] }

# implementing the model.
logistic_model= GridSearchCV(logistic_model, param_grid, scoring = 'accuracy', n_jobs = -1, verbose = 3, cv = 3)
logistic_model.fit(x_train_smote, y_train_smote)

In [None]:
# getting the best estimator
logistic_model.best_estimator_

In [None]:
# getting the optimal parameters
logistic_model.best_params_ 

In [None]:
# getting the predicted probability of target variable.
y_train_preds_logistic = logistic_model.predict_proba(x_train_smote)[:,1]
y_test_preds_logistic = logistic_model.predict_proba(x_test)[:,1]

In [None]:
# getting the predicted class
y_train_class_preds_logistic = logistic_model.predict(x_train_smote)
y_test_class_preds_logistic = logistic_model.predict(x_test)

In [None]:
# checking the accuracy on training and unseen test data.
logistic_train_accuracy= accuracy_score(y_train_smote, y_train_class_preds_logistic)
logistic_test_accuracy= accuracy_score(y_test, y_test_class_preds_logistic)

print("The accuracy on train data is ", logistic_train_accuracy)
print("The accuracy on test data is ", logistic_test_accuracy)

In [None]:
# writing a function for evaluating various metrics
def evaluation_metrics(actual, predicted):

  """ This function is used to find the accuracy score , precision score , recall score , f1 score , ROC_AUC Score , 
      Confusion Matrix , Classification  report """
  metrics_list = []
  accuracy = accuracy_score(actual,predicted)
  precision = precision_score(actual, predicted)
  recall = recall_score(actual, predicted)
  model_f1_score = f1_score(actual, predicted)
  auc_roc_score = roc_auc_score(actual , predicted)
  model_confusion_matrix = confusion_matrix(actual , predicted)

  metrics_list = [accuracy,precision,recall,model_f1_score,auc_roc_score, model_confusion_matrix]
  return metrics_list

In [None]:
evaluation_metrics(y_test, y_test_class_preds_logistic)

In [None]:
# Let's store these metrics in a dataframe. that way we can easily compare metrics of different models.
# first store this data in a dict.
metric_name_list = ['accuracy','precision','recall','f1_score','roc_auc_score','confusion_matrix']
metric_values = evaluation_metrics(y_test, y_test_class_preds_logistic)

# zipping together above lists to form a dictionary
metric_dict = dict(zip(metric_name_list,metric_values))

# creating a dataframe out of this. 
evaluation_metric_df = pd.DataFrame.from_dict(metric_dict, orient='index').reset_index()
evaluation_metric_df.columns = ['Evaluation Metric','Logistic Regression']

In [None]:
evaluation_metric_df

In [None]:
# Plotting the confusion matrix from test data

labels = ['Non Defaulter', 'Defaulter']
cm = confusion_matrix(y_test,y_test_class_preds_logistic)
ax= plt.subplot()
sns.heatmap(cm, annot=True, cmap='coolwarm', ax = ax, lw = 3) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('Actual labels')
ax.set_title('Confusion Matrix of Logistics Regression from testing data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

# also printing confusion matrix values
print(cm)

In [None]:
# Plotting Roc_auc_curve for test data
y_test_pred_logistic = logistic_model.predict_proba(x_test)[:,1]
fpr, tpr, _ = roc_curve(y_test,y_test_pred_logistic)
plt.plot(fpr,tpr)
plt.title("Roc_auc_curve on Test data")
plt.legend(loc=4)
plt.show()

In [None]:
# printing the classification report.
print('classification_report is \n {}'.format(classification_report(y_test, y_test_class_preds_logistic)))

In [None]:
evaluation_metric_df

####  Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The ML model used in this case is Logistic Regression, which is a popular supervised learning algorithm used for binary classification tasks. It models the relationship between the input features and the probability of a binary outcome.

Logistic regression is an excellent tool to know for classification problems, which are problems where the output value that we wish to predict only takes on only a small number of discrete values. Here we'll focus on the binary classification problem, where the output can take on only two distinct classes.

the Logistic Regression model achieved an accuracy of 0.74943, indicating a reasonably good overall performance. However, the precision and recall scores are relatively low, suggesting that the model struggles with correctly identifying positive instances. The F1-score, which combines precision and recall, also falls in the moderate range. The AUC-ROC score indicates a decent discrimination capability of the model. The confusion matrix provides a detailed breakdown of the model's predictions, showing its strengths and weaknesses in differentiating between positive and negative instances.

## Which hyperparameter optimization technique have you used and why?

I have used Grid Search CV In Grid Search, we try every combination of a preset list of values of the hyper-parameters and choose the best combination based on the cross-validation score.



# 2. Random Forest Classifier



In [None]:
# Importing Random forest 
from sklearn.ensemble import RandomForestClassifier

In [None]:
model_rf= RandomForestClassifier()                                                              # initializing the model.

grid_values = {'n_estimators':[50,80,90,100], 'max_depth':[9,11,14]}              # initializing the parameter grid.
grid_rf = GridSearchCV(model_rf, param_grid = grid_values, scoring = 'accuracy', cv=3)

# Fitting the model.
grid_rf.fit(x_train_smote, y_train_smote)

In [None]:
# getting the best parameter
grid_rf.best_params_ 

In [None]:
# Getting the predicted classes
y_train_class_preds_rf = grid_rf.predict(x_train_smote)
y_test_class_preds_rf = grid_rf.predict(x_test)

In [None]:
# Getting the evaluation metrics using our function and adding it to evaluation dataframe to better read it.
evaluation_metric_df['Random Forest']=evaluation_metrics(y_test,y_test_class_preds_rf)
evaluation_metric_df

In [None]:
# Plotting the confusion matrix from test data

labels = ['Non Defaulter', 'Defaulter']
cm = confusion_matrix(y_test,y_test_class_preds_rf)
ax= plt.subplot()
sns.heatmap(cm, annot=True, cmap='coolwarm', ax = ax, lw = 3) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('Actual labels')
ax.set_title('Confusion Matrix of Random Forest from testing data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

# also printing confusion matrix values
print(cm)

In [None]:
print('classification_report is \n {}'.format(classification_report(y_test, y_test_class_preds_rf)))

In [None]:
# Printing Roc_auc_curve from test data

y_test_preds_proba_rf = grid_rf.predict_proba(x_test)[::,1]
fpr, tpr, _ = roc_curve(y_test,  y_test_preds_proba_rf)
auc = roc_auc_score(y_test,  y_test_preds_proba_rf)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.title("Roc_auc_curve on testing data")
plt.legend(loc=4)
plt.show()

Random Forest model has inbuilt support for showing the feature importances - i.e. which feature is more important in coming up with the predicted results. This helps us interpret and understand the model better.

In [None]:
# getting columns names from training data
features = x_train_smote.columns

# getting the feature importances
importances = grid_rf.best_estimator_.feature_importances_
indices = np.argsort(importances)

In [None]:
# plotting the feature importances using a horizontal bar graph.
plt.figure (figsize= (12,12))
plt.title('Relative Feature Importance', fontsize=14)
plt.barh(range(len(indices)), importances[indices], color='magenta', edgecolor='mediumblue', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.ylabel('Features', fontsize=14)
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The ML model used in this case is Random Forest, which is an ensemble learning method that combines multiple decision trees to make predictions. Random Forest is a versatile and powerful algorithm that can be used for both classification and regression tasks.

The Random Forest model achieved a high accuracy of 0.867748, indicating a strong overall performance. The precision and recall scores are also relatively high, indicating the model's ability to correctly identify positive instances while minimizing false positives. The F1-score provides a balanced assessment of precision and recall. The AUC-ROC score of 0.828852 suggests a good discrimination capability of the model. The confusion matrix provides a detailed breakdown of the model's predictions, showing its strengths and weaknesses in differentiating between positive and negative instances.

## Which hyperparameter optimization technique have you used and why?

I have used Grid Search CV In Grid Search, we try every combination of a preset list of values of the hyper-parameters and choose the best combination based on the cross-validation score.



## 3. K-Nearest Neighbour Classifier

In [None]:
# Import K Nearest Neighbour Classifier
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# initializing the model
knn = KNeighborsClassifier()

# knn the parameter to be tuned is n_neighbors
param_grid = {'n_neighbors':[4,5,6,7,8,10,12,14]}

# Fitting the model

knn_cv= GridSearchCV(knn,param_grid, scoring = 'accuracy',cv=3)
knn_cv.fit(x_train_smote,y_train_smote)

In [None]:
# find best score 
knn_cv.best_score_

In [None]:
# best parameters
knn_cv.best_params_

In [None]:
knn_cv.best_estimator_

In [None]:
# Get the predicted classes
y_train_class_preds_knn = knn_cv.predict(x_train_smote)
y_test_class_preds_knn = knn_cv.predict(x_test)

In [None]:
# getting the evaluation metrics and adding it to metric dataframe. 
evaluation_metric_df['KNeighborsClassifier'] = evaluation_metrics(y_test,y_test_class_preds_knn)
evaluation_metric_df

In [None]:
# Printing the classification report.
print('classification_report is \n {}'.format(classification_report(y_test, y_test_class_preds_knn)))

In [None]:
# Plotting the confusion matrix for testing data 
labels = ['Not Defaulter', 'Defaulter']
cm = confusion_matrix(y_test,y_test_class_preds_knn)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, linewidths=1, cmap='coolwarm',ax = ax) 

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix of KNN Classifier for testing data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Printing Roc_auc_curve from test data

y_test_preds_proba_knn = knn_cv.predict_proba(x_test)[::,1]
fpr, tpr, _ = roc_curve(y_test,  y_test_preds_proba_knn)
auc = roc_auc_score(y_test,  y_test_preds_proba_rf)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.title("Roc_auc_curve on testing data")
plt.legend(loc=4)
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. While it can be used for either regression or classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another.

The K-Nearest Neighbors (KNN) Classifier model achieved a good accuracy of 0.85964, indicating a relatively strong overall performance. The precision and recall scores are reasonably good, indicating the model's ability to correctly identify positive instances while minimizing false positives. The F1-score provides a balanced assessment of precision and recall. The AUC-ROC score of 0.854637 suggests a good discrimination capability of the model. The confusion matrix provides a detailed breakdown of the model's predictions, showing its strengths and weaknesses in differentiating between positive and negative instances.

## Which hyperparameter optimization technique have you used and why?

I have used Grid Search CV In Grid Search, we try every combination of a preset list of values of the hyper-parameters and choose the best combination based on the cross-validation score.



## 4. Support Vector Classifier

In [None]:
# Importing support vector machine algorithm from sklearn
from sklearn import svm
 
# initiate a svm Classifier
svm_model = svm.SVC(kernel = 'poly',gamma='scale', probability=True)

# fit the model using the training sets
svm_model.fit(x_train_smote, y_train_smote)

In [None]:
# Get the predicted classes
y_train_class_preds_svm = svm_model.predict(x_train_smote)
y_test_class_preds_svm = svm_model.predict(x_test)

In [None]:
evaluation_metric_df['Support Vector classifier'] = evaluation_metrics(y_test,y_test_class_preds_svm)
evaluation_metric_df

In [None]:
# Plotting the confusion matrix for testing data 
labels = ['Not Defaulter', 'Defaulter']
cm = confusion_matrix(y_test,y_test_class_preds_svm)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, linewidths=1, cmap='coolwarm',ax = ax) 

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix of SVM Classifier for testing data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Printing the classification report.
print('classification_report is \n {}'.format(classification_report(y_test, y_test_class_preds_knn)))

In [None]:
# Roc_auc_curve on taining data

y_train_preds_proba_svm = svm_model.predict_proba(x_train_smote)[::,1]
fpr, tpr, _ = roc_curve(y_train_smote,  y_train_preds_proba_svm )
auc = roc_auc_score(y_train_smote,  y_train_preds_proba_svm )
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.title("Roc_auc_curve on Training data")
plt.legend(loc=4)
plt.show()

In [None]:
# finally, we can compare our models on variour evaluation metric values.
evaluation_metric_df

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Support Vector Machines (SVMs in short) are machine learning algorithms that are used for classification and regression purposes. SVMs are one of the powerful machine learning algorithms for classification, regression and outlier detection purposes. An SVM classifier builds a model that assigns new data points to one of the given categories. Thus, it can be viewed as a non-probabilistic binary linear classifier.

SVMs can be used for linear classification purposes. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using the kernel trick. It enable us to implicitly map the inputs into high dimensional feature spaces.

the Support Vector Classifier (SVC) model achieved a moderate accuracy of 0.777046, indicating a reasonable overall performance. The precision and recall scores are relatively low, suggesting that the model struggles with correctly identifying positive instances. The F1-score provides a balanced assessment of precision and recall. The AUC-ROC score of 0.717235 suggests a moderate discrimination capability of the model. The confusion matrix provides a detailed breakdown of the model's predictions, showing its strengths and weaknesses in differentiating between positive and negative instances.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The important metric to compare all the algorithms in this case is ‘Recall’. As the company can’t afford to predict False negative i.e. predict defaulter as a non defaulter. Since, company is one, who will give to money to the customers,if, for any reason giving money to defaulter is gaining more risk to getting the investment back. Hence, here identifying false negative is important.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I will choose K-Nearest Neghbors as my final prediction model because Recall for KNN model  aprrox 84% which highigher than other models .

# **Conclusion**

**After conducting this thorough exercise, we found that :**

- Most of the credit card users are Female and have higher number of defaults.
Most of the credit card users are highly educated.
Single users have more no. of credit cards.

- The number of credit card users goes down with increase in age as old people have less consumption and may not be able to use credit cards and their purchases are usually made by younger family members.

- Using a Logistic Regression classifier, we can predict an approximate accuracy of 74% and ROC_AUC score of 0.704

- Using Random Forest Classifier, we can predict an accuracy of around 86% and ROC_AUC score of 0.82

- Using K-Neighbor Classifier, we can predict an accuracy of 85% and ROC_AUC score of 0.854

- Using Support Vector Machine Classifier, we can predict an accuracy of 77% and ROC_AUC score of around 0.712

- Random Forest Classifier and K Neighbors classifier perform the best among all models.


Our best models are Random Forest and K-Neighbor Classifier as they have the best Precision, Recall, ROC_AUC and F1 score values. This being an imbalanced dataset, Recall will be most important metric as we don't want to classify a defaulter as a non defaulter so that makes K Neighbor Classifier model more suitable for the task.