# Credit Risk Analysis

## 1. Problem Statement

<p>Credit risk is associated with the possibility of a client failing to meet contractual obligations, such as mortgages, credit card debts, and other types of loans.</p>

## 2. Data Gathering

In [1]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv("../Data/loan.csv")
df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [5]:
# Finding duplicate values
df[df.duplicated() == True]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status


## 3. EDA

<ol>
    <li>Missing value & outlier.</li>
    <li>Datatype.</li>
    <li>Distribution: balanced , imblanced, skewed.</li>
    <li>Correlation.</li>
</ol>

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


There are some missing values present and data type of some features is 'Object'.

In [7]:
# Total number of missing values present in each feature.
df.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [8]:
# Percentage of missing values in each feature towards in the whole data set.
df.isna().sum() / len(df) * 100

Loan_ID              0.000000
Gender               2.117264
Married              0.488599
Dependents           2.442997
Education            0.000000
Self_Employed        5.211726
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           3.583062
Loan_Amount_Term     2.280130
Credit_History       8.143322
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

In [9]:
# Same as above
df.isna().sum() / df.shape[0] * 100

Loan_ID              0.000000
Gender               2.117264
Married              0.488599
Dependents           2.442997
Education            0.000000
Self_Employed        5.211726
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           3.583062
Loan_Amount_Term     2.280130
Credit_History       8.143322
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

In [18]:
null_percent_df = pd.DataFrame(df.isna().sum() / df.shape[0] * 100)
null_percent_df

Unnamed: 0,0
Loan_ID,0.0
Gender,2.117264
Married,0.488599
Dependents,2.442997
Education,0.0
Self_Employed,5.211726
ApplicantIncome,0.0
CoapplicantIncome,0.0
LoanAmount,3.583062
Loan_Amount_Term,2.28013


### Imputation : Handling missing values

In [20]:
df['Gender'].fillna(df['Gender'].mode()[0], inplace = True)

In [22]:
# This only considers the features which data type has numeric values
df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


In [24]:
v = df['Gender'].mode()[0]
v
# 'Male' value replaced all the null values in Gender feature

'Male'

In [21]:
# This consider all features
df.describe(include='all')

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
count,614,614,611,599.0,614,582,614.0,614.0,592.0,600.0,564.0,614,614
unique,614,2,2,4.0,2,2,,,,,,3,2
top,LP001002,Male,Yes,0.0,Graduate,No,,,,,,Semiurban,Y
freq,1,502,398,345.0,480,500,,,,,,233,422
mean,,,,,,,5403.459283,1621.245798,146.412162,342.0,0.842199,,
std,,,,,,,6109.041673,2926.248369,85.587325,65.12041,0.364878,,
min,,,,,,,150.0,0.0,9.0,12.0,0.0,,
25%,,,,,,,2877.5,0.0,100.0,360.0,1.0,,
50%,,,,,,,3812.5,1188.5,128.0,360.0,1.0,,
75%,,,,,,,5795.0,2297.25,168.0,360.0,1.0,,


In [25]:
df['Married'].fillna(df['Married'].mode()[0], inplace = True)

In [26]:
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace = True)

In [27]:
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace = True)

In [28]:
# Mean is preferred because it gives optimal value
# Mode can give the highest value if that's the maximum count
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace = True)

In [29]:
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace = True)

In [30]:
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace = True)

### Changing Data Type

In [31]:
df['Gender'].replace({"Male" : 1, "Female" : 0}, inplace = True)

In [32]:
df['Married'].unique()

array(['No', 'Yes'], dtype=object)

In [33]:
df['Married'].replace({"Yes" : 1, "No" : 0}, inplace = True)

In [34]:
df['Dependents'].unique()

array(['0', '1', '2', '3+'], dtype=object)

We need to perform Label Encoding on 'Dependents' feature

In [35]:
df['Education'].unique()

array(['Graduate', 'Not Graduate'], dtype=object)

In [36]:
df['Education'].replace({"Graduate" : 1, "Not Graduate" : 0}, inplace = True)

In [37]:
df['Self_Employed'].unique()

array(['No', 'Yes'], dtype=object)

In [38]:
df['Self_Employed'].replace({"Yes" : 1, "No" : 0}, inplace = True)

In [39]:
df['Property_Area'].unique()

array(['Urban', 'Rural', 'Semiurban'], dtype=object)

In [40]:
df['Loan_Status'].replace({"Y" : 1, "N" : 0}, inplace = True)

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             614 non-null    int64  
 2   Married            614 non-null    int64  
 3   Dependents         614 non-null    object 
 4   Education          614 non-null    int64  
 5   Self_Employed      614 non-null    int64  
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         614 non-null    float64
 9   Loan_Amount_Term   614 non-null    float64
 10  Credit_History     614 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    int64  
dtypes: float64(4), int64(6), object(3)
memory usage: 62.5+ KB


In [42]:
df.isna().sum() / len(df) * 100

Loan_ID              0.0
Gender               0.0
Married              0.0
Dependents           0.0
Education            0.0
Self_Employed        0.0
ApplicantIncome      0.0
CoapplicantIncome    0.0
LoanAmount           0.0
Loan_Amount_Term     0.0
Credit_History       0.0
Property_Area        0.0
Loan_Status          0.0
dtype: float64

### Label Encoding

In [43]:
from sklearn.preprocessing import LabelEncoder

In [44]:
le = LabelEncoder()

In [45]:
df['Dependents'] = le.fit_transform(df['Dependents'])

In [46]:
df['Property_Area'] = le.fit_transform(df['Property_Area'])

All features are in numeric data type

In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             614 non-null    int64  
 2   Married            614 non-null    int64  
 3   Dependents         614 non-null    int32  
 4   Education          614 non-null    int64  
 5   Self_Employed      614 non-null    int64  
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         614 non-null    float64
 9   Loan_Amount_Term   614 non-null    float64
 10  Credit_History     614 non-null    float64
 11  Property_Area      614 non-null    int32  
 12  Loan_Status        614 non-null    int64  
dtypes: float64(4), int32(2), int64(6), object(1)
memory usage: 57.7+ KB


In [48]:
import seaborn as sns

ModuleNotFoundError: No module named 'seaborn'