# Case study on Credit Risk

# Context: 
Credit risk is nothing but the default in payment of any loan by the borrower. In Banking sector this is an important factor to 
be considered before approving the loan of an applicant.Dream Housing Finance company deals in all home loans. They have presence
across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer 
eligibility for loan.

# Objective:
Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online 
application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History 
and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan 
amount so that they can specifically target these customers. Here they have provided a partial data set.

Dataset:

Variable	      Description
Loan_ID	          Unique Loan ID
Gender	          Male/ Female
Married	          Applicant married (Y/N)
Dependents	      Number of dependents
Education	      Applicant Education (Graduate/ Under Graduate)
Self_Employed	  Self employed (Y/N)
ApplicantIncome	  Applicant income
CoapplicantIncome Coapplicant income
LoanAmount	      Loan amount in thousands
Loan_Amount_Term  Term of loan in months
Credit_History	  credit history meets guidelines
Property_Area	  Urban/ Semi Urban/ Rural
Loan_Status	      Loan approved (Y/N)

In [None]:
# To enable plotting graphs in Jupyter notebook
%matplotlib inline

In [91]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

# importing ploting libraries
import matplotlib.pyplot as plt   

#importing seaborn for statistical plots
import seaborn as sns

#Let us break the X and y dataframes into training set and test set. For this we will use
#Sklearn package's data splitting function which is based on random function

from sklearn.model_selection import train_test_split

import numpy as np
import os,sys
from scipy import stats

# calculate accuracy measures and confusion matrix
from sklearn import metrics

In [92]:
df = pd.read_csv('CreditRisk.csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [98]:
cr_df = df.drop('Loan_ID', axis =1 ) # dropping this column as it will be 1-1 mapping anyways
cr_df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [125]:
# One hot encoding for all the categorical variables
cr_df['Property_Area'] = cr_df['Property_Area'].replace( {"Rural" : 1, 'Urban' :2, 'Semiurban' :3})
#cr_df['Self_Employed'] = cr_df['Self_Employed'].replace( {"Yes" : 0, 'No' :1})
#cr_df['Married'] = cr_df['Married'].replace( {"Yes" : 00, 'No' :11})
cr_df['Dependents'] = cr_df['Dependents'].replace( {"3+" : 3})
#cr_df['Education'] = cr_df['Education'].replace({"Graduate" : 8, "Not Graduate" : 9})
#cr_df['Gender'] = cr_df['Gender'].replace({"Male" : 4, "Female" : 5})

In [126]:
cr_df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,4,4,0,8,1,5849,0.0,0,360,1,2,Y
1,4,4,1,8,1,4583,1508.0,128,360,1,1,N
2,4,4,0,8,0,3000,0.0,66,360,1,2,Y
3,4,4,0,9,1,2583,2358.0,120,360,1,2,Y
4,4,4,0,8,1,6000,0.0,141,360,1,2,Y


In [128]:
# every column's missing value is replaced with 0 respectively
cr_df = cr_df.fillna('0')
#cr_df = cr_df.replace({'NaN':df.median()})
cr_df

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,4,4,0,8,1,5849,0.0,0,360,1,2,Y
1,4,4,1,8,1,4583,1508.0,128,360,1,1,N
2,4,4,0,8,0,3000,0.0,66,360,1,2,Y
3,4,4,0,9,1,2583,2358.0,120,360,1,2,Y
4,4,4,0,8,1,6000,0.0,141,360,1,2,Y
5,4,4,2,8,0,5417,4196.0,267,360,1,2,Y
6,4,4,0,9,1,2333,1516.0,95,360,1,2,Y
7,4,4,3,8,1,3036,2504.0,158,360,0,3,N
8,4,4,2,8,1,4006,1526.0,168,360,1,2,Y
9,4,4,1,8,1,12841,10968.0,349,360,1,3,N


In [129]:
#Lets analysze the distribution of the various attribute
cr_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Education,614.0,8.218241,0.413389,8.0,8.0,8.0,8.0,9.0
ApplicantIncome,614.0,5403.459283,6109.041673,150.0,2877.5,3812.5,5795.0,81000.0
CoapplicantIncome,614.0,1621.245798,2926.248369,0.0,0.0,1188.5,2297.25,41667.0
Property_Area,614.0,2.087948,0.815081,1.0,1.0,2.0,3.0,3.0


In [130]:
# Let us look at the target column which is 'Loan_Status' to understand how the data is distributed amongst the various values
cr_df.groupby(["Loan_Status"]).count()

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
Loan_Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
N,192,192,192,192,192,192,192,192,192,192,192
Y,422,422,422,422,422,422,422,422,422,422,422


In [131]:
cr_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 12 columns):
Gender               614 non-null object
Married              614 non-null object
Dependents           614 non-null object
Education            614 non-null int64
Self_Employed        614 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           614 non-null object
Loan_Amount_Term     614 non-null object
Credit_History       614 non-null object
Property_Area        614 non-null int64
Loan_Status          614 non-null object
dtypes: float64(1), int64(3), object(8)
memory usage: 57.6+ KB


In [133]:
array = cr_df.values
X = array[:,0:10] # select all rows and first 8 columns which are the attributes
Y = array[:,11]   # select all rows and the 8th column which is the classification "Yes", "No"
test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
type(X_train)

numpy.ndarray

In [134]:
# Fit the model on 30%
model = LogisticRegression()
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
model_score = model.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, y_predict))

0.7891891891891892
[[ 33  28]
 [ 11 113]]


Analyzing the confusion matrix

True Positives (TP): we correctly predicted that they do have diabetes 46

True Negatives (TN): we correctly predicted that they don't have diabetes 134

False Positives (FP): we incorrectly predicted that they do have diabetes (a "Type I error") 13 Falsely predict positive Type I error

False Negatives (FN): we incorrectly predicted that they don't have diabetes (a "Type II error") 38 Falsely predict negative Type II error

In [None]:
Now, you can try dropping some of the categorical columns which is not that necessary to improve the model accuracy.