I have referred to https://www.kaggle.com/sazid28/home-loan-prediction/notebook for this notebook!

# 1.1 Introduction

* Dream Housing Finance company deals in all home loans. 
* They have presence across all urban, semi urban and rural areas. 
* Customer first apply for home loan after that company validates the customer eligibility for loan.
* Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. 
* These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. 
* To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.

# 1.2 Problem Statement

* This is a standard supervised classification task.
* A classification problem where we have to predict whether a loan would be approved or not. 
* In a classification problem, we have to predict discrete values based on a given set of independent variable(s).Classification can be of following types:

1. Supervised: The labels are included in the training data and the goal is to train a model to learn to predict the labels from the features.

2. Binary Classification : In this classification we have to predict either of the two given classes. For example: classifying the gender as male or female, predicting the result as win or loss, etc.

3. Multiclass Classification : Here we have to classify the data into three or more classes. For example: classifying a movie's genre as comedy, action or romantic, classify fruits as oranges, apples, or pears, etc.

# 1.3 About the dataset

Below are some of the factors which I think can affect the Loan Approval (dependent variable for this loan prediction problem):

* Salary: Applicants with high income should have more chances of loan approval.

* Previous history: Applicants who have repayed their previous debts should have higher chances of loan approval.

* Loan amount: Loan approval should also depend on the loan amount. If the loan amount is less, chances of loan approval should be high.

* Loan term: Loan for less time period and less amount should have higher chances of approval.

* EMI: Lesser the amount to be paid monthly to repay the loan, higher the chances of loan approval.

* These are some of the factors which i think can affect the target variable, you can come up with many more factors

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
%matplotlib inline

# 1.4 Loading the data

For this practice problem, we have been given two CSV files: train and test.

* Train file will be used for training the model, i.e. our model will learn from this file. It contains all the independent variables and the target variable.

* Test file contains all the independent variables, but not the target variable. We will apply the model to predict the target variable for the test data.

In [None]:
train = pd.read_csv('/kaggle/input/home-loan/train.csv')
test = pd.read_csv('/kaggle/input/home-loan/test.csv')

In [None]:
train_original = train.copy()
test_original = test.copy()

In [None]:
train.columns

In [None]:
test.columns

In [None]:
train.dtypes

In [None]:
test.dtypes

In [None]:
train.head()

In [None]:
print('Training data shape: ', train.shape)
print('Test data shape: ', test.shape)

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
train["Loan_Status"].value_counts()

In [None]:
# Normalize can be set to True to print proportions instead of number 
train["Loan_Status"].value_counts(normalize=True)*100

In [None]:
train["Loan_Status"].value_counts(normalize=True).plot.bar(title = 'Loan_Status')

The loan of 422(around 69%) people out of 614 was approved.

Now lets visualize each variable separately. Different types of variables are Categorical, ordinal and numerical.

Categorical features: These features have categories (Gender, Married, Self_Employed, Credit_History, Loan_Status)

In [None]:
train["Gender"].value_counts()

In [None]:
train['Gender'].value_counts(normalize=True)*100

In [None]:
train["Gender"].value_counts(normalize=True).plot.bar(title = 'Gender')

In [None]:
train['Married'].value_counts(normalize=True)*100

In [None]:
train['Married'].value_counts(normalize=True).plot.bar(title= 'Married')

From the Grapch we see that :

Number of married people : 65%

Number of unmarried people : 35%

In [None]:
train["Self_Employed"].count()

In [None]:
train["Self_Employed"].value_counts()

In [None]:
train['Self_Employed'].value_counts(normalize=True)*100

In [None]:
train['Self_Employed'].value_counts(normalize=True).plot.bar(title='Self_Employed')

In [None]:
train['Dependents'].value_counts(normalize=True).plot.bar(title="Dependents")

In [None]:
plt.figure(1)
plt.subplot(121)
sns.distplot(train["ApplicantIncome"]);

plt.subplot(122)
train["ApplicantIncome"].plot.box(figsize=(16,5))
plt.show()

It can be inferred that most of the data in the distribution of applicant income is towards left which means it is not normally distributed. We will try to make it normal in later sections as algorithms works better if the data is normally distributed.

The boxplot confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society.

Part of this can be driven by the fact that we are looking at people with different education levels. Let us segregate them by Education:

In [None]:
train.boxplot(column='ApplicantIncome',by="Education" )
plt.suptitle(" ")
plt.show()

In [None]:
plt.figure(1)
plt.subplot(121)
sns.distplot(train["CoapplicantIncome"]);

plt.subplot(122)
train["CoapplicantIncome"].plot.box(figsize=(16,5))
plt.show()

We see a similar distribution as that of the applicant income. Majority of coapplicant’s income ranges from 0 to 5000. We also see a lot of outliers in the coapplicant income and it is not normally distributed.

In [None]:
print(pd.crosstab(train["Gender"],train["Loan_Status"]))
Gender = pd.crosstab(train["Gender"],train["Loan_Status"])
Gender.div(Gender.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True,figsize=(4,4))
plt.xlabel("Gender")
plt.ylabel("Percentage")
plt.show()

In [None]:
print(pd.crosstab(train["Married"],train["Loan_Status"]))
Married=pd.crosstab(train["Married"],train["Loan_Status"])
Married.div(Married.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True,figsize=(4,4))
plt.xlabel("Married")
plt.ylabel("Percentage")
plt.show()

In [None]:
train["TotalIncome"]=train["ApplicantIncome"]+train["CoapplicantIncome"]

In [None]:
bins =[0,2500,4000,6000,81000]
group=['Low','Average','High','Very High']
train["TotalIncome_bin"]=pd.cut(train["TotalIncome"],bins,labels=group)

In [None]:
print(pd.crosstab(train["TotalIncome_bin"],train["Loan_Status"]))
plt.figure(figsize=(10,10))
TotalIncome = pd.crosstab(train["TotalIncome_bin"],train["Loan_Status"])
TotalIncome.div(TotalIncome.sum(1).astype(float),axis=0).plot(kind='bar',stacked=True,figsize=(2,2))
plt.xlabel("TotalIncome")
plt.ylabel("Percentage")
plt.show()

In [None]:
bins = [0,100,200,700]
group=['Low','Average','High']
train["LoanAmount_bin"]=pd.cut(train["LoanAmount"],bins,labels=group)

In [None]:
print(pd.crosstab(train["LoanAmount_bin"],train["Loan_Status"]))
LoanAmount=pd.crosstab(train["LoanAmount_bin"],train["Loan_Status"])
LoanAmount.div(LoanAmount.sum(1).astype(float),axis=0).plot(kind='bar',stacked=True,figsize=(4,4))
plt.xlabel("LoanAmount")
plt.ylabel("Percentage")
plt.show()

In [None]:
#train['Dependents'].replace(('0', '1', '2', '3+'), (0, 1, 2, 3),inplace=True)
#test['Dependents'].replace(('0', '1', '2', '3+'), (0, 1, 2, 3),inplace=True)
train['Dependents'].replace('3+',3,inplace=True)
test['Dependents'].replace('3+',3,inplace=True)
train['Loan_Status'].replace('N', 0,inplace=True)
train['Loan_Status'].replace('Y', 1,inplace=True)

In [None]:
matrix = train.corr()
f, ax = plt.subplots(figsize=(10, 12))
sns.heatmap(matrix, vmax=.8, square=True, cmap="BuPu",annot=True);

In [None]:
train["Gender"].fillna(train["Gender"].mode()[0],inplace=True)
train["Married"].fillna(train["Married"].mode()[0],inplace=True)
train['Dependents'].fillna(train["Dependents"].mode()[0],inplace=True)
train["Self_Employed"].fillna(train["Self_Employed"].mode()[0],inplace=True)
train["Credit_History"].fillna(train["Credit_History"].mode()[0],inplace=True)

In [None]:
train["Loan_Amount_Term"].value_counts()

In [None]:
train["Loan_Amount_Term"].fillna(train["Loan_Amount_Term"].mode()[0],inplace=True)

In [None]:
train["LoanAmount"].fillna(train["LoanAmount"].median(),inplace=True)

In [None]:
train.isnull().sum()

In [None]:
train.drop(['TotalIncome', 'TotalIncome_bin', 'LoanAmount_bin'], axis=1, inplace=True)

In [None]:
train.isnull().sum()

In [None]:
test["Gender"].fillna(test["Gender"].mode()[0],inplace=True)
test['Dependents'].fillna(test["Dependents"].mode()[0],inplace=True)
test["Self_Employed"].fillna(test["Self_Employed"].mode()[0],inplace=True)
test["Loan_Amount_Term"].fillna(test["Loan_Amount_Term"].mode()[0],inplace=True)
test["Credit_History"].fillna(test["Credit_History"].mode()[0],inplace=True)
test["LoanAmount"].fillna(test["LoanAmount"].median(),inplace=True)

In [None]:
test.isnull().sum()

In [None]:
train["TotalIncome"]=train["ApplicantIncome"]+train["CoapplicantIncome"]
train[['TotalIncome']].head()

In [None]:
test["TotalIncome"]=test["ApplicantIncome"]+test["CoapplicantIncome"]

In [None]:
sns.distplot(train["TotalIncome"])

In [None]:
train["TotalIncome_log"]=np.log(train["TotalIncome"])
sns.distplot(train["TotalIncome_log"])

In [None]:
sns.distplot(test["TotalIncome"])

In [None]:
test["TotalIncome_log"] = np.log(train["TotalIncome"])
sns.distplot(test["TotalIncome_log"])

In [None]:
train["EMI"]=train["LoanAmount"]/train["Loan_Amount_Term"]
test["EMI"]=test["LoanAmount"]/test["Loan_Amount_Term"]

In [None]:
sns.distplot(train["EMI"])

In [None]:
train["Balance_Income"] = train["TotalIncome"]-train["EMI"]*1000 # To make the units equal we multiply with 1000
test["Balance_Income"] = test["TotalIncome"]-test["EMI"]

In [None]:
train=train.drop(["ApplicantIncome","CoapplicantIncome","LoanAmount","Loan_Amount_Term"],axis=1)

In [None]:
test = test.drop(["ApplicantIncome","CoapplicantIncome","LoanAmount","Loan_Amount_Term"],axis=1)

In [None]:
train=train.drop("Loan_ID",axis=1)
test=test.drop("Loan_ID",axis=1)

In [None]:
X=train.drop("Loan_Status",1)

In [None]:
y=train[["Loan_Status"]]

In [None]:
X = pd.get_dummies(X)

In [None]:
X.head(3)

In [None]:
train=pd.get_dummies(train)
test=pd.get_dummies(test)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train,x_cv,y_train,y_cv=train_test_split(X,y,test_size=0.3,random_state=1)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
logistic_model = LogisticRegression(random_state=1)

In [None]:
logistic_model.fit(x_train,y_train)

In [None]:
pred_cv_logistic=logistic_model.predict(x_cv)
score_logistic =accuracy_score(pred_cv_logistic,y_cv)*100 

In [None]:
score_logistic

In [None]:
pred_test_logistic = logistic_model.predict(test)

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier(random_state=1)

In [None]:
tree_model.fit(x_train,y_train)

In [None]:
pred_cv_tree=tree_model.predict(x_cv)

In [None]:
score_tree =accuracy_score(pred_cv_tree,y_cv)*100 
score_tree

In [None]:
pred_test_tree = tree_model.predict(test)

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest_model = RandomForestClassifier(random_state=1,max_depth=10,n_estimators=50)
forest_model.fit(x_train,y_train)

In [None]:
pred_cv_forest=forest_model.predict(x_cv)
score_forest = accuracy_score(pred_cv_forest,y_cv)*100
score_forest

In [None]:
pred_test_forest=forest_model.predict(test)

In [None]:
importances = pd.Series(forest_model.feature_importances_,index=X.columns)
importances.plot(kind='barh', figsize=(12,8))