# Loan Prediction Machine Learning Model

Here is a logistic regression machine learing model that I have put together using a labeled dataset from from Kaggle: https://www.kaggle.com/burak3ergun/loan-data-set


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


df = pd.read_csv('loan prediction data.csv', skipinitialspace=True)
print(df.head)

<bound method NDFrame.head of       Loan_ID  Gender Married Dependents     Education Self_Employed  \
0    LP001002    Male      No          0      Graduate            No   
1    LP001003    Male     Yes          1      Graduate            No   
2    LP001005    Male     Yes          0      Graduate           Yes   
3    LP001006    Male     Yes          0  Not Graduate            No   
4    LP001008    Male      No          0      Graduate            No   
..        ...     ...     ...        ...           ...           ...   
609  LP002978  Female      No          0      Graduate            No   
610  LP002979    Male     Yes         3+      Graduate            No   
611  LP002983    Male     Yes          1      Graduate            No   
612  LP002984    Male     Yes          2      Graduate            No   
613  LP002990  Female      No          0      Graduate           Yes   

     ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0               5849            

On inspecting the data frame the data looks quite complete. We are missing a few values and there is not way we can estimate or fill the missing data with estimates or it is for things like sex and loan amount. It will be best to remove the columns from the dataframe.

In [2]:
df.dropna(inplace=True)

We now have a table with no missing values.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 480 entries, 1 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            480 non-null    object 
 1   Gender             480 non-null    object 
 2   Married            480 non-null    object 
 3   Dependents         480 non-null    object 
 4   Education          480 non-null    object 
 5   Self_Employed      480 non-null    object 
 6   ApplicantIncome    480 non-null    int64  
 7   CoapplicantIncome  480 non-null    float64
 8   LoanAmount         480 non-null    float64
 9   Loan_Amount_Term   480 non-null    float64
 10  Credit_History     480 non-null    float64
 11  Property_Area      480 non-null    object 
 12  Loan_Status        480 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 52.5+ KB


Before building the Logistic Regression model we must map the data in such a way that we can use it. 
That means mapping the gender, married, dependents, education, self employed and property area to numeric values.

In [4]:
df['Gender'] = df['Gender'].map({'Male':0, 'Female':1})
df['Married'] = df['Married'].map({'No':0, 'Yes':1})
df['Dependents'] = df['Dependents'].map({'3+':3, '2':2, '1':1, '0':0})
df['Education'] = df['Education'].map({'Graduate':1, 'Not Graduate':0})
df['Self_Employed'] = df['Self_Employed'].map({'No':0, 'Yes':1})
df['Property_Area'] = df['Property_Area'].map({'Urban':0, 'Semiurban':1, 'Rural':2})
df['Loan_Status'] = df['Loan_Status'].map({'N':0, 'Y':1})

In [5]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001003,0,1,1,1,0,4583,1508.0,128.0,360.0,1.0,2,0
2,LP001005,0,1,0,1,1,3000,0.0,66.0,360.0,1.0,0,1
3,LP001006,0,1,0,0,0,2583,2358.0,120.0,360.0,1.0,0,1
4,LP001008,0,0,0,1,0,6000,0.0,141.0,360.0,1.0,0,1
5,LP001011,0,1,2,1,1,5417,4196.0,267.0,360.0,1.0,0,1


Now with the data cleaned and in a numeric format that can be used we will create a features list to be used in the model

In [6]:
features = df[['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'Loan_Amount_Term', 'Credit_History', 'Property_Area']]
loan_approval = df.Loan_Status

Now we can split the data into training and testing sets to enable us to validate the accuracy of our model

In [7]:
features_train, features_test, labels_train, labels_test = train_test_split(features,loan_approval,test_size=0.2)

Scale the data so that it's weighted equally between the columns

In [8]:
scaler = StandardScaler()
features_train = scaler.fit_transform(features_train)
features_test = scaler.fit_transform(features_test)

Creating the logistic regression object and training the model

In [9]:
model = LogisticRegression()
model.fit(features_train, labels_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

We can score the model to to see how the prediction holds up against the test data

In [10]:
print(model.score(features_train, labels_train))
print(model.score(features_test, labels_test))

0.8125
0.8020833333333334


The model can correctly predict the approval of a loan roughly 80% of the time

By looking at the coefficents of the model, we can see if any of the features used were more or less important in predicting the outcome of the the loan application

In [11]:
print(list(zip(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'Loan_Amount_Term', 'Credit_History', 'Property_Area'],model.coef_[0])))

[('Gender', -0.2583466817158926), ('Married', 0.1574860469968225), ('Dependents', -0.05835683481226577), ('Education', 0.12245106568111731), ('Self_Employed', -0.08419146493118554), ('ApplicantIncome', -0.03288610550310462), ('CoapplicantIncome', -0.2820556559161165), ('Loan_Amount_Term', -0.10643085736181893), ('Credit_History', 1.1941025930309999), ('Property_Area', -0.08906570548628245)]


One of the most important factors for getting a loan approved was having a clean credit history. The least important: gender, the number of dependants and if the applicant is self-employed.

To test the model I have used the following features:
Male, single, no dependancts, undergraduate, not self employed, income of 4000, no co-applicant income, loan term of 360, clean credit history and from a rural area.


In [12]:
test = np.array([0,0,0,0,0,4000,0,360,1.0,2])
test = test.reshape(1,-1)

scaled_test = scaler.transform(test)
print(scaled_test)

[[-0.48038446 -1.20894105 -0.66899361 -1.68522995 -0.32163376 -0.23340293
  -0.73601929  0.24589473  0.4472136   1.33095886]]


Using the data in my test array, the loan would not of been granted. As shown below.

In [13]:
print(model.predict(test))

[0]
