<h1><center> Loan Prediction Data Analysis

In [1]:
%%html
<script>
    var code_show=true; //true -> hide code at first
    
    function code_toggle() {
        $('div.prompt').hide(); // always hide prompt
        
        if (code_show){
            $('div.input').hide();
        } else {
            $('div.input').show();
        }
        code_show = !code_show
        
    }
    $( document ).ready(code_toggle);
</script>
<a href="javascript:code_toggle()"><center>[View Code]</center></a>

## Problem Statement:

Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers. 

## Hypothesis:

1. Married persons are more likely to be approved than single persons due to more stability in income thus decreased default risk. (**Married**)
2. The more credit history that exists will increase the chances of a loan being approved.(**Credit History**)
3. The higher loan amounts will be less desirable to lenders thus will dampen the chances of loans being approved.  (**Loan Amount**)
4. A higher number of dependents will dampen the approval rates due to higher demand on household income. (**Dependents**)
5. Persons with higher incomes are more inclined to be given a loan due to their ability to repay thus less risk involved. (**Income**)
6. Higher Education levels should correlate to higher incomes than lower levels thus increasing chances of getting loan approved. (**Education**):
7. Self-Employed persons are less likely to be approved for loans due to unstable, inconsistent income vs that of an employed person. (**Employment Terms**) 
8. The longer loan terms will lead to a greater likelihood of loan approval as this means that the borrower's ability to repay is greater over the life of the loan.  (**Loan Amount Term**)
9. A higher Co-Applicant income will increase chances of loans being approved as these persons are used as collateral. (**Co-applicantIncome**)
10. Persons in urban areas will have higher approval rates than other areas as urban areas have higher home prices than rural house prices. (**Property Area**)

In [2]:
#Libraries
import numpy as np
import pandas as pd
import sklearn as sk
import seaborn as sb
%matplotlib inline

## Data Acquisition

Load data into a DataFrame from local machine or from the internet url

In [3]:
# Importing data into workspace
location =r"C:\Users\Latoya Clarke\Desktop\Data for Analysis\loan_train.csv"
location2 =r"C:\Users\Latoya Clarke\Desktop\Data for Analysis\loan_test.csv"
loan = pd.read_csv(location)
test = pd.read_csv(location2)
loan.head()

#This output shows 12 data columns with loan amount having missing values

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


##  Data Exploration

### Variable Identification

1. Target Variable: **Loan_Status**

2. Predictor Variables: **Married, Dependents, Education, Self_Employed, ApplicantIncome, CoapplicantIncome, LoanAmount, LoanAmount_Term, Credit_History, Property_Area** 

### Univariate Analysis

At this stage, we explore variables one by one. Method to perform uni-variate analysis will depend on whether the variable type is categorical or continuous.

Continuous Variables:- In case of continuous variables, we need to understand the central tendency and spread of the variable. These are measured using various statistical metrics visualization methods.

Categorical Variables:- For categorical variables, we’ll use frequency table to understand distribution of each category. We can also read as percentage of values under each category. It can be be measured using two metrics, Count and Count% against each category. Bar chart can be used as visualization.

In [4]:
#General information on dataset
loan.info()

# 6/13 columns have missing values
#Applicant income is an int datatype which means applicants state absolute values for their incomes.
# while all other monies are of float datatypes.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               601 non-null object
Married              611 non-null object
Dependents           599 non-null object
Education            614 non-null object
Self_Employed        582 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null object
Loan_Status          614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.4+ KB


In [5]:
loan.describe()

# Credit history seems to be binary as to signify whether or not a credit history
# Applicant Income seem to have a really large range
# Loan terms ranges from 12 months (1 year) to 480 months (40 years)
# Some Coapplicants may have 0 income

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


### Missing Value Treatment

In [6]:
loan['Gender'] = loan['Gender'].fillna('Male')
loan['Married'] = loan['Married'].fillna('Yes')
loan['Dependents'] = loan['Dependents'].fillna(0)
loan['Self_Employed'] = loan['Self_Employed'].fillna('No')
loan['LoanAmount'] = loan['LoanAmount'].fillna(round(loan['LoanAmount'].mean(),1))
loan['Loan_Amount_Term'] = loan['Loan_Amount_Term'].fillna(round(loan['Loan_Amount_Term'].mean(),1))
loan['Credit_History'] = loan['Credit_History'].fillna(round(loan['Credit_History'].mean(),0))

test['Gender'] = test['Gender'].fillna('Male')
test['Married'] = test['Married'].fillna('Yes')
test['Dependents'] = test['Dependents'].fillna(0)
test['Self_Employed'] = test['Self_Employed'].fillna('No')
test['LoanAmount'] = test['LoanAmount'].fillna(round(test['LoanAmount'].mean(),1))
test['Loan_Amount_Term'] = test['Loan_Amount_Term'].fillna(round(test['Loan_Amount_Term'].mean(),1))
test['Credit_History'] = test['Credit_History'].fillna(round(test['Credit_History'].mean(),0))

### Outlier Detection and Treatment

Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of univariate outlier. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space.
Outliers can drastically change the results of the data analysis and statistical modeling.
Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plots.

## Feature Engineering

Feature engineering is the science (and art) of extracting more information from existing data. You are actually making the data you already have more useful. Feature engineering itself can be divided in 2 steps:
Variable transformation and
Variable / Feature creation.

### Variable Transformation

In data modelling, transformation refers to the replacement of a variable by a function. In other words, transformation is a process that changes the distribution or relationship of a variable with others.

### Variable Creation

Feature / Variable creation is a process to generate a new variables / features based on existing variable(s).

## Findings

## Model Building

In [43]:
y = np.array(loan.Loan_Status)

loan_selected = loan.drop(['Loan_ID','Loan_Status'], axis = 1)
X= loan_selected.to_dict(orient='records')

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
X = vec.fit_transform(X).toarray()

In [89]:
#Splitting the data for training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state =1234)

In [90]:
#Visualize train & test

In [91]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.naive_bayes import GaussianNB as NB
from sklearn.metrics import accuracy_score

print("DECISION TREE: ")
model = DT()
model.fit(X_train, y_train)
predicted= model.predict(X_test)
print("Model Prediction Accuracy: ", round(100* accuracy_score(predicted, y_test),2),"%")

DECISION TREE: 
Model Prediction Accuracy:  66.67 %


In [92]:
print("KNN: ")
model = KNN() 
model.fit(X_train, y_train)
predicted= model.predict(X_test)
print("Model Prediction Accuracy: ", round(100* accuracy_score(predicted, y_test),2),"%")

KNN: 
Model Prediction Accuracy:  61.79 %


In [93]:
print("KNN: ")
model = KNN() 
model.fit(X_train, y_train)
predicted= model.predict(X_test)
print("Model Prediction Accuracy: ", round(100* accuracy_score(predicted, y_test),2),"%")

KNN: 
Model Prediction Accuracy:  61.79 %


In [94]:
print("Navie Bayes: ")
model = NB() 
model.fit(X_train, y_train)
predicted= model.predict(X_test)
print("Model Prediction Accuracy: ", round(100* accuracy_score(predicted, y_test),2),"%")

Navie Bayes: 
Model Prediction Accuracy:  83.74 %


In [95]:
print("Gradient Boosting: ")
model = GBC() 
model.fit(X_train, y_train)
predicted= model.predict(X_test)
print("Model Prediction Accuracy: ", round(100* accuracy_score(predicted, y_test),2),"%")

Gradient Boosting: 
Model Prediction Accuracy:  78.86 %


In [96]:
print("Logistic Regression: ")
model = LR() 
model.fit(X_train, y_train)
predicted= model.predict(X_test)
print("Model Prediction Accuracy: ", round(100* accuracy_score(predicted, y_test),2),"%")

Logistic Regression: 
Model Prediction Accuracy:  86.18 %


In [97]:
print("Random Forest: ")
model = RFC() 
model.fit(X_train, y_train)
predicted= model.predict(X_test)
print("Model Prediction Accuracy: ", round(100* accuracy_score(predicted, y_test),2),"%")

Random Forest: 
Model Prediction Accuracy:  73.98 %


In [98]:
print("Support Vector Machine: ")
model = SVC() 
model.fit(X_train, y_train)
predicted= model.predict(X_test)
print("Model Prediction Accuracy: ", round(100* accuracy_score(predicted, y_test),2),"%")

Support Vector Machine: 
Model Prediction Accuracy:  68.29 %


In [107]:
from sklearn.linear_model import SGDClassifier as SGDC
print("SDGC: ")
model = SGDC() 
model.fit(X_train, y_train)
predicted= model.predict(X_test)
print("Model Prediction Accuracy: ", round(100* accuracy_score(predicted, y_test),2),"%")

SDGC: 
Model Prediction Accuracy:  68.29 %




## Model Selection and Tuning

In [105]:
#2042 to #792, 

print("Logistic Regression: ")
model = LR(random_state = 1234, solver='newton-cg') 
model.fit(X_train, y_train)
predicted= model.predict(X_test)
print("Model Prediction Accuracy: ", round(100* accuracy_score(predicted, y_test),2),"%")

Logistic Regression: 
Model Prediction Accuracy:  86.18 %




## Predictions

In [19]:
test_selected = test.drop(['Loan_ID'], axis = 1)
X_2= test_selected.to_dict(orient='records')

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
test_X = vec.fit_transform(X_2).toarray()

In [20]:
pred_y = model.predict(test_X)

In [21]:
test['Loan_Status'] = pred_y
test[['Loan_ID','Loan_Status']].to_csv(r"C:\Users\Latoya Clarke\Desktop\Data for Analysis\loan_predictions.csv", index=False)