Description of the loan dataset:

I chose to do my final project on the Loan Prediction dataset. The goal of this data set is to determine if a loan would get approved or not depending on the listed variables of the person trying to get the loan.

Here is an example of an entry and its variables of the Loan Prediction dataset:

In [1]:
%matplotlib inline

import pandas
import numpy
import matplotlib
#importing modules from sklearn for the analytic results
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

data = pandas.read_csv("TrainingSet.csv")

data.head(1)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y


To make things a bit more formatted, here are the variables and a brief description 


**_______________________________________________________________________**

Variable------------Description

Loan_ID ----------- Unique ID

Gender--------------Male/Female

Married-------------(Y/N)

Dependents----------# of dependents

Education-----------Applicant Education (Graduate/ Under Graduate)

Self_Employed-------Self employed (Y/N)

ApplicantIncome-----Applicant income

CoapplicantIncome---Coapplicant income

LoanAmount----------Loan amount in thousands

Loan_Amount_Term----Term of loan

Credit_History------Boolean value(1(yes)/0(no))

Property_Area-------Urban/ Semi Urban/ Rural

Loan_Status---------Loan approved (Y/N)

**________________________________________________________________________**


Because im using the pandas library to sort through this data it provides me functionality to get a good start with where to go with this problem. Now that we know the variables and their description, the next step is to find the amount of cases I am dealing with so I can start figuring out if there are missing values.

In [2]:
data.describe()



Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,,,
50%,3812.5,1188.5,,,
75%,5795.0,2297.25,,,
max,81000.0,41667.0,700.0,480.0,1.0


So now we know that there are 614 cases in this dataset which gives me a basis to start figuring out if there are missing casses that I need to fill in so that the analysis is more accurate.

For example, right away I can see that there are 22 missing values from LoanAmount, 14 missing values from Loan_Amount_Term, and 50 missing values from Credit_History. I want to see how many missing variables we have in total:

In [3]:
def missingNum(x):
    return sum(x.isnull())

print "Missing values per column"
print data.apply(missingNum, axis=0)

Missing values per column
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64


So most of these values can be filled by thinking about the data intuitively. Im going to go through them and fill them accordingly so I dont have any more missing values while building a predictive model.

The description of the data gave a mean of the LoanAmount that means I can use the average loan amount for the missing cases without throwing off the data too much.

In [4]:
data['LoanAmount'].fillna(data['LoanAmount'].mean(), inplace=True)
data['LoanAmount_log'] = numpy.log(data['LoanAmount'])

Another variable that could probably be filled out due to probability would be the self employed variable:

In [5]:
data['Self_Employed'].value_counts()

No     500
Yes     82
Name: Self_Employed, dtype: int64

So its probably safe to say that most of the 32 missing values can be marked as No

In [6]:
data['Self_Employed'].fillna('No', inplace=True)
print data.apply(missingNum, axis=0)

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
LoanAmount_log        0
dtype: int64


the loan amount term is a value that doesnt really vary much and is always a 360 term:

In [7]:
data['Loan_Amount_Term'].value_counts()

360.0    512
180.0     44
480.0     15
300.0     13
240.0      4
84.0       4
120.0      3
60.0       2
36.0       2
12.0       1
Name: Loan_Amount_Term, dtype: int64

I think its safe to say I can fill in Loan Amount Term with a 360 term

In [8]:
data['Loan_Amount_Term'].fillna(360.0, inplace=True)
print data.apply(missingNum, axis=0)

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term      0
Credit_History       50
Property_Area         0
Loan_Status           0
LoanAmount_log        0
dtype: int64


Now to handle the missing cases for Gender, Married, Dependents, and Credit History.

Gender:

In [9]:
data['Gender'].value_counts()

Male      489
Female    112
Name: Gender, dtype: int64

In [10]:
data['Gender'].fillna('Male', inplace=True)

In [11]:
data['Married'].value_counts()

Yes    398
No     213
Name: Married, dtype: int64

In [12]:
data['Married'].fillna('Yes', inplace=True)

In [13]:
data['Dependents'].value_counts()

0     345
1     102
2     101
3+     51
Name: Dependents, dtype: int64

In [14]:
data['Dependents'].fillna(1, inplace=True)

In [15]:
data['Credit_History'].value_counts()

1.0    475
0.0     89
Name: Credit_History, dtype: int64

In [16]:
data['Credit_History'].fillna(1.0, inplace=True)

So after going through and filling the missing variable values, the data should be a bit more accurate when making the predictive model. 

ANALYTIC RESULTS:

Now Im going to use a library called sklearn in order to 
model the analytical results from this data set after all 
the data munging. However, in order for sklearn to work, 
it requires all inputs to be numeric so I need to encode 
the remaining non-numeric values:

In [18]:
ModifiedVariables = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']
le = LabelEncoder()

for i in ModifiedVariables:
    data[i] = le.fit_transform(data[i])
    
data.dtypes

Loan_ID               object
Gender                 int64
Married                int64
Dependents             int64
Education              int64
Self_Employed          int64
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area          int64
Loan_Status            int64
LoanAmount_log       float64
dtype: object

So now that all the values are encoded I can get started with the actual classification functions that will determine the accuracy and cross validation of the model. Im going to do two examples, Logistic Regression and a Decision Tree

Generic Clasification function:

In [19]:
def classification_model(model, data, predictors, outcome):
    #fitting the given model
    model.fit(data[predictors], data[outcome])
    
    #making predictions on the data set
    predictions = model.predict(data[predictors])
    
    #Printing accuracy metric
    accuracy = metrics.accuracy_score(predictions,data[outcome])
    print "Accuracy: %s" % "{0:.3%}".format(accuracy)
    
    
    #Performing kfold cross validation with 5 folds
    kf = KFold(data.shape[0], n_folds=5)
    error = []
    for train, test in kf:
        #filtering the training data set
        train_predictors = (data[predictors].iloc[train,:])
        
        #training target used to train the algorithm
        train_target = data[outcome].iloc[train]
        
        #training the algorithm using the given
        #predictors and targets
        model.fit(train_predictors, train_target)
        
        #Recording errors from each cross validation run
        error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
        
    #Printing the cross validation score
    print "Cross Validation score: %s" % "{0:.3%}".format(numpy.mean(error))
    
    #Fitting the model so it can be referred outside of this function
    model.fit(data[predictors], data[outcome])

LOGISTIC REGRESSION MODEL

So before I run the logistic regression function on the data set i need to come up with predictions for variables of the entries that I think would affect the accuracy score the most. Intuitively, someone with a credit_history(good one preferably) would have a higher chance of getting approved for a loan, so that would definitely be something to be included in the predictor_vars. Some other important variables that come to my mind after looking at them again would be:

ApplicantIncome - somewhat straight forward reasoning. More income means generally better chance of paying it back

CoapplicantIncome - same reasoning as Applicant Income

Education - can determine the financial stability of applicants income

Dependents - having more people dependent on your income


In [20]:
outcome_var = 'Loan_Status'
Model = LogisticRegression ()
predictor_var = ['Credit_History']
classification_model(Model, data, predictor_var, outcome_var)

Accuracy: 80.945%
Cross Validation score: 80.946%


It might help the accuracy if I added more variables to the prediction_var list:

In [21]:
predictor_val = ['Credit_History', 'ApplicantIncome', 'CoapplicantIncome' , 'Education', 'Married']
classification_model(Model, data, predictor_var, outcome_var)

Accuracy: 80.945%
Cross Validation score: 80.946%


Well aparently those extra variables didnt matter too much in determining a higher accuracy and cross validation score. Maybe there are more significant values that affect it more than just Credit_History, because more variables generally increase the accuracy and cross validation.

In [22]:
predictor_var = ['Married', 'Gender', 'Dependents', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Property_Area']
classification_model(Model, data, predictor_var, outcome_var)

Accuracy: 69.055%
Cross Validation score: 68.402%


So the fact that all of the other variables combined result with a lower accuracy and cross validation score than credit_history by itself. This kind of limits the information I can get about this data set and I bet I could score a higher accuracy and cross validation score using a different model like a decision tree


DECISION TREE MODEL

In [23]:
Model = DecisionTreeClassifier()
predictor_var = ['Credit_History', 'Married', 'Education', 'ApplicantIncome']
classification_model(Model, data, predictor_var, outcome_var)

Accuracy: 98.208%
Cross Validation score: 72.964%


It could prove to be useful to try different values in predictor_vars like I did with logistic regression:

In [24]:
predictor_var = ['Credit_History','Loan_Amount_Term', 'LoanAmount_log', 'ApplicantIncome']
classification_model(Model, data, predictor_var, outcome_var)

Accuracy: 100.000%
Cross Validation score: 70.674%


So the fact that the Accuracy went to 100% and the cross validation dropped about 3 percent means that the decision tree model im using is most likely overfitting the data set. That means that while the logistic regression algorithm didnt score as high as the decision tree might have, it is still a more reliable source of information about the data set since the cross validation score was a bit higher and it didnt overfit the data. 