Description of the loan dataset:

I chose to do my final project on the Loan Prediction dataset. The goal of this data set is to determine if a loan would get approved or not depending on the listed variables of the person trying to get the loan.

Here is an example of an entry and its variables of the Loan Prediction dataset:

In [3]:
%matplotlib inline

import pandas
import numpy
import matplotlib

data = pandas.read_csv("TrainingSet.csv")

data.head(1)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y


To make things a bit more formatted, here are the variables and a brief description 


**_______________________________________________________________________**

Variable------------Description

Loan_ID ----------- Unique ID

Gender--------------Male/Female

Married-------------(Y/N)

Dependents----------# of dependents

Education-----------Applicant Education (Graduate/ Under Graduate)

Self_Employed-------Self employed (Y/N)

ApplicantIncome-----Applicant income

CoapplicantIncome---Coapplicant income

LoanAmount----------Loan amount in thousands

Loan_Amount_Term----Term of loan

Credit_History------Boolean value(1(yes)/0(no))

Property_Area-------Urban/ Semi Urban/ Rural

Loan_Status---------Loan approved (Y/N)

**________________________________________________________________________**


Because im using the pandas library to sort through this data it provides me functionality to get a good start with where to go with this problem. Now that we know the variables and their description, the next step is to find the amount of cases I am dealing with so I can start figuring out if there are missing values.

In [24]:
data.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,LoanAmount_log
count,614.0,614.0,614.0,614.0,564.0,614.0
mean,5403.459283,1621.245798,146.412162,342.410423,0.842199,4.862066
std,6109.041673,2926.248369,84.037468,64.428629,0.364878,0.496575
min,150.0,0.0,9.0,12.0,0.0,2.197225
25%,2877.5,0.0,100.25,360.0,,4.607658
50%,3812.5,1188.5,129.0,360.0,,4.859812
75%,5795.0,2297.25,164.75,360.0,,5.104426
max,81000.0,41667.0,700.0,480.0,1.0,6.55108


So now we know that there are 614 cases in this dataset which gives me a basis to start figuring out if there are missing casses that I need to fill in so that the analysis is more accurate.

For example, right away I can see that there are 22 missing values from LoanAmount, 14 missing values from Loan_Amount_Term, and 50 missing values from Credit_History. I want to see how many missing variables we have in total:

In [5]:
def missingNum(x):
    return sum(x.isnull())

print "Missing values per column"
print data.apply(missingNum, axis=0)

Missing values per column
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64


So most of these values can be filled by thinking about the data intuitively. Im going to go through them and fill them accordingly so I dont have any more missing values while building a predictive model.

The description of the data gave a mean of the LoanAmount that means I can use the average loan amount for the missing cases without throwing off the data too much.

In [6]:
data['LoanAmount'].fillna(data['LoanAmount'].mean(), inplace=True)

Another variable that could probably be filled out due to probability would be the self employed variable:

In [21]:
data['Self_Employed'].value_counts()

No     532
Yes     82
Name: Self_Employed, dtype: int64

So its probably safe to say that most of the 32 missing values can be marked as No

In [19]:
data['Self_Employed'].fillna('No', inplace=True)
print data.apply(missingNum, axis=0)

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
LoanAmount_log        0
dtype: int64


the loan amount term is a value that doesnt really vary much and is always a 360 term:

In [23]:
data['Loan_Amount_Term'].value_counts()

360.0    526
180.0     44
480.0     15
300.0     13
240.0      4
84.0       4
120.0      3
60.0       2
36.0       2
12.0       1
Name: Loan_Amount_Term, dtype: int64

I think its safe to say I can fill in Loan Amount Term with a 360 term

In [47]:
data['Loan_Amount_Term'].fillna(360.0, inplace=True)
print data.apply(missingNum, axis=0)

Loan_ID               0
Gender                0
Married               0
Dependents            0
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term      0
Credit_History       50
Property_Area         0
Loan_Status           0
LoanAmount_log        0
dtype: int64


Now to handle the missing cases for Gender, Married, Dependents, and Credit History.

Gender:

In [28]:
data['Gender'].value_counts()

Male      502
Female    112
Name: Gender, dtype: int64

In [36]:
data['Gender'].fillna('Male', inplace=True)

In [37]:
data['Married'].value_counts()

Yes    398
No     213
Name: Married, dtype: int64

In [46]:
data['Married'].fillna('Yes', inplace=True)

In [45]:
data['Dependents'].value_counts()

0     345
1     102
2     101
3+     51
0      15
Name: Dependents, dtype: int64

In [43]:
data['Dependents'].fillna(1, inplace=True)

In [50]:
data['Credit_History'].value_counts()

1.0    525
0.0     89
Name: Credit_History, dtype: int64

In [49]:
data['Credit_History'].fillna(1.0, inplace=True)

So after going through and filling the missing variable values, the data should be a bit more accurate when making the predictive model. 