#### Exercise 4: Data Analysis with Python

# Loan Prediction Practice Problem

## Our results:

![Our results](resources/results.png "Our results")

## The Data

Variable | Description
----------|--------------
Loan_ID | Unique Loan ID
Gender | Male/ Female
Married | Applicant married (Y/N)
Dependents | Number of dependents
Education | Applicant Education (Graduate/ Under Graduate)
Self_Employed | Self employed (Y/N)
ApplicantIncome | Applicant income
CoapplicantIncome | Coapplicant income
LoanAmount | Loan amount in thousands
Loan_Amount_Term | Term of loan in months
Credit_History | credit history meets guidelines
Property_Area | Urban/ Semi Urban/ Rural
Loan_Status | Loan approved (Y/N)


## Setup

To begin, start iPython interface in Inline Pylab mode by typing following on your terminal / windows command prompt:

In [None]:
%pylab inline

Following are the libraries we will use during this task:
- numpy
- matplotlib
- pandas

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt

After importing the library, you read the dataset using function read_csv().

In [None]:
df = pd.read_csv("./data/train.csv") #Reading the dataset in a dataframe using Pandas

### Quick Data Exploration

Look at few top rows by using the function head()

In [None]:
df.head(10)

Next, we look at summary of numerical fields by using describe() function

In [None]:
df.describe() # get the summary of numerical variables

## Distribution analysis

Start by plotting the histogram of ApplicantIncome using the following commands:

In [None]:
df['ApplicantIncome'].hist(bins=50)

Next, we look at box plots to understand the distributions. Box plot can be plotted by:

In [None]:
df.boxplot(column='ApplicantIncome')

This confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society. Part of this can be driven by the fact that we are looking at people with different education levels. Let us segregate them by Education:

In [None]:
df.boxplot(column='ApplicantIncome', by = 'Education')

We can see that there is no substantial different between the mean income of graduate and non-graduates. But there are a higher number of graduates with very high incomes, which are appearing to be the outliers.

Plot the histogram of LoanAmount:

In [None]:
df['LoanAmount'].hist(bins=50)

In [None]:
df.boxplot(column='LoanAmount')

Again, there are some extreme values. Clearly, both ApplicantIncome and LoanAmount require some amount of data munging. LoanAmount has missing and well as extreme values values, while ApplicantIncome has a few extreme values, which demand deeper understanding. We will take this up in coming sections.

## Categorical variable analysis

Frequency Table for Credit History:

In [None]:
temp1 = df['Credit_History'].value_counts(ascending=True)
temp1

Probability of getting loan for each Credit History class:

In [None]:
temp2 = df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambda x: x.map({'Y':1,'N':0}).mean())
temp2

We'll plot it as a bar chart:

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('Count of Applicants')
ax1.set_title("Applicants by Credit_History")
temp1.plot(kind='bar')

ax2 = fig.add_subplot(122)
temp2.plot(kind = 'bar')
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probability of getting loan')
ax2.set_title("Probability of getting loan by credit history")

This shows that the chances of getting a loan are eight-fold if the applicant has a valid credit history. You can plot similar graphs by Married, Self-Employed, Property_Area, etc.

Alternately, these two plots can also be visualized by combining them in a stacked chart::

In [None]:
temp3 = pd.crosstab(df['Credit_History'], df['Loan_Status'])
temp3.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)

## Check missing values in the dataset

Check the number of missing values in each column:

In [None]:
 df.apply(lambda x: sum(x.isnull()),axis=0) 

Though the missing values are not very high in number, but many variables have them and each one of these should be estimated and added in the data. 

## Fill missing values

First, we have to ensure that each of Self_Employed and Education variables should not have a missing values.

Let’s look at the frequency table:

In [None]:
df['Self_Employed'].value_counts()

Since ~86% values are “No”, it is safe to impute the missing values as “No” as there is a high probability of success. This can be done using the following code:

In [None]:
df['Self_Employed'].fillna('No',inplace=True)

Now, we will create a Pivot table, which provides us median values for all the groups of unique values of Self_Employed and Education features. Next, we define a function, which returns the values of these cells and apply it to fill the missing values of loan amount:

In [None]:
table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)
table

Define function to return value of this pivot_table:

In [None]:
def fage(x):
 return table.loc[x['Self_Employed'],x['Education']]

Replace missing values:

In [None]:
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

This should provide you a good way to impute missing values of loan amount.

## How to treat for extreme values in distribution of LoanAmount and ApplicantIncome?

Let’s analyze LoanAmount first. Since the extreme values are practically possible, i.e. some people might apply for high value loans due to specific needs. So instead of treating them as outliers, let’s try a log transformation to nullify their effect:

In [None]:
df['LoanAmount_log'] = np.log(df['LoanAmount'])
df['LoanAmount_log'].hist(bins=20)

Now the distribution looks much closer to normal and effect of extreme values has been significantly subsided.

Coming to ApplicantIncome. One intuition can be that some applicants have lower income but strong support Co-applicants. So it might be a good idea to combine both incomes as total income and take a log transformation of the same.

In [None]:
df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df['TotalIncome_log'] = np.log(df['TotalIncome'])
df['TotalIncome_log'].hist(bins=20) 

We see that the distribution is much better than before. 

Now, replace missing values from Loan_Amount_Term, Credit_history, Dependents, Married and Gender:

In [None]:
df['Loan_Amount_Term'].fillna(360, inplace=True)
df['Credit_History'].fillna(1, inplace=True)
df['Dependents'].fillna(0, inplace=True)
df['Married'].fillna('Yes', inplace=True)
df['Gender'].fillna('Male', inplace=True)

Check the result:

In [None]:
 df.apply(lambda x: sum(x.isnull()),axis=0) 

Seems fine. Next, we will look at making predictive models.

# Building a Predictive Model

Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories:

In [None]:
df.dtypes

In [None]:
from sklearn.preprocessing import LabelEncoder
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
    df[i] = le.fit_transform(df[i].astype(str))

In [None]:
df.dtypes 

Next, we will import the required modules. Then we will define a generic classification function, which takes a model as input and determines the Accuracy and Cross-Validation scores. 

In [None]:
#Import models from scikit learn module:
from sklearn.cross_validation import KFold   #For K-fold cross validation
from sklearn import metrics
# Method 1:
from sklearn.ensemble import ExtraTreesClassifier

#Method 2:
from sklearn.ensemble import AdaBoostClassifier

#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
  #Fit the model:
  model.fit(data[predictors],data[outcome])
  
  #Make predictions on training set:
  predictions = model.predict(data[predictors])
  
  #Print accuracy
  accuracy = metrics.accuracy_score(predictions,data[outcome])
  print("Accuracy : %s" % "{0:.3%}".format(accuracy))

  #Perform k-fold cross-validation with 5 folds
  kf = KFold(data.shape[0], n_folds=5)
  error = []
  for train, test in kf:
    # Filter training data
    train_predictors = (data[predictors].iloc[train,:])
    
    # The target we're using to train the algorithm.
    train_target = data[outcome].iloc[train]
    
    # Training the algorithm using the predictors and target.
    model.fit(train_predictors, train_target)
    
    #Record error from each cross-validation run
    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
 
  print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

  #Fit the model again so that it can be refered outside the function:
  model.fit(data[predictors],data[outcome]) 

# First Method: Extra Trees

The Extra-Tree method (standing for extremely randomized trees) was proposed in, with the main objective of further randomizing tree building in the context of numerical input features, where the choice of the optimal cut-point is responsible for a large proportion of the variance of the induced tree.

The method drops the idea of using bootstrap copies of the learning sample, and instead of trying to find an optimal cut-point for each one of the K randomly chosen features at each node, it selects a cut-point at random.

This idea is rather productive in the context of many problems characterized by a large number of numerical features varying more or less continuously: it leads often to increased accuracy thanks to its smoothing and at the same time significantly reduces computational burdens linked to the determination of optimal cut-points in standard trees and in random forests.

In [None]:
outcome_var = 'Loan_Status'
model = ExtraTreesClassifier(n_estimators=100)
predictor_var = ['Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area',
        'LoanAmount_log','TotalIncome_log']
classification_model(model, df, predictor_var, outcome_var)

Here we see that the accuracy is 100% for the training set. This is the ultimate case of overfitting and can be resolved in two ways:

    Reducing the number of predictors
    Tuning the model parameters

Let’s try both of these. First we see the feature importance matrix from which we’ll take the most important features.


In [None]:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)

Let’s use the top 5 variables for creating a model. Also, we will modify the parameters of random forest model a little bit:

In [None]:
model = ExtraTreesClassifier(n_estimators=25, min_samples_split=25, max_depth=7, max_features=1)
predictor_var = ['TotalIncome_log','LoanAmount_log','Credit_History','Dependents','Loan_Amount_Term']
classification_model(model, df, predictor_var, outcome_var)

## Predict the results

Read the test data set 

In [None]:
df_test = pd.read_csv("./data/test.csv")
df_test.head(10)

Save the loan ids in a different list:

In [None]:
loan_ids = df_test['Loan_ID']

### Prepare the input - repeat the steps

Check for missing values:

In [None]:
df_test.apply(lambda x: sum(x.isnull()),axis=0)

Replace missing values:

In [None]:
df_test['LoanAmount'].fillna(df_test[df_test['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
df_test['Loan_Amount_Term'].fillna(360, inplace=True)
df_test['Credit_History'].fillna(1, inplace=True)
df_test['Dependents'].fillna(0, inplace=True)
df_test['Married'].fillna('Yes', inplace=True)
df_test['Gender'].fillna('Male', inplace=True)
df_test.apply(lambda x: sum(x.isnull()),axis=0)

Handle types:

In [None]:
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area']
le = LabelEncoder()
for i in var_mod:
    df_test[i] = le.fit_transform(df_test[i].astype(str))
df.dtypes 

Treat extreme values:

In [None]:
df_test['LoanAmount_log'] = np.log(df_test['LoanAmount'])
df_test['TotalIncome'] = df_test['ApplicantIncome'] + df_test['CoapplicantIncome']
df_test['TotalIncome_log'] = np.log(df_test['TotalIncome'])

Adjust the data frame to our model:

In [None]:
df_test = df_test[['TotalIncome_log','LoanAmount_log','Credit_History','Dependents','Loan_Amount_Term']]
df_test
df_test.head(10)

Predict the loan status using the model:

In [None]:
prediction = model.predict(df_test)

Convert the result to fit the submission requirement template:

In [None]:
predict = ['Y' if p == 1 else 'N' for p in prediction]

Write the result to the submission file:

In [None]:
submission = {'Loan_ID':loan_ids, 'Loan_Status':predict}
submission = pd.DataFrame(submission)
submission.to_csv("./submission/model1.csv")

## Extra Trees method results:

![Model 1 results](resources/method1.png "Model 1 results")

In [None]:
df.head(10)

# Second Method: AdaBoost

AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

In [None]:
outcome_var = 'Loan_Status'
model = AdaBoostClassifier(n_estimators=100)
predictor_var = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area']
classification_model(model, df, predictor_var, outcome_var)

## Predict the results

Read the test data set 

In [None]:
df_test = pd.read_csv("./data/test.csv")
df_test.head(10)

Save the loan ids in a different list:

In [None]:
loan_ids = df_test['Loan_ID']

## Prepare the input - repeat the steps

Check for missing values:

In [None]:
df_test.apply(lambda x: sum(x.isnull()),axis=0)

Replace missing values:

In [None]:
df_test['LoanAmount'].fillna(df_test[df_test['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
df_test['Loan_Amount_Term'].fillna(360, inplace=True)
df_test['Credit_History'].fillna(1, inplace=True)
df_test['Dependents'].fillna(0, inplace=True)
df_test['Married'].fillna('Yes', inplace=True)
df_test['Gender'].fillna('Male', inplace=True)
df_test.apply(lambda x: sum(x.isnull()),axis=0)

Handle types:

In [None]:
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area']
le = LabelEncoder()
for i in var_mod:
    df_test[i] = le.fit_transform(df_test[i].astype(str))
df.dtypes 

Adjust the data frame to our model:

In [None]:
df_test = df_test[['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area']]
df_test
df_test.head(10)

Predict the loan status using the model:

In [None]:
prediction = model.predict(df_test)

Convert the result to fit the submission requirement template:

In [None]:
predict = ['Y' if p == 1 else 'N' for p in prediction]

Write the result to the submission file:

In [None]:
submission = {'Loan_ID':loan_ids, 'Loan_Status':predict}
submission = pd.DataFrame(submission)
submission.to_csv("./submission/model2.csv")

## AdaBoost method results:

![Model 2 results](resources/method2.png "Model 2 results")