# Loan Granting Prediction
## Adura ABIONA, PhD (UNSW)
### 12 February, 2017

## Introduction
This project is based on the Cortana Intelligence Gallery Competition . This competition was part of the requirement for Data Science Professional Project for **Microsoft Professional Program in Data Science(MPP-DS)**. The competition started on 12/12/2016 and ended on 1/29/2017. The evaluation criterium to achieve a passing grade in the accompanying course is that **Private Score Highest** has to be at least **70%**. 

### Short Description for the Required Task 
This competition concerns loan data. When a customer applies for a loan, banks and other credit providers use statistical models to determine whether or not to grant the loan based on the likelihood of the loan being repaid. The factors involved in determining this likelihood are complex, and extensive statistical analysis and modelling are required to predict the outcome for each individual case. The task is to implement a model that predicts loan repayment or default based on the data provided.

### Dataset Information
The dataset used in this competition consists of synthetic data that was generated specifically for use in this project. The data is designed to exhibit similar characteristics to genuine loan data. The dataset consisting of over 111,000 loan records to determine the best way to predict whether a loan applicant will fully repay or default on a loan.

More information on it can be found at [**Loan Granting Binary Classification December 2016**](https://gallery.cortanaintelligence.com/Competition/1ad7a6df99794816b9bc071e27d46b10).

In [None]:
import glob, os, string
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
%matplotlib inline 
print(pd.__version__)

import seaborn as sns  # Seaborn, useful for graphics
# JB's favorite Seaborn settings for notebooks
rc = {'lines.linewidth': 2, 
      'axes.labelsize': 18, 
      'axes.titlesize': 18, 
      'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style('darkgrid', rc=rc)
# This Function takes as input a custom palette
flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]

In [None]:
dfLoan = pd.read_csv("LoansTrainingDataset.csv")
print(dfLoan.shape)
dfLoan.head(2)

In [None]:
#Check for the number of missing values in each column as isnull() 
dfLoan.apply(lambda x: sum(x.isnull()),axis=0) 

There are **111,107 rows and 19 columns** in this dataset out of which **59,003 rows** more than 50% of the records have missing values in the *Months since last delinquent*" column alone. This is a lot, therefore it will not contribute much information to the modeling as such I will drop this column.

There are still **21,338 rows** that have missing values in the ***Credit Score*** and ***Annual Income*** columns. This is a somewhat significant number which could be treated by repalcing them with mean/median of each  column or by forward/backward filling methods. However, considering this number and the importance of annual income and credit score in the ability of an applicant to pay back the loan, using above methods would affect the modelling result negatively.  Therefore, it is advisable to drop them.

In [None]:
print(dfLoan.shape)
# Removes columns with missing values 
dfLoan1 = dfLoan.drop(labels=['Months since last delinquent'], axis=1)

# Removes rows with missing values    
dfLoan1 = dfLoan1[~pd.isnull(dfLoan1).any(axis=1)]

print(dfLoan1.shape)
dfLoan1.head(2)

In [None]:
## Checking feature types
def coltype(df):
        datatype1 = df.select_dtypes(exclude = ['object']).dtypes
        print('**** Number of numeric features = {} ****'.format(datatype1.count()))
        print(datatype1)
    
        datatype2 = df.select_dtypes(include = ['object']).dtypes
        print('\n**** Number of non-numeric features = {} ****'.format(datatype2.count()))
        print(datatype2)
        
coltype(dfLoan1)

In [None]:
# List unique elements of each feature and also detects datatype misclassification 
def uniqueCatList(df):
    print('Frame shape: {}'.format(df.shape))
    for col in df.select_dtypes(include = ['object']).columns.tolist():
        print('\nNumber of unique members[{}] = {}'.format(col,  len(df[col].unique().tolist())))
        print(df[col].unique())
#numeric_feats = train.dtypes[train.dtypes != "object"].index
uniqueCatList(dfLoan1)  

In [None]:
dfLoan1.groupby('Years in current job')['Loan ID'].nunique()

In [None]:
print(dfLoan1.shape)
dfLoan1 = dfLoan1[dfLoan1['Years in current job'] != 'n/a']
print(dfLoan1.shape)

With the above function, we are able to detect that ***Monthly Debt*** and ***Maximum Open Credit*** faetures are numeric but have some elements that are wrongly input as string.

In [None]:
# List non-numeric elements in numeric features
def DetectNonNumeric(dfx, nnCol):
    for colx in nnCol:
        nonNumList = []
        for el in dfx[colx].unique().tolist():
            try:      
                float(el)
            except:
                nonNumList.append(el)
        print('{}: {}'.format(colx, nonNumList))
        
DetectNonNumeric(dfLoan1, ['Monthly Debt','Maximum Open Credit'])  

The code block below munges the ***Monthly Debt*** and ***Maximum Open Credit*** columns to the correct format

In [None]:
# coverting feature element  to numeric and forcing string to nan
dfLoan1['Monthly Debt'] = [float(str(row).replace("$",""))  if '$' in str(row) else row   for row in dfLoan1['Monthly Debt']]
dfLoan1['Monthly Debt'] = pd.to_numeric(dfLoan1['Monthly Debt'], errors='coerce')

dfLoan1['Maximum Open Credit'] = pd.to_numeric(dfLoan1['Maximum Open Credit'], errors='coerce')
dfLoan1['Maximum Open Credit'].fillna(dfLoan1['Maximum Open Credit'].mean(), inplace=True) # fill missing values with mean

DetectNonNumeric(dfLoan1, ['Monthly Debt','Maximum Open Credit']) 

In [None]:
# Removing duplicate rows
print(dfLoan1.shape)
dfLoan1.drop_duplicates(['Loan ID'], inplace = True) 
print(dfLoan1.shape)

In [None]:
dfLoan1.describe()

In [None]:
# Getting the categorical features
datacols = dfLoan1.select_dtypes(exclude = ['object']).columns.tolist()
print(datacols)
len(datacols)

In [None]:
datacol1 = ['Current Loan Amount', 'Credit Score', 'Annual Income', 'Monthly Debt', 
            'Years of Credit History', 'Number of Open Accounts']
sns.pairplot(dfLoan1, vars=datacol1, size=3, kind="reg") #size=3, diag_kind="kde", kind="reg")

In [None]:
datacol2 = ['Current Loan Amount', 'Number of Credit Problems', 'Current Credit Balance', 
            'Maximum Open Credit', 'Bankruptcies', 'Tax Liens']
sns.pairplot(dfLoan1, vars=datacol2, size=3, kind="reg") #size=3, diag_kind="kde", kind="reg")

In [None]:
# 'Current Loan Amount', 'Credit Score', 'Annual Income', 'Monthly Debt', 'Years of Credit History', 'Number of Open Accounts'
def plot_outlier(x,y):
    fig,axs=plt.subplots(1,2,figsize=(8,3))
    sns.boxplot(y,orient='v',ax=axs[0])
    sns.regplot(x,y,ax=axs[1])
    plt.tight_layout()

In [None]:
dfx0 = dfLoan1[dfLoan1['Current Loan Amount'] <  0]
print(dfx0.shape)
dfx1 = dfLoan1[dfLoan1['Current Loan Amount'] <  0.5E8]
print(dfx1.shape)
dfx2 = dfLoan1[dfLoan1['Current Loan Amount'] >  0.5E8]
dfx2.shape

In [None]:
dfLoan1 = dfLoan1[dfLoan1['Annual Income'] <  0.2E7]
plot_outlier(dfLoan1['Current Loan Amount'], dfLoan1['Annual Income'])

In [None]:
dfLoan1 = dfLoan1[dfLoan1['Monthly Debt'] <  15000]

plot_outlier(dfLoan1['Current Loan Amount'], dfLoan1['Monthly Debt'])

In [None]:
dfLoan1 = dfLoan1[dfLoan1['Current Credit Balance'] <  1000000]

plot_outlier(dfLoan1['Current Loan Amount'], dfLoan1['Current Credit Balance'])

In [None]:
dfLoan1 = dfLoan1[dfLoan1['Annual Income'] <  3.0E7]

plot_outlier(dfLoan1['Current Loan Amount'], dfLoan1['Maximum Open Credit'])

In [None]:
dfLoan1.describe()

In [None]:
def plotcorrFloat(df):
    corr = df[['Current Loan Amount', 'Credit Score', 'Annual Income', 'Monthly Debt', 'Years of Credit History', 
               'Number of Open Accounts', 'Number of Credit Problems', 'Current Credit Balance', 
               'Maximum Open Credit', 'Bankruptcies', 'Tax Liens']].corr()
    colormap = plt.cm.viridis
    plt.figure(figsize=(12, 12))
    plt.title('Pearson Correlation of Features', y=1.05, size=15)
    return sns.heatmap(corr, vmax=1, square=True, linewidths=0.1, cmap=colormap, linecolor='white', annot=True)

# 'Current Loan Amount', 'Credit Score', 'Annual Income', 'Monthly Debt', 'Years of Credit History', 'Number of Open Accounts'
plotcorrFloat(dfLoan1)

## Model prediction
For the model prediction, I will be using a classifier of gradient boosted decision trees to predict whether a loan applicant will fully repay or default on a loan. In particular, I will use XGBoost. It is an algorithm that has recently been dominating applied machine learning and [**Kaggle competitions**](https://www.kaggle.com/competitions) for structured data.

[**XGBoost**](http://xgboost.readthedocs.io/en/latest/model.html) is an implementation of gradient boosted decision trees designed for speed and performance. [**Szilard Pafka's excellent benchmark**](https://github.com/szilard/benchm-ml) of a variety of machine learning libraries attest to XGBoost fast computation speed.

In [None]:
#import model libraries
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.preprocessing import StandardScaler 
import scipy.stats as st
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
#Import evaluation metrics
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score,  roc_curve, auc, precision_score, recall_score
import itertools

In [None]:
# http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
"""
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True')
    plt.xlabel('Predicted')
  """  
# Compute ROC curve and ROC area for each class
def plot_ROCcurve(Ytest, pred, title = "ROC Curve"):
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    fpr, tpr, _ = roc_curve(Ytest, pred)
    roc_auc = auc(fpr, tpr)

    plt.figure()
    lw = 2
    plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(title)
    plt.legend(loc="lower right")
    return plt.show()

## Applying Ordinal Encoding to Categoricals
We need to convert some features into categorical group to make processing simpler. The columns Vessel_ID represents categorical feature. However, because it is an integer, it is initially parsed as continuous number. It is also required to encode features like DayNight with a string category since XGBoost (like all of the other machine learning algorithms in Python) requires every feature vector to include only digits. 

In [None]:
# categorizing features that need it
CategLx = ['Loan Status', 'Term', 'Years in current job', 'Home Ownership', 'Purpose'] # Categorical features 
dfLoan2 = dfLoan1  
for fea in dfLoan2[CategLx]: # Loop through all columns in the dataframe
    dfLoan2[fea] = pd.Categorical(dfLoan2[fea]).codes # Convert to categorical features

In [None]:
#Separate target from other features: input (X) features, target (y) feature & label (Z) feature
#Z = dfLoan1['Loan ID']
Y = dfLoan2['Loan Status']
X = dfLoan2.drop(['Loan Status', 'Loan ID', 'Customer ID'], axis=1)

In [None]:
numCol = dfLoan2.select_dtypes(exclude = ['object']).columns.tolist()
print(numCol)
print(len(numCol))

catCol = dfLoan2.select_dtypes(include = ['object']).columns.tolist()
print(catCol)
len(catCol)

In [None]:
#Split data into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.45, random_state=123)

#### Next step is to separate features into numeric/categorical groups to make scaling  which is needed for accurate prediction for numeric features. 

In [None]:
CategLs = ['Term', 'Years in current job', 'Home Ownership', 'Purpose'] # Categorical features 
X_train_Cat = X_train[CategLs]
X_test_Cat = X_test[CategLs]

X_train_Num = X_train.drop(CategLs, axis=1)
X_test_Num = X_test.drop(CategLs, axis=1)

In [None]:
scaler = StandardScaler() # create scaler object
scaler.fit(X_train_Num) # fit with the training data ONLY
X_train_Num = scaler.transform(X_train_Num) 
X_test_Num = scaler.transform(X_test_Num) 

In [None]:
X_train.drop(CategLs, axis=1).head().columns

In [None]:
# names = datacols
X_train_tot = pd.concat([pd.DataFrame(X_train_Num, columns = datacols), X_train_Cat.reset_index(drop=True)], axis=1)
X_test_tot = pd.concat([pd.DataFrame(X_test_Num, columns = datacols), X_test_Cat.reset_index(drop=True)], axis=1)
print(X_train_Num.shape)
print(X_train_tot.shape)
X_train_tot.head()
X_train_tot.tail()

In [None]:
one_to_left = st.beta(10, 1)  
from_zero_positive = st.expon(0, 50)

params = {  
    "n_estimators": st.randint(3, 40),
    "max_depth": st.randint(3, 40),
    "learning_rate": st.uniform(0.05, 0.4),
    "colsample_bytree": one_to_left,
    "subsample": one_to_left,
    "gamma": st.uniform(0, 10),
    'reg_alpha': from_zero_positive,
    "min_child_weight": from_zero_positive
}

xgbreg = XGBClassifier(nthread=-1)
rsCV = RandomizedSearchCV(xgbreg, params, n_jobs=1)  
rsCV.fit(X_train_tot, Y_train)
rsCV.best_params_, rsCV.best_score_

clf = XGBClassifier(**rsCV.best_params_)
clf.fit(X_train_tot, Y_train)

predXtest = clf.predict(X_test_tot)

In [None]:
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
#Print model report:
print ("\nModel Report")
print ("Accuracy : %.4g" % accuracy_score(Y_test, predXtest))
print ("AUC Score (Train): %f" % roc_auc_score(Y_test, predXtest))

# Compute confusion matrix
cnf_matrix = confusion_matrix(predXtest, Y_test)
# Class names
class_names = ['Negative', 'Positive']
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True, title='Normalized confusion matrix')
plt.show()

In [None]:
#plot_confusion_matrix(confusion_matrix(Y_test, predXtest), 'Loan Confusion Matrix', savefilename='Loan CM.png')
plot_ROCcurve(Y_test, predXtest, "Loan ROC Curve")

In [None]:
#Print model report:
feat_imp = pd.Series(clf.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='barh', title='Feature Importances')
plt.ylabel('Feature Importance Score')