# Loan predictions - Notebook 3: Model & Pipeline

## Problem Statement

We want to automate the loan eligibility process based on customer details that are provided as online application forms are being filled. You can find the dataset [here](https://drive.google.com/file/d/1h_jl9xqqqHflI5PsuiQd_soNYxzFfjKw/view?usp=sharing). These details concern the customer's Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and other things as well. 

|Variable| Description|
|: ------------- |:-------------|
|Loan_ID| Unique Loan ID|
|Gender| Male/ Female|
|Married| Applicant married (Y/N)|
|Dependents| Number of dependents|
|Education| Applicant Education (Graduate/ Under Graduate)|
|Self_Employed| Self employed (Y/N)|
|ApplicantIncome| Applicant income|
|CoapplicantIncome| Coapplicant income|
|LoanAmount| Loan amount in thousands|
|Loan_Amount_Term| Term of loan in months|
|Credit_History| credit history meets guidelines|
|Property_Area| Urban/ Semi Urban/ Rural|
|Loan_Status| Loan approved (Y/N)



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/loan_pred.csv") 
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [5]:
#first split X and y
X = df.drop(['Loan_Status', 'Loan_ID'], axis = 1)
y = df['Loan_Status']

In [6]:
#split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 4. Building a Predictive Model

In [7]:
# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

#categorical features
cat_feats = X_train.dtypes[X_train.dtypes == 'object'].index.tolist()
#numeric features
num_feats = X_train.dtypes[~X_train.dtypes.index.isin(cat_feats)].index.tolist()

## 5. Using Pipeline
If you didn't use pipelines before, transform your data prep, feat. engineering and modeling steps into Pipeline. It will be helpful for deployment.

The goal here is to create the pipeline that will take one row of our dataset and predict the probability of being granted a loan.

`pipeline.predict(x)`

In [8]:
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder

In [9]:
#try adding a Feature Union to the pipeline

numeric_transform = Pipeline([('FunctionTransformer', FunctionTransformer()), ('impute_mean', SimpleImputer(strategy='mean')),                               
                              ('scaling', StandardScaler())])

categorical_transform = Pipeline([('FunctionTransformer',FunctionTransformer()), ('impute_mode', SimpleImputer(strategy='most_frequent')), 
                                  ('one-hot-encode', OneHotEncoder(sparse=False))])

preprocessing = ColumnTransformer([('numeric', numeric_transform, num_feats), 
                                        ('categorical', categorical_transform, cat_feats)])

pipeline = Pipeline([('preprocessing', preprocessing), ('model', RandomForestClassifier())])

pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)          
print(f'Test set score: {score}')

Test set score: 0.7642276422764228


Try paramater grid search to improve the results

In [10]:
rf = RandomForestClassifier()

In [11]:
from sklearn.model_selection import GridSearchCV

In [13]:
# set up our parameters grid
param_grid = {"model__n_estimators":[2, 5, 10], "model__max_depth":[2, 4, 6]}

# create a Grid Search object
grid_search = GridSearchCV(pipeline, param_grid, n_jobs = -1, verbose=10, refit=True)    

# fit the model and tune parameters
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


GridSearchCV(estimator=Pipeline(steps=[('preprocessing',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('FunctionTransformer',
                                                                                          FunctionTransformer()),
                                                                                         ('impute_mean',
                                                                                          SimpleImputer()),
                                                                                         ('scaling',
                                                                                          StandardScaler())]),
                                                                         ['ApplicantIncome',
                                                                          'CoapplicantIncome',
                   

In [14]:
print(grid_search.best_params_)

{'model__max_depth': 4, 'model__n_estimators': 10}


## 6. Deploy your model to cloud and test it with PostMan, BASH or Python

In [16]:
import pickle

In [18]:
pickle.dump(grid_search, open( "../src/model.p", "wb" ))

In [19]:
#At this point open jupyter through tmux
#add pickle, and app.py file to jupyter folder structure
#call the app.py file in the aws tmux terminal

In [None]:
# cloud:
- environment check 
- pickle
- py file

In [None]:
#put my new app in the jupter folder, then run in aws terminal