# Loan predictions
## Problem Statement
##### We want to automate the loan eligibility process based on customer details that are provided as online application forms are being filled. You can find the dataset here. These details concern the customer's Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and other things as well.

|Variable| Description|
|--- |---|
|Loan_ID| Unique Loan ID|
|Gender| Male/ Female|\n",
|Married| Applicant married (Y/N)|\n",
|Dependents| Number of dependents|\n",
|Education| Applicant Education (Graduate/ Under Graduate)|\n",
|Self_Employed| Self employed (Y/N)|\n",
|ApplicantIncome| Applicant income|\n",
|CoapplicantIncome| Coapplicant income|\n",
|LoanAmount| Loan amount in thousands|\n",
|Loan_Amount_Term| Term of loan in months|\n",
|Credit_History| credit history meets guidelines|\n",
|Property_Area| Urban/ Semi Urban/ Rural|\n",
|Loan_Status| Loan approved (Y/N)\n",

# Part 3

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.compose import make_column_transformer
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split

### Separate target variable and split train test data

In [22]:
%store -r cleaned_df
%store -r df
%store -r df_cat
%store -r df_num

In [23]:
# Separate the target variable
y_df = cleaned_df['Loan_Status']

cleaned_df = cleaned_df.drop(['Loan_Status'], axis=1)

In [24]:
# Split the training and testing data from the cleaned dataframe

X_train, X_test, y_train, y_test = train_test_split(cleaned_df, y_df, test_size = 0.3, random_state = 42)

In [25]:
print(X_train.shape)
print(y_train.shape)

(410, 12)
(410,)


In [26]:
print(X_test.shape)
print(y_test.shape)

(176, 12)
(176,)


## 4. Building a Predictive Model

### One hot encode the categorical variables

In [27]:
ohe = OneHotEncoder(sparse=False)

In [28]:
cat = make_column_transformer(
    (ohe, ['Gender', 'Married',
           'Dependents', 'Education',
           'Self_Employed', 'Loan_Amount_Term',
           'Credit_History', 'Property_Area']),
    remainder='passthrough'
)

In [29]:
cat_ = cat.fit_transform(cleaned_df)

### Run PCA on Categorical one hot encoded variables

In [30]:
# Select number of principle components
pca = PCA(n_components=3)

# Fit the data
pca.fit(cat_)

### Use selectKBest on numeric variables

In [31]:
# Instantiate SelectKBest Variable
selection = SelectKBest(k=3)

### Instantiate Standard Scaler

In [38]:
scaler=StandardScaler()

### Get the Classifiers

In [39]:
from sklearn.ensemble import GradientBoostingClassifier

Gb_clf = GradientBoostingClassifier()

In [40]:
# define individual transformers in a pipeline

categorical_preprocessing = Pipeline([('imputation', SimpleImputer(strategy='most_frequent')),
                                      ('ohe', OneHotEncoder(sparse=False)),
                                     ('PCA', pca)])

numerical_preprocessing = Pipeline([('imputation', SimpleImputer(strategy='mean')),
                                    ('scale', StandardScaler()),
                                   ('selectK_best', selection)])

In [41]:
# define which transformer applies to which columns
preprocess = ColumnTransformer([
    ('categorical_preprocessing', categorical_preprocessing, ['Gender', 'Married',
                                                              'Dependents', 'Education',
                                                              'Self_Employed', 'Loan_Amount_Term',
                                                              'Credit_History', 'Property_Area']),
    ('numerical_preprocessing', numerical_preprocessing, ['ApplicantIncome', 'CoapplicantIncome',
                                                          'LoanAmount'])
],remainder="passthrough")

preprocess

In [42]:
# create the final pipeline with preprocessing steps and 
# the final classifier step
pipeline = Pipeline([
    ('preprocess', preprocess),
    ('best_model', Gb_clf)
])

pipeline

In [43]:
X_train.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,TotalIncome,Gender,Married,Dependents,Education,Self_Employed,Loan_Amount_Term,Credit_History,Property_Area
317,2058.0,2134.0,88.0,4192.0,Male,Yes,0,Graduate,No,30 year,Yes,Urban
385,3667.0,0.0,113.0,3667.0,Male,No,1,Graduate,No,15 year,Yes,Urban
309,7667.0,0.0,185.0,7667.0,Male,Yes,2,Not Graduate,No,30 year,Yes,Rural
399,1500.0,1800.0,103.0,3300.0,Female,No,0,Graduate,No,30 year,No,Semiurban
254,16250.0,0.0,192.0,16250.0,Male,No,0,Graduate,Yes,30 year,No,Urban


In [44]:
y_train.head()

317    Y
385    Y
309    Y
399    N
254    N
Name: Loan_Status, dtype: object

In [45]:
# Call the pipeline on the training data

pipeline.fit(X_train, y_train)

In [46]:
pipeline.score(X_test, y_test)

0.7386363636363636

In [47]:
pipeline.predict(X_test)

array(['Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y',
       'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N',
       'Y', 'Y', 'Y', 'N', 'N', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'N', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'N', 'N', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y',
       'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'Y', 'Y',
       'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y',
       'N', 'N', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y'], dtype=object)

### Use GridSearch to improve predictive scores

In [48]:
from sklearn.model_selection import GridSearchCV

In [49]:
param_grid = {'best_model__max_depth': [1,2,3],
              'best_model__n_estimators': [1,2,3],
              'preprocess__categorical_preprocessing__PCA__n_components': [1,2,3]
             }

In [None]:
# create a Grid Search object
grid = GridSearchCV(pipeline, param_grid, n_jobs=-1, verbose=10, refit=True)

grid.fit(X_train, y_train)

In [51]:
best_param_grid = grid.best_params_
best_param_grid

{'best_model__max_depth': 1,
 'best_model__n_estimators': 1,
 'preprocess__categorical_preprocessing__PCA__n_components': 1}

In [52]:
print('Final prediction is: ', grid.score(X_test, y_test))

Final prediction is:  0.6420454545454546
[CV 3/5; 1/27] START best_model__max_depth=1, best_model__n_estimators=1, preprocess__categorical_preprocessing__PCA__n_components=1
[CV 3/5; 1/27] END best_model__max_depth=1, best_model__n_estimators=1, preprocess__categorical_preprocessing__PCA__n_components=1;, score=0.707 total time=   0.2s
[CV 3/5; 3/27] START best_model__max_depth=1, best_model__n_estimators=1, preprocess__categorical_preprocessing__PCA__n_components=3
[CV 3/5; 3/27] END best_model__max_depth=1, best_model__n_estimators=1, preprocess__categorical_preprocessing__PCA__n_components=3;, score=0.707 total time=   0.1s
[CV 2/5; 4/27] START best_model__max_depth=1, best_model__n_estimators=2, preprocess__categorical_preprocessing__PCA__n_components=1
[CV 2/5; 4/27] END best_model__max_depth=1, best_model__n_estimators=2, preprocess__categorical_preprocessing__PCA__n_components=1;, score=nan total time=   0.1s
[CV 1/5; 5/27] START best_model__max_depth=1, best_model__n_estimators

## Save Model with Pickle

In [53]:
import pickle

In [54]:
# Use pickle to store the model
pickle.dump( grid, open( "mini-project-IV.p", "wb" ) )

In [55]:
X_test.head(1)

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,TotalIncome,Gender,Married,Dependents,Education,Self_Employed,Loan_Amount_Term,Credit_History,Property_Area
542,3652.0,0.0,95.0,3652.0,Female,No,1,Graduate,No,30 year,Yes,Semiurban


In [56]:
json_data = {'ApplicantIncome': 3652.0,
 'CoapplicantIncome': 0.0,
 'LoanAmount': 95.0,
 'TotalIncome': 3652.0,
 'Gender': 'Female',
 'Married': 'No',
 'Dependents': '1',
 'Education': 'Graduate',
 'Self_Employed': 'Yes',
 'Loan_Amount_Term': '30 year',
 'Credit_History': 'Yes',
 'Property_Area': 'Semiurban'}

In [57]:
import requests
URL = "http://ec2-52-55-157-34.compute-1.amazonaws.com:5300/scoring"
# sending get request and saving the response as response object 
r = requests.post(url = URL, json = json_data)