# Loan predictions

## Problem Statement

We want to automate the loan eligibility process based on customer details that are provided as online application forms are being filled. You can find the dataset [here](https://drive.google.com/file/d/1h_jl9xqqqHflI5PsuiQd_soNYxzFfjKw/view?usp=sharing). These details concern the customer's Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and other things as well. 

|Variable| Description|
|: ------------- |:-------------|
|Loan_ID| Unique Loan ID|
|Gender| Male/ Female|
|Married| Applicant married (Y/N)|
|Dependents| Number of dependents|
|Education| Applicant Education (Graduate/ Under Graduate)|
|Self_Employed| Self employed (Y/N)|
|ApplicantIncome| Applicant income|
|CoapplicantIncome| Coapplicant income|
|LoanAmount| Loan amount in thousands|
|Loan_Amount_Term| Term of loan in months|
|Credit_History| credit history meets guidelines|
|Property_Area| Urban/ Semi Urban/ Rural|
|Loan_Status| Loan approved (Y/N)



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

## 1. Hypothesis Generation

Generating a hypothesis is a major step in the process of analyzing data. This involves understanding the problem and formulating a meaningful hypothesis about what could potentially have a good impact on the outcome. This is done BEFORE looking at the data, and we end up creating a laundry list of the different analyses which we can potentially perform if data is available.

#### Possible hypotheses
Which applicants are more likely to get a loan

1. Applicants having a credit history 
2. Applicants with higher applicant and co-applicant incomes
3. Applicants with higher education level
4. Properties in urban areas with high growth perspectives

Do more brainstorming and create some hypotheses of your own. Remember that the data might not be sufficient to test all of these, but forming these enables a better understanding of the problem.

In [None]:
# SEE HYPOTHESIS NOTEBOOK IN THIS DIRECTORY for work through on some of these

5. credit rating - or current credit debt
6. coapplicants history and credit rating
7. previous default on loan or previous bankruptcy
8. not just history but how the repayment history is
9. amount asked relative to income or total income if married their income as well
10. payment period vs amount (monthly payments relative to income)
11. any assest or loans already held

## 2. Data Exploration
Let's do some basic data exploration here and come up with some inferences about the data. Go ahead and try to figure out some irregularities and address them in the next section. 

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

df = pd.read_csv("data/data.csv") 
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

One of the key challenges in any data set are missing values. Lets start by checking which columns contain missing values.

In [None]:
# Missing values:
# Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, Credit_History
# Look at each variable below individually

Look at some basic statistics for numerical variables.

In [None]:
num_feats = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']

In [None]:
df.describe()

In [None]:
# 'ApplicantIncome'
# Seems to be skewed by some high incomes looking and median and mean
print(f'median income: {df.ApplicantIncome.median()}')
print(f'mean income: {df.ApplicantIncome.mean()}')      
print(f'max income: {df.ApplicantIncome.max()}')   
print(f'min income: {df.ApplicantIncome.min()}')   

In [None]:
# CoapplicantIncome
# This makes sense as a number of applications don't have coapplicants 
print(f'median Coapplicant income: {df.CoapplicantIncome.median()}')
print(f'mean Coapplicant income: {df.CoapplicantIncome.mean()}')      
print(f'max Coapplicant income: {df.CoapplicantIncome.max()}')   
print(f'min Coapplicant income: {df.CoapplicantIncome.min()}')  

# zero's DO make sense here
# df[df.CoapplicantIncome ==0].count()

In [None]:
# LoanAmount - HAS MISSING VALUES
print(f'missing: {df.LoanAmount.isna().sum()}')
# df.LoanAmount.isna().Index.tolist()
print(f'percentage missing: {df.LoanAmount.isna().sum()/df.LoanAmount.count()}')
print(f'median loan amount: {df.LoanAmount.median()}')
print(f'mean loan amount: {df.LoanAmount.mean()}')      
print(f'max loan amount: {df.LoanAmount.max()}')   
print(f'min loan amount: {df.LoanAmount.min()}')  

In [None]:
# Loan_Amount_Term, - HAS MISSING VALUES
print(f'missing: {df.Loan_Amount_Term.isna().sum()}')
df.Loan_Amount_Term.isna().sum()/df.Loan_Amount_Term.count()

In [None]:
# Credit_History (categorical) - HAS MISSING VALUES (assign to 0)
df_with_hist = df[df.Credit_History ==1].count()
df_no_hist = df[df.Credit_History ==0].count()
print(f'with history: {df_with_hist["Credit_History"]}')
print(f'na history: {df.Credit_History.isna().sum()}')
print(f'no history: {df_no_hist["Credit_History"]}')

In [None]:
Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, Credit_History

In [None]:

# df.loc[df["Self_Employed"].isna() == True] # Set to mode (0)
# df.loc[df["LoanAmount"].isna() == True] # try avg and med??
# df.loc[df["Loan_Amount_Term"].isna() == True] # try avg and med??


### Categorical Features

1. How many applicants have a `Credit_History`? (`Credit_History` has value 1 for those who have a credit history and 0 otherwise)
2. Is the `ApplicantIncome` distribution in line with your expectation? Similarly, what about `CoapplicantIncome`?
3. Tip: Can you see a possible skewness in the data by comparing the mean to the median, i.e. the 50% figure of a feature.



Let's discuss nominal (categorical) variable. Look at the number of unique values in each of them.

#### Gender

In [None]:
cat_feats = ['Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'Credit_History', 'Property_Area'] #'Loan_Amount_Term'?

y = ['Loan_Status']

In [None]:
df.loc[df["Gender"].isna() == True] # wipe if credit_history and gender both Nan, wipe, otherwise set to median

#### Married

Explore further using the frequency of different categories in each nominal variable. Exclude the ID obvious reasons.

In [None]:
# null values
df.loc[df["Married"].isna() == True]

# DROP 435

#### Education

In [None]:
# Gender
df['Gender'].value_counts().plot(kind='bar');

#### Dependents

In [None]:
# Married
df['Married'].value_counts().plot(kind='bar');

In [None]:
# Null values
df.loc[df["Dependents"].isna() == True] # set to median (0)

#### Credit History

In [None]:
# 'Education' - no null values
df['Education'].value_counts().plot(kind='bar');

In [None]:
# Null Values
df.loc[df["Credit_History"].isna() == True].head() #???? #DROP IF SELF_EMPLOYED AND CREDIT HISTORY NAN, otherwise fill with mode or median

#### Self Employed

In [None]:
df['Credit_History'].value_counts().plot(kind='bar');

In [None]:
# Null Values
# print(f'median {df.Self_Employed.median()}') (after converted)
df.loc[df["Self_Employed"].isna() == True].count() # Set to most frequent (1)

# IF credit_history == NaN as well, drop

In [None]:
# 'Self_Employed'
df['Self_Employed'].value_counts().plot(kind='bar');

In [None]:
# 'Credit_History'
df['Credit_History'].value_counts().plot(kind='bar');

In [None]:
# 'Property_Area'
df['Property_Area'].value_counts().plot(kind='bar');

In [None]:
# 'Dependents' FOR NA - ASSIGN MODE
df['Dependents'].value_counts().plot(kind='bar');

In [None]:
df.Loan_Amount_Term.value_counts().plot(kind='bar');
print(f'median {df.Loan_Amount_Term.median()}')
df.loc[df["Loan_Amount_Term"].isna() == True].count()


### Distribution analysis

Study distribution of various variables. Plot the histogram of ApplicantIncome, try different number of bins.



In [None]:
# ApplicantIncome
x = df.ApplicantIncome
plt.hist(x, bins=1000) 
plt.title('Applicant Income')
plt.ylabel('# of Applicatins')
plt.xlabel('Applicant Income');

In [None]:
# 'CoapplicantIncome'
x = df.CoapplicantIncome
plt.hist(x, bins=100) 
plt.title('CoApplicant Income')
plt.ylabel('# of Applicatins')
plt.xlabel('Coapplicant Income');

In [None]:
# 'TotalIncome' *NEW FEATURE
x =  df.ApplicantIncome + df.CoapplicantIncome
plt.hist(x, bins=1000) 
plt.title('Applicant and CoApplicant Combined Income')
plt.ylabel('# of Applicatins')
plt.xlabel('Combined Income');

# still has skew

In [None]:
# 'LoanAmount'
x =  df.LoanAmount
plt.hist(x, bins=100) 
plt.title('Loan Amount')
plt.ylabel('# of Applicatins')
plt.xlabel('Loan Amount');

In [None]:
# 'Loan_Amount_Term'
x =  df.Loan_Amount_Term
plt.hist(x, bins=100) 
plt.title('Loan Term')
plt.ylabel('# of Applicatins')
plt.xlabel('Loan Amount Term');

# THIS IS CATEGORICAL

In [None]:
# Loan_Amt_Term_ratio *NEW FEATURE
x =  df.LoanAmount/df.Loan_Amount_Term
plt.hist(x, bins=100) 
plt.title('Loan Amount and Term Ration')
plt.ylabel('# of Applications')
plt.xlabel('Loan Amount and Term Ration');

# SEE IF THIS CORRELATES?


Look at box plots to understand the distributions. 

In [None]:
# Box plot numeric features to see spread - **NEED TO REMOVE OUTLIERS FOR INCOME AND LOAN AMOUNT... ALSO EVAUATE LOAN TERMS
df['LoanAmountx100'] = df['LoanAmount']*100
df['Loan_Amount_Termx100'] = df['Loan_Amount_Term']*100
df['Total_Income'] = df['ApplicantIncome'] +df['CoapplicantIncome']
df['Loan_Amount_Term_Ratiox10000'] = (df.LoanAmount*10000/df.Loan_Amount_Term)
df_plot = df[['ApplicantIncome', 'CoapplicantIncome', 'Total_Income', 'LoanAmountx100', 'Loan_Amount_Term_Ratiox10000']]

plt.figure(figsize=(16,8))
ax = sns.boxplot(data=df_plot, orient="h", palette="Set2")

# ax = sns.boxplot(data=box_data)
# sns.set_palette(palette="crest", n_colors=1)
sns.color_palette("crest", as_cmap=True)

ax.set(title='Loan Application Numeric Features', xlabel="amount");


In [None]:
# Log, then REmove outliers

# PLOT!!

# Box plot numeric features to see spread - **NEED TO REMOVE OUTLIERS FOR INCOME AND LOAN AMOUNT... ALSO EVAUATE LOAN TERMS
df['LoanAmountx100'] = df['LoanAmount']*100
df['Loan_Amount_Termx100'] = df['Loan_Amount_Term']*100
df['Total_Income'] = df['ApplicantIncome'] +df['CoapplicantIncome']
df['Loan_Amount_Term_Ratiox10000'] = (df.LoanAmount*10000/df.Loan_Amount_Term)
df_plot = df[['ApplicantIncome', 'CoapplicantIncome', 'Total_Income', 'LoanAmountx100', 'Loan_Amount_Term_Ratiox10000']]

plt.figure(figsize=(16,8))
ax = sns.boxplot(data=df_plot, orient="h", palette="Set2")

# ax = sns.boxplot(data=box_data)
# sns.set_palette(palette="crest", n_colors=1)
sns.color_palette("crest", as_cmap=True)

ax.set(title='Loan Application Numeric Features', xlabel="amount");

Look at the distribution of income segregated  by `Education`

In [None]:
df_edu = df[['Education', 'ApplicantIncome']]
df_edu.Education.unique()

In [None]:
df_grad = df[df.Education == 'Graduate']
df_not_grad = df[df.Education == 'Not Graduate']

x =  df_not_grad.Total_Income
plt.hist(x, bins=100) 
plt.title('Total Income Not Graduate')
plt.ylabel('# of Applicatins')
plt.xlabel('Total Income');

print(f'Mean {df_not_grad.Total_Income.mean()}')
print(f'Median {df_not_grad.Total_Income.median()}')

In [None]:
x =  df_grad.Total_Income
plt.hist(x, bins=100) 
plt.title('Total Income Graduate')
plt.ylabel('# of Applicatins')
plt.xlabel('Total Income');
print(f'Mean {df_grad.Total_Income.mean()}')
print(f'Median {df_grad.Total_Income.median()}')

Look at the histogram and boxplot of LoanAmount

In [None]:
box_data = df[['LoanAmount']] 
# ax = sns.boxplot(data=df, orient="h", palette="Set2")
# plt.title('Pokemon Box Plot')
plt.figure(figsize=(16,8))
ax = sns.boxplot(data=box_data)
# sns.set_palette(palette="crest", n_colors=1)
# sns.color_palette("crest", as_cmap=True)
plt.ylabel('Loan amount')
# plt.xlabel
ax.set(title='Loan Amount Distribution', xlabel="type");

In [None]:
df.columns

There might be some extreme values. Both `ApplicantIncome` and `LoanAmount` require some amount of data munging. `LoanAmount` has missing and well as extreme values values, while `ApplicantIncome` has a few extreme values, which demand deeper understanding. 

### Categorical variable analysis

Try to understand categorical variables in more details using `pandas.DataFrame.pivot_table` and some visualizations.

In [None]:
# kids and married, correlation plot! 

## 3. Data Cleaning

This step typically involves imputing missing values and treating outliers. 

### Imputing Missing Values

Missing values may not always be NaNs. For instance, the `Loan_Amount_Term` might be 0, which does not make sense.



Impute missing values for all columns. Use the values which you find most meaningful (mean, mode, median, zero.... maybe different mean values for different groups)

In [None]:
# remove any rows with more than 1 NaN (13 rows)
df = pd.read_csv("data/data.csv")
to_drop = df[df.isnull().sum(axis=1) >1].index.tolist()

df = df.drop(to_drop)
df.shape

In [None]:
df.loc[df["Married"].isna() == True]
# df


In [None]:
# Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, Credit_History
# remove any row that has more than 1 NAN

# Loan_Amount_Term - check for zeros
# LoanAmount - HAS MISSING VALUES and 0s
#  Credit_History (categorical) - HAS MISSING VALUES (assign to 0)
# Dependent - (impute mode - 0)
#  Married, 
# Self_Employed, (impute mode - 0)
# Gender (impute mode OR drop as this has high impact)

# CHECK NA AFTER DOING THIS **

### Remove Outliers 

### Scale Numeric

In [None]:
try various options 

### Column Transforming 

In [None]:
# dependant - convert 3+ to 3
df.Dependents.replace({'0':0, '1':1, '2':2, '3+':3}, inplace=True)

df.Dependents.unique()

### Extreme values
Try a log transformation to get rid of the extreme values in `LoanAmount`. Plot the histogram before and after the transformation

In [None]:
df['LoanAmount_Log'] = np.log(df.LoanAmount)

x =  df.LoanAmount
plt.hist(x, bins=100) 
plt.title('Loan Amount')
plt.ylabel('# of Applicatins')
plt.xlabel('Loan Amount');

In [None]:
x =  df.LoanAmount_Log
plt.hist(x, bins=100) 
plt.title('Loan Amount - Log')
plt.ylabel('# of Applicatins')
plt.xlabel('Loan Amount - Log');

Combine both incomes as total income and take a log transformation of the same.

In [None]:
df['Total_Income'] = df['ApplicantIncome'] +df['CoapplicantIncome']
df['Total_Income_log'] = np.log(df['Total_Income'])

In [None]:
x =  df.Total_Income_log
plt.hist(x, bins=100) 
plt.title('Total Income - log')
plt.ylabel('# of Applicatins')
plt.xlabel('Combined Income');

#### Add new features

In [None]:
df['LoanAmt_Term_Ratio_Log']=  np.log(df.LoanAmount/df.Loan_Amount_Term)
x =  df.LoanAmt_Term_Ratio_Log
plt.hist(x, bins=100) 
plt.title('Loan Amount - Log')
plt.ylabel('# of Applicatins')
plt.xlabel('Loan Amount - Log');

### Heatmap

In [None]:
df_corr = tips.corr()

# plot the correlations
sns.heatmap(df_corr)
plt.title('Correlation plot')
plt.show()

### Remove Outliers

In [None]:
# with z-score
df = pd.read_csv("data/data.csv") 
df['LoanAmt_Term_Ratio_Log']=  np.log(df.LoanAmount/df.Loan_Amount_Term)
df['Total_Income_log'] = np.log(df['ApplicantIncome'] +df['CoapplicantIncome'])
df['LoanAmount_Log'] = np.log(df.LoanAmount)

def remove_outliers(df):
    cols = ['Total_Income_log']#, 'LoanAmt_Term_Ratio_Log', 'LoanAmountLog'] 
    Q1 = df[cols].quantile(0.25)
    Q3 = df[cols].quantile(0.75)
    IQR = Q3 - Q1
    df = df[~((df[cols] < (Q1 - 1.5 * IQR)) |(df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
    return df

# q25, q75 = percentile(data, 25), percentile(data, 75)
# iqr = q75 - q25
# cut_off = iqr * 1.5
# lower, upper = q25 - cut_off, q75 + cut_off

# df.Total_Income_log
# LoanAmountLog, LoanAmt_Term_Ratio_Log, 'Total_Income_log'

df_or = remove_outliers(df)


In [None]:
df_or.shape

# START HERE

## 4. Building a Predictive Model

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay # want the ones with false neg not positives
from sklearn.metrics import recall_score, precision_score, roc_auc_score, plot_roc_curve, roc_curve, auc, RocCurveDisplay

from matplotlib import pyplot as plt
# import seaborn as sns
from sklearn import set_config # for plotting pipeline


In [2]:
# Load data
df = pd.read_csv("data/data.csv") 

# # REMOVE any rows with more than one null for training 
to_drop = df[df.isnull().sum(axis=1) >1].index.tolist()
df = df.drop(to_drop)

# Note this removes 13 rows
df.shape

(601, 13)

In [3]:
# Remove outliers 
def remove_outliers(df, cols):
#     cols = ['Total_Income_log']#, 'LoanAmt_Term_Ratio_Log', 'LoanAmountLog'] 
    Q1 = df[cols].quantile(0.25)
    Q3 = df[cols].quantile(0.75)
    IQR = Q3 - Q1
#     df = df[~((df[cols] < (Q1 - 1.5 * IQR)) |(df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
    df = df[~((df[cols] > (Q3 + 2 * IQR))).any(axis=1)]
    return df

df = remove_outliers(df, ['ApplicantIncome', 'LoanAmount'])
df.shape


(552, 13)

In [None]:
# # DEBUG IMPUTER 
df.dropna(inplace=True)

In [None]:
# Separate out target, and drop id column
X = df.drop(columns=['Loan_Status','Loan_ID'])
y = df['Loan_Status'].replace({'Y':1, 'N':0})

# Test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=88)

In [None]:
# Split into cat_feats and num_feats
cat_feats = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area']
num_feats = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']

def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

keep_num = FunctionTransformer(numFeat)
keep_cat = FunctionTransformer(catFeat)

# Note: Loan amount term is really more categorical, but leaving as numeric so can use in calculations - and will scale
# Credit history will need to be converted to categorical

In [None]:
# Add new columns
def replace_income_with_total_income_log(X):
    X['Total_Income_Log'] = np.log(X['ApplicantIncome'] + X['CoapplicantIncome'])
    X.drop(columns=['ApplicantIncome','CoapplicantIncome'], inplace=True)
    return X

def add_LoanAmt_Term_Ratio_Log(X):
    X['LoanAmt_Term_Ratio_Log']=  np.log(X.LoanAmount/X.Loan_Amount_Term)
    return X

def replace_loanamount_with_loanamount_log(X):
    X['LoanAmount_Log'] = np.log(X.LoanAmount)
    X = X.drop(columns=['LoanAmount'])
    return X


add_total_income_log_object = FunctionTransformer(replace_income_with_total_income_log)
add_loanamt_term_ratio_log_object = FunctionTransformer(add_LoanAmt_Term_Ratio_Log)
add_loanamount_log_object = FunctionTransformer(replace_loanamount_with_loanamount_log)

In [None]:
# One hot encode # only a portion of the categorical 
enc = OneHotEncoder(sparse=False)

In [None]:
# PCA - reduce dummy variables of catigorical?? or all>
pca = PCA(n_components=3)

In [None]:
# kbest - right now just on numeric 
selection = SelectKBest(k=3)

Try paramater grid search to improve the results

In [None]:
# PCA/ kbest - do together after? or separately?

In [None]:
# test with p-value <0.05, sensitivity and specifity , AUC and ROC

## 5. Using Pipeline
If you didn't use pipelines before, transform your data prep, feat. engineering and modeling steps into Pipeline. It will be helpful for deployment.

The goal here is to create the pipeline that will take one row of our dataset and predict the probability of being granted a loan.

`pipeline.predict(x)`

In [None]:
# # ('impute_mean', SimpleImputer(strategy='median'))
# df = pd.read_csv("data/data.csv") 
# df
# df.LoanAmount[0]

In [None]:
# from sklearn.impute import SimpleImputer

# imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
# imputer = imputer.fit(X_train[num_feats])

In [None]:
# newdf = imputer.transform(X_train[num_feats])
# np.isnan(newdf).sum()
                                  

In [None]:
# preprocess_pipeline = make_pipeline(   
#     FeatureUnion(transformer_list=[
#         ('Handle numeric columns', make_pipeline(
#             ColumnSelector(columns=['Amount']),
#             SimpleImputer(strategy='constant', fill_value=0),
#             StandardScaler()
#         )),
#         ('Handle categorical data', make_pipeline(
#             ColumnSelector(columns=['Type', 'Name', 'Changes']),
#             SimpleImputer(strategy='constant', missing_values=' ', fill_value='missing_value'),
#             OneHotEncoder(sparse=False)
#         ))
#     ])
# )

In [None]:
# num_pipeline = Pipeline([
#     (“num_feats”, keep_num),
#     (“impute_num”, null_replace_num),
#     (“kBest”, k_best)
# ])
# cat_pipeline = Pipeline([
#     (“cat_feats”, keep_cat),
#     (“impute_cat”, null_replace_cat),
#     (“dummies”, ohe),
#     (“to_dense”, to_dense),
#     (“pca”, pca)
# ])
# all_features = FeatureUnion([
#     (‘numeric_features’, num_pipeline),
#     (‘categorical_features’, cat_pipeline),
# ])
# main_pipeline = Pipeline([
#     (‘all_features’, all_features),
#     (‘modeling’, base_model)
# ])


In [None]:
# keep_num = FunctionTransformer(num_feats)
# keep_cat = FunctionTransformer(cat_feats)

numeric_pipeline = Pipeline([('num_feats', keep_num),
#                             ('impute_median', SimpleImputer(strategy='median')),
                            ('add_total_income', add_total_income_log_object),
                            ('add_loanamt_term_ratio_log', add_loanamt_term_ratio_log_object),
                            ('add_loanamount_log', add_loanamount_log_object),
                            ('scaling', StandardScaler()),
                            ("kbest", selection)]) 


categorical_pipeline = Pipeline([('cat_feats', keep_cat),
                                ('impute_mode', SimpleImputer(strategy='most_frequent')), 
                                  ('one-hot-encode', OneHotEncoder(sparse=False)),
                                 ("pca", pca)])

all_features = FeatureUnion([('numeric_features', numeric_pipeline),
                            ('categorical_features', categorical_pipeline)])

# preprocessing_loan_feats = ColumnTransformer([('numeric', numeric_transform, num_feats), 
#                                         ('categorical', categorical_transform, cat_feats)])

## Logistic Regression

In [None]:
pipeline = Pipeline(steps = [('all_features', all_features),
                     ("model", LogisticRegression())])

pipeline.fit(X_train, y_train)

X_test = X_test.dropna()

y_pred = pipeline.predict(X_test)

acc = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred, average='macro')
precision =  precision_score(y_test, y_pred, average='micro')

print(f'Test set accuracy: {acc}')
print(f'Test set recall: {recall}')
print(f'Precision: {precision}')


In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)
display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
                                   estimator_name='example estimator')
display.plot()

plt.show()

In [None]:
confusion = confusion_matrix(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm) 
disp.plot()
plt.show();

In [None]:
# # to visualize pipeline
# set_config(display='diagram')
# pipeline

## Logistic Regression with gridsearch

In [None]:
numeric_pipeline = Pipeline([('num_feats', keep_num),
#                             ('impute_median', SimpleImputer(strategy='median')),
                            ('add_total_income', add_total_income_log_object),
                            ('add_loanamt_term_ratio_log', add_loanamt_term_ratio_log_object),
                            ('add_loanamount_log', add_loanamount_log_object),
                            ('scaling', StandardScaler()),
                            ("kbest", selection)]) 


categorical_pipeline = Pipeline([('cat_feats', keep_cat),
                                ('impute_mode', SimpleImputer(strategy='most_frequent')), 
                                  ('one-hot-encode', OneHotEncoder(sparse=False)),
                                 ("pca", pca)])

all_features = FeatureUnion([('numeric_pipeline', numeric_pipeline),
                            ('categorical_pipeline', categorical_pipeline)])

pipeline = Pipeline(steps = [('all_features', all_features),
                     ("model", LogisticRegression())])

param_grid = {'all_features__categorical_pipeline__pca__n_components':[3,5,7],
              'all_features__numeric_pipeline__kbest__k': [1,2,3,4]}
              

grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# best_model = grid_ridge.best_estimator_
# best_hyperparams = grid_ridge.best_params_
# best_score = grid_ridge.best_score_

print(f'hyperparameters: {best_hyperparams}\n {best_score}')
# y_pred = grid_ridge.predict(df_test)



# pipeline = Pipeline(steps = [('all_features', all_features),
#                      ("model", LogisticRegression())])

# pipeline.fit(X_train, y_train)

# X_test = X_test.dropna()

# y_pred = pipeline.predict(X_test)

# acc = accuracy_score(y_test, y_pred)
# recall = recall_score(y_test, y_pred, average='macro')
# precision =  precision_score(y_test, y_pred, average='micro')

# print(f'Test set accuracy: {acc}')
# print(f'Test set recall: {recall}')
# print(f'Precision: {precision}')

In [None]:
# Or, save the HTML to a file
from sklearn.utils import estimator_html_repr

with open('images/model_pipeline.html', 'w') as f:  
    f.write(estimator_html_repr(pipeline))

In [None]:
pipeline = Pipeline(steps = [('all_features', all_features),
                     ("model", LogisticRegression())])

pipeline.fit(X_train, y_train)

X_test = X_test.dropna()

y_pred = pipeline.predict(X_test)

acc = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred, average='macro')
precision =  precision_score(y_test, y_pred, average='micro')

print(f'Test set accuracy: {acc}')
print(f'Test set recall: {recall}')
print(f'Precision: {precision}')

### Random Forest

In [None]:
param_grid = {#'preprocessing_loan_feats__categorical__pca__n_components':[3,5,7],
              #'preprocessing_loan_feats__numeric__kbest__k': [1,2,3,4],
            'model__n_estimators': [50, 100, 200],
              'model__max_depth': [3, 7, 10, 20]
             }

pipeline = Pipeline(steps = [('all_features', all_features),
                     ("model", RandomForestClassifier())])
              

grid = GridSearchCV(pipeline, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
best_hyperparams = grid.best_params_
best_score = grid.best_score_

print(f'hyperparameters: {best_hyperparams}\n {best_score}')
# y_pred = grid_ridge.predict(df_test)


# X_test = X_test.dropna()

# y_pred = grid.predict(X_test)


# # pipeline.fit(X_train, y_train)


# acc = accuracy_score(y_test, y_pred)
# recall = recall_score(y_test, y_pred, average='macro')
# precision =  precision_score(y_test, y_pred, average='micro') 

# print(f'Test set accuracy: {acc}')
# print(f'Test set recall: {recall}')
# print(f'Precision: {precision}')

# plot_roc_curve(pipeline, X_test, y_test)

In [None]:




param_grid_ridge = {'preprocessing_sales__categorical__pca__n_components':[3,5,7],
              'preprocessing_sales__numeric__kbest__k': [1,2,3,4]}
              

grid_ridge = GridSearchCV(pipeline_ridge, param_grid=param_grid_ridge, cv=5)
grid_ridge.fit(df_train, y_train)

best_model = grid_ridge.best_estimator_
best_hyperparams = grid_ridge.best_params_
best_score = grid_ridge.best_score_

print(f'hyperparameters: {best_hyperparams}\n {best_score}')
# y_pred = grid_ridge.predict(df_test)

In [None]:
confusion = confusion_matrix(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm) 
disp.plot()
plt.show();

### Naive Bayes

In [None]:
pipeline = Pipeline(steps = [('all_features', all_features),
                     ("model",  GaussianNB())])

pipeline.fit(X_train, y_train)
X_test = X_test.dropna()
y_pred = pipeline.predict(X_test)

acc = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred, average='macro')
precision =  precision_score(y_test, y_pred, average='micro') 

print(f'Test set accuracy: {acc}')
print(f'Test set recall: {recall}')
print(f'Precision: {precision}')

# plot_roc_curve(pipeline, X_test, y_test)

In [None]:
confusion = confusion_matrix(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm) #,display_labels=clf.classes_)
disp.plot()

plt.show()

In [None]:
importance = lr.coef_[0]

for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test_b, clf.predict(X_test_b))
fpr, tpr, thresholds = roc_curve(y_test_b, clf.predict_proba(X_test_b)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

## 6. Deploy your model to cloud and test it with PostMan, BASH or Python

In [None]:
# Use one and many samples
# Docker 
# curl 
# pickle file?