# Predicting Loan Defaults
> Author: Alex Lau

## Table of Contents

https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

1. [Table of Contents](#1.-Table-of-Contents)
2. [Exploratory Data Analysis](#2.-Exploratory-Data-Analysis)
    <br>2.1 [Import Packages and Data](#2.1-Import-Packages-and-Data)
    <br>2.2 [High Level Checks](#2.2-High-Level-Checks)
    <br>2.3 [Investigating Target Variable](#2.3-Investigating-Target-Variable)
    <br>2.4 [Investigating Features](#2.4-Investigating-Features)
3. [Data Cleaning](#3.-Data-Cleaning)
4. [Feature Engineering](#4.-Feature-Engineering)
5. [Revisiting Exploratory Data Analysis: Correlations Deep Dive](#5.-Revisiting-Exploratory-Data-Analysis:-Correlations-Deep-Dive)
6. [Preprocessing](#6.-Preprocessing)
7. [Modeling](#7.-Modeling)
    <br>7.1 [Baseline Model](#7.1-Baseline-Model)
    <br>7.2 [Logistic Regression](#7.2-Logistic-Regression)
    <br>7.3 [KNeighbors Classifier](#7.3-KNeighbors-Classifier)
    <br>7.4 [Random Forest Classifier](#7.4-Random-Forest-Classifier)
    <br>7.5 [Extra Trees Classifier](#7.5-Extra-Trees-Classifier)
    <br>7.6 [AdaBoost Classifier](#7.6-AdaBoost-Classifier)
    <br>7.7 [Support Vector Machine](#7.7-Support-Vector-Machine)
    <br>7.8 [Gaussian Naive Bayes Classifier](#7.8-Gaussian-Naive-Bayes-Classifier)
    <br>7.9 [Gradient Boost Classifier](#7.9-Gradient-Boost-Classifier)
    <br>7.10 [Voting Classifier](#Voting-Classifier)
8. [Conclusions and Evaluation](#8.-Conclusions-and-Evaluation)

## 2. Exploratory Data Analysis

### 2.1 Import Packages and Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, roc_auc_score, roc_curve, recall_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier, ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

Download data from this website after creating an account
https://www.lendingclub.com/info/statistics.action

Read in main dataframe that we created from the "Downloading Data" notebook.

In [None]:
# Read in the main dataframe we have created after consolidating the downloaded files.
# this results in a mixed type warning, so we will check what these columns are in the next section
df = pd.read_csv('../Data/FullLoanStats.csv')

### 2.2 High Level Checks

In [None]:
# checking the first 5 rows of our dataframe
# we will remove the Unnamed columns, but first we will check the warning on mixed type columns since these are indexed
df.head()

The following columns will be removed for our model, since these are not relevant for predicting a loan default before approving the loan. Deb settlement refers to activity post default. 

In [None]:
# columns 51, 146, 147, 148 have mixed types

# create a list of these columns
mixed_type_columns = [51, 146, 147, 148]

# iterate through this list and print every column name
for column in mixed_type_columns:
    print(list(df.columns)[column])

In [None]:
# Removing unnamed columns
df.drop(columns = ['Unnamed: 0', 'Unnamed: 1'], inplace = True)
df.head()

In [None]:
# Checking number of rows and columns of dataframe
df.shape

In [None]:
# removing view limitation so we can see all columns
pd.set_option('display.max_columns', None)

# checking the top 5 rows
df.head()

In [None]:
# Checking statistics
df.describe()

In [None]:
# checking datatypes, memory usage, 
df.info()

In [None]:
# setting row view limitation, 150 for the number of columns
pd.set_option("display.max_rows", 150)
# viewing datatypes of each column
df.dtypes

Below are the counts of the null values of our dataset. We will remove all features where all values are the same. This includes where null values = 44171, since that is the shape of our dataframe. We will also remove all data that is collected after the loan is approved, includuing settlement columns mentioned earlier, since we are trying to predict loan defaults from before the loan is approved. This activity will be reviewed in section 2.4

In [None]:
# viewing all columns with null values sorted by count
df.isnull().sum().sort_values(ascending = False)

### 2.3 Investigating Target Variable

In [None]:
# getting distribution of target variable
df['loan_status'].value_counts()

In [None]:
# visualize the quantities
df['loan_status'].value_counts().plot.bar()
plt.title('Loan Status Counts', size = 20)
plt.xlabel('Categories', size = 15)
plt.ylabel('Quantity', size = 15)
plt.xticks(rotation = 0, size = 12);

In [None]:
# getting percentages of each value
df['loan_status'].value_counts(normalize = True)

Slightly over 12% of our list of loans were defaulted, which represents heavy imbalance and will cause poor model performance. This is expected to be a low percentage, since Lending Club would not approve loans if they thought the borrowers would default. They have also been improving on the fully paid vs charged off or default ratio over the years. 
We will use Smote to help with our unbalanced classes during the preprocessing sections.

The term "Charged off" is when a creditor, Lending Club in this case, gives up hope you will repay the money after months of not paying the mininum payments and writes off the debt. Learn more about it in this article from Marketwatch. 
https://www.marketwatch.com/story/everything-you-need-to-know-about-a-charged-off-debt-2019-08-15

We will convert Fully Paid values to 0 and defaulted/charged off to 1 for our model. 

In [None]:
# investigating the mean values of the numberical features for each target variable group
df.groupby(by = ['loan_status']).mean()

### 2.4 Investigating Features

In [None]:
# function for returning the unique values and count, sorted by descending order, based on user specified column
def show_values(column):
    return df[column].value_counts().sort_values(ascending = False)
 
# function for counting null values in a specified column
def null_count(column):
    return df[column].isnull().value_counts().sort_values(ascending = False)

In [None]:
# checking that all id values are unique no repeats
show_values('id')

In [None]:
# check distribution of applications
show_values('application_type')

We can see that average loans approved for Joint applications are over 19,000 USD, where individuals are 14,000 USD. Joint apps also have a higher joint debt to income ratios at 18.5 vs 17.5 for individuals. 

In [None]:
# checking any visible differences in values of features joint vs individual applications
df.groupby(by = ['application_type']).mean()

In [None]:
# reporting the percentages of loan statuses based on individual applications
df[df['application_type'] == 'Individual']['loan_status'].value_counts(normalize = True)

In [None]:
# On average joint applications have higher probability of defaulting or being charged off. 
df[df['application_type'] == 'Joint App']['loan_status'].value_counts(normalize = True)

We will need to treat individual and joint applications differently, since several features for joint applications are different than individual applications, such as the loan values, incomes, loan repayment rates. 

In [None]:
# creating separate dataframes for individual and joint apps
df_individual = df[df['application_type']=='Individual']
df_JointApp = df[df['application_type']=='Joint App']

# checking individual dataframes
df_individual.head()

In [None]:
# checking joint dataframe
df_JointApp.head()

In [None]:
# How many emp_titles are there? We will drop this since there are over 18,000 unique values
show_values('emp_title')

In [None]:
# checking values within payment plan, perhaps we can drop these since they are all the same
show_values('pymnt_plan')

In [None]:
# we confirm there are no null values in payment plan, so we are good to drop
null_count('pymnt_plan')

In [None]:
# checking values within feature purpose
show_values('purpose')

In [None]:
# the count of these values is identical to feature 'purpose' so we will drop one. 
# We will drop title because historical files have null values for in this feature.
show_values('title')

In [None]:
# categories of emp length, we can get_dummies, but we'll first need to check null values in the next cell
show_values('emp_length')

In [None]:
# checking null counts, we will replace these values with 'Unknown'
null_count('emp_length')

In [None]:
# counts of values for public records, convert to int
show_values('pub_rec')

In [None]:
# counts of values for public records
show_values('initial_list_status')

In [None]:
# policy codes are all the same values, remove this column
show_values('policy_code')

In [None]:
# Remove hardship flag, they are all the same values, we will remove this feature since it won't help our model
show_values('hardship_flag')

In [None]:
# how code the df['loan_status'] values of these debt_settlement_flags?
show_values('debt_settlement_flag')

In [None]:
# revol_bal_joint
show_values('revol_bal_joint')

In [None]:
# what are all of the unique values in the grade feature? We will build a dictionary later to convert these to numbers
show_values('grade')

In [None]:
# Checking unique values for sub_grade, these will also be a dictionary
show_values('sub_grade')

In [None]:
# There are no null values in this column within the joint applications. 
# We are safe to impute 0 for the null values in this column for the larger dataframe for individual applications
df_JointApp['revol_bal_joint'].isnull().value_counts().sort_values(ascending = False)

In [None]:
# 38812 null values refer to the individual applications
null_count('revol_bal_joint')

In [None]:
# We can also impute 0 for individual accounts for these values as well.
df_JointApp['sec_app_fico_range_low'].isnull().value_counts().sort_values(ascending = False)

In [None]:
# viewing null values for this column on the larger dataframe
null_count('sec_app_fico_range_low')

In [None]:
# We can also impute 0 for individual accounts for these values as well.
df_JointApp['sec_app_earliest_cr_line'].isnull().value_counts().sort_values(ascending = False)

In [None]:
# viewing null values for this column on the larger dataframe
null_count('sec_app_earliest_cr_line')

In [None]:
# We can also impute 0 for individual accounts for these values as well.
df_JointApp['annual_inc_joint'].isnull().value_counts().sort_values(ascending = False)

In [None]:
# viewing null values for this column on the larger dataframe
null_count('annual_inc_joint')

In [None]:
# We can also impute 0 for individual accounts for these values as well.
df_JointApp['dti_joint'].isnull().value_counts().sort_values(ascending = False)

In [None]:
# viewing null values for this column on the larger dataframe
null_count('dti_joint')

In [None]:
# checking the values in verification_status_joint column
show_values('verification_status_joint')

In [None]:
# There are over 444 null values for verification statuses on joint applications. We will impute "Unknown" for these. 
df_JointApp['verification_status_joint'].isnull().value_counts().sort_values(ascending = False)

In [None]:
# We will impute "N/A" for these values in individual accounts
df_individual['verification_status_joint'].isnull().value_counts().sort_values(ascending = False)

In [None]:
# num_tl_120dpd_2m seems unhelpful
show_values('num_tl_120dpd_2m')

In [None]:
# null count for this feature
null_count('num_tl_120dpd_2m')

In [None]:
# checking data types for percentages, all values in interest rates are strings
df[df['int_rate'].map(type) != str].shape

In [None]:
# unfortunately not all values in revolving utilization are strings, they are a mix of strings and ints
df[df['revol_util'].map(type) != str].shape

In [None]:
# checking the values in question in revolving utilization column. We need to convert this column into the same type
df[df['revol_util'].map(type) != str]['revol_util']

Get a correlation chart at some point, either before or after cleaning.

## 3. Data Cleaning

### Updating Target Variable

In [None]:
# dictionary for Y target variable
loan_status_dict = {'Fully Paid':0, 'Charged Off':1, 'Default':1}

df['loan_status'].replace(loan_status_dict, inplace = True)

### Removing unnecessary features for our model

In [None]:
# creating list of columns to remove
remove_cols = ['member_id', 'funded_amnt', 'emp_title', 'issue_d', 'pymnt_plan', 'url', 'desc', 'title', 
               'zip_code', 'earliest_cr_line', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 
               'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 
               'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d', 'last_fico_range_high', 
               'last_fico_range_low', 'policy_code', 'sec_app_earliest_cr_line', 'num_tl_120dpd_2m', 
               'hardship_payoff_balance_amount', 'hardship_type', 'payment_plan_start_date', 'hardship_length', 
               'hardship_dpd', 'hardship_loan_status', 'orig_projected_additional_accrued_interest', 'hardship_amount',
               'hardship_last_payment_amount', 'deferral_term', 'hardship_start_date', 'hardship_reason', 
               'hardship_status', 'hardship_end_date', 'settlement_term', 'settlement_percentage', 'settlement_amount',
               'settlement_date', 'settlement_status', 'debt_settlement_flag_date', 'hardship_flag', 
               'debt_settlement_flag']

# removing the unnecessary columns
df.drop(columns = remove_cols, inplace = True)

# viewing the new shape
df.shape

### Create columns for conditional features where necessary

In [None]:
# list of columns we want to create conditional columns to represent whether a value is populated
# Later we will replace null values in the original columns, so having this extra conditional column will help the model
conditional_columns = ['sec_app_mths_since_last_major_derog', 'mths_since_recent_bc_dlq', 'mths_since_last_major_derog',
                       'mths_since_recent_revol_delinq', 'mths_since_last_delinq', 'il_util', 'mths_since_recent_inq',
                       'mo_sin_old_il_acct', 'bc_util', 'percent_bc_gt_75', 'bc_open_to_buy', 'mths_since_recent_bc',
                       'revol_util', 'all_util', 'avg_cur_bal', 'sec_app_revol_util'] 

# function to create new columns
def create_conditionals(df, columns):
    
    # iterate through the list of columns
    for column in columns:
        
        # creating a new column name
        new_name = column + '_conditional'
        
        # values are tranformed into 0 is null, 1 if there is a value
        df[new_name] = df[column].isnull().map({False:1, True: 0})
        
    return df

# calling this function now
create_conditionals(df, conditional_columns)

In [None]:
# checking the new number of columns
df.shape

### Impute Null values

In [None]:
# lists of columns that we will fill in null values for 
individual_null_to_NA = ['verification_status_joint']
joint_null_to_unknown = ['verification_status_joint']

null_to_unknown = ['emp_length']
null_to_zero = ['sec_app_mths_since_last_major_derog', 'annual_inc_joint', 'revol_bal_joint', 
                     'sec_app_fico_range_low', 'sec_app_fico_range_high', 'sec_app_mort_acc', 'sec_app_inq_last_6mths',
                    'sec_app_open_acc', 'sec_app_open_act_il', 'sec_app_num_rev_accts', 'sec_app_chargeoff_within_12_mths',
                    'sec_app_collections_12_mths_ex_med', 'dti_joint', 'mths_since_last_record', 'mths_since_recent_bc_dlq',
                    'mths_since_last_major_derog', 'mths_since_recent_revol_delinq', 'mths_since_last_delinq', 'il_util', 
                    'mths_since_recent_inq', 'mo_sin_old_il_acct', 'mths_since_rcnt_il', 'bc_util', 'percent_bc_gt_75', 
                    'bc_open_to_buy', 'mths_since_recent_bc', 'revol_util', 'all_util', 'avg_cur_bal', 'sec_app_revol_util']
null_to_max = ['dti']

# function to fill in the missing values
def replace_null(df, columns, value):
    for column in columns:
        df[column].fillna(value, inplace = True)
    return df

# calling replace null function on the list of columns we will convert to unknown
replace_null(df, null_to_unknown, 'Unknown')

In [None]:
# testing, this works! 
show_values('emp_length')

In [None]:
# replacing null values for columns in null_to_zero list
replace_null(df, null_to_zero, 0)

In [None]:
# replacing null values in column dti (debt to income) with max value, which we know is 999
replace_null(df, null_to_max, 999) # we already know the max value is 999 from the describe function during EDA

In [None]:
# There are 39256 null values for verification status joint
null_count('verification_status_joint')

In [None]:
# 38812 of these values are associated to individual applications
df.loc[df.application_type =='Individual','verification_status_joint'].isnull().sum()

In [None]:
# The remaining 444 null values are associated to Joint applications
df.loc[df.application_type =='Joint App','verification_status_joint'].isnull().sum()

In [None]:
# We will now update verification status joint features for individual applications from null to "Not Applicable"
df.loc[df.application_type=='Individual','verification_status_joint'] = df.loc[df.application_type =='Individual','verification_status_joint'].fillna('Not Applicable')

In [None]:
# We will now update verification status joint values for joint applications from null to "Unknown"
df.loc[df.application_type =='Joint App','verification_status_joint'] = df.loc[df.application_type =='Joint App','verification_status_joint'].fillna('Unknown')

In [None]:
# checking null count for this feature, there should no longer be null values in this column
null_count('verification_status_joint')

In [None]:
# Checking if we have any null values for all features, we do not
df.isnull().sum().sort_values(ascending = False)

### Converting Strings into Numbers

Ordinal values

In [None]:
# creating dictionaries for ordinal values
grade_dict = {'A': 1,'B': 2,'C': 3,'D': 4, 'E':5, 'F': 6, 'G':7}
sub_grade_dict = {'A1':1, 'A2':2, 'A3':3, 'A4':4, 'A5':5, 
                  'B1':6, 'B2':7, 'B3':8, 'B4':9, 'B5':10,
                  'C1':11, 'C2':12, 'C3':13, 'C4':14, 'C5':15,
                  'D1':16, 'D2':17, 'D3':18, 'D4':19, 'D5':20,
                  'E1':21, 'E2':22, 'E3':23, 'E4':24, 'E5':25, 
                  'F1':26, 'F2':27, 'F3':28, 'F4':29, 'F5':30, 
                  'G1':31, 'G2':32, 'G3':33, 'G4':34, 'G5':35}

In [None]:
# replacing the values from the dictionary
df['grade'].replace(grade_dict, inplace = True)
df['sub_grade'].replace(sub_grade_dict, inplace = True)

In [None]:
# checking the changes
df.head()

### Converting Perecentages to Numbers

In [None]:
# first we need to convert column 'revol_util' into strings since this had mixed data types
df['revol_util'] = df['revol_util'].astype(str)

In [None]:
# creating a list of columns that have %s
percent_list = ['int_rate', 'revol_util']

# creating function to strip the % from each row of the specified columns, and then convert the values to float type
def convert_percent_to_num(df, columns):
    for column in columns:
        df[column] = df[column].map(lambda x: x.rstrip('%'))
        # convert the string to float
        df[column] = df[column].astype('float64')
    return df

convert_percent_to_num(df, percent_list)

### Splitting out categorical values

In [None]:
# list of Categorical features
categories = ['term', 'emp_length', 'home_ownership', 'verification_status', 'purpose', 'addr_state', 
              'initial_list_status', 'application_type', 'verification_status_joint']

# separate columns for all categorical features
df = pd.get_dummies(data = df, columns = categories, drop_first = True)

In [None]:
# resetting index to id number
df.set_index('id', inplace = True)

In [None]:
# checking new shape of dataframe
df.shape

In [None]:
# displaying all rows so we can see every column
pd.set_option("display.max_rows", 198)
# checking all datatypes are numerical values, they are
df.dtypes

## 4. Feature Engineering

Consider reorganizing the imputing and cleaning sections

## 5. Revisiting Exploratory Data Analysis: Correlations Deep Dive

In [None]:
# Checking correlation all the features
plt.figure(figsize = (6,100))
sns.heatmap(df.corr()[['loan_status']].sort_values(by = 'loan_status', ascending = False), annot = True, cmap = 'RdBu')
plt.title('Features Correlation to Loan Payback', fontsize = 15)
plt.xlabel('Loan Fully Paid Correlation', fontsize = 15)
plt.ylabel('Features', fontsize = 15);

## 6. Preprocessing

Before we begin modeling, we need to identify our X features and Y target variables for our models. We need to split the data into training and testing set, in this thase we will use the default 80/20 split. Scaling the data is necessary because we have a large number of features with a variety of ranges. 

In [None]:
# create X and y
X = df.drop(columns = 'loan_status')
y = df['loan_status']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

# Scaling data
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

Let's visualize how imbalanced our classes are

In [None]:
y_train_fully_paid = pd.Series(y_train).value_counts().sort_values(ascending = False)[0]
y_train_charged_off = pd.Series(y_train).value_counts().sort_values(ascending = False)[1]

# observe that data has been balanced in training data
pd.Series(y_train).value_counts().sort_values(ascending = False).plot.bar(color = ['steelblue', 'orange'])
plt.title('Training Data Values Pre Balancing', size = 20)
plt.ylabel('Counts', size = 15)
plt.xlabel('Loan Status', size = 15, rotation = 0)
plt.xticks(rotation = 0);

In [None]:
# instantiating SMOTE
sm = SMOTE(random_state = 42)
# fitting the scaled train data
X_train_new, y_train_new = sm.fit_sample(X_train_sc, y_train.ravel())

# checking the new counts
print("Number transactions after balancing X_train dataset: ", X_train_new.shape)
print("Number transactions after balancing y_train dataset: ", y_train_new.shape)

# observe that data has been balanced in training data
pd.Series(y_train_new).value_counts().sort_values(ascending = False).plot.bar(color = ['steelblue', 'orange'])
plt.title('Training Data Values Post Balancing', size = 20)
plt.ylabel('Counts', size = 15)
plt.xlabel('Loan Status', size = 15)
plt.xticks(rotation = 0);

## 7. Modeling

### 7.1 Baseline Model

A baseline Recall/Sensitivity score results in 0 when we are predicting all cases as the majority class, all borrowers fully pay back their loans. We will also aim to beat the baseline accuracy score of 87.55%.

In [None]:
# calcuating the baseline model for accuracy
1 - y_train.mean()

### 7.2 Logistic Regression

Logistic regression results in a test recall score of 61.24% but at the cost of our accuracy, which was 67.12%.

In [None]:
# Instantiate model for logistic regression, including max iterations
lr = LogisticRegression(random_state = 42, max_iter = 500)

# Fit model
lr.fit(X_train_new, y_train_new)

# checking accuracy scores
print(f'Logistic Regression Train Accuracy: {lr.score(X_train_new, y_train_new)}')
print(f'Logistic Regression Test Accuracy: {lr.score(X_test_sc, y_test)}')

# predicting values for linear regression
y_hat_lr_train = lr.predict(X_train_new)
y_hat_lr_test = lr.predict(X_test_sc)

# checking recall scores
lr_recall_train = recall_score(y_train_new, y_hat_lr_train)
lr_recall_test = recall_score(y_test, y_hat_lr_test)

print(f'Logistic Regression Train Recall: {lr_recall_train}')
print(f'Logistic Regression Test Recall: {lr_recall_test}')

In [None]:
# creating a function to run models and print Train and Test accuracy and recall scores for all other models
def run_model(model_name):
    # instantiating the model
    model = model_name()
    # Fit model
    model.fit(X_train_new, y_train_new)
    
    # checking accuracy scores
    print(f'{model_name} Train Accuracy: {model.score(X_train_new, y_train_new)}')
    print(f'{model_name} Test Accuracy: {model.score(X_test_sc, y_test)}')

    # predicting values for linear regression
    y_hat_train = model.predict(X_train_new)
    y_hat_test = model.predict(X_test_sc)

    # checking recall scores
    recall_train = recall_score(y_train_new, y_hat_train)
    recall_test = recall_score(y_test, y_hat_test)

    print(f'{model_name} Train Recall: {recall_train}')
    print(f'{model_name} Test Recall: {recall_test}')

### 7.4 Random Forest Classifier

Random forest classifier results in a poor testing recall score. This seems as if we simply predicted the majority class, since we have a similar accuracy score to the baseline. 

In [None]:
run_model(RandomForestClassifier)

### 7.5 Extra Trees Classifier

Extra Trees also performed poorly in recall score

In [None]:
run_model(ExtraTreesClassifier)

### 7.6 AdaBoost Classifier

Adaboost test recall score performs slightly better than random forest and extra trees but still very low at 7.27%

In [None]:
run_model(AdaBoostClassifier)

### 7.7 Support Vector Machine

In [None]:
# Instantiate support vector machine.
svc = SVC(gamma="scale")

# Fit support vector machine to training data.
svc.fit(X_train_new, y_train_new)

# Score model
print(f'Accuracy Score on training set: {svc.score(X_train_new, y_train_new)}')
print(f'Accuracy Score on testing set: {svc.score(X_test_sc, y_test)}')

# predicting values for linear regression
y_hat_svc_train = svc.predict(X_train_new)
y_hat_svc_test = svc.predict(X_test_sc)

# checking recall scores
svc_recall_train = recall_score(y_train_new, y_hat_svc_train)
svc_recall_test = recall_score(y_test, y_hat_svc_test)

print(f'Suppor Vector Machine Train Recall: {svc_recall_train}')
print(f'Suppor Vector Machine Test Recall: {svc_recall_test}')

### 7.8 Gaussian Naive Bayes Classifier

### 7.9 Gradiant Boosting Classifier

In [None]:
# Instantiate model
gboost = GradientBoostingClassifier(random_state = 42)

# Set parameters for grid search
gboost_params = {
    'max_depth': [2, 3, 4],
    'n_estimators': [100, 125, 150],
    'learning_rate': [.08, .1, .12]
}

# Instantiate 
gb_gs = GridSearchCV(gboost, param_grid = gboost_params, cv = 5)

# Fit
gb_gs.fit(X_train_new, y_train_new)

# Determine best parameters
print(f'Best parameters: {gb_gs.best_params_}')
print('')

# Set best model to a new variable
gb_model = gb_gs.best_estimator_

# Score model
print(f'Accuracy Score on training set: {gb_model.score(X_train_new, y_train_new)}')
print(f'Accuracy Score on testing set: {gb_model.score(X_test_sc, y_test)}')

# predicting values for linear regression
y_hat_gb_train = gb_model.predict(X_train_new)
y_hat_gb_test = gb_model.predict(X_test_sc)

# checking recall scores
gb_recall_train = recall_score(y_train_new, y_hat_gb_train)
gb_recall_test = recall_score(y_test, y_hat_gb_test)

print(f'Suppor Vector Machine Train Recall: {gb_recall_train}')
print(f'Suppor Vector Machine Test Recall: {gb_recall_test}')

## 8. Conclusions and Evaluation