# Best Model Selection and Hyperparameter Tuning

# David Berberena

# 5/12/2024

# Program Start

## 1. Import the dataset and ensure that it loaded properly.

In [1]:
# Importing the dataset requires the use of Pandas, which will be imported here.

import pandas as pd

loan_data = 'https://raw.githubusercontent.com/SosukeAizen5/Portfolio/main/DSC%20550%20Data%20Mining/Loan_Train.csv'

loan = pd.read_csv(loan_data)

# The head() function is used to ensure that the dataset has been loaded in with no issues.

loan.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## 2. Prepare the data for modeling by performing the following steps:
- Drop the column “Load_ID.”
- Drop any rows with missing data.
- Convert the categorical features into dummy variables.

In [2]:
# To drop the 'Loan_ID' column, we will use the drop() function specifying our column with the columns argument.

loan = loan.drop(columns = ['Loan_ID'])

# The head() function is used to verify that the transformation has been done correctly.

loan.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
# Dropping any columns in the dataset with missing values can be accomplished using dropna() with the axis argument set to 
# 1, indicating the columns with NaN values are to be dropped, and the inplace argument set to True.

loan.dropna(axis = 1, inplace = True)

# The head() function is used to verify that the transformation has been done correctly.

loan.head()

Unnamed: 0,Education,ApplicantIncome,CoapplicantIncome,Property_Area,Loan_Status
0,Graduate,5849,0.0,Urban,Y
1,Graduate,4583,1508.0,Rural,N
2,Graduate,3000,0.0,Urban,Y
3,Not Graduate,2583,2358.0,Urban,Y
4,Graduate,6000,0.0,Urban,Y


In [4]:
# Upon looking at the remaining columns in the dataset, we can easily see that 'Education', 'Property_Area', and 
# 'Loan_Status' are categorical variables. However, I need to verify that Python is also aware of this. I can check the 
# data type of each column in the dataset with the dtypes() keyword function and look for the 'object' denomination.

loan.dtypes

Education             object
ApplicantIncome        int64
CoapplicantIncome    float64
Property_Area         object
Loan_Status           object
dtype: object

In [5]:
# Now that I know that Python recognizes these columns as categorical columns, I will add them as arguments to the 
# pd.get_dummies() function, which will return dummy variable columns for each of these columns.

loan = pd.get_dummies(loan, columns = ['Education', 'Property_Area', 'Loan_Status'])

# The head() function is used to verify that the transformation has been done correctly.

loan.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,Education_Graduate,Education_Not Graduate,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status_N,Loan_Status_Y
0,5849,0.0,True,False,False,False,True,False,True
1,4583,1508.0,True,False,True,False,False,True,False
2,3000,0.0,True,False,False,False,True,False,True
3,2583,2358.0,False,True,False,False,True,False,True
4,6000,0.0,True,False,False,False,True,False,True


## 3. Split the data into a training and test set, where the “Loan_Status” column is the target.

In [6]:
# With the dataset now ready to be used for model creation, I need to import Sci-kitlearn's train_test_split() function to 
# split the data into a ratio of 80% training set, 20% test set with the target variable being the 'Loan_Status' variable. 

from sklearn.model_selection import train_test_split

# Now with the current dataset having two variables depicting loan status (thanks to the dummy variable transformation), I 
# will drop them both for the predictors and use only the 'Loan_Status_Y' as the target variable.

loan_x = loan.drop(columns = ['Loan_Status_N', 'Loan_Status_Y'])
loan_y = loan['Loan_Status_Y']

loan_xtrain, loan_xtest, loan_ytrain, loan_ytest = train_test_split(loan_x, loan_y, test_size=0.2, random_state=123)

# I will now verify that the split has been made accurately by comparing the size of the dataset before and after the split.

print('The number of rows within the cleaned loan dataset are:', loan.shape[0])
print('The number of rows within the loan training dataset are:', loan_xtrain.shape[0])
print('The number of rows within the loan test dataset are:', loan_xtest.shape[0])

The number of rows within the cleaned loan dataset are: 614
The number of rows within the loan training dataset are: 491
The number of rows within the loan test dataset are: 123


## 4. Create a pipeline with a min-max scaler and a KNN classifier.

In [7]:
# In order to create a pipeline, I needed to import the Pipeline() function from Sci-kitlearn's pipeline module. We have 
# worked with the min-max scaler before, so I will simply import that again from Sci-kitlearn's preprocessing module. The 
# same holds true for the KNN classifier, yet that functionality is imported from the neighbors module.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier

loan_pipe = Pipeline([('scaler', MinMaxScaler()), ('knn', KNeighborsClassifier())])

## 5. Fit a default KNN classifier to the data with this pipeline. Report the model accuracy on the test set.

In [8]:
# As this task asks us for an accuracy measure, we will import the accuracy_score() function from Sci-kitlearn's metrics.

from sklearn.metrics import accuracy_score

# As a pipeline model fits the data in the same way a regular model does, I have written the code for the pipeline model 
# almost exactly as I would have for a regular linear regression model.

loan_pipe.fit(loan_xtrain, loan_ytrain)

loan_predictions = loan_pipe.predict(loan_xtest)

loan_accuracy = accuracy_score(loan_ytest, loan_predictions)

# The accuracy statistic is printed here. 

print('The accuracy of the model on the test set data is:', loan_accuracy)

The accuracy of the model on the test set data is: 0.6178861788617886


## 6. Create a search space for your KNN classifier where your “n_neighbors” parameter varies from 1 to 10.

In [9]:
# Once again referring to section 15.3, the search space is a variable that I will create that will define the number of 
# nearest neighbors as a range of 1 to 10. I will write my search space variable in the same way as shown in the text.

loan_search = [{'knn__n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

# Now this variable can be placed into the pipeline to have the dataset be ran through the pipeline with each new number of 
# nearest neighbors.

## 7. Fit a grid search with your pipeline, search space, and 5-fold cross-validation to find the best value for the “n_neighbors” parameter.

In [10]:
# To fit a grid search, I need to access the GridSearchCV() function from Scikitlearn's model_selection module.

from sklearn.model_selection import GridSearchCV

# I will craft the grid search object and fit it to the training set. The arguments for the grid search need to be the 
# pipeline, the search space, and the cross-validation of 5. The verbose argument being set to 0 is simply so that there is 
# no output once the grid search fits the data.

loan_grid = GridSearchCV(loan_pipe, loan_search, cv = 5, verbose = 0)

# The grid search is then fitted to the data like every other model I have run across so far, except it is now to be stored
# in a variable to call and access various features.

grid_search_model_loan = loan_grid.fit(loan_xtrain, loan_ytrain)

# To realize the best value for the 'n_neighbors' parameter, I can access new features within the grid search, such as the 
# best_estimator_, which tells me what the best model in the search is, and the get_params() functions, which outputs the 
# best value for the paramater we have set, which in our case is the 'n_neighbors' parameter we need.

# I will print the best value for the n_neighbors parameter.

print('The best value for the n_neighbors parameter is:', 
      grid_search_model_loan.best_estimator_.get_params()['knn__n_neighbors'])

The best value for the n_neighbors parameter is: 9


## 8. Find the accuracy of the grid search best model on the test set.

In [11]:
# We've actually already found the best model within the last task's process, so I will store that model within a variable.

loan_best_model = grid_search_model_loan.best_estimator_

# Now I can perform the process for the accuracy score just like any other model we've worked with by finding the test 
# predictions and using them to find the accuracy score.

best_loan_predictions = loan_best_model.predict(loan_xtest)

best_model_accuracy = accuracy_score(loan_ytest, best_loan_predictions)

# The accuracy statistic is printed here.

print('The grid search model accuracy on the test set is:', best_model_accuracy)

The grid search model accuracy on the test set is: 0.6747967479674797


## 9. Now, repeat steps 6 and 7 with the same pipeline, but expand your search space to include logistic regression and random forest models with the hyperparameter values in section 12.3 of the Machine Learning with Python Cookbook.

In [12]:
# This task involves the updating of the search space by adding in the new models and hyperparameter values needed from the 
# aforementioned section of the text. To access logistic regression and random forest models, I need to import their 
# respective functions from Sci-kitlearn.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# After seeing the hyperparameters, I amended the previous pipeline slightly to match the classifier key seen in the other 
# classifier information provided by the text.

new_loan_pipe = Pipeline([('scaler', MinMaxScaler()), ('classifier', KNeighborsClassifier())])

# The search space variable has been updated with these hyperparameters here.

involved_loan_search = [{'classifier': [KNeighborsClassifier()], 
                         'classifier__n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, 
                        {'classifier': [LogisticRegression(max_iter=500, solver='liblinear')], 
                         'classifier__penalty': ['l1', 'l2'], 'classifier__C': np.logspace(0, 4, 10)}, 
                        {'classifier': [RandomForestClassifier()], 'classifier__n_estimators': [10, 100, 1000], 
                         'classifier__max_features': [1, 2, 3]}]

# The code is the same now from the previous steps six and seven, with a slight amendment to the print statement regarding 
# the access of the parameters for the best model parameters, as now that there are three classifier models in the grid 
# search, I can't be certain that the KNN classifier will still be considered the best model. I have chosen to print the 
# best model first to see whether the n_neighbors parameter still applies as the best parameter.

involved_loan_grid = GridSearchCV(new_loan_pipe, involved_loan_search, cv = 5, verbose = 0)

grid_search_model_involved_loan = involved_loan_grid.fit(loan_xtrain, loan_ytrain)

involved_loan_best_model = grid_search_model_involved_loan.best_estimator_

print(involved_loan_best_model)

Pipeline(steps=[('scaler', MinMaxScaler()),
                ('classifier',
                 LogisticRegression(C=7.742636826811269, max_iter=500,
                                    solver='liblinear'))])


## 10. What are the best model and hyperparameters found in the grid search? Find the accuracy of this model on the test set.

In [13]:
# Now that we see that our best model in this case is the logistic regression model, we can disregard the n_neighbors 
# parameter and focus on the logistic regression parameters. I will print them using the best_params_ function.

print(grid_search_model_involved_loan.best_params_)

{'classifier': LogisticRegression(C=7.742636826811269, max_iter=500, solver='liblinear'), 'classifier__C': 7.742636826811269, 'classifier__penalty': 'l2'}


In [14]:
# Now for the accuracy, the same code as before applies.

involved_loan_predictions = involved_loan_best_model.predict(loan_xtest)

involved_loan_accuracy = accuracy_score(loan_ytest, involved_loan_predictions)

print('The grid search model accuracy on the test set is:', involved_loan_accuracy)

The grid search model accuracy on the test set is: 0.6504065040650406


## 11. Summarize your results.

Looking at the exercises holistically, I was interested to see that with the inclusion of more classifiers with their corresponding parameters, the accuracy of the best model decreased. Also, the model changed from being a KNN classifier to a logistic regression classifier. In addition, the best parameter changed from n_neighbors to classifier__C. The search space initially helped to increase the KNN classifier's accuracy, yet after the introduction of the new classifiers, the search space was too large to optimize the hyperparameters that needed to be selected to benefit model accuracy. The default KNN started at 61% accuracy, then jumped to 67% accuracy with the inclusion of the search space for the KNN values, then fell to 65% with the addition of the two other classifiers with their own parameters. Granted, the whole basis of these models was a dataset that looked at the ability of loan candidates to be classified as being accepted or rejected. With the highest model accuracy being right only two-thirds of the time is rather low for a large financial investment. In my opinion, I think another model strategy would be a better option to improve accuracy so banks and customers alike can be certain that the income they have is sufficient for the loan not to default. 