# Week6 - Classifier Evaluation Lab

* Copy&paste your model for homework5 model
* Add grid search and train
* Compare performance
* Which one is better? Explain?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import GridSearchCV

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/classification/loan_status_data/loan_status.csv')

In [3]:
df.head().T

Unnamed: 0,0,1,2,3,4
Loan_ID,LP001003,LP001005,LP001006,LP001008,LP001013
Gender,Male,Male,Male,Male,Male
Married,Yes,Yes,Yes,No,Yes
Dependents,1,0,0,0,0
Education,Graduate,Graduate,Not Graduate,Graduate,Not Graduate
Self_Employed,No,Yes,No,No,No
ApplicantIncome,4583,3000,2583,6000,2333
CoapplicantIncome,1508.0,0.0,2358.0,0.0,1516.0
LoanAmount,128.0,66.0,120.0,141.0,95.0
Loan_Amount_Term,360.0,360.0,360.0,360.0,360.0


In [4]:
df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,381.0,381.0,381.0,370.0,351.0
mean,3579.845144,1277.275381,104.986877,340.864865,0.837607
std,1419.813818,2340.818114,28.358464,68.549257,0.369338
min,150.0,0.0,9.0,12.0,0.0
25%,2600.0,0.0,90.0,360.0,1.0
50%,3333.0,983.0,110.0,360.0,1.0
75%,4288.0,2016.0,127.0,360.0,1.0
max,9703.0,33837.0,150.0,480.0,1.0


In [5]:
numerical_vars = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']
categorical_vars = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']


In [15]:
from sklearn.model_selection import train_test_split

X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=120)


In [16]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [17]:
# Numerical pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
numerical_pipeline

In [9]:
# Categorical pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])
categorical_pipeline

In [11]:
#Creating a pipeline with pre-processing and logistic regression:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [12]:
# Combining  pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_vars),
        ('cat', categorical_pipeline, categorical_vars)
    ])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])
pipeline

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Combine preprocessing and logistic regression
logreg_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('classifier', LogisticRegression())])


In [18]:
logreg_pipeline.fit(X_train, y_train)


In [19]:
# Predictions on training and test sets
train_preds = logreg_pipeline.predict(X_train)
test_preds = logreg_pipeline.predict(X_test)


In [21]:
# Accuracy scores
train_accuracy = accuracy_score(y_train, train_preds)
test_accuracy = accuracy_score(y_test, test_preds)

print("Training Accuracy before GS:", train_accuracy)
print("Test Accuracy before GS:", test_accuracy)

Training Accuracy before GS: 0.8388157894736842
Test Accuracy before GS: 0.8831168831168831


In [22]:
# Define hyperparameters grid
param_grid = {
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'classifier__solver': ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']
}


In [23]:
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)


In [24]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

In [25]:
# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Assign the best estimator from grid search
best_pipeline = grid_search.best_estimator_

Best Parameters: {'classifier__C': 0.01, 'classifier__solver': 'liblinear'}


In [26]:
# Evaluate the performance
train_accuracy = accuracy_score(y_train, best_pipeline.predict(X_train))
test_accuracy = accuracy_score(y_test, best_pipeline.predict(X_test))

print("Train Accuracy:", train_accuracy)
print("Test Accuracy:", test_accuracy)


Train Accuracy: 0.8421052631578947
Test Accuracy: 0.8701298701298701


We are evaluating the model's performance based on accuracy as the metric. Accuracy is calculated based on the proportion of accurate predictions. We evaluate accuracy for both the training and testing datasets to ensure that the model is not overfitting.
By utilizing grid search and training, the model's hyperparameters are adjusted to improve its predictive accuracy. 

To determine which model is better, we need to analyze the performance metrics, specifically the accuracy scores, of both models trained with and without grid search.

Training Accuracy without Grid search: 0.8388157894736842

Test Accuracy without Grid search: 0.8831168831168831

Model with Grid Search:
Train Accuracy: 0.8421052631578947
Test Accuracy: 0.8701298701298701
The model with the higher test accuracy is generally considered better, as it indicates how well the model performs on unseen data. However, we also need to ensure that the test accuracy is not significantly higher than the train accuracy, which might suggest overfitting.