### Hyperparameter Tuning - Grid Search

In this notebook, we revisit the process of building and evaluating a **Logistic Regression** model to predict whether a patient has diabetes based on various features that we saw in previous notebook. However, our primary focus here is on **hyperparameter tuning**—a crucial step to optimize the performance of the model.

**Hyperparameter tuning** refers to the process of finding the optimal combination of hyperparameters that results in the best model performance. 
**Hyperparameters** are parameters that are set before training the model and cannot be learned directly from the data. These parameters, such as regularization strength, penalty type, and solver choice, can have a significant impact on the model's accuracy and generalization capabilities.

One of the most effective methods for hyperparameter tuning is **Grid Search**. This approach systematically evaluates all possible combinations of a set of predefined hyperparameter values by training and testing the model on each combination. For example, we may want to experiment with different regularization strengths (`C`) and penalty types (`l1` vs. `l2`). Grid Search exhaustively searches through all combinations, ensuring that no potential option is overlooked.

To assess the best model performance, **cross-validation** is often used in conjunction with Grid Search. This helps ensure that the hyperparameter combinations we test lead to a model that is robust and not biased toward a particular train-test split.
By the end of the tuning process, we select the combination of hyperparameters that results in the most accurate and robust model.

Through this process, we can fine-tune our model's performance, improving its ability to make accurate predictions on unseen data.

In this example, we will use **GridSearchCV** from scikit-learn to tune parameters such as the regularization strength, the penalty type and the tolerance treshold (for stopping criteria) for a logistic regression model.

In [1]:
import pandas as pd

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [2]:
# Load the diabetes dataset
diabetes = pd.read_csv('./datasets/diabetes.csv')

In [3]:
# Shuffling all samples to avoid group bias
diabetes = diabetes.sample(frac=1).reset_index(drop=True)

In [4]:
# Select features and target variable
selected_features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
                     'BMI', 'DiabetesPedigreeFunction', 'Age']
X = diabetes[selected_features].values
y = diabetes['Outcome'].values

In [5]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
# Standardize the features using StandardScaler
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

In [7]:
# Create a logistic regression model
logistic_model = LogisticRegression(random_state=42, solver='saga') # 'liblinear' or 'saga' solvers supports both 'l1' and 'l2' penalties.

In [8]:
# Define the hyperparameter grid for grid search
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Regularization parameter
    'penalty': ['l1', 'l2'], # Regularization type
    'tol' : [0.1, 0.001] # Tolerance for stopping criteria (e.g. when the loss or score is not improving by at least tol for n consecutive iterations)
}

# Create GridSearchCV object
grid_search = GridSearchCV(logistic_model, param_grid, cv=5, scoring='accuracy', verbose=True)

# Fit the model with grid search on the standardized training data
grid_search.fit(X_train_std, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


In [9]:
# Get the best parameters from the grid search
best_params = grid_search.best_params_
print(best_params)

{'C': 10, 'penalty': 'l1', 'tol': 0.1}


In [10]:
# Make predictions on the standardized test data using the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_std)

In [11]:
# Evaluate the performance of the best model
accuracy = accuracy_score(y_test, y_pred)
classification_report_str = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the results
print(f'Best Hyperparameters: {best_params}')
print(f'Accuracy with Best Model: {accuracy:.2f}')
print('Classification Report:\n', classification_report_str)
print('Confusion Matrix:\n', conf_matrix)

Best Hyperparameters: {'C': 10, 'penalty': 'l1', 'tol': 0.1}
Accuracy with Best Model: 0.79
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.84      0.84       101
           1       0.70      0.70      0.70        53

    accuracy                           0.79       154
   macro avg       0.77      0.77      0.77       154
weighted avg       0.79      0.79      0.79       154

Confusion Matrix:
 [[85 16]
 [16 37]]
