# Finding the Best Model: Hyperparameter Tuning with GridSearchCV ⚙️

**Hyperparameters** are the "settings" of a machine learning model that are set *before* the training process begins. They are not learned from the data like the model's weights or coefficients. For example, the `max_depth` of a Decision Tree or the `C` parameter of an SVM are hyperparameters.

The choice of hyperparameters can have a huge impact on a model's performance. The process of finding the optimal combination of these settings is called **hyperparameter tuning**.

While you can do this manually by trial and error, it's inefficient. A much better approach is **Grid Search Cross-Validation (`GridSearchCV`)**. This technique automates the process by performing an exhaustive search over a specified parameter grid, using cross-validation to evaluate each combination and identify the best one.


## 1. The Manual Approach to Hyperparameter Tuning

Let's start with a synthetic dataset and a baseline `DecisionTreeClassifier`.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn import svm

# Generate data and split it
X, y = make_classification(
    n_features=10, n_samples=1000, n_informative=8,
    n_redundant=2, n_repeated=0, n_classes=2, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

We could manually test different combinations of hyperparameters (like `criterion` and `max_depth`) using `cross_val_score`.


In [15]:
# Manual test for one combination
scores = cross_val_score(DecisionTreeClassifier(criterion='gini', max_depth=10), X_train, y_train, cv=5)
print(f"Scores for one combination: {scores}")
print(f"Average score: {np.average(scores):.4f}")

Scores for one combination: [0.78       0.82666667 0.81333333 0.84666667 0.79333333]
Average score: 0.8120


In [3]:
from sklearn.model_selection import cross_val_score

cross_val_score(DecisionTreeClassifier(criterion='gini', max_depth=5), X_train, y_train, cv=5)

array([0.81333333, 0.80666667, 0.76666667, 0.84      , 0.75333333])

To test multiple combinations, we could write a loop. This is essentially a manual grid search.


In [16]:
criterion = ['gini', 'entropy']
max_depth = [5, 10, 15]
avg_scores = {}

for c in criterion:
    for d in max_depth:
        clf = DecisionTreeClassifier(criterion=c, max_depth=d)
        scores_list = cross_val_score(clf, X_train, y_train, cv=5)
        avg_scores[c + '_' + str(d)] = np.average(scores_list)

print(avg_scores)

{'gini_5': 0.796, 'gini_10': 0.8093333333333333, 'gini_15': 0.8226666666666667, 'entropy_5': 0.7773333333333333, 'entropy_10': 0.8026666666666668, 'entropy_15': 0.808}


This manual approach works for a few parameters but quickly becomes unmanageable as the number of hyperparameters grows.


## 2. Automating the Search with `GridSearchCV`

`GridSearchCV` automates this entire process. You provide it with:
1.  An **estimator** (the model, e.g., `DecisionTreeClassifier()`).
2.  A **parameter grid** (a dictionary of hyperparameters and the values to test).
3.  A cross-validation strategy (e.g., `cv=5`).

It will then test every possible combination and find the best one.

In [17]:
clf = GridSearchCV(
    DecisionTreeClassifier(),
    {
        'criterion': ['gini', 'entropy'],
        'max_depth': [5, 10, 15]
    },
    cv = 5,
    return_train_score=False
)

# GridSearchCV handles the cross-validation internally
clf.fit(X, y)

0,1,2
,estimator,DecisionTreeClassifier()
,param_grid,"{'criterion': ['gini', 'entropy'], 'max_depth': [5, 10, ...]}"
,scoring,
,n_jobs,
,refit,True
,cv,5
,verbose,0
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,criterion,'entropy'
,splitter,'best'
,max_depth,15
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


The results of the grid search are stored in the `cv_results_` attribute, which we can view as a DataFrame.


In [18]:
df_results = pd.DataFrame(clf.cv_results_)
df_results[['param_criterion', 'param_max_depth', 'mean_test_score']]

Unnamed: 0,param_criterion,param_max_depth,mean_test_score
0,gini,5,0.781
1,gini,10,0.784
2,gini,15,0.787
3,entropy,5,0.781
4,entropy,10,0.789
5,entropy,15,0.81


## 3. Identifying the Best Model and Parameters

`GridSearchCV` makes it easy to find the best performing combination.

The best combination of parameters is found in the `best_params_` attribute.

In [19]:
clf.best_params_

{'criterion': 'entropy', 'max_depth': 15}

The average cross-validated score of the best model is in `best_score_`.

In [20]:
clf.best_score_

0.8099999999999999

The `best_estimator_` attribute returns a model already re-trained on the entire dataset using these optimal parameters, ready for prediction.

In [21]:
clf.best_estimator_

0,1,2
,criterion,'entropy'
,splitter,'best'
,max_depth,15
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


## 4. Comparing Multiple Models with Grid Search

A powerful workflow is to use `GridSearchCV` to find the best version of *several different types of models* and then compare their optimal scores. Here, we'll compare the best `DecisionTreeClassifier` against the best `SVC` (Support Vector Classifier).


In [23]:
# Define models and their parameter grids
model_params = {
    'decision_tree' : {
        'model' : DecisionTreeClassifier(),
        'params' : {
            'criterion' : ['gini', 'entropy'],
            'max_depth' : [5, 10, 15]
        }
    },
    'svm' : {
        'model' : svm.SVC(gamma='auto'),
        'params' : {
            'C' : [1, 10, 20],
            'kernel' : ['linear', 'rbf']
        }
    }
}

# Loop through the models, run GridSearchCV, and store the results
scores = []
for key, val in model_params.items():
    clf = GridSearchCV(val['model'], val['params'], cv=5, return_train_score=False)
    clf.fit(X_train, y_train)
    scores.append({
        'model' : key,
        'best_score' : clf.best_score_,
        'best_params' : clf.best_params_
    })

# Display the results in a DataFrame
pd.DataFrame(scores)

Unnamed: 0,model,best_score,best_params
0,decision_tree,0.829333,"{'criterion': 'gini', 'max_depth': 15}"
1,svm,0.916,"{'C': 1, 'kernel': 'rbf'}"


**Conclusion:** The results clearly show that after hyperparameter tuning, the optimized **SVM** (with an accuracy of **91.6%**) is the superior model for this dataset compared to the best Decision Tree (81.7%). This demonstrates how `GridSearchCV` is an essential tool for both optimizing a single model and for comparing different types of models on a level playing field.