GridSearchCV & RandomizedSearchCV

Import Libraries & Load Data

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

# Load Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)


Data Cleaning & Feature Engineering

In [2]:
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

df['FamilySize'] = df['SibSp'] + df['Parch'] + 1


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


Select Features & Target

In [3]:
X = df[['Pclass', 'Sex', 'Age', 'Fare', 'FamilySize']]
y = df['Survived']


Train-Test Split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


GridSearchCV (Exhaustive Search)

Tries every combination (slow but accurate)

In [5]:
rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [5, 8, None],
    'min_samples_split': [2, 10],
    'min_samples_leaf': [1, 5]
}

grid_search = GridSearchCV(
    rf,
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)


Best Parameters from GridSearch

In [6]:
print("Best Parameters:", grid_search.best_params_)
print("Best CV AUC:", grid_search.best_score_)


Best Parameters: {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best CV AUC: 0.8629070633549116


Evaluate Best Model on Test Data

In [7]:
best_rf = grid_search.best_estimator_

y_pred = best_rf.predict(X_test)
y_prob = best_rf.predict_proba(X_test)[:, 1]

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Test AUC:", roc_auc_score(y_test, y_prob))


Test Accuracy: 0.8044692737430168
Test AUC: 0.8906048906048907


RandomizedSearchCV (Faster, Smarter)

Tries random combinations (much faster)

In [8]:
param_dist = {
    'n_estimators': np.arange(100, 500, 50),
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': np.arange(2, 20),
    'min_samples_leaf': np.arange(1, 10)
}

random_search = RandomizedSearchCV(
    rf,
    param_distributions=param_dist,
    n_iter=30,
    cv=5,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)


    GridSearch vs RandomSearch
    Feature	GridSearch	RandomSearch
    -------  ---------   -----------
    Speed	    Slow	      Fast
    Coverage	Exhaustive	Smart sampling
    Best for	Small grids	Large spaces

    In real jobs → RandomSearch first

Extra: Why tuning works

Reduces overfitting

Improves generalization

Finds hidden performance gains

Mandatory for competitions & **production**