In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
import seaborn as sns


In [2]:
data = sns.load_dataset("iris")
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
data.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Grid Search CV

Grid Search CV is a hyperparameter tuning technique used in machine learning to find the optimal hyperparameters for a model. It's a type of exhaustive search that tries all possible combinations of hyperparameters and evaluates their performance using cross-validation.

How Grid Search CV Works:

1. Define Hyperparameter Grid: Define a grid of hyperparameters to search over.
2. Cross-Validation: Split the data into training and validation sets, and use cross-validation to evaluate the model's performance.
3. Model Evaluation: Train the model on the training set and evaluate its performance on the validation set for each combination of hyperparameters.
4. Best Hyperparameters: Select the combination of hyperparameters that results in the best performance.

Advantages:

1. Exhaustive Search: Grid Search CV tries all possible combinations of hyperparameters, ensuring that the optimal solution is found.
2. Cross-Validation: Grid Search CV uses cross-validation to evaluate the model's performance, reducing overfitting.
3. Easy to Implement: Grid Search CV is a widely used technique, and many machine learning libraries (e.g., scikit-learn) provide built-in support.

Disadvantages:

1. Computationally Expensive: Grid Search CV can be computationally expensive, especially for large datasets or complex models.
2. Curse of Dimensionality: As the number of hyperparameters increases, the number of possible combinations grows exponentially, making Grid Search CV less practical.

Alternatives:

1. Random Search: Randomly samples the hyperparameter space, rather than trying all possible combinations.
2. Bayesian Optimization: Uses Bayesian inference to search for the optimal hyperparameters.

Grid Search CV is a powerful technique for hyperparameter tuning, but it may not be the most efficient approach for large or complex problems.

In [4]:
X = data.drop(columns="species")
y = data["species"]

# Split (80% train, 20% test)
train_X, test_X, train_Y, test_Y = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Define parameter grid (dictionary keys must be strings!)
params = {
    "max_depth": [2, 3, 4, 5, 6],
    "min_samples_split": [2, 5, 10],
    "criterion": ["gini", "entropy"]
}

# Grid Search with Cross Validation
grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(),
    param_grid=params,
    cv=5,   # 5-fold cross-validation
    scoring="accuracy",
    n_jobs=-1
)

# Train
grid_search.fit(train_X, train_Y)

# Best parameters
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)

# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(test_X)

print("\nAccuracy Score:", accuracy_score(test_Y, y_pred))
print("\nClassification Report:\n", classification_report(test_Y, y_pred))

Best Parameters: {'criterion': 'gini', 'max_depth': 3, 'min_samples_split': 2}
Best Cross-Validation Accuracy: 0.9333333333333333

Accuracy Score: 0.9666666666666667

Classification Report:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.90      0.95        10
   virginica       0.91      1.00      0.95        10

    accuracy                           0.97        30
   macro avg       0.97      0.97      0.97        30
weighted avg       0.97      0.97      0.97        30



Random Search CV

Random Search CV is a hyperparameter tuning technique used in machine learning to find the optimal hyperparameters for a model. Unlike Grid Search CV, which tries all possible combinations of hyperparameters, Random Search CV randomly samples the hyperparameter space.

How Random Search CV Works:

1. Define Hyperparameter Distribution: Define a distribution for each hyperparameter (e.g., uniform, log-uniform).
2. Random Sampling: Randomly sample the hyperparameter space, generating a set of hyperparameters.
3. Model Evaluation: Train the model on the training set and evaluate its performance on the validation set using the sampled hyperparameters.
4. Iteration: Repeat steps 2-3 for a specified number of iterations or until a stopping criterion is met.
5. Best Hyperparameters: Select the combination of hyperparameters that results in the best performance.

Advantages:

1. Efficient: Random Search CV is often more efficient than Grid Search CV, especially for large hyperparameter spaces.
2. Flexibility: Random Search CV can handle continuous and categorical hyperparameters.
3. Less Prone to Overfitting: Random Search CV is less prone to overfitting, as it doesn't try all possible combinations.

Disadvantages:

1. No Guarantee of Optimal Solution: Random Search CV doesn't guarantee finding the optimal solution, as it relies on random sampling.
2. May Miss Important Regions: Random Search CV may miss important regions of the hyperparameter space.

When to Use:

1. Large Hyperparameter Space: Random Search CV is suitable for large hyperparameter spaces where Grid Search CV is impractical.
2. Limited Computational Resources: Random Search CV is a good option when computational resources are limited.

Random Search CV is a useful technique for hyperparameter tuning, offering a good trade-off between efficiency and effectiveness.

In [5]:
X = data.drop(columns="species")
y = data["species"]

# Split (80% train, 20% test)
train_X, test_X, train_Y, test_Y = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Define parameter distribution
param_dist = {
    "max_depth": [2, 3, 4, 5, 6],
    "min_samples_split": [2, 5, 10, 20, 50],
    "criterion": ["gini", "entropy"]
}

# Randomized Search with Cross Validation
random_search = RandomizedSearchCV(
    estimator=DecisionTreeClassifier(),
    param_distributions=param_dist,
    n_iter=10,              # number of random combinations to try
    cv=5,                   # 5-fold CV
    scoring="accuracy",
    random_state=42,
    n_jobs=-1
)

# Train
random_search.fit(train_X, train_Y)

# Best parameters
print("Best Parameters:", random_search.best_params_)
print("Best Cross-Validation Accuracy:", random_search.best_score_)

# Evaluate on test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(test_X)

print("\nAccuracy Score:", accuracy_score(test_Y, y_pred))
print("\nClassification Report:\n", classification_report(test_Y, y_pred))

Best Parameters: {'min_samples_split': 20, 'max_depth': 4, 'criterion': 'gini'}
Best Cross-Validation Accuracy: 0.9333333333333333

Accuracy Score: 0.9666666666666667

Classification Report:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.90      0.95        10
   virginica       0.91      1.00      0.95        10

    accuracy                           0.97        30
   macro avg       0.97      0.97      0.97        30
weighted avg       0.97      0.97      0.97        30

