# Hyperparameter Tuning of ML Models

# 1. Understanding Hyperparameters 

Hyperparameters are settings or parameters that are set before the learning process begins and control the behavior of the machine learning model. Unlike model parameters, which are learned during training, hyperparameters must be predefined. Techniques like Grid Search, Random Search, and Bayesian Optimization are commonly used to find the optimal hyperparameters by systematically exploring different combinations.

### Importing Necessary Libraries

In [4]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
import joblib

# 2. Grid Search

In [5]:
df = pd.read_csv('spam_dataset.csv')

X = df.drop(columns=['Prediction'])
y = df['Prediction']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_model = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy', verbose=1)
grid_search.fit(X_train, y_train)

print("Grid Search Best parameters found:", grid_search.best_params_)
print("Grid Search Best cross-validation score:", grid_search.best_score_)

best_rf_model_grid = grid_search.best_estimator_
y_pred_grid = best_rf_model_grid.predict(X_test)

Fitting 5 folds for each of 81 candidates, totalling 405 fits
Grid Search Best parameters found: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}
Grid Search Best cross-validation score: 0.9719578944908843


# 3. Random Search

In [6]:
param_dist = {
    'n_estimators': np.arange(50, 201, 50),
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': np.arange(2, 11, 2),
    'min_samples_leaf': np.arange(1, 5)
}

random_search = RandomizedSearchCV(estimator=rf_model, param_distributions=param_dist, n_iter=100, cv=5, scoring='accuracy', random_state=42, verbose=1)
random_search.fit(X_train, y_train)

print("Random Search Best parameters found:", random_search.best_params_)
print("Random Search Best cross-validation score:", random_search.best_score_)

best_rf_model_random = random_search.best_estimator_
y_pred_random = best_rf_model_random.predict(X_test)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
Random Search Best parameters found: {'n_estimators': 200, 'min_samples_split': 8, 'min_samples_leaf': 1, 'max_depth': 50}
Random Search Best cross-validation score: 0.9729249542902872


# 4. Model Evaluation

In [7]:
accuracy_grid = accuracy_score(y_test, y_pred_grid)
print(f"Grid Search Accuracy on test set: {accuracy_grid:.2f}")
print(classification_report(y_test, y_pred_grid))

accuracy_random = accuracy_score(y_test, y_pred_random)
print(f"Random Search Accuracy on test set: {accuracy_random:.2f}")
print(classification_report(y_test, y_pred_random))

joblib.dump(best_rf_model_grid, 'best_rf_model_grid.pkl')
joblib.dump(best_rf_model_random, 'best_rf_model_random.pkl')

Grid Search Accuracy on test set: 0.98
              precision    recall  f1-score   support

           0       0.99      0.98      0.98       739
           1       0.95      0.97      0.96       296

    accuracy                           0.98      1035
   macro avg       0.97      0.97      0.97      1035
weighted avg       0.98      0.98      0.98      1035

Random Search Accuracy on test set: 0.97
              precision    recall  f1-score   support

           0       0.99      0.98      0.98       739
           1       0.95      0.96      0.96       296

    accuracy                           0.97      1035
   macro avg       0.97      0.97      0.97      1035
weighted avg       0.98      0.97      0.97      1035



['best_rf_model_random.pkl']