<a href="https://colab.research.google.com/github/OsirisEscaL/Machine_Learning/blob/main/Hyperparameter_Tuning_Model_Selection_and_Evaluation_with_Scikit_Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameter Tuning: Model Selection and Evaluation

Over the years, machine learning has significantly progressed in creating intelligent systems capable of predicting, pattern recognition, and task automation. However, the effectiveness of a machine learning model largely depends on the algorithm used and the tuning of its hyperparameters. This assignment aims to compare different machine learning algorithms on a particular dataset to emphasize the importance of model selection and hyperparameter tuning. We will explore various hyperparameter tuning techniques, such as grid and random search, to identify the best-performing model and provide code examples to guide you.

**The Importance of Model Selection**

Model selection is crucial in selecting the appropriate machine-learning algorithm for a given problem. Different algorithms have their strengths and weaknesses, and it is vital to determine the best one for your dataset and task. Linear regression, decision trees, support vector machines, and neural networks are typical machine learning algorithms.

Model selection involves evaluating and selecting the algorithm most likely to perform optimally on a given dataset. We consider the dataset's size, feature characteristics, problem type (classification or regression), and the algorithm's fundamental assumptions. Using a one-size-fits-all approach often produces suboptimal results.

**The Role of Hyperparameters**

Once we select an algorithm, the next critical step is to tune its hyperparameters. Hyperparameters are configuration settings that govern various aspects of the learning process. They are not learned from the data and must be manually specified. Examples of hyperparameters include learning rates, decision tree depth, and the number of hidden layers in a neural network.

Hyperparameter optimization involves finding the optimal combination of hyperparameters that maximize a model's performance on a given dataset. Although this process can be computationally intensive and time-consuming, it is necessary to produce optimal results. We will examine two popular methods for hyperparameter optimization: grid search and random search.

**Dataset**

We will use the Spanish Wine Quality Dataset from [Kaggle](https://www.kaggle.com/datasets/yasserh/wine-quality-dataset) for this project. This datasets is related to red variants of the Portuguese "Vinho Verde" wine.The dataset describes the amount of various chemicals present in wine and their effect on it's quality.

**Step 1: Importing Essential Libraries**

Importing the essential Python libraries for the project will be our initial step:

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor

**Step 2: Loading and Preprocessing the Dataset**

Once the dataset has been downloaded and extracted, it will be loaded and preprocessed.

In [9]:
# Load the dataset
data = pd.read_csv('WineQT.csv')
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


In [10]:
# Split the data into features and labels
data = data.drop('Id', axis=1)
X = data.drop('quality', axis=1)
y = data['quality']

# Split the data into training and testing sets, stratifying on the 'Class' label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Step 3: Model Selection**

Train baseline models utilizing the three algorithms with their default hyperparameter settings. Use a relevant metric. We use R-squared in this project to evaluate their performance.

In [11]:
# Initialize the models
rf_model = RandomForestRegressor()
svm_model = SVR()
gb_model = GradientBoostingRegressor()

# Fit the models to the training data
rf_model.fit(X_train, y_train)
svm_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)

# Predict on the test data
y_pred_rf = rf_model.predict(X_test)
y_pred_svm = svm_model.predict(X_test)
y_pred_gb = gb_model.predict(X_test)

# Evaluate the models
r2_rf = r2_score(y_test, y_pred_rf)
r2_svm = r2_score(y_test, y_pred_svm)
r2_gb = r2_score(y_test, y_pred_gb)

print(f"R-squared score:")
print(f"Random Forest: {r2_rf}")
print(f"SVM: {r2_svm}")
print(f"Gradient Boosting: {r2_gb}")

R-squared score:
Random Forest: 0.4691225808358457
SVM: 0.44422885657278155
Gradient Boosting: 0.438763557426571


**Step 4: Hyperparameter Tuning**

Select the algorithm with the best performance (in this case, Random Forest) and conduct grid and random search hyperparameter tuning.

We begin with a grid search. Grid search is a method for systematically evaluating all possible combinations within a predefined search space to identify the optimal hyperparameters for your machine learning model.

In [12]:
# Define a grid of hyperparameters to search
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the best model
y_pred = best_model.predict(X_test)
r2 = r2_score(y_test, y_pred)

print(f'Best Model Hyperparameters: {best_params}')
print(f'Best Model R2: {r2:.2f}')

Best Model Hyperparameters: {'max_depth': 30, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 200}
Best Model R2: 0.46


Next, a random search is employed. In contrast to grid search, which evaluates predefined hyperparameter values systematically, random search arbitrarily selects hyperparameters from predefined ranges or distributions. This randomness can make it easier to discover suitable hyperparameter configurations.

In [13]:
# Create a RandomizedSearchCV object
random_search = RandomizedSearchCV(RandomForestRegressor(), param_grid, n_iter=15, cv=5, scoring='neg_mean_squared_error')

# Fit the random search to the data
random_search.fit(X_train, y_train)

# Get the best hyperparameters and model
best_params = random_search.best_params_
best_model = random_search.best_estimator_

# Evaluate the best model
y_pred = best_model.predict(X_test)
r2 = r2_score(y_test, y_pred)

print(f'Best Model Hyperparameters: {best_params}')
print(f'Best Model R2: {r2:.2f}')

Best Model Hyperparameters: {'n_estimators': 300, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_depth': 20}
Best Model R2: 0.46


**Conclusion**

Model selection and hyperparameter tunning are crucial machine learning pipeline steps. This project demonstrates how to systematically evaluate various machine learning algorithms, optimize their hyperparameters, and select the model with the best performance for a particular problem. The provided code examples should be a helpful guide for implementing a comparable project on your datasets.