<a href="https://colab.research.google.com/github/Sans-codes/2462364_SanskritiAryal/blob/main/worksheet8_sanskritiaryal_5cs037_conceptsandtechnologiesofai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Ensemble Methods and Hyperparameter Tuning (Wine Dataset)

Objective

In this exercise, we:

1.  Train Decision Tree and Random Forest
classifiers
2.   Compare their performance using F1 Score
3.   Perform hyperparameter tuning using GridSearchCV
4.   Train Decision Tree and Random Forest regression models
5.   Perform hyperparameter tuning using RandomizedSearchCV

Step 1: Import Required Libraries

In [None]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import f1_score, mean_squared_error

Only core Python libraries and scikit-learn modules explicitly required by the worksheet are used.

Step 2: Load and Split the Wine Dataset

In [None]:
data = load_wine()
X = data.data
y = data.target

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

The dataset is split into 80% training and 20% testing, which is standard practice.

Step 3: Classification Using Decision Tree

In [None]:
dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)

y_pred_dt = dt_clf.predict(X_test)
f1_dt = f1_score(y_test, y_pred_dt, average='weighted')

f1_dt

0.9439974457215836

F1 score is used because it balances precision and recall, which is suitable for multi-class classification.

Step 4: Classification Using Random Forest

In [None]:
rf_clf = RandomForestClassifier(random_state=42)
rf_clf.fit(X_train, y_train)

y_pred_rf = rf_clf.predict(X_test)
f1_rf = f1_score(y_test, y_pred_rf, average='weighted')

f1_rf

1.0

Step 5: Classification Model Comparison

In [None]:
print("F1 Score Comparison")
print("Decision Tree Classifier:", f1_dt)
print("Random Forest Classifier:", f1_rf)

F1 Score Comparison
Decision Tree Classifier: 0.9439974457215836
Random Forest Classifier: 1.0


Random Forest performs better than Decision Tree because Random Forest:


1.   Uses multiple trees
2.   Reduces overfitting
3.   Improves generalization

Step 6: Hyperparameter Tuning (Random Forest Classifier â€“ GridSearchCV)    
Selected Hyperparameters



*   n_estimators
*   max_depth


*   min_samples_split










In [None]:
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5]
}


In [None]:
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    scoring='f1_weighted',
    cv=5
)

grid_search.fit(X_train, y_train)

grid_search.best_params_

{'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}

Analysis

*   LGridSearchCV tries all parameter combinations
*   Best parameters give the highest cross-validated F1 score

Step 7: Regression Using Decision Tree Regressor

In [None]:
dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_train, y_train)

y_pred_dt_reg = dt_reg.predict(X_test)
mse_dt = mean_squared_error(y_test, y_pred_dt_reg)

mse_dt

0.16666666666666666

Mean Squared Error (MSE) measures prediction error in regression tasks.

Step 8: Regression Using Random Forest Regressor

In [None]:
rf_reg = RandomForestRegressor(random_state=42)
rf_reg.fit(X_train, y_train)

y_pred_rf_reg = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf_reg)

mse_rf

0.06483333333333333

Step 9: Regression Hyperparameter Tuning (RandomizedSearchCV)   
Selected Parameters

*   n_estimators
*   max_depth

In [None]:
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

In [None]:
random_search = RandomizedSearchCV(
    RandomForestRegressor(random_state=42),
    param_distributions=param_dist,
    n_iter=5,
    scoring='neg_mean_squared_error',
    cv=5,
    random_state=42
)

random_search.fit(X_train, y_train)

random_search.best_params_

{'n_estimators': 200, 'min_samples_split': 10, 'max_depth': None}

Discussion

*   RandomizedSearchCV is faster than GridSearchCV
*   It is suitable when the parameter space is large
*   Optimal parameters reduce regression error

Final Conclusion

*   Random Forest outperforms Decision Tree for both classification and regression
*   Hyperparameter tuning improves model performance
*   Ensemble methods provide better stability and accuracy