## Machine Learing 37: Selecting the Best Machine Learning Model with the Best Hyperparameters


## 1. **Model Selection Basics**

### Why choosing the right model is important

Different ML models have different strengths.

* **Linear models (Logistic Regression, Linear Regression):** Simple, interpretable, fast, but may underfit complex problems.
* **Tree-based models (Random Forest, XGBoost):** Handle nonlinear relationships, robust, but may overfit if not tuned.
* **Neural networks:** Very powerful, but need large datasets and high compute.

Choosing the wrong model may lead to **underfitting** (too simple) or **overfitting** (too complex).

### Bias-Variance Tradeoff

* **Bias**: Error from oversimplification (e.g., using a linear model for non-linear data).
* **Variance**: Error from sensitivity to noise (e.g., an overfitted deep tree).
* The **goal** is to balance them: low enough bias + low enough variance.


## 2. **Hyperparameters vs Parameters**

* **Parameters:** Learned from data during training.

  * Example: In Linear Regression, the coefficients (weights).
  * Example: In Neural Networks, the weights of connections.

* **Hyperparameters:** Set *before* training, not learned.

  * Example: Learning rate in gradient descent.
  * Example: Number of trees in Random Forest.
  * Example: Maximum depth of XGBoost tree.

- **Parameters** = learned values.
- **Hyperparameters** = tuning knobs chosen by us.


## 3. **Techniques for Model Selection**

* **Train-Test Split:**
  Split dataset (e.g., 80% train, 20% test). Train on train, evaluate on test.
  - Simple, but may give unstable results for small datasets.

* **Cross-Validation (CV):**

  * Split into *k folds* (e.g., k=5).
  * Train on 4 folds, validate on 1 fold. Repeat for all folds.
  * Average the performance.
    - More reliable than train-test split.

* **Nested Cross-Validation:**

  * Outer loop → evaluate generalization.
  * Inner loop → tune hyperparameters.
    - Prevents overfitting during hyperparameter search.

## 4. **Hyperparameter Tuning Approaches**

* **Grid Search:** Try all combinations (exhaustive).
  ✅ Finds best within given grid.
  ❌ Expensive for large search spaces.

* **Random Search:** Randomly sample combinations.
  ✅ Often better than grid for large spaces.
  ❌ May miss best exact combination.

* **Bayesian Optimization:** Uses past results to guide next search (smart).
  ✅ Efficient, fewer trials needed.
  ❌ More complex.

* **Genetic Algorithms (Evolutionary Search):** Mimics natural selection.
  ✅ Good for large non-smooth spaces.
  ❌ Slower.

* **Modern approaches (Hyperband, Optuna):**

  * **Hyperband:** Early-stop bad configurations.
  * **Optuna:** Combines Bayesian + pruning.
    ✅ Very efficient for large hyperparameter spaces.

## 5. **Evaluation Metrics**

Choosing the right metric depends on the problem:

* **Classification:**

  * Accuracy → Good for balanced classes.
  * Precision, Recall, F1 → Better for imbalanced classes.
  * ROC-AUC → Measures ranking ability.

* **Regression:**

  * MAE (Mean Absolute Error) → Average error.
  * MSE (Mean Squared Error) → Penalizes large errors.
  * R² (coefficient of determination) → Proportion of variance explained.

 Always match the metric with your business/real-world goal.


## 6. **Model Comparison**

* Use **cross-validation scores** to compare.
* Apply **statistical tests** (e.g., paired t-test on CV folds) to ensure differences are significant.
* Consider **interpretability, training time, inference speed**, not just accuracy.


## 7. **Avoiding Pitfalls**

- **Data Leakage:** Using future/test info in training (e.g., scaling test data with train mean).
- **Overfitting during tuning:** If you tune on test set → overly optimistic performance.
- Always keep a separate **final test set**.
- **Ignoring class imbalance:** High accuracy may be misleading.


### What this code does:

1. Loads **Breast Cancer dataset** (binary classification).
2. Splits into train/test.
3. Defines Logistic Regression, Random Forest, and XGBoost with small hyperparameter grids.
4. Uses **GridSearchCV with cross-validation** to find best hyperparameters.
5. Compares models on the test set.


### This workflow ensures you:

1. Choose models wisely.
2. Tune hyperparameters effectively.
3. Evaluate with the right metrics.
4. Avoid common mistakes.


## Rergression Tasks

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import xgboost as xgb

In [2]:
# 1. Load dataset
diamonds = sns.load_dataset("diamonds")

In [3]:
# 2. Preprocess dataset
# Convert categorical variables to numeric (Label Encoding)
label_cols = ["cut", "color", "clarity"]
diamonds[label_cols] = diamonds[label_cols].apply(LabelEncoder().fit_transform)

In [4]:
# Features and Target
X = diamonds.drop("price", axis=1)
y = diamonds["price"]

In [5]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
# Feature scaling (not mandatory for tree-based models but helps linear models)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
# 3. Define models
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0),
    "Lasso Regression": Lasso(alpha=0.001),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, random_state=42),
    "XGBoost": xgb.XGBRegressor(n_estimators=100, random_state=42, verbosity=0)
}

In [13]:

# 4. Train & Evaluate
results = []
for name, model in models.items():
    if "Linear" in name or "Ridge" in name or "Lasso" in name:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)

    results.append([name, r2, rmse, mae])



In [8]:
# 5. Show Results
results_df = pd.DataFrame(results, columns=["Model", "R2 Score", "RMSE", "MAE"])
print("\nModel Performance on Diamonds Dataset:\n")
print(results_df.sort_values(by="R2 Score", ascending=False))


Model Performance on Diamonds Dataset:

               Model  R2 Score         RMSE         MAE
4      Random Forest  0.981517   542.047468  267.975502
6            XGBoost  0.981276   545.574239  277.941284
5  Gradient Boosting  0.972955   655.694773  364.865262
3      Decision Tree  0.966566   729.033563  355.644234
1   Ridge Regression  0.885140  1351.262011  858.800082
0  Linear Regression  0.885140  1351.263480  858.708470
2   Lasso Regression  0.885140  1351.263537  858.709208


In [9]:
# 6. Best Models by Metric
best_r2 = results_df.loc[results_df["R2 Score"].idxmax()]
best_rmse = results_df.loc[results_df["RMSE"].idxmin()]
best_mae = results_df.loc[results_df["MAE"].idxmin()]

In [10]:
print("\nBest Models by Metrics:")
print(f"👉 Highest R2 Score: {best_r2['Model']} ({best_r2['R2 Score']:.4f})")
print(f"👉 Lowest RMSE: {best_rmse['Model']} ({best_rmse['RMSE']:.2f})")
print(f"👉 Lowest MAE: {best_mae['Model']} ({best_mae['MAE']:.2f})")


Best Models by Metrics:
👉 Highest R2 Score: Random Forest (0.9815)
👉 Lowest RMSE: Random Forest (542.05)
👉 Lowest MAE: Random Forest (267.98)


---

# Classifiers:

In [11]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# dont show warnings
import warnings
warnings.filterwarnings('ignore')

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create a dictionary of classifiers to evaluate
classifiers = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier()
}

# Perform k-fold cross-validation and calculate the mean accuracy
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

for name, classifier in classifiers.items():
    scores = cross_val_score(classifier, X, y, cv=kfold)
    accuracy = np.mean(scores)
    print("Classifier:", name)
    print("Mean Accuracy:", accuracy)
    print()

Classifier: Logistic Regression
Mean Accuracy: 0.9733333333333334

Classifier: Decision Tree
Mean Accuracy: 0.9533333333333335

Classifier: Random Forest
Mean Accuracy: 0.9600000000000002

Classifier: SVM
Mean Accuracy: 0.9666666666666668

Classifier: KNN
Mean Accuracy: 0.9733333333333334



## Hyperparameter tuning:

In [None]:
%%time
# Import SVR, KNeighborsRegressor, and XGBRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV

# Create a dictionaries of list of models to evaluate performance with hyperparameters
models = { 
          'LinearRegression' : (LinearRegression(), {}),
          'SVR' : (SVR(), {'kernel': ['rbf', 'poly', 'sigmoid']}),
          'DecisionTreeRegressor' : (DecisionTreeRegressor(), {'max_depth': [None, 5, 10]}),
          'RandomForestRegressor' : (RandomForestRegressor(), {'n_estimators': [10, 100]}),
          'KNeighborsRegressor' : (KNeighborsRegressor(), {'n_neighbors': np.arange(3, 100, 2)}),
          'GradientBoostingRegressor' : (GradientBoostingRegressor(), {'n_estimators': [10, 100]}),
          'XGBRegressor' : (XGBRegressor(), {'n_estimators': [10, 100]}),          
          }
# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

for name, (model, params) in models.items():
    # create a pipline
    pipeline = GridSearchCV(model, params, cv=5)
    
    # fit the pipeline
    pipeline.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = pipeline.predict(X_test)
    
    # print the performing metric
    print(name, 'MSE: ', mean_squared_error(y_test, y_pred))
    print(name, 'R2: ', r2_score(y_test, y_pred))
    print(name, 'MAE: ', mean_absolute_error(y_test, y_pred))
    print('\n')
    print(name, 'Best Params: ', pipeline.best_params_)
    

LinearRegression MSE:  1825912.9915253515
LinearRegression R2:  0.8851397433679629
LinearRegression MAE:  858.7084697710105


LinearRegression Best Params:  {}


In [None]:
# Create a dictionaries of list of models to evaluate performance with hyperparameters
models = { 
          'LinearRegression' : (LinearRegression(), {}),
          'SVR' : (SVR(), {'kernel': ['rbf', 'poly', 'sigmoid'], 'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01], 'epsilon': [0.1, 0.01, 0.001]}),
          'DecisionTreeRegressor' : (DecisionTreeRegressor(), {'max_depth': [None, 5, 10], 'splitter': ['best', 'random']}),
          'RandomForestRegressor' : (RandomForestRegressor(), {'n_estimators': [10, 100, 1000], 'max_depth': [None, 5, 10]}),
          'KNeighborsRegressor' : (KNeighborsRegressor(), {'n_neighbors': np.arange(3, 100, 2), 'weights': ['uniform', 'distance']}),
          'GradientBoostingRegressor' : (GradientBoostingRegressor(), {'loss': ['ls', 'lad', 'huber', 'quantile'], 'n_estimators': [10, 100, 1000]}),
          'XGBRegressor' : (XGBRegressor(), {'n_estimators': [10, 100, 1000], 'learning_rate': [0.1, 0.01, 0.001]}),          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

for name, (model, params) in models.items():
    # create a pipline
    pipeline = GridSearchCV(model, params, cv=5)
    
    # fit the pipeline
    pipeline.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = pipeline.predict(X_test)
    
      
    # print the performing metric
    print(name, 'MSE: ', mean_squared_error(y_test, y_pred))
    print(name, 'R2: ', r2_score(y_test, y_pred))
    print(name, 'MAE: ', mean_absolute_error(y_test, y_pred))
    print('\n')

# **Add preprocessor inside the pipeline**

In [None]:
# make a preprocessor

preprocessor = ColumnTransformer(
    transformers=[('numeric_scaling', StandardScaler(), ['total_bill', 'size'])], remainder='passthrough')


# Create a dictionaries of list of models to evaluate performance with hyperparameters
models = { 
          'LinearRegression' : (LinearRegression(), {}),
          'SVR' : (SVR(), {'kernel': ['rbf', 'poly', 'sigmoid'], 'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01], 'epsilon': [0.1, 0.01, 0.001]}),
          'DecisionTreeRegressor' : (DecisionTreeRegressor(), {'max_depth': [None, 5, 10], 'splitter': ['best', 'random']}),
          'RandomForestRegressor' : (RandomForestRegressor(), {'n_estimators': [10, 100, 1000], 'max_depth': [None, 5, 10]}),
          'KNeighborsRegressor' : (KNeighborsRegressor(), {'n_neighbors': np.arange(3, 100, 2), 'weights': ['uniform', 'distance']}),
          'GradientBoostingRegressor' : (GradientBoostingRegressor(), {'loss': ['ls', 'lad', 'huber', 'quantile'], 'n_estimators': [10, 100, 1000]}),
          'XGBRegressor' : (XGBRegressor(), {'n_estimators': [10, 100, 1000], 'learning_rate': [0.1, 0.01, 0.001]}),          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

for name, (model, params) in models.items():
    # create a pipline with preprocessor
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])   
    
    # make a grid search cv to tune the hyperparameter
    grid_search = GridSearchCV(pipeline, params, cv=5)
    
    
    # fit the pipeline
    grid_search.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = grid_search.predict(X_test)
    
      
    # print the performing metric
    print(name, 'MSE: ', mean_squared_error(y_test, y_pred))
    print(name, 'R2: ', r2_score(y_test, y_pred))
    print(name, 'MAE: ', mean_absolute_error(y_test, y_pred))
    print('\n')