This module develops tuned hyperparameters for each machine learning model to be benchmarked.

This initial code block defines and configures 16 different ML algorithms for subsequent hyperparameter tuning in the next code cell. The 16 machine learning algorithms are the following:

1.  Tree-based:     Decision Tree, Random Forest, Extra Trees, Gradient Boosting, AdaBoost, XGBoost
2.  Linear:         Logistic Regression, Ridge Classifier, Stochastic Gradient Descent (SGD) Classifier, Perceptron
3.  Kernel-based:   Support Vector Classifier
4.  Instance-based: KNeighbors Classifier
5.  Probabilistic:  Gaussian Naive Bayes
6.  Discriminant:   Linear Discriminant Analysis, Quadratic Discriminant Analysis
7.  Neural:         Multi-layer Perceptron Classifier.

The configurations are set up in the AVAILABLE_MODELS dictionary. One objective of these configurations is to standardize paramater grids and default parameters for fair comparisons among models, and increase the chance of reproducible results. Another objective was to load models at runtime, rather than having to hard-code all the imports. This should make scaling up to more ML models less painful.

Each model's configuration inclues:

1.  Class and module:   For dynamic importing of its libraries
2.  Label requirements: Whether model needs numeric vs. text labels
3.  Default parameters: Base settings (random_state, n_jobs, etc.)
4.  Parameter grid:     Hyperparameters to tune and their value choices.


In [1]:
# -------------------------------------------
# 1. ESTABLISH TAXONOMY OF ML MODELS
# -------------------------------------------

MODEL_FAMILIES = {
    "Tree-Based Models": [
        (1, "DecisionTree"),
        (2, "RandomForest"),
        (3, "ExtraTrees"),
        (4, "GradientBoosting"),
        (5, "AdaBoost"),
        (6, "XGBoost")
    ],
    "Linear Models": [
        (7, "LogisticRegression"),
        (8, "RidgeClassifier"),
        (9, "SGDClassifier"),
        (10, "Perceptron")
    ],
    "Kernel-Based Models": [
        (11, "SVC")
    ],
    "Instance-Based Models": [
        (12, "Kneighbors")
    ],
    "Probabilistic Models": [
        (13, "GaussianNB")
    ],
    "Discriminant Models": [
        (14, "LDA"),
        (15, "QDA")
    ],
    "Neural Models": [
        (16, "MLP")
    ]
}

In [None]:
# -------------------------------------------
# 2. DEFINE MODELS, DEFAULT HYPERPARAMETERS, AND TO-BE-TUNED HYPERPARAMETERS
# -------------------------------------------


# Import
import pandas as pd
import json
import importlib
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score
from pathlib import Path
import warnings

# Suppress warnings for cleaner output. Many arise during hyperparameter tuning.
warnings.filterwarnings("ignore")  

# Define this project's file locations.
# This notebook uses a centralized config.py file for all path management.

# Import config paths
import sys
sys.path.append('..')
from config import CURATED_DATA_DIR, TUNED_MODELS_DIR

# Define paths specific to this module
curated_path = CURATED_DATA_DIR / "DryBean_curated.parquet"
tuned_models_dir = TUNED_MODELS_DIR
tuned_models_dir.mkdir(parents=True, exist_ok=True)


# Load scaled and encoded dataset -- again.
# Loading from persistant storage (e.g., hard/SSD drive) is intended to ensure modularity and reliability.
df = pd.read_parquet(curated_path)
X = df.drop(columns=["label"])
y = df["label"]

# Define the 16 models to allow for dynamic importing of their libraries in the next code cell.
# Configure each model for hyperparameter tuning. This requires defining: 
#    1. Default parameters (e.g., random_state, n_jobs)
#    2. Search type ("grid" for GridSearchCV, the chosen method in next code cell)
#    3. Scoring metric to assess how will a hyperparameter does ("accuracy") 
#    4. The specific hyperparameters to tune in the next code block (in the "param_grid" dictionary).
# Each model is defined in the AVAILABLE_MODELS dictionary with its class, module, and hyperparameter grid.


AVAILABLE_MODELS = {
    "DecisionTreeClassifier": {
        "class": "DecisionTreeClassifier",
        "module": "sklearn.tree",
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {"random_state": 42},
        "param_grid": {
            "criterion": ["gini", "entropy"],
            "max_depth": [None, 5, 10, 20],
            "min_samples_split": [2, 5],
            "min_samples_leaf": [1, 3]
        }
    },
    "RandomForestClassifier": {
        "class": "RandomForestClassifier",
        "module": "sklearn.ensemble",
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {"random_state": 42, "n_jobs": -1},
        "param_grid": {
            "n_estimators": [50, 100],
            "max_depth": [None, 10, 20],
            "min_samples_split": [2, 5],
            "min_samples_leaf": [1, 3]
        }
    },
    "ExtraTreesClassifier": {
        "class": "ExtraTreesClassifier",
        "module": "sklearn.ensemble",
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {"random_state": 42, "n_jobs": -1},
        "param_grid": {
            "n_estimators": [50, 100],
            "max_depth": [None, 10, 20],
            "min_samples_split": [2, 5]
        }
    },
    "GradientBoostingClassifier": {
        "class": "GradientBoostingClassifier",
        "module": "sklearn.ensemble",
        "subsample": 0.8,
        "max_depth": 5,
        "max_features": "sqrt",
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {"random_state": 42, "subsample": 0.8,
            "max_depth": 5,
            "max_features": "sqrt"
        },
        "param_grid": {
            "n_estimators": [50, 100],
            "learning_rate": [0.05, 0.1],
            "max_depth": [3, 5]
        }
    },
    "AdaBoostClassifier": {
        "class": "AdaBoostClassifier",
        "module": "sklearn.ensemble",
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {"random_state": 42},
        "param_grid": {
            "n_estimators": [50, 100],
            "learning_rate": [0.5, 1.0]
        }
    },
    "XGBClassifier": {
        "class": "XGBClassifier",
        "module": "xgboost",
        "requires_numeric_labels": True,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {
            "random_state": 42,
            "use_label_encoder": False,
            "eval_metric": "mlogloss",
            "n_jobs": -1
        },
        "param_grid": {
            "n_estimators": [50, 100],
            "max_depth": [3, 6],
            "learning_rate": [0.1, 0.2],
            "subsample": [0.8, 1.0]
        }
    },
    "LogisticRegression": {
        "class": "LogisticRegression",
        "module": "sklearn.linear_model",
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {"random_state": 42, "n_jobs": -1, "max_iter": 1000},
        "param_grid": {
            "C": [0.1, 1, 10],
            "penalty": ["l2"],
            "solver": ["lbfgs", "liblinear"]
        }
    },
    "RidgeClassifier": {
        "class": "RidgeClassifier",
        "module": "sklearn.linear_model",
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {},
        "param_grid": {
            "alpha": [0.1, 1.0, 10.0],
            "solver": ["auto", "sparse_cg"]
        }
    },
    "SGDClassifier": {
        "class": "SGDClassifier",
        "module": "sklearn.linear_model",
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {"random_state": 42, "n_jobs": -1},
        "param_grid": {
            "loss": ["hinge", "log_loss"],
            "alpha": [0.0001, 0.001],
            "penalty": ["l2", "l1"]
        }
    },
    "Perceptron": {
        "class": "Perceptron",
        "module": "sklearn.linear_model",
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {"random_state": 42, "max_iter": 1000, "n_jobs": -1},
        "param_grid": {
            "penalty": ["l2", "elasticnet", None],
            "alpha": [0.0001, 0.001]
        }
    },
    "KNeighborsClassifier": {
        "class": "KNeighborsClassifier",
        "module": "sklearn.neighbors",
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {"n_jobs": -1},
        "param_grid": {
            "n_neighbors": [3, 5, 7],
            "weights": ["uniform", "distance"],
            "metric": ["euclidean", "manhattan"]
        }
    },
    "SVC": {
        "class": "SVC",
        "module": "sklearn.svm",
        "random_state": 42,
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {"random_state": 42, "probability": True},
        "param_grid": {
            "C": [0.1, 1, 10],
            "kernel": ["linear", "rbf"],
            "gamma": ["scale", "auto"]
        }
    },
    "GaussianNB": {
        "class": "GaussianNB",
        "module": "sklearn.naive_bayes",
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {},
        "param_grid": {}
    },
    "LinearDiscriminantAnalysis": {
        "class": "LinearDiscriminantAnalysis",
        "module": "sklearn.discriminant_analysis",
        "requires_numeric_labels": False,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {},
        "param_grid": {
            "solver": ["svd", "lsqr"],
            "shrinkage": [None, "auto"]
        }
    },
    "QuadraticDiscriminantAnalysis": {
    "class": "QuadraticDiscriminantAnalysis",
    "module": "sklearn.discriminant_analysis",
    "requires_numeric_labels": False,
    "search_type": "grid",
    "scoring": "accuracy",
    "default_params": {},
    "param_grid": {
        "reg_param": [0.0, 0.1, 0.5]
    }
    },
    "MLPClassifier": {
        "class": "MLPClassifier",
        "module": "sklearn.neural_network",
        "requires_numeric_labels": True,
        "search_type": "grid",
        "scoring": "accuracy",
        "default_params": {"random_state": 42, "max_iter": 1000},
        "param_grid": {
            "hidden_layer_sizes": [(50,), (100,), (50, 50)],
            "activation": ["relu", "tanh"],
            "solver": ["adam", "lbfgs"],
            "alpha": [0.0001, 0.001, 0.01]
        }
    }
}

print("Default parameters in AVAILABLE_MODELS (set in default_params for each model):\n")

# Show what default parameters will be used (informational only).
for name, config in AVAILABLE_MODELS.items():
    if "default_params" in config:
        print(f"{name}:")
        for k, v in config["default_params"].items():
            print(f"   {k} = {v}")
        print()

all_valid = True

# Instantiate each model with its default parameters.
# Before doing so, confirm that all default parameters for each model are valid for that model before tuning in the next cell.
# Doing so, and correcting any parameter errors, will avoid runtime errors during hyperparameter tuning (which takes buku time). 
for name, config in AVAILABLE_MODELS.items():
    module = importlib.import_module(config["module"])
    model_class = getattr(module, config["class"])
    try:
        model_class(**config["default_params"])
    except TypeError as e:
        all_valid = False
        print(f"‚ùå {name} failed: {e}")

if all_valid:
    print("\n‚úÖ All default_params are valid across all AVAILABLE_MODELS.")



print("Default values for hyperparameters listed in param_grid:\n")

for model_name, config in AVAILABLE_MODELS.items():
    param_grid = config.get("param_grid", {})
    if not param_grid:
        continue
    module = importlib.import_module(config["module"])
    model_class = getattr(module, config["class"])
    default_instance = model_class(**config.get("default_params", {}))
    default_params = default_instance.get_params()
    print(f"{model_name}:")
    for param in param_grid:
        value = default_params.get(param, "(not set)")
        print(f"   {param}: {value}")
    print()
    

Default parameters in AVAILABLE_MODELS (set in default_params for each model):

DecisionTreeClassifier:
   random_state = 42

RandomForestClassifier:
   random_state = 42
   n_jobs = -1

ExtraTreesClassifier:
   random_state = 42
   n_jobs = -1

GradientBoostingClassifier:
   random_state = 42
   subsample = 0.8
   max_depth = 5
   max_features = sqrt

AdaBoostClassifier:
   random_state = 42

XGBClassifier:
   random_state = 42
   use_label_encoder = False
   eval_metric = mlogloss
   n_jobs = -1

LogisticRegression:
   random_state = 42
   n_jobs = -1
   max_iter = 1000

RidgeClassifier:

SGDClassifier:
   random_state = 42
   n_jobs = -1

Perceptron:
   random_state = 42
   max_iter = 1000
   n_jobs = -1

KNeighborsClassifier:
   n_jobs = -1

SVC:
   random_state = 42
   probability = True

GaussianNB:

LinearDiscriminantAnalysis:

QuadraticDiscriminantAnalysis:

MLPClassifier:
   random_state = 42
   max_iter = 1000


‚úÖ All default_params are valid across all AVAILABLE_MODELS.
De

In [None]:
# -------------------------------------------
# 3. CREATE JSON FILE WITH MODEL CONFIGURTIONS AND ORDERING
# -------------------------------------------

# This code creates a .json file that contains the model configurations and their ordering
# based on the MODEL_FAMILIES dictionary. The json file is used in the notebook 07 
# to display the models and their default parameters in a structured way.



print("Default parameters in AVAILABLE_MODELS (set in default_params for each model):\n")

# Create ordered list based on MODEL_FAMILIES
ordered_models = []
for family_name, models in MODEL_FAMILIES.items():
    for order_num, model_short_name in models:
        # Map short names to full classifier names
        model_mapping = {
            "RandomForest": "RandomForestClassifier",
            "ExtraTrees": "ExtraTreesClassifier", 
            "GradientBoosting": "GradientBoostingClassifier",
            "AdaBoost": "AdaBoostClassifier",
            "DecisionTree": "DecisionTreeClassifier",
            "XGBoost": "XGBClassifier",
            "LogisticRegression": "LogisticRegression",
            "RidgeClassifier": "RidgeClassifier",
            "SGDClassifier": "SGDClassifier", 
            "Perceptron": "Perceptron",
            "SVC": "SVC",
            "Kneighbors": "KNeighborsClassifier",
            "GaussianNB": "GaussianNB",
            "LDA": "LinearDiscriminantAnalysis",
            "QDA": "QuadraticDiscriminantAnalysis",
            "MLP": "MLPClassifier"
        }
        
        full_name = model_mapping.get(model_short_name)
        if full_name and full_name in AVAILABLE_MODELS:
            ordered_models.append((order_num, full_name, family_name))

# Save model configuration (so these dictionaries can be accessed in notebook 07)
model_config_path = tuned_models_dir / "model_config.json"
model_config = {
    'MODEL_FAMILIES': MODEL_FAMILIES,
    'model_mapping': model_mapping
}
with open(model_config_path, "w") as f:
    json.dump(model_config, f, indent=4)

# Sort by order number and display
ordered_models.sort(key=lambda x: x[0])

current_family = None
for order_num, model_name, family_name in ordered_models:
    # Print family header when family changes
    if family_name != current_family:
        if current_family is not None:
            print()  # Extra line between families
        print(f"--- {family_name} ---")
        current_family = family_name
    
    config = AVAILABLE_MODELS[model_name]
    if "default_params" in config and config["default_params"]:
        print(f"{order_num:2d}. {model_name}:")
        for k, v in config["default_params"].items():
            print(f"      {k} = {v}")
    else:
        print(f"{order_num:2d}. {model_name}: (no default params)")
    print()

Default parameters in AVAILABLE_MODELS (set in default_params for each model):

--- Tree-Based Models ---
 1. DecisionTreeClassifier:
      random_state = 42

 2. RandomForestClassifier:
      random_state = 42
      n_jobs = -1

 3. ExtraTreesClassifier:
      random_state = 42
      n_jobs = -1

 4. GradientBoostingClassifier:
      random_state = 42
      subsample = 0.8
      max_depth = 5
      max_features = sqrt

 5. AdaBoostClassifier:
      random_state = 42

 6. XGBClassifier:
      random_state = 42
      use_label_encoder = False
      eval_metric = mlogloss
      n_jobs = -1


--- Linear Models ---
 7. LogisticRegression:
      random_state = 42
      n_jobs = -1
      max_iter = 1000

 8. RidgeClassifier: (no default params)

 9. SGDClassifier:
      random_state = 42
      n_jobs = -1

10. Perceptron:
      random_state = 42
      max_iter = 1000
      n_jobs = -1


--- Kernel-Based Models ---
11. SVC:
      random_state = 42
      probability = True


--- Instance-Based

In [None]:
# -------------------------------------------
# 4. TUNE HYPERPARAMETERS AND SAVE RESULTS
# -------------------------------------------

# This code cell can take approximately 10 minutes to run.

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from joblib import dump
import importlib
import json
import pandas as pd

# Encode labels once (for models that require numeric targets like XGBoost and MLP)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Collect best parameters here
best_params_summary = {}

# Below is where the param_grid entry in each model‚Äôs dictionary entry in the prior cell gets
# systematically explored. For example, for RandomForestClassifier tuning tests, there are
# 3x2√ó2√ó2 = 24 parameter combinations defined in its dictionary entry, or 24 possible
# combinations.  Each combination is evaluated with 5-fold stratified CV, meaning there
# are 120 total model fits for the Random Forest algorithm.
# Results from each model are saved to an individual CSV file.
# A consolidated summary of all tuned parameters is saved in the best_params.json file.



for model_name, config in AVAILABLE_MODELS.items():
    print(f"\n Tuning {model_name}...")

    # Load the model class dynamically
    module = importlib.import_module(config["module"])
    model_class = getattr(module, config["class"])
    model = model_class(**config["default_params"])

    # Use numeric labels if model needs them
    y_target = y_encoded if config["requires_numeric_labels"] else y

    # Select cross-validation strategy
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    # Select search method
    search_cls = GridSearchCV if config["search_type"] == "grid" else RandomizedSearchCV
    param_grid = config["param_grid"]

    # If no hyperparameters to tune, skip GridSearch
    if not param_grid:
        model.fit(X, y_target)
        best_score = model.score(X, y_target)
        best_params = config["default_params"]

        # Inject reproducibility params if supported
        if "random_state" in model.get_params():
            best_params["random_state"] = 42
        if "n_jobs" in model.get_params():
            best_params["n_jobs"] = -1

        print(f"‚ö†Ô∏è  No hyperparameter grid ‚Äî using default params. Score: {best_score:.4f}")
    else:
        search = search_cls(
            estimator=model,
            param_grid=param_grid,
            scoring=config["scoring"],
            cv=cv,
            n_jobs=-1,
            return_train_score=True
        )
        search.fit(X, y_target)
        best_model = search.best_estimator_
        best_score = search.best_score_
        best_params = search.best_params_    # This is where the combination of hyperparameters yielding the best score is stored.

        # Inject reproducibility params if supported to:
        #   1. Ensure consistent results across runs
        #   2. Allow multi-threading if the ML model supports it.
        if "random_state" in best_model.get_params():
            best_params["random_state"] = 42
        if "n_jobs" in best_model.get_params():
            best_params["n_jobs"] = -1

        # Save full CV search results
        results_df = pd.DataFrame(search.cv_results_)
        results_path = tuned_models_dir / f"cv_results_{model_name}.csv"
        results_df.to_csv(results_path, index=False)

    # Add to summary
    best_params_summary[model_name] = {
        "best_params": best_params,
        "best_score": round(best_score, 4),
        "search_type": config["search_type"],
        "scoring": config["scoring"]
    }

    print(f" {model_name} done ‚Äî Best CV score: {best_score:.4f}")

# Save best parameter summary
summary_path = tuned_models_dir / "best_params.json"
with open(summary_path, "w") as f:
    json.dump(best_params_summary, f, indent=4)

print(f"\nBest parameters saved to: {summary_path}")




 Tuning DecisionTreeClassifier...
 DecisionTreeClassifier done ‚Äî Best CV score: 0.9081

 Tuning RandomForestClassifier...
 DecisionTreeClassifier done ‚Äî Best CV score: 0.9081

 Tuning RandomForestClassifier...
 RandomForestClassifier done ‚Äî Best CV score: 0.9256

 Tuning ExtraTreesClassifier...
 RandomForestClassifier done ‚Äî Best CV score: 0.9256

 Tuning ExtraTreesClassifier...
 ExtraTreesClassifier done ‚Äî Best CV score: 0.9229

 Tuning GradientBoostingClassifier...
 ExtraTreesClassifier done ‚Äî Best CV score: 0.9229

 Tuning GradientBoostingClassifier...
 GradientBoostingClassifier done ‚Äî Best CV score: 0.9269

 Tuning AdaBoostClassifier...
 GradientBoostingClassifier done ‚Äî Best CV score: 0.9269

 Tuning AdaBoostClassifier...
 AdaBoostClassifier done ‚Äî Best CV score: 0.8441

 Tuning XGBClassifier...
 AdaBoostClassifier done ‚Äî Best CV score: 0.8441

 Tuning XGBClassifier...
 XGBClassifier done ‚Äî Best CV score: 0.9302

 Tuning LogisticRegression...
 XGBClassifier

In [None]:
# -------------------------------------------
# 5. LOG WHICH MODELS ACCEPT FIXED PARAMS
# -------------------------------------------

# Some of the models can accept these two fixed parameter assignments: random_state=42 and/or n_jobs=-1
# These parameters were included in the .json file parameters file, above.
# This code cell confirms which models can, and did, accept one or both of these fixed parameters.


# Optional: verify support for fixed reproducibility params
supported_random = []
supported_n_jobs = []

for model_name, meta in AVAILABLE_MODELS.items():
    module = importlib.import_module(meta["module"])
    model_class = getattr(module, meta["class"])
    params = model_class().get_params()

    if "random_state" in params:
        supported_random.append(model_name)
    if "n_jobs" in params:
        supported_n_jobs.append(model_name)

print("\n‚úÖ Models that accept random_state:")
for m in supported_random:
    print(f"   ‚Ä¢ {m}")

print("\n‚úÖ Models that accept n_jobs:")
for m in supported_n_jobs:
    print(f"   ‚Ä¢ {m}")


‚úÖ Models that accept random_state:
   ‚Ä¢ DecisionTreeClassifier
   ‚Ä¢ RandomForestClassifier
   ‚Ä¢ ExtraTreesClassifier
   ‚Ä¢ GradientBoostingClassifier
   ‚Ä¢ AdaBoostClassifier
   ‚Ä¢ XGBClassifier
   ‚Ä¢ LogisticRegression
   ‚Ä¢ RidgeClassifier
   ‚Ä¢ SGDClassifier
   ‚Ä¢ Perceptron
   ‚Ä¢ SVC
   ‚Ä¢ MLPClassifier

‚úÖ Models that accept n_jobs:
   ‚Ä¢ RandomForestClassifier
   ‚Ä¢ ExtraTreesClassifier
   ‚Ä¢ XGBClassifier
   ‚Ä¢ LogisticRegression
   ‚Ä¢ SGDClassifier
   ‚Ä¢ Perceptron
   ‚Ä¢ KNeighborsClassifier


In [None]:
# -------------------------------------------
# 6. CONFIRM TUNED AND REPRODUCIBILITY PARAMETERS USED IN EACH MODEL
# -------------------------------------------

# This code cell shows the actual tuned parameters that were determined during GridSearchCV.
# This code cell also confirms whether the default parameters that are defined for some models in the 
# AVAILABLE_MODELS dictionary (random_state=42, n_jobs=-1) were injected into the best_params list.
# These last two parameters support reproducibility and parallel processing, respectively.
# The display provides an audit trail of what parameter combinations actually were selected as "best"
# for each model, including either of the two injected reproducibility parameters.
# All other sklearn default parameters that each model used internally are not shown here.

print("FINAL TUNED PARAMETERS FOR EACH MODEL")
print("=" * 60)

for model_name, summary in best_params_summary.items():
    print(f"\n{model_name}:")
    print(f"   Best CV Score: {summary['best_score']}")
    print(f"   Search Method: {summary['search_type']}")
    print(f"   Scoring Metric: {summary['scoring']}")
    print("   All Parameters:")
    
    # Get the model configuration
    config = AVAILABLE_MODELS[model_name]
    best_params = summary['best_params']
    
    # Create a model instance to get all parameters
    module = importlib.import_module(config["module"])
    model_class = getattr(module, config["class"])
    model_instance = model_class(**best_params)
    all_params = model_instance.get_params()
    
    # Get lists of parameter types for labeling
    param_grid_keys = set(config.get("param_grid", {}).keys())
    reproducibility_params = {'random_state', 'n_jobs'}
    
    # Sort parameters alphabetically for consistent display
    for param_name in sorted(all_params.keys()):
        param_value = all_params[param_name]
        
        # Determine parameter type and add appropriate label
        labels = []
        if param_name in param_grid_keys:
            labels.append("üéØ TUNED")
        if param_name in reproducibility_params:
            labels.append("‚ö° REPRO")
        if not labels:
            labels.append("üìã DEFAULT")
        
        label_str = " ".join(labels)
        print(f"      {param_name}: {param_value} ({label_str})")
    
    print("-" * 40)

print(f"\nSummary: Tuned parameters for {len(best_params_summary)} models")
print(f"Detailed results saved to: {tuned_models_dir / 'best_params.json'}")

# Optional: Show models that had no hyperparameters to tune
no_tuning = [name for name, summary in best_params_summary.items() 
             if not AVAILABLE_MODELS[name]['param_grid']]

if no_tuning:
    print(f"\n‚ö†Ô∏è  Models with no hyperparameter grid (used defaults only):")
    for model in no_tuning:
        print(f"   ‚Ä¢ {model}")




FINAL TUNED PARAMETERS FOR EACH MODEL

üîß DecisionTreeClassifier:
   Best CV Score: 0.9081
   Search Method: grid
   Scoring Metric: accuracy
   All Parameters:
      ccp_alpha: 0.0 (üìã DEFAULT)
      class_weight: None (üìã DEFAULT)
      criterion: gini (üéØ TUNED)
      max_depth: 10 (üéØ TUNED)
      max_features: None (üìã DEFAULT)
      max_leaf_nodes: None (üìã DEFAULT)
      min_impurity_decrease: 0.0 (üìã DEFAULT)
      min_samples_leaf: 1 (üéØ TUNED)
      min_samples_split: 5 (üéØ TUNED)
      min_weight_fraction_leaf: 0.0 (üìã DEFAULT)
      monotonic_cst: None (üìã DEFAULT)
      random_state: 42 (‚ö° REPRO)
      splitter: best (üìã DEFAULT)
----------------------------------------

üîß RandomForestClassifier:
   Best CV Score: 0.9256
   Search Method: grid
   Scoring Metric: accuracy
   All Parameters:
      bootstrap: True (üìã DEFAULT)
      ccp_alpha: 0.0 (üìã DEFAULT)
      class_weight: None (üìã DEFAULT)
      criterion: gini (üìã DEFAULT)
     

In [None]:
# -------------------------------------------
# 7. PRODUCE TABLE OF HYPERPARAMETER CHOICES AND FINAL TUNED VALUES
# -------------------------------------------

# The table produced here is used in  this benchmarking project's final report to 
# summarize hyperparameter tuning results and to support explanations of
# a model's accuracy and runtime results.

import pandas as pd
from IPython.display import display

rows = []

for model_name, summary in best_params_summary.items():
    config = AVAILABLE_MODELS[model_name]
    param_grid = config.get('param_grid', {})
    best_params = summary['best_params']
    
    if not param_grid:
        # No hyperparameters tuned for this model
        rows.append({
            'Model': model_name,
            'Hyperparameter (search space)': '(no hyperparameters tuned)',
            'Tuned Value': '(sklearn defaults)'
        })
        continue
    
    for param, choices in param_grid.items():
        tuned_value = best_params.get(param, '(not set)')
        # Format choices for display
        if isinstance(choices, list):
            choices_str = ', '.join([str(c) for c in choices])
        else:
            choices_str = str(choices)
        rows.append({
            'Model': model_name,
            'Hyperparameter (search space)': f"{param} (choices: {choices_str})",
            'Tuned Value': tuned_value
        })

# Create DataFrame for display
param_table = pd.DataFrame(rows)

# Optionally, sort by model name
param_table = param_table.sort_values(['Model', 'Hyperparameter (search space)'])

# Display the table
print("\nTable of Tuned Hyperparameters and Search Spaces:")
display(param_table)

# Optionally, save to Excel for inclusion in report
param_table_path = tuned_models_dir / "tuned_hyperparameters_table.xlsx"
param_table.to_excel(param_table_path, index=False)
print(f"\nTable saved to: {param_table_path}")



Table of Tuned Hyperparameters and Search Spaces:


Unnamed: 0,Model,Hyperparameter (search space),Tuned Value
15,AdaBoostClassifier,"learning_rate (choices: 0.5, 1.0)",1.0
14,AdaBoostClassifier,"n_estimators (choices: 50, 100)",100
0,DecisionTreeClassifier,"criterion (choices: gini, entropy)",gini
1,DecisionTreeClassifier,"max_depth (choices: None, 5, 10, 20)",10
3,DecisionTreeClassifier,"min_samples_leaf (choices: 1, 3)",1
2,DecisionTreeClassifier,"min_samples_split (choices: 2, 5)",5
9,ExtraTreesClassifier,"max_depth (choices: None, 10, 20)",
10,ExtraTreesClassifier,"min_samples_split (choices: 2, 5)",5
8,ExtraTreesClassifier,"n_estimators (choices: 50, 100)",100
36,GaussianNB,(no hyperparameters tuned),(sklearn defaults)



Table saved to: C:\Misc\ml_benchmark\outputs\tuned_models\tuned_hyperparameters_table.xlsx
