# **AutoHyper**

AutoHyper is designed to facilitate hyperparameter optimization (HPO) for supervised learning models on tabular data.
It serves as a lightweight, modular, and fully customizable package, giving you fine-grained control over the entire tuning and validation process.

## **What is AutoHyper**

AutoHyper is designed to:

- Provide a clear and consistent interface for different HPO strategies: at the moment grid search, random search, and evolutionary algorithms.

- Leverage nested cross-validation to deliver robust and unbiased estimates of out-of-sample model performance.

- Incorporate a custom selection mechanism that combines performance and robustness using a weighted scoring function, ensuring the best configurations are both accurate and consistently effective across multiple resampling iterations.

- Return structured outputs, ideal for quantitative comparison and visual inspection of configurations.

- Offer detailed logging and configuration ranking based on a composite score of performance and frequency.

## **Key Components of Hyperparameter Optimization in AutoHyper**

A typical hyperparameter optimization (HPO) problem consists of **five essential components** . AutoHyper is designed to give users full control and flexibility over each of these:

1. **`Learner`**

The learner is the machine learning model to be tuned. In AutoHyper, any supervised model following the **scikit-learn API** is supported. This includes both classifiers and regressors such as RandomForestClassifier, XGBRegressor, LogisticRegression, and even custom models wrapped using **SciKeras** (for Keras) or **Skorch** (for PyTorch). The only requirement is that the model must implement `fit(X, y)`, `predict(X)`, and `set_params(**kwargs)`.

2. **`Hyperparameter Space`**

The search space defines the set of hyperparameters to explore. AutoHyper allows users to specify this space as a dictionary of parameter names and candidate values. The search space is dynamically parsed based on the selected optimization strategy. When needed particularly in random search or evolutionary algorithms, AutoHyper can **automatically infer and apply the most appropriate value scale** for each hyperparameter, such as linear sampling for float and integer parameters, and categorical sampling for discrete options. This enables more efficient exploration, especially when dealing with large or non-uniform hyperparameter domains.

3. **`Dataset`**

AutoHyper is specifically tailored for **tabular datasets**, where `X` is a `pandas.DataFrame` and `y` is a `pandas.Series` or `numpy.ndarray`. The dataset is passed directly to the HPO class and internally split according to the specified resampling strategy. The package assumes clean, preprocessed data, leaving feature engineering and preprocessing under the user's control for maximal transparency and modularity.

4. **`Resampling Strategy`** 


To avoid overfitting and ensure unbiased evaluation, AutoHyper leverages **nested cross-validation**. The outer loop estimates generalization performance, while the inner loop performs hyperparameter tuning. Users can configure the number of outer and inner folds (`n_outer_folds`, `n_inner_folds`), making the resampling strategy fully customizable. This separation guarantees a rigorous assessment of how the chosen hyperparameters would perform on truly unseen data.

5. **`Performance Measure`**

Evaluation is handled using common metrics like accuracy, f1, precision, recall for classification and r2, neg_mean_squared_error, etc., for regression. The user specifies the metric via the scoring parameter. Internally, AutoHyper calculates average performance for each configuration across all outer folds and applies a **custom weighted scoring function** that balances average performance with robustness (measured as frequency of selection), ensuring the final recommendation is both strong and stable.

6. **`Optimization Strategy`**
AutoHyper supports multiple optimization strategies, including:
- **Grid Search**: Exhaustively explores all combinations of hyperparameters.
- **Random Search**: Samples a fixed number of random configurations from the hyperparameter space.
- **Evolutionary Algorithms**: Uses genetic algorithms to evolve hyperparameter configurations over generations, balancing exploration and exploitation.

## **Optimization Strategies - How To Use AutoHyper**

### **Step 1: Installation**

To install AutoHyper simply run this command in your terminal:

```bash
pip install git+https://github.com/Christian-Braga/AutoHyper.git
```

### **Step 2: Importing Autohyper and setup**

#### **Libraries**

In [None]:
# Import the HPO class from the autohyper module
from autohyper import HPO

# Import necessary libraries for your project
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor


#### **Dataset**

In [None]:
# Load the California Housing Dataset and convert it into a pandas DataFrame
X, y = fetch_california_housing(return_X_y=True)

X = pd.DataFrame(
    X,
    columns=[
        "MedInc",
        "HouseAge",
        "AveRooms",
        "AveBedrms",
        "Population",
        "AveOccup",
        "Latitude",
        "Longitude",
    ],
)
y = pd.Series(y, name="target")

data_features = X
data_target = y

#### **Model, Hyperparameter space and Task**

In [None]:
# Define the model you want to optimize
model = RandomForestRegressor()

# Define the Hyperparameter Space
hp_values = {
    "max_depth": [1, 3, 20],
    "learning_rate": [0.1, 0.5, 0.9],
    "n_estimators": [10, 30, 50],
}

# Specify the task type
task = "regression"

#### **Create an istance of the HPO class**

**Parameter Required:**

- **`model`**:

A scikit-learn compatible estimator implementing fit and predict.
Used as the base model for hyperparameter tuning.

- **`data_features`** (pd.DataFrame):

The input features (X) on which training and evaluation are performed.

- **`data_target`** (pd.Series or np.ndarray):

The target variable (y) to predict.

- **`hp_values`** (dict):

Dictionary defining the hyperparameter search space.
Keys are parameter names, values are lists of candidate values (for grid, random, EA).

- **`task`** (str):

Type of supervised learning task. Supported: "classification" or "regression".
Determines the scoring metric used in cross-validation.

In [None]:
test_hpo = HPO(
    model=model,
    data_features=data_features,
    data_target=data_target,
    hp_values=hp_values,
    task=task,
)

### **Step 3: Perform Hyperparameter Optimization**

To start the tuning process and obtain the best hyperparameter configuration, call the `hp_tuning` method on the HPO instance:

#### **`hp_tuning`**

The hp_tuning method is the core interface to perform hyperparameter optimization using nested cross-validation. It supports multiple search strategies (**grid_search**, **random_search**, **evolutionary_algorithm**) and returns a structured summary of results, suitable for downstream analysis and visualization.

##### **Parameters**

| Name                          | Type                    | Description                                                                                                        |
| ----------------------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------------ |
| `hpo_method`                  | `str`                   | Hyperparameter optimization method to use. Options: `"grid_search"`, `"random_search"`, `"evolutionary_algorithm"` |
| `outer_k`                     | `int`                   | Number of folds in the **outer cross-validation** loop (used to evaluate generalization performance)               |
| `inner_k`                     | `int`                   | Number of folds in the **inner cross-validation** loop (used for model selection)                                  |
| `n_trials`                    | `Optional[int]`         | (Only for `random_search`) Number of random configurations to sample                                               |
| `parents_selection_mechanism` | `Optional[str]`         | (Only for `evolutionary_algorithm`) Strategy for selecting parent individuals (e.g. `"tournament_selection"`)  |
| `parents_selection_ratio`     | `Optional[float]`       | (Only for `evolutionary_algorithm`) Fraction of population selected as parents                                     |
| `max_generations`             | `Optional[int]`         | (Only for `evolutionary_algorithm`) Number of generations to evolve                                                |
| `n_new_configs`               | `Optional[int]`         | (Only for `evolutionary_algorithm`) Number of new configurations generated per generation                          |
| `shuffle`                     | `bool` (default `True`) | Whether to shuffle data before splitting in CV                                                                     |

##### **Returns**
```python
Dict[str, Any]
```
A structured dictionary with the following keys:

- **`best_config`**: Best configuration by weighted score (70% performance, 30% frequency)

- **`most_frequent_config`**: Most frequently selected config across outer folds

- **`best_by_performance`**: Best config based on raw metric (e.g., R² or F1)

- **`overall_metrics`**: Contains the mean and standard deviation of the evaluation metrics computed on the outer test sets across all outer folds. This provides a global estimate of model generalization performance.

- **`configs_summary`**: A ranked summary of all unique configurations selected during outer folds, including their raw score, weighted score, selection frequency, and corresponding ranks.

- **`fold_metrics`**: Lists the evaluation results for each outer fold. For every fold, it includes: the fold index, the best configuration selected via inner CV, and the model’s performance on the corresponding outer test set.

- **`execution_time`**: Total time elapsed (seconds)

- **`weighting_factors`**: Current weights used to compute the final score


##### **Notes**:

Supports both regression and classification tasks.

Handles errors and logs metrics throughout execution.

Designed for modular, extensible HPO pipelines with pluggable strategies.

#### **GRID SEARCH**

Grid Search is an exhaustive search strategy that systematically evaluates all possible combinations of hyperparameters defined in a user-specified grid. Each hyperparameter is associated with a discrete set of candidate values, and the Cartesian product of these sets is computed to form the full configuration space.

For example, given a grid like:

```python
{
    "lr": [0.01, 0.001],
    "batch_size": [32, 64],
    "optimizer": ["adam", "sgd"]
}
```
Grid Search will generate all 2 x 2 x 2 = 8 possible combinations, such as:

```python
{'lr': 0.01, 'batch_size': 32, 'optimizer': 'adam'}
{'lr': 0.01, 'batch_size': 32, 'optimizer': 'sgd'}
...
{'lr': 0.001, 'batch_size': 64, 'optimizer': 'sgd'},
```

This approach guarantees that the best-performing configuration within the provided grid will be found (assuming proper evaluation), but its cost grows exponentially with the number of hyperparameters and candidate values.

##### **Use this method when**:

- The search space is relatively small

- Full coverage is required

- Parallel evaluation is feasible

##### **How to use Grid Search in AutoHyper**

To perform hyperparameter optimization using Grid Search, simply call the **`hp_tuning`** method with **`hpo_method="grid_search"`** and specify the number of outer and inner cross-validation folds.

In [None]:
grid_search = test_hpo.hp_tuning(hpo_method="grid_search", outer_k=5, inner_k=3)

#### **RANDOM SEARCH**

Random Search is a stochastic search strategy that samples a fixed number of random combinations from the hyperparameter grid.
Unlike Grid Search, it does not evaluate all possible configurations, but instead selects a random subset, making it more scalable for high-dimensional search spaces.

For example, given a grid like:

```python
{
    "lr": [0.01, 0.001],
    "batch_size": [32, 64],
    "optimizer": ["adam", "sgd"]
}
```
You can specify the number of configurations to evaluate with **`n_trials`**. For instance, with **`n_trials=4`**, Random Search might sample:

```python
{'lr': 0.001, 'batch_size': 64, 'optimizer': 'adam'}
{'lr': 0.01, 'batch_size': 32, 'optimizer': 'sgd'}
{'lr': 0.01, 'batch_size': 64, 'optimizer': 'adam'}
{'lr': 0.001, 'batch_size': 32, 'optimizer': 'sgd'}
```

These are randomly selected from the full Cartesian product of the parameter space.

This approach does not guarantee finding the globally best configuration, but it is often more efficient and surprisingly effective in practice, especially when only a few hyperparameters dominate model performance.

##### **Use this method when**:

- The search space is large or high-dimensional

- You want faster optimization with fewer evaluations

- Full coverage is unnecessary or too costly

- You can afford randomness in exchange for speed

##### **How to Use Random Search in AutoHyper**

To perform hyperparameter optimization using Random Search, call the **`hp_tuning`** method with **`hpo_method="random_search"`** and specify the number of configurations to sample via the **`n_trials`** parameter as well as the number of outer and inner cross-validation folds.

In [None]:
random_search = test_hpo.hp_tuning(
    hpo_method="random_search", outer_k=5, inner_k=3, n_trials=35
)

#### **EVOLUTIONARY ALGORITHM**

The Evolutionary Algorithm is a population-based, bio-inspired optimization strategy that evolves hyperparameter configurations over multiple generations.
Each generation refines the search by selecting the best-performing configurations (parents), applying mutations to generate offspring, and retaining the fittest solutions.

This approach is especially useful for exploring large or complex search spaces, where exhaustive methods are impractical and random search may be inefficient.

##### **Workflow**

1. **Initialization**:

A population of candidate configurations is generated either from the Cartesian product of the hyperparameter space or by sampling additional random configurations (if **`n_new_configs`** is specified). Each configuration is validated for compatibility with the model.

2. **Evaluation**:

Each configuration is evaluated using K-fold cross-validation, and assigned a fitness score (mean accuracy for classification, or mean R² for regression).

3. **Parent Selection**:

A fraction of the evaluated configurations is selected as parents based on the specified mechanism:

**`"neutral_selection"`**: random selection (no fitness pressure) (very computationally cheap)

**`"fitness_proportional_selection"`**: selection with a probability proportional to the fitness (higher fitness = higher chance of selection) (very computationally expensive)

**`"tournament_selection"`**: select the best from randomly drawn groups (moderately computationally expensive)

4. **Mutation**:

Each parent undergoes mutation:

- Numerical parameters (int/float) are incremented upward or downward (with a mechanism that checks the validy of the values for safety).

- Categorical parameters are replaced with a different value.


5. **Survival (μ + λ)**:

μ = number of individuals in the current generation (parents)

λ = number of offspring generated in the current generation

The best μ configurations among both parents and offspring (μ + λ) are selected for the next generation.


6. **Tracking & Convergence**:

For each generation, statistics are collected:

Best and mean fitness, Population diversity, Parameter evolution over time

The process continues for **`max_generations`**, after which the best configuration is returned.

##### **How to Use Evolutionary Algorithm**

To perform Hyperparameter optimization with evolutionary algorithm, call the **`hp_tuning`** method and activate it via **`hp_tuning(hpo_method="evolutionary_algorithm")`** and specify the evolutionary options below:

##### **Evolution Specific Parameters**

| Parameter                     | Type                  | Required | Description                                                                                                                                                                                  |
| ----------------------------- | --------------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `parents_selection_mechanism` | `str`                 | **Yes**  | Parent selection method:<br>• `"neutral_selection"` – random<br>• `"fitness_proportional_selection"` – probability proportional to fitness<br>• `"tournament_selection"` – best of random mini-tournaments |
| `parents_selection_ratio`     | `float`               | **Yes**  | Fraction of the current population chosen as parents (0 < ratio ≤ 1).                                                                                                                        |
| `max_generations`             | `int`                 | **Yes**  | Number of evolutionary iterations to run.                                                                                                                                                    |
| `n_new_configs`               | `Optional[int]`       | No       | Extra random configurations added to the initial population (default `None`).                                                                                                                |                                                                                                       |


##### **When to choose EA**

- Very large or mixed-type hyperparameter spaces

- Desire for self-adapting exploration vs. exploitation

- Need for convergence diagnostics (fitness trends, diversity)

- Exhaustive grid is infeasible and random search under-explores

In [None]:
evolutionary_algo = test_hpo.hp_tuning(
    hpo_method="evolutionary_algorithm",
    outer_k=5,
    inner_k=3,
    parents_selection_mechanism="tournament_selection",
    parents_selection_ratio=0.5,
    n_new_configs=5,
    max_generations=10,
)