# Chapter 53: Automated Machine Learning

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand what Automated Machine Learning (AutoML) is and how it can accelerate model development
- Identify the components of an AutoML system: feature engineering, model selection, hyperparameter tuning, and architecture search
- Apply automated feature engineering techniques using libraries like `tsfresh` and `Featuretools` on the NEPSE dataset
- Use AutoML frameworks such as AutoGluon, H2O AutoML, and TPOT to build high‑performance models with minimal manual intervention
- Understand the principles behind hyperparameter optimization methods (grid search, random search, Bayesian optimization)
- Explore Neural Architecture Search (NAS) for finding optimal neural network architectures
- Recognize the limitations and potential pitfalls of AutoML, especially in time‑series forecasting
- Decide when to use AutoML versus manual modeling based on project constraints
- Customize AutoML frameworks to incorporate domain knowledge and business constraints
- Adopt best practices for integrating AutoML into a production MLOps pipeline

---

## Introduction

Building a high‑quality machine learning model for a task like NEPSE stock prediction involves many decisions: which features to create, which algorithm to use, how to set its hyperparameters, and sometimes even how to design a neural network architecture. Each of these decisions can significantly impact performance, and exploring all possibilities manually is time‑consuming and requires deep expertise.

**Automated Machine Learning (AutoML)** aims to automate the end‑to‑end process of applying machine learning to real‑world problems. An AutoML system takes a dataset and a task (e.g., classification or regression) and automatically produces a well‑performing model, often with little or no human intervention. AutoML has democratized machine learning, enabling non‑experts to build models and allowing experts to focus on higher‑level problems.

In this chapter, we will explore the components of AutoML and apply several popular frameworks to the NEPSE stock prediction problem. We will also discuss when AutoML is appropriate and how to customize it for specific needs.

---

## 53.1 AutoML Overview

AutoML is not a single algorithm but a combination of techniques that automate the machine learning pipeline. A typical AutoML system performs the following steps:

1. **Data preprocessing**: Handling missing values, scaling, encoding categorical variables.
2. **Feature engineering**: Creating new features from raw data, selecting the most relevant ones.
3. **Model selection**: Choosing among a set of candidate algorithms (e.g., linear models, tree‑based models, neural networks).
4. **Hyperparameter optimization**: Tuning the hyperparameters of the chosen algorithm.
5. **Ensemble construction**: Often combining multiple models to improve performance.

The goal is to find the best possible model within given constraints (time, computational resources). AutoML systems are evaluated on the quality of the final model and the efficiency of the search.

For the NEPSE system, AutoML could automatically try different lag combinations, technical indicators, and model types to find the best predictor of next‑day price direction, saving us weeks of manual experimentation.

---

## 53.2 Automated Feature Engineering

Feature engineering is often the most time‑consuming part of a machine learning project. AutoML frameworks can automate this by generating a large set of candidate features from the raw data and then selecting the most useful ones.

### 53.2.1 tsfresh for Time‑Series

`tsfresh` (Time Series Feature extraction based on scalable hypothesis tests) is a Python library that automatically extracts hundreds of features from time‑series data. It is particularly well‑suited for our NEPSE dataset, which consists of multiple time series (one per stock symbol).

**Example: Using tsfresh on NEPSE data**

```python
import pandas as pd
from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import ComprehensiveFCParameters

# Load NEPSE data (assume we have columns: Date, Symbol, Close, Volume, ...)
df = pd.read_csv('nepse_daily.csv', parse_dates=['Date'])
df = df.sort_values(['Symbol', 'Date'])

# tsfresh requires a DataFrame with columns: id (symbol), time (date), and the value columns
df_tsfresh = df.melt(id_vars=['Symbol', 'Date'], value_vars=['Close', 'Volume'], 
                     var_name='kind', value_name='value')
df_tsfresh = df_tsfresh.rename(columns={'Symbol': 'id', 'Date': 'time'})

# Extract all possible features (this may take a while)
extraction_settings = ComprehensiveFCParameters()
X = extract_features(df_tsfresh, column_id='id', column_sort='time', 
                     column_kind='kind', column_value='value',
                     default_fc_parameters=extraction_settings,
                     impute_function=impute)

print(f"Extracted {X.shape[1]} features")
```

**Explanation:**  
`extract_features` generates features such as mean, variance, trend coefficients, Fourier transform coefficients, etc., for each combination of `id` (symbol) and `kind` (Close, Volume). The result is a DataFrame with one row per symbol and columns for each extracted feature. This can then be merged with target labels for model training.

After extraction, we can use `select_features` to filter out features that are not statistically relevant to the target.

```python
# Assuming y contains the target (e.g., next‑day direction) for each symbol‑day
# Note: tsfresh expects y to be aligned with the extracted features
# This may require careful merging; often you extract features per symbol over rolling windows.

# For demonstration, we use a simple approach: align features with target
# Here we assume X has a MultiIndex (id, time) and y has the same index.
X_selected = select_features(X, y, fdr_level=0.05)
print(f"Selected {X_selected.shape[1]} features")
```

**Explanation:**  
`select_features` performs hypothesis tests to keep only features that are significantly related to the target. This reduces dimensionality and prevents overfitting.

### 53.2.2 Featuretools for Relational Feature Engineering

Featuretools performs **Deep Feature Synthesis (DFS)**, which can generate features from relational data by applying primitives (like mean, max, trend) across relationships. For NEPSE, we might have multiple tables: stock prices, sector information, macroeconomic indicators. Featuretools can automatically combine them.

**Example: Generating features with Featuretools**

```python
import featuretools as ft

# Create an entityset
es = ft.EntitySet(id="nepse")

# Add the main dataframe as an entity
es = es.add_dataframe(
    dataframe_name="prices",
    dataframe=df,
    index="index",  # create a temporary index
    time_index="Date"
)

# Add a second entity, e.g., sector information
sector_df = pd.DataFrame({'Symbol': ['NABIL', 'NTC'], 'Sector': ['Bank', 'Telecom']})
es = es.add_dataframe(
    dataframe_name="sectors",
    dataframe=sector_df,
    index="Symbol"
)

# Define relationship
es = es.add_relationship("sectors", "Symbol", "prices", "Symbol")

# Run deep feature synthesis
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe="prices",
    agg_primitives=["mean", "max", "min", "std", "trend"],
    trans_primitives=["diff", "percent_change"],
    max_depth=2
)

print(f"Generated {len(feature_defs)} features")
```

**Explanation:**  
DFS creates features like `MEAN(prices.Close over sectors)` or `TREND(prices.Close over time)`. This can capture interactions between different stocks in the same sector, for example.

---

## 53.3 Automated Model Selection

AutoML frameworks typically evaluate multiple algorithms on the given dataset and select the best one. They often include:

- Linear models (Ridge, Lasso, ElasticNet)
- Tree‑based models (Random Forest, Gradient Boosting)
- Support Vector Machines
- Neural networks
- k‑Nearest Neighbors

The selection is based on cross‑validation performance, often using a holdout validation set or time‑series cross‑validation.

**Example: Using TPOT for automated model selection**

TPOT (Tree‑based Pipeline Optimization Tool) uses genetic programming to search over feature preprocessors, models, and hyperparameters.

```python
from tpot import TPOTClassifier
from sklearn.model_selection import TimeSeriesSplit

# Prepare data (assuming X already contains engineered features)
# Note: For time series, we must use time‑aware cross‑validation
tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    cv=TimeSeriesSplit(n_splits=3),
    verbosity=2,
    random_state=42,
    n_jobs=-1
)
tpot.fit(X_train, y_train)

# Evaluate on test set
print(f"Test accuracy: {tpot.score(X_test, y_test):.4f}")

# Export the best pipeline
tpot.export('tpot_nepse_pipeline.py')
```

**Explanation:**  
TPOT evolves pipelines over generations, selecting operators like `StandardScaler`, `RandomForestClassifier`, `XGBClassifier`, etc. It outputs a Python file with the best pipeline found. For time‑series, we pass a `TimeSeriesSplit` cross‑validator to avoid leakage.

---

## 53.4 Hyperparameter Optimization

Hyperparameter optimization (HPO) is a core component of AutoML. Methods range from simple grid search to sophisticated Bayesian optimization.

### 53.4.1 Grid Search and Random Search

Grid search exhaustively tries all combinations of specified hyperparameter values. Random search samples combinations randomly and often finds good configurations faster.

```python
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(50, 500),
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': randint(2, 11),
    'min_samples_leaf': randint(1, 5)
}

rf = RandomForestClassifier(random_state=42)
random_search = RandomizedSearchCV(
    rf, param_dist, n_iter=50,
    cv=TimeSeriesSplit(n_splits=3),
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV accuracy: {random_search.best_score_:.4f}")
```

### 53.4.2 Bayesian Optimization

Bayesian optimization builds a probabilistic model (e.g., Gaussian Process) of the objective function and uses it to select the most promising hyperparameters to evaluate next. It is more efficient than random search.

**Example with scikit‑optimize**

```python
from skopt import BayesSearchCV
from skopt.space import Integer, Real

search_spaces = {
    'n_estimators': Integer(50, 500),
    'max_depth': Integer(5, 50),
    'min_samples_split': Integer(2, 10),
    'min_samples_leaf': Integer(1, 4),
    'max_features': Real(0.1, 1.0)
}

bayes_search = BayesSearchCV(
    RandomForestClassifier(random_state=42),
    search_spaces,
    n_iter=30,
    cv=TimeSeriesSplit(n_splits=3),
    scoring='accuracy',
    random_state=42
)
bayes_search.fit(X_train, y_train)
print(f"Best parameters: {bayes_search.best_params_}")
```

### 53.4.3 Hyperparameter Optimization for Gradient Boosting

Libraries like XGBoost, LightGBM, and CatBoost have many hyperparameters. Tools like `hyperopt` or `optuna` can optimize them.

**Example with Optuna**

```python
import optuna
import xgboost as xgb

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10.0, log=True)
    }
    # Use early stopping with a validation set
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dvalid = xgb.DMatrix(X_val, label=y_val)
    model = xgb.train(params, dtrain, num_boost_round=1000,
                      evals=[(dvalid, 'valid')], early_stopping_rounds=50,
                      verbose_eval=False)
    preds = (model.predict(dvalid) > 0.5).astype(int)
    acc = accuracy_score(y_val, preds)
    return acc

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print(study.best_params)
```

---

## 53.5 Neural Architecture Search (NAS)

For deep learning models, AutoML can also search over network architectures. This is called Neural Architecture Search. NAS methods are computationally expensive but can find architectures that outperform manually designed ones.

**Example: Using AutoKeras for NAS**

AutoKeras is an AutoML library for deep learning that performs architecture search.

```python
import autokeras as ak

# Initialize the time series regressor/classifier
clf = ak.StructuredDataClassifier(
    max_trials=10,  # number of different architectures to try
    overwrite=True,
    seed=42
)

# Fit with time‑series split (AutoKeras uses its own validation split)
# For time series, we should ensure the split does not shuffle randomly.
# AutoKeras may not handle time series natively; we can pass a validation set manually.
val_split_idx = int(0.8 * len(X_train))
clf.fit(
    X_train[:val_split_idx], y_train[:val_split_idx],
    validation_data=(X_train[val_split_idx:], y_train[val_split_idx:]),
    epochs=50,
    batch_size=32
)

# Evaluate
accuracy = clf.evaluate(X_test, y_test)
print(f"Test accuracy: {accuracy}")
```

**Explanation:**  
AutoKeras searches over different neural network architectures (e.g., number of layers, units, activation functions) using Bayesian optimization. It also tunes preprocessing and training hyperparameters. For time‑series, we must be careful to maintain temporal order; AutoKeras's built‑in validation split might shuffle, so we provide a fixed validation set.

**Limitations:** NAS is resource‑intensive. For a small dataset like NEPSE, simpler models may suffice.

---

## 53.6 AutoML Frameworks

Several mature AutoML frameworks are available. We will highlight a few and show how to apply them to NEPSE.

### 53.6.1 AutoGluon (from Amazon)

AutoGluon focuses on tabular data (including time‑series) and automatically trains multiple models and stacks them.

```python
from autogluon.tabular import TabularDataset, TabularPredictor

# Prepare data
train_data = TabularDataset(pd.concat([X_train, y_train], axis=1))
test_data = TabularDataset(pd.concat([X_test, y_test], axis=1))

# Train predictor
predictor = TabularPredictor(label='Target', eval_metric='accuracy')
predictor.fit(
    train_data,
    time_limit=3600,  # seconds
    presets='medium_quality'  # can be 'best_quality' for better but slower
)

# Evaluate
performance = predictor.evaluate(test_data)
print(performance)

# Get leaderboard of models
leaderboard = predictor.leaderboard(test_data)
print(leaderboard)
```

**Explanation:**  
AutoGluon automatically preprocesses data, trains multiple models (Random Forest, XGBoost, LightGBM, CatBoost, neural networks), and creates an ensemble. It handles time‑series if we provide the data in the correct format (with time column, but it does not automatically enforce temporal cross‑validation; we should ensure our train/test split respects time order).

### 53.6.2 H2O AutoML

H2O's AutoML is a popular platform that trains and tunes many algorithms, including stacked ensembles.

```python
import h2o
from h2o.automl import H2OAutoML

h2o.init()

# Convert pandas to H2O frames
train_h2o = h2o.H2OFrame(pd.concat([X_train, y_train], axis=1))
test_h2o = h2o.H2OFrame(pd.concat([X_test, y_test], axis=1))

# Specify target and features
x = train_h2o.columns[:-1]
y = train_h2o.columns[-1]
train_h2o[y] = train_h2o[y].asfactor()  # for classification

# Run AutoML
aml = H2OAutoML(max_models=20, seed=42, max_runtime_secs=300)
aml.train(x=x, y=y, training_frame=train_h2o)

# View leaderboard
lb = aml.leaderboard
print(lb.head())

# Predict on test
preds = aml.leader.predict(test_h2o)
```

**Explanation:**  
H2O AutoML runs a fixed set of algorithms (GLM, Random Forest, GBM, XGBoost, deep learning) and builds two stacked ensembles. It reports performance via cross‑validation. Note that H2O's cross‑validation is random; for time‑series, we should instead provide a validation frame that is later in time, or use time‑based folds manually.

### 53.6.3 TPOT

We already introduced TPOT for model selection. It can also handle feature preprocessing.

```python
from tpot import TPOTClassifier

tpot = TPOTClassifier(generations=5, population_size=20, cv=3, random_state=42, n_jobs=-1)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_pipeline.py')
```

### 53.6.4 Custom Solutions

Sometimes you need to build your own AutoML pipeline to incorporate domain constraints. For example, you might fix the set of features (e.g., only use lags and technical indicators) and only search over models and hyperparameters. You can use `GridSearchCV`, `RandomizedSearchCV`, or `Optuna` to automate the search.

**Example: Custom pipeline with scikit‑learn and Optuna**

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
import optuna

def objective(trial):
    # Choose classifier
    classifier_name = trial.suggest_categorical('classifier', ['rf', 'svm', 'lr'])
    if classifier_name == 'rf':
        params = {
            'n_estimators': trial.suggest_int('rf_n_estimators', 50, 500),
            'max_depth': trial.suggest_int('rf_max_depth', 3, 20),
            'min_samples_split': trial.suggest_int('rf_min_samples_split', 2, 10)
        }
        model = RandomForestClassifier(**params, random_state=42)
    elif classifier_name == 'svm':
        params = {
            'C': trial.suggest_float('svm_C', 1e-2, 10, log=True),
            'gamma': trial.suggest_float('svm_gamma', 1e-3, 1, log=True)
        }
        model = SVC(**params, probability=True, random_state=42)
    else:
        params = {
            'C': trial.suggest_float('lr_C', 1e-2, 10, log=True)
        }
        model = LogisticRegression(**params, random_state=42)

    # Pipeline with scaling
    pipeline = Pipeline([('scaler', StandardScaler()), ('clf', model)])

    # Cross‑validation (time‑series)
    tscv = TimeSeriesSplit(n_splits=3)
    scores = []
    for train_idx, val_idx in tscv.split(X_train):
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]
        pipeline.fit(X_tr, y_tr)
        scores.append(accuracy_score(y_val, pipeline.predict(X_val)))
    return np.mean(scores)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print(study.best_params)
```

**Explanation:**  
This custom search selects both a model type and its hyperparameters, using time‑series cross‑validation. It demonstrates how to build a flexible AutoML pipeline tailored to your problem.

---

## 53.7 Limitations of AutoML

While AutoML is powerful, it is not a silver bullet. Some limitations include:

- **Computational cost**: AutoML can be expensive, especially with large datasets and complex search spaces.
- **Overfitting risk**: Without careful validation, AutoML can overfit to the validation set, especially if the search is too extensive.
- **Lack of domain knowledge**: AutoML may generate features that are nonsensical or miss important domain‑specific ones (e.g., circuit breaker flags in NEPSE).
- **Time‑series challenges**: Most AutoML frameworks assume i.i.d. data. Applying them to time‑series requires manual intervention to ensure temporal ordering and avoid leakage.
- **Interpretability**: The final model may be a complex ensemble that is hard to explain.
- **Reproducibility**: AutoML results can be sensitive to random seeds and search parameters.

---

## 53.8 When to Use AutoML

AutoML is beneficial when:

- You need a baseline model quickly.
- You lack deep expertise in the domain or in machine learning.
- You have a large search space and want to explore it efficiently.
- You are building many models and want to automate routine work.

For the NEPSE system, AutoML could be used to:

- Quickly benchmark different modeling approaches.
- Automate model retraining as new data arrives.
- Explore feature engineering combinations automatically (with caution).

However, you should still incorporate domain knowledge (e.g., by restricting the search space or by manually engineering certain features) and always validate results with time‑series‑appropriate methods.

---

## 53.9 Customizing AutoML

AutoML frameworks often allow customization. For example:

- In AutoGluon, you can specify `hyperparameters` to tune and `excluded_model_types`.
- In H2O, you can set `include_algos` to restrict the algorithm set.
- In TPOT, you can customize the operators in the genetic programming search.
- You can also provide a fixed preprocessing pipeline and let AutoML tune only certain parts.

**Example: Restricting AutoGluon to tree‑based models**

```python
predictor = TabularPredictor(label='Target').fit(
    train_data,
    hyperparameters={
        'GBM': {},
        'RF': {},
        'XGB': {}
    }
)
```

**Example: Adding a custom feature to the AutoML loop**

If using a custom pipeline with Optuna, you can add a step that tries different feature sets (e.g., with or without technical indicators).

---

## 53.10 Best Practices

1. **Start with a simple baseline** – Before running AutoML, have a simple model (e.g., logistic regression) to compare against.
2. **Use proper validation** – For time‑series, always use time‑based splits or walk‑forward validation.
3. **Limit search space** – Restrict the search to reasonable values and models to avoid wasting resources.
4. **Incorporate domain knowledge** – Manually create important features and let AutoML handle the rest.
5. **Monitor for overfitting** – Use a final holdout test set that was never used in the AutoML process.
6. **Reproducibility** – Set random seeds and log all experiments.
7. **Interpretability** – If explainability is required, consider using simpler models or post‑hoc explanation methods.
8. **Resource management** – Set time limits and use early stopping to control costs.

---

## Chapter Summary

In this chapter, we explored Automated Machine Learning and its application to the NEPSE stock prediction system. We covered:

- The components of AutoML: automated feature engineering, model selection, hyperparameter optimization, and neural architecture search.
- How to use `tsfresh` and `Featuretools` to automatically generate features from time‑series data.
- AutoML frameworks like TPOT, AutoGluon, and H2O AutoML, with code examples showing how to apply them to NEPSE.
- Hyperparameter optimization techniques, including grid search, random search, Bayesian optimization, and tools like Optuna.
- Neural Architecture Search and its limitations.
- The limitations of AutoML and when it is appropriate to use.
- Customizing AutoML to incorporate domain knowledge and best practices for successful application.

AutoML is a powerful ally in the machine learning practitioner's toolkit, especially for quickly establishing baselines and automating routine tasks. For the NEPSE system, AutoML can help explore a wide range of models and features, but must be used with care to respect the temporal nature of financial data. In the next chapter, we will discuss **Reinforcement Learning for Time‑Series**, exploring how agents can learn trading strategies directly.

---

**End of Chapter 53**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='52. transfer_learning_and_pre_training.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='54. reinforcement_learning_for_time_series.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
