# Chapter 64: Experiment Tracking

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand why experiment tracking is essential for reproducible machine learning
- Identify the key components to track: parameters, metrics, artifacts, and environment
- Compare popular experiment tracking tools (MLflow, Weights & Biases, Neptune, Comet)
- Set up an experiment tracking server and integrate it with your training scripts
- Log hyperparameters, metrics, and artifacts for each run
- Organise experiments using tags, names, and hierarchical structures
- Query and compare runs to select the best model
- Integrate experiment tracking with a model registry for production handoff
- Apply these practices to the NEPSE prediction system to manage model development

---

## Introduction

In the previous chapters, we trained numerous models for the NEPSE prediction system: different algorithms, feature sets, and hyperparameters. Without a systematic way to record these experiments, it quickly becomes impossible to remember which combination produced the best results, let alone reproduce them. **Experiment tracking** solves this problem by providing a central place to log all relevant information about each training run.

Experiment tracking is a cornerstone of MLOps. It enables data scientists to:

- Keep a historical record of all experiments.
- Compare runs to identify the best model.
- Share results with colleagues.
- Reproduce any past model exactly.
- Automate model selection and deployment pipelines.

In this chapter, we will explore experiment tracking in depth. We will set up **MLflow** (a popular open‑source tool) for the NEPSE project and demonstrate how to log parameters, metrics, and artifacts. We will also discuss other tools and best practices for managing experiments at scale.

---

## 64.1 Why Track Experiments?

Imagine you are developing a model for NEPSE. You try different window sizes for moving averages, different lags, different classifiers, and various hyperparameters. After a week, you have a dozen notebooks, each with slightly different code and results. You find a model that performs well, but you cannot remember exactly which features it used or the random seed. This is the **reproducibility crisis** in machine learning.

Experiment tracking addresses this by capturing:

- **Code version**: Git commit hash or notebook snapshot.
- **Environment**: Python packages and versions.
- **Data**: Version or checksum of the dataset used.
- **Parameters**: All hyperparameters and configuration options.
- **Metrics**: Training and validation scores at each epoch or at the end.
- **Artifacts**: Model files, plots, feature importance, etc.

With this information, you can:

- Recreate any past result.
- Compare multiple runs side‑by‑side.
- Share a link to a specific experiment with a colleague.
- Automate the selection of the best model for deployment.

For the NEPSE system, experiment tracking will save countless hours of manual note‑taking and enable confident model selection.

---

## 64.2 What to Track

A comprehensive experiment log should include:

### 64.2.1 Parameters

All inputs that affect the model's behavior:

- Model type (e.g., `random_forest`, `xgboost`).
- Hyperparameters (e.g., `n_estimators`, `max_depth`, `learning_rate`).
- Feature engineering choices (e.g., `window_sizes`, `include_rsi`).
- Data split (e.g., `train_start_date`, `test_end_date`).

### 64.2.2 Metrics

Quantitative measures of model performance:

- Accuracy, precision, recall, F1 (for classification).
- Mean Absolute Error, Root Mean Squared Error (for regression).
- Training time, inference latency.
- Custom metrics like Sharpe ratio or profit from a trading simulation.

### 64.2.3 Artifacts

Files generated during the run:

- Serialized model (e.g., `.pkl` file).
- Feature importance plots.
- Confusion matrix.
- Prediction vs. actual plots.
- Training and validation loss curves.

### 64.2.4 Metadata

Contextual information:

- Git commit hash.
- Experiment name and description.
- Tags (e.g., `"baseline"`, `"feature_set_v2"`).
- Start and end time.
- Hostname or environment (e.g., `"aws_g4dn.xlarge"`).

---

## 64.3 Experiment Tracking Tools

Several tools are available, ranging from simple home‑grown solutions to enterprise platforms.

### 64.3.1 MLflow

**MLflow** is an open‑source platform by Databricks. It provides four main components:

- **Tracking**: Log parameters, metrics, and artifacts.
- **Projects**: Package code for reproducible runs.
- **Models**: Manage and deploy models.
- **Model Registry**: Central model store with versioning and stage transitions.

MLflow is lightweight, integrates with many libraries, and can be self‑hosted or used with a tracking server.

### 64.3.2 Weights & Biases (W&B)

**W&B** is a commercial platform with a generous free tier. It offers rich visualizations, collaboration features, and tight integration with deep learning frameworks. W&B is particularly popular in the research community.

### 64.3.3 Neptune

**Neptune** is another commercial platform with a strong focus on experiment tracking and model registry. It provides a clean UI and supports many integrations.

### 64.3.4 Comet

**Comet** offers experiment tracking, model monitoring, and a model registry. It has a free tier for individuals and academic users.

### 64.3.5 Comparison

| Tool      | Open Source | Self‑Hosted | Free Tier | UI Richness | Integration |
|-----------|-------------|-------------|-----------|-------------|-------------|
| MLflow    | Yes         | Yes         | Yes       | Basic       | Many        |
| W&B       | No          | No          | Yes       | Excellent   | Excellent   |
| Neptune   | No          | Yes (enterprise) | Yes (limited) | Good | Good |
| Comet     | No          | No          | Yes (limited) | Good | Good |

For the NEPSE project, we will use MLflow because it is open source, easy to set up, and covers our needs. You can later migrate to a commercial tool if required.

---

## 64.4 Setting Up MLflow

MLflow can be used locally or with a tracking server. For a single user, local file storage is sufficient. For a team, you should set up a tracking server with a database backend.

### 64.4.1 Installation

```bash
pip install mlflow
```

### 64.4.2 Local Usage

By default, MLflow logs to a local `mlruns` directory. You can start the UI with:

```bash
mlflow ui
```

Then open `http://localhost:5000` to view experiments.

### 64.4.3 Tracking Server (Optional)

For team use, set up a tracking server with a PostgreSQL backend and an artifact store (e.g., S3). This is more involved; refer to the MLflow documentation.

---

## 64.5 Integrating MLflow with NEPSE Training

Let's modify our training script to log everything to MLflow.

### 64.5.1 Basic Logging

```python
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Load data (as before)
df = pd.read_csv('nepse_features.csv')
X = df.drop(columns=['target', 'date', 'symbol'])
y = df['target']

# Split (simplified, use time‑based split in practice)
split_idx = int(0.8 * len(X))
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

# Start an MLflow run
with mlflow.start_run(run_name="random_forest_baseline"):
    # Log parameters
    mlflow.log_param("model_type", "RandomForest")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("train_size", len(X_train))
    mlflow.log_param("test_size", len(X_test))
    
    # Train model
    model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
    model.fit(X_train, y_train)
    
    # Evaluate
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    train_acc = accuracy_score(y_train, train_pred)
    test_acc = accuracy_score(y_test, test_pred)
    
    # Log metrics
    mlflow.log_metric("train_accuracy", train_acc)
    mlflow.log_metric("test_accuracy", test_acc)
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    # (Optional) Log feature importance plot
    import matplotlib.pyplot as plt
    importances = model.feature_importances_
    plt.figure(figsize=(10,6))
    plt.barh(X.columns, importances)
    plt.xlabel("Importance")
    plt.title("Feature Importance")
    plt.savefig("feature_importance.png")
    mlflow.log_artifact("feature_importance.png")
```

**Explanation:**  
We wrap the training code inside an MLflow run. `log_param` records key‑value pairs. `log_metric` records a numeric value (multiple values can be logged per metric over time). `log_model` saves the model in MLflow's format, which includes the environment. `log_artifact` saves any file to the artifact store.

### 64.5.2 Logging Hyperparameters Dynamically

Often we want to try many hyperparameter combinations. We can use a loop or integrate with a search tool.

```python
import itertools

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, None]
}

for n_est, depth in itertools.product(param_grid['n_estimators'], param_grid['max_depth']):
    with mlflow.start_run(run_name=f"rf_n{n_est}_d{depth}"):
        mlflow.log_param("n_estimators", n_est)
        mlflow.log_param("max_depth", depth)
        # ... train and evaluate ...
```

### 64.5.3 Logging Metrics Per Epoch

For iterative models (like neural networks), you can log metrics at each epoch.

```python
with mlflow.start_run():
    for epoch in range(epochs):
        loss = train_one_epoch()
        val_acc = validate()
        mlflow.log_metric("loss", loss, step=epoch)
        mlflow.log_metric("val_accuracy", val_acc, step=epoch)
```

This creates a time series of metrics that can be plotted in the MLflow UI.

---

## 64.6 Organising Experiments

As the number of runs grows, organisation becomes crucial. MLflow provides several mechanisms:

### 64.6.1 Experiment Names

You can create separate experiments for different projects or phases.

```python
mlflow.set_experiment("NEPSE_RandomForest")
```

All runs under that experiment will be grouped together in the UI.

### 64.6.2 Tags

Add tags to runs for filtering and searching.

```python
mlflow.set_tag("data_version", "v2024_01")
mlflow.set_tag("feature_set", "basic_plus_technical")
mlflow.set_tag("model_family", "xgboost")
```

### 64.6.3 Nested Runs

For complex pipelines (e.g., hyperparameter tuning), you can create nested runs. The parent run represents the overall search, and children represent individual trials.

```python
with mlflow.start_run(run_name="HPO_RandomForest") as parent_run:
    for params in param_combinations:
        with mlflow.start_run(run_name=f"trial_{i}", nested=True):
            mlflow.log_params(params)
            # ... train ...
```

---

## 64.7 Model Registry Integration

MLflow includes a **Model Registry** to manage model versions and stages (Staging, Production, Archived). After a successful run, you can register the model.

```python
mlflow.register_model(f"runs:/{run.info.run_id}/model", "NEPSE_Predictor")
```

This adds the model to the registry. You can then transition it to "Staging" or "Production" via the UI or API.

**Example: Promoting a model programmatically**

```python
from mlflow.tracking import MlflowClient

client = MlflowClient()
client.transition_model_version_stage(
    name="NEPSE_Predictor",
    version=3,
    stage="Production"
)
```

Now, your deployment pipeline can fetch the model in the "Production" stage.

---

## 64.8 Querying and Comparing Runs

MLflow provides a Python API to query runs and compare them.

```python
from mlflow.tracking import MlflowClient

client = MlflowClient()
experiment = client.get_experiment_by_name("NEPSE_RandomForest")
runs = client.search_runs(experiment.experiment_id, order_by=["metrics.test_accuracy DESC"])

best_run = runs[0]
print(f"Best run ID: {best_run.info.run_id}")
print(f"Test accuracy: {best_run.data.metrics['test_accuracy']}")
```

You can also load the model from the best run:

```python
model_uri = f"runs:/{best_run.info.run_id}/model"
model = mlflow.sklearn.load_model(model_uri)
```

This is useful for automated model selection pipelines.

---

## 64.9 Best Practices for Experiment Tracking

1. **Log everything**: Even small changes can affect results. Log all parameters, including data paths and random seeds.
2. **Use consistent naming**: Adopt a convention for run names (e.g., `model_featureSet_timestamp`).
3. **Tag runs with meaningful labels**: `baseline`, `feature_set_v2`, `production_candidate`.
4. **Log code version**: Automatically capture the Git commit hash.
   ```python
   import subprocess
   commit_hash = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode("ascii").strip()
   mlflow.log_param("git_commit", commit_hash)
   ```
5. **Log environment**: Use `mlflow.log_artifact("requirements.txt")` or log the output of `pip freeze`.
6. **Log artifacts liberally**: Save plots, confusion matrices, and sample predictions for later inspection.
7. **Use nested runs for structured workflows**: HPO, cross‑validation.
8. **Regularly clean up old runs**: If using a local tracking server, archive or delete runs you no longer need.
9. **Integrate with your CI/CD**: Automatically log runs from your training pipelines.

---

## 64.10 Complete Example: NEPSE with MLflow

Let's put it all together in a script that trains an XGBoost model with different hyperparameters and logs everything.

```python
# train_nepse_with_mlflow.py
import mlflow
import mlflow.xgboost
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import xgboost as xgb
import argparse
import subprocess
import json

def load_data():
    df = pd.read_csv('nepse_features.csv')
    # Assume we have a 'target' column (0/1) and a 'date' column for splitting
    df['date'] = pd.to_datetime(df['date'])
    df = df.sort_values('date')
    split_date = '2024-01-01'
    train = df[df['date'] < split_date]
    test = df[df['date'] >= split_date]
    
    feature_cols = [c for c in df.columns if c not in ['target', 'date', 'symbol']]
    X_train = train[feature_cols]
    y_train = train['target']
    X_test = test[feature_cols]
    y_test = test['target']
    return X_train, y_train, X_test, y_test, feature_cols

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--learning_rate', type=float, default=0.1)
    parser.add_argument('--max_depth', type=int, default=6)
    parser.add_argument('--n_estimators', type=int, default=100)
    parser.add_argument('--subsample', type=float, default=0.8)
    args = parser.parse_args()
    
    # Start MLflow run
    mlflow.set_experiment("NEPSE_XGBoost")
    with mlflow.start_run():
        # Log parameters
        mlflow.log_params(vars(args))
        
        # Log git commit
        try:
            commit = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
            mlflow.log_param("git_commit", commit)
        except:
            pass
        
        # Load data
        X_train, y_train, X_test, y_test, feature_cols = load_data()
        mlflow.log_param("n_features", len(feature_cols))
        mlflow.log_param("train_size", len(X_train))
        mlflow.log_param("test_size", len(X_test))
        
        # Train model
        model = xgb.XGBClassifier(
            learning_rate=args.learning_rate,
            max_depth=args.max_depth,
            n_estimators=args.n_estimators,
            subsample=args.subsample,
            random_state=42,
            use_label_encoder=False,
            eval_metric='logloss'
        )
        model.fit(X_train, y_train)
        
        # Evaluate
        train_pred = model.predict(X_train)
        test_pred = model.predict(X_test)
        train_acc = accuracy_score(y_train, train_pred)
        test_acc = accuracy_score(y_test, test_pred)
        
        mlflow.log_metric("train_accuracy", train_acc)
        mlflow.log_metric("test_accuracy", test_acc)
        
        # Log feature importance
        importance = model.feature_importances_
        importance_dict = dict(zip(feature_cols, importance.tolist()))
        mlflow.log_dict(importance_dict, "feature_importance.json")
        
        # Log model
        mlflow.xgboost.log_model(model, "model")
        
        # (Optional) log a sample prediction plot
        import matplotlib.pyplot as plt
        plt.figure(figsize=(10,6))
        plt.barh(feature_cols, importance)
        plt.xlabel("Importance")
        plt.title("Feature Importance")
        plt.tight_layout()
        plt.savefig("importance.png")
        mlflow.log_artifact("importance.png")
        
        print(f"Run ID: {mlflow.active_run().info.run_id}")
        print(f"Test accuracy: {test_acc:.4f}")

if __name__ == "__main__":
    main()
```

**Explanation:**  
This script accepts hyperparameters as command‑line arguments, logs them, loads the data, trains an XGBoost model, and logs metrics and artifacts. The `log_dict` method saves a JSON file with feature importance. The script can be run multiple times with different arguments, and all runs will be recorded in MLflow.

To run a set of experiments, we can use a shell loop:

```bash
for lr in 0.01 0.05 0.1; do
    for depth in 4 6 8; do
        python train_nepse_with_mlflow.py --learning_rate $lr --max_depth $depth
    done
done
```

Afterwards, we open the MLflow UI to compare the runs and select the best model.

---

## Chapter Summary

In this chapter, we explored the critical practice of experiment tracking. We covered:

- The reasons for tracking experiments: reproducibility, comparison, and collaboration.
- The key elements to log: parameters, metrics, artifacts, and metadata.
- Popular experiment tracking tools, with a focus on MLflow.
- How to integrate MLflow into the NEPSE training pipeline.
- Organising experiments with experiments, tags, and nested runs.
- Using the MLflow Model Registry to manage model versions.
- Querying and comparing runs programmatically.
- Best practices to ensure effective experiment tracking.

By adopting experiment tracking, the NEPSE prediction system gains a solid foundation for model development. Every experiment is recorded, every result is reproducible, and the path to production is clear. In the next chapter, we will discuss **Data and Model Lineage**, which extends tracking to capture the relationships between data, features, models, and predictions.

---

**End of Chapter 64**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='63. feature_stores.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='65. data_and_model_lineage.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
