# Tutorial: Working with DataHandler and Aggregator

This notebook demonstrates how to use the refactored **DataHandler** and **Aggregator** classes.

## Key Steps
1. **Set Up**: Import the necessary modules and classes.
2. **Initialize a DataHandler**: Choose a root directory and create the folder structure.
3. **Create Iterations**: For each iteration, register simulation outputs (identifiers) and save data.
4. **Aggregation**: Use the `Aggregator` (invoked via `DataHandler.aggregate_and_save()`) to combine results.

---
**Important**: Make sure the files `data_handler.py`, `aggregator.py`, `metadata.py`, and their dependencies are accessible from the same environment or folder structure.

## 1. Imports and Setup

In [3]:
import os
from pathlib import Path
from typing import Any

import pandas as pd


from pysubmit.workflow.data_handler import DataHandler  # <-- Adjust if needed

# For demonstration, let's define a root directory inside a temporary folder.
root_dir = Path("./demo_data_handler")

## 2. Create and Configure `DataHandler`
We'll instantiate the `DataHandler` with our chosen root directory and some grouping configuration for aggregation later.

In [4]:
# Define a grouping configuration.
# For demonstration, suppose we have two categories of results:
# 1) 'metrics' that merges output from 'accuracy' and 'loss'.
# 2) 'hyperparams' that merges data from 'params'.

group_config = {
    "metrics": ["accuracy", "loss"],
    "hyperparams": ["params"]
}

# Initialize the DataHandler with the root directory and grouping config.
data_handler = DataHandler(
    root_directory=root_dir,
    grouping_config=group_config
)

# Create the folder structure
data_handler.create_folders()

print(f"DataHandler initialized.\nRoot: {data_handler.root_directory}\n")

DataHandler initialized.
Root: demo_data_handler



## 3. Create Iterations and Register Simulation Outputs
Each iteration folder (e.g., `iteration_0`, `iteration_1`) contains a `metadata.json` that tracks what outputs have been created and saved.

In [5]:
# 3.1: Create a new iteration
iteration_folder_1 = data_handler.create_iteration()
print("Created iteration folder:", iteration_folder_1)

# 3.2: Register identifiers for this iteration
id_accuracy = data_handler.register_identifier("accuracy")
id_loss = data_handler.register_identifier("loss")
id_params = data_handler.register_identifier("params")

print(f"Registered 'accuracy' --> {id_accuracy}")
print(f"Registered 'loss'     --> {id_loss}")
print(f"Registered 'params'   --> {id_params}")

Created iteration folder: demo_data_handler\results\iterations\iteration_0
Registered 'accuracy' --> accuracy.json
Registered 'loss'     --> loss.json
Registered 'params'   --> params.json


## 4. Add JSON Data to Iteration
Once an identifier is registered, we can add (save) the corresponding data to the iteration folder.

In [6]:
# 4.1: Prepare some dummy JSON-serializable data
accuracy_data = {"epoch": 1, "accuracy": 0.85}
loss_data = {"epoch": 1, "loss": 0.45}
params_data = {"learning_rate": 0.001, "batch_size": 32}

# 4.2: Add them to the iteration
data_handler.add_data_to_iteration("accuracy", accuracy_data)
data_handler.add_data_to_iteration("loss", loss_data)
data_handler.add_data_to_iteration("params", params_data)

print("Data saved and metadata updated for iteration_0.")

Data saved and metadata updated for iteration_0.


### (Optional) Create Another Iteration
For demonstration, let's add a second iteration with slightly different data.

In [None]:
# Create a second iteration
iteration_folder_2 = data_handler.create_iteration()
print("Created iteration folder:", iteration_folder_2)

# Register the same identifiers
data_handler.register_identifier("accuracy")
data_handler.register_identifier("loss")
data_handler.register_identifier("params")

# Save some different data.
data_handler.add_data_to_iteration("accuracy", {"epoch": 2, "accuracy": 0.90})
data_handler.add_data_to_iteration("loss", {"epoch": 2, "loss": 0.40})
data_handler.add_data_to_iteration("params", {"learning_rate": 0.0005, "batch_size": 64})

print("Data saved and metadata updated for iteration_1.")

AttributeError: 'DataHandler' object has no attribute 'create_new_iteration'

## 5. Aggregate and Save
Finally, we'll invoke the DataHandler's aggregation method, which uses the `Aggregator` internally to scan all iterations, flatten the JSON outputs, and then save the results as CSV files under the `aggregations` folder.

In [None]:
# By default, this uses the 'analysis' adapter
# If you want a custom adapter, provide a TypeAdapter instance.

data_handler.aggregate_and_save()
print("Aggregation completed. CSV files are in:", data_handler.aggregations_directory)

### Check the Aggregation Results
The `DataHandler` wrote one CSV file per group (based on `group_config`) into the `aggregations` folder.

- `metrics.csv` should contain merged data from `accuracy` and `loss`.
- `hyperparams.csv` should contain data from `params`.

In [None]:
# Let's look at the metrics.csv
metrics_csv = data_handler.aggregations_directory / "metrics.csv"
metrics_df = pd.read_csv(metrics_csv)
metrics_df

In [None]:
# And the hyperparams.csv
hyperparams_csv = data_handler.aggregations_directory / "hyperparams.csv"
hyperparams_df = pd.read_csv(hyperparams_csv)
hyperparams_df

## 6. Conclusion
We have:
1. Created a **DataHandler** pointing to our `demo_data_handler` root.
2. Generated two iterations of data, each with three identifiers (`accuracy`, `loss`, `params`).
3. Saved the data in JSON form under each iteration folder, updating `metadata.json`.
4. Aggregated the results according to `group_config`, producing CSV files for each group.

Feel free to explore the folders in your file browser to see how everything is structured.
You can also experiment with custom TypeAdapters, advanced flattening, or multi-layer dictionaries.

Thank you for using the **DataHandler** and **Aggregator** tutorial!