# Data and Model Example
An explainer and example of how to define the data that goes into your model for training, evaluation, etc.
Then, defining the model itself!

Since this is a notebook, if you are on VSCode or another IDE, you can hover over each of the class instances / references and see more details or go to their code definitions for further exploration.

## 

## Defining Model Inputs and Outputs
To start, we'll begin with the `ModelDataSpec`, considered a top-level object for defining the specification for your model inputs, outputs, etc.

A `ModelDataSpec` is defined by three dictionaries: features, targets, and extras. Each of them are dictionaries that map strings to `DataSpec`.
More details on `DataSpec` is provided in the `docs/spec_system_explainer.md`. The features are the inputs to the model, the targets are the outputs. Extras is optionally and doesn't get used during model training at all and can be used for evaluation and pairing features and targets with other data not related to the model itself. Consider adding latitude and longitude specifications for evaluating model performance on specific locations or for filtering out specific locations.

Below is an example of a `ModelDataSpec` for a model trained on CoSMIR-H data to predict temperature:

In [None]:
from hympi_ml.data import ModelDataSpec, cosmirh, CosmirhSpec, NRSpec
from hympi_ml.data.scale import MinMaxScaler

spec = ModelDataSpec(
    features={
        "CH": CosmirhSpec(
            frequencies=[
                cosmirh.C50_BAND,
                cosmirh.C183_BAND,
            ],
            ignore_frequencies=[  # problematic CRTM frequencies
                56.96679675,
                57.60742175,
                57.611328,
                57.61523425,
            ],
        ),
    },
    targets={
        "TEMPERATURE": NRSpec(
            dataset="TEMPERATURE",
            scaler=MinMaxScaler(minimum=175.0, maximum=325.0),
        ),
    },
)

As you can see, the above code defines a `ModelDataSpec` with a features (inputs) dictionary that contains a single entry for named "CH" for a `CosmirhSpec` instance. The inner workings of that spec don't matter too much for this notebook but feel free to explore the CosmirhSpec code itself for more details.

We can also see the targets defined with a single "TEMPERATURE" entry that defines a nature run specification `NRSpec` that pulls "TEMPERATURE" data. It also scales the data using a min max scaler from 175K to 325K. For more details on `NRSpec` please refer to it's code definition.

With our data now defined, we must define where we get our data! This is where `RawDataModule` comes in!
This will allow us to use our above spec with multiple `DataSource` definitons for our train, validation, and test datasets.

Note: A `RawDataModule` inherits from a PyTorch Lightning `LightningDataModule` which informs its functionality. Visit the docs [here](https://lightning.ai/docs/pytorch/stable/data/datamodule.html) for reference.

Below is an example instance:

In [None]:
from hympi_ml.data import RawDataModule
from hympi_ml.data.ch06 import Ch06Source

datamodule = RawDataModule(
    spec=spec,
    train_source=Ch06Source(
        days=[
            "20060115",
            "20060215",
            "20060415",
            "20060515",
            "20060615",
            "20060715",
            "20060815",
            "20061015",
        ]
    ),
    val_source=Ch06Source(days=["20061115"]),
    test_source=Ch06Source(days=["20061215"]),
    batch_size=8192,
    num_workers=20,
)

In the above code, we define a `RawDataModule` which references the `ModelDataSpec` we defined earlier along with three data sources for our train, validation, and test datasets. In this case, we have a different set of days for each as we are using the `CH06Source` based on our common means of splitting up data using days. `RawDataModule` also requires that we define a batch size that our data will load in and the number of workers (or processes) that will be used during loading (20 is a good number for ADAPT but YMMV).

## Defining the Model Itself

Now that we have our data, we need a model to work with!

But before we define our model, we'll need some metrics that our model will use for training and for our further analysis.

For our case, a set of metrics is defined as dictionaries of `MetricCollection`, which is simply another list of metrics. All of our metric functions and classes are from the package `torchmetrics` (more details [here](https://lightning.ai/docs/torchmetrics/stable/)). 

Below we'll define our metrics for each of our train, validation, and test datasets:

In [None]:
from torchmetrics import MetricCollection
import torchmetrics.regression as re

train_metrics = {
    "TEMPERATURE": MetricCollection(
        {
            "mae": re.MeanAbsoluteError(),
        },
    ),
}

val_metrics = {
    "TEMPERATURE": MetricCollection(
        {
            "mae": re.MeanAbsoluteError(),
            "mse": re.MeanSquaredError(),
            "rmse": re.NormalizedRootMeanSquaredError(),
        },
    ),
}

test_metrics = {
    "TEMPERATURE": MetricCollection(
        {
            "mae": re.MeanAbsoluteError(),
            "mse": re.MeanSquaredError(),
            "rmse": re.NormalizedRootMeanSquaredError(),
        },
    ),
}

Now that we've defined our metrics, we'll finally get to create our model!

In this case it all begins with the idea of a `SpecModel`, which you should consider as a base class for defining any kind of model. It inherits from `LightningModule` from PyTorch Lightning (more details on this module [here](https://lightning.ai/docs/pytorch/stable/model/train_model_basic.html)).

Now, for this example we'll use the `MLPModel` which stands for Multi-Layer Perceptron, one of the many options for creating a model. More details about an MLP model type can be found in the introduction section (before any code) [here](https://www.geeksforgeeks.org/deep-learning/multi-layer-perceptron-learning-in-tensorflow/#).

Below, we'll define our model:

In [None]:
import torch.nn as nn
from hympi_ml.model import MLPModel

model = MLPModel(
    spec=spec,
    train_metrics=train_metrics,
    val_metrics=val_metrics,
    test_metrics=test_metrics,
    feature_paths=nn.ModuleDict(
        {
            "CH": nn.Sequential(
                nn.LazyLinear(1024),
                nn.GELU(),
                nn.LazyLinear(256),
                nn.GELU(),
                nn.LazyLinear(128),
                nn.GELU(),
            ),
        }
    ),
    output_path=nn.Sequential(
        nn.LazyLinear(128),
        nn.GELU(),
    ),
)

As you can see, the specification we defined earlier is referenced here as well as all of our metrics. Beyond that, the MLP-specific details are defined as the "feature paths" which is a `nn.ModuleDict` for each of our features defined in the spec. The "output path" is another PyTorch module that is a single path of layers that is used after all feature paths are concatenated together! An example of this kind of architecture is described in my poster found [here](https://ntrs.nasa.gov/citations/20250003177).

After your data, metrics, and model have all been defined, you can now train! This can be done in countless ways, please reference the ends of the files in the `runs` directory. These will have great examples of what to do after everything above has been set up.