3_model Scope and Structure #14
Replies: 4 comments 2 replies
-
We could break this work up into five PRs that follow the five tasks I outlined:
The last item - log training metrics - might be so small that it makes sense to include it in the train model task. |
Beta Was this translation helpful? Give feedback.
-
In the order of my reading I like Tasks1, 2, 4, and 5 are good/given. 3 can be logged upon initialization of the pipeline, especially if those hyperparameters would go on to impact processing. Not a big deal either way. This information could also go into a metadata table with broader run information (e.g., lakes that you exclude during processing, train/test dates, git commit tag, container version, etc...) - I don't have direct experience with all of this, but this seems to be the broader direction of other projects. Form training data"Static lake attributes ... don't change over time, so the same value gets repeated day after day in the sequences." In river-dl and the reservoir work, we use Partitioning and normalizing the data may be more of a processing task. This may just be semantics - whichever way, as long as it's modular from the training code I think that's fine (i.e., if you change NN hyperparameters, the pipeline/function calls should be set up so the data isn't always resplit/renormalized) Pytorch dataset objects aren't required but they do scale really nice (especially if you use parallel GPUs). They were a little awkward for me to figure out at first though, lmk if you want help. "Create model" - "Structure"All sounds good Your subsequent commentI would assume those tasks are little more connected (e.g., Task 1 = 1, Task 2 = 2+4+5, Task 3 = 3+5) because:
Overall, I think most of my comments were about organization or efficiency rather than the ML or goals. I hope that helps, and feel free to ask any follow ups/future reviews and if you want any help with training code |
Beta Was this translation helpful? Give feedback.
-
Just a couple of additional thoughts!
|
Beta Was this translation helpful? Give feedback.
-
There's a separate issue for each PR listed above: |
Beta Was this translation helpful? Give feedback.
-
Here's a draft of the structure of the model-training phase for us to discuss. Feedback very welcome!
I've envisioned five phases to this repository:
1_fetch
2_process
3_model
4_evaluate
5_viz
The primary job of phase 3 is to train the LSTM. I've been planning to name it
3_model
, though I'm open to other names like3_train
.Tasks
There are several tasks to complete to train the model:
Form training data
Phase
2_process
provides a.npy
file for every lake.Each
.npy
file contains all the sequences of daily inputs and outputs for the lake, includingStatic lake attributes (e.g. latitude, longitude, elevation, etc) don't change over time, so the same value gets repeated day after day in the sequences. Every sequence is the same number of days long. So, every
.npy
file contains a three dimensionalnumpy
array with different sequences along the first dimension, days along the second dimension, and inputs and outputs along the last dimension. If you were to printlake_sequences.shape
, you'd get(num_sequences, num_days_per_sequence, num_features + num_pre_specified_temperature_depths)
. The sequences will be used to form the training and test data sets.To form the training data, there's a few things to do.
2_process
is that splitting them into training and test sets by lake is easy.Create model
Classes for the LSTM and the EA-LSTM can be found in the code accompanying Kratzert et al., 2019 and Willard et al., 2022. Conveniently, there's a generic Model class that makes it easy to toggle between a vanilla LSTM and an EA-LSTM. We can modify those classes as needed - in particular, allowing multiple outputs at every timestep and making sure to provide output at all timesteps (I've already made those modifications). Then, we instantiate a model object from the Model class.
Log settings and hyperparameters
We want to save as much information as possible about the hyperparameters, settings, preprocessing steps, and training. We want this information to be saved alongside the models. Quests across PUMP are developing solutions for this need, so we can make use of their progress. The details are TBD.
Train model
The code for the training loop of the model should be fairly short and straightforward. We'll track loss and other metrics of interest over epochs. After training, we'll save the parameters of the trained model and those metrics.
Structure
Generally, I'm planning to structure the code in a similar way as
2_process
:Two of the tasks above could easily be Snakemake rules: forming the training data and training the model. The
3_model
phase can have a config file separate from2_process
, or we have one config file for the entire pipeline. The rule to form the training data would pass the relevant parts of the Snakemake config usingparams
.The rule for training the model would also call a Python script. This script would create the model, train it, and log all the settings.
Beta Was this translation helpful? Give feedback.
All reactions