3_model Scope and Structure #14

AndyMcAliley · 2022-03-17T15:00:41Z

AndyMcAliley
Mar 17, 2022

Here's a draft of the structure of the model-training phase for us to discuss. Feedback very welcome!

I've envisioned five phases to this repository:

1_fetch
2_process
3_model
4_evaluate
5_viz

The primary job of phase 3 is to train the LSTM. I've been planning to name it 3_model, though I'm open to other names like 3_train.

Tasks

There are several tasks to complete to train the model:

Get training and test data
Create model
Log settings and hyperparameters
Train model
Log training metrics

Form training data

Phase 2_process provides a .npy file for every lake.
Each .npy file contains all the sequences of daily inputs and outputs for the lake, including

meteorological drivers
static lake attributes
a flag for ice cover
lake clarity
temperatures at multiple pre-specified depths

Static lake attributes (e.g. latitude, longitude, elevation, etc) don't change over time, so the same value gets repeated day after day in the sequences. Every sequence is the same number of days long. So, every .npy file contains a three dimensional numpy array with different sequences along the first dimension, days along the second dimension, and inputs and outputs along the last dimension. If you were to print lake_sequences.shape, you'd get (num_sequences, num_days_per_sequence, num_features + num_pre_specified_temperature_depths). The sequences will be used to form the training and test data sets.

To form the training data, there's a few things to do.

Gather the sequences from all the lakes. Optionally, we can exclude sequences from any given lake at this point if we have reason to.
Split the sequences into training and test sets. I've been planning to do this split by lake. In other words, if a lake's sequences are in the test set, you won't find any sequences from that lake in the training set. One benefit of saving a separate file for each lake in 2_process is that splitting them into training and test sets by lake is easy.
Normalize the data. Neural networks perform better when their input and output values are not too far outside the range of -1 to 1. Linear rescaling is the simplest approach. We'll need to store the scale and shift values with the model so that we can normalize and unnormalize properly during model evaluation and production.
Form a Pytorch Dataset object that stores these data

Create model

Classes for the LSTM and the EA-LSTM can be found in the code accompanying Kratzert et al., 2019 and Willard et al., 2022. Conveniently, there's a generic Model class that makes it easy to toggle between a vanilla LSTM and an EA-LSTM. We can modify those classes as needed - in particular, allowing multiple outputs at every timestep and making sure to provide output at all timesteps (I've already made those modifications). Then, we instantiate a model object from the Model class.

Log settings and hyperparameters

We want to save as much information as possible about the hyperparameters, settings, preprocessing steps, and training. We want this information to be saved alongside the models. Quests across PUMP are developing solutions for this need, so we can make use of their progress. The details are TBD.

Train model

The code for the training loop of the model should be fairly short and straightforward. We'll track loss and other metrics of interest over epochs. After training, we'll save the parameters of the trained model and those metrics.

Structure

Generally, I'm planning to structure the code in a similar way as 2_process:

Write Snakemake rules.
Each Snakemake rule calls a Python script
Each Python script file is made up of mostly functions, with a bit of code at the bottom that runs the script when Snakemake calls it.

Two of the tasks above could easily be Snakemake rules: forming the training data and training the model. The 3_model phase can have a config file separate from 2_process, or we have one config file for the entire pipeline. The rule to form the training data would pass the relevant parts of the Snakemake config using params.

The rule for training the model would also call a Python script. This script would create the model, train it, and log all the settings.

AndyMcAliley · 2022-03-17T19:11:06Z

AndyMcAliley
Mar 17, 2022
Author

We could break this work up into five PRs that follow the five tasks I outlined:

Get training and test data
Create model
Log settings and hyperparameters
Train model
Log training metrics

The last item - log training metrics - might be so small that it makes sense to include it in the train model task.

0 replies

jdiaz4302 · 2022-03-22T16:12:22Z

jdiaz4302
Mar 22, 2022

In the order of my reading

I like 3_train(_model) (or maybe 3_fit(_model)) better since the other names are action verbs rather than nouns

Tasks

1, 2, 4, and 5 are good/given.

3 can be logged upon initialization of the pipeline, especially if those hyperparameters would go on to impact processing. Not a big deal either way. This information could also go into a metadata table with broader run information (e.g., lakes that you exclude during processing, train/test dates, git commit tag, container version, etc...) - I don't have direct experience with all of this, but this seems to be the broader direction of other projects.

Form training data

"Static lake attributes ... don't change over time, so the same value gets repeated day after day in the sequences." In river-dl and the reservoir work, we use .npz files rather than .npy because those will allow you to store separate values/objects of different shapes together - would be slightly more efficient/compact here (unless your model needs it repeated along the sequence, then disregard). Allowing different shapes is also good for including labels/more metadata (e.g., dates, lake IDs) - kind of like .nc files in a way (but maybe not to the extent of .nc files)

Partitioning and normalizing the data may be more of a processing task. This may just be semantics - whichever way, as long as it's modular from the training code I think that's fine (i.e., if you change NN hyperparameters, the pipeline/function calls should be set up so the data isn't always resplit/renormalized)

Pytorch dataset objects aren't required but they do scale really nice (especially if you use parallel GPUs). They were a little awkward for me to figure out at first though, lmk if you want help.

"Create model" - "Structure"

All sounds good

Your subsequent comment

I would assume those tasks are little more connected (e.g., Task 1 = 1, Task 2 = 2+4+5, Task 3 = 3+5) because:

It sounds like the model code may already exist (I'm not sure how much you need to alter it though), so creating the model is just initializing the class, right? Then you need the code to actually use it (train) and logging metrics would occur within that training code. E.g., in pseudocode:

model = ModelClass(hyperparameters) # create model
loss_list = []
for epoch in range(100): # train model
    update model
    compute loss
    loss_list.append(loss) # log loss; also, any other training metrics

The two "log ..." tasks seems more about parsing/organizing objects/values that already exist from the other tasks

Overall, I think most of my comments were about organization or efficiency rather than the ML or goals. I hope that helps, and feel free to ask any follow ups/future reviews and if you want any help with training code

1 reply

AndyMcAliley Mar 22, 2022
Author

Thanks so much! These are great points.

Tasks

3 can be logged upon initialization of the pipeline, especially if those hyperparameters would go on to impact processing.

You're right, settings and hyperparameters can be logged right at the start.

This information could also go into a metadata table with broader run information (e.g., lakes that you exclude during processing, train/test dates, git commit tag, container version, etc...)

Yes, I'm looking forward to learning the ins and outs of how other quests create and maintain those metadata tables.

Form training data

In river-dl and the reservoir work, we use .npz files rather than .npy because those will allow you to store separate values/objects of different shapes together - would be slightly more efficient/compact here (unless your model needs it repeated along the sequence, then disregard). Allowing different shapes is also good for including labels/more metadata (e.g., dates, lake IDs) - kind of like .nc files in a way (but maybe not to the extent of .nc files)

Switching to .npz would save some memory and disk space, and labels and metadata would improve code readability a lot. Great idea!

Partitioning and normalizing the data may be more of a processing task.

Agreed. I've considered putting that code into 2_process. It probably makes sense to do so, especially if other repos are organized this way.

if you change NN hyperparameters, the pipeline/function calls should be set up so the data isn't always resplit/renormalized

Yes... I need to get confident about when exactly Snakemake remakes a target (e.g. when a variable that is an argument to a function is changed in a config file).

Subsequent comment

It sounds like the model code may already exist

It does, although it needs a couple of changes to suit our use case. For the sake of small PRs, I'm thinking of submitting a separate PR that is just the model classes themselves and my changes to them.

The two "log ..." tasks seems more about parsing/organizing objects/values that already exist from the other tasks

Yes, the two "log ..." tasks are about taking those values from other tasks and saving them to log files, along with other metadata like git commit, date run, etc. It's the question of how to format, combine, and save them that I haven't thought through, so I separated out that part as separate tasks. You're right that logging metadata and logging metrics are related. This also goes back to learning how other quests log their settings and results. We'll probably want the training metrics logged in the same metadata table as the settings. But we probably want to save hyperparameters and settings to a log file before training, in case there's an error partway through training. I suppose a way forward is to save hyperparameters/settings in one file, save training metrics/results summaries in another, and then combine them into a new row in the full metadata table as a final step.

We could worry about logging everything in a single task at the end. Then, the PRs would be

Get training and test data
Write model classes
Train model
Log settings, hyperparameters, training metrics, and metadata

lindsayplatt · 2022-03-23T16:42:40Z

lindsayplatt
Mar 23, 2022

Just a couple of additional thoughts!

We will want to consider the ability to re-use this codebase for future activities. I think throughout development, asking ourselves "how easy would it be to run with different meterological drivers?" or something similar would help us make sure that target-naming and the architecture could handle future use cases.
A common pattern I have experienced is wanting to compare data from a previous run to understand the implications of changes to the code/approach; either to verify that there was no impact on results, or to better understand how the results were changed. Our code is usually written to overwrite results, so if we want to do a comparison, we usually need to remember before re-running and copy the previous results elsewhere to compare after the new run. I'm wondering if including some kind of mechanism to handle this pattern would be worthwhile?

1 reply

AndyMcAliley Mar 24, 2022
Author

Thanks, your comments are really useful and insightful! The points raised here apply beyond this phase to the whole repository. I've created two new issues for these: one for future activities and one for saving multiple results. We can continue those discussions there.

AndyMcAliley · 2022-03-24T21:20:14Z

AndyMcAliley
Mar 24, 2022
Author

There's a separate issue for each PR listed above:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3_model Scope and Structure #14

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

3_model Scope and Structure #14

AndyMcAliley Mar 17, 2022

Tasks

Form training data

Create model

Log settings and hyperparameters

Train model

Structure

Replies: 4 comments · 2 replies

AndyMcAliley Mar 17, 2022 Author

jdiaz4302 Mar 22, 2022

Tasks

Form training data

"Create model" - "Structure"

Your subsequent comment

AndyMcAliley Mar 22, 2022 Author

Tasks

Form training data

Subsequent comment

lindsayplatt Mar 23, 2022

AndyMcAliley Mar 24, 2022 Author

AndyMcAliley Mar 24, 2022 Author

AndyMcAliley
Mar 17, 2022

Replies: 4 comments 2 replies

AndyMcAliley
Mar 17, 2022
Author

jdiaz4302
Mar 22, 2022

AndyMcAliley Mar 22, 2022
Author

lindsayplatt
Mar 23, 2022

AndyMcAliley Mar 24, 2022
Author

AndyMcAliley
Mar 24, 2022
Author