Pre-initialize output `xarray` time coordinates and variables #68

aufdenkampe · 2024-02-26T17:32:22Z

Preinitializing an array at the start of a computation is well known to improve performance, because updating values in the array is much more CPU and memory efficient than resizing or appending to an array.

This issue separates a sub-thread started in this issue:

Improve timestep performance further: numba or not? #57 (comment)

In that comment, @xaviernogueira describes the results of a simple test in the compute_engine_testing.ipynb notebook (currently in the 57-improve-timestep-performance-further branch) that showed it improved runtime by 4x-5x. See the link above for a discussion of these results.

NOTE that we expect all users to know exactly how many time steps they want to run before starting a simulation, so we do not need to preserve that current flexibility in the modeling system.

The text was updated successfully, but these errors were encountered:

sjordan29 · 2024-02-26T17:53:48Z

@aufdenkampe, I started working on this late last week by creating a profiling script and notebook as a benchmark. My plan for this week was to pre-init and run the same suite of tests.

Notebook: performance_profiling_tsm.ipynb
Profiling Script: performance_profiling_tsm.py

These could also serve as a point of comparison for any testing you do with xarray-simlab. I can work on making the profiling script more generalized/user friendly.

aufdenkampe · 2024-02-26T18:17:45Z

@sjordan29, having a common set of profiling scripts that we use for our various test branches sounds very useful! Thank you!

I just created a new Milestone for tracking these collective efforts: Performance enhancements for NSM demo, and I want to create a number of specific issues for us to each test and characterize with our different approaches.

aufdenkampe · 2024-02-26T18:19:10Z

Pasting #57 (comment) (from @xaviernogueira on Dec. 14) here, to keep discussion on this specific issue in one place:

Strategy of writing to a pre-initialized set of xarray time slices

This was proposed by @aufdenkampe earlier as a potential avenue for performance increases.

While I have not had time to implement a fully version into the existing code for true apples to apples comparison (more on that later), I did make simplified xarray workflows in the compute_engine_testing.ipynb notebook to gage if the approach is worth developing.

The TLDR: Computation becomes 4-5x faster, however, implementation may involve some tradeoffs, especially when the user hasn't pre-decided how many timesteps will be executed.

Performance Comparison

I used the tutorial air_temperature Dataset from xarray, which is a (lat: 25, lon: 53, time: 50) grid.

The test was adding 50 extra time steps where a mock water_temp_c (renamed air temperature) is updated to +10 more than the last timestep.

Test 1:

I implement the current algo that is in clearwater_modules: copy the last timestep (makes a Dataset with len(time=1)), update it's values, the concatenate it at the end of our existing Dataset.

Time: 364ms

Test 2:

Here we make another (lat: 25, lon: 53, time: 50) dataset, where time ranges from 1+ the last time of the existing dataset, to 50+. All values are initialized as np.NaN. Then concat to our existing dataset, and iterate over the new indices writing to each time slice.

Creating the new Dataset and concatenating it takes about 9ms (O(0) time since it runs once). The test itself

Considerations:

An array the size of your output needs to be initialized upon startup, which can potentially hog memory for a very long model run, or make workflows where the data is saved to a file at regular intervals more difficult to implement.

Not a massive issue, but if the user doesn't know how many timesteps they will run a-priori, the code will have to make a "guess", and add those time slices with np.NaN. This can make the dataset look odd in between timesteps and can make debugging more difficult. Additionally, if one runs more timesteps than the guess, there needs to be logic to add in another "buffer" of empty time slices to write too.

One option around the above issue is to simply require the user to set a model iteration time, but is this desired?

That said, 4-5x faster is hard to argue against. Implementation shouldn't be too bad.

If the desired behavior is well constrained, I am happy to implement it, potentially past my office time in one shot. This can be discussed in our meeting this Friday (12/15/2023).

TO DO: hot start dataset; increment_timestep #68

#68 todo: deal with timesteps and potentially change typehints update tests

#68

sjordan29 · 2024-03-12T19:36:36Z

@aufdenkampe, some findings (also outlined in performance_profiling_tsm.ipynb)

I worked to pre-initialize the model xarray and compare against the baseline model. When we track dynamic variables, we see some interesting behavior. On the graphs, each panel has the time/iteration on the y axis, the grid size on the x-axis, and each panel represents a different number of iterations (e.g., the top panel is for 1 iteration, the second panel is for 10 iterations, etc.).

For a low number of iterations and 1-1,000 cells, we see the baseline is faster, but when we start to get to a large number of cells and timesteps, the baseline computation time takes off while the updated computation time remains steady. I think it is slower than the baseline up front because of the additional overhead memory required to store an empty array over both space and time for all dynamic variables.

When we stop tracking dynamic variables and compare the two models, we see better outcomes for our updated pre-initialized model. For single-cell models, it is slightly slower (by <0.002 seconds), but once we get up to more than one cell, the pre-initialized model is faster. Once we get up to high numbers of cells and timesteps, the pre-initialized model is more than 10x faster than the baseline model

I imagine we can get the 1-cell model faster by:

Reducing dimensionality of state / dynamic variables if they aren't variable over the model mesh
Addressing Reduce memory usage: zero-copy #62
Additional memory management via Add memory profiling #70

I'd also propose we set the default to not track dynamic variables so that new users don't have this computational expense by default (currently default is to track dynamic variables).

aufdenkampe · 2024-03-12T19:54:37Z

@sjordan29, those are really interesting results! Thanks for that impressive analysis. Your suggestions for next steps all sound spot-on.

The performance advantages are quite impressive for >10,000 time steps and grids >1,000 cells, which I think is where we need it the most.

It might be interesting to try a 10-cell or 100-cell model, just to get an idea of some intermediate differences.

For the slowdowns with larger number of time steps, I'm thinking that the answer might be:

Reduce memory usage: lazy writing to dask.array #69
That way, we're writing the outputs to storage and freeing memory after we're run those time steps.

sjordan29 · 2024-03-13T14:12:33Z

Great -- I'll run those additional tests to fill in the gaps. Then I think I can clean this branch up and merge to main, and we can pivot to some of the memory management issues.

sjordan29 · 2024-03-13T18:58:30Z

Ran the tests for 10 and 100 timesteps as well. Start seeing the benefits at 100 iterations. Seems like the 1 and 10 timestep are more impacted by some natural variability since we're not averaging out over a large number of timesteps.

#68 to do: hot start dataset test

sjordan29 · 2024-03-13T21:49:52Z

@aufdenkampe, a couple questions/notes while I get the tests updated.

I've made time_steps a required input. We could, optionally, default to a certain number of timesteps to run but this seems like a waste of memory and could promote inefficient use of the model.
I'll need to adjust how we handle a hotstart dataset. My sense is that we should grab the last timestep from a hotstart dataset and drop the rest of it to save on memory. Users could merge the two datasets on their own if desired.
Any thoughts/preferences?

aufdenkampe · 2024-03-13T21:58:04Z

I like both of you suggestions.

Require number of time_steps (or a time series / coordinate) as input.
For a hotstart dataset, grab the last tilmestep by default. There's no need to hold previous timesteps in memory while computing.

sjordan29 · 2024-03-14T17:26:11Z

Sounds good. I am leaving time_steps as an integer in the original implementation to prevent this branch from getting too big, but opened #75 to address this in the future. Hotstart datasets have been addressed as described above and now all tests are passing.

I will probably also need to update example notebooks and documentation now that I am thinking of it.

Do you want me to work on a PR to main, or should we keep these speed/memory enhancements on a dev branch for now?

aufdenkampe · 2024-03-14T17:50:52Z

@sjordan29, that all sounds awesome. I'm a fan of keeping feature branches short-lived, so merging early and often. That said, if the example notebooks won't run, then it's probably a good idea to fix those first. More extensive docs could come in a second PR.

aufdenkampe added this to the Performance enhancements for NSM demo milestone Feb 26, 2024

aufdenkampe mentioned this issue Feb 26, 2024

Improve timestep performance further: numba or not? #57

Closed

3 tasks

aufdenkampe mentioned this issue Feb 26, 2024

Reduce memory usage: lazy writing to dask.array #69

Open

aufdenkampe added enhancement New feature or request refactor-core For core functionality, not model specifc labels Feb 26, 2024

aufdenkampe mentioned this issue Feb 26, 2024

Explore refactoring to use xarray-simlab #71

Open

sjordan29 added a commit that referenced this issue Feb 26, 2024

initialize array when initializing model (2/3)

52ec704

TO DO: hot start dataset; increment_timestep #68

sjordan29 added a commit that referenced this issue Feb 27, 2024

update increment timestep

9b06e86

#68 todo: deal with timesteps and potentially change typehints update tests

sjordan29 added a commit that referenced this issue Feb 28, 2024

compare performance

8241b6a

#68

sjordan29 added a commit that referenced this issue Mar 13, 2024

update tests

30d0203

#68 to do: hot start dataset test

sjordan29 added a commit that referenced this issue Mar 14, 2024

hotstart tests #68

e14441d

sjordan29 mentioned this issue Mar 14, 2024

Pre-Initialize model xarray with timesteps #77

Merged

sjordan29 linked a pull request Mar 14, 2024 that will close this issue

Pre-Initialize model xarray with timesteps #77

Merged

sjordan29 closed this as completed in #77 Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-initialize output `xarray` time coordinates and variables #68

Pre-initialize output `xarray` time coordinates and variables #68

aufdenkampe commented Feb 26, 2024

sjordan29 commented Feb 26, 2024

aufdenkampe commented Feb 26, 2024

aufdenkampe commented Feb 26, 2024 •

edited

Strategy of writing to a pre-initialized set of `xarray` time slices

Performance Comparison

Test 1:

Test 2:

sjordan29 commented Mar 12, 2024

aufdenkampe commented Mar 12, 2024 •

edited

sjordan29 commented Mar 13, 2024

sjordan29 commented Mar 13, 2024

sjordan29 commented Mar 13, 2024

aufdenkampe commented Mar 13, 2024

sjordan29 commented Mar 14, 2024

aufdenkampe commented Mar 14, 2024

Pre-initialize output xarray time coordinates and variables #68

Pre-initialize output xarray time coordinates and variables #68

Comments

aufdenkampe commented Feb 26, 2024

sjordan29 commented Feb 26, 2024

aufdenkampe commented Feb 26, 2024

aufdenkampe commented Feb 26, 2024 • edited

Strategy of writing to a pre-initialized set of xarray time slices

Performance Comparison

Test 1:

Test 2:

sjordan29 commented Mar 12, 2024

aufdenkampe commented Mar 12, 2024 • edited

sjordan29 commented Mar 13, 2024

sjordan29 commented Mar 13, 2024

sjordan29 commented Mar 13, 2024

aufdenkampe commented Mar 13, 2024

sjordan29 commented Mar 14, 2024

aufdenkampe commented Mar 14, 2024

Pre-initialize output `xarray` time coordinates and variables #68

Pre-initialize output `xarray` time coordinates and variables #68

aufdenkampe commented Feb 26, 2024 •

edited

Strategy of writing to a pre-initialized set of `xarray` time slices

aufdenkampe commented Mar 12, 2024 •

edited