Fixes MOC memory error #164

pwolfram · 2017-03-28T17:24:24Z

This removes memory error resulting from conversion of xarray dataset to numpy,
e.g., attempting to load in a dataset larger than on-node memory. Use of xarray and
dask prevents this problem via memory chunking.

pwolfram · 2017-03-28T17:24:51Z

This PR is to resolve #161 and is pending further testing.

xylar · 2017-03-28T19:22:24Z

@pwolfram, I'm happy to test this on all the runs that support the MOC (i.e. where we have appropriate mask files). Is it at the point where I should do that or are you still verifying that the end result isn't changed from before?

pwolfram · 2017-03-28T19:27:14Z

Thanks @xylar, I have a long test running and am working on getting a fast test completed to do the verification.

Can you please take a quick look at the code to make sure you are ok with the changes as of yet? One thing I did not do here was to remove the reading and writing of the cached files via netCDF4. That may be something we want to do, however, it doesn't seem high priority so I've skipped it at this point.

xylar · 2017-03-28T19:32:49Z

@pwolfram, we definitely want to keep the reading and writing of the cache files. As you see, the computation takes a long time so we definitely only want to do it once per result, even if the analysis gets run multiple times as the job progresses.

If you want to convert the code to use xarray to do that I/O, that's fine. But please don't get rid of it entirely. And please do make sure that it can successfully pick up where it left off by, say, requesting a time series of 2 years and then one of 3 years.

I'll take a look at the code and see whether I have any specific comments.

pwolfram · 2017-03-28T19:34:16Z

Sorry-- I meant switch out netCDF4 with xarray, not remove the computation. I didn't do this refractor.

xylar · 2017-03-28T19:38:23Z

mpas_analysis/ocean/meridional_overturning_circulation.py

+    transportZ = (refLayerThickness*horizontalVel.where(transect)
+                  *transectEdgeMaskSigns.where(transect)
+                  *dvEdge.where(transect)
+                  ).sum('nEdges')


@pwolfram, very nice! I had seen that this could be vectorized and I appreciate you doing it.

Thanks @xylar!

very nice @pwolfram! I spent some time to get this right without the loop, alas unsuccessfully. Thanks for taking care of it.

xylar · 2017-03-28T19:40:33Z

I haven't taken a look at the code in a great deal of detail but what I see looks great!

pwolfram · 2017-03-28T19:40:41Z

mpas_analysis/ocean/meridional_overturning_circulation.py

+    mocTop = xr.DataArray(np.zeros([len(nz), len(latBins)]),
+                          coords=[nz, latBins])
+    mocTop[1:, 0] = transportZ.cumsum(axis=0)
+    for iLat in np.arange(1, len(latBins)):


We should be able to remove this loop once pydata/xarray#924 is realized.

cellbins = np.logical_and(latCell >= latBins[iLat-1], latCell < latBins[iLat]) mocTop[:, iLat] = mocTop[:, iLat-1] + velArea.where(cellbins).sum('nCells')

is just clunky and probably slow.

Yeah, I saw your comment. But I don't think that loop is a problem. There's a lot of computation going on in there so I can't imagine vectorizing it will make a huge difference.

I doubt the line you pointed out is that slow. I'm sure it's negligible in comparison to the sum below.

Things aren't always intuitive when it comes to dask because some optimizations can be performed, e.g., groupby_bins was faster than a by-hand equivalent. But, you are probably right unless there is some fancy graph-based optimization that could be used in the generalized, multi-index groupy_bins approach.

Just be careful about not over-optimizing code that isn't actually a bottleneck...

@pwolfram, in order to use groupby_bins here, couldn't you just mask out the full dataset much earlier on using regionCellMask (probably by using isel), and then use groupby_bins using latitude bins? That way, you don't need 2 arguments to groupby.

The problem here is that mocTop needs to end up to be a 2D array. The groupby_bins will collapse that down to a single binned variable. In the example at http://xarray.pydata.org/en/stable/groupby.html, lat gets reduced and we need it to stay unreduced.

xylar · 2017-03-28T20:50:08Z

@pwolfram, I couldn't resist testing. On the QU240 test (on my laptop), the climatology part worked fine but the time-series part produced this error:

  Compute and/or plot post-processed Atlantic MOC time series...
   Load data...
   Process and save MOC time series
     date: 0001-01
Traceback (most recent call last):
  File "./run_analysis.py", line 243, in <module>
    analysis(config)
  File "./run_analysis.py", line 182, in analysis
    moc_streamfunction(config)
  File "/home/xylar/code/mpas-work/analysis/fix_moc_memory_error/mpas_analysis/ocean/meridional_overturning_circulation.py", line 137, in moc_streamfunction
    regionNames, dictTseries, mocDictClimo, dictRegion)
  File "/home/xylar/code/mpas-work/analysis/fix_moc_memory_error/mpas_analysis/ocean/meridional_overturning_circulation.py", line 458, in _compute_moc_time_series_postprocess
    mocAtlantic26 = np.amax(mocTop[:, indlat26])
  File "/home/xylar/anaconda2/envs/default/lib/python2.7/site-packages/xarray/core/dataarray.py", line 469, in __getitem__
    return self.isel(**self._item_key_to_dict(key))
  File "/home/xylar/anaconda2/envs/default/lib/python2.7/site-packages/xarray/core/dataarray.py", line 657, in isel
    ds = self._to_temp_dataset().isel(drop=drop, **indexers)
  File "/home/xylar/anaconda2/envs/default/lib/python2.7/site-packages/xarray/core/dataset.py", line 1119, in isel
    new_var = var.isel(**var_indexers)
  File "/home/xylar/anaconda2/envs/default/lib/python2.7/site-packages/xarray/core/variable.py", line 547, in isel
    return self[tuple(key)]
  File "/home/xylar/anaconda2/envs/default/lib/python2.7/site-packages/xarray/core/variable.py", line 377, in __getitem__
    values = self._indexable_data[key]
  File "/home/xylar/anaconda2/envs/default/lib/python2.7/site-packages/xarray/core/indexing.py", line 467, in __getitem__
    key = self._convert_key(key)
  File "/home/xylar/anaconda2/envs/default/lib/python2.7/site-packages/xarray/core/indexing.py", line 454, in _convert_key
    key = orthogonal_indexer(key, self.shape)
  File "/home/xylar/anaconda2/envs/default/lib/python2.7/site-packages/xarray/core/indexing.py", line 79, in orthogonal_indexer
    key = list(canonicalize_indexer(key, len(shape)))
  File "/home/xylar/anaconda2/envs/default/lib/python2.7/site-packages/xarray/core/indexing.py", line 67, in canonicalize_indexer
    return tuple(canonicalize(k) for k in expanded_indexer(key, ndim))
  File "/home/xylar/anaconda2/envs/default/lib/python2.7/site-packages/xarray/core/indexing.py", line 67, in <genexpr>
    return tuple(canonicalize(k) for k in expanded_indexer(key, ndim))
  File "/home/xylar/anaconda2/envs/default/lib/python2.7/site-packages/xarray/core/indexing.py", line 56, in canonicalize
    raise ValueError('orthogonal array indexing only supports '
ValueError: orthogonal array indexing only supports 1d arrays

pwolfram · 2017-03-28T20:52:09Z

@xylar, is the QU240 test on IC? It looks like I flipped an index.

xylar · 2017-03-28T20:52:52Z

@pwolfram, yep, just look in the configs/lanl folder.

pwolfram · 2017-03-28T20:52:52Z

If the QU240 isn't on IC would it be possible to get more information so that I can reproduce the error?

pwolfram · 2017-03-28T20:53:05Z

Thanks @xylar!

xylar · 2017-03-28T20:54:42Z

Sorry, let me amend that to say, look in configs/lanl in #163. I've fixed the path to the mapping and mask files in that PR.

pwolfram · 2017-03-28T20:55:33Z

Gotcha, just reproduced the error.

xylar · 2017-03-28T21:12:26Z

So my tests suggest that the climatology calculation is, indeed, using reasonable amounts of memory. It's super slow so far on the EC60to30 grid but maybe you can't get the one without the other?

xylar · 2017-03-28T21:13:02Z

Rerunning the QU240 case with your latest push.

pwolfram · 2017-03-28T21:13:40Z

It's super slow so far on the EC60to30 grid but maybe you can't get the one without the other?

I'm not sure I understand what you mean-- are you saying that now that the memory is under control the runtime is slow?

xylar · 2017-03-28T21:14:57Z

I'm not sure I understand what you mean-- are you saying that now that the memory is under control the runtime is slow?

Yep, seems that way.

xylar · 2017-03-28T21:16:16Z

QU240 is still crashing. Let me know if you're not seeing that on your side, and I'll post details.

pwolfram · 2017-03-28T21:16:31Z

I've got it too-- debugging now

pwolfram · 2017-03-28T21:59:42Z

@xylar, the code completed and visually is identical. They aren't strictly BFB, but are practically:

└─▪ ~/src/scrIpT/test_zeros_diff_netcdf.py output_trusted/timeseries/mocTimeSeries.nc output_candidate/timeseries/mocTimeSeries.nc 
Determining if fields are the same for output_trusted/timeseries/mocTimeSeries.nc and output_candidate/timeseries/mocTimeSeries.nc:
 
testing Time -- Time is bit identical.
testing mocAtlantic26 -- ERROR mocAtlantic26 field is not bit-zero for min!
min   -0.343137635398 -0.343137635398
max   7.77140371566 7.77140371566
mean  2.55768124242 2.55768124242
min  diff  -1.7763568394e-15
max  diff  9.15933995316e-16
mean diff  -3.58509518369e-17

pwolfram · 2017-03-28T22:02:32Z

I'll also run a quick performance comparison between the two, although to be fair this isn't as representative as that with a full production run.

xylar · 2017-03-30T20:16:47Z

mpas_analysis/ocean/meridional_overturning_circulation.py

+        load = partial(preload, maxgb=config.getfloat('input',
+                                                      'reasonableArraySizeGB'))
+        horizontalVel = load(ds.avgNormalVelocity.isel(Time=tIndex))
+        velArea = load(ds.avgVertVelocityTop.isel(Time=tIndex) * areaCell)


Wouldn't it actually be clearer to read if you just cached the value of reasonableArraySizeGB and had 2 calls to preload without the need to use partial? If we want future developers to use preload, I think we want it to be clear how to use it.

The reason I did it this way is to parallel the xarray load method. Also, it seems like reasonableArraySizeGB isn't really something we need to change per call to preload. But, partial is confusing for first-time users. I also thought about making reasonableArraySizeGB a global but decided against it for obvious reasons.

Do you think this needs changed?

It's kind of a style choice but I'd prefer having it be as straightforward as possible for someone who might use this code as a starting point to figure out how to do something similar in another analysis.

xylar · 2017-03-30T20:38:59Z

@pwolfram, I noticed an irritating bug where the figure title and caption is using startYearClimo and endYearClimo instead of the potentially updated values in dictClimo['startYearClimo'] and dictClimo['endYearClimo']. I thought I had fixed this in #157 but obviously not. Would you be willing to replace these on lines 166 and 169 or would you like me to submit a separate PR once this is merged?

xylar · 2017-03-30T20:47:21Z

@pwolfram, so it's still quite slow in my testing. Suppose we want to do a 50-year time series, this is going to take hours and hours. What kind of performance were you getting?

pwolfram · 2017-03-30T20:55:14Z

@vanroekel, I don't think we're at the point where it makes sense to test on such a big problem yet. In fact, I'm not sure post-processing the MOC will ever be possible with such a high-res grid. We would really need a big parallel job to do it, and I don't think we're heading that direction with the analysis.
I'll be happy to make you a region mask for the 18to6 grid once there is another (more computationally feasible) task to use it for, though.

@xylar, can you please get @vanroekel set up with the files needed to test the 18to6? We should be set up from #164 for this to work and if it doesn't we should know sooner rather than later. Someone will eventually want to run the MOC on this case and we need to make sure that this works before someone finds a bug before us. I'm very grateful you found #161 before it caused a bigger problem. Thank you!

pwolfram · 2017-03-30T20:59:56Z

@xylar, is f847c07 ok?

xylar · 2017-03-30T21:23:18Z

@xylar, is f847c07 ok?

Could you make the same change on line 169?

pwolfram · 2017-03-30T21:31:54Z

Sorry @xylar, thanks for catching that!

xylar · 2017-03-30T21:36:06Z

@xylar, can you please get @vanroekel set up with the files needed to test the 18to6? We should be set up from #164 for this to work and if it doesn't we should know sooner rather than later. Someone will eventually want to run the MOC on this case and we need to make sure that this works before someone finds a bug before us. I'm very grateful you found #161 before it caused a bigger problem. Thank you!

I'm working on it now. I'm skeptical that the performance will be good enough to be practical. The RRS18to6 mesh has about 24 times as many edges and cells as the EC60to30, so it will take at least a factor of that much to do each computation. With each month of the MOC currently taking several minutes, I don't think the current performance is even practical for the EC60to30 runs, let alone for the bigger ones.

xylar · 2017-03-30T21:43:54Z

@vanroekel, if you look on LANL turquoise for the file

/usr/projects/climate/SHARED_CLIMATE/mpas_analysis/region_masks/RRS18to6_region_masks_and_southern_transect.nc

this is the RRS18to6 version of the regional mask file for the Atlantic. This should hopefully be all you need to run the streamfunctionMOC analysis. Let me know if you have trouble.

vanroekel · 2017-03-30T22:18:28Z

@xylar thanks! I'm trying now. I'm also testing this branch on the core2 G-case (6030 out 110 years now).

xylar · 2017-03-30T22:36:31Z

So to give you guys a sense of the performance I'm seeing on Edison:

To compute a 10-year MOC climatology on the EC60to30 grid took about 1 hour
4.5 years of the Atlantic MOC time series has taken 2 hours. (That's a lot computing time for essentially a single number per 2.5 minutes.)

Multiply that by 24 and I would expect it to take at least a full day to compute a 10-year MOC climatology on the 18to6 and about an hour to compute a single Atlantic MOC time point on the RRS18to6 grid. Like I said, I don't really think this postprocessing of the MOC is going to be practical for the big grids.

vanroekel · 2017-03-31T13:50:22Z

Update on my testing. I've been running the 18to6 for almost 9 hours and it is not completed. This is on 1 year of data. Also, I'm running the 100 years of CORE2 data on blues, it is also not completed (nearly 9 hours as well) I see this output thus far

Plotting streamfunction of Meridional Overturning Circulation (MOC)...

  List of files for climatologies:
/lcrc/group/acme/jwolfe/acme_scratch/20170320.beta0.GMPAS-IAF.T62_oECv3.anvil/run/./mpaso.hist.am.timeSeriesStatsMonthly.0001-12-01.nc through /lcrc/group/acme/jwolfe/acme_scratch/20170320.beta0.GMPAS-IAF.T62_oECv3.anvil/run/./mpaso.hist.am.timeSeriesStatsMonthly.0096-01-01.nc

  List of files for timeSeries:
/lcrc/group/acme/jwolfe/acme_scratch/20170320.beta0.GMPAS-IAF.T62_oECv3.anvil/run/./mpaso.hist.am.timeSeriesStatsMonthly.0001-12-01.nc through /lcrc/group/acme/jwolfe/acme_scratch/20170320.beta0.GMPAS-IAF.T62_oECv3.anvil/run/./mpaso.hist.am.timeSeriesStatsMonthly.0096-01-01.nc

  Reading region and transect mask for Atlantic...

  Compute and/or plot post-processed MOC climatological streamfunction...
   Load data...
/lcrc/group/acme/lvanroe/conda/lib/python2.7/site-packages/dask/array/rechunk.py:347: RuntimeWarning: invalid value encountered in double_scalars
  key=lambda k: np.log(graph_size_effect[k]) / np.log(block_size_effect[k]),

Is there a way to check if this is working? Am I missing a necessary setting? My config file is

/home/lvanroe/MPAS-Analsysis/configLuke

I was able to run 70 years through MOC in 2.5 hours before without issue, so I feel like I'm not setting this up right.

xylar · 2017-03-31T14:33:29Z

@vanroekel, you can take a look at the cache files in clim/mpas/moc*.nc and timeSeries/moc*.nc. The time series file gets updated once per simulation year, while the climatology cache file gets written only once.

As I have commented above, this move to purely xarray seems to have slowed things down a lot (maybe by a factor of 10) from what is now in develop. I think this is too much of a performance hit to be practical so I think we should hold off on merging this PR and try to make some more optimizations.

I don't think I have access to the machine where you're running to look at your scripts. But (other than the weird "invalid value encountered" warning) everything sounds like it's behaving as expected and you didn't configure anything wrong.

pwolfram · 2017-03-31T17:48:23Z

@xylar, I agree. @vanroekel, can you confirm whether develop crashes on the 18to6 if that is easy to try? The overarching goal here is to make sure the analysis is robust on the production runs and that should give us a good indication.

This prevents memory errors and allows for dask-enabled computational threading.

pwolfram · 2017-04-03T15:14:01Z

Just talked with @vanroekel-- even develop doesn't finish for the 18to6 (didn't finish loading data). The issue is that loading of data is too slow and is the bottleneck. The 60to30 works with the current version of develop.

Do we want / need to address this issue now @xylar? We can do so with xarray + dask.distributed using multiple nodes to load in data in parallel. But, this is a level of complexity it sounds like we currently don't want or need to support, especially since MOC should be done online. In my estimation, we aren't at this point right now.

I'm going to close this PR now because it sounds like it is out of scope for our needs and we solved issue #161 via #164, which supercedes this PR. Please reopen if you disagree.

xylar · 2017-04-03T15:28:28Z

Agreed. I hope this work can be used later on, but for now I do think we need to explore other approaches. I don't think we necessarily want to support offline computation of the MOC for grids bigger than EC60to30. I would think that's what analysis mode in MPAS itself could be used for.

pwolfram added bug in progress labels Mar 28, 2017

xylar reviewed Mar 28, 2017

View reviewed changes

pwolfram commented Mar 28, 2017

View reviewed changes

pwolfram mentioned this pull request Mar 28, 2017

Add moc script (postprocessing) and plot_vertical_section #139

Merged

This was referenced Mar 28, 2017

Update LANL configs, add new config for ACME B case with land-ice cavities #163

Merged

Checklist for v0.2 tag #154

Closed

pwolfram force-pushed the fix_moc_memory_error branch from 3586d8a to d64f127 Compare March 28, 2017 21:07

pwolfram force-pushed the fix_moc_memory_error branch from d64f127 to 1e4f925 Compare March 28, 2017 21:40

xylar reviewed Mar 30, 2017

View reviewed changes

pwolfram mentioned this pull request Mar 30, 2017

Updates xarray requirement for xarray 0.9.1 #167

Merged

pwolfram force-pushed the fix_moc_memory_error branch 2 times, most recently from 3e9797b to c2ddc7c Compare March 30, 2017 21:31

xylar mentioned this pull request Mar 31, 2017

Memory error with MOC analysis on edison compute node #161

Closed

xylar mentioned this pull request Mar 31, 2017

Uses updated start and end years for climo plot #168

Merged

pwolfram force-pushed the fix_moc_memory_error branch from 287891e to 0f85336 Compare April 2, 2017 22:54

pwolfram added 6 commits April 2, 2017 16:56

Removes values selection from moc

9b06b03

Fixes dictionary that only had last set of entries

ffc26c9

Moves MOC computation from numpy to xarray

f4cb876

This prevents memory errors and allows for dask-enabled computational threading.

Sets chunking limits to prevent dask memory errors

d477b9d

Adds capability to preload results for performance

abff765

Optimizes calclations using xarray datastructures

43f3712

pwolfram force-pushed the fix_moc_memory_error branch from 0f85336 to 43f3712 Compare April 2, 2017 23:04

pwolfram closed this Apr 3, 2017

Fixes MOC memory error #164

Fixes MOC memory error #164

Uh oh!

Conversation

pwolfram commented Mar 28, 2017

Uh oh!

pwolfram commented Mar 28, 2017

Uh oh!

xylar commented Mar 28, 2017

Uh oh!

pwolfram commented Mar 28, 2017

Uh oh!

xylar commented Mar 28, 2017

Uh oh!

pwolfram commented Mar 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xylar commented Mar 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pwolfram Mar 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pwolfram Mar 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xylar commented Mar 28, 2017

Uh oh!

pwolfram commented Mar 28, 2017

Uh oh!

xylar commented Mar 28, 2017

Uh oh!

pwolfram commented Mar 28, 2017

Uh oh!

pwolfram commented Mar 28, 2017

Uh oh!

xylar commented Mar 28, 2017

Uh oh!

pwolfram commented Mar 28, 2017

Uh oh!

xylar commented Mar 28, 2017

Uh oh!

xylar commented Mar 28, 2017

Uh oh!

pwolfram commented Mar 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xylar commented Mar 28, 2017

Uh oh!

xylar commented Mar 28, 2017

Uh oh!

pwolfram commented Mar 28, 2017

Uh oh!

pwolfram commented Mar 28, 2017

Uh oh!

pwolfram commented Mar 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pwolfram Mar 28, 2017 •

edited

Loading

pwolfram Mar 28, 2017 •

edited

Loading

pwolfram commented Mar 28, 2017 •

edited

Loading