-
Notifications
You must be signed in to change notification settings - Fork 53
Fixes MOC memory error #164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR is to resolve #161 and is pending further testing. |
|
@pwolfram, I'm happy to test this on all the runs that support the MOC (i.e. where we have appropriate mask files). Is it at the point where I should do that or are you still verifying that the end result isn't changed from before? |
|
Thanks @xylar, I have a long test running and am working on getting a fast test completed to do the verification. Can you please take a quick look at the code to make sure you are ok with the changes as of yet? One thing I did not do here was to remove the reading and writing of the cached files via |
|
@pwolfram, we definitely want to keep the reading and writing of the cache files. As you see, the computation takes a long time so we definitely only want to do it once per result, even if the analysis gets run multiple times as the job progresses. If you want to convert the code to use xarray to do that I/O, that's fine. But please don't get rid of it entirely. And please do make sure that it can successfully pick up where it left off by, say, requesting a time series of 2 years and then one of 3 years. I'll take a look at the code and see whether I have any specific comments. |
|
Sorry-- I meant switch out |
| transportZ = (refLayerThickness*horizontalVel.where(transect) | ||
| *transectEdgeMaskSigns.where(transect) | ||
| *dvEdge.where(transect) | ||
| ).sum('nEdges') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwolfram, very nice! I had seen that this could be vectorized and I appreciate you doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xylar!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice @pwolfram! I spent some time to get this right without the loop, alas unsuccessfully. Thanks for taking care of it.
|
I haven't taken a look at the code in a great deal of detail but what I see looks great! |
| mocTop = xr.DataArray(np.zeros([len(nz), len(latBins)]), | ||
| coords=[nz, latBins]) | ||
| mocTop[1:, 0] = transportZ.cumsum(axis=0) | ||
| for iLat in np.arange(1, len(latBins)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to remove this loop once pydata/xarray#924 is realized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cellbins = np.logical_and(latCell >= latBins[iLat-1],
latCell < latBins[iLat])
mocTop[:, iLat] = mocTop[:, iLat-1] + velArea.where(cellbins).sum('nCells')is just clunky and probably slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I saw your comment. But I don't think that loop is a problem. There's a lot of computation going on in there so I can't imagine vectorizing it will make a huge difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt the line you pointed out is that slow. I'm sure it's negligible in comparison to the sum below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things aren't always intuitive when it comes to dask because some optimizations can be performed, e.g., groupby_bins was faster than a by-hand equivalent. But, you are probably right unless there is some fancy graph-based optimization that could be used in the generalized, multi-index groupy_bins approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just be careful about not over-optimizing code that isn't actually a bottleneck...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwolfram, in order to use groupby_bins here, couldn't you just mask out the full dataset much earlier on using regionCellMask (probably by using isel), and then use groupby_bins using latitude bins? That way, you don't need 2 arguments to groupby.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that mocTop needs to end up to be a 2D array. The groupby_bins will collapse that down to a single binned variable. In the example at http://xarray.pydata.org/en/stable/groupby.html, lat gets reduced and we need it to stay unreduced.
|
@pwolfram, I couldn't resist testing. On the |
|
@xylar, is the |
|
@pwolfram, yep, just look in the |
|
If the |
|
Thanks @xylar! |
|
Sorry, let me amend that to say, look in |
|
Gotcha, just reproduced the error. |
3586d8a to
d64f127
Compare
|
So my tests suggest that the climatology calculation is, indeed, using reasonable amounts of memory. It's super slow so far on the EC60to30 grid but maybe you can't get the one without the other? |
|
Rerunning the QU240 case with your latest push. |
I'm not sure I understand what you mean-- are you saying that now that the memory is under control the runtime is slow? |
Yep, seems that way. |
|
QU240 is still crashing. Let me know if you're not seeing that on your side, and I'll post details. |
|
I've got it too-- debugging now |
d64f127 to
1e4f925
Compare
|
@xylar, the code completed and visually is identical. They aren't strictly BFB, but are practically: |
|
I'll also run a quick performance comparison between the two, although to be fair this isn't as representative as that with a full production run. |
| load = partial(preload, maxgb=config.getfloat('input', | ||
| 'reasonableArraySizeGB')) | ||
| horizontalVel = load(ds.avgNormalVelocity.isel(Time=tIndex)) | ||
| velArea = load(ds.avgVertVelocityTop.isel(Time=tIndex) * areaCell) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it actually be clearer to read if you just cached the value of reasonableArraySizeGB and had 2 calls to preload without the need to use partial? If we want future developers to use preload, I think we want it to be clear how to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I did it this way is to parallel the xarray load method. Also, it seems like reasonableArraySizeGB isn't really something we need to change per call to preload. But, partial is confusing for first-time users. I also thought about making reasonableArraySizeGB a global but decided against it for obvious reasons.
Do you think this needs changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of a style choice but I'd prefer having it be as straightforward as possible for someone who might use this code as a starting point to figure out how to do something similar in another analysis.
|
@pwolfram, I noticed an irritating bug where the figure title and caption is using |
|
@pwolfram, so it's still quite slow in my testing. Suppose we want to do a 50-year time series, this is going to take hours and hours. What kind of performance were you getting? |
@xylar, can you please get @vanroekel set up with the files needed to test the 18to6? We should be set up from #164 for this to work and if it doesn't we should know sooner rather than later. Someone will eventually want to run the MOC on this case and we need to make sure that this works before someone finds a bug before us. I'm very grateful you found #161 before it caused a bigger problem. Thank you! |
|
@xylar, is |
3e9797b to
c2ddc7c
Compare
|
Sorry @xylar, thanks for catching that! |
I'm working on it now. I'm skeptical that the performance will be good enough to be practical. The RRS18to6 mesh has about 24 times as many edges and cells as the EC60to30, so it will take at least a factor of that much to do each computation. With each month of the MOC currently taking several minutes, I don't think the current performance is even practical for the EC60to30 runs, let alone for the bigger ones. |
|
@vanroekel, if you look on LANL turquoise for the file this is the RRS18to6 version of the regional mask file for the Atlantic. This should hopefully be all you need to run the streamfunctionMOC analysis. Let me know if you have trouble. |
|
@xylar thanks! I'm trying now. I'm also testing this branch on the core2 G-case (6030 out 110 years now). |
|
So to give you guys a sense of the performance I'm seeing on Edison:
Multiply that by 24 and I would expect it to take at least a full day to compute a 10-year MOC climatology on the 18to6 and about an hour to compute a single Atlantic MOC time point on the RRS18to6 grid. Like I said, I don't really think this postprocessing of the MOC is going to be practical for the big grids. |
|
Update on my testing. I've been running the 18to6 for almost 9 hours and it is not completed. This is on 1 year of data. Also, I'm running the 100 years of CORE2 data on blues, it is also not completed (nearly 9 hours as well) I see this output thus far Is there a way to check if this is working? Am I missing a necessary setting? My config file is I was able to run 70 years through MOC in 2.5 hours before without issue, so I feel like I'm not setting this up right. |
|
@vanroekel, you can take a look at the cache files in As I have commented above, this move to purely xarray seems to have slowed things down a lot (maybe by a factor of 10) from what is now in I don't think I have access to the machine where you're running to look at your scripts. But (other than the weird "invalid value encountered" warning) everything sounds like it's behaving as expected and you didn't configure anything wrong. |
|
@xylar, I agree. @vanroekel, can you confirm whether |
287891e to
0f85336
Compare
This prevents memory errors and allows for dask-enabled computational threading.
0f85336 to
43f3712
Compare
|
Just talked with @vanroekel-- even Do we want / need to address this issue now @xylar? We can do so with xarray + dask.distributed using multiple nodes to load in data in parallel. But, this is a level of complexity it sounds like we currently don't want or need to support, especially since MOC should be done online. In my estimation, we aren't at this point right now. I'm going to close this PR now because it sounds like it is out of scope for our needs and we solved issue #161 via #164, which supercedes this PR. Please reopen if you disagree. |
|
Agreed. I hope this work can be used later on, but for now I do think we need to explore other approaches. I don't think we necessarily want to support offline computation of the MOC for grids bigger than EC60to30. I would think that's what analysis mode in MPAS itself could be used for. |
This removes memory error resulting from conversion of xarray dataset to numpy,
e.g., attempting to load in a dataset larger than on-node memory. Use of xarray and
dask prevents this problem via memory chunking.