-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weight problems in cdutil.ANNUALCYCLE.climatology (daily data) #1664
Comments
Thanks! I will take a look next week. |
I don't know if the following should be a separate issue @dnadeau4 if you look at the code, can you see if there is some room for improving the speed of cdutil.ANNUALCYCLE.climatology, and similar functions? As you can see in the output above, my hard-coded climatology is 58 times faster than cdutil.ANNUALCYCLE.climatology. And it's 64 times faster for the full dataset (20452 time steps, instead of the 3653 sample steps I supplied). And that's with print statement and without trying to reduce ram usage I agree it's much easier to get something faster when you know what the calendar is, the frequency of the samples (daily values here), ..., and you can hard code the processing, but it would be useful to get something faster that we can sell to the users. Otherwise, I have to tell the users: if you want something faster, just use cdo ymonmean data.nc climato.nc What would also be useful, would be to be able to pass a file variable as a parameter to cdutil.ANNUALCYCLE.climatology (and similar functions), and let the function take care itself of looping on small chuncks of time steps (rather than loading the whole variable) in order to reduce the memory footprint of the operation, and be able to process arbitrarily long data sets
|
I will create a new API with a fast ANNUALCYCLE to satisfy some users. This will be for advanced users and I will warn them to check results with the long haul "averager" at least once. When calendars, units are other other metadata mismatch, no check will be done and data will be "embarrassingly" averaged. 😏 Even with the warnings, I am afraid erroneous computations will be published. |
@doutriaux1 the time axis of the sample data I have posted is in days. I had some doubts about the standard calendar, but the fact that the last time step of the dataset is recognized as December 31st (using asComponentTime), which we expected, is a good sign that cdms2 used a Gregorian Calendar
@dnadeau4 A fast interface would be nice, but I'm not sure that allowing potential errors is a good idea. :/ I'm rather thinking, but this may be naive, and maybe this part of cdutil already works that way, that there could be a 2 steps process:
|
@jypeter that is what it's doing and that's why it's slow 😜 except for the optimized second setp. In general I think we should NEVER think we are smarter than the user and |
OK, so I guess it would be nice to have a documented fast ANNUALCYCLE so that we don't have to reinvent the wheel if we want to get something faster. And it's a good idea to have this print some warning messages, unless the user explicitly pass an option to disable the messages ( If I want ultimate speed I will use a python/shell script to wrap cdo commands, but I like not having to call external programs when running a python script, and I don't know how much I can trust cdo for checking if time axes are correct By the way, is there already some kind of support function that a user could call on a variable to check if the time axis (metadata + values + bounds) is approximately OK and that would list potential errors (and suggest fixes) otherwise? cdutil and genutil (among other things) are why I keep using and recommending UV-CDAT rather than Canopy and Anaconda, and should be more advertised |
About speed, I intend to make some changes in UV-CDAT after the next release, which will improve performance somewhat. For the ACME project we had to compute climatologies on a poorly configured machine where starting up I/O was enormously expensive. There was plenty of room to write something faster (by a factor of a thousand!) while still writing it in UV-CDAT-based Python. Most changes were designed to minimize I/O; there is a lot of it buried down deep. But doing this involved multiple assumptions about the data including calendar and time frequency. On the other hand, my feeling is that it some of those assumptions could be relaxed, and the others could at least be checked. Aside from all that, I'm not sure how much payoff there would be on normal platforms. |
I think it's good to allow I/O and the use of file variables, if you have to process a long variable without having to first load all the time steps in memory |
I have found out that when using cdutil.ANNUALCYCLE.climatology on daily data, the months of different years seem to be equally weighted.
This results in (slight but visible) errors:
Test script: https://files.lsce.ipsl.fr/public.php?service=files&t=115de1c1cb57217b85776fd08fe21b23
Test data: https://files.lsce.ipsl.fr/public.php?service=files&t=c7258c48f12d3451bfa67e5585bbedef
Running the test script above will get the following output, and you can see that the difference between ANNUALCYCLEclimatology and computation by hand for all months, including unweighted February is ~1e6. But there is a bigger difference when I compute a correctly weighted February mean
The text was updated successfully, but these errors were encountered: