New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weight problems in cdutil.ANNUALCYCLE.climatology (daily data) #1664

Open
jypeter opened this Issue Nov 6, 2015 · 9 comments

Comments

Projects
None yet
4 participants
@jypeter

jypeter commented Nov 6, 2015

I have found out that when using cdutil.ANNUALCYCLE.climatology on daily data, the months of different years seem to be equally weighted.

This results in (slight but visible) errors:

  • for February: eg for years, when averaging the February values for 1959-1960-1961, the weights used seem to be [1, 1, 1], when they should be the actual month length: [28, 29, 28]
  • for months that have missing days: e.g. if the first month of the data file starts on January 2nd, but all the other January month in the time series have 31 days, the weights used seem to be [1, 1, 1, ...] when they should be [30, 31, 31, ..., 31]

Test script: https://files.lsce.ipsl.fr/public.php?service=files&t=115de1c1cb57217b85776fd08fe21b23
Test data: https://files.lsce.ipsl.fr/public.php?service=files&t=c7258c48f12d3451bfa67e5585bbedef

Running the test script above will get the following output, and you can see that the difference between ANNUALCYCLEclimatology and computation by hand for all months, including unweighted February is ~1e6. But there is a bigger difference when I compute a correctly weighted February mean

python -i bug_ANNUALCYCLEclimatology.py
Variable shape = (3653, 9, 11)
Time axes go from 1959-1-1 12:0:0.0 to 1968-12-31 12:0:0.0
Latitude values = [-66.7283256  -67.84978441 -68.97123954 -70.09269035 -71.21413608
 -72.33557575 -73.45700815 -74.57843166 -75.69984422]
Longitude values = [ 120.375  121.5    122.625  123.75   124.875  126.     127.125  128.25
  129.375  130.5    131.625]
Fri Nov  6 17:35:08 2015  - Computing the climatology with cdutil.ANNUALCYCLE.climatology,
                please wait...
Fri Nov  6 17:35:26 2015  - Done!
time.clock diff = 17.36

Fri Nov  6 17:35:26 2015  - Computing the climatology by hand...
year, month, nb of selected days = 1959  2 28
year, month, nb of selected days = 1960  2 29
year, month, nb of selected days = 1961  2 28
year, month, nb of selected days = 1962  2 28
year, month, nb of selected days = 1963  2 28
year, month, nb of selected days = 1964  2 29
year, month, nb of selected days = 1965  2 28
year, month, nb of selected days = 1966  2 28
year, month, nb of selected days = 1967  2 28
year, month, nb of selected days = 1968  2 29
Difference range for month 1 = (-4.903731820604662e-06, 6.318861416332311e-06)
Difference range for February (unweighted) = (-5.2588326724389844e-06, 5.322724128120626e-06)
Difference range for February (WEIGHTED) = (-0.014045400572531008, -0.0006070483053850495)
Difference range for month 3 = (-7.192550192769431e-06, 6.8172331566529465e-06)
Difference range for month 4 = (-8.519490563685395e-06, 6.39597574547679e-06)
Difference range for month 5 = (-7.727838379878449e-06, 7.407895935784836e-06)
Difference range for month 6 = (-8.519490563685395e-06, 6.81559244952723e-06)
Difference range for month 7 = (-7.284841260002395e-06, 9.992045690410123e-06)
Difference range for month 8 = (-8.761498278886393e-06, 7.900114923131696e-06)
Difference range for month 9 = (-8.850097671597723e-06, 7.451375310552066e-06)
Difference range for month 10 = (-7.30945221505408e-06, 6.18965391652182e-06)
Difference range for month 11 = (-6.2370300426550784e-06, 6.243387851156967e-06)
Difference range for month 12 = (-6.288097772255696e-06, 6.534207223296562e-06)
Fri Nov  6 17:35:26 2015  - Finished computing the climatology by hand!
time.clock diff = 0.3
Acceleration = 57.8666666667

@dnadeau4 dnadeau4 self-assigned this Nov 6, 2015

@dnadeau4 dnadeau4 added this to the 3.0 milestone Nov 6, 2015

@dnadeau4

This comment has been minimized.

Contributor

dnadeau4 commented Nov 6, 2015

Thanks! I will take a look next week.

@jypeter

This comment has been minimized.

jypeter commented Nov 9, 2015

I don't know if the following should be a separate issue

@dnadeau4 if you look at the code, can you see if there is some room for improving the speed of cdutil.ANNUALCYCLE.climatology, and similar functions?

As you can see in the output above, my hard-coded climatology is 58 times faster than cdutil.ANNUALCYCLE.climatology. And it's 64 times faster for the full dataset (20452 time steps, instead of the 3653 sample steps I supplied). And that's with print statement and without trying to reduce ram usage

I agree it's much easier to get something faster when you know what the calendar is, the frequency of the samples (daily values here), ..., and you can hard code the processing, but it would be useful to get something faster that we can sell to the users. Otherwise, I have to tell the users: if you want something faster, just use cdo ymonmean data.nc climato.nc

What would also be useful, would be to be able to pass a file variable as a parameter to cdutil.ANNUALCYCLE.climatology (and similar functions), and let the function take care itself of looping on small chuncks of time steps (rather than loading the whole variable) in order to reduce the memory footprint of the operation, and be able to process arbitrarily long data sets

#huge_var = f('my_long_variable')
file_var = f['my_long_variable']
my_climato = cdutil.ANNUALCYCLE.climatology(file_var)
@doutriaux1

This comment has been minimized.

Member

doutriaux1 commented Nov 9, 2015

@jypeter yes it is slow, but as you mentioned you can have it right or you can have it fast 😉

I will take a look, also remember that the weights depend on the time units, if your data are in "months since" then each month would be weighted the same. Anyhow will take a look, probably with @dnadeau4

@dnadeau4

This comment has been minimized.

Contributor

dnadeau4 commented Nov 10, 2015

I will create a new API with a fast ANNUALCYCLE to satisfy some users. This will be for advanced users and I will warn them to check results with the long haul "averager" at least once. When calendars, units are other other metadata mismatch, no check will be done and data will be "embarrassingly" averaged. 😏

Even with the warnings, I am afraid erroneous computations will be published.
Scary!!

@jypeter

This comment has been minimized.

jypeter commented Nov 10, 2015

@doutriaux1 the time axis of the sample data I have posted is in days. I had some doubts about the standard calendar, but the fact that the last time step of the dataset is recognized as December 31st (using asComponentTime), which we expected, is a good sign that cdms2 used a Gregorian Calendar

        double time(time) ;
                time:standard_name = "time" ;
                time:long_name = "time" ;
                time:units = "days since 1958-01-01 12:00:00" ;
                time:calendar = "standard" ;

@dnadeau4 A fast interface would be nice, but I'm not sure that allowing potential errors is a good idea. :/ I'm rather thinking, but this may be naive, and maybe this part of cdutil already works that way, that there could be a 2 steps process:

  • Get the values of the time axis and the metadata. This can't be a lot of data, so it can probably be processed/analyzed quickly. Do all the quality control and stuff (is the calendar recognized correctly? are the values of the time axis strictly increasing? Is the difference between 2 consecutive time steps always the same or not? etc..) on this time data
  • Based on the knowledge gained during the first step, use some optimized array syntax (and the fact that we have a file variable, when that's the case) to process the data. We may have a lot of data here, but appropriate array syntax will make it possible to process the data quickly. You could automatically call the fast API here, if the required conditions are met
@doutriaux1

This comment has been minimized.

Member

doutriaux1 commented Nov 10, 2015

@jypeter that is what it's doing and that's why it's slow 😜 except for the optimized second setp.

In general I think we should NEVER think we are smarter than the user and assume things they want.

@jypeter

This comment has been minimized.

jypeter commented Nov 18, 2015

OK, so I guess it would be nice to have a documented fast ANNUALCYCLE so that we don't have to reinvent the wheel if we want to get something faster. And it's a good idea to have this print some warning messages, unless the user explicitly pass an option to disable the messages (...,I_Hopefully_Think_I_Know_What_Im_Doing=True))

If I want ultimate speed I will use a python/shell script to wrap cdo commands, but I like not having to call external programs when running a python script, and I don't know how much I can trust cdo for checking if time axes are correct

By the way, is there already some kind of support function that a user could call on a variable to check if the time axis (metadata + values + bounds) is approximately OK and that would list potential errors (and suggest fixes) otherwise?

cdutil and genutil (among other things) are why I keep using and recommending UV-CDAT rather than Canopy and Anaconda, and should be more advertised

@painter1

This comment has been minimized.

Contributor

painter1 commented Nov 18, 2015

About speed, I intend to make some changes in UV-CDAT after the next release, which will improve performance somewhat.

For the ACME project we had to compute climatologies on a poorly configured machine where starting up I/O was enormously expensive. There was plenty of room to write something faster (by a factor of a thousand!) while still writing it in UV-CDAT-based Python. Most changes were designed to minimize I/O; there is a lot of it buried down deep. But doing this involved multiple assumptions about the data including calendar and time frequency. On the other hand, my feeling is that it some of those assumptions could be relaxed, and the others could at least be checked. Aside from all that, I'm not sure how much payoff there would be on normal platforms.

@jypeter

This comment has been minimized.

jypeter commented Nov 19, 2015

I think it's good to allow I/O and the use of file variables, if you have to process a long variable without having to first load all the time steps in memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment