{{ message }}

# Weight problems in cdutil.ANNUALCYCLE.climatology (daily data)#4

Closed
opened this issue Nov 23, 2016 · 7 comments
Closed

# Weight problems in cdutil.ANNUALCYCLE.climatology (daily data)#4

opened this issue Nov 23, 2016 · 7 comments
Labels
Milestone

### chaosphere2112 commented Nov 23, 2016

 I have found out that when using cdutil.ANNUALCYCLE.climatology on daily data, the months of different years seem to be equally weighted. This results in (slight but visible) errors: for February: eg for years, when averaging the February values for 1959-1960-1961, the weights used seem to be [1, 1, 1], when they should be the actual month length: [28, 29, 28] for months that have missing days: e.g. if the first month of the data file starts on January 2nd, but all the other January month in the time series have 31 days, the weights used seem to be [1, 1, 1, ...] when they should be [30, 31, 31, ..., 31] Running the test script above will get the following output, and you can see that the difference between ANNUALCYCLEclimatology and computation by hand for all months, including unweighted February is ~1e6. But there is a bigger difference when I compute a correctly weighted February mean python -i bug_ANNUALCYCLEclimatology.py Variable shape = (3653, 9, 11) Time axes go from 1959-1-1 12:0:0.0 to 1968-12-31 12:0:0.0 Latitude values = [-66.7283256 -67.84978441 -68.97123954 -70.09269035 -71.21413608 -72.33557575 -73.45700815 -74.57843166 -75.69984422] Longitude values = [ 120.375 121.5 122.625 123.75 124.875 126. 127.125 128.25 129.375 130.5 131.625] Fri Nov 6 17:35:08 2015 - Computing the climatology with cdutil.ANNUALCYCLE.climatology, please wait... Fri Nov 6 17:35:26 2015 - Done! time.clock diff = 17.36 Fri Nov 6 17:35:26 2015 - Computing the climatology by hand... year, month, nb of selected days = 1959 2 28 year, month, nb of selected days = 1960 2 29 year, month, nb of selected days = 1961 2 28 year, month, nb of selected days = 1962 2 28 year, month, nb of selected days = 1963 2 28 year, month, nb of selected days = 1964 2 29 year, month, nb of selected days = 1965 2 28 year, month, nb of selected days = 1966 2 28 year, month, nb of selected days = 1967 2 28 year, month, nb of selected days = 1968 2 29 Difference range for month 1 = (-4.903731820604662e-06, 6.318861416332311e-06) Difference range for February (unweighted) = (-5.2588326724389844e-06, 5.322724128120626e-06) Difference range for February (WEIGHTED) = (-0.014045400572531008, -0.0006070483053850495) Difference range for month 3 = (-7.192550192769431e-06, 6.8172331566529465e-06) Difference range for month 4 = (-8.519490563685395e-06, 6.39597574547679e-06) Difference range for month 5 = (-7.727838379878449e-06, 7.407895935784836e-06) Difference range for month 6 = (-8.519490563685395e-06, 6.81559244952723e-06) Difference range for month 7 = (-7.284841260002395e-06, 9.992045690410123e-06) Difference range for month 8 = (-8.761498278886393e-06, 7.900114923131696e-06) Difference range for month 9 = (-8.850097671597723e-06, 7.451375310552066e-06) Difference range for month 10 = (-7.30945221505408e-06, 6.18965391652182e-06) Difference range for month 11 = (-6.2370300426550784e-06, 6.243387851156967e-06) Difference range for month 12 = (-6.288097772255696e-06, 6.534207223296562e-06) Fri Nov 6 17:35:26 2015 - Finished computing the climatology by hand! time.clock diff = 0.3 Acceleration = 57.8666666667 Migrated from: CDAT/cdat#1664 The text was updated successfully, but these errors were encountered:

### lee1043 commented Feb 27, 2017

 Same issue found when using cdutil.ANNUALCYCLE.departures(). Wondering if there was any progress with this issue...? @chaosphere2112 @doutriaux1

### durack1 commented Feb 27, 2017

 @jypeter it seems you were lost in the original issue migration across repos, so pinging you here for your input

### golaz commented Mar 3, 2017

 Thanks to @jypeter for reporting this bug. Let's hope it gets fixed soon. Adding my 2 cents here regarding speed. I came up with an alternate direct computation of the climatologies using only 4 lines of code and got an even bigger speed-up. I'm seeing a 30x speed-up over cdutil with @jypeter's implementation and 1500x speed-up with my alternate implementation. Here are the key lines: # Compute climatologies climato_values2 = np.zeros([12]+list(np.shape(var))[1:]) month = np.array( [ timax_absolute[i].month for i in range(len(timax_absolute)) ], dtype=np.int) for i in range(1,13): climato_values2[i-1] = np.average(var[np.argwhere(month==i)],axis=0) Attached is the script and below the full output. Just trying to raise the bar for @dnadeau4 and @doutriaux1 😉 Variable shape = (3653, 9, 11) Time axes go from 1959-1-1 12:0:0.0 to 1968-12-31 12:0:0.0 Latitude values = [-66.7283256 -67.84978441 -68.97123954 -70.09269035 -71.21413608 -72.33557575 -73.45700815 -74.57843166 -75.69984422] Longitude values = [ 120.375 121.5 122.625 123.75 124.875 126. 127.125 128.25 129.375 130.5 131.625] Thu Mar 2 17:47:15 2017 - Computing the climatology with cdutil.ANNUALCYCLE.climatology, please wait... Thu Mar 2 17:47:20 2017 - Done! time.clock diff = 4.515801 Thu Mar 2 17:47:20 2017 - Computing the climatology by hand... year, month, nb of selected days = 1959 2 28 year, month, nb of selected days = 1960 2 29 year, month, nb of selected days = 1961 2 28 year, month, nb of selected days = 1962 2 28 year, month, nb of selected days = 1963 2 28 year, month, nb of selected days = 1964 2 29 year, month, nb of selected days = 1965 2 28 year, month, nb of selected days = 1966 2 28 year, month, nb of selected days = 1967 2 28 year, month, nb of selected days = 1968 2 29 Difference range for month 1 = (-4.903731820604662e-06, 6.318861416332311e-06) Difference range for February (unweighted) = (-5.2588326724389844e-06, 5.322724128120626e-06) Difference range for February (WEIGHTED) = (-0.014047166386458088, -0.0006071392919806406) Difference range for month 3 = (-7.192550192769431e-06, 6.8172331566529465e-06) Difference range for month 4 = (-8.519490563685395e-06, 6.39597574547679e-06) Difference range for month 5 = (-7.727838379878449e-06, 7.407895935784836e-06) Difference range for month 6 = (-8.519490563685395e-06, 6.81559244952723e-06) Difference range for month 7 = (-7.284841260002395e-06, 9.992045690410123e-06) Difference range for month 8 = (-8.761498278886393e-06, 7.900114923131696e-06) Difference range for month 9 = (-8.850097671597723e-06, 7.451375310552066e-06) Difference range for month 10 = (-7.30945221505408e-06, 6.18965391652182e-06) Difference range for month 11 = (-6.2370300426550784e-06, 6.243387851156967e-06) Difference range for month 12 = (-6.288097772255696e-06, 6.534207223296562e-06) Thu Mar 2 17:47:20 2017 - Finished computing the climatology by hand! time.clock diff = 0.155475 Acceleration = 29.0451905451 Thu Mar 2 17:47:20 2017 - Computing the climatology by hand, second algorithm Difference range for month 1 = (-6.318861437648593e-06, 4.903731806393807e-06) Difference range for month 2 = (-2.3858707294266424e-06, 2.938530052176702e-06) Difference range for month 3 = (-6.8172331779692286e-06, 7.192550164347722e-06) Difference range for month 4 = (-6.39597574547679e-06, 8.519490556579967e-06) Difference range for month 5 = (-7.4078959784174e-06, 7.727838337245885e-06) Difference range for month 6 = (-6.81559244952723e-06, 8.519490556579967e-06) Difference range for month 7 = (-9.992045747253542e-06, 7.284841231580685e-06) Difference range for month 8 = (-7.900114979975115e-06, 8.761498236253829e-06) Difference range for month 9 = (-7.451375324762921e-06, 8.850097657386868e-06) Difference range for month 10 = (-6.189653952048957e-06, 7.309452179526943e-06) Difference range for month 11 = (-6.243387858262395e-06, 6.237030028444224e-06) Difference range for month 12 = (-6.5342072517182714e-06, 6.288097750939414e-06) Thu Mar 2 17:47:20 2017 - Finished computing the climatology by hand, second algorithm! time.clock diff = 0.002744 Acceleration = 1645.70007289 bug_ANNUALCYCLEclimatology2.py

### doutriaux1 commented Mar 6, 2017

 @golaz this a totally different issues your script will barf on any general dataset. This is the difference between one off user geenrated script that works for his data only and general application software that works on any dataset. cdutil will work on 6hourly data for example. It will work on data starting in February. It will work on data with missing months, etc...

### golaz commented Mar 6, 2017

 @doutriaux1, sure, everyone understands that there will be a performance penalty for generality, and generality is what makes cdutil in general and ANNUALCYCLE in particular so useful. Still, a factor 1500x in speed would suggest that there is substantial room for performance improvement without loss of generality. Any progress with the fix?

### doutriaux1 commented Mar 6, 2017

 @golaz I believe fix is in place. @dnadeau4 ?

added the bug label May 9, 2017
added this to the 2.10 milestone May 9, 2017
closed this May 9, 2017