loadCMIP5 profiling #118

bpbond · 2015-04-06T14:33:03Z

I have started some performance tests and immediately see that loadCMIP5 has a big problem when it converts the loaded array to a data frame. It uses the slowest possible method to do so:

> x <- array(1, dim=c(400,400,400))
> system.time(y<-as.data.frame(x))
   user  system elapsed 
  2.550   0.389   2.945 
> system.time(y<-as.numeric(x))
   user  system elapsed 
  0.124   0.158   0.281 
> system.time(y<-reshape2::melt(x))  # this is the method it uses
   user  system elapsed 
  3.213   1.500  52.892

The text was updated successfully, but these errors were encountered:

See issue #118.

ktoddbrown · 2015-04-08T22:59:24Z

method	loadTime	mergeTime	annualTime	globalTime	fileSize [Mb]	annualSize [Mb]	globalSize [Mb]	loadSize [Mb]	fileReadSize [Mb]	inMemSize [Mb]
RCMIP5v1.1	7.331	7.588	466.092	NA	33.43258	24.069296	NA	300.68803	NA	NA
ncdf-array	0.408	0.869	9.994	0.041	33.43258	5.345488	0.000232	66.81897	NA	NA
ncdf-dataframe	1.478	747.486	386.428	8.963	33.43258	17.373160	0.001192	267.26512	NA	NA
raster	0.145	10.457	4.222	0.097	33.43258	5.804424	0.001856	NA	3.758504	66.87124

Here is my quick and dirty profile evaluation (running on a 2011 MacAir that was multi-tasking). Clearly I did NOT quite reach the level of optimization in my data frame functions that are currently in RCMIP5 but it gives a pretty good overview for the three candidate flavors we are looking at. Array seems to be both fast and small. The raster package is good if we manage everything from file instead of memory from a memory prospective but is relatively slow. We could consider rolling our own 'raster' package where we read in and process 'chuncks' of the netcdf file. But, from this run, I'm inclined to falling back on array. Thoughts?

bpbond · 2015-04-09T14:22:21Z

Thanks Kathe! I have a bunch of thoughts, as I've also spent the last few days doing quite a few profiling and performance tests.

The data.frame approach (current master branch) is very fast when using dplyr - that's the only reason we changed to it - for timing see here. (The numbers in your post above @ktoddbrown were done using the much slower ddply I think.)
~~There is another advantage to the data frame implementation: it allows us to handle irregular grids, which some CMIP5 models use.~~ By the way @cahartin I now have a branch that correctly loads your MPI models!
Finally, the data frame code is much simpler: easier to reason about, debug, etc.
The catch is that the data frame approach takes much more memory (currently 5x!!! -although this could be improved on a bit).
The array approach has two big advantages, no question: memory efficiency and instant access to arbitrary elements.
Side note: a caveat to the last point is that dplyr is much more careful in its memory use than plyr, so that if you can load the memory-intensive data frame, you're good to go, whereas the plyr operations can consume lots of memory during their operation, sometimes killing the machine.
Putting these together, I'd say that if you have enough memory, the data frame approach is the clear winner, but there are definitely cases when arrays make it possible to process files that can't be loaded into a memory-intensive data frame.
In summary, I'm reluctantly coming around to Kathe's earlier suggestion: we should support both approaches--that is, loadCMIP5 should have a flag specifying whether to return a data frame or array, and all further operations support either implementation. The testing code will need to test both, of course.
Finally, we should also work to reduce the memory footprint of cmip5data objects as much as possible: letting loadCMIP5 only load certain lon/lat/Z values (similar to how it can be told to only load certain dates); make the data frame more memory efficient by removing unused columns; etc.

ktoddbrown · 2015-04-09T17:49:45Z

Clearly I'm still behind the curve with this whole plyr thing. I have to admit, I still think that the array operations are simpler.

I don't think that the array will have difficulties with the non-uniform grids as long as the lat doesn't change in a lon band and vis versa. However raster will error out on them which sadly invalidates that option.

Another option, just to give even more choices, would be to use chunk reads-writes similar to the raster package. Doable but more work then just providing the data.frame vs array option.

bpbond · 2015-04-10T10:41:52Z

On further reflection, I think you're right that arrays can handle irregular grids too…though not our current implementation, as lat does change in a lon band, etc., for the MPI-class models.

I guess I'd say that arrays are memory efficient, and conceptually simple but a bit complicated in code; data.frame simple in code, and much faster, but take much more memory. Are we willing to support both?

ktoddbrown · 2015-04-10T16:31:21Z

I'm going to say yes.

bpbond added the enhancement label Apr 6, 2015

bpbond self-assigned this Apr 6, 2015

bpbond added a commit that referenced this issue Apr 6, 2015

Much faster conversion of array to data frame

93a0dfa

See issue #118.

bpbond closed this as completed May 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loadCMIP5 profiling #118

loadCMIP5 profiling #118

bpbond commented Apr 6, 2015

ktoddbrown commented Apr 8, 2015

bpbond commented Apr 9, 2015

ktoddbrown commented Apr 9, 2015

bpbond commented Apr 10, 2015

ktoddbrown commented Apr 10, 2015

loadCMIP5 profiling #118

loadCMIP5 profiling #118

Comments

bpbond commented Apr 6, 2015

ktoddbrown commented Apr 8, 2015

bpbond commented Apr 9, 2015

ktoddbrown commented Apr 9, 2015

bpbond commented Apr 10, 2015

ktoddbrown commented Apr 10, 2015