Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loadCMIP5 profiling #118

Closed
bpbond opened this issue Apr 6, 2015 · 5 comments
Closed

loadCMIP5 profiling #118

bpbond opened this issue Apr 6, 2015 · 5 comments
Assignees

Comments

@bpbond
Copy link
Member

bpbond commented Apr 6, 2015

I have started some performance tests and immediately see that loadCMIP5 has a big problem when it converts the loaded array to a data frame. It uses the slowest possible method to do so:

> x <- array(1, dim=c(400,400,400))
> system.time(y<-as.data.frame(x))
   user  system elapsed 
  2.550   0.389   2.945 
> system.time(y<-as.numeric(x))
   user  system elapsed 
  0.124   0.158   0.281 
> system.time(y<-reshape2::melt(x))  # this is the method it uses
   user  system elapsed 
  3.213   1.500  52.892 
@bpbond bpbond self-assigned this Apr 6, 2015
bpbond added a commit that referenced this issue Apr 6, 2015
@ktoddbrown
Copy link
Contributor

method loadTime mergeTime annualTime globalTime fileSize [Mb] annualSize [Mb] globalSize [Mb] loadSize [Mb] fileReadSize [Mb] inMemSize [Mb]
RCMIP5v1.1 7.331 7.588 466.092 NA 33.43258 24.069296 NA 300.68803 NA NA
ncdf-array 0.408 0.869 9.994 0.041 33.43258 5.345488 0.000232 66.81897 NA NA
ncdf-dataframe 1.478 747.486 386.428 8.963 33.43258 17.373160 0.001192 267.26512 NA NA
raster 0.145 10.457 4.222 0.097 33.43258 5.804424 0.001856 NA 3.758504 66.87124

Here is my quick and dirty profile evaluation (running on a 2011 MacAir that was multi-tasking). Clearly I did NOT quite reach the level of optimization in my data frame functions that are currently in RCMIP5 but it gives a pretty good overview for the three candidate flavors we are looking at. Array seems to be both fast and small. The raster package is good if we manage everything from file instead of memory from a memory prospective but is relatively slow. We could consider rolling our own 'raster' package where we read in and process 'chuncks' of the netcdf file. But, from this run, I'm inclined to falling back on array. Thoughts?

@bpbond
Copy link
Member Author

bpbond commented Apr 9, 2015

Thanks Kathe! I have a bunch of thoughts, as I've also spent the last few days doing quite a few profiling and performance tests.

  1. The data.frame approach (current master branch) is very fast when using dplyr - that's the only reason we changed to it - for timing see here. (The numbers in your post above @ktoddbrown were done using the much slower ddply I think.)

  2. There is another advantage to the data frame implementation: it allows us to handle irregular grids, which some CMIP5 models use. By the way @cahartin I now have a branch that correctly loads your MPI models!

  3. Finally, the data frame code is much simpler: easier to reason about, debug, etc.

  4. The catch is that the data frame approach takes much more memory (currently 5x!!! -although this could be improved on a bit).

  5. The array approach has two big advantages, no question: memory efficiency and instant access to arbitrary elements.

  6. Side note: a caveat to the last point is that dplyr is much more careful in its memory use than plyr, so that if you can load the memory-intensive data frame, you're good to go, whereas the plyr operations can consume lots of memory during their operation, sometimes killing the machine.

  7. Putting these together, I'd say that if you have enough memory, the data frame approach is the clear winner, but there are definitely cases when arrays make it possible to process files that can't be loaded into a memory-intensive data frame.

  8. In summary, I'm reluctantly coming around to Kathe's earlier suggestion: we should support both approaches--that is, loadCMIP5 should have a flag specifying whether to return a data frame or array, and all further operations support either implementation. The testing code will need to test both, of course.

  9. Finally, we should also work to reduce the memory footprint of cmip5data objects as much as possible: letting loadCMIP5 only load certain lon/lat/Z values (similar to how it can be told to only load certain dates); make the data frame more memory efficient by removing unused columns; etc.

@ktoddbrown
Copy link
Contributor

Clearly I'm still behind the curve with this whole plyr thing. I have to admit, I still think that the array operations are simpler.

I don't think that the array will have difficulties with the non-uniform grids as long as the lat doesn't change in a lon band and vis versa. However raster will error out on them which sadly invalidates that option.

Another option, just to give even more choices, would be to use chunk reads-writes similar to the raster package. Doable but more work then just providing the data.frame vs array option.

@bpbond
Copy link
Member Author

bpbond commented Apr 10, 2015

On further reflection, I think you're right that arrays can handle irregular grids too…though not our current implementation, as lat does change in a lon band, etc., for the MPI-class models.

I guess I'd say that arrays are memory efficient, and conceptually simple but a bit complicated in code; data.frame simple in code, and much faster, but take much more memory. Are we willing to support both?

@ktoddbrown
Copy link
Contributor

I'm going to say yes.

@bpbond bpbond closed this as completed May 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants