New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory management #8

grssnbchr opened this Issue Jun 8, 2017 · 4 comments


None yet
2 participants

grssnbchr commented Jun 8, 2017

Hey Klaus

I am working on spatial interpolation of German-speaking dialects, similar as Josh Katz did in his research:

What I need to do is interpolate a rectangular grid (a raster, basically) with q cells from n points which contain the spoken dialect d at location lat/lon. d is nominal. The covariates are lat and lon, as simple as that. In theory, everything works fine, I end up with maps like the following (plotted with ggplot2).


A big problem is memory management, though.

  • q is in the order of millions (for a pixel resolution of 1000x1160, for example)
  • n is ideally in the order of several hundred thousands - the sample is really big, and the more I include in the interpolation the more detailed/beautiful the maps get (even with a high k). Also, the higher n, the higher I need to set k, to get the desired aggregation effect.

The above map has n=50000, k=2500 and q=104400 (pixel width = 300, so a very "low-res" example), still the computation

dialects.kknn <- kknn(dialect ~ ., 
                      kernel = "gaussian", 
                      k = 2500)

already crashes with a message that says something like "cannot allocate vector of size 2.x GB". After having upgraded my 8GB RAM to 16GB, it works (and takes around 9 mins to compute), but the same with q ~= 1mio already fails with the message "cannot allocate vector of size 10.x GB".

My question is simple: Do you know of any mitigation strategies? Or do I have to set a different parameter, change the kernel? Could this be a memory leak? I am using 1.3.1.

One idea I came up with is to interpolate to a low-res-grid in the first run and then use another method for raster resampling to higher resolution, for example with raster::resample. Don't know if nominal values are fit for that, though.


This comment has been minimized.


KlausVigo commented Jun 10, 2017

Hello Timo @grssnbchr,

kknn constructs internally 2 distance matrices which have dimension q * k (* 4 / 8 byte) which will be the bottle neck.
You should be able to subsets of dialects_test to kknn and just combine the results in the end, you can even compute these on different machines.
Also kernel="rectangular" may gives very similar results for a smaller choice of k, as the contribution from far away (distance wise) speakers to the estimate will be low.
If you can send me some sample data I may can find some more parts to improve.

Have a nice weekend


This comment has been minimized.

grssnbchr commented Jun 11, 2017

Thanks a lot for your fast reply. I quickly thought about splitting the raster but then I thought the results wouldn't match and rejected the idea. Buut - of course - if always the same training data set is taken, the tiles fit nicely together (i.e. ==> not split the training data set, too). Doing it that way, the memory problems vanish. Also, somehow, the whole computation if faster by about 10-20%. And what's even better: I can now use the foreach package to do parallel processing and gain another 30-40% of computation time. So thanks a lot. Once I have this all together, I will write a blog post and gladly point out your package and your help!


This comment has been minimized.

grssnbchr commented Mar 17, 2018


This comment has been minimized.


KlausVigo commented Mar 19, 2018

@grssnbchr Looks amazing! Glad I could help

@KlausVigo KlausVigo closed this Mar 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment