Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zoo::na.locf equivalent #489

Closed
BenoitLondon opened this issue Nov 7, 2023 · 9 comments
Closed

zoo::na.locf equivalent #489

BenoitLondon opened this issue Nov 7, 2023 · 9 comments

Comments

@BenoitLondon
Copy link

Is there an equivalent of zoo::na.locf in collapse? If not it would be cool to implement a fast version.

Thanks for that great package

@SebKrantz
Copy link
Owner

Nope, please use data.table::nafill() or data.table::setnafill() with type = "locf".

@SebKrantz
Copy link
Owner

In collapse you can also use ffirst and flast with TRA = "replace_NA", but this is something else (replacing all missing values with the first or last value), and you might need to group your data appropriately for it to do what you intend.

@BenoitLondon
Copy link
Author

I didn't know set/na.fill had a locf mode.
Thanks!

@NicChr
Copy link

NicChr commented Nov 12, 2023

Hello, sorry to reopen the thread but I too would have loved to see a collapse style fast grouped locf na fill, so if it's any interest I recently wrote a method on the development version of timeplyr (on github) to do just that. Feel free to check it out if it's useful.

@SebKrantz Thanks again for the amazing package, I find myself using it so often it's become a daily part of my workflow.
I fully respect that this might be out of scope for collapse but if you find the method useful, I would be more than happy to share the code/method.
Many thanks!

@SebKrantz
Copy link
Owner

Thanks @NicChr. I can think about it, but for that it would be good to know which functionality you are lacking from the data.table implementation, and how you would like to see it added to collapse. Currently, I don't really plan on adding new functions to collapse, so it would have to be an argument to replace_NA(), and an internal method for grouped data.

@SebKrantz SebKrantz reopened this Nov 12, 2023
@NicChr
Copy link

NicChr commented Nov 12, 2023

It could be that I'm not using the correct code or something, but it seems that the data.table version isn't very fast when applied to many groups. Consider the below example:

I ran each expression twice to get a more accurate memory allocation.

library(timeplyr)
library(data.table)
library(bench)

x <- sample.int(10^2, 10^5, TRUE)
x[sample.int(10^5, round(10^5/3))] <- NA
groups <- sample.int(10^4, 10^5, TRUE)

dt <- data.table(x, groups)

## No groups

mark(timeplyr = dt[, filled1 := .roll_na_fill(x)][]$filled1,
     timeplyr2 = dt[, filled2 := .roll_na_fill(x)][]$filled2,
     data.table = dt[, filled3 := data.table::nafill(x, type = "locf")][]$filled3,
     data.table2 = dt[, filled4 := data.table::nafill(x, type = "locf")][]$filled4)
#> # A tibble: 4 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 timeplyr       694µs    865µs     1079.    2.23MB     10.7
#> 2 timeplyr2      695µs    860µs     1089.  423.19KB     10.9
#> 3 data.table     671µs    945µs      998.   844.3KB     20.4
#> 4 data.table2    669µs    969µs      990.  829.95KB     22.9

## With groups

mark(timeplyr = dt[, filled1 := roll_na_fill(x, g = groups)][]$filled1,
     timeplyr2 = dt[, filled2 := roll_na_fill(x, g = groups)][]$filled2,
     data.table = dt[, filled3 := data.table::nafill(x, type = "locf"),
                     by = groups][]$filled3,
     data.table2 = dt[, filled4 := data.table::nafill(x, type = "locf"),
                     by = groups][]$filled4)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 timeplyr      1.71ms   2.08ms    423.      6.61MB     12.0
#> 2 timeplyr2     1.66ms   2.08ms    425.    852.97KB     16.0
#> 3 data.table  297.79ms 358.07ms      2.79  157.73MB     12.6
#> 4 data.table2 308.45ms 309.95ms      3.23   157.7MB     14.5

Created on 2023-11-12 with reprex v2.0.2

My method essentially uses the order of the groups and the sorted group sizes to perform a fast locf na fill.
To go into a bit more detail, radixorderv(g) is called if g isn't a GRP, or if it is a GRP and doesn't have an order vector attached to it, and the group IDs aren't sorted. The order vector and group sizes are then passed to a C/C++ function that loops through x in order of this order vector and no physical sorting takes place. The order vector allows us to loop through x as if it was sorted by g, and the group sizes allow us to know when we enter a new group. Alternatively you could also pass the sorted group starts as well.

In any case I think the benchmark demonstrates that there is a potential for a significant improvement in performance, at least when there are large numbers of groups.

@SebKrantz
Copy link
Owner

Thanks again, I will implement an option value = "locf" to replace_na and replace_inf in the future, but this is a bit complex because with indexed data which has both a group and time dimension, the last value might not be in appearance order of the data. So this requires a bit more thought on my side to thoroughly implement also for grouped and indexed data.

@SebKrantz
Copy link
Owner

I added a basic "locf" and "focb" functionality to replace_na(), but the grouped implementation or a tailored implementation for matrices is a bit beyond the scope currently. The issue can stay open, but in general it will require some extra thought on implementing this, and perhaps a deeper integration of these functions with collapse's vectorized computing infrastructure.

SebKrantz added a commit that referenced this issue Jan 11, 2024
@SebKrantz
Copy link
Owner

Small update: in addition to the above, I have made available baseline C implementations with na_locf() and na_focb() that operate on a single vector. This facilitates nearly zero overhead repetition e.g. dapply(X, na_locf) for xts or airquality |> fgroup_by(Month) |> fmutate(Ozone = na_locf(Ozone)). I don't think, at this point a full column-level and grouped logic will be implemented here. This would require a lot of programming to support matrices, group-level vectorization, and indexed time series, and I'm currently not up to doing that, particularly because this feature is quite particular and not often required. I'm happy to see a grouped implementation in timeplyr though. I'd thus close the issue if there are no other substantial remarks. collapse 2.0.9 will be submitted to CRAN today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants