zoo::na.locf equivalent #489

BenoitLondon · 2023-11-07T09:57:31Z

Is there an equivalent of zoo::na.locf in collapse? If not it would be cool to implement a fast version.

Thanks for that great package

SebKrantz · 2023-11-07T10:17:56Z

Nope, please use data.table::nafill() or data.table::setnafill() with type = "locf".

SebKrantz · 2023-11-07T10:22:36Z

In collapse you can also use ffirst and flast with TRA = "replace_NA", but this is something else (replacing all missing values with the first or last value), and you might need to group your data appropriately for it to do what you intend.

BenoitLondon · 2023-11-07T14:55:22Z

I didn't know set/na.fill had a locf mode.
Thanks!

NicChr · 2023-11-12T10:55:20Z

Hello, sorry to reopen the thread but I too would have loved to see a collapse style fast grouped locf na fill, so if it's any interest I recently wrote a method on the development version of timeplyr (on github) to do just that. Feel free to check it out if it's useful.

@SebKrantz Thanks again for the amazing package, I find myself using it so often it's become a daily part of my workflow.
I fully respect that this might be out of scope for collapse but if you find the method useful, I would be more than happy to share the code/method.
Many thanks!

SebKrantz · 2023-11-12T19:04:32Z

Thanks @NicChr. I can think about it, but for that it would be good to know which functionality you are lacking from the data.table implementation, and how you would like to see it added to collapse. Currently, I don't really plan on adding new functions to collapse, so it would have to be an argument to replace_NA(), and an internal method for grouped data.

NicChr · 2023-11-12T20:08:08Z

It could be that I'm not using the correct code or something, but it seems that the data.table version isn't very fast when applied to many groups. Consider the below example:

I ran each expression twice to get a more accurate memory allocation.

library(timeplyr)
library(data.table)
library(bench)

x <- sample.int(10^2, 10^5, TRUE)
x[sample.int(10^5, round(10^5/3))] <- NA
groups <- sample.int(10^4, 10^5, TRUE)

dt <- data.table(x, groups)

## No groups

mark(timeplyr = dt[, filled1 := .roll_na_fill(x)][]$filled1,
     timeplyr2 = dt[, filled2 := .roll_na_fill(x)][]$filled2,
     data.table = dt[, filled3 := data.table::nafill(x, type = "locf")][]$filled3,
     data.table2 = dt[, filled4 := data.table::nafill(x, type = "locf")][]$filled4)
#> # A tibble: 4 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 timeplyr       694µs    865µs     1079.    2.23MB     10.7
#> 2 timeplyr2      695µs    860µs     1089.  423.19KB     10.9
#> 3 data.table     671µs    945µs      998.   844.3KB     20.4
#> 4 data.table2    669µs    969µs      990.  829.95KB     22.9

## With groups

mark(timeplyr = dt[, filled1 := roll_na_fill(x, g = groups)][]$filled1,
     timeplyr2 = dt[, filled2 := roll_na_fill(x, g = groups)][]$filled2,
     data.table = dt[, filled3 := data.table::nafill(x, type = "locf"),
                     by = groups][]$filled3,
     data.table2 = dt[, filled4 := data.table::nafill(x, type = "locf"),
                     by = groups][]$filled4)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 timeplyr      1.71ms   2.08ms    423.      6.61MB     12.0
#> 2 timeplyr2     1.66ms   2.08ms    425.    852.97KB     16.0
#> 3 data.table  297.79ms 358.07ms      2.79  157.73MB     12.6
#> 4 data.table2 308.45ms 309.95ms      3.23   157.7MB     14.5

^{Created on 2023-11-12 with reprex v2.0.2}

My method essentially uses the order of the groups and the sorted group sizes to perform a fast locf na fill.
To go into a bit more detail, radixorderv(g) is called if g isn't a GRP, or if it is a GRP and doesn't have an order vector attached to it, and the group IDs aren't sorted. The order vector and group sizes are then passed to a C/C++ function that loops through x in order of this order vector and no physical sorting takes place. The order vector allows us to loop through x as if it was sorted by g, and the group sizes allow us to know when we enter a new group. Alternatively you could also pass the sorted group starts as well.

In any case I think the benchmark demonstrates that there is a potential for a significant improvement in performance, at least when there are large numbers of groups.

SebKrantz · 2023-12-07T11:11:25Z

Thanks again, I will implement an option value = "locf" to replace_na and replace_inf in the future, but this is a bit complex because with indexed data which has both a group and time dimension, the last value might not be in appearance order of the data. So this requires a bit more thought on my side to thoroughly implement also for grouped and indexed data.

SebKrantz · 2024-01-11T03:44:38Z

I added a basic "locf" and "focb" functionality to replace_na(), but the grouped implementation or a tailored implementation for matrices is a bit beyond the scope currently. The issue can stay open, but in general it will require some extra thought on implementing this, and perhaps a deeper integration of these functions with collapse's vectorized computing infrastructure.

…arebones functions.

SebKrantz · 2024-01-11T11:27:46Z

Small update: in addition to the above, I have made available baseline C implementations with na_locf() and na_focb() that operate on a single vector. This facilitates nearly zero overhead repetition e.g. dapply(X, na_locf) for xts or airquality |> fgroup_by(Month) |> fmutate(Ozone = na_locf(Ozone)). I don't think, at this point a full column-level and grouped logic will be implemented here. This would require a lot of programming to support matrices, group-level vectorization, and indexed time series, and I'm currently not up to doing that, particularly because this feature is quite particular and not often required. I'm happy to see a grouped implementation in timeplyr though. I'd thus close the issue if there are no other substantial remarks. collapse 2.0.9 will be submitted to CRAN today.

SebKrantz closed this as completed Nov 7, 2023

SebKrantz reopened this Nov 12, 2023

SebKrantz added a commit that referenced this issue Jan 11, 2024

Add basic "locf" and "focb" options to replace_na() (#489).

87f4228

SebKrantz added a commit that referenced this issue Jan 11, 2024

Better solution to (#489): making available na_locf() and na_focb() b…

a55172f

…arebones functions.

SebKrantz closed this as completed Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zoo::na.locf equivalent #489

zoo::na.locf equivalent #489

BenoitLondon commented Nov 7, 2023

SebKrantz commented Nov 7, 2023

SebKrantz commented Nov 7, 2023

BenoitLondon commented Nov 7, 2023

NicChr commented Nov 12, 2023

SebKrantz commented Nov 12, 2023

NicChr commented Nov 12, 2023

SebKrantz commented Dec 7, 2023

SebKrantz commented Jan 11, 2024

SebKrantz commented Jan 11, 2024

zoo::na.locf equivalent #489

zoo::na.locf equivalent #489

Comments

BenoitLondon commented Nov 7, 2023

SebKrantz commented Nov 7, 2023

SebKrantz commented Nov 7, 2023

BenoitLondon commented Nov 7, 2023

NicChr commented Nov 12, 2023

SebKrantz commented Nov 12, 2023

NicChr commented Nov 12, 2023

SebKrantz commented Dec 7, 2023

SebKrantz commented Jan 11, 2024

SebKrantz commented Jan 11, 2024