Skip to content
master
Go to file
Code

Latest commit

* Add shared empty double

* Add online version of `slide_sum()`

This version suffers from numerical instability when large floating point values are involved. I think that is enough to make me choose a non-online version, like in RcppRoll, which shouldn't suffer from any stability issues, even if slower

* Simpler non-online version of `slide_sum()`

* Add `validate_na_rm()`

* Add `opts-slide.h`

* Formalize `slide_sum()`

* Add slightly better error messages for `slide_sum()`

* Add `slide_mean()`

Using `long double`, two pass approach, and offline algorithm for max similarity to base R, at the cost of speed

* Ensure that names are kept on the output

* Correctly inline header function

* Further generalize to make extensions easier

* Add `slide_index_sum()` POC

* Push assignment into the summary loop

* Move macros into a header for reuse by index functions

* Typo

* Make summary index implementation more generic, use macros

* Add `slide_index_mean()` implementation

* Add `slide_prod()` and `slide_index_prod()`

* Add `slide_min()` and `slide_index_min()`

* Correct sliding min implementation

* Add `slide_max()` and `slide_index_max()`

* Implement segment tree based implementation for `slide_sum()`

* Big refactor and add `slide_mean()`

* Retain state in the tree

* Add `slide_min()`

* Ensure that long doubles are used in `slide_sum()`

* Add `slide_prod()`

* Handle large / small results like base R

* Add `slide_max()`

* Attempt to assign less often

* Rename `summary_slide_loop()` to `slide_summary_loop()`

* Move tree implementations to a shared header

* More rearranging for readibility

* Use 80 char line limit

* Add `slide_index_sum()` based on segment tree

* Add `slide_index_prod()`

* Add a few precision tests to keep us from taking shortcuts

* Add mean/min/max index variants

* Replace old implementation with new one

* Remove excessive spacing

* Bail early when summing with NaN values

Arithmetic with long doubles and NA/NaN/Inf/-Inf is VERY slow. This has a huge performance gain when any NA/NaN values are present. We exit early if we either already have a nan value, or if we are about to get one. This is still slow with Inf/-Inf, but there is a much larger performance penalty to detect these correctly, and is much rarer anyways

* Use isnan performance trick in `slide_prod()`

* Use isnan trick in `slide_mean()`

* Correct NaN propagation from intermediate results with `na_rm = TRUE`

* Add casting and array tests

* Use new opts structs in `slide_common_impl()`

* Pull out struct members into their own variables

* Expand on `slide_sum()` tests

* Test retaining names

* Tweak existing docs to be more generic

* Document and export `slide_sum()` and friends

* More documentation tweaks for slide-sum and friends

* Walk back changes to man-roxygen templates

* Fix the slide-index-mean `state` object

* Document and export the summary-index functions

* Add `slide_prod()` tests

* Add remaining tests for summary-slide functions

* Fix bug with OOB window behavior

* Fix typo

* Add summary-index tests

* Fix summary-slide OOB window handling

* Use consistent ordering between slide and summary-slide loop

* Use size friendly min/max

* NEWS bullet

* Add note about not being generic

* Add return value documentation

* Further refine documentation

* Note summary functions in README

* Add to pkgdown index

* Add size zero input tests

* Add size test

* Parameterize error messages a bit more

* One more step to revert `.after` doc changes

* Try to include header with `uint64_t`

* Add `RAW_RO()` compat define

* Comment out tests where NA / NaN behavior may be platform dependent

* Fix typo of max end size

* Fix issue with array of struct access

* Forgot that `i` doesn't start at 0...

* `x` can be any vector as long as it is castable to double
88d819f

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
R
 
 
 
 
man
 
 
 
 
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

slider

Codecov test coverage Lifecycle: maturing R build status

slider provides a family of general purpose “sliding window” functions. The API is purposefully very similar to purrr. The goal of these functions is usually to compute rolling averages, cumulative sums, rolling regressions, or other “window” based computations.

There are 3 core functions in slider:

  • slide() iterates over your data like purrr::map(), but uses a sliding window to do so. It is type-stable, and always returns a result with the same size as its input.

  • slide_index() computes a rolling calculation relative to an index. If you have ever wanted to compute something like a “3 month rolling average” where the number of days in each month is irregular, you might like this function.

  • slide_period() is similar to slide_index() in that it slides relative to an index, but it first breaks the index up into “time blocks”, like 2 month blocks of time, and then it slides over .x using indices defined by those blocks.

Each of these core functions have the same variants as purrr::map(). For example, slide() has slide_dbl(), slide2(), and pslide(), along with the other combinations of these variants that you might expect from having previously used purrr.

To learn more about these three functions, read the introduction vignette.

There are also a set of extremely fast specialized variants of slide_dbl() for the most common use cases. These include slide_sum() for rolling sums and slide_mean() for rolling averages. There are index variants of each of these as well, like slide_index_sum().

Installation

Install the released version from CRAN with:

install.packages("slider")

Install the development version from GitHub with:

remotes::install_github("DavisVaughan/slider")

Examples

The help page for slide() has many examples, but here are a few:

library(slider)

The classic example would be to do a moving average. slide() handles this with a combination of the .before and .after arguments, which control the width of the window and the alignment.

# Moving average (Aligned right)
# "The current element + 2 elements before"
slide_dbl(1:5, ~mean(.x), .before = 2)
#> [1] 1.0 1.5 2.0 3.0 4.0

# Align left
# "The current element + 2 elements after"
slide_dbl(1:5, ~mean(.x), .after = 2)
#> [1] 2.0 3.0 4.0 4.5 5.0

# Center aligned
# "The current element + 1 element before + 1 element after"
slide_dbl(1:5, ~mean(.x), .before = 1, .after = 1)
#> [1] 1.5 2.0 3.0 4.0 4.5

With Inf, you can do a “cumulative slide” to compute cumulative expressions. I think of this as saying “give me everything before the current element.”

slide(1:4, ~.x, .before = Inf)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1 2
#> 
#> [[3]]
#> [1] 1 2 3
#> 
#> [[4]]
#> [1] 1 2 3 4

With .complete, you can decide whether or not .f should be evaluated on incomplete windows. In the following example, the requested window size is 3, but the first two results are computed on windows of size 1 and 2 because partial results are allowed by default. When .complete is set to TRUE, the first two results are not computed.

slide(1:4, ~.x, .before = 2)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1 2
#> 
#> [[3]]
#> [1] 1 2 3
#> 
#> [[4]]
#> [1] 2 3 4

slide(1:4, ~.x, .before = 2, .complete = TRUE)
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> [1] 1 2 3
#> 
#> [[4]]
#> [1] 2 3 4

Data frames

Unlike purrr::map(), slide() iterates over data frames in a row wise fashion. Interestingly this means the default of slide() becomes a generic row wise iterator, with nice syntax for accessing data frame columns.

There is a vignette specifically about this.

mini_cars <- cars[1:4,]

slide(mini_cars, ~.x)
#> [[1]]
#>   speed dist
#> 1     4    2
#> 
#> [[2]]
#>   speed dist
#> 1     4   10
#> 
#> [[3]]
#>   speed dist
#> 1     7    4
#> 
#> [[4]]
#>   speed dist
#> 1     7   22

slide_dbl(mini_cars, ~.x$speed + .x$dist)
#> [1]  6 14 11 29

This makes rolling regressions trivial!

library(tibble)
set.seed(123)

df <- tibble(
  y = rnorm(100),
  x = rnorm(100)
)

# Window size of 20 rows
# The current row + 19 before
# (see slide_index() for how to do this relative to a date vector!)
df$regressions <- slide(df, ~lm(y ~ x, data = .x), .before = 19, .complete = TRUE)

df[15:25,]
#> # A tibble: 11 x 3
#>         y      x regressions
#>     <dbl>  <dbl> <list>     
#>  1 -0.556  0.519 <NULL>     
#>  2  1.79   0.301 <NULL>     
#>  3  0.498  0.106 <NULL>     
#>  4 -1.97  -0.641 <NULL>     
#>  5  0.701 -0.850 <NULL>     
#>  6 -0.473 -1.02  <lm>       
#>  7 -1.07   0.118 <lm>       
#>  8 -0.218 -0.947 <lm>       
#>  9 -1.03  -0.491 <lm>       
#> 10 -0.729 -0.256 <lm>       
#> 11 -0.625  1.84  <lm>

Index sliding

In many business settings, the value you want to compute is tied to some index, like a date vector. In these cases, you’ll probably want to compute sliding windows relative to the index, and not using the fixed window that slide() provides. You can use slide_index() to pass in both .x and an index, .i, and the window will be calculated relative to that index.

Here, when computing a “2 day window”, you probably don’t want "2019-08-16" and "2019-08-18" to be grouped together. slide() has no concept of an index, so when you specify a window size of 2, it will group these two together. slide_index(), on the other hand, will do the right thing.

x <- 1:3
i <- as.Date(c("2019-08-15", "2019-08-16", "2019-08-18"))

# slide() has no concept of an "index"
slide(x, ~.x, .before = 1)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1 2
#> 
#> [[3]]
#> [1] 2 3

# "index aware"
slide_index(x, i, ~.x, .before = 1)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1 2
#> 
#> [[3]]
#> [1] 3

Essentially what happens is that when we get to "2019-08-18", it “looks backwards” 1 day to set a window boundary at "2019-08-17". Since the date at position 2, "2019-08-16", is before "2019-08-17", it is not included.

Powerfully, you can pass through any object to .before that computes a value from .i - .before. This means that you could also have used a lubridate period object (which gets even more interesting when you use weeks() or months()):

slide_index(x, i, ~.x, .before = lubridate::days(1))
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1 2
#> 
#> [[3]]
#> [1] 3

Period sliding

slide_period() is different from slide_index() in that it first breaks the index into “time blocks” and then slides over .x relative to those blocks. For example, in the monthly period slide below, i is broken up into 4 time blocks of “the current block of monthly data, plus one block before this one”. The locations of those blocks are the locations that are used to slice .x with.

i <- as.Date(c(
  "2019-01-29", 
  "2019-01-30", 
  "2019-02-05", 
  "2019-04-01", 
  "2019-05-10"
))

slide_period(i, i, "month", ~.x, .before = 1)
#> [[1]]
#> [1] "2019-01-29" "2019-01-30"
#> 
#> [[2]]
#> [1] "2019-01-29" "2019-01-30" "2019-02-05"
#> 
#> [[3]]
#> [1] "2019-04-01"
#> 
#> [[4]]
#> [1] "2019-04-01" "2019-05-10"

One neat thing to notice is that slide_period() is aware of the distance between elements of .i in the period you specify. The practical implication of this is that in the above example, group 3 with 2019-04-01 did not include 2019-02-05 in it, because it is more than 1 month group away.

Inspiration

This package is inspired heavily by SQL’s window functions. The API is similar, but more general because you can iterate over any kind of R object.

There have been multiple attempts at creating sliding window functions (I personally created rollify(), and worked a little bit on tsibble::slide() with Earo Wang).

  • zoo::rollapply()
  • tibbletime::rollify()
  • tsibble::slide()

I believe that slider is the next iteration of these. There are a few reasons for this:

  • To me, the API is more intuitive, and is more flexible because .before and .after let you completely control the entry point (as opposed to fixed entry points like "center", "left", etc.

  • It is objectively faster because it is written purely in C.

  • With slide_vec() you can return any kind of object, and are not limited to the suffixed versions: _dbl, _int, etc.

  • It iterates rowwise over data frames, consistent with the vctrs framework.

  • I believe it is overall more consistent, backed by a theory that can always justify the sliding window generated by any combination of the parameters.

Earo and I have spoken, and we have mutually agreed that it would be best to deprecate tsibble::slide() in favor of slider::slide().

Additionally, data.table’s non-equi joins have been pretty much the only solution to the problem that slide_index() tries to solve. Their solution is robust and quite fast, and has been a nice benchmark for slider. slider is trying to solve a much narrower problem, so the API here is more focused.

Performance

In terms of performance, be aware that any specialized package that shifts the function calls to C are going to be faster than slider. For example, RcppRoll::roll_mean() computes the rolling mean at the C level, which is bound to be faster. The purpose of slider is to be general purpose, while still being as fast as possible. This means that it can be used for more abstract things, like rolling regressions, or any other custom function that you want to use in a rolling fashion.

Otherwise, like purrr::map(), slide() is optimized in C to be as fast as possible, getting out of the way as quickly as it can so the main overhead are the .f calls.

References

I’ve found the following references very useful to understand more about window functions:

You can’t perform that action at this time.