Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lead/lag with respect to a time variable #37

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

matthieugomez
Copy link

Solve #13

@matthieugomez
Copy link
Author

Any thought?

@piever
Copy link
Collaborator

piever commented May 4, 2020

Thanks for taking a stab at this!

My main concern is that ideally this would have a lazy implementation (just like lag, lead normally do), so that this line is executed on getindex.

I'm not sure whether it should be done by simply allowing ShiftedArray to have two arrays (keys and indices), or whether one could somehow get it to work using as parent array some custom array type which also stores its indices.

cc: @nalimilan (he is using this package in StatsKit, so he may have feedback on what is the best way to add time series support in a way that is helpful to the statistical ecosystem).

@matthieugomez
Copy link
Author

matthieugomez commented May 4, 2020 via email

@nalimilan
Copy link
Contributor

To me it sounds like this should be a specialization of lead and lag for TimeArrays instead. TimeArrays can be constructed by passing timestamps and values, so lag(TimeArray(times, values), Day(1)) could give you the same result as what this function does.

@matthieugomez
Copy link
Author

matthieugomez commented May 4, 2020 via email

@nalimilan
Copy link
Contributor

Well GroupedDataFrame gives SubArray views of the columns for rows in the group, so you'd get a SubArray of a TimeArray and lag would be implemented by that type I guess.

@matthieugomez
Copy link
Author

matthieugomez commented May 4, 2020 via email

@nalimilan
Copy link
Contributor

Ah you mean that you want duplicate times in the data frame that would be unique only within each group?

@matthieugomez
Copy link
Author

Yes, exactly.

@nalimilan
Copy link
Contributor

OK. Then the only solution would be to call lag(TimeArray(times, values), Day(1)) for each group inside combine/select. But that wouldn't have any advantage over adding methods like in this PR, so...

It's still annoying that lead and lag are currently lazy in ShiftedArrays. This makes sense for the current methods since it's very efficient, but for methods added by this PR the overhead is much higher. Being lazy can be useful even there if you go over the data only once, but otherwise collecting it will be faster.

This choice between lazy and eager is a broader issue that doesn't have a very good solution currently. Julia has filter and Iterators.filter, DataFrames has stack and stackview (unexported)... all of that is quite inconsistent.

By the way, if methods from this PR requires times to be sorted (at least by default), wouldn't they be able to work much more efficiently, just by checking whether the previous timestamp is equal to the current one minus period? Then it could make sense to be lazy. It may even be faster to sort the data (notably using radix sort) than to use a Dict.

@piever
Copy link
Collaborator

piever commented May 4, 2020

Concerning lazy versus eager: I agree that it only makes sense if the timestamps are sorted (or if one stores the sortperm in the lazy object). I'm actually not 100% sure that the lazy object should be an AbstractArray, or just an iterator. It feels like it is simpler to define iteration efficiently rather than getindex.

As a side comment, if this becomes a special method for TimeArray (which IMO makes a lot of sense), I think it should live in TimeSeries (ShiftedArrays is much more lightweight, so it would make more sense for TimeSeries to depend on it than viceversa).

@matthieugomez
Copy link
Author

matthieugomez commented May 5, 2020 via email

@matthieugomez
Copy link
Author

matthieugomez commented May 5, 2020 via email

@nalimilan
Copy link
Contributor

I don’t think it would work since you may want to lag by more than one
period.

Is that really a problem? You can just check go back by one timestamp until you find one which is either equal or older than the current timestamp minus period. That should be efficient unless you want to lag by a large number of periods compared to the average resolution of the timestamps.

That said, I'm no longer sure that using TimeArray is the best approach. Indeed the proposed methods work for any type (notably integers), and not just dates/times.

Can you point at the corresponding APIs in other languages? That may give us some ideas.

@iblislin
Copy link

iblislin commented May 5, 2020

Hi @matthieugomez

I have a proposal for these functions:

  • Let's implement them in TimeSeries.jl for both AbstractArray and TimeArray.
  • Furthermore, we can add a new type like SimpleTimeArray to accept dates in Int.

Since the project goal of TimeSeries.jl is a versatile time series toolbox.
I want to make some common functions available for AbstractArray, but I don't do it yet. :p

Oh, another issue: Is there a interface package for lag, lead, head, tail?
I always got this:

WARNING: both TimeSeries and DataFrames export "tail"; uses of it in module Main must be qualified

And the same story will happen on TimeSeries and ShiftedArray.

@matthieugomez
Copy link
Author

matthieugomez commented May 5, 2020

@nalimilan

  • Stata has a concept of timed DataFrame and timed grouped DataFrame. Once a time variable is defined, doing L.x creates the lagged variable as in this pull request (per group if dataframe is grouped)
  • dplyr includes a simple function defined on vectors lag(values, n = 1, orderby = times). It does something a bit different: it returns the value for the previous time (whether or not the previous time is equal to the current time minus 1). So in effect, n corresponds to a number of rows to lag by, rather than a period. I think this is to maintain compatibility with SQL. Now there is new user defined package that defines timed tibble and timed group tibble. It includes a function keyed_lag doing the same thing as this pull request, but defined only within a mutate call lag(), lead() and difference() for regular series with missingness tidyverts/tsibble#55
  • I don’t know Panda that well but it seems that the shift function corresponds to the lag function in dplyr (ie shifts by a number of rows, not a period). https://stackoverflow.com/questions/33875746/make-a-shift-by-index-with-a-pandas-dataframe

Defining a new timed data frame type is certainly an option, although it would be much more complicated than what this pull request does. Another downside, AFIAK, is that it would not extend to other type of Tables.

As for your idea about sorted, yes that sounds like a great way to get laziness. Although every indexing would have to check that the whole vector is still sorted right? Stuff like checking missing obs would take a lot of times etc...

@nalimilan
Copy link
Contributor

Thanks for the references. It would be useful if you could file an issue in DataFrames listing the operations that should be supported. That way we will be able to discuss what's the best solution to implement a time-aware type in/based on DataFrames.

As for your idea about sorted, yes that sounds like a great way to get laziness. Although every indexing would have to check that the whole vector is still sorted right? Stuff like checking missing obs would take a lot of times etc...

I guess you could assume that the vector of timestamps isn't mutated. That's a natural expectation for a lazy view.

Oh, another issue: Is there a interface package for lag, lead, head, tail?
I always got this:

WARNING: both TimeSeries and DataFrames export "tail"; uses of it in module Main must be qualified

And the same story will happen on TimeSeries and ShiftedArray.

@iblis17 head and tail are deprecated in DataFrames, so they will be removed at some point. FWIW, we moved to using first and last so maybe that would be an option for TimesSeries too?

For lag and lead, we could define them in DataAPI, but it's difficult to coordinate multiple packagesif they want to defined methods that operate on types they don't own, which is quite typical for theese functions (like ShiftedArrays does). That would require defining very clearly which package is supposed to provide which method.

@iblislin
Copy link

iblislin commented May 6, 2020

@nalimilan

head and tail are deprecated in DataFrames, so they will be removed at some point. FWIW, we moved to using first and last so maybe that would be an option for TimesSeries too?

👍 I will check them and try to move to first and last.

For lag and lead, we could define them in DataAPI, but it's difficult to coordinate multiple packagesif they want to defined methods that operate on types they don't own, which is quite typical for theese functions (like ShiftedArrays does). That would require defining very clearly which package is supposed to provide which method.

Oh, okay. I just read the doc of ShiftedArrays.
Seem the "just-let-them-conflict" policy is suitable for the types not owned by the package.

@matthieugomez
Copy link
Author

matthieugomez commented May 6, 2020

@nalimilan that seems way too error prone too me. I think the best option would be to rename the lag function in this package to shift, which is more consistent with the name of this package anyway. lag would be reserved for a time non lazy lag.

@FuZhiyu
Copy link

FuZhiyu commented Oct 11, 2020

Thanks @matthieugomez for working on this! This functionality has been hugely (also surprisingly) missing outside of Stata-verse. Here are my two cents on it:

I don’t know Panda that well but it seems that the shift function corresponds to the lag function in dplyr

Pandas now support shift by frequency. Not sure about the underlying implementation but it's pretty slow at least on quarterly series. Also, it's not compatible with MultiIndex in pandas which is also unfriendly to panel data. This leaves me wondering after so many years how people deal with panel data outside of Stata.

I like the current implementation to take another argument of another time index. Since there potentially be multiple columns on the same timestamps, it'll be redundant to attach every column with time information as in TimeArray. Another approach would be to define a TimeDataFrame but I also agree with @matthieugomez on the downside of that path. The current implementation is more flexible and simple.

Concerning sorting and laziness: I find it totally reasonable to assume the time index has been sorted. In most of the applications we have sorted time indexes anyways. Operating on a sorted index is much more efficient as you only need to check n-steps back, so I tend to think it should be the default algorithm while the dictionary-based approach can be a fallback. In Stata/Pandas the lag/lead operators also require the data to be sorted by time indexes (the time variable in their applications are literally the indexes of the data so it's admittedly less error-prone).

Also consistent with the current method of this package, the lag&lead are relative to the index of the arrays. This should give the user a hint that any other indexes they are passing to the time-aware method better to be sorted.

@eloualiche
Copy link

Bump! It would be great to be able to use this package for (unbalanced) panel data.

I don't think everybody is only focused on time series ;)

@matthieugomez
Copy link
Author

matthieugomez commented Feb 17, 2022

The sorted-lazy approach won’t work for panel data — how would one check that the vector is sorted within each group? Running issorted at each getindex call would be too costly.

@eloualiche
Copy link

The sorted-lazy approach won’t work for panel data — how would one check that the vector is sorted within each group? Running issorted at each getindex call would be too costly.

Agreed! I did not think of that. This is a great insight!

@nalimilan
Copy link
Contributor

nalimilan commented Mar 23, 2022

The sorted-lazy approach won’t work for panel data — how would one check that the vector is sorted within each group? Running issorted at each getindex call would be too costly.

@matthieugomez I don't understand this point. I guess you would do something like combine(groupby(df, :key), [:x, :times] => lag => :x_lagged) right? Then lag just has to check that the SubArray it gets passed for times (corresponding to a particular group) is sorted (this is super cheap).

EDIT: Incidentally, this is how PanelShift.jl implements this, even if it's not lazy, so that's probably an efficient strategy in general. Cc: @FuZhiyu

@FuZhiyu
Copy link

FuZhiyu commented Mar 27, 2022

Laziness vs being sorted seems two orthogonal issues. We need :times to be sorted, while we probably want :x_lagged to be lazy. As long as times is not lazy, checking whether it's sorted seems pretty cheap.

As for PanelShift.jl, I didn't implement the lazy approach because 1) I'm not familiar with the machinery of laziness 2) Usually we don't have a need for a lazy lagged array in panel data.

But if the maintainer of ShiftedArrays.jl thinks it's a good idea to incorporate PanelShifts into this package I'm also happy to dig into the lazy wood to make it lazy.

@matthieugomez
Copy link
Author

Is it really cheap to check that times is sorted every time getindex(:x_lagged, i) is called?

@FuZhiyu
Copy link

FuZhiyu commented Mar 28, 2022

ok I see the issue. If for laziness we are not allowed to store the new index for x_lagged, then indeed we need to check whether times is sorted every time we call getindex.

In the case where there are gaps in time arrays, we could either store the whole new index in the ShiftedArray, or store multiple shifts and use them depending on the queried index. But both solutions require different designs of ShiftedArray and seem a little bit awkward.

@nalimilan
Copy link
Contributor

My point is that we could check whether times are sorted only when creating the object. Then getindex would simply assume that the underlying times are not mutated. Am I missing something?

@matthieugomez
Copy link
Author

I guess, pragmatically, I am not sure what would be returned by this command:

df = DataFrame(times = [0, 1], x = ["V1", "V2"])
df.y = lag(df.x, df.times)
sort!(df, :times, rev = true)
df.y

@matthieugomez
Copy link
Author

matthieugomez commented Mar 28, 2022

@nalimilan The other thing is that, assuming that the underlying times are not mutated seems very error prone to me. Safety seems more important than memory efficiency.

@FuZhiyu
Copy link

FuZhiyu commented Mar 29, 2022

I now feel lazy vectors may not be a good idea for time vectors with gaps. Let me clarify.

The current type ShiftedArray stores shifts, so when getindex(:x_lagged, i) is called, it returns x_lagged[i-shift]. However, when there are gaps in time, shifts are no longer constant so the current design cannot be applied directly. Say we have x_lagged=lag(time, x, 1), where x is a length-N vector We can do:

  1. Store a length-N vector of shifts in x_lagged, so when we call getindex(:x_lagged, t), it returns x_lagged[t-shift[t]]. But since we already saved another shift vector, what's the benefit of being lazy?
  2. A true lazy approach, so every call of getindex(x_lagged, t), as @matthieugomez suggests, checks whether time is sorted and searches for x at t-1. Super safe, super inefficient.

Both approaches are safe for post-creation mutation to the time vector though. but neither seem to have an advantage of a non-lazy method.

Or is there any other implementation for lazy shiftedarrays that I haven't thought of?

@piever
Copy link
Collaborator

piever commented Mar 30, 2022

I think it may be important to give some perspective from the point of view of the ShiftedArrays library on this type of functionality.

In my understanding, the main goal of this package is to provide a lazy, shifted array abstraction (ShiftedArray and CircShiftedArray) and, whenever possible, add lazy, memory-efficient versions of established functions (Base.circshift, AbstractFFTs.fftshift).

lag and lead are exceptions to this, and at the moment they are mostly helpers for the ShiftedArray function, i.e., lag(v, n) = ShiftedArray(v, n), lead(v, n) = ShiftedArray(v, -n). They are not crucial, in that the constructor already does most of the job.

I don't think adding additional eager lag and lead methods (based on a Dict) here is consistent with the rest of the library. It would also increase the maintenance burden for this package, which at the moment is quite low, as it is a low-level package with no dependencies. I'm afraid that would change if ShiftedArrays became more of a "toolkit to work with sequence data".

IMO, possible ways forward would be the following.

  1. Unexport lag and lead from ShiftedArrays. A separate package (StatsBase.jl, or a hypothetical WindowsFunctions.jl) would define eager versions, with all the required features. ShiftedArrays.lag would be a lazy version of WindowsFunctions.lag whenever possible. Same could be done for the diff proposal in Add ShiftedArrays.diff for differences between elements in a vector #51 (WindowsFunctions.diff is eager, ShiftedArrays.diff is lazy).
  2. Same as 1., but lag and lead are removed altogether from here, the ShiftedArray constructor is probably enough (see Lead/lag with respect to a time variable #37 (comment)).
  3. Do not change anything in ShiftedArrays, but use instead a custom vector type and specialize lag(CustomVector(times, values), deltat). I think CustomVector should be some sort of AxisArray, with a list of sorted indices and a list of values (see Lead/lag with respect to a time variable #37 (comment)).

@nalimilan
Copy link
Contributor

Thanks for commenting. I agree it's reasonable to keep ShiftedArrays oriented towards lazy operations only, and put eager methods in another package.

I'd tend to prefer your solutions 1 or 2. I actually think it would make sense to define empty lead and lag functions in StatsAPI so that other packages can provide appropriate methods for custom array types (as long as they are consistent with the general definition). Fallback definitions for any AbstractVectors, which would be eager, could live in StatsBase (if others agree).

@matthieugomez
Copy link
Author

I would love Solution 1 or 2

@mkitti
Copy link

mkitti commented Sep 2, 2022

Option 1 is the best choice. You can always remove them later as in Option 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants