Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a keyword argument to diff which preserves length #42509

Open
pdeffebach opened this issue Oct 5, 2021 · 15 comments · May be fixed by #50945
Open

Add a keyword argument to diff which preserves length #42509

pdeffebach opened this issue Oct 5, 2021 · 15 comments · May be fixed by #50945
Labels
kind:feature Indicates new feature / enhancement requests

Comments

@pdeffebach
Copy link
Contributor

pdeffebach commented Oct 5, 2021

Currently Base.diff(x) produces a new vector with length length(x) - 1.

This is often annoying when working with tabular data, since you cannot do

df.x_diff = diff(df.x)

@nalimilan recommended a keyword argument to Base.diff, which allows for pre-pending a default value such that length, and shape more generally in the case of matrices and other arrays, is preserved.

Given the discussion in the issue below, I propose keyword arguments fillfirst and filllast which indicate the value appended to the array.

julia> begin 
       function newdiff(a::AbstractArray{T,N}; dims::Integer=1, 
                        fillfirst=nothing, 
                        filllast=nothing) where {T,N}
           Base.require_one_based_indexing(a)
           1 <= dims <= N || throw(ArgumentError("dimension $dims out of range (1:$N)"))
       
           r = axes(a)
           r0 = ntuple(i -> i == dims ? UnitRange(1, last(r[i]) - 1) : UnitRange(r[i]), N)
           r1 = ntuple(i -> i == dims ? UnitRange(2, last(r[i])) : UnitRange(r[i]), N)
           if fillfirst !== nothing  
               out = similar(a, Union{eltype(a), typeof(fillfirst)})
               out .= fillfirst
               out[r1...] .= view(a, r1...) .- view(a, r0...)
               return out
           elseif filllast !== nothing  
               out = similar(a, Union{eltype(a), typeof(filllast)})
               out .= filllast
               out[r0...] .= view(a, r1...) .- view(a, r0...)
               return out
           else
               view(a, r0...)
               return view(a, r1...) .- view(a, r0...)
           end
       end
       end
newdiff (generic function with 1 method)

julia> x = collect(1:5) # separate method for ranges;

julia> newdiff(x)
4-element Vector{Int64}:
 1
 1
 1
 1

julia> newdiff(x; fillfirst=0)
5-element Vector{Int64}:
 0
 1
 1
 1
 1

julia> newdiff(x; filllast=0)
5-element Vector{Int64}:
 1
 1
 1
 1
 0
@nalimilan
Copy link
Member

I agree with is really needed. Another possibility for the API would be to have an argument called e.g. default or fill which you would set to missing.

@timholy
Copy link
Sponsor Member

timholy commented Oct 5, 2021

The sad part about this is it's not obvious whether missing should go at the beginning or the end, and without reading the docs either guess would be reasonable. When you reduce by 1 there is no uncertainty about alignment, except for the fact that in a way we'd like to define the axes of the resulting array as 1.5:1:n-0.5.

@pdeffebach
Copy link
Contributor Author

That's a reasonable point about ambiguity. But I'm not sure I've seen a context other than d_n = x_{n} - x_{n-1}. Where we return missing if we don't know what x_{n-1} is, meaning d_1 is missing.

@timholy
Copy link
Sponsor Member

timholy commented Oct 6, 2021

Well, given that arrays in Julia start by default with 1, I'd say the obvious formula is d[n] = x[n+1] - x[n]. That's consistent with the fact that d is shorter by 1 than x, and the fact that the first diff value you can compute is first(d) = x[begin+1] - x[begin]; there isn't an earlier one you can compute. It's thinking of the array as a one-dimensional iterable collection of values and not a function in 1d---the index is almost meaningless. But if you collect from the iterable, you get d[1] = x[2] - x[1]; voila, there is nothing inevitable about the fact that the missing goes in the first slot.

Thinking of it as a function in 1d is why I suggested that the natural axes for d are on the half-integers, i.e., d[n+0.5] = x[n+1] - x[n]. But we don't support non-integer indexing so we can't really do this.

@KristofferC
Copy link
Sponsor Member

I don't think we should introduce Missings into arrays like this. Missings have quite unintuitive behavior for people that are not frequent users of the data science stack.

Another possibility for the API would be to have an argument called e.g. default or fill which you would set to missing.

Something like this would be better imho.

@nalimilan
Copy link
Member

We could have fillfirst and filllast arguments (or just first and last) that one would set to missing or NaN or anything depending on the use case. Passing both would be an error.

@petvana
Copy link
Member

petvana commented Oct 6, 2021

What about NumPy syntax (prepend, append)?

 numpy.diff(a, n=1, axis=-1, prepend=<no value>, append=<no value>)

@nickrobinson251
Copy link
Contributor

what would be the benefit of something like diff(x; prepend=missing) over [missing; diff(x)] / vcat(missing, diff(x))? That it would maintain the container type?

@nalimilan
Copy link
Member

That, and it would avoid making two allocations.

@piever
Copy link
Contributor

piever commented Oct 6, 2021

I also think this would be useful as a so-called "window function" that preserves input length. The interface I had in mind was along the lines of

diff(v, n; default)

where n denotes the shift to perform before subtracting, and default denote the value to use when going outside the range of one of the two arrays. In practice, for a positive n, diff(v, n; default=missing) would have n missings at the beginning, and diff(v, -n; default=missing) would have n missings at the end.

There is a friction point in that it is not super clear to me whether the default should be used before subtracting (eg, subtract to v a shifted version of v padded on one side with default) or compute diff and pad the result with default. The former is easy to implement lazily with ShfitedArrays, see JuliaArrays/ShiftedArrays.jl#51 (comment), but not the latter. The two things are equivalent for default=missing, but different in general.

@pdeffebach
Copy link
Contributor Author

Just to emphasize, about missings in response to @KristofferC , we don't need to have missings here. Ideally an API would allow for any value the user wishes for the unknown differences.

The main goal of this issue (as indicated by the title) is an operation which preserves shape, not necessarily introducing missings.

I think @petvana 's idea about prepend and append is very good and would get rid of the ambiguity mentioned by Tim.

@pdeffebach
Copy link
Contributor Author

All, I have updated the initial post in this issue with a proposal of fillfirst and filllast which makes no assumption about missings as a default value.

@StefanKarpinski
Copy link
Sponsor Member

What about this instead: introduce a wrap::Bool=false keyword which changes diff to produce a vector of the same size where the extra element (at the end) is the difference between the last and first element. Then if someone wants to replace that element with missing, they can just do an assignment afterwards. That doesn't cover the "prepend" case, but that seems less likely to be what someone wants as it changes the index that ever difference ends up at, whereas putting a[end]-a[1] at the end leaves all the other differences where they would otherwise be.

Which does bring me to this option: d = diff(v); push!(d, missing). Same effect, probably doesn't actually do any additional allocation.

@nalimilan
Copy link
Member

The original motivation for this issue was to prepend missing. That's actually quite useful in modelling to create a variable giving the increase compared with the previous value: appending missing would be problematic for a causal interpretation as a future increase would be assigned to the current observation.

FWIW prepending missing is what R does with just diff(v) so there's clearly a use case for it (R doesn't even support appending missing). The NumPy method also accepts prepend and append arguments. Wrapping (a.k.a. circular shift) is yet another possibility, but I'm not sure it's the most common -- and anyway we can support all three behaviors.

Also note that d = diff(v); push!(d, missing) doesn't work as the eltype of d doesn't support missing in general.

@JeffBezanson
Copy link
Sponsor Member

We should also avoid push! to better prepare for the immutable future...

@petvana petvana linked a pull request Aug 16, 2023 that will close this issue
5 tasks
@brenhinkeller brenhinkeller added the kind:feature Indicates new feature / enhancement requests label Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature Indicates new feature / enhancement requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants