Add a keyword argument to `diff` which preserves length #42509

pdeffebach · 2021-10-05T19:31:48Z

Currently Base.diff(x) produces a new vector with length length(x) - 1.

This is often annoying when working with tabular data, since you cannot do

df.x_diff = diff(df.x)

@nalimilan recommended a keyword argument to Base.diff, which allows for pre-pending a default value such that length, and shape more generally in the case of matrices and other arrays, is preserved.

Given the discussion in the issue below, I propose keyword arguments fillfirst and filllast which indicate the value appended to the array.

julia> begin 
       function newdiff(a::AbstractArray{T,N}; dims::Integer=1, 
                        fillfirst=nothing, 
                        filllast=nothing) where {T,N}
           Base.require_one_based_indexing(a)
           1 <= dims <= N || throw(ArgumentError("dimension $dims out of range (1:$N)"))
       
           r = axes(a)
           r0 = ntuple(i -> i == dims ? UnitRange(1, last(r[i]) - 1) : UnitRange(r[i]), N)
           r1 = ntuple(i -> i == dims ? UnitRange(2, last(r[i])) : UnitRange(r[i]), N)
           if fillfirst !== nothing  
               out = similar(a, Union{eltype(a), typeof(fillfirst)})
               out .= fillfirst
               out[r1...] .= view(a, r1...) .- view(a, r0...)
               return out
           elseif filllast !== nothing  
               out = similar(a, Union{eltype(a), typeof(filllast)})
               out .= filllast
               out[r0...] .= view(a, r1...) .- view(a, r0...)
               return out
           else
               view(a, r0...)
               return view(a, r1...) .- view(a, r0...)
           end
       end
       end
newdiff (generic function with 1 method)

julia> x = collect(1:5) # separate method for ranges;

julia> newdiff(x)
4-element Vector{Int64}:
 1
 1
 1
 1

julia> newdiff(x; fillfirst=0)
5-element Vector{Int64}:
 0
 1
 1
 1
 1

julia> newdiff(x; filllast=0)
5-element Vector{Int64}:
 1
 1
 1
 1
 0

The text was updated successfully, but these errors were encountered:

nalimilan · 2021-10-05T20:18:57Z

I agree with is really needed. Another possibility for the API would be to have an argument called e.g. default or fill which you would set to missing.

timholy · 2021-10-05T22:39:41Z

The sad part about this is it's not obvious whether missing should go at the beginning or the end, and without reading the docs either guess would be reasonable. When you reduce by 1 there is no uncertainty about alignment, except for the fact that in a way we'd like to define the axes of the resulting array as 1.5:1:n-0.5.

pdeffebach · 2021-10-05T23:28:09Z

That's a reasonable point about ambiguity. But I'm not sure I've seen a context other than d_n = x_{n} - x_{n-1}. Where we return missing if we don't know what x_{n-1} is, meaning d_1 is missing.

timholy · 2021-10-06T04:12:55Z

Well, given that arrays in Julia start by default with 1, I'd say the obvious formula is d[n] = x[n+1] - x[n]. That's consistent with the fact that d is shorter by 1 than x, and the fact that the first diff value you can compute is first(d) = x[begin+1] - x[begin]; there isn't an earlier one you can compute. It's thinking of the array as a one-dimensional iterable collection of values and not a function in 1d---the index is almost meaningless. But if you collect from the iterable, you get d[1] = x[2] - x[1]; voila, there is nothing inevitable about the fact that the missing goes in the first slot.

Thinking of it as a function in 1d is why I suggested that the natural axes for d are on the half-integers, i.e., d[n+0.5] = x[n+1] - x[n]. But we don't support non-integer indexing so we can't really do this.

KristofferC · 2021-10-06T07:22:09Z

I don't think we should introduce Missings into arrays like this. Missings have quite unintuitive behavior for people that are not frequent users of the data science stack.

Another possibility for the API would be to have an argument called e.g. default or fill which you would set to missing.

Something like this would be better imho.

nalimilan · 2021-10-06T08:22:53Z

We could have fillfirst and filllast arguments (or just first and last) that one would set to missing or NaN or anything depending on the use case. Passing both would be an error.

petvana · 2021-10-06T09:45:50Z

What about NumPy syntax (prepend, append)?

 numpy.diff(a, n=1, axis=-1, prepend=<no value>, append=<no value>)

nickrobinson251 · 2021-10-06T10:08:44Z

what would be the benefit of something like diff(x; prepend=missing) over [missing; diff(x)] / vcat(missing, diff(x))? That it would maintain the container type?

nalimilan · 2021-10-06T10:11:51Z

That, and it would avoid making two allocations.

piever · 2021-10-06T11:11:13Z

I also think this would be useful as a so-called "window function" that preserves input length. The interface I had in mind was along the lines of

diff(v, n; default)

where n denotes the shift to perform before subtracting, and default denote the value to use when going outside the range of one of the two arrays. In practice, for a positive n, diff(v, n; default=missing) would have n missings at the beginning, and diff(v, -n; default=missing) would have n missings at the end.

There is a friction point in that it is not super clear to me whether the default should be used before subtracting (eg, subtract to v a shifted version of v padded on one side with default) or compute diff and pad the result with default. The former is easy to implement lazily with ShfitedArrays, see JuliaArrays/ShiftedArrays.jl#51 (comment), but not the latter. The two things are equivalent for default=missing, but different in general.

pdeffebach · 2021-10-06T13:42:04Z

Just to emphasize, about missings in response to @KristofferC , we don't need to have missings here. Ideally an API would allow for any value the user wishes for the unknown differences.

The main goal of this issue (as indicated by the title) is an operation which preserves shape, not necessarily introducing missings.

I think @petvana 's idea about prepend and append is very good and would get rid of the ambiguity mentioned by Tim.

pdeffebach · 2021-10-06T14:33:13Z

All, I have updated the initial post in this issue with a proposal of fillfirst and filllast which makes no assumption about missings as a default value.

StefanKarpinski · 2021-10-06T16:41:35Z

What about this instead: introduce a wrap::Bool=false keyword which changes diff to produce a vector of the same size where the extra element (at the end) is the difference between the last and first element. Then if someone wants to replace that element with missing, they can just do an assignment afterwards. That doesn't cover the "prepend" case, but that seems less likely to be what someone wants as it changes the index that ever difference ends up at, whereas putting a[end]-a[1] at the end leaves all the other differences where they would otherwise be.

Which does bring me to this option: d = diff(v); push!(d, missing). Same effect, probably doesn't actually do any additional allocation.

nalimilan · 2021-10-06T19:34:56Z

The original motivation for this issue was to prepend missing. That's actually quite useful in modelling to create a variable giving the increase compared with the previous value: appending missing would be problematic for a causal interpretation as a future increase would be assigned to the current observation.

FWIW prepending missing is what R does with just diff(v) so there's clearly a use case for it (R doesn't even support appending missing). The NumPy method also accepts prepend and append arguments. Wrapping (a.k.a. circular shift) is yet another possibility, but I'm not sure it's the most common -- and anyway we can support all three behaviors.

Also note that d = diff(v); push!(d, missing) doesn't work as the eltype of d doesn't support missing in general.

JeffBezanson · 2021-10-08T21:34:16Z

We should also avoid push! to better prepare for the immutable future...

pdeffebach mentioned this issue Oct 5, 2021

Add ShiftedArrays.diff for differences between elements in a vector JuliaArrays/ShiftedArrays.jl#51

Open

petvana linked a pull request Aug 16, 2023 that will close this issue

Introduce optional prepend/append argument for diff #50945

Draft

5 tasks

brenhinkeller added the kind:feature Indicates new feature / enhancement requests label Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a keyword argument to `diff` which preserves length #42509

Add a keyword argument to `diff` which preserves length #42509

pdeffebach commented Oct 5, 2021 •

edited

nalimilan commented Oct 5, 2021

timholy commented Oct 5, 2021

pdeffebach commented Oct 5, 2021

timholy commented Oct 6, 2021

KristofferC commented Oct 6, 2021

nalimilan commented Oct 6, 2021

petvana commented Oct 6, 2021 •

edited

nickrobinson251 commented Oct 6, 2021

nalimilan commented Oct 6, 2021

piever commented Oct 6, 2021

pdeffebach commented Oct 6, 2021

pdeffebach commented Oct 6, 2021

StefanKarpinski commented Oct 6, 2021

nalimilan commented Oct 6, 2021

JeffBezanson commented Oct 8, 2021

Add a keyword argument to diff which preserves length #42509

Add a keyword argument to diff which preserves length #42509

Comments

pdeffebach commented Oct 5, 2021 • edited

nalimilan commented Oct 5, 2021

timholy commented Oct 5, 2021

pdeffebach commented Oct 5, 2021

timholy commented Oct 6, 2021

KristofferC commented Oct 6, 2021

nalimilan commented Oct 6, 2021

petvana commented Oct 6, 2021 • edited

nickrobinson251 commented Oct 6, 2021

nalimilan commented Oct 6, 2021

piever commented Oct 6, 2021

pdeffebach commented Oct 6, 2021

pdeffebach commented Oct 6, 2021

StefanKarpinski commented Oct 6, 2021

nalimilan commented Oct 6, 2021

JeffBezanson commented Oct 8, 2021

Add a keyword argument to `diff` which preserves length #42509

Add a keyword argument to `diff` which preserves length #42509

pdeffebach commented Oct 5, 2021 •

edited

petvana commented Oct 6, 2021 •

edited