UnitWeights #358

Nosferican · 2018-03-14T16:34:19Z

I would like to discuss adding a new type of weights for consistency and efficiency purposes. The proposed type is basically UnitWeights.

nalimilan · 2018-03-14T17:25:40Z

I'm in favor in principle, but why would we need to actually store a vector of weights if we know they are unit weights? I would have thought we should use an implementation similar to #135.

Nosferican · 2018-03-14T18:13:53Z

I would be fine with

struct UnitWeights <: AbstractWeights end

Then just specialize the methods for efficiency.

nalimilan · 2018-03-14T18:30:26Z

If we add this, it should also implement getindex and sum so that it behaves like other weights types (and like vectors).

Nosferican · 2018-03-15T19:45:46Z

Something like?

struct UnitWeights <: AbstractWeights
    sum::Int64
    UnitWeights(obj::AbstractVector) = new(length(obj))
end
getindex(obj::UnitWeights) = one(Float64)
sum(obj::UnitWeights) = getfield(obj, :sum)

nalimilan · 2018-03-15T20:47:26Z

Mostly, though that's a bit more complex, see #135.

Nosferican · 2018-11-05T00:32:53Z

FillArrays provides the Ones which are optimal for this.

lbittarello · 2019-08-21T21:59:52Z

I have redesigned UnitWeights in Microeconometrics.jl (cf. here). It does not store a vector of ones any longer. We could perhaps port it into StatsBase if there is any interest and the implementation is fine. Let me know.

nalimilan · 2019-08-22T09:58:38Z

Sure, please file a PR! Though looking at your implementation it appears to allow any value, while "unit" evokes weights equal to 1 to me, doesn't it?

lbittarello · 2019-08-22T10:46:16Z

True. As I understand it, the current inner constructor should prevent the user from creating weights with a different value, but the user could later modify it. The following alternative implementation is safer, albeit marginally less efficient:

struct UnitWeights{S<:Real, T<:Real} <: AbstractWeights{S, T, V where V<:Vector{T}}
    sum::S
end

UnitWeights(::Type{T}, s::S) where {S, T} = UnitWeights{S, T}(s)
UnitWeights(r::T, s::S)      where {S, T} = UnitWeights{S, T}(s)
UnitWeights{T}(s::S)         where {S, T} = UnitWeights{S, T}(s)

We need then change references to wv.el for one(T) – for instance,

Base.values(wv::UnitWeights{S, T})  where {S, T} = fill(wv.el, length(wv))

becomes

Base.values(wv::UnitWeights{S, T})  where {S, T} = fill(one(T), length(wv))

Does it seem better?

Nosferican · 2019-08-22T10:48:30Z

I would have to test it, but values tend be be iterators so maybe it could yield one(T) wv times rather than the fill call which is invoked every time values is called.

lbittarello · 2019-08-22T10:51:24Z

values returns a vector under the current implementation of the other weight types. Shouldn't it also return a vector for UnitWeights for consistency's sake?

Nosferican · 2019-08-22T11:44:52Z

obj = 1:3
wt = FrequencyWeights(obj)
isa(values(wt), UnitRange)

I believe it conforms to AbstractVector, but doesn't require it to be a Vector (as opposed to FillArrays which requires it to be a Vector).

Maybe something like

using StatsBase
struct UnitIterator{T<:Real} <: AbstractVector{T}
    l::Int
    UnitIterator(x::Real) = new{eltype(x)}(x)
end
Base.length(obj::UnitIterator) = obj.l
Base.size(obj::UnitIterator) = (length(obj),)
function Base.getindex(obj::UnitIterator{T}, i::Integer) where {T<:Real}
    ((i > 0) && (i ≤ obj.l)) ? one(T) : Base.throw_boundserror(obj, i)
end
using StatsBase
struct UnitIterator{T<:Real} <: AbstractVector{T}
    l::Int
    UnitIterator(x::Real) = new{eltype(x)}(x)
end
Base.length(obj::UnitIterator) = obj.l
Base.size(obj::UnitIterator) = (length(obj),)
function Base.getindex(obj::UnitIterator{T}, i::Integer) where {T<:Real}
    ((i > 0) && (i ≤ obj.l)) ? one(T) : Base.throw_boundserror(obj, i)
end
function Base.getindex(obj::UnitIterator{T}, is::AbstractRange) where {T<:Real}
    is₀, is₁ = extrema(is)
    ((is₀ ≥ 1) && (is₁ ≤ length(obj))) ? UnitIterator(length(is)) : Base.throw_boundserror(obj, is)
end
struct UnitWeight{S<:Int,T<:Real,V<:UnitIterator{T}} <: AbstractWeights{S,T,V}
    values::UnitIterator{T}
    sum::Int
    function UnitWeight(obj::UnitIterator{T}) where {T<:Real}
        new{Int,T,UnitIterator{T}}(obj, obj.l)
    end
    UnitWeight(obj::Real) = UnitWeight(UnitIterator(obj))
end
x = UnitWeight(5)

lbittarello · 2019-08-22T12:01:10Z

My bad. I didn't know that ranges were a subtype of arrays.

Wouldn't it be simpler to keep the current definition and have values return Iterators.repeated(one(T), length(wv))?

Nosferican · 2019-08-22T12:08:25Z

Probably the best solution, but might require to break the AbstractWeights{S,T,V} since the rest of the <:AbstractWeights hold an AbstractVector for values... We could consider relaxing that for allowing some iterator and keeping the shape information in the overall struct. That maybe breaking, but would allow for greater flexibility and efficiency.

Disclosure, I did add an opinionated getindex(obj, ::AbstractVector) to return a new instance of weights rather than just the values... I find that particularly useful for my work, but it isn't the current behavior AFAIK. That would be another potential breaking change we could include if it gets triaged and we decide to have an up minor release.

nalimilan · 2019-08-22T12:14:07Z

What's the purpose of values actually? Depending on what it's useful for, we could have values be the identity for UnitWeights, or use fill. Defining a custom vector type doesn't seem worth it, given that UnitWeights already satisfies the AbstractArray interface.

Nosferican · 2019-08-22T12:18:09Z

I would favor relaxing the AbstractWeights{S,T,V<:AbstractVector} such that the values could be an interator while keeping the shape information in the weight struct. That way we don't have to materialize the vector not even by repeating reference with fill. Maybe fill is efficient enough and we could use it without issues.

Nosferican · 2019-08-22T12:21:25Z

It might also be better to have a ConstantWeights (and handle UnitWeights as sub/special case). Are there any use cases for constant weights that might use it?

nalimilan · 2019-08-22T12:21:43Z

I would favor relaxing the AbstractWeights{S,T,V<:AbstractVector} such that the values could be an interator while keeping the shape information in the weight struct.

What would be the advantage over returning the UnitWeights object itself?

Nosferican · 2019-08-22T12:27:35Z

The constructor for UnitWeight could be something like UnitWeight(5000) meaning that all the information is provided and can be handled with an iterator without having to hold a fill(one(Int), 5000) in the values. That might be okay for setting the weights once, but I usually end up doing subsetting which requires me to set new instances of Weights many times and not having to materialize it each time would be nice. I could use type dispatch to handle those instances, but not having to materialize it would be nicer in general.

lbittarello · 2019-08-22T12:39:22Z

I will make a pull request with the current implementation. We can then review it as necessary. I think however that broader changes to the weights infrastructure belong to a separate pull request.

nalimilan · 2019-08-22T12:42:12Z

The constructor for UnitWeight could be something like UnitWeight(5000) meaning that all the information is provided and can be handled with an iterator without having to hold a fill(one(Int), 5000) in the values. That might be okay for setting the weights once, but I usually end up doing subsetting which requires me to set new instances of Weights many times and not having to materialize it each time would be nice. I could use type dispatch to handle those instances, but not having to materialize it would be nicer in general.

What I propose is that fill is never called at all. No allocation would be needed then.

Nosferican · 2019-08-22T12:44:50Z

If you want open the PR for now,

Adjust the fields to conform to the current AbstractWeights{S,T,V}, that means holding the fill(one(T), x) as the values (two fields, values and sum)
Relax Vector for AbstractVector and Int for Integer

lbittarello · 2019-08-22T12:47:56Z

I thought we wanted to avoid holding the weight vector. To quote @nalimilan above,

I'm in favor in principle, but why would we need to actually store a vector of weights if we know they are unit weights?

Nosferican · 2019-08-22T12:48:13Z

For making it a subtype of AbstractWeights.

Relaxing that requirement was the discussion I mentioned above, but since we can discuss that at a later stage for including UnitWeights as things are, it needs to hold it.

You could still need to materialize the input for wrapping it in a UnitWeight, since the values are ignored maybe AbstractRange could work nicely.

UnitWeight(1:1000)

That would work after Vector is relaxed to AbstractVector.

The current API requires an AbstractVector as a field, currently we need to put something in there. I think, @nalimilan was referring to returning the identity rather than calling fill for Base.values.

lbittarello · 2019-08-22T13:03:44Z

Sorry, I don't follow.

The current API requires an AbstractVector as a field, currently we need to put something in there

Why?

The current implementation doesn't hold the weight vector in memory, but it works fine: you can index it, you can iterate on it, you have an appropriate AbstractVector as a parameter... Given appropriate method definitions, why do we need to hold the weight vector in memory?

You could still need to materialize the input for wrapping it in a UnitWeight

Why would you need to materialize the input?

nalimilan · 2019-08-22T13:08:28Z

The API doesn't require fields, it requires methods. Also, UnitWeight(1:1000) makes no sense since weights are not equal in that vector...

Nosferican · 2019-08-22T14:16:41Z

What I am trying to say is that all other weights have a constructor T(::AbstractVector).
For UnitWeight it should be T(n::Integer) rather than T(::AbstractVector). Otherwise you would have to materialize the something like UnitWeight(ones(n)) and then test all(isone, x) rather than doing that we could omit the check and pass an efficient placeholder (e.g., 1:n). If we pass just the length of the vector we don't have a reference to pass the identity and would have to end up materializing something like 1:n and define the getindex as in the draft.

lbittarello · 2019-08-22T14:21:21Z

Good point. But the proposed implementation already takes a number instead of an array.

Nosferican · 2019-08-22T14:45:33Z

Aye. That is why it currently calls fill every time values is invoked. To avoid that, nalimilan suggested just passing the identity, but since this isn't a wrapper it would have to materialize it and hold it in the struct for that. Hence, why I suggested just holding an iterator à la FillArrays.Ones. That way we never have to materialize anything (just a lazy iterator).

nalimilan · 2019-08-22T15:30:42Z

No, I suggested values to return the object itself, so nothing needs to be materialized.

Nosferican · 2019-08-22T15:51:14Z

Base.values(obj::UnitWeight) = obj still has the issue of what is the
V in UnitWeight{S<:Integer,T<:Real} <: AbstractVector{S,T,V}, no?

nalimilan · 2019-08-22T16:08:01Z

I think we can just remove the V parameter from the definition of AbstractWeights: that's an implementation detail that isn't useful in the interface.

Nosferican · 2019-08-22T16:09:44Z

Aye. That was my suggestion earlier in the thread.

lbittarello · 2019-08-22T16:12:24Z

I agree, but I think it should be a separate pull request, since it's a bigger change

Nosferican · 2019-08-22T16:14:42Z

Aye. That's why for the PR to get it in, it should hold some <:AbstractVector V in the form of range or something. We can clean it up after we drop the {S,T,V} for {S,T}.

nalimilan · 2019-08-22T16:29:41Z

I think that change literally only requires touching two lines (or almost).

lbittarello · 2019-08-22T16:32:09Z

Since we're discussing AbstractWeights: why do we allow the sum of weights to have a different type from the weight elements? Shouldn't they logically be the same?

Nosferican · 2019-08-22T16:34:18Z

I believe it was to prevent overflow sum of Float64 valid numbers may require BigFloat... In practice, probably not a big issue... As for the type of the sum for UnitWeight, maybe we want that to be Int or S<:Integer.

nalimilan · 2019-08-22T16:41:42Z

Yeah, for some types sum returns a different type (Int for Int8).

nalimilan · 2019-08-23T09:01:14Z

To fix the values problem, I think we should just deprecate that function in favor of convert(Vector, wv). There's no need to define functions for things that can be expressed in more standard ways. Then we can just define convert(::Type{Vector}, wv::UnitWeights) = ones(eltype(wv), length(wv)).

lbittarello · 2019-08-23T18:34:53Z

Another incidental question: do we need to parametrize the weights by the type of sum? I think that we only use it in wsumtype and wmeantype, which we could easily redefine / work around.

nalimilan · 2019-10-19T14:30:39Z

This is fully implemented AFAICT.

lbittarello mentioned this issue Aug 22, 2019

Add unit weights #515

Merged

lbittarello mentioned this issue Sep 29, 2019

Simplify weights #526

Merged

nalimilan closed this as completed Oct 19, 2019

UnitWeights #358

UnitWeights #358

Comments

Nosferican commented Mar 14, 2018

nalimilan commented Mar 14, 2018

Nosferican commented Mar 14, 2018

nalimilan commented Mar 14, 2018

Nosferican commented Mar 15, 2018

nalimilan commented Mar 15, 2018

Nosferican commented Nov 5, 2018

lbittarello commented Aug 21, 2019

nalimilan commented Aug 22, 2019

lbittarello commented Aug 22, 2019

Nosferican commented Aug 22, 2019

lbittarello commented Aug 22, 2019

Nosferican commented Aug 22, 2019 • edited

lbittarello commented Aug 22, 2019

Nosferican commented Aug 22, 2019

nalimilan commented Aug 22, 2019

Nosferican commented Aug 22, 2019

Nosferican commented Aug 22, 2019

nalimilan commented Aug 22, 2019

Nosferican commented Aug 22, 2019

lbittarello commented Aug 22, 2019

nalimilan commented Aug 22, 2019

Nosferican commented Aug 22, 2019

lbittarello commented Aug 22, 2019

Nosferican commented Aug 22, 2019 • edited

lbittarello commented Aug 22, 2019

nalimilan commented Aug 22, 2019

Nosferican commented Aug 22, 2019

lbittarello commented Aug 22, 2019

Nosferican commented Aug 22, 2019

nalimilan commented Aug 22, 2019

Nosferican commented Aug 22, 2019

nalimilan commented Aug 22, 2019

Nosferican commented Aug 22, 2019

lbittarello commented Aug 22, 2019

Nosferican commented Aug 22, 2019

nalimilan commented Aug 22, 2019

lbittarello commented Aug 22, 2019

Nosferican commented Aug 22, 2019

nalimilan commented Aug 22, 2019

nalimilan commented Aug 23, 2019

lbittarello commented Aug 23, 2019

nalimilan commented Oct 19, 2019

Nosferican commented Aug 22, 2019 •

edited

Nosferican commented Aug 22, 2019 •

edited