Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnitWeights #358

Closed
Nosferican opened this issue Mar 14, 2018 · 42 comments
Closed

UnitWeights #358

Nosferican opened this issue Mar 14, 2018 · 42 comments

Comments

@Nosferican
Copy link
Contributor

I would like to discuss adding a new type of weights for consistency and efficiency purposes. The proposed type is basically UnitWeights.

@nalimilan
Copy link
Member

I'm in favor in principle, but why would we need to actually store a vector of weights if we know they are unit weights? I would have thought we should use an implementation similar to #135.

@Nosferican
Copy link
Contributor Author

I would be fine with

struct UnitWeights <: AbstractWeights end

Then just specialize the methods for efficiency.

@nalimilan
Copy link
Member

If we add this, it should also implement getindex and sum so that it behaves like other weights types (and like vectors).

@Nosferican
Copy link
Contributor Author

Something like?

struct UnitWeights <: AbstractWeights
    sum::Int64
    UnitWeights(obj::AbstractVector) = new(length(obj))
end
getindex(obj::UnitWeights) = one(Float64)
sum(obj::UnitWeights) = getfield(obj, :sum)

@nalimilan
Copy link
Member

Mostly, though that's a bit more complex, see #135.

@Nosferican
Copy link
Contributor Author

FillArrays provides the Ones which are optimal for this.

@lbittarello
Copy link
Contributor

I have redesigned UnitWeights in Microeconometrics.jl (cf. here). It does not store a vector of ones any longer. We could perhaps port it into StatsBase if there is any interest and the implementation is fine. Let me know.

@nalimilan
Copy link
Member

Sure, please file a PR! Though looking at your implementation it appears to allow any value, while "unit" evokes weights equal to 1 to me, doesn't it?

@lbittarello
Copy link
Contributor

True. As I understand it, the current inner constructor should prevent the user from creating weights with a different value, but the user could later modify it. The following alternative implementation is safer, albeit marginally less efficient:

struct UnitWeights{S<:Real, T<:Real} <: AbstractWeights{S, T, V where V<:Vector{T}}
    sum::S
end

UnitWeights(::Type{T}, s::S) where {S, T} = UnitWeights{S, T}(s)
UnitWeights(r::T, s::S)      where {S, T} = UnitWeights{S, T}(s)
UnitWeights{T}(s::S)         where {S, T} = UnitWeights{S, T}(s)

We need then change references to wv.el for one(T) – for instance,

Base.values(wv::UnitWeights{S, T})  where {S, T} = fill(wv.el, length(wv))

becomes

Base.values(wv::UnitWeights{S, T})  where {S, T} = fill(one(T), length(wv))

Does it seem better?

@Nosferican
Copy link
Contributor Author

I would have to test it, but values tend be be iterators so maybe it could yield one(T) wv times rather than the fill call which is invoked every time values is called.

@lbittarello
Copy link
Contributor

values returns a vector under the current implementation of the other weight types. Shouldn't it also return a vector for UnitWeights for consistency's sake?

@Nosferican
Copy link
Contributor Author

Nosferican commented Aug 22, 2019

obj = 1:3
wt = FrequencyWeights(obj)
isa(values(wt), UnitRange)

I believe it conforms to AbstractVector, but doesn't require it to be a Vector (as opposed to FillArrays which requires it to be a Vector).

Maybe something like

using StatsBase
struct UnitIterator{T<:Real} <: AbstractVector{T}
    l::Int
    UnitIterator(x::Real) = new{eltype(x)}(x)
end
Base.length(obj::UnitIterator) = obj.l
Base.size(obj::UnitIterator) = (length(obj),)
function Base.getindex(obj::UnitIterator{T}, i::Integer) where {T<:Real}
    ((i > 0) && (i ≤ obj.l)) ? one(T) : Base.throw_boundserror(obj, i)
end
using StatsBase
struct UnitIterator{T<:Real} <: AbstractVector{T}
    l::Int
    UnitIterator(x::Real) = new{eltype(x)}(x)
end
Base.length(obj::UnitIterator) = obj.l
Base.size(obj::UnitIterator) = (length(obj),)
function Base.getindex(obj::UnitIterator{T}, i::Integer) where {T<:Real}
    ((i > 0) && (i ≤ obj.l)) ? one(T) : Base.throw_boundserror(obj, i)
end
function Base.getindex(obj::UnitIterator{T}, is::AbstractRange) where {T<:Real}
    is₀, is₁ = extrema(is)
    ((is₀ ≥ 1) && (is₁ ≤ length(obj))) ? UnitIterator(length(is)) : Base.throw_boundserror(obj, is)
end
struct UnitWeight{S<:Int,T<:Real,V<:UnitIterator{T}} <: AbstractWeights{S,T,V}
    values::UnitIterator{T}
    sum::Int
    function UnitWeight(obj::UnitIterator{T}) where {T<:Real}
        new{Int,T,UnitIterator{T}}(obj, obj.l)
    end
    UnitWeight(obj::Real) = UnitWeight(UnitIterator(obj))
end
x = UnitWeight(5)

@lbittarello
Copy link
Contributor

My bad. I didn't know that ranges were a subtype of arrays.

Wouldn't it be simpler to keep the current definition and have values return Iterators.repeated(one(T), length(wv))?

@Nosferican
Copy link
Contributor Author

Probably the best solution, but might require to break the AbstractWeights{S,T,V} since the rest of the <:AbstractWeights hold an AbstractVector for values... We could consider relaxing that for allowing some iterator and keeping the shape information in the overall struct. That maybe breaking, but would allow for greater flexibility and efficiency.

Disclosure, I did add an opinionated getindex(obj, ::AbstractVector) to return a new instance of weights rather than just the values... I find that particularly useful for my work, but it isn't the current behavior AFAIK. That would be another potential breaking change we could include if it gets triaged and we decide to have an up minor release.

@nalimilan
Copy link
Member

What's the purpose of values actually? Depending on what it's useful for, we could have values be the identity for UnitWeights, or use fill. Defining a custom vector type doesn't seem worth it, given that UnitWeights already satisfies the AbstractArray interface.

@Nosferican
Copy link
Contributor Author

I would favor relaxing the AbstractWeights{S,T,V<:AbstractVector} such that the values could be an interator while keeping the shape information in the weight struct. That way we don't have to materialize the vector not even by repeating reference with fill. Maybe fill is efficient enough and we could use it without issues.

@Nosferican
Copy link
Contributor Author

It might also be better to have a ConstantWeights (and handle UnitWeights as sub/special case). Are there any use cases for constant weights that might use it?

@nalimilan
Copy link
Member

I would favor relaxing the AbstractWeights{S,T,V<:AbstractVector} such that the values could be an interator while keeping the shape information in the weight struct.

What would be the advantage over returning the UnitWeights object itself?

@Nosferican
Copy link
Contributor Author

The constructor for UnitWeight could be something like UnitWeight(5000) meaning that all the information is provided and can be handled with an iterator without having to hold a fill(one(Int), 5000) in the values. That might be okay for setting the weights once, but I usually end up doing subsetting which requires me to set new instances of Weights many times and not having to materialize it each time would be nice. I could use type dispatch to handle those instances, but not having to materialize it would be nicer in general.

@lbittarello
Copy link
Contributor

I will make a pull request with the current implementation. We can then review it as necessary. I think however that broader changes to the weights infrastructure belong to a separate pull request.

@nalimilan
Copy link
Member

The constructor for UnitWeight could be something like UnitWeight(5000) meaning that all the information is provided and can be handled with an iterator without having to hold a fill(one(Int), 5000) in the values. That might be okay for setting the weights once, but I usually end up doing subsetting which requires me to set new instances of Weights many times and not having to materialize it each time would be nice. I could use type dispatch to handle those instances, but not having to materialize it would be nicer in general.

What I propose is that fill is never called at all. No allocation would be needed then.

@Nosferican
Copy link
Contributor Author

If you want open the PR for now,

  • Adjust the fields to conform to the current AbstractWeights{S,T,V}, that means holding the fill(one(T), x) as the values (two fields, values and sum)
  • Relax Vector for AbstractVector and Int for Integer

@lbittarello
Copy link
Contributor

I thought we wanted to avoid holding the weight vector. To quote @nalimilan above,

I'm in favor in principle, but why would we need to actually store a vector of weights if we know they are unit weights?

@Nosferican
Copy link
Contributor Author

Nosferican commented Aug 22, 2019

For making it a subtype of AbstractWeights.

Relaxing that requirement was the discussion I mentioned above, but since we can discuss that at a later stage for including UnitWeights as things are, it needs to hold it.

You could still need to materialize the input for wrapping it in a UnitWeight, since the values are ignored maybe AbstractRange could work nicely.

UnitWeight(1:1000)

That would work after Vector is relaxed to AbstractVector.

The current API requires an AbstractVector as a field, currently we need to put something in there. I think, @nalimilan was referring to returning the identity rather than calling fill for Base.values.

@lbittarello
Copy link
Contributor

Sorry, I don't follow.

The current API requires an AbstractVector as a field, currently we need to put something in there

Why?

The current implementation doesn't hold the weight vector in memory, but it works fine: you can index it, you can iterate on it, you have an appropriate AbstractVector as a parameter... Given appropriate method definitions, why do we need to hold the weight vector in memory?

You could still need to materialize the input for wrapping it in a UnitWeight

Why would you need to materialize the input?

@nalimilan
Copy link
Member

The API doesn't require fields, it requires methods. Also, UnitWeight(1:1000) makes no sense since weights are not equal in that vector...

@Nosferican
Copy link
Contributor Author

What I am trying to say is that all other weights have a constructor T(::AbstractVector).
For UnitWeight it should be T(n::Integer) rather than T(::AbstractVector). Otherwise you would have to materialize the something like UnitWeight(ones(n)) and then test all(isone, x) rather than doing that we could omit the check and pass an efficient placeholder (e.g., 1:n). If we pass just the length of the vector we don't have a reference to pass the identity and would have to end up materializing something like 1:n and define the getindex as in the draft.

@lbittarello
Copy link
Contributor

Good point. But the proposed implementation already takes a number instead of an array.

@Nosferican
Copy link
Contributor Author

Aye. That is why it currently calls fill every time values is invoked. To avoid that, nalimilan suggested just passing the identity, but since this isn't a wrapper it would have to materialize it and hold it in the struct for that. Hence, why I suggested just holding an iterator à la FillArrays.Ones. That way we never have to materialize anything (just a lazy iterator).

@nalimilan
Copy link
Member

No, I suggested values to return the object itself, so nothing needs to be materialized.

@Nosferican
Copy link
Contributor Author

Base.values(obj::UnitWeight) = obj still has the issue of what is the
V in UnitWeight{S<:Integer,T<:Real} <: AbstractVector{S,T,V}, no?

@nalimilan
Copy link
Member

I think we can just remove the V parameter from the definition of AbstractWeights: that's an implementation detail that isn't useful in the interface.

@Nosferican
Copy link
Contributor Author

Aye. That was my suggestion earlier in the thread.

@lbittarello
Copy link
Contributor

I agree, but I think it should be a separate pull request, since it's a bigger change

@Nosferican
Copy link
Contributor Author

Aye. That's why for the PR to get it in, it should hold some <:AbstractVector V in the form of range or something. We can clean it up after we drop the {S,T,V} for {S,T}.

@nalimilan
Copy link
Member

I think that change literally only requires touching two lines (or almost).

@lbittarello
Copy link
Contributor

Since we're discussing AbstractWeights: why do we allow the sum of weights to have a different type from the weight elements? Shouldn't they logically be the same?

@Nosferican
Copy link
Contributor Author

I believe it was to prevent overflow sum of Float64 valid numbers may require BigFloat... In practice, probably not a big issue... As for the type of the sum for UnitWeight, maybe we want that to be Int or S<:Integer.

@nalimilan
Copy link
Member

Yeah, for some types sum returns a different type (Int for Int8).

@nalimilan
Copy link
Member

To fix the values problem, I think we should just deprecate that function in favor of convert(Vector, wv). There's no need to define functions for things that can be expressed in more standard ways. Then we can just define convert(::Type{Vector}, wv::UnitWeights) = ones(eltype(wv), length(wv)).

@lbittarello
Copy link
Contributor

Another incidental question: do we need to parametrize the weights by the type of sum? I think that we only use it in wsumtype and wmeantype, which we could easily redefine / work around.

@nalimilan
Copy link
Member

This is fully implemented AFAICT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants