Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type coercion for vectors and columns #114

Merged
merged 7 commits into from
May 2, 2019

Conversation

giordano
Copy link
Member

@giordano giordano commented Apr 15, 2019

Address first part of #109. I've added basic tests, but they can probably be improved.

@giordano giordano marked this pull request as ready for review April 15, 2019 20:45
@giordano giordano force-pushed the type-coercion branch 2 times, most recently from 39bb694 to 19ea457 Compare April 16, 2019 10:08
@ablaom
Copy link
Member

ablaom commented Apr 22, 2019

Okay, great start, thanks.

Can we please:

  • suppress conversion if the type is already of required form. So, as you point out in Add type coercion to task constructors #109, Measurement objects are already AbstractFloat - and so already Continuous - meaning conversion to Float64 is lossy. See also note (1) below.

  • coerce(types::Dict{Symbol, Type}, X) currently returns a column table (named tuple of vectors). Generally in MLJ, we try to return the same data container that was given as input. So, instead return MLJBase.table(coltable, prototype=X), where coltable is the current return value.

  • please add a test for conversion to FiniteOrderedFactor.

Notes:

(1) This raises a new issue. The user may want conversion to a custom type T which has a supported scientific type instead of using the fallbacks suggested in #109. Let's leave this to a later PR; will add to #109.

@giordano
Copy link
Member Author

Ok, I think I addressed all comments. Thanks for pointing out tests for FiniteOrderedFactor, they totally slipped my mind!

Regarding conversion to Continuous, I think that float should do the trick: it's no-op if the eltype of the vector is already an AbstractFloat, otherwise it converts the vector to the most appropriate floating type. Also coerce(Count, y) now is no-op if y is an AbstractVector{Integer}

@ablaom
Copy link
Member

ablaom commented Apr 25, 2019

Yes, that's would seem to work for floats. (And I realise I have made a mistake in declaring all Real objects Continuous, it should be just AbstractFloat. Thanks!)

However the discrete case is different, for consider this:

julia> using CategoricalArrays
julia> v = [x for x in categorical([:x,:y,:z,:w])[1:2]]
2-element Array{CategoricalValue{Symbol,UInt32},1}:
 :x
 :y

The vector v is not a CategoricalVector but still has the scitype Multiclass (because scitype is based on the union of scitypes of elements, not on whether the vector itself is categorical):

julia> scitype_union(v)
Multiclass{4}

I believe that if you apply your coecion to it you will loose some of the levels (there are four of them), for consider this:

julia> scitype_union(categorical(v))
Multiclass{2}

So, I think you need to explicitly check (using scitype_union method) if a vector already has the requested scitype, and suppress coercion in that case. (Note performance is not an issue here.)

Incidentally, some of the scitype methods changed name in MLJBase 0.1.1. In particular scitype_union was union_scitypes. See below for draft issue explaining the change. The revised methods have doc strings you can query:


Some of the nomenclature surrounding scitypes is hard to keep
straight. I am making the executive decision to make the following
breaking changes, which will require refactoring MLJBase, MLJModels
and MLJ, but will probably effect almost no-one at this point. There
will be just three methods for scitypes:

  • scitype keeps existing meaning but with the following addition:
    the scitype of any tuple is (as a fallback) the tuple type
    constructed out of the corresponding scitypes of elements. So, for
    example, scitype((4.5, 7)) = Tuple{Float64,Int64}.
  • scitype_union(v) supposes that v is iterable and returns the
    union of scitypes of elements . For example, scitype_union([1, 2.3, missing]) = Union{Count,Continuous,Missing}
  • scitypes(X) supposes X is a table and returns a named tuple
    keyed on the column names of X with values the corresponding
    scitype unions:
julia> X = (height = Float64[183, 145, 160, 78, 182, 76],
                   gender = categorical([:m, :f, :f, :f, :m, :m]),
                   weight = Float64[92, 67, 62, 25, 80, 31],
                   age = Float64[53, 12, 60, 5, 31, 7],
julia> scitypes(X)
(height = Continuous,
 gender = Multiclass{2},
 weight = Continuous,
 age = Continuous,
 overall_health = FiniteOrderedFactor{3},)

The method column_scitype_as_tuple will be dropped, and we:

  • rename the trait input_scitype_union to input_scitype_union
  • rename the trait target_scitype_union to target_scitype_union
  • make similar renamings of the corresponding task fields

@ablaom ablaom closed this Apr 25, 2019
@ablaom ablaom reopened this Apr 25, 2019
@ablaom
Copy link
Member

ablaom commented Apr 25, 2019

Oops, closed by mistake.

In addition:
* fix output type of `coerce(types, X)`
* do not convert `AbstractVector{<:Integer}` to `Vector{Int}`
* add more tests
* check if a vector already has the requested scitype in the discrete case
@giordano
Copy link
Member Author

Discrete case fixed as suggested, thanks!

@ablaom
Copy link
Member

ablaom commented Apr 28, 2019

Great. One last request. If a vector has missing elements then coercion to the four concrete scitypes presently fails in all cases.

I think the preferred behaviour is for the missing values to be ignored with a warning thrown. So, for example, coerce(Continuous, [4, missing, 7]) should return [4.0, missing, 7.0] and the warning is "Missing values encountered. Coerced to Union{Missing,Continuous} instead of Continuous. "

Start by defining new methods coerce(T::Union{Continous,Missing}, v) and then modify the existing methods to give the above behaviour by calling the new methods.

Sound reasonable to you?

@giordano
Copy link
Member Author

float already gives the desired result:

julia> MLJ.coerce(Continuous, [4, missing, 7])
3-element Array{Union{Missing, Float64},1}:
 4.0     
  missing
 7.0

In order to show the warning, for the continuous case we can just catch arrays with missing values with:

function coerce(T::Type{Continuous}, y::AbstractVector{Union{S,Missing}}) where S
    @warn "Missing values encountered. Coerced to Union{Missing,Continuous} instead of Continuous."
    return float(y)
end
coerce(T::Type{Continuous}, y) = float(y)

[...]

julia> MLJ.coerce(Continuous, [4, missing, 7])
┌ Warning: Missing values encountered. Coerced to Union{Missing,Continuous} instead of Continuous.
└ @ MLJ ~/.julia/dev/MLJ/src/tasks.jl:36
3-element Array{Union{Missing, Float64},1}:
 4.0     
  missing
 7.0

Caveat: this is the same test for missing values used by float, but would fail for heterogeneous vectors with missing values too, because in that case the eltype would be just Any. I'm not sure if this is a concern.

We can do something similar for the other cases. This isn't exactly what you suggested but gives pretty much the same result (provided that the caveat above is acceptable).

@ablaom
Copy link
Member

ablaom commented Apr 29, 2019

You raise a good point. I want coerce(Continuous, Any[4, 7]) to return [4.0, 7.0] and coerce(Continuous, Any[4, missing, 7]) to return [4.0, missing, 7.0] plus the warning. So you're proposal isn't really satisfactory.

@giordano
Copy link
Member Author

Ok, here is how I went on to deal with missing values for the Continuous case:
https://github.com/alan-turing-institute/MLJ.jl/blob/cd143a22ca5aa844309e3097dc6382ef6e117e32/src/tasks.jl#L38-L51

Note: actually, only the last method could be kept, and all tests would pass (apart those that check that coerce is no-op, but one just needs to replace === with == there). The first two methods are only for performance, as we don't need to iterate over all elements to check for missing values when we are sure they are there or they are not (and in many cases the first method is completely no-op).

@ablaom
Copy link
Member

ablaom commented Apr 29, 2019

Nice.

@ablaom
Copy link
Member

ablaom commented Apr 30, 2019

Happy for the same approach to the Discrete scitypes.

@ablaom
Copy link
Member

ablaom commented Apr 30, 2019

There is a bug here I think (from MLJ/tasks.jl):

function coerce(types::Dict{Symbol, Type}, X)
    names = schema(X).names
    coltable = NamedTuple{names}(coerce(types[name], selectcols(X, name))
                                 for name in names)
    return MLJBase.table(coltable, prototype=X)
end

types[name] is only defined if name is a key of types. Your test misses this because the keys of the supplied dictionary types includes all the column names (only two).

@giordano
Copy link
Member Author

giordano commented Apr 30, 2019

So there should be a default type to coerce to? Like scitype_union()? If so, what to do with Unknown?

@ablaom
Copy link
Member

ablaom commented Apr 30, 2019

No. You don't touch columns that are not keys of the dictionary. That is, if :foo is a column of X but not a key of types then X.foo == X_coerced.foo, where X_coerced = coerce(types, X).

Just pass a column as is when there is no coercion type in the provided
dictionary.
@giordano
Copy link
Member Author

Ok, I see what you mean, thanks. There are quite some tests, I hope I covered all relevant cases.

BTW, it'd be useful to enable test coverage, I recommend https://coveralls.io for which it's possible to disable annoying PR comments.

@ablaom
Copy link
Member

ablaom commented Apr 30, 2019

Very happy with the fix.

A few fussy points:

  • Can we make relax the dictionary type in coerce signature from Dict{Symbol, Type} to plain Dict? (Some user is bound to pass a Dict{Any,Any}.)
  • To future proof, in tests, instead of testing the output has a certain type, just test the the output has the required scitype (using scitype_union). Eg, instead of
@test X_coerced.x isa AbstractVector{Float64}

the following would be sufficient:

@test scitype(X_coerced.x) <: Continuous

And let's add export coerce to MLJ/src/MLJ.jl

@giordano
Copy link
Member Author

giordano commented May 1, 2019

Can we make relax the dictionary type in coerce signature from Dict{Symbol, Type} to plain Dict?

That's already the case:
https://github.com/alan-turing-institute/MLJ.jl/blob/30c29806f53bf24d78d47aafc1263c6d1a53f58a/src/tasks.jl#L102
I changed this yesterday.

To future proof, in tests, instead of testing the output has a certain type, just test the the output has the required scitype (using scitype_union)

Yes, good point, I just did it

@ablaom
Copy link
Member

ablaom commented May 1, 2019

Thanks. Regarding coerce to Muliticlass/FiniteOrderedFactor, your methods already allow for missing values. All that's needed, I believe, is to add the warning.

@giordano
Copy link
Member Author

giordano commented May 1, 2019

your methods already allow for missing values.

Well, I was just delegating the work to categorical, which correctly deals with missing values.

All that's needed, I believe, is to add the warning.

Done!

@ablaom ablaom merged commit 852767d into JuliaAI:master May 2, 2019
@ablaom
Copy link
Member

ablaom commented May 2, 2019

Thanks!!

@giordano giordano deleted the type-coercion branch May 2, 2019 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants