Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type coercion to task constructors #109

Closed
ablaom opened this issue Mar 27, 2019 · 3 comments
Closed

Add type coercion to task constructors #109

ablaom opened this issue Mar 27, 2019 · 3 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@ablaom
Copy link
Member

ablaom commented Mar 27, 2019

Branching off from #68:

In the current design one of the functions of task constructors is to establish the scientific type of the data. This is currently inferred from the data passed to the constructors, according to the convention that object x has scientific type scitype(x); see the docs for details.

At present, this information is simply inferred from the data passed to the constructor. I suggest the following straightforward enhancement: The user passes, as an optional kwarg, a dictionary explaining how they would like to overide this behaviour. The dictionary is keyed on the names of feature and target columns; the values are the intended scitypes. So, a call might look like this:

types = Dict(:overall_quality => FiniteOrderedFactor, :is_freehold => Binary, :Price => Continuous)
task = SupervisedTask(data=house_prices, is_probabilistic=true, target=:Price, scitypes=types)

After coercion, the constructor would print out the final post-coercion types and scitypes of all variables.

I can provide more detail on the existing design to potential implementers.

Any objections, other ideas?

@ablaom ablaom added the enhancement New feature or request label Mar 27, 2019
@ablaom
Copy link
Member Author

ablaom commented Mar 28, 2019

Okay, no objections. Any volunteers? Some familiarity with Tables.jl interface and CategoricalArrays is required.

Lets break this up into two separate PR's with two items each:

  • Define four methods MLJ.coerce(T::Continuous, y), MLJ.coerce(T::Multiclass, y), MLJ.coerce(T::FiniteOrderedFactor, y) and MLJ.coerce(T::Count, y), to coerce a vector or categorical vector y into, respectively, an object of type Vector{Float64}, CategoricalVector (with pool.ordered=false), CategoricalVector (with pool.ordered=true), and Vector{Int}.

  • Define a new method MLJ.coerce(types::Dict{Symbol,<:DataType}, X) to coerce the columns of a Tables.jl compatible table X into
    columns with the scitypes specified by the dictionary types.

  • Define a new constructor MLJ.supervised(; data=nothing, types=nothing, kwargs...) that coerces data into
    newdata according to types and returns MLJBase.SupervisedTask(data=newdata, kwargs...).

  • Define a new constructor MLJ.unsupervised(; data=nothing, types=nothing, kwargs...) that coerces data
    into newdata according to types and returns MLJBase.UnupervisedTask(data=newdata, kwargs...).

Some technical details:

  • the first two methods need not be exported

  • current task constructors live in MLJBase/src/tasks.jl with detailed doc strings

  • new code is to go in MLJ/src/tasks.jl (please don't touch MLJBase).

  • use Categorical.categorical(v) or Categorical.categorical(v, ordered=true) to construct the categorical vectors in first item.

  • the table at that bottom of
    here says
    what types the specified new columns should have, given the scitype
    declared by the user-supplied dictionary.

  • Assume data is a table. There are edge cases is when data is not
    a table but a vector or categorical vector but let's worry about
    that later.

  • Performance is not an issue as these coercions are not called
    often. I don't think Tables.jl supports in-place mutation of
    columns, so we just create new tables and that's fine. For the
    second method above, I suggest we create a new table Xnew as a
    numed tuple of vectors (one key per column) and return
    MLJBase.table(Xnew, prototype=X), which converts the "columns
    table" to a table of original
    type (e.g., if X is DataFrame, then a DataFrame is returned).

  • For some convenience methods for manipulating tables see
    here
    under "Convenience Methods". In particular, MLJBase.selectcols and
    MLJBase.schema may be useful. You can also use the Tables.jl
    methods but, as this backend might get replaced later, it's better
    to use these, I reckon.

  • Obviously errors need to be thrown for impossible conversions (e.g.,
    string vector -> Continuous vector)

@giordano
Copy link
Member

In #114 I implemented the coercion methods as suggested, however the method coerce(T::Continuous, y) assumes that any numerical type can be converted to Float64, which is not true for any arbitrary custom type. For example, the Measurement type from Measurements.jl is a subtype of AbstractFloat but doesn't have a conversion rule to Float64 for a good reason: it would be a lossy conversion.

ablaom added a commit that referenced this issue May 2, 2019
@ablaom
Copy link
Member Author

ablaom commented May 2, 2019

Done.

@ablaom ablaom closed this as completed May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants