From f3bfa4d0e859ade230a468859b2636ba075f9bbd Mon Sep 17 00:00:00 2001 From: "Anthony D. Blaom" Date: Fri, 5 Jun 2020 18:03:14 +1200 Subject: [PATCH] big update of docs, including using discrete data WIP --- docs/src/adding_models_for_general_use.md | 107 +++++++------ docs/src/model_search.md | 20 ++- docs/src/working_with_categorical_data.md | 184 ++++++++++++++++++++++ 3 files changed, 260 insertions(+), 51 deletions(-) create mode 100644 docs/src/working_with_categorical_data.md diff --git a/docs/src/adding_models_for_general_use.md b/docs/src/adding_models_for_general_use.md index bdac87346..e428ef2fa 100755 --- a/docs/src/adding_models_for_general_use.md +++ b/docs/src/adding_models_for_general_use.md @@ -1,6 +1,13 @@ - # Adding Models for General Use +!!! warning + + Models implementing the MLJ model interface according to the instructions + given here should import MLJModelInterface version 0.3 or higher. This is + enforced with a statement such as `MLJModelInterface = "^0.3" under + `[compat]` in the Project.toml file of the package containing the + implementation. + This guide outlines the specification of the MLJ model interface and provides detailed guidelines for implementing the interface for models intended for general use. See also the more condensed @@ -408,7 +415,7 @@ ordering of these integers being consistent with that of the pool), integers back into `CategoricalValue`/`CategoricalString` objects), and `classes`, for extracting all the `CategoricalValue` or `CategoricalString` objects sharing the pool of a particular -value/string. Refer to [Convenience methods](@ref) below for important +value. Refer to [Convenience methods](@ref) below for important details. Note that a decoder created during `fit` may need to be bundled with @@ -456,68 +463,77 @@ must be an `AbstractVector` whose elements are distributions (one distribution per row of `Xnew`). Presently, a *distribution* is any object `d` for which -`MMI.isdistribution(::d) = true`, which is currently restricted to -objects subtyping `Distributions.Sampleable` from the package -Distributions.jl. +`MMI.isdistribution(::d) = true`, which is the case for objects of +type `Distributions.Sampleable`. + +Use the distribution `MMI.UnivariateFinite` for `Probabilistic` models +predicting a target with `Finite` scitype (classifiers). In this case +the eltype of the training target `y` will be a `CategoricalValue`. + +For efficiency, one should not construct `UnivariateDistribution` +instances one at a time. Rather, once a probability vector or matrix +is known, construct an instance of `UnivariateFiniteVector <: +AbstractArray{<:UnivariateFinite},1}` to return. Both `UnivariateFinite` +and `UnivariateFiniteVector` objects are constructed using the single +`UnivariateFinite` function. -Use the distribution `MMI.UnivariateFinite` for `Probabilistic` -models predicting a target with `Finite` scitype (classifiers). In -this case each element of the training target `y` is a -`CategoricalValue` or `CategoricalString`, as in this contrived example: +For example, suppose the target `y` arrives as a subsample of some +`ybig` and is missing some classes: ```julia -using CategoricalArrays -y = Any[categorical([:yes, :no, :no, :maybe, :maybe])...] +ybig = categorical([:a, :b, :a, :a, :b, :a, :rare, :a, :b]) +y = ybig[1:6] ``` -Note that, as in this case, we cannot assume `y` is a -`CategoricalVector`, and we rely on elements for pool information (if -we need it); this is accessible using the convenience method -`MLJ.classes`: +Your fit method has bundled the first element of `y` with the +`fitresult` to make it available to `predict` for purposes of tracking +the complete pool of classes. Let's call this `an_element = +y[1]`. Then, supposing the corresponding probabilities of the observed +classes `[:a, :b]` are in an `n x 2` matrix `probs` (where `n` the number of +rows of `Xnew`) then you return ```julia -julia> yes = y[1] -julia> levels = MMI.classes(yes) -3-element Array{CategoricalValue{Symbol,UInt32},1}: - :maybe - :no - :yes +yhat = UnivariateFinite([:a, :b], probs, pool=an_element) ``` -Now supposing that, for some new input pattern, the elements `yes = -y[1]` and `no = y[2]` are to be assigned respective probabilities of -0.2 and 0.8. Then the corresponding distribution `d` is constructed as -follows: +This object automatically assigns zero-probability to the unseen class +`:rare` (i.e., `pdf.(yhat, :rare)` works and returns a zero +vector). If you would like to assign `:rare` non-zero probabilities, +simply add it to the first vector (the *support*) and supply a larger +`probs` matrix. -```julia -julia> d = MMI.UnivariateFinite([yes, no], [0.2, 0.8]) -UnivariateFinite(:yes=>0.2, :maybe=>0.0, :no=>0.8) +If instead of raw labels `[:a, :b]` you have the corresponding +`CategoricalElement`s (from, e.g., `filter(cv->cv in unique(y), +classes(y))`) then you can use these instead and drop the `pool` +specifier. -julia> pdf(d, yes) -0.2 +In a binary classification problem it suffices to specify a single +vector of probabilities, provided you specify `augment=true`, as in +the following example, *and note carefully that these probablities are +associated with the* **last** *(second) class you specify in the +constructor:* -julia> maybe = y[4]; pdf(d, maybe) -0.0 +```julia +y = categorical([:TRUE, :FALSE, :FALSE, :TRUE, :TRUE]) +an_element = y[1] +probs = rand(10) +yhat = UnivariateFinite([:FALSE, :TRUE], probs, augment=true, pool=an_element) ``` -Alternatively, a dictionary can be passed to the constructor. +The constructor has a lot of options, including passing a dictionary +instead of vectors. See [`UnivariateFinite`](@ref) for details. See [LinearBinaryClassifier](https://github.com/alan-turing-institute/MLJModels.jl/blob/master/src/GLM.jl) for an example of a Probabilistic classifier implementation. - -```@docs -UnivariateFinite -``` - *Important note on binary classifiers.* There is no "Binary" scitype distinct from `Multiclass{2}` or `OrderedFactor{2}`; `Binary` is just an alias for `Union{Multiclass{2},OrderedFactor{2}}`. The `target_scitype` of a binary classifier will generally be `AbstractVector{<:Binary}` and according to the *mlj* scitype -convention, elements of `y` have type `CategoricalValue` or -`CategoricalString`, and *not* `Bool`. See +convention, elements of `y` have type `CategoricalValue`, and *not* +`Bool`. See [BinaryClassifier](https://github.com/alan-turing-institute/MLJModels.jl/blob/master/src/GLM.jl) for an example. @@ -558,8 +574,7 @@ MMI.input_scitype(::Type{<:DecisionTreeClassifier}) = Table(Union{Continuous,Mis ``` Similarly, to ensure the target is an AbstractVector whose elements -have `Finite` scitype (and hence `CategoricalValue` or -`CategoricalString` machine type) we declare +have `Finite` scitype (and hence `CategoricalValue` machine type) we declare ```julia MMI.target_scitype(::Type{<:DecisionTreeClassifier}) = AbstractVector{<:Finite} @@ -584,8 +599,7 @@ restricts to tables with continuous or binary (ordered or unordered) columns. For predicting variable length sequences of, say, binary values -(`CategoricalValue`s or `CategoricalString`s with some common size-two -pool) we declare +(`CategoricalValue`s) with some common size-two pool) we declare ```julia target_scitype(SomeSupervisedModel) = AbstractVector{<:NTuple{<:Finite{2}}} @@ -875,6 +889,11 @@ MLJModelInterface.selectrows MLJModelInterface.selectcols ``` +```@docs +UnivariateFinite +``` + + ### Where to place code implementing new models diff --git a/docs/src/model_search.md b/docs/src/model_search.md index d4372aebe..94e2d339e 100644 --- a/docs/src/model_search.md +++ b/docs/src/model_search.md @@ -9,10 +9,10 @@ methods, as detailed below. ## Model metadata -*Terminology.* In this section the word "model" refers to the metadata -entry in the registry of an actual model `struct`, as appearing -elsewhere in the manual. One can obtain such an entry with the `info` -command: +*Terminology.* In this section the word "model" refers to a metadata +entry in the model registry, as opposed to an actual model `struct` +that such an entry represents. One can obtain such an entry with the +`info` command: ```@setup tokai using MLJ @@ -38,14 +38,20 @@ localmodels() localmodels()[2] ``` -If `models` is passed any `Bool`-valued function `test`, it returns every `model` for which `test(model)` is true, as in +One can search for models containing specified strings or regular expressions in their `docstring` attributes, as in + +```@repl tokai +models("forest") +``` + +or by specifying a filter (`Bool`-valued function): ```@repl tokai -test(model) = model.is_supervised && +filter(model) = model.is_supervised && model.input_scitype >: MLJ.Table(Continuous) && model.target_scitype >: AbstractVector{<:Multiclass{3}} && model.prediction_type == :deterministic -models(test) +models(filter) ``` Multiple test arguments may be passed to `models`, which are applied diff --git a/docs/src/working_with_categorical_data.md b/docs/src/working_with_categorical_data.md new file mode 100644 index 000000000..2dfb4b2e7 --- /dev/null +++ b/docs/src/working_with_categorical_data.md @@ -0,0 +1,184 @@ +# Working with Categorical Data + +## Scientific types for discrete data + +Recall that models articulate their data requirements using scientific +types (see [Getting Started](@ref) or the MLJScientificTypes.jl +[documentation](https://alan-turing-institute.github.io/MLJScientificTypes.jl/dev/)). There +are three scientific types discrete data can have: `Count`, +`OrderedFactor` and `Multiclass`. + + +### Count data + +In MLJ you cannot use integers to represent (finite) categorical +data. Integers are reserved for discrete data you want interpreted as +`Count <: Infinite`: + +```@example hut +using MLJ # hide +scitype([1, 4, 5, 6]) +``` + +The `Count` scientific type includes things like the number of phone +calls, or city populations, and other "frequency" data of a generally +unbounded nature. + +That said, you may have data that is theoretically `Count`, but which +you coerce to `OrderedFactor` to enable the use of more models, +trusting to your knowledge of how those models work to inform an +appropriate interpretation. + + +### OrderedFactor and Multiclass data + +Other integer data, such as the number of an animal's legs, or number +of rooms of homes, are generally coerced to `OrderedFactor <: +Finite`. The other categorical scientific type is `Multiclass <: +Finite`, which is for *unordered* categorical data. Coercing data to +one of these two forms is discussed under [ Detecting and coercing +improperly represented categorical data](@ref) below. + + +### Binary data + +There is no separate scientific type for binary data. Binary data is +either `OrderedFactor{2}` if ordered, and `Multiclass{2}` otherwise. +Data with type `OrderedFactor{2}` is considered to have an instrinsic +"positive" class, e.g., the outcome of a medical test, and the +"pass/fail" outcome of an exam. MLJ measures, such as `true_positive` +assume the *second* class in the ordering is the "positive" +class. Inspecting and changing order is discussed below. + +If data has type `Bool` it is considered `Count` data (as `Bool <: +Integer`) and in generally users will want to coerce to a binary type. + + +## Detecting and coercing improperly represented categorical data + +One inspects the scientific type of data using `scitype` as shown +above. To inspect all column scientific types in a table +simultaneously, use `schema`. (Tables also have a `scitype`, in which +this information appears in a condensed form more appropriate for type +dispatch.) + +```@example hut +using DataFrames +X = DataFrame( + name = ["Siri", "Robo", "Alexa", "Cortana"], + gender = ["male", "male", "Female", "female"], + likes_soup = [true, false, false, true], + height = [152, missing, 148, 163], + rating = [2, 5, 2, 1], + outcome = ["rejected", "accepted", "accepted", "rejected"]) +schema(X) +``` + +Coercing a single column: + +```@example hut +X.outcome = coerce(X.outcome, OrderedFactor) +``` + +Inspecting the order of the levels: + +```julia +levels(X.outcome) +``` + +Since we wish to regard "accepted" as the positive class, it should +appear second, which we correct with the `levels!` function: + +```@example hut +levels!(X.outcome, ["rejected", "accepted"]); +``` +Coercing all remaining types simultaneously: + +```@example hut +Xnew = coerce(X, :gender => Multiclass, + :like_soup => OrderedFactor, + :height => Continuous, + :rating => OrderedFactor) +schema(Xnew) +``` + +(For `DataFrame`s there is also in-place coercion using `coerce!`.) + + +## Tracking all levels + +The key property of vectors of scientific type `OrderedFactor` and + `Multiclass` is that the pool of all levels is not lost when +separating out one or more elements: + +```@example hut +v = Xnew.rating +``` + +```@example hut +levels(v) +``` + +```@example hut +levels(v[1:2]) +``` + +```@example hut +levels(v[2]) +``` +By tracking all classes in this way, MLJ +avoids common pain points around categorical data, such as evaluating +models on an evaluation set only to crash your code because classes appear +there which were not seen during training. + + +## Under the hood: CategoricalValue and CategoricalArray + +In MLJ the atomic objects with `OrderedFactor` or `Multiclass` +scientific are `CategoricalValue`s, from the [CategoricalArrays.jl] +(https://juliadata.github.io/CategoricalArrays.jl/stable/) package. +In some sense `CategoricalValue`s are an implementation detail users +can ignore for the most part, as shown above. However, some users may +want some basic understanding of these types - and those implementing +MLJ's model interface for new alogorithms will have to understand +them, which we do so informally now. For the complete API, see the +CategoricalArrays.jl +[documentation](https://juliadata.github.io/CategoricalArrays.jl/stable/) + + +To construct an `OrderedFactor` or `Multiclass` vector from raw +labels, one uses `categorical`: + +``` +@example hut +using CategoricalArrays # hide +v = categorical([:A, :B, :A, :A, :C]) +typeof(v) +``` + +``` +@example hut +scitype(v) +``` + +``` +@example hut +v = categorical([:A, :B, :A, :A, :C], ordered=true) +scitype(v) +``` + +When you index a `CategoricalVector` you don't get a raw label, but +instead an instance of `CategoricalValue`. As explained above, this +value knows the complete pool of levels from vector from which it +came. Use `get(val)` to extract the raw label from a value `val`. + +Despite the distinction that exists between a value (element) and a +label, the two are the same, from the point of `==` and `in`: + +```@julia +v[1] == :A # true +:A in v # true +``` + + +## Probablilistic predictions of categorical data