From f3bfa4d0e859ade230a468859b2636ba075f9bbd Mon Sep 17 00:00:00 2001
From: "Anthony D. Blaom" <anthony.blaom@gmail.com>
Date: Fri, 5 Jun 2020 18:03:14 +1200
Subject: [PATCH] big update of docs, including using discrete data WIP

---
 docs/src/adding_models_for_general_use.md | 107 +++++++------
 docs/src/model_search.md                  |  20 ++-
 docs/src/working_with_categorical_data.md | 184 ++++++++++++++++++++++
 3 files changed, 260 insertions(+), 51 deletions(-)
 create mode 100644 docs/src/working_with_categorical_data.md

diff --git a/docs/src/adding_models_for_general_use.md b/docs/src/adding_models_for_general_use.md
index bdac87346..e428ef2fa 100755
--- a/docs/src/adding_models_for_general_use.md
+++ b/docs/src/adding_models_for_general_use.md
@@ -1,6 +1,13 @@
-
 # Adding Models for General Use
 
+!!! warning
+
+    Models implementing the MLJ model interface according to the instructions
+	given here should import MLJModelInterface version 0.3 or higher. This is
+	enforced with a statement such as `MLJModelInterface = "^0.3" under 
+	`[compat]` in the Project.toml file of the package containing the 
+	implementation.
+
 This guide outlines the specification of the MLJ model interface
 and provides detailed guidelines for implementing the interface for
 models intended for general use. See also the more condensed
@@ -408,7 +415,7 @@ ordering of these integers being consistent with that of the pool),
 integers back into `CategoricalValue`/`CategoricalString` objects),
 and `classes`, for extracting all the `CategoricalValue` or
 `CategoricalString` objects sharing the pool of a particular
-value/string. Refer to [Convenience methods](@ref) below for important
+value. Refer to [Convenience methods](@ref) below for important
 details.
 
 Note that a decoder created during `fit` may need to be bundled with
@@ -456,68 +463,77 @@ must be an `AbstractVector` whose elements are distributions (one distribution
 per row of `Xnew`).
 
 Presently, a *distribution* is any object `d` for which
-`MMI.isdistribution(::d) = true`, which is currently restricted to
-objects subtyping `Distributions.Sampleable` from the package
-Distributions.jl.
+`MMI.isdistribution(::d) = true`, which is the case for objects of
+type `Distributions.Sampleable`.
+
+Use the distribution `MMI.UnivariateFinite` for `Probabilistic` models
+predicting a target with `Finite` scitype (classifiers). In this case
+the eltype of the training target `y` will be a `CategoricalValue`.
+
+For efficiency, one should not construct `UnivariateDistribution`
+instances one at a time. Rather, once a probability vector or matrix
+is known, construct an instance of `UnivariateFiniteVector <:
+AbstractArray{<:UnivariateFinite},1}` to return. Both `UnivariateFinite`
+and `UnivariateFiniteVector` objects are constructed using the single
+`UnivariateFinite` function.
 
-Use the distribution `MMI.UnivariateFinite` for `Probabilistic`
-models predicting a target with `Finite` scitype (classifiers). In
-this case each element of the training target `y` is a
-`CategoricalValue` or `CategoricalString`, as in this contrived example:
+For example, suppose the target `y` arrives as a subsample of some
+`ybig` and is missing some classes:
 
 ```julia
-using CategoricalArrays
-y = Any[categorical([:yes, :no, :no, :maybe, :maybe])...]
+ybig = categorical([:a, :b, :a, :a, :b, :a, :rare, :a, :b])
+y = ybig[1:6]
 ```
 
-Note that, as in this case, we cannot assume `y` is a
-`CategoricalVector`, and we rely on elements for pool information (if
-we need it); this is accessible using the convenience method
-`MLJ.classes`:
+Your fit method has bundled the first element of `y` with the
+`fitresult` to make it available to `predict` for purposes of tracking
+the complete pool of classes. Let's call this `an_element =
+y[1]`. Then, supposing the corresponding probabilities of the observed
+classes `[:a, :b]` are in an `n x 2` matrix `probs` (where `n` the number of
+rows of `Xnew`) then you return
 
 ```julia
-julia> yes = y[1]
-julia> levels = MMI.classes(yes)
-3-element Array{CategoricalValue{Symbol,UInt32},1}:
- :maybe
- :no
- :yes
+yhat = UnivariateFinite([:a, :b], probs, pool=an_element)
 ```
 
-Now supposing that, for some new input pattern, the elements `yes =
-y[1]` and `no = y[2]` are to be assigned respective probabilities of
-0.2 and 0.8. Then the corresponding distribution `d` is constructed as
-follows:
+This object automatically assigns zero-probability to the unseen class
+`:rare` (i.e., `pdf.(yhat, :rare)` works and returns a zero
+vector). If you would like to assign `:rare` non-zero probabilities,
+simply add it to the first vector (the *support*) and supply a larger
+`probs` matrix. 
 
-```julia
-julia> d = MMI.UnivariateFinite([yes, no], [0.2, 0.8])
-UnivariateFinite(:yes=>0.2, :maybe=>0.0, :no=>0.8)
+If instead of raw labels `[:a, :b]` you have the corresponding
+`CategoricalElement`s (from, e.g., `filter(cv->cv in unique(y),
+classes(y))`) then you can use these instead and drop the `pool`
+specifier.
 
-julia> pdf(d, yes)
-0.2
+In a binary classification problem it suffices to specify a single
+vector of probabilities, provided you specify `augment=true`, as in
+the following example, *and note carefully that these probablities are
+associated with the* **last** *(second) class you specify in the
+constructor:*
 
-julia> maybe = y[4]; pdf(d, maybe)
-0.0
+```julia
+y = categorical([:TRUE, :FALSE, :FALSE, :TRUE, :TRUE])
+an_element = y[1]
+probs = rand(10)
+yhat = UnivariateFinite([:FALSE, :TRUE], probs, augment=true, pool=an_element)
 ```
 
-Alternatively, a dictionary can be passed to the constructor.
+The constructor has a lot of options, including passing a dictionary
+instead of vectors. See [`UnivariateFinite`](@ref) for details.
 
 See
 [LinearBinaryClassifier](https://github.com/alan-turing-institute/MLJModels.jl/blob/master/src/GLM.jl)
 for an example of a Probabilistic classifier implementation.
 
-
-```@docs
-UnivariateFinite
-```
-
 *Important note on binary classifiers.* There is no "Binary" scitype
 distinct from `Multiclass{2}` or `OrderedFactor{2}`; `Binary` is just
 an alias for `Union{Multiclass{2},OrderedFactor{2}}`. The
 `target_scitype` of a binary classifier will generally be
 `AbstractVector{<:Binary}` and according to the *mlj* scitype
-convention, elements of `y` have type `CategoricalValue` or
-`CategoricalString`, and *not* `Bool`. See
+convention, elements of `y` have type `CategoricalValue`, and *not*
+`Bool`. See
 [BinaryClassifier](https://github.com/alan-turing-institute/MLJModels.jl/blob/master/src/GLM.jl)
 for an example.
 
@@ -558,8 +574,7 @@ MMI.input_scitype(::Type{<:DecisionTreeClassifier}) = Table(Union{Continuous,Mis
 ```
 
 Similarly, to ensure the target is an AbstractVector whose elements
-have `Finite` scitype (and hence `CategoricalValue` or
-`CategoricalString` machine type) we declare
+have `Finite` scitype (and hence `CategoricalValue` machine type) we declare
 
 ```julia
 MMI.target_scitype(::Type{<:DecisionTreeClassifier}) = AbstractVector{<:Finite}
@@ -584,8 +599,7 @@ restricts to tables with continuous or binary (ordered or unordered)
 columns.
 
 For predicting variable length sequences of, say, binary values
-(`CategoricalValue`s or `CategoricalString`s with some common size-two
-pool) we declare
+(`CategoricalValue`s) with some common size-two pool) we declare
 
 ```julia
 target_scitype(SomeSupervisedModel) = AbstractVector{<:NTuple{<:Finite{2}}}
@@ -875,6 +889,11 @@ MLJModelInterface.selectrows
 MLJModelInterface.selectcols
 ```
 
+```@docs
+UnivariateFinite
+```
+
+
 
 ### Where to place code implementing new models
 
diff --git a/docs/src/model_search.md b/docs/src/model_search.md
index d4372aebe..94e2d339e 100644
--- a/docs/src/model_search.md
+++ b/docs/src/model_search.md
@@ -9,10 +9,10 @@ methods, as detailed below.
 
 ## Model metadata
 
-*Terminology.* In this section the word "model" refers to the metadata
-entry in the registry of an actual model `struct`, as appearing
-elsewhere in the manual. One can obtain such an entry with the `info`
-command:
+*Terminology.* In this section the word "model" refers to a metadata
+entry in the model registry, as opposed to an actual model `struct`
+that such an entry represents. One can obtain such an entry with the
+`info` command:
 
 ```@setup tokai
 using MLJ
@@ -38,14 +38,20 @@ localmodels()
 localmodels()[2]
 ```
 
-If `models` is passed any `Bool`-valued function `test`, it returns every `model` for which `test(model)` is true, as in
+One can search for models containing specified strings or regular expressions in their `docstring` attributes, as in
+
+```@repl tokai 
+models("forest")
+```
+
+or by specifying a filter (`Bool`-valued function):
 
 ```@repl tokai
-test(model) = model.is_supervised &&
+filter(model) = model.is_supervised &&
                 model.input_scitype >: MLJ.Table(Continuous) &&
                 model.target_scitype >: AbstractVector{<:Multiclass{3}} &&
                 model.prediction_type == :deterministic
-models(test)
+models(filter)
 ```
 
 Multiple test arguments may be passed to `models`, which are applied
diff --git a/docs/src/working_with_categorical_data.md b/docs/src/working_with_categorical_data.md
new file mode 100644
index 000000000..2dfb4b2e7
--- /dev/null
+++ b/docs/src/working_with_categorical_data.md
@@ -0,0 +1,184 @@
+# Working with Categorical Data
+
+## Scientific types for discrete data
+
+Recall that models articulate their data requirements using scientific
+types (see [Getting Started](@ref) or the MLJScientificTypes.jl
+[documentation](https://alan-turing-institute.github.io/MLJScientificTypes.jl/dev/)). There
+are three scientific types discrete data can have: `Count`,
+`OrderedFactor` and `Multiclass`.
+
+
+### Count data
+
+In MLJ you cannot use integers to represent (finite) categorical
+data. Integers are reserved for discrete data you want interpreted as
+`Count <: Infinite`:
+
+```@example hut
+using MLJ # hide
+scitype([1, 4, 5, 6])
+```
+
+The `Count` scientific type includes things like the number of phone
+calls, or city populations, and other "frequency" data of a generally
+unbounded nature.
+
+That said, you may have data that is theoretically `Count`, but which
+you coerce to `OrderedFactor` to enable the use of more models,
+trusting to your knowledge of how those models work to inform an
+appropriate interpretation.
+
+
+### OrderedFactor and Multiclass data
+
+Other integer data, such as the number of an animal's legs, or number
+of rooms of homes, are generally coerced to `OrderedFactor <:
+Finite`. The other categorical scientific type is `Multiclass <:
+Finite`, which is for *unordered* categorical data. Coercing data to
+one of these two forms is discussed under [ Detecting and coercing
+improperly represented categorical data](@ref) below.
+
+
+### Binary data
+
+There is no separate scientific type for binary data. Binary data is
+either `OrderedFactor{2}` if ordered, and `Multiclass{2}` otherwise.
+Data with type `OrderedFactor{2}` is considered to have an instrinsic
+"positive" class, e.g., the outcome of a medical test, and the
+"pass/fail" outcome of an exam. MLJ measures, such as `true_positive`
+assume the *second* class in the ordering is the "positive"
+class. Inspecting and changing order is discussed below.
+
+If data has type `Bool` it is considered `Count` data (as `Bool <:
+Integer`) and in generally users will want to coerce to a binary type.
+
+
+## Detecting and coercing improperly represented categorical data
+
+One inspects the scientific type of data using `scitype` as shown
+above. To inspect all column scientific types in a table
+simultaneously, use `schema`. (Tables also have a `scitype`, in which
+this information appears in a condensed form more appropriate for type
+dispatch.)
+
+```@example hut
+using DataFrames
+X = DataFrame(
+         name       = ["Siri", "Robo", "Alexa", "Cortana"],
+         gender     = ["male", "male", "Female", "female"],
+         likes_soup = [true, false, false, true],
+         height     = [152, missing, 148, 163],
+         rating     = [2, 5, 2, 1],
+         outcome    = ["rejected", "accepted", "accepted", "rejected"])
+schema(X)
+```
+
+Coercing a single column:
+
+```@example hut
+X.outcome = coerce(X.outcome, OrderedFactor)
+```
+
+Inspecting the order of the levels:
+
+```julia
+levels(X.outcome)
+```
+
+Since we wish to regard "accepted" as the positive class, it should
+appear second, which we correct with the `levels!` function:
+
+```@example hut
+levels!(X.outcome, ["rejected", "accepted"]);
+```
+Coercing all remaining types simultaneously:
+
+```@example hut
+Xnew = coerce(X, :gender    => Multiclass,
+                 :like_soup => OrderedFactor,
+                 :height    => Continuous,
+                 :rating    => OrderedFactor)
+schema(Xnew)
+```
+
+(For `DataFrame`s there is also in-place coercion using `coerce!`.)
+
+
+## Tracking all levels
+
+The key property of vectors of scientific type `OrderedFactor` and
+ `Multiclass` is that the pool of all levels is not lost when
+separating out one or more elements:
+
+```@example hut
+v = Xnew.rating
+```
+
+```@example hut
+levels(v)
+```
+
+```@example hut
+levels(v[1:2])
+```
+
+```@example hut
+levels(v[2])
+```
+By tracking all classes in this way, MLJ
+avoids common pain points around categorical data, such as evaluating
+models on an evaluation set only to crash your code because classes appear
+there which were not seen during training.
+
+
+## Under the hood: CategoricalValue and CategoricalArray
+
+In MLJ the atomic objects with `OrderedFactor` or `Multiclass`
+scientific are `CategoricalValue`s, from the [CategoricalArrays.jl]
+(https://juliadata.github.io/CategoricalArrays.jl/stable/) package.
+In some sense `CategoricalValue`s are an implementation detail users
+can ignore for the most part, as shown above. However, some users may
+want some basic understanding of these types - and those implementing
+MLJ's model interface for new alogorithms will have to understand
+them, which we do so informally now.  For the complete API, see the
+CategoricalArrays.jl
+[documentation](https://juliadata.github.io/CategoricalArrays.jl/stable/)
+
+
+To construct an `OrderedFactor` or `Multiclass` vector from raw
+labels, one uses `categorical`:
+
+```
+@example hut
+using CategoricalArrays # hide
+v = categorical([:A, :B, :A, :A, :C])
+typeof(v)
+```
+
+```
+@example hut
+scitype(v)
+```
+
+```
+@example hut
+v = categorical([:A, :B, :A, :A, :C], ordered=true)
+scitype(v)
+```
+
+When you index a `CategoricalVector` you don't get a raw label, but
+instead an instance of `CategoricalValue`. As explained above, this
+value knows the complete pool of levels from vector from which it
+came. Use `get(val)` to extract the raw label from a value `val`. 
+
+Despite the distinction that exists between a value (element) and a
+label, the two are the same, from the point of `==` and `in`:
+
+```@julia
+v[1] == :A # true
+:A in v    # true
+```
+
+
+## Probablilistic predictions of categorical data