MLJ has two tools for splitting data. To split data vertically (that
is, to split by observations) use partition
. This is commonly applied to a
vector of observation indices, but can also be applied to datasets
themselves, provided they are vectors, matrices or tables.
To split tabular data horizontally (i.e., break up a table based on
feature names) use unpack
.
MLJBase.partition
MLJBase.unpack
As outlined in Getting Started, it is important that the
scientific type of
data matches the requirements of the model of interest. For example,
while the majority of supervised learning models require input
features to be Continuous
, newcomers to MLJ are sometimes
surprised at the disappointing results of [model queries](@ref
model_search) such as this one:
using MLJ
X = (height = [185, 153, 163, 114, 180],
time = [2.3, 4.5, 4.2, 1.8, 7.1],
mark = ["D", "A", "C", "B", "A"],
admitted = ["yes", "no", missing, "yes"]);
y = [12.4, 12.5, 12.0, 31.9, 43.0]
models(matching(X, y))
Or are unsure about the source of the following warning:
Tree = @load DecisionTreeRegressor pkg=DecisionTree verbosity=0
tree = Tree();
julia> machine(tree, X, y)
julia> machine(tree, X, y)
┌ Warning: The scitype of `X`, in `machine(model, X, ...)` is incompatible with `model=DecisionTreeRegressor @378`:
│ scitype(X) = Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Textual}, AbstractVector{Union{Missing, Textual}}}}
│ input_scitype(model) = Table{var"#s46"} where var"#s46"<:Union{AbstractVector{var"#s9"} where var"#s9"<:Continuous, AbstractVector{var"#s9"} where var"#s9"<:Count, AbstractVector{var"#s9"} where var"#s9"<:OrderedFactor}.
└ @ MLJBase ~/Dropbox/Julia7/MLJ/MLJBase/src/machines.jl:103
Machine{DecisionTreeRegressor,…} @198 trained 0 times; caches data
args:
1: Source @628 ⏎ `Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Textual}, AbstractVector{Union{Missing, Textual}}}}`
2: Source @544 ⏎ `AbstractVector{Continuous}`
The meaning of the warning is:
- The input
X
is a table with column scitypesContinuous
,Count
, andTextual
andUnion{Missing, Textual}
, which can also see by inspecting the schema:
schema(X)
- The model requires a table whose column element scitypes subtype
Continuous
, an incompatibility.
There are two tools for addressing data-model type mismatches like the above, with links to further documentation given below:
Scientific type coercion: We coerce machine types to obtain the
intended scientific interpretation. If height
in the above example
is intended to be Continuous
, mark
is supposed to be
OrderedFactor
, and admitted
a (binary) Multiclass
, then we can
do
X_coerced = coerce(X, :height=>Continuous, :mark=>OrderedFactor, :admitted=>Multiclass);
schema(X_coerced)
Data transformations: We carry out conventional data transformations, such as missing value imputation and feature encoding:
imputer = FillImputer()
mach = machine(imputer, X_coerced) |> fit!
X_imputed = transform(mach, X_coerced);
schema(X_imputed)
encoder = ContinuousEncoder()
mach = machine(encoder, X_imputed) |> fit!
X_encoded = transform(mach, X_imputed)
schema(X_encoded)
Such transformations can also be combined in a pipeline; see Linear Pipelines.
Scientific type coercion is documented in detail at ScientificTypesBase.jl. See also the tutorial at the this MLJ Workshop (specifically, here) and this Data Science in Julia tutorial.
Also relevant is the section, Working with Categorical Data.
MLJ's Built-in transformers are documented at Transformers and Other Unsupervised Models. The most relevant in the present context
are: ContinuousEncoder
, OneHotEncoder
,
FeatureSelector
and FillImputer
. A Gaussian
mixture models imputer is provided by BetaML, which can be loaded
with
MissingImputator = @load MissingImputator pkg=BetaML
This MLJ Workshop, and the "End-to-end examples" in Data Science in Julia tutorials give further illustrations of data preprocessing in MLJ.