Improvement in the Preparing Data part #964

lucasmsoares96 · 2022-09-14T01:58:35Z

Hello guys. First I want to thank you for the work you are doing here. I'm here to suggest an improvement in the Preparing Data part of the MLJ. I missed some functionality of scikit-learn.

Topics covered in the MLJ documentation are:

Common data preprocessing workflows
- Scientific type coercion
- Data transformations
Scientific type coercion
Data transformation

In scikit-learn they are:

Standardization, or mean removal and variance scaling
- Scaling features to a range
- Scaling sparse data
- Scaling data with outliers
- Centering kernel matrices
Non-linear transformation
- Mapping to a Uniform distribution
- Mapping to a Gaussian distribution
Normalization
Encoding categorical features
- Infrequent categories
Discretization
K-bin discretization
- Feature binarization
Imputation of missing values
Generating polynomial features
- Polynomial features
- Spline transformer
Custom transformers

ablaom · 2022-09-14T22:56:35Z

Thanks for positive feedback.

Most of these are actually implemented and documented here:

https://github.com/alan-turing-institute/MLJ.jl/blob/master/docs/src/transformers.md

There is an active PR to generate polynomial features (JuliaAI/MLJModels.jl#478).

For an up-to-date list of built-in preprocessing transformers, follow this workflow:

using MLJModels
julia> models() do m
       !m.is_supervised && m.package_name=="MLJModels"
       end
10-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = ContinuousEncoder, package_name = MLJModels, ... )
 (name = FeatureSelector, package_name = MLJModels, ... )
 (name = FillImputer, package_name = MLJModels, ... )
 (name = OneHotEncoder, package_name = MLJModels, ... )
 (name = Standardizer, package_name = MLJModels, ... )
 (name = UnivariateBoxCoxTransformer, package_name = MLJModels, ... )
 (name = UnivariateDiscretizer, package_name = MLJModels, ... )
 (name = UnivariateFillImputer, package_name = MLJModels, ... )
 (name = UnivariateStandardizer, package_name = MLJModels, ... )
 (name = UnivariateTimeTypeToContinuous, package_name = MLJModels, ... )

julia> doc("OneHotEncoder") # to get a detailed document string

Feel free to open separate request issues for missing items.

ablaom closed this as completed Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement in the Preparing Data part #964

Improvement in the Preparing Data part #964

lucasmsoares96 commented Sep 14, 2022

ablaom commented Sep 14, 2022

Improvement in the Preparing Data part #964

Improvement in the Preparing Data part #964

Comments

lucasmsoares96 commented Sep 14, 2022

ablaom commented Sep 14, 2022