Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement in the Preparing Data part #964

Closed
lucasmsoares96 opened this issue Sep 14, 2022 · 1 comment
Closed

Improvement in the Preparing Data part #964

lucasmsoares96 opened this issue Sep 14, 2022 · 1 comment

Comments

@lucasmsoares96
Copy link

Hello guys. First I want to thank you for the work you are doing here. I'm here to suggest an improvement in the Preparing Data part of the MLJ. I missed some functionality of scikit-learn.

Topics covered in the MLJ documentation are:

  • Common data preprocessing workflows
    • Scientific type coercion
    • Data transformations
  • Scientific type coercion
  • Data transformation

In scikit-learn they are:

  • Standardization, or mean removal and variance scaling
    • Scaling features to a range
    • Scaling sparse data
    • Scaling data with outliers
    • Centering kernel matrices
  • Non-linear transformation
    • Mapping to a Uniform distribution
    • Mapping to a Gaussian distribution
  • Normalization
  • Encoding categorical features
    • Infrequent categories
  • Discretization
  • K-bin discretization
    • Feature binarization
  • Imputation of missing values
  • Generating polynomial features
    • Polynomial features
    • Spline transformer
  • Custom transformers
@ablaom
Copy link
Member

ablaom commented Sep 14, 2022

Thanks for positive feedback.

Most of these are actually implemented and documented here:

https://github.com/alan-turing-institute/MLJ.jl/blob/master/docs/src/transformers.md

There is an active PR to generate polynomial features (JuliaAI/MLJModels.jl#478).

For an up-to-date list of built-in preprocessing transformers, follow this workflow:

using MLJModels
julia> models() do m
       !m.is_supervised && m.package_name=="MLJModels"
       end
10-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = ContinuousEncoder, package_name = MLJModels, ... )
 (name = FeatureSelector, package_name = MLJModels, ... )
 (name = FillImputer, package_name = MLJModels, ... )
 (name = OneHotEncoder, package_name = MLJModels, ... )
 (name = Standardizer, package_name = MLJModels, ... )
 (name = UnivariateBoxCoxTransformer, package_name = MLJModels, ... )
 (name = UnivariateDiscretizer, package_name = MLJModels, ... )
 (name = UnivariateFillImputer, package_name = MLJModels, ... )
 (name = UnivariateStandardizer, package_name = MLJModels, ... )
 (name = UnivariateTimeTypeToContinuous, package_name = MLJModels, ... )

julia> doc("OneHotEncoder") # to get a detailed document string

Feel free to open separate request issues for missing items.

@ablaom ablaom closed this as completed Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants