[FR] ability to drop columns that get used for split, but not for training #11942
Open
1 of 22 tasks
Labels
area/recipes
MLflow Recipes, Recipes APIs, Recipes configs, Recipe Templates
enhancement
New feature or request
Willingness to contribute
Yes. I would be willing to contribute this feature with guidance from the MLflow community.
Proposal Summary
Add the ability to split the ingested data by groups that don't get included in the training set.
For example, have the option to use sklearn
GroupShuffleSplit
within thesplit
step of the recipe. Without using thesplit_by_feature
as a feature in the training setMotivation
Modeling using stratified sampling for the training and test sets
Built in stratified sampling would help with avoiding workarounds to use this method within the
split
step of recipesAll of our models require stratified sampling in order to work as intended
As the code is now any features that get ingested and used in a grouped split will be fed to the next step for transformations. Since transformations get registered with the model that means the unused feature stays
Details
No response
What component(s) does this bug affect?
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrationsarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingWhat interface(s) does this bug affect?
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportWhat language(s) does this bug affect?
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: