[FR] ability to drop columns that get used for split, but not for training #11942

hopemiranda · 2024-05-08T16:32:32Z

Willingness to contribute

Yes. I would be willing to contribute this feature with guidance from the MLflow community.

Proposal Summary

Add the ability to split the ingested data by groups that don't get included in the training set.

For example, have the option to use sklearn GroupShuffleSplit within the split step of the recipe. Without using the split_by_feature as a feature in the training set

from sklearn.model_selection import GroupShuffleSplit

GroupShuffleSplit(test_size=0.2, n_splits=2, random_state=2).split(
            data, groups=data[split_by_feature]
        )

Motivation

What is the use case for this feature?

Modeling using stratified sampling for the training and test sets

Why is this use case valuable to support for MLflow users in general?

Built in stratified sampling would help with avoiding workarounds to use this method within the split step of recipes

Why is this use case valuable to support for your project(s) or organization?

All of our models require stratified sampling in order to work as intended

Why is it currently difficult to achieve this use case?

As the code is now any features that get ingested and used in a grouped split will be fed to the next step for transformations. Since transformations get registered with the model that means the unused feature stays

Details

No response

What component(s) does this bug affect?

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-16T00:12:58Z

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

hopemiranda added the enhancement New feature or request label May 8, 2024

github-actions bot added the area/recipes MLflow Recipes, Recipes APIs, Recipes configs, Recipe Templates label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] ability to drop columns that get used for split, but not for training #11942

[FR] ability to drop columns that get used for split, but not for training #11942

hopemiranda commented May 8, 2024

What is the use case for this feature?

Why is this use case valuable to support for MLflow users in general?

Why is this use case valuable to support for your project(s) or organization?

Why is it currently difficult to achieve this use case?

github-actions bot commented May 16, 2024

[FR] ability to drop columns that get used for split, but not for training #11942

[FR] ability to drop columns that get used for split, but not for training #11942

Comments

hopemiranda commented May 8, 2024

Willingness to contribute

Proposal Summary

Motivation

What is the use case for this feature?

Why is this use case valuable to support for MLflow users in general?

Why is this use case valuable to support for your project(s) or organization?

Why is it currently difficult to achieve this use case?

Details

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

github-actions bot commented May 16, 2024