Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ColumnTransformer in parallel with joblib's auto memmapping #28822

Merged
merged 5 commits into from Apr 22, 2024

Conversation

jeremiedbb
Copy link
Member

Fixes #28781

When running in parallel, ColumnTransformer will crash if joblib's auto memmap triggers and copies are not made in time.

Currently we index X when declaring the jobs. It means we have copy then read-only memmap. Then if the transformer fails to do inplace transfo, or fails earlier in case of dataframe (see #28781 (comment)).

The fix here proposes to index X within each job instead. This way we have read-only memmap then copy, and the transformer can do inplace transfo.

Disclaimer: it doesn't solve the underlying problem completely. If you select columns by slice it still fails because it creates a view and not a copy. I'm starting to think that the issue is more profound, and lies between the copy parameter and check_array, for all estimators. I think check_array should always make a copy if the array is read-only, even if copy=False because when an estimator has a copy parameter, it's because it wants to do inplace modifications.

Copy link

github-actions bot commented Apr 12, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: f558534. Link to the linter CI: here

@jeremiedbb
Copy link
Member Author

I opened #28824 to discuss the read-only situation more globally.

Comment on lines 2465 to 2467
with parallel_backend("loky", max_nbytes=1):
Xt = transformer.fit_transform(X)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is failing because this is only doable in joblib>=1.13 and our min is 1.12.
I can use a bigger array for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could skip the test on that joblib though.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this might be a bigger issue, but this is a minimal change that fixes a few cases. So LGTM.

Comment on lines 2465 to 2467
with parallel_backend("loky", max_nbytes=1):
Xt = transformer.fit_transform(X)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could skip the test on that joblib though.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well.

@ogrisel ogrisel merged commit 51fca39 into scikit-learn:main Apr 22, 2024
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants