Skip to content

[bugfix, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching #2364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

icfaust
Copy link
Contributor

@icfaust icfaust commented Mar 17, 2025

Description

This addresses an issue related to linux thread control using affinity. New tests added which validated n_jobs when a reduced number of threads are available, set by the affinity. This requires 4 threads/CPU cores be made available to the main python pytest process in order to have proper functionality, meaning that this test does not run on azure pipelines runners, but only on intelci and github actions.

This change was needed in order to have scikit-learn-intelex properly work on thread-limited Kubernetes pods.


PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
  • I have provided justification why performance has changed or why changes are not expected.
  • I have provided justification why quality metrics have changed or why changes are not expected.
  • I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

@icfaust icfaust changed the title [WIP, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching [bugfix, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching Mar 17, 2025
@icfaust icfaust changed the title [bugfix, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching [WIP, bugfix, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching Mar 17, 2025

try:
if not self.n_jobs:
n_jobs = cpu_count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this later on get limited to the number of physical cores from oneDAL side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest, I'm not 100% sure yet. The default in the threading.h in daal will set it to the number of CPUs, but with the affinity I didn't spend the full time to track the default setting there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @Alexsandruss could comment here on whether it'd end up limited to number of physical cores somewhere else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like setting the number of threads like this would not result in that number later on getting limited to the number of physical cores. How about passing argument only_physical_cores=True here?

Copy link
Contributor

@david-cortes-intel david-cortes-intel Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried adding a line to print this value here: https://github.com/uxlfoundation/oneDAL/blob/31cafec9950f1db352b639dafad5875971ca00fe/cpp/daal/src/threading/threading.cpp#L267
.. and from what I see, it is indeed set to the result of cpu_count(only_physical_cores=False).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although from some further testing, this behavior also appears to be the same in the current main branch.

Copy link

codecov bot commented Mar 18, 2025

Codecov Report

Attention: Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
onedal/datatypes/numpy/data_conversion.cpp 66.66% 0 Missing and 1 partial ⚠️
onedal/datatypes/table.cpp 50.00% 0 Missing and 1 partial ⚠️
Flag Coverage Δ
azure 79.70% <ø> (-0.06%) ⬇️
github 73.58% <60.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sklearnex/linear_model/incremental_linear.py 83.45% <ø> (ø)
sklearnex/linear_model/incremental_ridge.py 86.66% <ø> (ø)
onedal/datatypes/numpy/data_conversion.cpp 53.02% <66.66%> (+1.86%) ⬆️
onedal/datatypes/table.cpp 51.92% <50.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@icfaust
Copy link
Contributor Author

icfaust commented Mar 18, 2025

/intelci: run

@icfaust
Copy link
Contributor Author

icfaust commented Mar 19, 2025

/azp run CI

@icfaust
Copy link
Contributor Author

icfaust commented Mar 23, 2025

/intelci: run

@icfaust
Copy link
Contributor Author

icfaust commented Mar 24, 2025

/intelci: run

def get_num_threads(self):
return num_threads()

def set_num_threads(self, nthreads):
Copy link
Contributor

@david-cortes-intel david-cortes-intel Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this setting would apply globally, which could lead to race conditions if users call this in parallel, for example through some framework that would parallelize estimator calls.

Could it somehow get a mutex (or use atomic ops) either here or on the oneDAL side?

Also, would be better to add a warning that the setting is changed at a global level, so that a user would not try to call these inside multi-threaded code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on a further look, it does already have a mutex on the daal side. Still better to document this behavior being global.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will do!

@@ -55,9 +55,6 @@
sklearn_clone_dict,
)

# to reproduce errors even in CI
d4p.daalinit(nthreads=100)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this causes all sorts of memory leak check test failures, not just in windows and not just with pandas.

@icfaust
Copy link
Contributor Author

icfaust commented Jun 13, 2025

This last run is very useful/interesting. it shows that:

  1. linux and windows
  2. both GitHub actions and azure pipelines
  3. for daal4py (logistic regression) and onedal
  4. many different estimators.
  5. Different input datatypes

im very worried about our malloc routines in daal. Thoughts @david-cortes-intel @Vika-F

@david-cortes-intel
Copy link
Contributor

This last run is very useful/interesting. it shows that:

  1. linux and windows
  2. both GitHub actions and azure pipelines
  3. for daal4py (logistic regression) and onedal
  4. many different estimators.
  5. Different input datatypes

im very worried about our malloc routines in daal. Thoughts @david-cortes-intel @Vika-F

Perhaps there could be memory leaks, but I'm not sure that we can conclude anything from these tests alone.

I see that some are for cases where the input is from DPCTL - does it set flags like SYCL_PI_LEVEL_ZERO_DISABLE_USM_ALLOCATOR for example? Might also be a better idea to set PYTHONMALLOC=malloc, and maybe LD_PRELOAD=jemalloc.so on linux, given the way in which the test is structured.

But what'd be even better is to run them through valgrind and asan to see if there are any reports of leaks coming specifically from oneDAL or sklearnex. We unfortunately get a lot of false positives from numpy, scipy, pybind11, and others that make it hard to browse logs; but true leaks should definitely show up there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants