[bugfix, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching #2364

icfaust · 2025-03-17T22:15:21Z

Description

This addresses an issue related to linux thread control using affinity. New tests added which validated n_jobs when a reduced number of threads are available, set by the affinity. This requires 4 threads/CPU cores be made available to the main python pytest process in order to have proper functionality, meaning that this test does not run on azure pipelines runners, but only on intelci and github actions.

This change was needed in order to have scikit-learn-intelex properly work on thread-limited Kubernetes pods.

PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have added a respective label(s) to PR if I have a permission for that.
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
I have provided justification why performance has changed or why changes are not expected.
I have provided justification why quality metrics have changed or why changes are not expected.
I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

daal4py/sklearn/_n_jobs_support.py

david-cortes-intel · 2025-03-18T14:22:17Z

daal4py/sklearn/_n_jobs_support.py


-        try:
+        if not self.n_jobs:
+            n_jobs = cpu_count()


Would this later on get limited to the number of physical cores from oneDAL side?

I'll be honest, I'm not 100% sure yet. The default in the threading.h in daal will set it to the number of CPUs, but with the affinity I didn't spend the full time to track the default setting there.

Maybe @Alexsandruss could comment here on whether it'd end up limited to number of physical cores somewhere else?

Looks like setting the number of threads like this would not result in that number later on getting limited to the number of physical cores. How about passing argument only_physical_cores=True here?

Tried adding a line to print this value here: https://github.com/uxlfoundation/oneDAL/blob/31cafec9950f1db352b639dafad5875971ca00fe/cpp/daal/src/threading/threading.cpp#L267
.. and from what I see, it is indeed set to the result of cpu_count(only_physical_cores=False).

Although from some further testing, this behavior also appears to be the same in the current main branch.

daal4py/sklearn/_n_jobs_support.py

codecov · 2025-03-18T15:00:02Z

Codecov Report

Attention: Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
onedal/datatypes/numpy/data_conversion.cpp	66.66%	0 Missing and 1 partial ⚠️
onedal/datatypes/table.cpp	50.00%	0 Missing and 1 partial ⚠️

Flag	Coverage Δ
azure	`79.70% <ø> (-0.06%)`	⬇️
github	`73.58% <60.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
sklearnex/linear_model/incremental_linear.py	`83.45% <ø> (ø)`
sklearnex/linear_model/incremental_ridge.py	`86.66% <ø> (ø)`
onedal/datatypes/numpy/data_conversion.cpp	`53.02% <66.66%> (+1.86%)`	⬆️
onedal/datatypes/table.cpp	`51.92% <50.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

icfaust · 2025-03-18T22:53:35Z

/intelci: run

icfaust · 2025-03-19T06:34:01Z

/azp run CI

icfaust · 2025-03-23T22:39:38Z

/intelci: run

icfaust · 2025-03-24T10:01:13Z

/intelci: run

david-cortes-intel · 2025-03-24T13:25:36Z

daal4py/sklearn/_n_jobs_support.py

+    def get_num_threads(self):
+        return num_threads()
+
+    def set_num_threads(self, nthreads):


I understand this setting would apply globally, which could lead to race conditions if users call this in parallel, for example through some framework that would parallelize estimator calls.

Could it somehow get a mutex (or use atomic ops) either here or on the oneDAL side?

Also, would be better to add a warning that the setting is changed at a global level, so that a user would not try to call these inside multi-threaded code.

Actually on a further look, it does already have a mutex on the daal side. Still better to document this behavior being global.

Sounds good, will do!

icfaust · 2025-03-24T19:24:45Z

sklearnex/tests/test_run_to_run_stability.py

@@ -55,9 +55,6 @@
    sklearn_clone_dict,
 )

-# to reproduce errors even in CI
-d4p.daalinit(nthreads=100)


Removing this causes all sorts of memory leak check test failures, not just in windows and not just with pandas.

icfaust · 2025-06-13T16:32:19Z

This last run is very useful/interesting. it shows that:

linux and windows
both GitHub actions and azure pipelines
for daal4py (logistic regression) and onedal
many different estimators.
Different input datatypes

im very worried about our malloc routines in daal. Thoughts @david-cortes-intel @Vika-F

david-cortes-intel · 2025-06-16T06:19:39Z

This last run is very useful/interesting. it shows that:

linux and windows

both GitHub actions and azure pipelines

for daal4py (logistic regression) and onedal

many different estimators.

Different input datatypes

im very worried about our malloc routines in daal. Thoughts @david-cortes-intel @Vika-F

Perhaps there could be memory leaks, but I'm not sure that we can conclude anything from these tests alone.

I see that some are for cases where the input is from DPCTL - does it set flags like SYCL_PI_LEVEL_ZERO_DISABLE_USM_ALLOCATOR for example? Might also be a better idea to set PYTHONMALLOC=malloc, and maybe LD_PRELOAD=jemalloc.so on linux, given the way in which the test is structured.

But what'd be even better is to run them through valgrind and asan to see if there are any reports of leaks coming specifically from oneDAL or sklearnex. We unfortunately get a lot of false positives from numpy, scipy, pybind11, and others that make it hard to browse logs; but true leaks should definitely show up there.

icfaust and others added 5 commits March 17, 2025 22:56

Update _n_jobs_support.py

c78859f

Update test_run_to_run_stability.py

e207110

Update test_n_jobs_support.py

043b09d

add changes

66b0b6d

add other changes

bc66055

icfaust changed the title ~~[WIP, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching~~ [bugfix, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching Mar 17, 2025

icfaust changed the title ~~[bugfix, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching~~ [WIP, bugfix, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching Mar 17, 2025

icfaust and others added 10 commits March 18, 2025 00:55

add an affinity test

280c0e0

reduce lines

eb0df7f

use pylance

2403e6d

further fixes

009348b

better docs

29c318f

better docs

ed726de

mark

2b58453

Update _n_jobs_support.py

78d07bb

Update _n_jobs_support.py

8da0891

Update test_n_jobs_support.py

79ced00

david-cortes-intel reviewed Mar 18, 2025

View reviewed changes

icfaust added 9 commits March 18, 2025 16:36

Update _n_jobs_support.py

e021335

Update test_n_jobs_support.py

30f822a

Update _n_jobs_support.py

6d02aea

Update test_n_jobs_support.py

84f91ac

Update _n_jobs_support.py

a2a499a

Update _n_jobs_support.py

ac16042

Update incremental_linear.py

e6fdd80

Update incremental_ridge.py

04075dc

Update test_n_jobs_support.py

dd798fa

icfaust added 2 commits March 23, 2025 16:15

Update requirements-test.txt

1948e7d

Update requirements-test.txt

62c7d9f

Update requirements-test.txt

ce79ace

david-cortes-intel reviewed Mar 24, 2025

View reviewed changes

icfaust commented Mar 24, 2025

View reviewed changes

icfaust and others added 6 commits March 24, 2025 21:25

Merge branch 'main' into dev/njobs_fix

8765c0a

return values, and reduce test

e2fa126

Merge branch 'uxlfoundation:main' into dev/njobs_fix

1ca56cd

Merge branch 'uxlfoundation:main' into dev/njobs_fix

ab1c1eb

Merge branch 'uxlfoundation:main' into dev/njobs_fix

e9b5da5

Merge branch 'uxlfoundation:main' into dev/njobs_fix

f979da3

icfaust mentioned this pull request May 30, 2025

[CI, enhancement] add pytorch+gpu testing ci #2494

Merged

13 tasks

ethanglaser mentioned this pull request May 30, 2025

[CI] temporarily deselect test_dbscan_params_validation for green CI #2500

Merged

13 tasks

Merge branch 'uxlfoundation:main' into dev/njobs_fix

b56729e

icfaust mentioned this pull request Jun 13, 2025

FIX: Fix table destructor not being called #2540

Merged

6 tasks

icfaust added 5 commits June 13, 2025 09:42

Update data_conversion.cpp

2b2749c

Update table.cpp

3f0155b

Merge branch 'uxlfoundation:main' into dev/njobs_fix

385ad80

Update data_conversion.cpp

052bcdd

Update test_memory_usage.py

c638473

icfaust added 6 commits June 17, 2025 15:57

Update _n_jobs_support.py

e00feb3

Merge branch 'uxlfoundation:main' into dev/njobs_fix

627df75

Merge branch 'uxlfoundation:main' into dev/njobs_fix

3bf60b5

Update test_n_jobs_support.py

df38221

Update run_test.sh

51ccef3

Update ci.yml

17de438

[bugfix, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching #2364

Are you sure you want to change the base?

[bugfix, enhancement] Address affinity bug by using threadpoolctl/joblib for n_jobs dispatching #2364

Conversation

icfaust commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

icfaust commented Mar 18, 2025

Uh oh!

icfaust commented Mar 19, 2025

Uh oh!

icfaust commented Mar 23, 2025

Uh oh!

icfaust commented Mar 24, 2025

Uh oh!

david-cortes-intel Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icfaust commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-cortes-intel commented Jun 16, 2025

Uh oh!

Uh oh!

icfaust commented Mar 17, 2025 •

edited

Loading

david-cortes-intel Mar 24, 2025 •

edited

Loading

codecov bot commented Mar 18, 2025 •

edited

Loading

david-cortes-intel Mar 24, 2025 •

edited

Loading

icfaust commented Jun 13, 2025 •

edited

Loading