Huge overhead on devcloud linked to dpctl calls #945

fcharras · 2023-03-02T13:23:06Z

Version: numba_0.20.0dev3 and main

The three following dpctl calls 1 2 3 have huge wall time on edge devcloud (measured ranging from 10 to 30ms each call by py-spy, see speedscope report):

On the devcloud this add about 80 seconds to the k-means benchmark (for an expected 10 seconds).

I didn't see the issue on a local machine, but maybe the remaining small overhead that we reported comes from there.

@oleksandr-pavlyk not sure if this should be considered as an unreasonable use in numba_dpex (those calls should be expected to be that long and cached ?) or a bug in dpctl.

I've experimenting with caching the values and can confirm that caching those 3 calls completely remove the overhead.

Regarding the scope of the cache, I'll check if a hotfix that consists in storing those value in a WeakKeyDictionary where keys are val, and usm_mem, and wrapping SyclDevice(device) call in a lru_cache, is enough. (if so, will monkey-patch in sklearn_numba_dpex in the meantime).

The text was updated successfully, but these errors were encountered:

ogrisel · 2023-03-02T13:34:01Z

To avoid confusion, the .svg file extension of the py-spy report file should be a .json extension. The .svg extension only make sense when py-spy is used to generate a flamegraph report as an SVG file instead of a json speedscope trace.

ogrisel · 2023-03-02T13:47:46Z

By zooming in the report, it seems that the overhead seems to come from the repeated calls to typeof_usm_ndarray:

but I cannot see the calls to dpctl.SyclDevice(device).

fcharras · 2023-03-02T13:53:13Z

The relevant calls to investigate here are the cells that are closer to the bottom, since it's as large as the parent cells, it means it's the bottleneck.

By hovering on those cells you can see the filename and the line number. You should be able to trace it back to the 3 lines I've linked in the OP.

fcharras · 2023-03-02T16:10:03Z

Workaround here ~~soda-inria/sklearn-numba-dpex#104~~
edit: doesn't work
edit edit: this works https://github.com/soda-inria/sklearn-numba-dpex/blob/e040e78d2a5492d7b7b0ec79c2576f0df15cb9db/sklearn_numba_dpex/patches/load_numba_dpex.py#L44

AlexanderKalistratov · 2023-03-02T16:59:30Z

@ogrisel @fcharras could you please verify if #946 fixes the issue?

fcharras · 2023-03-03T09:11:50Z

Unfortunately, it doesn't fix. Looking at the PR it doesn't seem to change the instructions that lead to the time consuming steps in the OP (that are, dpctl.SyclDevice(device) and *.sycl_device.filter_string).

fcharras · 2023-03-03T12:37:16Z

The workaround I've posted yesterday doesn't work either. (currently fixing)

oleksandr-pavlyk · 2023-03-03T13:49:00Z

The dpctl.SyclDevice calls sycl::device constructor, which scores each available device and selects the one with the highest score. SyclDevice.filter_string calls sycl::get_devices() and searches for the given device in that list.

Construction of SYCL devices may thus be expensive, as RT must talk to the hardware. numba-dpex should not be constructing the device, but rather should capture it from the instance of usm_ndarray that it is inferring the type from. This is forthcoming, but I do not the ETA.

This suggests that using SYCL_DEVICE_FILTER to limit the number of devices discoverable by RT should improve the timing.
Use sycl-ls to determine the appropriate value to set the environment variable to. For example: with SYCL_DEVICE_FILTER=level_zero:gpu:0 the runtime would only discover one level-zero GPU device.

AlexanderKalistratov · 2023-03-03T14:04:07Z

@oleksandr-pavlyk

You are correct. We should extract device from usm_ndarray instead of creation of new one from filter_string.
But we still have to get filter_string since it is part of type signature. And getting filter_string is slow

oleksandr-pavlyk · 2023-03-03T14:08:22Z

I would argue that the need to store filter_string as part of type signature would be rendered unnecessary once boxing/unboxing of dpctl.SyclQueue is implemented.

AlexanderKalistratov · 2023-03-03T14:16:44Z

@oleksandr-pavlyk
It doesn't matter if it is part of array type signature or queue type signature.
Device must be part of signature. We not just need to get queue. We need to know for which device we are compiling/calling function. The most human friendly form of adding it to type signature is to use filter_string. Alternatives are: device name, python object id, something else?

fcharras · 2023-03-03T14:39:34Z

I've fixed the monkey-patching workaround given in a previous comment. This should work https://github.com/soda-inria/sklearn-numba-dpex/blob/e040e78d2a5492d7b7b0ec79c2576f0df15cb9db/sklearn_numba_dpex/patches/load_numba_dpex.py#L44

(edit: seems to work. I'd argue that the draft caching mechanism that is outlined in this hack might have some value for numba_dpex if dpctl does not fix)

fcharras · 2023-03-03T14:44:20Z

This even also (almost?) entirely fixes the remaining small overhead that we also noticed even on iGPUs on laptops after the caching overhaul. (pointed out at in #886 (comment))

So, this issue is exacerbated on the intel edge devcloud, but also noticeable on more ordinary hardware.

diptorupd · 2023-03-05T01:00:31Z

I would argue that the need to store filter_string as part of type signature would be rendered unnecessary once boxing/unboxing of dpctl.SyclQueue is implemented.

Absolutely. #930

Using the filter string for compute follows data and having it part of any type signature (DpnpNdArray or SyclQueue) is a no go. I only did it as a stop gap under time pressure.

We need to know for which device we are compiling/calling function.

Sure, but that has nothing to do with adding it to any type signature. Moreover, it is conceivable that advanced programmers will target sub-devices and have much finer gain control. For such cases, a filter string is not supported by SYCL.

The most human friendly form of adding it to type signature is to use filter_string

I agree, but given the performance overhead of generating a filter string it is not possible. We can perhaps add backend and device type as string attributes for ease of reading typemaps and such. It is the generation of device number that kills performance.

AlexanderKalistratov · 2023-03-05T02:10:51Z

@diptorupd

Sure, but that has nothing to do with adding it to any type signature.

It has. Numba caches compiled functions based on input types. Types are described by signatures. Types with equal signature are considered to be equal. Not having device in type signature means Numba wouldn't know to which device function should be compiled.

I agree, but given the performance overhead of generating a filter string it is not possible.

I really don't see any problem in caching filter string for the device. You need to generate it only once for the created device. In python (not sure about cython) it is a single line fix.

For such cases, a filter string is not supported by SYCL.

Ok. That means we would need another human friendly text representation on sycl devices/sub devices. But I really don't think that it is Numba-dpex who should be responsible for this.

oleksandr-pavlyk · 2023-03-24T14:26:19Z

FYI, I have added caching for filter_string property in IntelPython/dpctl#1127

fcharras · 2023-04-04T12:35:04Z

@oleksandr-pavlyk this is half of the fix for this issue I think ? The remaining issue is that since the cache key is a device instance, the cache is not shared for distinct arrays or queues. Would that be possible that all arrays share the same device instance (i.e having id(array.sycl_device) == id(dpctl.SyclDevice(array.device.filter_string))) for all arrays) without adding any overhead to the array.sycl_device call ? I was trying to look into monkey patching my way to that from what is exposed to the python interpreter but I'm not sure it's possible now.

AlexanderKalistratov · 2023-04-04T15:50:42Z

@fcharras Could you please try #946 again? I've updated it according to your comment and I think with IntelPython/dpctl#1127 it should solve the problem. I'm not sure if IntelPython/dpctl#1127 is already on dppy/label/dev channel already or not.

fcharras · 2023-04-05T08:22:01Z

I'll look more into that today and reach back.

ogrisel mentioned this issue Mar 2, 2023

FIX fix performance regression seen in numba_dpex==0.20.0dev3 soda-inria/sklearn-numba-dpex#104

Merged

AlexanderKalistratov mentioned this issue Mar 2, 2023

Using cached queue instead of creating new one on type inference #946

Merged

AlexanderKalistratov mentioned this issue Mar 3, 2023

dpctl.SyclDevice() and *.filter_string are slow IntelPython/dpctl#1098

Closed

mingjie-intel added 1 - In Progress user User submitted issue and removed 1 - In Progress labels Mar 3, 2023

fcharras mentioned this issue Mar 3, 2023

Fix monkey patching hack soda-inria/sklearn-numba-dpex#105

Merged

fcharras mentioned this issue Apr 4, 2023

FEA 1D Radix topk and application to cluster relocation in KMeans soda-inria/sklearn-numba-dpex#81

Merged

diptorupd closed this as completed in #946 Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge overhead on devcloud linked to dpctl calls #945

Huge overhead on devcloud linked to dpctl calls #945

fcharras commented Mar 2, 2023 •

edited

Loading

ogrisel commented Mar 2, 2023

ogrisel commented Mar 2, 2023

fcharras commented Mar 2, 2023

fcharras commented Mar 2, 2023 •

edited

Loading

AlexanderKalistratov commented Mar 2, 2023

fcharras commented Mar 3, 2023

fcharras commented Mar 3, 2023 •

edited

Loading

oleksandr-pavlyk commented Mar 3, 2023

AlexanderKalistratov commented Mar 3, 2023 •

edited

Loading

oleksandr-pavlyk commented Mar 3, 2023

AlexanderKalistratov commented Mar 3, 2023

fcharras commented Mar 3, 2023 •

edited

Loading

fcharras commented Mar 3, 2023

diptorupd commented Mar 5, 2023

AlexanderKalistratov commented Mar 5, 2023 •

edited

Loading

oleksandr-pavlyk commented Mar 24, 2023

fcharras commented Apr 4, 2023 •

edited

Loading

AlexanderKalistratov commented Apr 4, 2023

fcharras commented Apr 5, 2023

Huge overhead on devcloud linked to dpctl calls #945

Huge overhead on devcloud linked to dpctl calls #945

Comments

fcharras commented Mar 2, 2023 • edited Loading

ogrisel commented Mar 2, 2023

ogrisel commented Mar 2, 2023

fcharras commented Mar 2, 2023

fcharras commented Mar 2, 2023 • edited Loading

AlexanderKalistratov commented Mar 2, 2023

fcharras commented Mar 3, 2023

fcharras commented Mar 3, 2023 • edited Loading

oleksandr-pavlyk commented Mar 3, 2023

AlexanderKalistratov commented Mar 3, 2023 • edited Loading

oleksandr-pavlyk commented Mar 3, 2023

AlexanderKalistratov commented Mar 3, 2023

fcharras commented Mar 3, 2023 • edited Loading

fcharras commented Mar 3, 2023

diptorupd commented Mar 5, 2023

AlexanderKalistratov commented Mar 5, 2023 • edited Loading

oleksandr-pavlyk commented Mar 24, 2023

fcharras commented Apr 4, 2023 • edited Loading

AlexanderKalistratov commented Apr 4, 2023

fcharras commented Apr 5, 2023

fcharras commented Mar 2, 2023 •

edited

Loading

fcharras commented Mar 2, 2023 •

edited

Loading

fcharras commented Mar 3, 2023 •

edited

Loading

AlexanderKalistratov commented Mar 3, 2023 •

edited

Loading

fcharras commented Mar 3, 2023 •

edited

Loading

AlexanderKalistratov commented Mar 5, 2023 •

edited

Loading

fcharras commented Apr 4, 2023 •

edited

Loading