Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge overhead on devcloud linked to dpctl calls #945

Closed
fcharras opened this issue Mar 2, 2023 · 19 comments · Fixed by #946
Closed

Huge overhead on devcloud linked to dpctl calls #945

fcharras opened this issue Mar 2, 2023 · 19 comments · Fixed by #946
Labels
user User submitted issue

Comments

@fcharras
Copy link

fcharras commented Mar 2, 2023

Version: numba_0.20.0dev3 and main

The three following dpctl calls 1 2 3 have huge wall time on edge devcloud (measured ranging from 10 to 30ms each call by py-spy, see speedscope report): report download link

On the devcloud this add about 80 seconds to the k-means benchmark (for an expected 10 seconds).

I didn't see the issue on a local machine, but maybe the remaining small overhead that we reported comes from there.

@oleksandr-pavlyk not sure if this should be considered as an unreasonable use in numba_dpex (those calls should be expected to be that long and cached ?) or a bug in dpctl.

I've experimenting with caching the values and can confirm that caching those 3 calls completely remove the overhead.

Regarding the scope of the cache, I'll check if a hotfix that consists in storing those value in a WeakKeyDictionary where keys are val, and usm_mem, and wrapping SyclDevice(device) call in a lru_cache, is enough. (if so, will monkey-patch in sklearn_numba_dpex in the meantime).

@ogrisel
Copy link

ogrisel commented Mar 2, 2023

To avoid confusion, the .svg file extension of the py-spy report file should be a .json extension. The .svg extension only make sense when py-spy is used to generate a flamegraph report as an SVG file instead of a json speedscope trace.

@ogrisel
Copy link

ogrisel commented Mar 2, 2023

By zooming in the report, it seems that the overhead seems to come from the repeated calls to typeof_usm_ndarray:

image

but I cannot see the calls to dpctl.SyclDevice(device).

@fcharras
Copy link
Author

fcharras commented Mar 2, 2023

The relevant calls to investigate here are the cells that are closer to the bottom, since it's as large as the parent cells, it means it's the bottleneck.

By hovering on those cells you can see the filename and the line number. You should be able to trace it back to the 3 lines I've linked in the OP.

@fcharras
Copy link
Author

fcharras commented Mar 2, 2023

@AlexanderKalistratov
Copy link
Contributor

@ogrisel @fcharras could you please verify if #946 fixes the issue?

@fcharras
Copy link
Author

fcharras commented Mar 3, 2023

Unfortunately, it doesn't fix. Looking at the PR it doesn't seem to change the instructions that lead to the time consuming steps in the OP (that are, dpctl.SyclDevice(device) and *.sycl_device.filter_string).

@fcharras
Copy link
Author

fcharras commented Mar 3, 2023

The workaround I've posted yesterday doesn't work either. (currently fixing)

@oleksandr-pavlyk
Copy link
Contributor

The dpctl.SyclDevice calls sycl::device constructor, which scores each available device and selects the one with the highest score. SyclDevice.filter_string calls sycl::get_devices() and searches for the given device in that list.

Construction of SYCL devices may thus be expensive, as RT must talk to the hardware. numba-dpex should not be constructing the device, but rather should capture it from the instance of usm_ndarray that it is inferring the type from. This is forthcoming, but I do not the ETA.

This suggests that using SYCL_DEVICE_FILTER to limit the number of devices discoverable by RT should improve the timing.
Use sycl-ls to determine the appropriate value to set the environment variable to. For example: with SYCL_DEVICE_FILTER=level_zero:gpu:0 the runtime would only discover one level-zero GPU device.

@AlexanderKalistratov
Copy link
Contributor

AlexanderKalistratov commented Mar 3, 2023

@oleksandr-pavlyk

You are correct. We should extract device from usm_ndarray instead of creation of new one from filter_string.
But we still have to get filter_string since it is part of type signature. And getting filter_string is slow

@oleksandr-pavlyk
Copy link
Contributor

I would argue that the need to store filter_string as part of type signature would be rendered unnecessary once boxing/unboxing of dpctl.SyclQueue is implemented.

@mingjie-intel mingjie-intel added 1 - In Progress user User submitted issue and removed 1 - In Progress labels Mar 3, 2023
@AlexanderKalistratov
Copy link
Contributor

@oleksandr-pavlyk
It doesn't matter if it is part of array type signature or queue type signature.
Device must be part of signature. We not just need to get queue. We need to know for which device we are compiling/calling function. The most human friendly form of adding it to type signature is to use filter_string. Alternatives are: device name, python object id, something else?

@fcharras
Copy link
Author

fcharras commented Mar 3, 2023

I've fixed the monkey-patching workaround given in a previous comment. This should work https://github.com/soda-inria/sklearn-numba-dpex/blob/e040e78d2a5492d7b7b0ec79c2576f0df15cb9db/sklearn_numba_dpex/patches/load_numba_dpex.py#L44

(edit: seems to work. I'd argue that the draft caching mechanism that is outlined in this hack might have some value for numba_dpex if dpctl does not fix)

@fcharras
Copy link
Author

fcharras commented Mar 3, 2023

This even also (almost?) entirely fixes the remaining small overhead that we also noticed even on iGPUs on laptops after the caching overhaul. (pointed out at in #886 (comment))

So, this issue is exacerbated on the intel edge devcloud, but also noticeable on more ordinary hardware.

@diptorupd
Copy link
Collaborator

I would argue that the need to store filter_string as part of type signature would be rendered unnecessary once boxing/unboxing of dpctl.SyclQueue is implemented.

Absolutely. #930

Using the filter string for compute follows data and having it part of any type signature (DpnpNdArray or SyclQueue) is a no go. I only did it as a stop gap under time pressure.

We need to know for which device we are compiling/calling function.

Sure, but that has nothing to do with adding it to any type signature. Moreover, it is conceivable that advanced programmers will target sub-devices and have much finer gain control. For such cases, a filter string is not supported by SYCL.

The most human friendly form of adding it to type signature is to use filter_string

I agree, but given the performance overhead of generating a filter string it is not possible. We can perhaps add backend and device type as string attributes for ease of reading typemaps and such. It is the generation of device number that kills performance.

@AlexanderKalistratov
Copy link
Contributor

AlexanderKalistratov commented Mar 5, 2023

@diptorupd

Sure, but that has nothing to do with adding it to any type signature.

It has. Numba caches compiled functions based on input types. Types are described by signatures. Types with equal signature are considered to be equal. Not having device in type signature means Numba wouldn't know to which device function should be compiled.

I agree, but given the performance overhead of generating a filter string it is not possible.

I really don't see any problem in caching filter string for the device. You need to generate it only once for the created device. In python (not sure about cython) it is a single line fix.

For such cases, a filter string is not supported by SYCL.

Ok. That means we would need another human friendly text representation on sycl devices/sub devices. But I really don't think that it is Numba-dpex who should be responsible for this.

@oleksandr-pavlyk
Copy link
Contributor

FYI, I have added caching for filter_string property in IntelPython/dpctl#1127

@fcharras
Copy link
Author

fcharras commented Apr 4, 2023

@oleksandr-pavlyk this is half of the fix for this issue I think ? The remaining issue is that since the cache key is a device instance, the cache is not shared for distinct arrays or queues. Would that be possible that all arrays share the same device instance (i.e having id(array.sycl_device) == id(dpctl.SyclDevice(array.device.filter_string))) for all arrays) without adding any overhead to the array.sycl_device call ? I was trying to look into monkey patching my way to that from what is exposed to the python interpreter but I'm not sure it's possible now.

@AlexanderKalistratov
Copy link
Contributor

@fcharras Could you please try #946 again? I've updated it according to your comment and I think with IntelPython/dpctl#1127 it should solve the problem. I'm not sure if IntelPython/dpctl#1127 is already on dppy/label/dev channel already or not.

@fcharras
Copy link
Author

fcharras commented Apr 5, 2023

I'll look more into that today and reach back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user User submitted issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants