Skip to content

Conversation

@maikel
Copy link
Collaborator

@maikel maikel commented Jun 16, 2023

Using the stopping facade writes currently the wrong pointer into the IORING_OP_ASYNC_CANCEL operation. This PR fixes this by passing the pointer to __task which is used for the original submission.

Its still a draft because I stlil want to think about a cleaner solution

@rapids-bot
Copy link

rapids-bot bot commented Jun 16, 2023

Pull requests from external contributors require approval from a NVIDIA organization member with write permissions or greater before CI can begin.

@ericniebler
Copy link
Collaborator

/ok to test

})));
auto end = std::chrono::steady_clock::now();
auto diff = end - start;
CHECK(diff < 5ms);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure of this test perhaps indicates that work scheduled to run in the future on the io_uring scheduler doesn't complete promptly when it has been cancelled.

Copy link
Collaborator Author

@maikel maikel Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it might be as you say. I will look into it asap. The failing test uses kernel 5.4 and the implementation is not touched by this pr (yet) because I only changed the IORING_OP_ASYNC_CANCEL code path, which requires kernel 5.5 and newer.

@ericniebler
Copy link
Collaborator

/ok to test

@ericniebler
Copy link
Collaborator

It doesn't seem to be a timing issue.

@ericniebler
Copy link
Collaborator

/ok to test

@ericniebler
Copy link
Collaborator

/ok to test

@maikel
Copy link
Collaborator Author

maikel commented Jun 25, 2023

I could finally reproduce the issue. It is because the container image is based on Ubuntu-2204 but the GPU host runner is probably on an older system with a Kernel 5.4 (I assume ubuntu-2004).

I could reproduce the error in a Ubuntu-2004 VM where I compiled and ran the tests in an Ubuntu-2204 container.

We have implemented two strategies for cancelation based on the kernel version. But since this is a header-only solution the detection of the kernel version works via defines in /usr/include/linux/version.h which reports a newer kernel version than what is actually being used.

The question is how to solve this now. I know of following solutions

  1. Generate a cmake configure file with the actual detected kernel version of the host system and use this detection
  2. Make the cancelation strategy a run-time decision instead of a build-time decision by detecting the factual kernel-version in run-time
  3. Improve the CI by matching Host kernel and kernel headers within the image.

any thoughts?

@trxcllnt @ericniebler

@ericniebler
Copy link
Collaborator

  1. Improve the CI by matching Host kernel and kernel headers within the image.

This would be my preferred solution. I don't think we're under any obligation to support Franken-distros. @trxcllnt, any clue what could have led to this situation?

@maikel
Copy link
Collaborator Author

maikel commented Jun 25, 2023

  1. Improve the CI by matching Host kernel and kernel headers within the image.

This would be my preferred solution. I don't think we're under any obligation to support Franken-distros. @trxcllnt, any clue what could have led to this situation?

I want to side note that up until recently the default github action runner were based on ubuntu-20.04 too.

If upgrading the runners host system is too complicated because of other dependencies the test should also be OK with a container image based on 20.04. Or at least faking the Linux header files

@maikel maikel marked this pull request as ready for review June 26, 2023 05:24
@ericniebler
Copy link
Collaborator

Paul makes the valid point that if code is compiled on one kernel version and run on another, it should still work. That argues for runtime detection.

@maikel
Copy link
Collaborator Author

maikel commented Jun 27, 2023

Paul makes the valid point that if code is compiled on one kernel version and run on another, it should still work. That argues for runtime detection.

We have compatibilty for newer kernel versions. i.e. if we compile an application with kernel version 5.4 it will also run correctly on kernel 5.15. If I build my application against kernel version 5.15 and use features of this newer kernel I don't really expect my application to work correctly on an older kernel.

I will add the runtime check nonetheless. But it will pessimize the size of the timed schedule operations, because the solution that uses timerfd has more state.

io_uring itself provides a query for supported op codes but this is officially available since kernel version 5.6. Since Ubuntu-20.04 ships with kernel version 5.4 the only way to test for cancelation support is to make a dummy cancel request and check the error code for -EINVAL. I will do so in the constructor of io_uring_context.

@maikel
Copy link
Collaborator Author

maikel commented Jun 27, 2023

How about this: We require a kernel version of 5.5 (which introduces IORING_OP_ASYNC_CANCEL) or higher for io_uring support and if we detect an older version within CMake we shouldn't build the io_uring tests? I could completely remove the timerfd workaround.

@ericniebler
Copy link
Collaborator

/ok to test

@ericniebler
Copy link
Collaborator

/ok to test

@maikel
Copy link
Collaborator Author

maikel commented Jun 28, 2023

There was something weird going on for nvc++ for the run-time detection. I've reverted those changes and disabled the tests for the GPU CI. We just need to remember to re-enable the tests whenever the host machine of the GPU runner gets a newer kernel version.

@ericniebler
Copy link
Collaborator

/ok to test

@ericniebler
Copy link
Collaborator

/ok to test

@ericniebler ericniebler merged commit 4667eb2 into NVIDIA:main Jun 28, 2023
@maikel
Copy link
Collaborator Author

maikel commented Jun 28, 2023

Thank you!

@maikel maikel deleted the io-uring-fix-stop-facade branch June 28, 2023 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants