Skip to content
Permalink
Browse files
RFC: drm/amdgpu: Implement a proper implicit fencing uapi
WARNING: Absolutely untested beyond "gcc isn't dying in agony".

Implicit fencing done properly needs to treat the implicit fencing
slots like a funny kind of IPC mailbox. In other words it needs to be
explicitly. This is the only way it will mesh well with explicit
fencing userspace like vk, and it's also the bare minimum required to
be able to manage anything else that wants to use the same buffer on
multiple engines in parallel, and still be able to share it through
implicit sync.

amdgpu completely lacks such an uapi. Fix this.

Luckily the concept of ignoring implicit fences exists already, and
takes care of all the complexities of making sure that non-optional
fences (like bo moves) are not ignored. This support was added in

commit 177ae09
Author: Andres Rodriguez <andresx7@gmail.com>
Date:   Fri Sep 15 20:44:06 2017 -0400

    drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

Unfortuantely it's the wrong semantics, because it's a bo flag and
disables implicit sync on an allocated buffer completely.

We _do_ want implicit sync, but control it explicitly. For this we
need a flag on the drm_file, so that a given userspace (like vulkan)
can manage the implicit sync slots explicitly. The other side of the
pipeline (compositor, other process or just different stage in a media
pipeline in the same process) can then either do the same, or fully
participate in the implicit sync as implemented by the kernel by
default.

By building on the existing flag for buffers we avoid any issues with
opening up additional security concerns - anything this new flag here
allows is already.

All drivers which supports this concept of a userspace-specific
opt-out of implicit sync have a flag in their CS ioctl, but in reality
that turned out to be a bit too inflexible. See the discussion below,
let's try to do a bit better for amdgpu.

This alone only allows us to completely avoid any stalls due to
implicit sync, it does not yet allow us to use implicit sync as a
strange form of IPC for sync_file.

For that we need two more pieces:

- a way to get the current implicit sync fences out of a buffer. Could
  be done in a driver ioctl, but everyone needs this, and generally a
  dma-buf is involved anyway to establish the sharing. So an ioctl on
  the dma-buf makes a ton more sense:

  https://lore.kernel.org/dri-devel/20210520190007.534046-4-jason@jlekstrand.net/

  Current drivers in upstream solves this by having the opt-out flag
  on their CS ioctl. This has the downside that very often the CS
  which must actually stall for the implicit fence is run a while
  after the implicit fence point was logically sampled per the api
  spec (vk passes an explicit syncobj around for that afaiui), and so
  results in oversync. Converting the implicit sync fences into a
  snap-shot sync_file is actually accurate.

- Simillar we need to be able to set the exclusive implicit fence.
  Current drivers again do this with a CS ioctl flag, with again the
  same problems that the time the CS happens additional dependencies
  have been added. An explicit ioctl to only insert a sync_file (while
  respecting the rules for how exclusive and shared fence slots must
  be update in struct dma_resv) is much better. This is proposed here:

  https://lore.kernel.org/dri-devel/20210520190007.534046-5-jason@jlekstrand.net/

These three pieces together allow userspace to fully control implicit
fencing and remove all unecessary stall points due to them.

Well, as much as the implicit fencing model fundamentally allows:
There is only one set of fences, you can only choose to sync against
only writers (exclusive slot), or everyone. Hence suballocating
multiple buffers or anything else like this is fundamentally not
possible, and can only be fixed by a proper explicit fencing model.

Aside from that caveat this model gets implicit fencing as closely to
explicit fencing semantics as possible:

On the actual implementation I opted for a simple setparam ioctl, no
locking (just atomic reads/writes) for simplicity. There is a nice
flag parameter in the VM ioctl which we could use, except:
- it's not checked, so userspace likely passes garbage
- there's already a comment that userspace _does_ pass garbage in the
  priority field
So yeah unfortunately this flag parameter for setting vm flags is
useless, and we need to hack up a new one.

v2: Explain why a new SETPARAM (Jason)

v3: Bas noticed I forgot to hook up the dependency-side shortcut. We
need both, or this doesn't do much.

v4: Rebase over the amdgpu patch to always set the implicit sync
fences.

Cc: mesa-dev@lists.freedesktop.org
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Rob Clark <robdclark@chromium.org>
Cc: Kristian H. Kristensen <hoegsberg@google.com>
Cc: Michel Dänzer <michel@daenzer.net>
Cc: Daniel Stone <daniels@collabora.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Deepak R Varma <mh12gx2825@gmail.com>
Cc: Chen Li <chenli@uniontech.com>
Cc: Kevin Wang <kevin1.wang@amd.com>
Cc: Dennis Li <Dennis.Li@amd.com>
Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: linaro-mm-sig@lists.linaro.org
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
  • Loading branch information
danvet authored and intel-lab-lkp committed Jun 22, 2021
1 parent 9fe6cb9 commit 42de2bd7635cf7c6d79494a3a35512c53196524f
Show file tree
Hide file tree
Showing 4 changed files with 42 additions and 2 deletions.
@@ -498,6 +498,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
struct amdgpu_bo *gds;
struct amdgpu_bo *gws;
struct amdgpu_bo *oa;
bool no_implicit_sync = READ_ONCE(fpriv->vm.no_implicit_sync);
int r;

INIT_LIST_HEAD(&p->validated);
@@ -577,7 +578,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,

e->bo_va = amdgpu_vm_bo_find(vm, bo);

if (bo->tbo.base.dma_buf && !amdgpu_bo_explicit_sync(bo)) {
if (bo->tbo.base.dma_buf &&
!(no_implicit_sync || amdgpu_bo_explicit_sync(bo))) {
e->chain = dma_fence_chain_alloc();
if (!e->chain) {
r = -ENOMEM;
@@ -649,14 +651,15 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
{
struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
struct amdgpu_bo_list_entry *e;
bool no_implicit_sync = READ_ONCE(fpriv->vm.no_implicit_sync);
int r;

list_for_each_entry(e, &p->validated, tv.head) {
struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
struct dma_resv *resv = bo->tbo.base.resv;
enum amdgpu_sync_mode sync_mode;

sync_mode = amdgpu_bo_explicit_sync(bo) ?
sync_mode = no_implicit_sync || amdgpu_bo_explicit_sync(bo) ?
AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
r = amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode,
&fpriv->vm);
@@ -1724,6 +1724,26 @@ int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv)
return 0;
}

int amdgpu_setparam_ioctl(struct drm_device *dev, void *data,
struct drm_file *filp)
{
struct drm_amdgpu_setparam *setparam = data;
struct amdgpu_fpriv *fpriv = filp->driver_priv;

switch (setparam->param) {
case AMDGPU_SETPARAM_NO_IMPLICIT_SYNC:
if (setparam->value)
WRITE_ONCE(fpriv->vm.no_implicit_sync, true);
else
WRITE_ONCE(fpriv->vm.no_implicit_sync, false);
break;
default:
return -EINVAL;
}

return 0;
}

const struct drm_ioctl_desc amdgpu_ioctls_kms[] = {
DRM_IOCTL_DEF_DRV(AMDGPU_GEM_CREATE, amdgpu_gem_create_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
DRM_IOCTL_DEF_DRV(AMDGPU_CTX, amdgpu_ctx_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
@@ -1742,6 +1762,7 @@ const struct drm_ioctl_desc amdgpu_ioctls_kms[] = {
DRM_IOCTL_DEF_DRV(AMDGPU_GEM_VA, amdgpu_gem_va_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
DRM_IOCTL_DEF_DRV(AMDGPU_GEM_OP, amdgpu_gem_op_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
DRM_IOCTL_DEF_DRV(AMDGPU_GEM_USERPTR, amdgpu_gem_userptr_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
DRM_IOCTL_DEF_DRV(AMDGPU_SETPARAM, amdgpu_setparam_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
};

static const struct drm_driver amdgpu_kms_driver = {
@@ -321,6 +321,12 @@ struct amdgpu_vm {
bool bulk_moveable;
/* Flag to indicate if VM is used for compute */
bool is_compute_context;
/*
* Flag to indicate whether implicit sync should always be skipped on
* this context. We do not care about races at all, userspace is allowed
* to shoot itself with implicit sync to its fullest liking.
*/
bool no_implicit_sync;
};

struct amdgpu_vm_manager {
@@ -54,6 +54,7 @@ extern "C" {
#define DRM_AMDGPU_VM 0x13
#define DRM_AMDGPU_FENCE_TO_HANDLE 0x14
#define DRM_AMDGPU_SCHED 0x15
#define DRM_AMDGPU_SETPARAM 0x16

#define DRM_IOCTL_AMDGPU_GEM_CREATE DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_GEM_CREATE, union drm_amdgpu_gem_create)
#define DRM_IOCTL_AMDGPU_GEM_MMAP DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_GEM_MMAP, union drm_amdgpu_gem_mmap)
@@ -71,6 +72,7 @@ extern "C" {
#define DRM_IOCTL_AMDGPU_VM DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_VM, union drm_amdgpu_vm)
#define DRM_IOCTL_AMDGPU_FENCE_TO_HANDLE DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_FENCE_TO_HANDLE, union drm_amdgpu_fence_to_handle)
#define DRM_IOCTL_AMDGPU_SCHED DRM_IOW(DRM_COMMAND_BASE + DRM_AMDGPU_SCHED, union drm_amdgpu_sched)
#define DRM_IOCTL_AMDGPU_SETPARAM DRM_IOW(DRM_COMMAND_BASE + DRM_AMDGPU_SETPARAM, struct drm_amdgpu_setparam)

/**
* DOC: memory domains
@@ -306,6 +308,14 @@ union drm_amdgpu_sched {
struct drm_amdgpu_sched_in in;
};

#define AMDGPU_SETPARAM_NO_IMPLICIT_SYNC 1

struct drm_amdgpu_setparam {
/* AMDGPU_SETPARAM_* */
__u32 param;
__u32 value;
};

/*
* This is not a reliable API and you should expect it to fail for any
* number of reasons and have fallback path that do not use userptr to

0 comments on commit 42de2bd

Please sign in to comment.