Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory barriers on TOP_OF_PIPE #128

Closed
philiptaylor opened this Issue Mar 7, 2016 · 30 comments

Comments

Projects
None yet
8 participants
@philiptaylor
Copy link
Contributor

philiptaylor commented Mar 7, 2016

Imagination's VkBonjour sample has this code in create_and_load_texture:

// [...write to buffer on host...]

//Transition the buffer from host write to transfer read
bufferMemoryBarrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
bufferMemoryBarrier.pNext = NULL;
bufferMemoryBarrier.srcAccessMask = VK_ACCESS_HOST_WRITE_BIT;
bufferMemoryBarrier.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;
bufferMemoryBarrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
bufferMemoryBarrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
bufferMemoryBarrier.buffer = srcBuffer;
bufferMemoryBarrier.offset = 0;
bufferMemoryBarrier.size = memoryRequirements.size;

// [...]

vkCmdPipelineBarrier(..., VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, VK_PIPELINE_STAGE_TRANSFER_BIT, ..., 1, &bufferMemoryBarrier, ...);

// [...]

vkCmdCopyBufferToImage(...);

// [...submit command buffer...]

with the intention of making the host write available for the vkCmdCopyBufferToImage command.

(Ignore that there's an implicit host-write memory barrier that occurs on command buffer submission - the interesting question is whether this explicit barrier by itself is sufficient.)

Per discussion on IRC, mcw thought this was correct ("Our understanding of that is TOP_OF_PIPE means flush/invalidate all caches"), but I think it's wrong.

My reasoning is:

  • I think memory barriers essentially mean "make a write of a given type from a given stage available to subsequent commands" (i.e. flush that stage's cache), and "make a write of a given type from previous commands visible to a given stage" (edit: correction) "make any available writes from previous commands visible to accesses of a given type in a given stage" (i.e. invalidate that stage's cache) (where "subsequent/previous commands" are defined by execution dependency chains). In particularly they only apply to the given stage, not any earlier or later stages (for whatever definition of 'earlier' or 'later' you want).
  • STAGE_TOP_OF_PIPE does not perform any reads or writes by itself, and does not have any caches. A memory barrier on that stage is therefore entirely meaningless.
  • The spec has a note for STAGE_BOTTOM_OF_PIPE saying "when defining a memory dependency, [...] using only VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT would not make any accesses available and/or visible because this stage doesn’t access memory". That seems consistent with my understanding, and the same should apply to STAGE_TOP_OF_PIPE because it's a symmetrical case.
  • The spec also says "For host writes to be seen by subsequent command buffer operations, a pipeline barrier from a source of VK_ACCESS_HOST_WRITE_BIT and VK_PIPELINE_STAGE_HOST_BIT to a destination of the relevant device pipeline stages and access types must be performed", which is a little imprecise (it doesn't even mention "memory barrier") but seems to indicate that you really do need to create a memory barrier on that specific stage.

Given the disagreement, it would be helpful for the spec to clarify this.

@NicolBolas

This comment has been minimized.

Copy link

NicolBolas commented Mar 7, 2016

Given the disagreement, it would be helpful for the spec to clarify this.

I disagree.

The specification is very clear that TOP_OF_PIPE is not intended to be equivalent to "everything". That's why we have "everything" bitflags. Chapter 6.4 makes it clear that memory dependencies make the particular writing stage operations visible to the particular reading stages.

It seems very clear to me from the spec that, since the source of the operation is a host write, the source stage ought to be VK_PIPELINE_STAGE_HOST_BIT. TOP_OF_PIPE did not issue a host write, and cannot issue a host write. So the dependency as written makes no sense.

The fact that there is disagreement doesn't mean that the spec is unclear. It means that someone didn't actually read it. Perhaps they read an earlier draft and got confused. But the spec language of chapter 6.4 couldn't be more clear about how things work.

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 8, 2016

@NicolBolas I think there's some ambiguity in the spec. It's not necessarily that the information isn't there, it's more that it's a little scattered and hard to read. Whilst the definition of TOP_OF_PIPE does seem clear, there's other language in various places that's quite difficult to grok. For instance, trying to figure out this bug, I tripped up on the section about execution dependency chains.

IMO there are just too many words to describe what shouldn't be hugely complicated things. It's worth at least reviewing the chapter to see what can be changed/trimmed to make it clearer.

Having said all that, you're right, we should definitely be using VK_PIPELINE_STAGE_HOST_BIT instead of TOP_OF_PIPE in VkBonjour. I'll raise a bug internally to take a look at the spec and see if we can do something better.

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 8, 2016

NB: Note that many of us are going to be at GDC next week, so this might take a little while to make any progress on this - ditto on many other bugs unfortunately!

@jeffbolznv

This comment has been minimized.

Copy link
Contributor

jeffbolznv commented Mar 8, 2016

IMO this issue mostly amounts to an application bug (PIPELINE_STAGE_HOST_BIT is indeed the correct stage to use for host reads/writes, not TOP_OF_PIPE_BIT). I'm not seeing any clear actions in the bug.

IMO there are just too many words to describe what shouldn't be hugely complicated things.

The only example you give is for execution dependency chains, and in that case the large number of words is the price of having a precise definition.

@ghost

This comment has been minimized.

Copy link

ghost commented Mar 8, 2016

@jeffbolznv

IMO this issue mostly amounts to an application bug (PIPELINE_STAGE_HOST_BIT is indeed the correct stage to use for host reads/writes, not TOP_OF_PIPE_BIT). I'm not seeing any clear actions in the bug.

I was going on this bit of the original issue:

Given the disagreement, it would be helpful for the spec to clarify this.

There's also a lot of offline discussion on IRC which isn't captured here, but apparently nobody was 100% sure of what the right answer was offline - hence coming here.

The only example you give is for execution dependency chain

I've only mentioned one or two bits here, but I kept going down into the rabbit hole whilst I was looking into this. See issue #131 for more of this, and I've spotted other bits. I agree that we need to be precise, but IMO it's more complex than it needs to be - otherwise we wouldn't be having these discussions. I'm not guaranteeing anything here - just that I would like to see if there's any way we can make things clearer.

in that case the large number of words is the price of having a precise definition.

To be clear, I'm not specifically targeting this section as a place to remove words - actually for most of that I'm looking at some of the non-normative language, like the note at the end of section 6.2. There's also some unclear (or outright incorrect) example code in the WSI sections for this that needs looking at.

@philiptaylor

This comment has been minimized.

Copy link
Contributor Author

philiptaylor commented Mar 10, 2016

Incidentally, I think https://github.com/KhronosGroup/Vulkan-LoaderAndValidationLayers/blob/master/demos/tri.c has the same problem:

    VkPipelineStageFlags src_stages = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;
    VkPipelineStageFlags dest_stages = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;

    vkCmdPipelineBarrier(demo->setup_cmd, src_stages, dest_stages, 0, 0, NULL,
                         0, NULL, 1, pmemory_barrier);

and

    vkCmdPipelineBarrier(demo->draw_cmd, VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
                         VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, 0, 0, NULL, 0,
                         NULL, 1, pmemory_barrier);

Those are both trying to place memory barriers on the TOP or BOTTOM stages, which I believe will not have the desired effect (i.e. won't have any effect).

Same in https://github.com/nvpro-samples/gl_vk_chopper/blob/master/VkeCreateUtils.cpp#L677 too.

They're application bugs - but as apparently nobody can even write a correct hello-world application, and presumably the authors read the specification and are competent, it seems the spec is failing to communicate its intentions adequately.

@philiptaylor

This comment has been minimized.

Copy link
Contributor Author

philiptaylor commented Mar 16, 2016

It's not really necessary to point at more examples, but it is still fun, so I'll add that the Mali Vulkan SDK gets it wrong too. E.g. hellotriangle says:

    // Prepare image for presentation.
    VkImageMemoryBarrier preparePresentation = { VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER };
    // Flush caches associated with color attachments.
    preparePresentation.srcAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;
    // And invalidate caches for general memory reads. Presentation engine uses this access mask.
    preparePresentation.dstAccessMask = VK_ACCESS_MEMORY_READ_BIT;
    ...
    // Complete all graphics stages before making the memory barrier,
    // but don't block any other shaders from running, so leave dstStageMask as VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT.
    vkCmdPipelineBarrier(cmd, VK_PIPELINE_STAGE_ALL_GRAPHICS_BIT, VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, false, 0,
                         nullptr, 0, nullptr, 1, &preparePresentation);

(memory barrier on BOTTOM_OF_PIPE won't work), and spinning_cube says

    void imageMemoryBarrier(VkCommandBuffer cmd, VkImage image, VkAccessFlags srcAccessMask,
                            VkAccessFlags dstAccessMask, VkPipelineStageFlags srcStageMask,
                            VkPipelineStageFlags dstStageMask, VkImageLayout oldLayout, VkImageLayout newLayout,
                            VkImageAspectFlags aspect);
...
    imageMemoryBarrier(cmd, image, 0, VK_ACCESS_TRANSFER_WRITE_BIT, VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
                       VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, VK_IMAGE_LAYOUT_UNDEFINED,
                       VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, VK_IMAGE_ASPECT_COLOR_BIT);
...
    imageMemoryBarrier(cmd, image, VK_ACCESS_TRANSFER_WRITE_BIT, VK_ACCESS_SHADER_READ_BIT,
                       VK_PIPELINE_STAGE_TRANSFER_BIT, VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
                       VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL,
                       VK_IMAGE_ASPECT_COLOR_BIT);

(memory barriers on TOP_OF_PIPE won't work).

@NicolBolas

This comment has been minimized.

Copy link

NicolBolas commented Mar 16, 2016

Since top/bottom of the pipe make no sense for memory barriers, maybe using them should just be flat-out invalid. And thus useful only for execution barriers.

@philiptaylor

This comment has been minimized.

Copy link
Contributor Author

philiptaylor commented Mar 16, 2016

(I looked at the samples in the Adreno Vulkan SDK too. To my enormous lack of surprise, they consistently got this wrong and put memory barriers on TOP_OF_PIPE and BOTTOM_OF_PIPE, just like everybody else.)

maybe using them should just be flat-out invalid.

Yeah, or it might be just as good if a validation layer emitted a warning if StageMask = (TOP|BOTTOM)_OF_PIPE and AccessMask != 0, because it's always redundant and almost certainly a mistake even if it's not actually invalid. (Then ideally the validation layer would be smart enough to emit a warning about the memory dependency that you really needed but omitted.) (But then there's the problem that I think it's impossible to understand the spec well enough to implement a correct validation layer for its rules, per #132.)

@HansKristian-Work

This comment has been minimized.

Copy link

HansKristian-Work commented Mar 18, 2016

(I'm the author of the Mali SDK sample in question)
I'm a bit surprised about this. We clearly need an ImageMemoryBarrier to be able to perform the layout transition, but how can we do an image transition without knowing srcAccess?

Is dstAccessMask supposed to be 0 here (but srcAccessMask COLOR_ATTACHMENT)?
Signalling the WSI semaphore will implicitly make all memory visible, so technically it would make sense, but the spec itself does very clearly say:

Before an application can present an image, the image’s layout must be transitioned to the VK_IMAGE_LAYOUT_
PRESENT_SRC_KHR layout. The vkCmdWaitEvents or vkCmdPipelineBarrier that perform the transition
must have srcStageMask and srcAccessMask parameters set based on the preceding use of the image. The
dstAccessMask must include VK_ACCESS_MEMORY_READ_BIT indicating all prior accesses indicated in
srcAccessMask from stages in srcStageMask are to be made available to reads by the presentation engine. Any value
of dstStageMask is valid, but should be set to VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT to avoid delaying
subsequent commands that don’t access the image.

So it seems like it's this apparent misunderstanding that caused all these "hellotriangle" samples to get it "wrong". I know I read this very paragraph and tried to do exactly what the spec said.

I don't think the spec communicates clearly enough that it synchronizes memory access only for the exact pipeline stages mentioned with memory access in only the very particular subsequent stages.

@HansKristian-Work

This comment has been minimized.

Copy link

HansKristian-Work commented Mar 19, 2016

Ok, I found in the spec that it does specify this:

The first half of the dependency makes memory accesses using the set of access
types in srcAccessMask performed in pipeline stages in srcStageMask by the first set of commands complete and
writes be available for subsequent commands. The second half of the dependency makes any available writes from
previous commands visible to pipeline stages in dstStageMask using the set of access types in dstAccessMask for the
second set of commands, if those writes have been made available with the first half of the same or a previous
dependency.

It seems like the segment I mentioned earlier should be fixed to not make misleading memory barriers.

I'm not sure then how to express the write-after-read hazard that happens in WSI acquire. Spec again suggests this:

srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
• srcAccessMask = VK_ACCESS_MEMORY_READ_BIT
• dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
• dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACH
MENT_WRITE_BIT.
• oldLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR
• newLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL

With new eyes, the MEMORY_READ_BIT seems very odd here since COLOR_ATTACHMENT_OUTPUT is certainly not doing any MEMORY_READ (the presentation engine does, and there's no "WSI pipeline stage").

Another point here is that it probably doesn't make sense to have src/dstAccessMask be 0 while dst/srcAccessMask is != 0. It doesn't seem logical to try to synchronize some memory access against nothing, which is what the solution to the BOTTOM_OF_PIPE would be in this case.

@philiptaylor

This comment has been minimized.

Copy link
Contributor Author

philiptaylor commented Mar 19, 2016

We clearly need an ImageMemoryBarrier to be able to perform the layout transition, but how can we do an image transition without knowing srcAccess?

If you know that some stage will have written to the image and not flushed its caches yet, but you have no idea which stage, I think you have to do a memory barrier with srcStageMask = VK_PIPELINE_STAGE_ALL_COMMANDS_BIT and maybe srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT (except I think the spec is very unclear about whether MEMORY_WRITE also includes stages that have more specific types; if not then you might need srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT|VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT|VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT|VK_ACCESS_TRANSFER_WRITE_BIT|VK_ACCESS_MEMORY_WRITE_BIT). (And if the last write might have been from the host and not made available yet, you'd need to add that in too.)

(That seems really ugly, it's probably much nicer if you can make the application know what stage last wrote to the memory so it knows exactly what needs to be flushed. I don't really know anything about engine design, but if the parts of the engine are too decoupled for that, maybe it's okay to e.g. have a convention that rendering command buffers are allowed to leave framebuffers dirty in STAGE_COLOR_ATTACHMENT_OUTPUT+ACCESS_COLOR_ATTACHMENT_WRITE, so the presentation command buffer only ever needs to flush that stage+access; and if a rendering command buffer wants to do something more complicated with the framebuffer (like run a compute shader over it), it's responsible for flushing any other stages and access types itself before the presentation command buffer will see that framebuffer.)

Signalling the WSI semaphore will implicitly make all memory visible

The spec says "The side effects of these commands are available to any commands or sparse binding operations (on any queue) that follow a semaphore wait". I don't think that's quite the right terminology (writes are made available from specific commands (/specific stages of specific commands), and made visible to specific commands, but they're not made available to specific commands/stages), and presentation is not a command or sparse binding operation anyway, so the effect of the semaphore on memory seems unclear here.

I think the only thing that makes sense with the current API is to assume the presentation engine is essentially uncached, so you never need to explicitly make writes visible to it - you simply need to make the writes available from whatever stage was doing the writing, then have an execution dependency (e.g. provided by the semaphore) to make sure the presentation engine waits until the making-available pipeline barrier has completed.

I'm not sure then how to express the write-after-read hazard that happens in WSI acquire

Write-after-read shouldn't need memory dependencies. The stage that's doing the writing might still have parts of the image in its cache, but that's okay since nothing else has written to that image to make those cache lines stale, so you don't need to invalidate or flush anything. The spec says "Write-after-read and read-after-read hazards only require execution dependencies to synchronize", so the execution dependency provided by the semaphore should be sufficient.

it probably doesn't make sense to have src/dstAccessMask be 0 while dst/srcAccessMask is != 0

I believe that's still a useful thing to do - srcAccessMask is flushing caches and dstAccessMask is invalidating caches, which are logically separate operations, and sometimes you may want to flush a cache in one pipeline barrier and then invalidate a different cache in a much later pipeline barrier (with an execution dependency between them) rather than doing it all at once.

I know I read this very paragraph and tried to do exactly what the spec said.

I'm not nearly that trusting ;-) . With my current interpretation of the spec's intended behaviour (which I'd be happy to have corrected if I'm mistaken!), I think most of the examples in the WSI spec are just wrong and based on a misunderstanding of memory barriers.

@HansKristian-Work

This comment has been minimized.

Copy link

HansKristian-Work commented Mar 19, 2016

I'm not nearly that trusting ;-) . With my current interpretation of the spec's intended behaviour (which I'd be happy to have corrected if I'm mistaken!), I think most of the examples in the WSI spec are just wrong and based on a misunderstanding of memory barriers.

True, but having misleading samples in a spec document is worse than having no sample I think :p

If you know that some stage will have written to the image and not flushed its caches yet, but you have no idea which stage, I think you have to do a memory barrier with srcStageMask = VK_PIPELINE_STAGE_ALL_COMMANDS_BIT and maybe srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT (except I think the spec is very unclear about whether MEMORY_WRITE also includes stages that have more specific types; if not then you might need srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT|VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT|VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT|VK_ACCESS_TRANSFER_WRITE_BIT|VK_ACCESS_MEMORY_WRITE_BIT). (And if the last write might have been from the host and not made available yet, you'd need to add that in too.)

I think I didn't formulate that statement well, I mean I need to perform a transition, so if I understand this discussion so far, what we should be doing is

ALL_GRAPHICS_BIT -> BOTTOM_OF_PIPE
COLOR_ATTACHMENT_OPT -> PRESENT_SRC (layout change will happen after memory is "made available")
COLOR_ATTACHMENT_WRITE (this one is still important) -> 0 (semaphore will take care of the reader side)

For acquire, we should just remove the MEMORY_READ_BIT in srcAccessMask and go with
COLOR_ATTACHMENT_OUTPUT -> COLOR_ATTACHMENT_OUTPUT
0 -> 0 (memory barriers probably don't make sense at all here then since it's write-after-read ...)
PRESENT/UNDEF -> COLOR_ATTACHMENT

@philiptaylor

This comment has been minimized.

Copy link
Contributor Author

philiptaylor commented Mar 19, 2016

Ah, I forgot about the layout transitions...

Before presenting: If you know the initial layout is LAYOUT_COLOR_ATTACHMENT_OPTIMAL, it can only have been written to by colour attachment writes, so I think you want:

  • Flush any writes to the image (which must have been as a colour attachment, given the layout):
    • srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
    • srcAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT
  • Layout transition:
    • oldLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
    • newLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR
  • Don't need to invalidate any caches, and don't need to delay execution of any stages (we already know the command buffer won't signal its semaphore until this pipeline barrier has completed execution), but we need to set some dstStageMask != 0 here, and BOTTOM_OF_PIPE seems reasonable:
    • dstStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT
    • dstAccessMask = 0

After presenting: The layout transition is doing a write, so the draw calls will be doing a write-after-write and you do need memory dependencies to invalidate the caches used by those draw calls.

  • When waiting for the vkAcquireNextImage semaphore, the only stages that really need to wait are those that are touching the framebuffer image, i.e. COLOR_ATTACHMENT_OUTPUT. We could wait earlier but it's best to wait late to maximise the amount of code that can run before the signalling:
    • VkSubmitInfo::pWaitDstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
  • Don't need to flush any caches, and this transition can happen any time after the semaphore is signalled on the COLOR_ATTACHMENT_OUTPUT stage:
    • srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
    • srcAccessMask = 0
  • Layout transition:
    • oldLayout = VK_IMAGE_LAYOUT_UNDEFINED
      • (You almost certainly don't want VK_IMAGE_LAYOUT_PRESENT_SRC_KHR here because that's invalid the first time an image is acquired (since the real layout is UNDEFINED in that case, and oldLayout must be either UNDEFINED or the same as the real layout). If you do want to preserve the framebuffer contents between frames you'd have to keep track of whether this is the first time it's acquired, and only use oldLayout = LAYOUT_PRESENT_SRC if it's not.)
    • newLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
  • Need to invalidate caches used by draw calls that will touch this image, which definitely includes writes and probably includes reads (e.g. blending sounds like it needs to do a read):
    • dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
    • dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT|VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT

This is largely based on guesswork though, I don't think the spec makes any of this clear enough. (I'm more confident at pointing out that things are wrong than at trying to come up with something that's actually correct.)

gardell pushed a commit to google/vulkan-cpp-library that referenced this issue May 4, 2016

Johan Gardell
Fixed issue when window resize called before callback set.
Fixed pipeline_barriers for presenting as discussed in
KhronosGroup/Vulkan-Docs#128

TODO: Introduce a better task runner.
@ghost

This comment has been minimized.

Copy link

ghost commented Jun 27, 2016

So this bug, I'm trying to re-read it and I think I can see where we've ended up, but I'd like to clarify that the following is where we got to:

Once the vkAcquireNextImage semaphore signals, any access to the image (which is read-only) by the presentation engine has completed execution. As the access is read-only, only an execution dependency is required - which is satisfied by that semaphore completing. Thus, no pipeline barrier is required.

Right?

If so, the spec language fixes are fairly straightforward based on my "queue operations" change - though the broken examples in the spec are going to need some serious cleanup...

@ratchetfreak

This comment has been minimized.

Copy link

ratchetfreak commented Jun 27, 2016

@TobiasHector you still need the layout transition from present/undefined which requires a barrier. Even the implicit transition at the start of a renderpass needs to wait on the semaphore.

@ghost

This comment has been minimized.

Copy link

ghost commented Jun 27, 2016

Oh yea, sorry I forgot about the transition. But that transition can assume execution and access has completed - so srcStageMask of TOP_OF_PIPE and srcAccessMask of 0 is fine.

@ratchetfreak

This comment has been minimized.

Copy link

ratchetfreak commented Jun 27, 2016

@TobiasHector only if the semaphore's waitStageMask is top_of_pipe or all_commands (and you probably don't want that if you want to do transfers and vertex processing before that semaphore signals)

@ghost

This comment has been minimized.

Copy link

ghost commented Jun 27, 2016

So many interactions - but yes you're correct. This is going to be fun to explain. Any other caveats I've glossed over :)?

@philiptaylor

This comment has been minimized.

Copy link
Contributor Author

philiptaylor commented Jun 27, 2016

My understanding is that if you're drawing directly onto the swapchain image, you want to call vkQueueSubmit with the semaphore from vkAcquireNextImageKHR and with

pWaitDstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
  // (other stages could be used but this should be the most efficient)

Then some time before the first draw call, use a vkCmdPipelineBarrier with:

srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
VkImageMemoryBarrier {
  srcAccessMask = 0
  dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT
    // (don't know if READ_BIT is really needed here since this stuff is all hopelessly underdefined; does a (non-blending) draw call read from the attachment? in practice I expect it sort of does because of cache line granularity, but does it matter if a partial cache line write uses stale data given that it's initially UNDEFINED anyway? Might as well include the read bit for now, it won't hurt much...)
  oldLayout = VK_IMAGE_LAYOUT_UNDEFINED
    // (assuming you don't want to preserve the previous image contents - that sounds like a rare use case, and is harder to handle correctly (you must still use UNDEFINED for the first usage of each swapchain image, and PRESENT_SRC only for subsequent usages), so examples should probably stick to the simpler case)
  newLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
}

and some time after the last draw call, use:

srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
dstStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT
  // (other stages could be used but BOTTOM_OF_PIPE should be the most efficient since it will avoid delaying any subsequent commands)
VkImageMemoryBarrier {
  srcAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT
  dstAccessMask = 0 
  oldLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
  newLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR
}

and call vkQueuePresentKHR with a semaphore that was signaled by the vkQueueSubmit.

I assume (though I suspect it's not defined anywhere) that the presentation engine will automatically have visibility of any writes that were made available before the semaphore was signaled, so it's okay that we're only explicitly defining half a memory dependency here.

If you're rendering to the swapchain image with vkCmdResolveImage (when you want MSAA but couldn't do the resolve directly in the subpass output), vkCmdBlitImage, etc, then I think you need basically the same with TRANSFER instead of COLOR_ATTACHMENT_OUTPUT everywhere (plus another barrier to flush and wait for the transfer source).

@ratchetfreak

This comment has been minimized.

Copy link

ratchetfreak commented Jun 27, 2016

@philiptaylor you can also let the renderpass do the transition.

to do that you create the renderpass with the color attachment defined as

initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
finalLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR;

and add a subpass dependency with

srcSubpass = VK_SUBPASS_EXTERNAL;
dstSubpass = 0; //or the first subpass that touches the image
srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
srcAccessMask = 0;
dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;
dependencyFlags = VK_DEPENDENCY_BY_REGION_BIT;
@philiptaylor

This comment has been minimized.

Copy link
Contributor Author

philiptaylor commented Jun 27, 2016

(@ratchetfreak: Ah, that's a good point - I mostly try to avoid thinking about subpass transitions since it's already complicated enough without them...)

By the way, one thing that's (intentionally) missing from the above suggestions is any use of VK_ACCESS_MEMORY_READ_BIT. Currently the spec says:

Before an application can present an image, the image’s layout must be transitioned to the VK_IMAGE_LAYOUT_PRESENT_SRC_KHR layout. The vkCmdWaitEvents or vkCmdPipelineBarrier that perform the transition must have srcStageMask and srcAccessMask parameters set based on the preceding use of the image. The dstAccessMask must include VK_ACCESS_MEMORY_READ_BIT indicating all prior accesses indicated in srcAccessMask from stages in srcStageMask are to be made available to reads by the presentation engine. Any value of dstStageMask is valid, but should be set to VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT to avoid delaying subsequent commands that don’t access the image.

but I think that's nonsense because (per discussions in issue 132) memory dependencies probably only apply to the specific combination of dstAccessMask+dstStageMask in that barrier/wait/etc command, and no memory accesses ever happen in BOTTOM_OF_PIPE, and there's no stage bit whose definition includes the presentation engine, so it's not possible to use dstAccessMask+dstStageMask to make anything visible to the presentation engine, and you have to just assume the presentation engine is going to manage its own memory visibility automatically.

@ratchetfreak

This comment has been minimized.

Copy link

ratchetfreak commented Jun 27, 2016

Or it memory_read is special (which it kinda is) where as a dst it will always make it available to outside reads (like the presentation engine).

@philiptaylor

This comment has been minimized.

Copy link
Contributor Author

philiptaylor commented Jun 27, 2016

It's certainly special in that it's even more poorly specified than all the other access types :-) . I think the closest thing to a definition is:

VK_ACCESS_MEMORY_READ_BIT ... When included in dstAccessMask, all writes using access types in srcAccessMask performed by pipeline stages in srcStageMask must be visible in memory.

but I can't even tell whether that's meant to be a requirement on implementations or a requirement on applications, and I don't understand how it would work either way, which makes it a pretty useless definition.

@Themaister

This comment has been minimized.

Copy link

Themaister commented Jun 27, 2016

MEMORY_READ_BIT for the presentation engine is kinda moot anyways since signalling the semaphore will make all device accesses visible before the semaphore is signalled anyways.

BOTTOM_OF_PIPE stage + 0 dstAccess and signalling semaphore should be enough.

@ratchetfreak

This comment has been minimized.

Copy link

ratchetfreak commented Jun 27, 2016

But Memory_Read in general is kinda useless as long as semaphores are used to coordinate the execution dependencies between the external reads and the queue operation. As they will take care of the memory dependencies as well.

@Themaister

This comment has been minimized.

Copy link

Themaister commented Jun 27, 2016

I agree with that. It seems like an escape hatch for future extensions to me, really.

@szdarkhack

This comment has been minimized.

Copy link

szdarkhack commented Sep 26, 2016

Apologies if I'm misunderstanding something, but consider the wording in the following part of the Semaphores section in the spec:

This semaphore signal operation defines the first half of a memory dependency, guaranteeing that all memory accesses defined by the submitted queue operations in the batch are made available, and that those queue operations have completed execution.

Semaphore signal operations for vkQueueSubmit additionally include all queue operations previously submitted via vkQueueSubmit in their half of a memory dependency, and all batches that are stored at a lower index in the same pSubmits array.

Signaling of semaphores can be waited on by similarly including them in a batch, defining a queue operation to wait for a signal. A semaphore wait operation defines the second half of a memory dependency for the semaphores being waited on. This half of the memory dependency guarantees that the first half has completed execution, and also guarantees that all available memory accesses are made visible to the queue operations in the batch.

Semaphore wait operations for vkQueueSubmit additionally include all queue operations subsequently submitted via vkQueueSubmit in their half of a memory dependency, and all batches that are stored at a higher index in the same pSubmits array.

According to this, it seems to me that semaphores perform the equivalent of a (really heavyweight) memory barrier from ALL_COMMANDS to ALL_COMMANDS. If that is indeed the case, I see no point in using any access masks in either barrier transitioning swapchain images (at the start and end of the frame), since the caches have already been both flushed and invalidated for everything due to the semaphore. All they should do is the layout transition.

Is that true? If it is, I believe that, for a start, the swapchain examples ought to be updated, and furthermore it should be made more explicit that semaphores do so much memory sync work, which, by the way, I find very weird. Why not have semaphores only provide the execution dependency and leave any required cache manipulation to the application (via appropriate fine-grained barriers)?

@ghost

This comment has been minimized.

Copy link

ghost commented Nov 15, 2016

@szdarkhack Apologies for not responding to this sooner, I wasn't sure how to answer your question. We've been in the midst of a substantial number of sweeping edits to the sync chapter, and it wasn't clear what the answer was until we'd finished those.

According to this, it seems to me that semaphores perform the equivalent of a (really heavyweight) memory barrier from ALL_COMMANDS to ALL_COMMANDS. If that is indeed the case, I see no point in using any access masks in either barrier transitioning swapchain images (at the start and end of the frame), since the caches have already been both flushed and invalidated for everything due to the semaphore. All they should do is the layout transition.

Is that true? If it is, I believe that, for a start, the swapchain examples ought to be updated, and furthermore it should be made more explicit that semaphores do so much memory sync work, which, by the way, I find very weird. Why not have semaphores only provide the execution dependency and leave any required cache manipulation to the application (via appropriate fine-grained barriers)?

You're pretty much correct. The only synchronization the barriers with transitions would require is that the first one should block other uses of that image in the same submission, and the one at the end should wait until after such accesses. No synchronization with stuff happening before or after the submission is necessary if there's a semaphore involved.

This should be clear in the rewrite which should hopefully become public in a week or two.

@oddhack

This comment has been minimized.

Copy link
Contributor

oddhack commented Nov 26, 2016

This should be fixed in the 1.0.35 spec update, as part of the synchronization language rewrite.

@oddhack oddhack closed this Nov 26, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.