Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronization is impossible to understand #132

Closed
philiptaylor opened this issue Mar 10, 2016 · 46 comments
Closed

Synchronization is impossible to understand #132

philiptaylor opened this issue Mar 10, 2016 · 46 comments

Comments

@philiptaylor
Copy link
Contributor

Maybe I'm just being dumb but, as far as I can tell, it's simply impossible to understand execution dependencies and memory dependencies from the spec.

I think dependencies are an extremely important thing for an application developer to understand and get right - otherwise their code may work fine when they test it, but suffer from bizarre bugs when a new GPU or a new driver version is released that tries to extract more parallelism or has more aggressive caching. That'll be a huge pain for application developers and end users, and/or will prevent driver developers from adding valid optimisations that will break buggy applications (i.e. all applications).

Maybe developers will learn from books and tutorials and best practices documents rather than from the spec, but the spec needs to provide a clearly-defined unambiguous foundation that other people can build a more intuitive understanding on top of. I believe it currently fails at that.

I've tried to understand it myself based on what I think the spec is saying or trying to say - I'm uncertain about a lot of things around the edges, and obviously I could still be getting it fundamentally wrong, but I haven't seen anything that convincingly contradicts my basic understanding yet, so for now I'll assume that I basically get it. In that case, it seems like everybody gets dependencies wrong in practice, including all sample programs (see #128) and the WSI spec (see #131), so the misunderstanding is a widespread problem.

(I should note that I'm not trying to be mean about the spec authors, it's clearly a non-trivial task especially when there's deadlines and constantly-evolving requirements and so on, so you have my sympathies! But it still needs fixing.)


I don't want to just list individual problems with the spec, because I don't think lots of small incremental changes will lead to a satisfactory result. But I see some general themes:

  • Some terms are used without defining them, so you have to guess from the context. (E.g. "framebuffer-space stages", "side effects".)
  • Some terms are used inconsistently, so it's not even possible to clearly define them. (E.g. "region" is described as a single pixel "(x,y,layer)", but is also described as having a size.)
  • There are many references to "before", "after", "preceding", "subsequent", "between", etc, which have very different meanings in different contexts, and none of those meanings are defined.
  • Closely-related requirements about execution/memory dependencies are spread across many different sections of the spec, so it's hard to get a coherent view of the whole thing.
  • Sometimes a single requirement is stated twice in two different places in slightly different ways, and you can't tell which is more correct.
  • Sometimes the intention behind a part of the spec seems obvious, but a literal reading gives a completely different interpretation that just seems wrong. (E.g. I think "SetEvent(); WaitEvents(); ResetEvent(); SetEvent(); WaitEvents()" is specified to create circular dependencies, so you can only safely use an event once per queue for its whole lifetime, making reset useless.)
  • The underlying conceptual model is never properly explained and has to be reverse-engineered from the text, so it's always hard to tell if you've got the right conceptual model yet.
  • Because the text is convoluted and open to interpretation, it's hard to report bugs in applications/drivers/spec so that everyone agrees there's even a problem instead of getting bogged down in debating quotes from the spec.

While I was trying to understand this, I had a go at writing (part of) the rules in a very different form, at https://github.com/philiptaylor/vulkan-sync/blob/master/README.md . I think the existing spec is essentially just defining a DAG of execution dependencies, so I made that much more explicit, and tried to be careful about defining any potentially-unclear terms. It's not necessarily easy to read but it should be possible to convey most of the meaning visually through diagrams (since DAGs are easy to visualize), and a reader can do a 1:1 translation from the diagrams back into specific rules to confirm their interpretation. I don't know if that's a successful approach or worth continuing, and it's still incomplete and buggy and would need a lot more work, but I think at least it's a lot less ambiguous and might suggest a useful direction.

@ghost ghost self-assigned this Mar 10, 2016
@ghost
Copy link

ghost commented Mar 10, 2016

I'll take this since I've already committed to looking at the synchronization stuff in #128. Thanks for the detailed list and the readme should be a useful reference. I'll tie these three issues together internally because IMO they have the same solution. Whilst most individual descriptions in the spec are quite clear, I think it's largely the inconsistencies that are the problem. The end result of a spec with many months of iterative evolution I suppose! This isn't going to be a quick fix but I'll try my best to get this fixed - I'm hoping I can rope in a bunch of people to help - I'm sure any fixes will get reviewed very carefully at the very least!

@philiptaylor
Copy link
Contributor Author

I tried updating https://github.com/philiptaylor/vulkan-sync/blob/master/README.md with an initial attempt to describe memory dependencies in the same model.

The description there isn't entirely compatible with the current spec's requirements, because the spec seems to do some weird things and I can't tell whether that's intentional - some of the cases look like bugs to me, but I don't know if I'm misunderstanding them or if there's some subtle reason for them to be as they are. Instead I tried to make up something that sounded very nearly the same as the spec, but more sensible and understandable to me from an application developer perspective: sequences of reads, writes, flushes ('make available'), and invalidates ('make visible'), which must be correctly ordered by execution dependencies to satisfy the standard requirements of caches. (Of course this is all with the usual disclaimer that I'm totally not an expert on any of this.)

For example, I think you can write something a bit like:

vkCmdClearColorImage();

vkCmdPipelineBarrier(...,
  srcStageMask = TRANSFER,
  dstStageMask = FRAGMENT_SHADER | TRANSFER,
  dependencyFlags = 0,
  pImageMemoryBarriers = {},
...);

vkCmdBeginRenderPass(...,
  pDependencies = {
    (VkSubpassDependency){
      srcSubpass = 0,
      dstSubpass = 1,
      srcStageMask = FRAGMENT_SHADER | TRANSFER,
      dstStageMask = FRAGMENT_SHADER,
      srcAccessMask = TRANSFER_WRITE,
      dstAccessMask = SHADER_READ,
      dependencyFlags = 0,
    }
  }, ...
);

vkCmdDraw(); // draw 1

vkCmdNextSubpass();

vkCmdDraw(); // draw 2

vkCmdEndRenderPass();

where there is an execution dependency chain from the clear's TRANSFER stage to draw 2's FRAGMENT_SHADER stage, going via the subpass dependency which does a 'make available' of TRANSFER_WRITE in TRANSFER and a 'make visible' of SHADER_READ in FRAGMENT_SHADER.

But I believe the current spec indicates that the subpass dependency's 'make available' only applies to writes from draw 1's TRANSFER stage, not to writes from the clear's TRANSFER stage, even though there's an execution dependency from clear's TRANSFER to draw 1's TRANSFER. ("The first half of the dependency makes memory accesses using the set of access types in srcAccessMask performed in pipeline stages in srcStageMask by the first set of commands complete and writes be available for subsequent commands", where the "first set of commands" for a subpass dependency is just the commands in srcSubpass). Therefore there is no guarantee that the clear's TRANSFER writes have been made available for draw 2's FRAGMENT_SHADER.

That seems weird to me - I'd have hoped the application could rely on dependencies to be transitive, rather than nearly-but-not-quite-transitive. My attempt at defining this stuff relies more heavily on transitivity, rather than notions like "the first set of commands", so I think it does provide that guarantee in this example.

@pimanttr
Copy link

The dependencies are transitive, provided the stages overlap, creating a dependency chain. That is, in your example, the pipeline barrier makes the clear 'available' and the subpass dependency makes it 'visible' to draw 2, because the barrier's dstStageMask and the subpass dependency's srcStageMask have at least 1 bit in common (in this case, both).

@philiptaylor
Copy link
Contributor Author

Where does the spec say that?

The way I read it, a write occurs in a specific stage in a specific command with a specific access type, and a memory dependency will 'make available' writes that occurred in a specific set of stages in a specific set of commands with a specific set of access types:

Memory dependencies synchronize accesses to memory between two sets of commands. They operate according to two “halves” of a dependency to synchronize two sets of commands, the commands that execute first vs the commands that execute second, as described above. The first half of the dependency makes memory accesses using the set of access types in srcAccessMask performed in pipeline stages in srcStageMask by the first set of commands complete and writes be available for subsequent commands.

and

Each subpass dependency defines an execution and memory dependency between two sets of commands, with the second set depending on the first set. When srcSubpass does not equal dstSubpass then the first set of commands is:

  • All commands in the subpass indicated by srcSubpass, if srcSubpass is not VK_SUBPASS_EXTERNAL.

In my example, the "first set of commands" for the subpass dependency's memory dependency is defined to be all the commands in subpass 0 (and no others). The pipeline barrier doesn't create any memory dependencies because it doesn't include any Vk{Image,Buffer,}MemoryBarrier objects.

The common bits in the dstStageMask/srcStageMask create an execution dependency chain through the pipeline barrier and subpass dependency, but don't create any new memory dependencies.

So as far as I can tell, nothing specifies that the writes from the clear command will ever be made available.

@pimanttr
Copy link

Just a bit later : "The two halves of a memory dependency can either be expressed as part of a single command, or can be part of separate barriers as long as there is an execution dependency chain between them"

@pimanttr
Copy link

Oh, maybe I missed in your example - if you have no memory barrier in your Pipeline barrier (I understood it as being implied), then I agree the spec is not very clear about that. That said, the barrier implied by the subpass dependency is only guaranteed to apply to the renderArea.
Maybe if the renderArea covers the entire attachment we could add the guarantee that the first half of the barrier implied by the subpass dependency also applies to commands "before" the RenderPass (i.e. assuming a proper execution dependency).

@philiptaylor
Copy link
Contributor Author

Ah, sorry for misleading pseudocode - I meant that the vkCmdPipelineBarrier had an empty list of image memory barriers (and no other memory barriers either).

That said, the barrier implied by the subpass dependency is only guaranteed to apply to the renderArea.

Hmm, I don't see anything in the spec talking about dependencies being affected by renderArea, so I'm not sure what you mean by that? It's not a by-region dependency either, so it should apply globally across all invocations. And in this example the image might be some arbitrary texture, not a framebuffer attachment.

Maybe if the renderArea covers the entire attachment we could add the guarantee ...

My current opinion is that the spec just seems fundamentally broken, and won't be fixed by patching individual issues as they're noticed - it needs a more principled approach that can eliminate all the tricky edge cases like these.

@philiptaylor
Copy link
Contributor Author

I attempted to understand a bit more of the spec, but got stuck with subpass automatic image layout transitions, which seem particularly confusing. E.g. the spec says:

If two subpasses are connected by a dependency and those two subpasses use the same attachment in a different layout, then the layout transition will occur [... after the dependency's src and before its dst ...]

What if there are two subpass dependencies between the same subpasses? e.g. one from subpass 0's STAGE_VERTEX_SHADER to subpass 1's STAGE_VERTEX_SHADER, and one from subpass 0's STAGE_FRAGMENT_SHADER to subpass 1's STAGE_FRAGMENT_SHADER. Does the transition happen inside just one of those dependencies? (that sounds bad since it might not the pick the one the application needs). Does it do two separate transitions, one for each dependency? (that won't work because the second transition will see the wrong source layout). Does it wait for all the dependencies' srcs, then transition once, then proceed with all the dependencies' dsts? (maybe that'll work, it's just a bit weird that it'll create false dependencies from subpass 0's STAGE_FRAGMENT_SHADER to subpass 1's STAGE_VERTEX_SHADER etc, which wouldn't happen if there was no transition). Or is it invalid to have two subpass dependencies between the same subpasses? (that doesn't seem to be a requirement anywhere).

If the attachment does not include the VK_ATTACHMENT_DESCRIPTION_MAY_ALIAS_BIT bit and there is no subpass dependency from VK_SUBPASS_EXTERNAL to the first subpass that uses the attachment, then it is as if there were such a dependency with srcStageMask = srcAccessMask = 0

A dependency with srcStageMask = 0 is usually invalid, and doesn't seem very sensible - there would be almost no limit on how early it can run, so the automatic transition could happen in the middle of the previous frame while it's still using the same attachment. (I think it might actually be limited by earlier execution dependencies with dstStageMask = TOP_OF_PIPE or ALL_COMMANDS, but there might be no such barriers and there's no indication that they're expected, and that seems like an accident of the definition of execution dependency chains. Incidentally I think the execution dependency chain concept of "execution dependencies submitted to a queue in order" doesn't make sense for subpass dependencies anyway, since they're all defined in vkCreateRenderPass and not submitted in any meaningful order.)

If the attachment includes the VK_ATTACHMENT_DESCRIPTION_MAY_ALIAS_BIT bit, then the transition executes according to all the subpass dependencies with dstSubpass equal to the first subpass index that the attachment is used in. That is, it occurs after all memory accesses in the source stages and masks from all the source subpasses have completed and are available, and before the union of all the destination stages begin, and the new layout is visible to the union of all the destination access types.

If one of those subpass dependencies is a self-dependency, I think that definition will be wrong. Also the merging of source/destination sets seems confusing if one of the (non-self) subpass dependencies is BY_REGION and one is not.

The automatic image layout transitions from initialLayout to the first used layout (if it is different)

I think even the basic idea of "first used layout" doesn't make sense. There might be two subpasses that both read an attachment with the same layout (different to initialLayout), with no subpass dependency between them (so neither is "first" in terms of execution order), and the behaviour of the transition seems unspecified in that case.

Anyway, everything seems very unclear and it's all quite frustrating (which I guess is why I feel the need to continue complaining).

@jeffbolznv
Copy link
Contributor

Just to touch on a couple of the easier issues:

A dependency with srcStageMask = 0 is usually invalid, and doesn't seem very sensible - there would be almost no limit on how early it can run, so the automatic transition could happen in the middle of the previous frame while it's still using the same attachment.

There was a point in time where this was legal, and I guess we missed this. The intent was that it could happen earlier, but not before preceding dependencies (which we need to define more precisely).

I think even the basic idea of "first used layout" doesn't make sense.

"First use" is defined at the start of the chapter:

A subpass uses an attachment if the attachment is a color, depth/stencil,
resolve, or input attachment for that subpass. A subpass does not use an
attachment if that attachment is preserved by the subpass. The first use of
an attachment is in the lowest numbered subpass that uses that attachment.
Similarly, the last use of an attachment is in the highest numbered subpass
that uses that attachment.

@philiptaylor
Copy link
Contributor Author

"First use" is defined at the start of the chapter

Ah, thanks, I'd missed that definition. I think it still doesn't particularly make sense though: if subpass 0 and 1 both read the same attachment with the same layout (!= initialLayout), and there's no subpass dependency between them (which seems valid since the spec only says subpass dependencies are needed if one of the subpasses does a write), then there will be an execution dependency from the automatic transition to the commands in subpass 0, but not to the commands in subpass 1, so the commands in subpass 1 might execute and read the attachment before the transition has completed.

In that situation, if you really want those two subpasses to run concurrently, I guess you'd have to create a new empty subpass which 'uses' the attachment first (triggering the automatic transition), with subpass dependencies from that empty subpass to the two that are really using the attachment.

(...though then I think you hit the problem that subpass dependencies are defined in terms of execution dependencies on commands inside subpasses - so if a subpass is empty and has no commands in it, there will be no execution dependencies in it, and no execution dependency chains passing through it, which breaks the idea of subpass dependency chains.)

@oddhack
Copy link
Contributor

oddhack commented Apr 22, 2016

There are some improvements in the 1.0.11 spec update, although @TobiasHector is still working on a broader rewrite of the synchronization chapter.

@oddhack oddhack closed this as completed Apr 22, 2016
@oddhack oddhack reopened this Apr 22, 2016
@ghost
Copy link

ghost commented Jun 24, 2016

Ok an update to this long-running issue finally; I've made substantial changes to the description of semaphore and fence operations, which are now in the latest drop of the spec (1.0.18). It's by no means completely fixed this bug, but it's the first of a few large changes which hopefully will make synchronization a lot clearer and eventually resolve this bug.

@philiptaylor hopefully the changes make sense - any feedback welcome!

@ratchetfreak
Copy link

There is one situation I'd like clarification on:

given 3 execution dependencies D1 D2 and D3, where D1 has stagemasks S1 -> S2, D2 has S2 -> S3 and D3 with S3->S4, submitted in that order to a queue (making an execution dependency chain).

D2 additionally defines a memory barrier with accessmasks AM1 -> AM2 of memory M.

Does this guarantee that the writes to M in the commands before D1 with stagemask S1 and accessmask AM1 are visible to the commands after D3 with stagemask S3 and accessmask AM2?

@ghost
Copy link

ghost commented Jun 24, 2016

@ratchetfreak Assuming I'm understanding your question correctly - yes, that is guaranteed. The way I think about it is is that memory barriers are a task to execute - essentially flushing the source access mask's caches, and invalidating the destination access mask's caches. That task is executed as part of an execution dependency chain. So if there's an execution dependency chain between two points, and anywhere in the middle of that chain there is a memory barrier relevant to their commands, then there is a memory dependency between them.

Execution dependencies/dependency chains/memory dependencies are the next thing on my 'hit list'. The description of them in the spec doesn't even mention semaphores or fences at the moment, which is just plain wrong, and they're inexplicably define in the precise manner they work for barriers/events. They should be defined in general terms, then details specified for events and barriers. Hopefully that next change will mean someone in future doesn't have to ask this question...

@ratchetfreak
Copy link

@TobiasHector To clarify I'm asking if the stagemasks of D2 have any relation on the memory barrier declared in it.

I'm of the belief that the chain has the exact same effect as a single pipelinebarrier with stagemasks (S1|S2) -> (S3|S4) and accessmasks AM1->AM2. @philiptaylor disagrees with me on that.

@philiptaylor
Copy link
Contributor Author

I believe ratchetfreak's belief is that something like:

vkCmdDraw(...);

vkCmdPipelineBarrier(srcStageMask=VERTEX_SHADER, dstStageMask=TRANSFER);

vkCmdPipelineBarrier(srcStageMask=TRANSFER, dstStageMask=TRANSFER,
    pMemoryBarriers={srcAccessMask=SHADER_WRITE, dstAccessMask=SHADER_READ});

vkCmdPipelineBarrier(srcStageMask=TRANSFER, dstStageMask=FRAGMENT_SHADER);

vkCmdDraw(...);

// (ignore boring details about subpass dependencies etc here; maybe pretend the draws are some other command that happens outside a render pass)

will let the second draw's fragment shader read data written by the first draw's vertex shader. There is an execution dependency chain between those stages, and there is a memory dependency in the middle of that chain with access masks corresponding to the shader writes and reads, so it should be work correctly.

My belief is that's not correct: the memory dependency only applies to accesses that (1) are performed by commands that are ordered by the execution dependency chain, (2) are of a type that matches the access mask, and (3) (the important point) are performed in stages that match the stage mask of the specific command that is creating the memory dependency. In this case that command is the second pipeline barrier, and its stage masks do not include the vertex/fragment stages, so the shader reads/writes in the vertex/fragment stages will not be made available/visible.

@ghost
Copy link

ghost commented Jun 24, 2016

@ratchetfreak Ah I see - no, not specifically. The memory barrier is a sort of independent task to the event that just happens to be attached to it for various reasons (during spec development, various parties wanted them in one function call). If the pipeline barrier includes source/destination stage masks that match the source/destination access masks, then those stages are included in the memory dependency for that pipeline barrier. If they don't, then the memory barrier affects all previously completed stages for that access mask.

@ghost
Copy link

ghost commented Jun 24, 2016

Oh good - replying at the same time! Yay. I'll read your comments @philiptaylor...

@tomaka
Copy link
Contributor

tomaka commented Jun 24, 2016

By the way, this is pretty minor but I think you should change

(section 6.5)

A pipeline barrier inserts an execution dependency and a set of memory dependencies between a set of commands earlier in the command buffer and a set of commands later in the command buffer.

To something like "between a set of commands earlier in the queue and a set of commands later in the queue" (assuming I understood it correctly).

@ratchetfreak
Copy link

If the pipeline barrier includes source/destination stage masks that match the source/destination access masks, then those stages are included in the memory dependency for that pipeline barrier. If they don't, then the memory barrier affects all previously completed stages for that access mask.

what defines that stage and access masks "match"?

Also such behavior makes it pretty difficult to ensure that a memory barrier hits all pipeline stages without stalling the pipeline (assuming there is already an execution dependency).

@ghost
Copy link

ghost commented Jun 24, 2016

@philiptaylor So memory barriers execute between pipeline stages specified in the stage masks - i.e. after waiting for srcStageMask to execute, and before dstStageMask to execute. I think you agree with that?

But that is the ONLY effect the stage masks have on the memory barrier. They do not interact directly with the things that are flushed. If you say "I want all shader writes to be available", it means "all shader writes performed by tasks that have completed execution at the point the memory barrier executes". Similar semantics apply to visibility. So if the source stage masks include stages that do shader writes, those writes will be made available. If they don't include stages that do shader writes, they won't. All prior shader writes that have completed execution at the point the memory barriere executes will be made visible, regardless of the stage mask values.

@ghost
Copy link

ghost commented Jun 24, 2016

@tomaka Yes i've spotted that, it's on my todo list :)

@ghost
Copy link

ghost commented Jun 24, 2016

@ratchetfreak I was being fairly loose when I said "match" - I really meant "a type of memory access that is performed by a stage" - so SHADER WRITE applies to all shader stages that write to memory via image or buffer stores.

Also such behavior makes it pretty difficult to ensure that a memory barrier hits all pipeline stages without stalling the pipeline (assuming there is already an execution dependency).

It doesn't make it that difficult, but that is sort of necessary anyway. This is why it's important to carefully specify your stage masks - and also one of the reasons AMD touted Render Passes as being awesome: http://gpuopen.com/vulkan-renderpasses/

In fact you can kind of think of a memory barrier in a similar way to image transitions, though there's some complexity for image memory barriers in that case so.. don't think too hard about that?

Edit: That article on gpuopen actually doesn't go into as much detail as Graham Seller's GDC presentation. See here: http://gpuopen.com/wp-content/uploads/2016/03/VulkanFastPaths.pdf (particularly slide 35+36 where he adds the flushes and invalidates that memory barriers would perform)

@philiptaylor
Copy link
Contributor Author

philiptaylor commented Jun 24, 2016

So memory barriers execute between pipeline stages specified in the stage masks - i.e. after waiting for srcStageMask to execute, and before dstStageMask to execute. I think you agree with that?

Yep, I'm happy with the order of execution.

If you say "I want all shader writes to be available", it means "all shader writes performed by tasks that have completed execution at the point the memory barrier executes".

Hmm, I think that really doesn't match my reading of the spec...

The definition of memory dependencies says: "Memory dependencies are coupled to execution dependencies ... The first half of the dependency makes memory accesses using the set of access types in srcAccessMask performed in pipeline stages in srcStageMask by the first set of commands complete and writes be available for subsequent commands."

The definition of VkMemoryBarrier says: "Memory accesses using the set of access types in srcAccessMask performed in pipeline stages in srcStageMask by the first set of commands must complete and be available to later commands."

Elsewhere it says things like "When an image memory barrier includes a layout transition, the barrier first makes writes via srcStageMask and srcAccessMask available".

Those all sound to me like they're clearly referring to the srcAccessMask and srcStageMask together, and that srcStageMask comes from the single execution dependency that the barrier is associated with. They're not defined in terms of execution dependency chains (except for where the two halves of a memory barrier can be split across a chain, but that doesn't affect the definition of each half).

I don't see anything in the spec to say that accesses matching srcAccessMask performed in any pipeline stage that precedes the barrier in execution dependency chain order will be made available, but it sounds like that's what you're saying.

(Tying srcAccessMask and srcStageMask seems theoretically useful to me if e.g. in some implementation, vertex shaders and fragment shaders use different caches, and you only want to flush vertex shader writes and not fragment shader writes. Using the combination of flags seems like a trick to avoid needing the API to define hundreds of flags for ACCESS_VERTEX_SHADER_WRITE, ACCESS_FRAGMENT_SHADER_WRITE, ACCESS_GEOMETRY_SHADER_WRITE, etc, while still allowing that fine-grained control over every stage's caches separately. But your comment seems to indicate we can't flush them separately.)

@ghost
Copy link

ghost commented Jun 24, 2016

@philiptaylor I'm going to call (slightly) broken spec. It's not a card I like to call often, but I'm calling it. I'll respond to your specific examples:

The definition of memory dependencies says: "Memory dependencies are coupled to execution dependencies ... The first half of the dependency makes memory accesses using the set of access types in srcAccessMask performed in pipeline stages in srcStageMask by the first set of commands complete and writes be available for subsequent commands."

This paragraph doesn't actually specify what the first or second set of commands are, nor where "srcStageMask" comes from. srcStageMask is actually sourced from the entire execution dependency CHAIN, not the single execution dependency that triggers the memory barrier.

For your latter two quotes:

Elsewhere it says things like "When an image memory barrier includes a layout transition, the barrier first makes writes via srcStageMask and srcAccessMask available".

and

The definition of VkMemoryBarrier says: "Memory accesses using the set of access types in srcAccessMask performed in pipeline stages in srcStageMask by the first set of commands must complete and be available to later commands."

These both still hold in the model I've suggested, because as I said - if the access type is performed by the source stage, that source stage is part of the memory dependency. There's no change here.

I don't see anything in the spec to say that accesses matching srcAccessMask performed in any pipeline stage that precedes the barrier in execution dependency chain order will be made available, but it sounds like that's what you're saying.

Right, because it's not any stage, it's the specific source stages in the execution dependency chain.

(Tying srcAccessMask and srcStageMask seems theoretically useful to me if e.g. in some implementation, vertex shaders and fragment shaders use different caches, and you only want to flush vertex shader writes and not fragment shader writes. Using the combination of flags seems like a trick to avoid needing hundreds of flags for ACCESS_VERTEX_SHADER_WRITE, ACCESS_FRAGMENT_SHADER_WRITE, ACCESS_GEOMETRY_SHADER_WRITE, etc, while still allowing that fine-grained control over every stage's caches separately. But your comment seems to indicate we can't flush them separately.)

Basically no member in the group needed that granularity (AFAIK everyone has universal shader cores) - which is precisely why we DON'T have all those access masks. If any member IHV needed them (and said so), we'd have added it.

@ghost
Copy link

ghost commented Jun 24, 2016

Elsewhere it says things like "When an image memory barrier includes a layout transition, the barrier first makes writes via srcStageMask and srcAccessMask available".

NB: For this specific case, this should be the logical chain of executed operations:

  1. Wait for srcStageMask to complete (before dstStageMask stages)
  2. Make memory writes AVAILABLE from srcAccessMask accesses (for all source stages in execution dependency chain)
  3. Make available memory writes VISIBLE to the transition
  4. Transition image from state A to B
  5. Make memory writes AVAILABLE from the transition
  6. Make available memory writes VISIBLE for dstAccessMask accesses (for all destination stages in execution dependency chain)
  7. Continue execution of dstStageMask stages

@ghost
Copy link

ghost commented Jun 24, 2016

Oh! I forgot a small part of the explanation. Which I think I need your example to illustrate:

vkCmdDraw(...);

vkCmdPipelineBarrier(srcStageMask=VERTEX_SHADER, dstStageMask=TRANSFER);

vkCmdPipelineBarrier(srcStageMask=TRANSFER, dstStageMask=TRANSFER,
    pMemoryBarriers={srcAccessMask=SHADER_WRITE, dstAccessMask=SHADER_READ});

vkCmdPipelineBarrier(srcStageMask=TRANSFER, dstStageMask=FRAGMENT_SHADER);

vkCmdDraw(...);

// (ignore boring details about subpass dependencies etc here; maybe pretend the draws are some other command that happens outside a render pass)

In this case, the results of the first draw's vertex shader's writes are visible to the second draw's fragment shader.

In this second case, that is not true:

vkCmdPipelineBarrier(srcStageMask=VERTEX_SHADER, dstStageMask=TRANSFER);

vkCmdDraw(...);  // Draw call is now after the first pipeline barrier.

vkCmdPipelineBarrier(srcStageMask=TRANSFER, dstStageMask=TRANSFER,
    pMemoryBarriers={srcAccessMask=SHADER_WRITE, dstAccessMask=SHADER_READ});
// This is still part of an execution dependency chain, but that chain no longer includes the draw.

vkCmdPipelineBarrier(srcStageMask=TRANSFER, dstStageMask=FRAGMENT_SHADER);

vkCmdDraw(...);

// (ignore boring details about subpass dependencies etc here; maybe pretend the draws are some other command that happens outside a render pass)

...because there's no longer an execution dependency on the vertex shader in the first draw call completing before the memory barrier executes.

@philiptaylor
Copy link
Contributor Author

I'm going to call (slightly) broken spec

Okay - I think the model you describe sounds like a reasonable one (if there's no desire to control separate caches per stage), I just still don't see a reasonable way to read the spec that is consistent with that model. If there's consensus that that really is the correct model, and the spec is changed to match it clearly, then I'll be happy enough.

@ghost
Copy link

ghost commented Jun 24, 2016

@philiptaylor Honestly part of the motivation for me trying to fix this spec is that I'm not convinced that everyone completely agrees on how it works. It's more than possible that someone (maybe @jeffbolznv!?!) will completely disagree with me and agree with you. But I suspect I can find yet others who agree with me, and others still that have a completely different idea. I'm really quite uncomfortable with the state of the synchronization chapter :)

@ratchetfreak
Copy link

  1. Make memory writes AVAILABLE from srcAccessMask

where the writes can be from any command that has finished executing.

Perhaps when describing memory dependencies use a term other than "srcStageMask" to define which stage's writes are made available. Likewise with "dstStageMask".

Then you can defined those terms such that "source stages" are all stages from which there is a execution dependency to this memory barrier and "destination stages" are all stages to which there is a execution dependency from this memory barrier.

@ghost
Copy link

ghost commented Jun 24, 2016

@ratchetfreak I think I was going to define it as something like "all source stage masks that have completed execution" and "all stages that have yet to execute", and then just make the execution dependencies obvious.

@ghost
Copy link

ghost commented Jun 24, 2016

NB: I'm actually wanting @jeffbolznv (or some other Khronos member) to weigh in on this before we consider this line of explanation "closed". Whilst I've built up a significant amount of knowledge about synchronization during the past few months, there are still fuzzy bits and I'm certainly not infallible.

@philiptaylor
Copy link
Contributor Author

Another case that might be worth considering:

vkCmdDraw(...);

vkCmdPipelineBarrier(srcStageMask=BOTTOM_OF_PIPE, dstStageMask=TOP_OF_PIPE,
    pMemoryBarriers={srcAccessMask=SHADER_WRITE, dstAccessMask=SHADER_READ});

vkCmdDraw(...);

There's an execution dependency chain from the first draw's vertex stage to the second draw's fragment stage, via the pipeline barrier with the memory dependency in the middle, due to how BOTTOM_OF_PIPE and TOP_OF_PIPE are defined in execution dependency chains. VERTEX_SHADER is not mentioned anywhere in this code. Does the memory dependency still apply to the vertex shader stage's writes?

If it does apply, then I think your statement "srcStageMask is actually sourced from the entire execution dependency CHAIN" is not the right terminology since it's actually not directly related to the source stage masks at all, it's determined by the stages involved in execution dependency chains which are derived in a complex way from sequences of srcStageMask and dstStageMask and other commands and subpass dependencies etc.

(That also seems to contradict the spec's note that "when defining a memory dependency, if the stage mask(s) refer to all stages, then the indicated access types from all stages will be made available and/or visible, but using only VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT would not make any accesses available and/or visible because this stage doesn’t access memory." - here we're using BOTTOM_OF_PIPE and it does make some accesses available, and it's actually going to act exactly the same as ALL_COMMANDS.)

@ratchetfreak
Copy link

ratchetfreak commented Jun 24, 2016

That note should definitely be rewritten to focus on the execution dependency. Using BOTTOM_OF_PIPE means no subsequent commands except other dependencies with BOTTOM_OF_PIPE or ALL_COMMANDS in the srcStageMask have a execution dependency on it. As such there cannot be a memory dependency.

It also means that the HOST stage is mostly useless (read: can usually be replaced by top and bottom of pipe) because of the implicit execution dependencies with vkQueueSubmit and vkWaitForFences. The only use that still makes sense for it is in the srcStageMask in vkCmdWaitEvents to signal that one of the events is host-signaled. I made another issue about that a few days ago: #261.

@ghost
Copy link

ghost commented Jun 24, 2016

@philiptaylor Just to be clear, I think the only thing we disagree on is whether the memory barrier affects all memory writes that have completed execution for a particular access type, or whether it only affects the stages in the stage masks they were submitted with. Right?

If that's the case, I might be coming around to your way of thinking. I'm not 100% convinced that you're right yet, but I'm not convinced that I am either. I'm also still not sure the spec is entirely right here!

@philiptaylor
Copy link
Contributor Author

Yes, I believe that's the point of disagreement.

(From an application developer perspective, that leads to questions like: If I want a write in a vertex shader to be available to later commands, do I need to insert a vkCmdPipelineBarrier with srcAccessMask=SHADER_WRITE somewhere in an execution dependency chain that starts at the vertex shader, or do I need to insert a vkCmdPipelineBarrier with srcAccessMask=SHADER_WRITE and srcStageMask=VERTEX_SHADER (in the same command) somewhere in that execution dependency chain?)

("right" is a tricky term - I'm basing it off what I infer was the intention behind the text currently in the spec, under the assumption that there is a sensible coherent intention that has been mostly but perhaps imperfectly expressed in the text. That might not be the same as what the people writing the text actually intended, but hopefully the gap can be closed one way or the other :-) )

@ghost
Copy link

ghost commented Jun 27, 2016

Ok I think that some of my confusion has come from how semaphores interact with pipeline barriers (e.g. issue #128). I'm now 99% certain that you're right @philiptaylor - so that clears that up. I'll make sure that goes into the spec properly.

@ghost
Copy link

ghost commented Jun 27, 2016

Which means, @ratchetfreak - if @philiptaylor's interpretation of your example is correct - then no, you wouldn't have sufficient dependencies in that scenario.

@pimanttr
Copy link

@TobiasHector I'm pretty sure our intent when writing this was to follow @philiptaylor's interpretation, in that the memory accesses made available are the srcAccessMask/srcStageMask for the barrier (not the whole dependency chain). The dependency chain is what connects which "available" accesses can be made "visible" to another barrier, but I don't think we intended that it "collects" extra stages/accesses along the way that would apply to other barriers in the dependency chains.
Indeed I think it would be overly conservative, because due to fences and semaphores, basically after the first few frames every barrier is part of a dependency chain that includes everything that was done in those first few frames, so virtually every barrier would probably need to consider all stages/accesses used by the application.

@ghost
Copy link

ghost commented Jun 28, 2016

@pimanttr Ohhhh that's where I got confused. It's the act of making "visible"; it makes anything that has been previously (in the chain) made available, visible - irrespective of the source stage masks. The source stage mask just informs the things made available by that particular barrier.

@ghost
Copy link

ghost commented Jun 28, 2016

Wait but doesn't that mean we always have to flush to RAM when something is made available? ...and always invalidate caches when making things visible?

I thought we wanted to allow implementations to not flush if the src and dst stage/access combination didn't require it? A driver could potentially still detect this in an execution dependency chain, but it would mean tracking what's available/visible and what's not, or a command buffer/render pass compilation step doing the same. That seems like a huge burden, given that this is potentially per-subresource.

I suppose the alternative would be to have the application do tracking... but wouldn't that be better (an app can just know this without actually having to do tracking)? I'm now worried we discussed this internally and I just wasn't up to speed on synchronisation enough at the time to understand the ramifications of this :/

I'm kind of hoping I'm still missing part of the picture here...

@philiptaylor
Copy link
Contributor Author

I thought we wanted to allow implementations to not flush if the src and dst stage/access combination didn't require it?

With the current spec, regardless of how you interpret stage masks, if you do something like

vkCmdDraw(...);
vkCmdPipelineBarrier(
  srcStageMask = VERTEX_SHADER,
  dstStageMask = FRAGMENT_SHADER,
  pMemoryBarriers = { srcAccessMask = SHADER_WRITE, dstAccessMask = SHADER_READ }
);
vkCmdDraw(...);
vkCmdPipelineBarrier(
  srcStageMask = BOTTOM_OF_PIPE,
  dstStageMask = HOST,
  pMemoryBarriers = { srcAccessMask = 0, dstAccessMask = HOST_READ }
);
vkCmdSetEvent(stageMask = HOST);
...

then I expect an implementation with unified shader caches would want to omit the vertex-to-fragment cache flush/invalidate since that's a noop in their hardware - but I believe there is still a memory dependency whose first half is VERTEX_SHADER/SHADER_WRITE and whose second half is HOST/HOST_READ, so the device still needs to flush the shader cache at some point before the vkCmdSetEvent.

(I'm assuming a vkCmdPipelineBarrier's half-memory-dependencies can participate in multiple memory dependencies across multiple execution dependency chains, not just one - I think too much stuff would break if that wasn't true.)

I think that means the driver either must flush the shader cache on the first pipeline barrier; or it must do enough tracking to let it delay the flush until the second pipeline barrier makes the memory visible to a stageMask/accessMask that's not coherent with the shader cache.

You could avoid that problem if memory dependencies were indivisible and couldn't be split across an execution dependency chain - that should make it trivial to examine a single barrier and see if it can be optimised away because the caches on both sides are in the same coherency domain. But the spec does allow them to be split.

(an app can just know this without actually having to do tracking)

(A general concern here: Based on current evidence (i.e. everybody is writing buggy applications, and only approximately two people are complaining about the spec being highly unclear and/or broken), I have zero confidence in application developers caring enough about synchronisation correctness to put enough effort into getting it right. If the validation layers don't detect bugs, developers won't fix them. If this tracking is infeasible to handle automatically in the drivers, and if the application developers are going to be given more responsibility, I believe it really needs to be designed to be fully automatically trackable by a validation layer so it can tell application developers whenever they do it wrong.)

@jeffbolznv
Copy link
Contributor

I see that Tobias has directed a few comments at me, but I haven't been able to catch up on the bug. If there are any remaining questions for me (after this post), please summarize them and I'll try to respond.

then I expect an implementation with unified shader caches would want to omit the vertex-to-fragment cache flush/invalidate since that's a noop in their hardware - but I believe there is still a memory dependency whose first half is VERTEX_SHADER/SHADER_WRITE and whose second half is HOST/HOST_READ, so the device still needs to flush the shader cache at some point before the vkCmdSetEvent.

During spec development, we had a long discussion about whether dependencies are "transitive" (this was in internal bugzilla 15158). We eventually resolved that they are transitive, and this led to the current definition of memory dependencies being formed by a "first half" making writes available, an execution dependency chain, and then a "second half" makes available writes visible. One consequence of this is that an implementation can't completely omit a barrier just because the stages/caches involved are coherent with each other.

@ghost
Copy link

ghost commented Jun 28, 2016

@jeffbolznv I think we figured everything out. I just wanted you to look over it tbh. But yea I remember the transitive discussion and I remember having no idea what was going on at the time. Now I actually understand the problem, I think I would have participated more in that discussion - but never mind.

Either way, I at least have more clarity on how this is all supposed to fit together now.

@ghost
Copy link

ghost commented Nov 15, 2016

The broad scope of the rewrite is now done - it'll be a week or two before the final review is complete, and then it should land in 1.0 and that should finally close this issue :)

@oddhack
Copy link
Contributor

oddhack commented Nov 26, 2016

This should be fixed in the 1.0.35 spec update, as part of the synchronization language rewrite.

@oddhack oddhack closed this as completed Nov 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants