Cleanup DrawBatch.Append/JIT Logic #395

ilexp · 2016-09-16T15:25:46Z

Summary

The current Append logic for draw batches is a bit of a mess, partially for compatibility reasons, partially because there hasn't been a lot of engineering in there for a while. Clean it up, refactor and improve performance where possible.

Analysis

Make sure appending two batches does not put the resulting one into the Large Object Heap.
Also, consider introducing a proper size limit for batches, so we don't get the overhead of copying large arrays.
Check if an array resize needs to occur for merging two batches. If not, the merge is essentially free and can be done regardless of array size considerations.
Is there a reasonable way to streamline append logic for better performance, especially AppendJIT?

ilexp · 2016-09-16T15:40:30Z

Might also do an overall refactoring of the IDrawBatch / DrawBatch<T> pair with an eye on potential performance improvements.

Keep the interface for extensibility reasons, but avoid virtual calls and the like as much as possible while dealing with only the internal batch type.
Consider removing Append logic out of the interface and only supporting it for the internal, dynamic draw batch.
Idea: Have a rendering pipeline that is fully capable to deal with any IDrawBatch, but provide optimized, faster code paths when dealing with the internal DrawBatch<T>.
Is there a way to make the standard DrawBatch a struct and avoid boxing / allocation for them entirely?

ilexp · 2017-01-24T17:03:13Z

This might be easier to do after issue #487 is finished, since that will likely involve moving batching logic into its own class.

ilexp · 2017-08-30T21:07:59Z

After findings of heavy memory allocation in issue #552 mainly due to dynamic batching, reducing or preventing allocations as part of gathering vertex data / batches should be a priority.

One way to get around this would be to transform AddVertices calls into a form that "rents" a segment from a pre-existing internal array of the appropriate type, rather than registering and merging one that the renderer allocated. Ownership of dynamic vertex data would then be in IDrawDevice, as opposed to ICmpRenderer.

It will be very important to have fast retrieval of a segment of the appropriate type. Potentially have a compile-time mapping from T to an index for a lookup or similar.
Ideally, we'd use Span<T>, but this will have to wait until Duality v4.0, as this is way too experimental right now to make it into the stable main release soon. Come up with something custom or use ArraySegment<T>, to be considered.
This will probably make CanvasBuffer obsolete.
Make sure to compare benchmark and profiler results before and after.

ilexp · 2017-08-31T09:15:18Z

Rough draft on how this could be approached, probably subject to change:

Create a DynamicVertexStore class that encapsulates one array per vertex type and provides "renting" and retrieval functionality. Could probably use some tests for this.
Transform DrawBatch<T> to no longer store an array locally, but instead store the rented part of the DynamicVertexStore.
Transform the rendering system so that all contents of the dynamic vertex store are uploaded at once, prior to rendering, one internal vertex buffer per vertex type. DrawBatch<T> upload functionality will be empty. Rendering will be vertex setup and GL.DrawArrays, but double-check whether the vertex setup is really necessary for rendering, or whether that's somehow stored in the VBO.
Consider removing upload functionality from IDrawBatch altogether, remove then-obsolete IVertexUploader as well.

Up to this point, most of the public drawing API can still stay the same, with AddVertices still receiving an array and just copying it over to a rented region of the DynamicVertexStore. Going forward, an update to the draw device API to leverage the new data flow would be required, introducing changes all over the codebase.

Transform IDrawDevice into a form where AddVertices doesn't accept vertex data anymore, but instead "rents" a segment of some internal array.
Update all the points in code where this becomes relevant.
Remove the now-obsolete CanvasBuffer.

After this change, there should be next to no allocations anymore in the rendering cycle and way less copying due to dynamic batching.

One thing that still needs to be considered is how to handle very large, mostly static vertex arrays where a re-generate and re-upload every frame would be wasteful and unnecessary. This use case is not currently implemented and can be considered a new feature, but it would be neat if that option existed. Probably go into more detail as part of issue #487.

ilexp · 2017-08-31T17:29:42Z

Progress

Created the develop-3.0-vertexgather branch to work on this.
Started experimenting and implementing parts of this change. Previous benchmark sample report, for later reference: BenchmarkTestReport.txt
Implemented a VertexSegment<T> struct that will be used for "renting" parts of a shared vertex array. Investigated how to achieve maximum performance when accessing it and optimized for inline friendliness, low (JIT-)compiled instruction counts and low complexity at the cost of safety. Documented this.
Found that type erasure will be an implementation issue to be solved, since the DynamicVertexStore would end up with an object[], each being a different generic RawList<T> with no way to get back to the actual types. This is not a problem when gathering vertices, as each renderer call specifies the type as a generic parameter, but it will be a problem when uploading vertices later. There will likely be the need for a generic-class-and-non-generic-interface combo.
To simplify things a bit, transformed IVertexUploader API from a generic one requiring the vertex type into a non-generic one that accepts an IntPtr, as type information beyond what VertexDeclaration offers isn't actually needed for this.

For an unknown reason, this led to a notable speedup in rendering:
BenchmarkTestReport.txt. So far, everything seems to work as usual, not sure why the IntPtr vs. T[] change would make a difference like that.

Edit: Can't reproduce it in a clean OpenTK sample at this commit. It's probably something elsewhere in Duality that caused this time drop.

Edit: Can't reproduce the slower version anymore even when going back to even before the last benchmark report. No idea why it was slower before - maybe interference from running applications, GPU, something else.

Immediate ToDo

Create a DynamicVertexStore class that encapsulates one array per vertex type and provides "renting" and retrieval functionality. Could probably use some tests for this.
Transform DrawBatch<T> to no longer store an array locally, but instead store the rented part of the DynamicVertexStore.
Transform the rendering system so that all contents of the dynamic vertex store are uploaded at once, prior to rendering, one internal vertex buffer per vertex type. DrawBatch<T> upload functionality will be empty. Rendering will be vertex setup and GL.DrawArrays, but double-check whether the vertex setup is really necessary for rendering, or whether that's somehow stored in the VBO.
Consider removing upload functionality from IDrawBatch altogether, remove then-obsolete IVertexUploader as well.
Transform IDrawDevice into a form where AddVertices doesn't accept vertex data anymore, but instead "rents" a segment of some internal array.
Update all the points in code where this becomes relevant.
Remove the now-obsolete CanvasBuffer.

ilexp · 2017-09-01T17:36:52Z

Progress

Introduced VertexBatchStore, VertexBatch and IVertexBatch types, as well as a first implementation.
Wrote tests for the above types basic functionality.
Introduced PinnedArrayHandle struct as a general helper for safely pinning and unpinning managed arrays for direct IntPtr data access.
Renamed VertexSegment to VertexSlice.
Added debugger display rules for VertexSlice.
Wrote XML docs for VertexDeclaration, VertexElement, VertexBatchStore, VertexBatch, IVertexBatch and PinnedArrayHandle.

Immediate ToDo

Transform DrawBatch<T> and DrawDevice so that incoming vertices are stored in a VertexBatchStore in the device, and each batch only has a token referring to vertex declaration, index and count.
Upload all vertices in the VertexBatchStore at once, prior to rendering.
Removing upload functionality from IDrawBatch altogether.
Remove IVertexUploader and replace it with regular graphics backend API.
Transform IDrawDevice into a form where AddVertices doesn't accept vertex data anymore, but instead "rents" a segment of some internal array.
Update all the points in code where this becomes relevant.
Remove the now-obsolete CanvasBuffer.

ilexp · 2017-09-02T19:27:14Z

Progress

Transformed DrawBatch<T> and DrawDevice so that incoming vertices are stored in a VertexBatchStore in the device, and each batch only has a token referring to vertex declaration, index and count.
Now uploading all vertices at once, rather then per-batch.
Removed upload functionality from IDrawBatch altogether.
Removed IVertexUploader, as the graphics backend can now do this itself using the VertexBatchStore it receives in BeginRendering.
Prototyped a transformation of the IDrawDevice API so that AddVertices provides a vertex slice instead of accepting a vertex array, but found it to be unsatisfying overall.
- The API felt a lot harder to use, more confusing. AddVertices calls required more parameters and line length.
- Code defining vertices was a lot less readable. No direct array index access with a slice, can't group lines by vertex attribute anymore.
- With the new API, there is no way for someone giving out a vertex slice to perform an operation on the vertices since they're not defined yet. This is especially painful for picking overrides in DrawDevice.
- Performance gains were negligible (< 0.2 ms)
Reverted the draw device API prototype, a patch file containing its previous WiP state is available here, ideally apply at this commit.
New benchmark results are available here: BenchmarkTestReport.txt
- Performance is about the same as before.
- Allocations and GC pressure were greatly reduced.

Immediate ToDo

Investigate the glitch / bug in rendering hover highlights in the Scene Editor and potentially other shapes.
Investigate a potential bug reported by DeathKiller on the chat, repro case is running the SmoothAnimation sample:

Is it my fault or it renders VertexC1P3T4A1 (SmoothAnim) vertices always as (glitchy) fullscreen quads with latest "develop-3.0-vertexgather" commit? and the commit before (c94c7cb) doesn't render them at all 😄 thanks anyway, it fixed my previous issue 😃

Review the changes of the develop-3.0-vertexgather-wip branch and do some polishing when and where necessary.
Do some more profiling to see where the new allocation and performance hotpaths are.
Merge develop-3.0-vertexgather-wip back to develop-3.0

ilexp · 2017-09-02T20:18:47Z

Progress

Fixed the bug that was reported by DeathKiller, which turned out to be the same bug as the one that produced glitchy hover highlights in the editor.
Checked DualStickSpaceShooter sample and found another rendering glitch.

Immediate ToDo

Investigate the DualStickSpaceShooter glitch where the controller input info sprite seems to get mixed up with some background sprite while moving.
- Both probably have the same vertex type, but different materials. Could be a merge bug?
- It still happens when removing the entire level except two Background "Backdrop4" objects, one Geometry "Backdrop4" objects and the two ControlInfo sprites.
- The two Background "Backdrop4" sprites are actually missing on their actual position and also missing their vertex color, which means their batches are probably using the wrong vertices.
Check the major rendering-related sample projects to make sure they're still working as expected.
Review the changes of the develop-3.0-vertexgather-wip branch and do some polishing when and where necessary.
Do some more profiling to see where the new allocation and performance hotpaths are.
Merge develop-3.0-vertexgather-wip back to develop-3.0

ilexp · 2017-09-02T20:55:05Z

Progress

Fixed the bug that was found in the DualStickSpaceShooter sample. Turned out to be a missing check that sometimes led to JIT batching even though the vertices were disjoint in the internal vertex store.
Checked all rendering-related samples. Everything seems to work as expected.

Immediate ToDo

Do some more profiling to see where the new allocation and performance hotpaths are.
Investigate whether batching can be streamlined a bit more. Feels like there's still a lot of overhead right now.
Review the changes of the develop-3.0-vertexgather-wip branch and do some polishing when and where necessary.
Merge develop-3.0-vertexgather-wip back to develop-3.0

ilexp · 2017-09-03T15:46:27Z

Progress

Removed IDrawBatch abstraction, as there's nothing useful to abstract right now. The graphics backend now accepts a list of DrawBatch instances directly.
A DrawBatch can now express rendering multiple disjoint slices of a vertex buffer as long as there is no state change.
Rewrote DrawDevice vertex gathering code to only generate DrawBatch instances on the very last step and use private arrays-of-structs before that for reduced allocation count and improved efficiency.
Simplified IDrawDevice API and moved the previous overloads to extension methods.
New benchmark report is available here: BenchmarkTestReport.txt
- Slight performance improvements.
- Cut worst-case allocations / GCs in half again.

Immediate ToDo

Deal with the allocation hotspot when aggregating the internal raw data into batch instances that are passed to the graphics backend.
- In the multi-material sprite benchmark, 80% of all allocated memory s are VertexDrawRange[].
- Could solve both this and DrawBatch allocation by introducing a pool class (rent / return) and pooling DrawBatch instances locally in every DrawDevice instance.
- Might need a specialized pool, as batches with a matching vertex drawing range length are to be preferred.
- Take a look at CanvasBuffer for reference.
Consider introducing IDrawDevice API that provides access to the rented vertex slice directly and avoid the extra copy, rather than accepting a T[] - only this time as something optional, not "the" way to go.
Conclude the issue by doing some more testing on samples.
Briefly review the changes of the develop-3.0-vertexgather-wip branch and do some polishing when and where necessary.
Merge develop-3.0-vertexgather-wip back to develop-3.0

ilexp · 2017-09-04T09:42:22Z

Progress

Introduced some additional ReflectionHelper methods to allow quickly determining whether a type can contain any references.
Optimized RawList<T> clear and remove behavior to not actually clear the internal arrays now off-limits sections unless it's a reference type.
Introduced a simple RawListPool<T> class for local, single-owner pooling.
Re-implemented CanvasBuffer as a thin wrapper around RawListPool<VertexC1P3T2>.
DrawDevice now pools the vertex range lists that are used in batch generation internally to not allocate all that memory every frame.
New benchmark results are available here: BenchmarkTestReport.txt
- About the same performance as before. Seems slightly faster, but not significantly.
- Reduced allocations / GC pressure in worst-case scenario to about 25%
- Very first pre-v3.0-optimization report for reference: BenchmarkTestReportPreV3.txt

Immediate ToDo

Consider introducing IDrawDevice API that provides access to the rented vertex slice directly and avoid the extra copy, rather than accepting a T[] - only this time as something optional, not "the" way to go.
- Check potential perf improvements in the multi-sprite renderer benchmark.
- If they're significant enough, apply them to the tilemap renderer as well.
Conclude the issue by doing some more testing on samples.
Briefly review the changes of the develop-3.0-vertexgather-wip branch and do some polishing when and where necessary.
Merge develop-3.0-vertexgather-wip back to develop-3.0

ilexp · 2017-09-17T11:59:02Z

Progress

Reviewed potential new slice-based IDrawDevice API and profiled the multi-sprite renderer case with no significant performance improvements. Discarding the idea for now.
Merged develop-3.0-vertexgather-wip back to develop-3.0

ilexp added Task ToDo that's neither a Bug, nor a Feature Performance Related to runtime or editor performance Breaking Change Breaks binary or source compatibility Cleanup Improving form, keeping function Rendering Related to rendering / graphics labels Sep 16, 2016

ilexp added this to the v3.0 milestone Sep 16, 2016

ilexp mentioned this issue Jan 24, 2017

Introduce Explicit Vertex Buffers to the Graphics Backend Interface #487

Closed

ilexp mentioned this issue Aug 30, 2017

Investigate Memory Pressure in Rendering Benchmarks #552

Closed

ilexp self-assigned this Aug 31, 2017

ilexp closed this as completed Sep 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup DrawBatch.Append/JIT Logic #395

Cleanup DrawBatch.Append/JIT Logic #395

ilexp commented Sep 16, 2016 •

edited

Loading

ilexp commented Sep 16, 2016 •

edited

Loading

ilexp commented Jan 24, 2017

ilexp commented Aug 30, 2017 •

edited

Loading

ilexp commented Aug 31, 2017

ilexp commented Aug 31, 2017 •

edited

Loading

ilexp commented Sep 1, 2017

ilexp commented Sep 2, 2017 •

edited

Loading

ilexp commented Sep 2, 2017 •

edited

Loading

ilexp commented Sep 2, 2017 •

edited

Loading

ilexp commented Sep 3, 2017 •

edited

Loading

ilexp commented Sep 4, 2017 •

edited

Loading

ilexp commented Sep 17, 2017

Cleanup DrawBatch.Append/JIT Logic #395

Cleanup DrawBatch.Append/JIT Logic #395

Comments

ilexp commented Sep 16, 2016 • edited Loading

Summary

Analysis

ilexp commented Sep 16, 2016 • edited Loading

ilexp commented Jan 24, 2017

ilexp commented Aug 30, 2017 • edited Loading

ilexp commented Aug 31, 2017

ilexp commented Aug 31, 2017 • edited Loading

Progress

Immediate ToDo

ilexp commented Sep 1, 2017

Progress

Immediate ToDo

ilexp commented Sep 2, 2017 • edited Loading

Progress

Immediate ToDo

ilexp commented Sep 2, 2017 • edited Loading

Progress

Immediate ToDo

ilexp commented Sep 2, 2017 • edited Loading

Progress

Immediate ToDo

ilexp commented Sep 3, 2017 • edited Loading

Progress

Immediate ToDo

ilexp commented Sep 4, 2017 • edited Loading

Progress

Immediate ToDo

ilexp commented Sep 17, 2017

Progress

ilexp commented Sep 16, 2016 •

edited

Loading

ilexp commented Sep 16, 2016 •

edited

Loading

ilexp commented Aug 30, 2017 •

edited

Loading

ilexp commented Aug 31, 2017 •

edited

Loading

ilexp commented Sep 2, 2017 •

edited

Loading

ilexp commented Sep 2, 2017 •

edited

Loading

ilexp commented Sep 2, 2017 •

edited

Loading

ilexp commented Sep 3, 2017 •

edited

Loading

ilexp commented Sep 4, 2017 •

edited

Loading