-
Notifications
You must be signed in to change notification settings - Fork 1
/
IDEAS.txt
352 lines (335 loc) · 19.5 KB
/
IDEAS.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
1. Binary Search on GPU
Let t[i], 0 <= i <= n be a an array of "times" and n = 2^k for
some k with 0.0 = t[0] <= t[1] <= t[2] <= ... <= t[n] = 1.0.
Given a value t, find i so that t[i] <= t < t[i + 1] using the
GPU.
int
find_index(float t)
{
begin = 0;
end = n;
for (int i = 0; i < k - 1; ++i)
{
current = (begin + end) / 2;
if (t > t[current])
{
begin = current;
}
else
{
end = current;
}
}
return begin;
}
2. Clipping by convex polygon and rects. For clipOutConvexPolygon and
without anti-aliasing can be done by drawing an occluder. For
clipInRect and clipInCovexPolyon without anti-aliasing can also
be done by drawing an occluder. The performance issue is that the
occluder might be HUGE. A simple example is drawing a table of
cells where eac cell is clipped. In this example, then each
cell's clip is an occluder over its complement which is quite
large.
3. Rework offscreen region allocator. The main issue is that it is
not very fast and the way it allocates spaces does not allow
to use a smaller offscreen target. The idea is to use a shelf
algorithm together with the ablity to swap x/y coordinate so
that the allocator can always assume that width >= hieght.
4. Sparse offscreen stroking. Currently when rendering strokes
we generate multiple VirtualBuffer objects and render what
stroking primitives hit it. Instead, we should have a single
VirtualBuffer, but return the unused tiles (unioned up in
the same fashion that we currently union up what is hit).
A more agressive return strategy would be to have TileAllocator
a public interface to return unused regions of allocated
tiles and it would internally union it up. Lastly, to
make sure that there is no rendering leackage, we would
draw the masks AFTER drawing the color buffers and at
the end of drawing the color buffers, we would cap all
color buffers with a depth-value of occlude always.
5. Sparse offscreen filling. Following the same idea of having
a single VirtualBuffer but marking returning what region
are not touched by partial tiles. However, there are more
details:
a. For contours that are small in one of their dimension
(i.e. no more than 3 tiles), instead of mapping and clipping
them, just add their STC data to the VirtualBuffer and all of
the tiles that are hit by it are fully backed. The "is hit"
decision would be by using astral::TransformedBoundingBox,
which would be given a method to return the length of its
edges.
b. For "biggish contours", we run the clipping to see what
tiles are hit. However, because we no longer have that
each tile has padding, we only need to clip against the
lines H_n = { (x, 60n) | x real } and V_n = {60n, y) | y real }.
We can get efficient clipping by first breaking each curve
C of the contour into pieces where each piece is on a
specific side of each H_n and V_n, i.e. curves get clipped
once. From there, we can quickly compute what tiles are
partial tiles. The actual clipped contour generation would
simply walk the clipped curve's instead of the original
curve list to generate the contour to run on a tile.
Tiles that have no original curves walking though them
would, as before, have an internal winding number on
them incremented. At the end of data generation, if the
rull is odd-even, we would add N % 1 rects to draw where
N is the winding number, but for non-zero, we would add
N rects.
c. We could, in theory avoid streaming the anti-aliasing
data by drawing the original anti-alias fuzz to the large
render target. To make sure of no leakage, the rendering
needs to be protected always by depth buffer capping;
A simple way to do this is to first render the color virtual
buffers and then cap them. This will prevent any drawing
to them period by the masks. To reduce the area covered,
we should also tessellate the curves for filling also
according to area.
d. The generation of the STC data is tempting to also
do this way, but would eat oddles of fill-rate. It is
likely best to continue to clip the contour to produce
the STC passes.
6. Likely not a good idea to reset UBO buffer pool on each render
target as that would cause a GPU to issue a pipeline stall to
wait until the GPU is done if a BO gets reused. Though, given
that at the end of each render target we need to blit the renders
to the ImageAtlas anyways, the GPU pipeline will need to stall
anyways.
7. IDEA: Path union, intersection in STC.
A. When filling, a VirtualBuffer will have "boundaries" in its
STC data indicating the start of a new combined path, several
types of markers:
1. End: this means run the cover pass (but not anti-aliasing)
to not only set .r channel to indicate covered but to also
clear the stencil
2. ComplementEnd: run the cover pass as for End and then also
run another rect that complement the .r channel using
a fragment shader that emits .r = 1 with GL_FUNC_SUBTRACT
and coefficient (GL_ONE, GL_ONE) because then the blender
does D <-- GL_ONE * F - GL_ONE * D and F is 1 thus it does
D <-- 1 - D which is what we want to do invert.
B. P1 union P2 would just have an End marker between them.
C. P1 intersect P2 would be fill P1 with the complement of its
fill rule, then an End between P1 and P2, fill P2 with its
complement fill rule as well and then P2 ends with a ComplementEnd
D. We can then support any sequence of such path building, but
an arbitary set expression is likely not feasible.
E. Modify the distace field data, kill one of the "raw coverage" value,
or ideally both and in the post-process step have a channel as
0 or 1 indicating it is a boundary texel.
F. Drawing anti-alias fuzz is delayed compeletely after STC.
G. Stroking idea may or may not work. Taking the mask generated
from (E). In the fragment shader when generating the mask for
stroking, go to the point along the path that generated the fragment
(this is easy to compute if the fragment shader knows enough of
the geometry for the primitive). Sample the path-intersection mask
and if it says it is on the boundary, then emit as usual. If not
on the boundary, emit covered as 0 and distance as maximum.
8. IDEA: support 3D scenes. There would be a special "3DSCENE"
group that would specify an ENTIRE 3D scene. Within such a
group, there would be stuff to render, in 3D. Items within
a group would have different vertex and fragment shader entry
functions: the vertex shader would emit the analogue of
gl_Position and the fragment shader would only emit a color
value. When a 3D scene is requested, we allocate a range of
Z-values that it will consume to do depth testing; since we
are not rendering games or sophisticated things, the range
would be around 16-bits wide. This would allow us to still
use the dept-buffer occlusion in rendering and have many
3D scenes (though too many scenes would sink this still).
B. The caller of the vertex shader would MODIFY the output
clip-vertex value as follows:
1. Let [minX, maxX]x[minY, maxY] be the viewport in -normalized-
coordinates of the current virtual buffer. Let (A, B) and
(C, D) be the numbers so that
x --> Ax + B maps [-1, 1] to [minX, maxX]
y --> Cx + D maps [-1, 1] to [minY, maxY]
and we do
gl_Position.x = A * vertex_shader.x + B * vertex_shader.w
gl_Position.y = C * vertex_shader.y + D * vertex_shader.w
2. Let [minZ, maxZ] be the depth-buffer range allocated to the
3D scene and let (E, F) be the numbers so that
z --> Ez + F maps [-1, 1] to [minZ, maxZ]
and we do
gl_Position.z = A * vertex_shader.z + B * vertex_shader.w
3. We add the clipping planes:
i. minX * gl_Position.w <= gl_Position.x <= maxX * gl_Position.w
ii. minY * gl_Position.w <= gl_Position.y <= maxY * gl_Position.w
iii. minZ * gl_Position.w <= gl_Position.z <= maxZ * gl_Position.w
C. As an alternative do doing B., we could instead call
glViewport, glScissor and glDepthRange. This avoids the silliness
of adding more hardware clip planes and allows for better GPU optimizaions,
but some GPU's lose performance when glViewport and glDepthRange
are called. Lastly, this makes it impossible to walk across different
virtual buffers to draw scenes to reduce shader changes.
9. sRGB support via HW support and GL extensions. Currently, we have that
a render emits sRGB or linear values and that an astral::Image is encoded
to store sRGB or linear values. Steps:
a. Make the format of the image atlas as GL_SRGB8_ALPHA8
b. Bind the image atlas to two different image units where
- decoded: no sampler, this will convert the sRGB stored
values into linear
- raw: with a sampler using GL_EXT_texture_sRGB_decode
to set the sampler with GL_TEXTURE_SRGB_DECODE_EXT
having GL_SKIP_DECODE_EXT
c. when rendering virtual buffers, note that we have the following:
- mask rendering
i. means fragment shader emits linear values
ii. want to store the exact value that shader emitted
- srgb rendering
i. means fragment shader emits srgb values
ii. want to store the exact value that shader emitted
- linear rendering
i. means fragment shader emits linear values
ii. want the HW to covert the linear to sRGB and store the sRGB value
d. when rendering render the linear color buffers with
glEnable(GL_FRAMEBUFFER_SRGB) and when rendering the mask and
sRGB buffers render with glDisable(GL_FRAMEBUFFER_SRGB). The offscreen
render target should be an GL_SRGB8_ALPHA8 texture.
e. Sampling
- masks: sample from raw always
- color, get srgb value: sample from raw
- color, get linear value: sample from decoded
f. The main difference is that Image::set_pixels() should always
be passed sRGB values for color images and linear values
for mask values. In addition, Image::colorspace() and
ImageSampler::colorspace() have no role anymore.
g. The colorspace argument to astral_sample_image() will still hold
and chances are we still want the item and material shaders to
know the colorspace they are to work in.
h. when blitting the results from the render target to the atlas,
sRGB rendering should be disabled and the sampling from the render
target should be that GL_TEXTURE_SRGB_DECODE_EXT is the value
GL_SKIP_DECODE_EXT so that the blits are bit-copies. In addition,
the render target should be backed by an sRGB texture.
i. Comment: if GL_EXT_texture_sRGB_decode is not available, one
can use GL_ARB_texture_view instead to alias a GL_SRGB8_ALPHA8
texture as an GL_RGB8 texture.
10. Tighter clipping via geometry when HW clip planes are not available.
i. Option 1: Have something in vertex shader that can optionally emit a screen-aligned
bounding box in the same coordinate system as the the clip window. If that
box is compleely clipped by the clip window, then the main vertex shader
will make the w = -1 to clip it.
ii. Option 2: Allow for the vertex shader to read the clip window values and from that
if the box would completely clip a triangle, then to just emit a vertices
that don't generate any geoemtry; alternatively have the vertex shader
have the option to emit "completely clipped" and the main() will handle
it.
11. A different way to do filling. This idea builds strongly off of how
Pathfinder3 performs filling: https://nical.github.io/posts/a-look-at-pathfinder.html.
Steps:
a. Pick a tile size; PathFinder has great success with 16x16, so
perhaps we start with that.
b. Do NOT use line segments only, but continus to use quadratic bezier
curves and line segments.
c. In the "binning" pass (on CPU) do the same as PathFinder3 and
compute what curves and segments hit what tiles and build a list
of curves and segments that hit each tile. Use the algorithm in
Random Access Rendering of Generaly Vector Graphics, on the web
at https://hhoppe.com/ravg.pdf, to figure out the modification to
the winding offset value.
d. This step is where everything is different. Instead of rendering
to an fp16 buffer with additive blending, just draw a single quad
covering the tile. The fragment shader then walks the list of
curves and segments that hit it to compute the winding number.
Add the winding offset as well. In addition, instead of performing
this on a scratch buffer, we can do this directly onto the atlas
as well.
e. If we add to the curve list those curves that are within a pixel of
tile, we can then also compute a semi-reliable distance value as we do
for glyph rendering. Another approach is to observe that it is the
memory reads that are most expensive, we could instead track the
winding value for several sample points in the fragment shader (as
we do for the lighting shader) and use that to compute an anti-alias
value that is robust against false edges.
The main fly in the ointment is the interaction with the tile size
in the image atlas. The simplest way out is to simply draw several
times to the atlas for the filling tile size and the image atlas tile
size to match up. The other issue is to parallize the step in (c) so
that Renderer can know the full and empty tiles immeidately; though
given that it parallizes on the level of curves, the thread join
may not be that bad really.
Another issue is that to do combining we cannot render directly to the
atlas because WebGL2 does not one permit to read from a texture and
at the same time write to it. Thus the step in (d) on WebGL2 cannot
be done directly to the atlas and must be done on the offscreen scratch
area.
12. Use/abuse transform feedback to do the binning for path filling, building
off of 11.
a. Pick a tile size; PathFinder has great success with 16x16, so
perhaps we start with that.
b. Do NOT use line segments only, but continus to use quadratic bezier
curves and line segments.
c. When a tessellation LOD is chosen, just like stroking, the error
also includes a value for the longest length of any of the curves.
In addition, the length of the longest curve is known as well.
d. Knowing the longest curve, we then know the maximum number of tiles
any curve may occupy. With that in mind, we can then use transform
feedback to do the binning. We can probably identify a single
curve of a contour with a 16-bit integer, thus a single uvec4
can record 8 curves; it looks like GL requires that feedback
must support atleast 64 scalars, so our transform feedback shader
when processing a single curve, could emit up to 128 seperately
hit tiles. However, one scalar will be used to store the index
into the shader buffer of step f.i.
e. This big deal about (d) is that now we are using the GPU to do
the binning, not the CPU. The binning itself is what we see
in Random Access Rendering of Generaly Vector Graphics, on the web
at https://hhoppe.com/ravg.pdf.
f. The tile incrementing is done in two steps. Firstly we have
two small one channel FP16 buffers; one for storing the raw
delta values from the binning and another for storing the sums.
There size is one pixel per tile for a path fill. Thus their
sizes can be computed on CPU and we can reserve room that way.
i. compute the delta-winding is by using the buffer produced
in (e) as a vertex buffer to then jazz the vertex shader so
that if a curve goes through a tile at the bottom to then
emit a 1-pixel square and the fragment shader emits -1 or
+1 and use additive blending. Instancing should be used
so that we have a single draw and the instance ID decides
which of the 64 values to check for the emit. In addition,
we should use glVertexAttribDivisor() so that every full
possible rect advances the attribute. The next question to
answer is how do we draw MANY of these together? The answer
is a single uint attribute will be reserved for reading
into a buffer where are the tiles located and what is the
transformation jazz to apply. Using a UBO to hold the header
means that every several thousand paths means another glDraw
to upload a different set of headers to the UBO.
ii. to sum up the delta winding value we then have ANOTHER
FP16 buffer where we draw to it a single rect that
adds up the buffer values in (i) of elements strictly to
the right. This is done with blending OFF.
g. Create a surface buffers storing if a tile is partially covered,
fully covered or empty.
i. When doing f.i, we do multiple render targets where the other
buffer in an R8 which is initialized as zero and when a curve
intersects a tile we can have the fragment shader emit 1 for
that buffer and this works with the addive blending.
ii. During f.ii, we do MRT again and can decide if a tile is
completely covered, paritally covered or not at all and
write that to an R8 buffer to indicate its status.
h. The evil part: we need to read back what tiles are parially covered,
i.e. glReadPixels on that R8 buffer of f.ii. Since we are doing
many paths, and it is just pixel per tile, it is a small read and
it is highly amoritized, but it is still a read from GPU to CPU.
i. With (h), the tiles are then known on CPU which are partially
covered and which are fully covered.
j. We can then proceed to draw to the partially covered tiles the
partial coverage reusing the vertex buffers of (d), but with
an additional UBO to read where a partially covered tile lives.
The drawing of the partially covered rect content will be directed
to another surface (and we gamble that we can fit all the tiles
always into that surface). This surface is reset on every frame.
k. The draw at the level of Renderer does not know what tiles are partial
and what tiles are full or what are empty, but it will emit an opaque
draw and a non-opaque draw and the backend will use its data to
do the right thing.
The biggest stink, besides the immense complexity, is the glReadPixel() in
part h. On native, this is not that bad because it happens likely only once
per frame and all of the render target changes induce pipeline flushes anyways.
However, on WebGL2, this foces a synchronization point between the process
of the tab running and the GPU process which may harm performance a great deal.
Another issue is the combining with previous clipping. WebGL2 does not permit
for one to bind a texture for reading and to also to write to it, even if there
is no intersect. The upshot is that we would have to rely on analytic interection
of clipping on CPU and feed those to Astral. Also not good.