IDEAS.txt

1. Binary Search on GPU
     Let t[i], 0 <= i <= n be a an array of "times" and n = 2^k for
     some k with 0.0 = t[0] <= t[1] <= t[2] <= ... <= t[n] = 1.0.
     Given a value t, find i so that t[i] <= t < t[i + 1] using the
     GPU.

     int
     find_index(float t)
     {
       begin = 0;
       end = n;
       for (int i = 0; i < k - 1; ++i)
         {
           current = (begin + end) / 2;
           if (t > t[current])
             {
               begin = current;
             }
           else
             {
               end = current;
             }
         }

       return begin;
     }

2. Clipping by convex polygon and rects. For clipOutConvexPolygon and
   without anti-aliasing can be done by drawing an occluder. For
   clipInRect and clipInCovexPolyon without anti-aliasing can also
   be done by drawing an occluder. The performance issue is that the
   occluder might be HUGE. A simple example is drawing a table of
   cells where eac cell is clipped. In this example, then each
   cell's clip is an occluder over its complement which is quite
   large.

3. Rework offscreen region allocator. The main issue is that it is
   not very fast and the way it allocates spaces does not allow
   to use a smaller offscreen target. The idea is to use a shelf
   algorithm together with the ablity to swap x/y coordinate so
   that the allocator can always assume that width >= hieght.

4. Sparse offscreen stroking. Currently when rendering strokes
   we generate multiple VirtualBuffer objects and render what
   stroking primitives hit it. Instead, we should have a single
   VirtualBuffer, but return the unused tiles (unioned up in
   the same fashion that we currently union up what is hit).
   A more agressive return strategy would be to have TileAllocator
   a public interface to return unused regions of allocated
   tiles and it would internally union it up. Lastly, to
   make sure that there is no rendering leackage, we would
   draw the masks AFTER drawing the color buffers and at
   the end of drawing the color buffers, we would cap all
   color buffers with a depth-value of occlude always.

5. Sparse offscreen filling. Following the same idea of having
   a single VirtualBuffer but marking returning what region
   are not touched by partial tiles. However, there are more
   details:
  a. For contours that are small in one of their dimension
     (i.e. no more than 3 tiles), instead of mapping and clipping
     them, just add their STC data to the VirtualBuffer and all of
     the tiles that are hit by it are fully backed. The "is hit"
     decision would be by using astral::TransformedBoundingBox,
     which would be given a method to return the length of its
     edges.
  b. For "biggish contours", we run the clipping to see what
     tiles are hit. However, because we no longer have that
     each tile has padding, we only need to clip against the
     lines H_n = { (x, 60n) | x real } and V_n = {60n, y) | y real }.
     We can get efficient clipping by first breaking each curve
     C of the contour into pieces where each piece is on a
     specific side of each H_n and V_n, i.e. curves get clipped
     once. From there, we can quickly compute what tiles are
     partial tiles. The actual clipped contour generation would
     simply walk the clipped curve's instead of the original
     curve list to generate the contour to run on a tile.
     Tiles that have no original curves walking though them
     would, as before, have an internal winding number on
     them incremented. At the end of data generation, if the
     rull is odd-even, we would add N % 1 rects to draw where
     N is the winding number, but for non-zero, we would add
     N rects.
  c. We could, in theory avoid streaming the anti-aliasing
     data by drawing the original anti-alias fuzz to the large
     render target. To make sure of no leakage, the rendering
     needs to be protected always by depth buffer capping;
     A simple way to do this is to first render the color virtual
     buffers and then cap them. This will prevent any drawing
     to them period by the masks. To reduce the area covered,
     we should also tessellate the curves for filling also
     according to area.
  d. The generation of the STC data is tempting to also
     do this way, but would eat oddles of fill-rate. It is
     likely best to continue to clip the contour to produce
     the STC passes.

6. Likely not a good idea to reset UBO buffer pool on each render
   target as that would cause a GPU to issue a pipeline stall to
   wait until the GPU is done if a BO gets reused.  Though, given
   that at the end of each render target we need to blit the renders
   to the ImageAtlas anyways, the GPU pipeline will need to stall
   anyways.

7. IDEA: Path union, intersection in STC.
   A. When filling, a VirtualBuffer will have "boundaries" in its
      STC data indicating the start of a new combined path, several
      types of markers:
     1. End: this means run the cover pass (but not anti-aliasing)
        to not only set .r channel to indicate covered but to also
        clear the stencil
     2. ComplementEnd: run the cover pass as for End and then also
        run another rect that complement the .r channel using
        a fragment shader that emits .r = 1 with GL_FUNC_SUBTRACT
        and coefficient (GL_ONE, GL_ONE) because then the blender
        does D <-- GL_ONE * F - GL_ONE * D and F is 1 thus it does
        D <-- 1 - D which is what we want to do invert.
   B. P1 union P2 would just have an End marker between them.
   C. P1 intersect P2 would be fill P1 with the complement of its
      fill rule, then an End between P1 and P2, fill P2 with its
      complement fill rule as well and then P2 ends with a ComplementEnd
   D. We can then support any sequence of such path building, but
      an arbitary set expression is likely not feasible.
   E. Modify the distace field data, kill one of the "raw coverage" value,
      or ideally both and in the post-process step have a channel as
      0 or 1 indicating it is a boundary texel.
   F. Drawing anti-alias fuzz is delayed compeletely after STC.
   G. Stroking idea may or may not work. Taking the mask generated
      from (E). In the fragment shader when generating the mask for
      stroking, go to the point along the path that generated the fragment
      (this is easy to compute if the fragment shader knows enough of
      the geometry for the primitive). Sample the path-intersection mask
      and if it says it is on the boundary, then emit as usual. If not
      on the boundary, emit covered as 0 and distance as maximum.

8. IDEA: support 3D scenes. There would be a special "3DSCENE"
    group that would specify an ENTIRE 3D scene. Within such a
    group, there would be stuff to render, in 3D. Items within
    a group would have different vertex and fragment shader entry
    functions: the vertex shader would emit the analogue of
    gl_Position and the fragment shader would only emit a color
    value. When a 3D scene is requested, we allocate a range of
    Z-values that it will consume to do depth testing; since we
    are not rendering games or sophisticated things, the range
    would be around 16-bits wide. This would allow us to still
    use the dept-buffer occlusion in rendering and have many
    3D scenes (though too many scenes would sink this still).
     B. The caller of the vertex shader would MODIFY the output
        clip-vertex value as follows:
      1. Let [minX, maxX]x[minY, maxY] be the viewport in -normalized-
         coordinates of the current virtual buffer. Let (A, B) and
         (C, D) be the numbers so that
           x --> Ax + B maps [-1, 1] to [minX, maxX]
           y --> Cx + D maps [-1, 1] to [minY, maxY]
         and we do
           gl_Position.x = A * vertex_shader.x + B * vertex_shader.w
           gl_Position.y = C * vertex_shader.y + D * vertex_shader.w
      2. Let [minZ, maxZ] be the depth-buffer range allocated to the
         3D scene and let (E, F) be the numbers so that
           z --> Ez + F maps [-1, 1] to [minZ, maxZ]
         and we do
           gl_Position.z = A * vertex_shader.z + B * vertex_shader.w
      3. We add the clipping planes:
         i.   minX * gl_Position.w <= gl_Position.x <= maxX * gl_Position.w
         ii.  minY * gl_Position.w <= gl_Position.y <= maxY * gl_Position.w
         iii. minZ * gl_Position.w <= gl_Position.z <= maxZ * gl_Position.w
     C. As an alternative do doing B., we could instead call
        glViewport, glScissor and glDepthRange. This avoids the silliness
        of adding more hardware clip planes and allows for better GPU optimizaions,
        but some GPU's lose performance when glViewport and glDepthRange
        are called. Lastly, this makes it impossible to walk across different
        virtual buffers to draw scenes to reduce shader changes.

9. sRGB support via HW support and GL extensions. Currently, we have that
   a render emits sRGB or linear values and that an astral::Image is encoded
   to store sRGB or linear values. Steps:
     a. Make the format of the image atlas as GL_SRGB8_ALPHA8
     b. Bind the image atlas to two different image units where
         - decoded: no sampler, this will convert the sRGB stored
                    values into linear
         - raw:  with a sampler using GL_EXT_texture_sRGB_decode
                 to set the sampler with GL_TEXTURE_SRGB_DECODE_EXT
                 having GL_SKIP_DECODE_EXT
     c. when rendering virtual buffers, note that we have the following:
         - mask rendering
             i. means fragment shader emits linear values
             ii. want to store the exact value that shader emitted
         - srgb rendering
             i. means fragment shader emits srgb values
             ii. want to store the exact value that shader emitted
         - linear rendering
             i. means fragment shader emits linear values
             ii. want the HW to covert the linear to sRGB and store the sRGB value
     d. when rendering render the linear color buffers with
        glEnable(GL_FRAMEBUFFER_SRGB) and when rendering the mask and
        sRGB buffers render with glDisable(GL_FRAMEBUFFER_SRGB). The offscreen
        render target should be an GL_SRGB8_ALPHA8 texture.
     e. Sampling
          - masks: sample from raw always
          - color, get srgb value: sample from raw
          - color, get linear value: sample from decoded
     f. The main difference is that Image::set_pixels() should always
        be passed sRGB values for color images and linear values
        for mask values. In addition, Image::colorspace() and
        ImageSampler::colorspace() have no role anymore.
     g. The colorspace argument to astral_sample_image() will still hold
        and chances are we still want the item and material shaders to
        know the colorspace they are to work in.
     h. when blitting the results from the render target to the atlas,
        sRGB rendering should be disabled and the sampling from the render
        target should be that GL_TEXTURE_SRGB_DECODE_EXT is the value
        GL_SKIP_DECODE_EXT so that the blits are bit-copies. In addition,
        the render target should be backed by an sRGB texture.
     i. Comment: if GL_EXT_texture_sRGB_decode is not available, one
        can use GL_ARB_texture_view instead to alias a GL_SRGB8_ALPHA8
        texture as an GL_RGB8 texture.

10. Tighter clipping via geometry when HW clip planes are not available.
     i. Option 1: Have something in vertex shader that can optionally emit a screen-aligned
                  bounding box in the same coordinate system as the the clip window. If that
                  box is compleely clipped by the clip window, then the main vertex shader
		  will make the w = -1 to clip it.
     ii. Option 2: Allow for the vertex shader to read the clip window values and from that
                   if the box would completely clip a triangle, then to just emit a vertices
		   that don't generate any geoemtry; alternatively have the vertex shader
		   have the option to emit "completely clipped" and the main() will handle
		   it.

11. A different way to do filling. This idea builds strongly off of how
    Pathfinder3 performs filling: https://nical.github.io/posts/a-look-at-pathfinder.html.
    Steps:
      a. Pick a tile size; PathFinder has great success with 16x16, so
         perhaps we start with that.
      b. Do NOT use line segments only, but continus to use quadratic bezier
         curves and line segments.
      c. In the "binning" pass (on CPU) do the same as PathFinder3 and
         compute what curves and segments hit what tiles and build a list
         of curves and segments that hit each tile. Use the algorithm in
         Random Access Rendering of Generaly Vector Graphics, on the web
         at https://hhoppe.com/ravg.pdf, to figure out the modification to
         the winding offset value.
      d. This step is where everything is different. Instead of rendering
         to an fp16 buffer with additive blending, just draw a single quad
         covering the tile. The fragment shader then walks the list of
         curves and segments that hit it to compute the winding number.
         Add the winding offset as well. In addition, instead of performing
         this on a scratch buffer, we can do this directly onto the atlas
         as well.
      e. If we add to the curve list those curves that are within a pixel of
         tile, we can then also compute a semi-reliable distance value as we do
         for glyph rendering. Another approach is to observe that it is the
         memory reads that are most expensive, we could instead track the
         winding value for several sample points in the fragment shader (as
         we do for the lighting shader) and use that to compute an anti-alias
         value that is robust against false edges.

     The main fly in the ointment is the interaction with the tile size
     in the image atlas. The simplest way out is to simply draw several
     times to the atlas for the filling tile size and the image atlas tile
     size to match up. The other issue is to parallize the step in (c) so
     that Renderer can know the full and empty tiles immeidately; though
     given that it parallizes on the level of curves, the thread join
     may not be that bad really.

     Another issue is that to do combining we cannot render directly to the
     atlas because WebGL2 does not one permit to read from a texture and
     at the same time write to it. Thus the step in (d) on WebGL2 cannot
     be done directly to the atlas and must be done on the offscreen scratch
     area.

12. Use/abuse transform feedback to do the binning for path filling, building
    off of 11.
      a. Pick a tile size; PathFinder has great success with 16x16, so
         perhaps we start with that.
      b. Do NOT use line segments only, but continus to use quadratic bezier
         curves and line segments.
      c. When a tessellation LOD is chosen, just like stroking, the error
         also includes a value for the longest length of any of the curves.
	 In addition, the length of the longest curve is known as well.
      d. Knowing the longest curve, we then know the maximum number of tiles
         any curve may occupy. With that in mind, we can then use transform
	 feedback to do the binning. We can probably identify a single
	 curve of a contour with a 16-bit integer, thus a single uvec4
	 can record 8 curves; it looks like GL requires that feedback
	 must support atleast 64 scalars, so our transform feedback shader
	 when processing a single curve, could emit up to 128 seperately
	 hit tiles. However, one scalar will be used to store the index
	 into the shader buffer of step f.i.
      e. This big deal about (d) is that now we are using the GPU to do
         the binning, not the CPU. The binning itself is what we see
	 in Random Access Rendering of Generaly Vector Graphics, on the web
         at https://hhoppe.com/ravg.pdf.
      f. The tile incrementing is done in two steps. Firstly we have
         two small one channel FP16 buffers; one for storing the raw
	 delta values from the binning and another for storing the sums.
	 There size is one pixel per tile for a path fill. Thus their
	 sizes can be computed on CPU and we can reserve room that way.
          i. compute the delta-winding is by using the buffer produced
	     in (e) as a vertex buffer to then jazz the vertex shader so
	     that if a curve goes through a tile at the bottom to then
	     emit a 1-pixel square and the fragment shader emits -1 or
	     +1 and use additive blending. Instancing should be used
	     so that we have a single draw and the instance ID decides
             which of the 64 values to check for the emit. In addition,
	     we should use glVertexAttribDivisor() so that every full
	     possible rect advances the attribute. The next question to
	     answer is how do we draw MANY of these together? The answer
	     is a single uint attribute will be reserved for reading
	     into a buffer where are the tiles located and what is the
	     transformation jazz to apply. Using a UBO to hold the header
	     means that every several thousand paths means another glDraw
	     to upload a different set of headers to the UBO.
	  ii. to sum up the delta winding value we then have ANOTHER
	      FP16 buffer where we draw to it a single rect that
	      adds up the buffer values in (i) of elements strictly to
	      the right. This is done with blending OFF.
      g. Create a surface buffers storing if a tile is partially covered,
         fully covered or empty.
	  i. When doing f.i, we do multiple render targets where the other
	     buffer in an R8 which is initialized as zero and when a curve
	     intersects a tile we can have the fragment shader emit 1 for
	     that buffer and this works with the addive blending.
	  ii. During f.ii, we do MRT again and can decide if a tile is
	      completely covered, paritally covered or not at all and
	      write that to an R8 buffer to indicate its status.
      h. The evil part: we need to read back what tiles are parially covered,
         i.e. glReadPixels on that R8 buffer of f.ii. Since we are doing
         many paths, and it is just pixel per tile, it is a small read and
	 it is highly amoritized, but it is still a read from GPU to CPU.
      i. With (h), the tiles are then known on CPU which are partially
         covered and which are fully covered.
      j. We can then proceed to draw to the partially covered tiles the
         partial coverage reusing the vertex buffers of (d), but with
	 an additional UBO to read where a partially covered tile lives.
	 The drawing of the partially covered rect content will be directed
	 to another surface (and we gamble that we can fit all the tiles
	 always into that surface). This surface is reset on every frame.
      k. The draw at the level of Renderer does not know what tiles are partial
         and what tiles are full or what are empty, but it will emit an opaque
	 draw and a non-opaque draw and the backend will use its data to
	 do the right thing.

    The biggest stink, besides the immense complexity, is the glReadPixel() in
    part h. On native, this is not that bad because it happens likely only once
    per frame and all of the render target changes induce pipeline flushes anyways.
    However, on WebGL2, this foces a synchronization point between the process
    of the tab running and the GPU process which may harm performance a great deal.

    Another issue is the combining with previous clipping. WebGL2 does not permit
    for one to bind a texture for reading and to also to write to it, even if there
    is no intersect. The upshot is that we would have to rely on analytic interection
    of clipping on CPU and feed those to Astral. Also not good.