Webrender Overview

webrender overview

nical: goals: what have you built and gone through and what have you learned?
gw: WR1 is the obvious way to do things. Treat everything as quads (rectangles, text, glyphs, etc.). Build them into dynamic vertex buffers. Use instance rendering where available to reduce vertex buffer traffic. Complex stuff (border corners, etc.) were done in a pre-pass, cached into a texture atlas. Then, sampling the frame is how you get them to the quad. So, the rendering pass was just one shader - just draw a quad (possibly with some clips). Made batching very easy. Common for one frame to rendering 2-3 draw calls. Limited by the index buffer sizes. Problems! Going through the single primary shader made it hard to add features (e.g., arbitrary clip paths) without affecting the performance of pages without them. The second one is that though nearly all pages run at 60fps, some would not quite hold it. The common theme was overdraw. Lots of opacity, alpha blending, and pixels drawn multiple times. Third issue was questions from gfx & browser is how you do subpixel text AA. Some people said it's not necessary in HIDPI world, but we wanted to get an answer for that.

webrender 2 overview

The idea is to split the screen at the device pixel level into a small number of non-overlapping tiles. Then, for each tile, determine a quick way to render each one. We throw out any primitives or stacking contexts that don't contribute (transparent, offscreen, etc.). Then, start with 512x512 pixel tiles and assign primitives to them. Not required to be uniform sizes. Run a procedure on each tile assigning a BSP tree that chooses split axes that are on edges of primitives to minimize the number of overlapping primitives on tiles. So, choose a split axis wherever there's a primitive edge. For each tile, you then have the list of primitives affecting that subtile. e.g., there might be a text run, a white background section, and a brown element background. Then, you can remove the white background since it's covered by the opaque brown element. Each tile then has the layers that it needs to render along with their transparency and it draws them. The pass that does this layering & compositing are very simple shaders. Can't do border, text or corners. So, before rendering the tile it runs the shaders that render things to an offscreen render target. The final stage picks from the render target but can handle blending & stacking contexts, etc. The goal is that the final composite for the tile should have all the colors for all of the fragments so that we should be able to do subpixel AA. This does not enable stencil or alpha blending - all blending is done in the shaders. Because border corners are written by complex shaders beforehand and separately, they do not have an impact on the final rendering if they are not used. The main screen pass, rendered into tiles, and ???, are guaranteed to be non-overlapping, so you don't have to worry about sequencing.

nical: how do you animations?

Right now, we build things every frame. It takes ~1ms to build the entire frame for the unoptimized code. Each of the 512x512 tiles is independent for 90% of the algorithm, so we could get more parallelism if we need it. The CPU backend is 1.4 ms for a basic page (HN). For full generation of the tiles, BSP, occlusion culling, initial traversal of the stacking tree, etc. Even for complex websites, it's < 2-3ms.

A: 2-3 box shadows that overlap?

Rasterized to that render target & composited.

nical: WR1 handled drop shadows directly? It would render them to a cache, but then do three draws. Batched into a single call, but you'd get overdraw & alpha blend. In WR2, it's one pass in the shader.
A: Why is drawing 3 times in the shader faster than overdraw? You can fetch from the texture faster. Especially with modern GPUs, they're really good at hiding latency of texture fetches by running other kernel calls. If you have alpha blend enabled, the framerate is halved, even with an alpha of 1.
A: Do you run out of texture units? I haven't yet. Minimum guarantee is 16 samples on the stuff we have.
benwa: Could we expect 4 or 8 on some mobile units? Right now, we're ES3 only, which has pretty crazy minimum samples. Also, the vast majority of stuff is coming from one texture. The render target is 2k x 2k, so we don't run out of texture units. There's the render target cache. And then the second is the images & rasterized glyphs.
benwa: Benchmarked on mobile & desktop? Expecting WR2 to be much better than WR1 with less passes. Seemed to be getting gaming framerates. This is based on gaming on android, where we usually do a z-order prepass.
A: How fast is it to switch shaders? Intel is quite expensive. But, they're batched together. I have six (box-shadow, border, gradient, etc.). Generally only 5 or so draw calls to each of them because it batches each of them at once. It can hurt if you have a really crazy edge case with a tile that fills up the render target cache, then it might blow it away.
nical: Outside of overfill/overdraw, what else did you run into? It was getting hard to add new features to it because the strategy was to try to get into one or two types of shaders. With the traditional batching architecture, if you only have one shader, generating batches is fairly trivial because it doesn't matter if stuff overlaps. Once you have 3 or 4, you have to start considering whether the primitives that come later overlap. If so, you can't add them to an earlier batch or the paint order might be incorrect. That can be quite expensive. Determining overlap without a system like WR2 can be quite expensive. So, it was hard to add features to the limited numbers of shaders.
benwa: We have a layering system and a sublayering system that caches to a texture. Do you do that? Redraw from primitives every frame? Isn't that wasteful? Yes, but we also do no dynamic texture allocation, so we remove all the bugs associated with that or extra texture memory. The architecture supports doing some caching if we wanted, at the tile level. I've also been thinking about caching part of the render target for box shadows (our most expensive shader) so that we don't have to do it each frame.
benwa: Seems like there might be some pathological cases where you'd do worse here.

Yes. In WR1, we have an elaborate system that caches vertex buffers per screen tile when you scroll. This made scrolling 0.1 ms on the CPU, which was great especially for power optimization. We're trying to get to the point where people can animate the full scene at 60fps. Obviously, if you can draw a new frame at 60fps regardless of the animation, you should also be able to do that for scrolling. If we can get there for web developers so that they can skip causing layers to be created, animate top, etc. it would be compelling for them. But, definitely, for the simple cases, we should do the caching to minimize power consumption.

benwa: Makes sense that you can measure performance without the caching layer, but need to do the caching to ship.

Yes. Caching intermediate render target results should give us pretty decent results. Talking about just scrolling, can cache the BSP trees for those coarse trials.

nical: What GPU features do you rely on?

Uniform buffer objects (GL3) and instancing (GL3). Neither are required, though - it can run in ES2 with standard uniforms and generating vertex buffers. It's just to make development easy because they're well-type in Rust. WR2 can easily run on ES2 or GL2. The blur shader and box shadow shader do not currently run on ES2 because they use dynamic for loops. I believe we can do them as a multi-pass effect on ES2 if we want to.

nical: All abstracted away? Could add a backend if needed?

Core tile rendering loop is ~3 pages of code. Rendering cache is a few more pages. But about 400-500 lines of code. There's a bunch of setup, but the core of the render loop is small and should be adaptable.

nical: How would it perform with a software implementation? Seems like fallback to LLVMPIPE would work well b/c overdraw kills software too.

Yes. Everything goes through the primitives to the render cache & compiles them to a tile. Could put in some fast paths to tiles with rectangles or gradients, so I'll render that to the screen.

benwa: If you turn on tiling, it doesn't redraw the gradients. So it SHOULD work to save work.
A: How do you handle nested filter without an intermediate surface?

Up to eight transparent layers right now. The shader has to accumulate the colors/shaders it needs during the tree traversal of the shader. But, for complex cases, we may have to drop to multipass. There are limits to how much data can pass through the interpolators.

A: Blur shader for multiple pixels.

Yes! It'll be a render target effect. That would only work for stuff that's 1:1 pixel.

nical: SVG? Or path masks?

Two options, but we haven't done it yet. Rendering to the primitive cache, because they can be complex shaders, they might have the knowledge to render the path mask. More likely, we'll have the option to render that mask to the primitive cache target and sample from that mask. The nice thing there is that we can store four clip masks in the texture, one per channel. So, it might be more performant for multiple masks. Again, hypothetical right now.

A: Once you write a blur shader with rendered intermediate target, then also other things into that target? Servo support SVG?

Nope. We might start next quarter. But just the initial parsing of the data structures.

benchmarks

bz: Can we create some demos that push the worst-case on WR2?
pcwalton: We haven't yet.
jeff: in transparent-rects. what's taking the time?
gw: border corners. Rasterizing those to the render target is most painful, and we don't cache between frames.

resource usage

bz: How big is the render target?
gw: 2k * 2k pixels. On a normal page, we can get by with a 1k * 1k cache.
bz: GPU memory?
gw: Yes.
bz: What's GPU memory look like on mobile?
jeff: integrated.
gw: Except for places with crazy blur filters or > 16 layers, we don't allocate another one dynamically.
bz: Separate per iframe?
pcwalton: No. We flatten everything into one scene. Scroll layers?
gw: We don't bother doing any caching/rebuilding right now - it's just a transform. We plan to add it for power optimization, but don't need it for perf. Initial visibility is brute-force, and is needed for the HTML spec, but haven't reproduced the AABB tree from WR1.
pcwalton: Maybe do the optimization at the display list level instead of in webrender? That's what Gecko does.
gw: Trivial to handle to cull.

cache

jeff: what's happening with border corners in transparent-rects and the render cache?
gw: 2d intersection between screen space vertices & the tile. In the shader, we have info about stacking contexts. For each vertex of the tile, do a raycast to the stacking context. Totally overkill for 2d, but makes it work for 3d to determine the z coordinate of the vertex. Then, you can back-transform that screen space 3d position via the stacking context to get it in local space. Then, the fragment shader can just check if the distance from the local space is > some threshold, discard. 3d almost works - just needs a per-pixel discard for interpolation.
jeff: So you need these corners. They're all drawn at once?
gw: Yes. The primitive cache has all the corners, etc. Batches into groups of 512 items.
pcwalton: WR1 did the same.
gw: Since there's no overlap, it's trivial to get them together.

Display lists vs. frame trees

pcwalton: Servo's display lists are a tree of stacking contexts with a flat list of display items. They are serializable so that they can go over IPC. So Webrender is in a separate process isolated from content. They have no references to the DOM; just an identifier to map back to the DOM. Hit testing is done on display lists.
pcwalton: We don't rasterize outside the viewport in the WR model, so we have a much larger display port than the view port. This is so that we don't generate all the display list items for the entire HTML5 spec. Most scrolls are handled without regenerating display lists. It can be performed off main thread.
pcwalton: Items are pretty small.
gw: Text, box shadow, solid color, gradient, border, image. Missing some of the properties gecko supports. radial gradient. background image w/ background-repeat. Not stuff that's difficult, but stuff that's not there.
pcwalton: Repeating backgrounds are a flag on image in webrender. Lots of attributes on stacking contexts - filters, transforms, perspective, etc. So, mirrors the CSS spec there. Generation of displaylists is similar to gecko. Emit them in whatever order we encounter them. There's another sorting pass that then reorders them into the correct order per CSS. Should not be hard to convert Gecko display lists into Servo display lists. Should mention that Servo & Webrender display lists are not identical. Conversion is trivial. But it's at least one point that it's possible there.
vlad: Since the gecko display items have more stuff in them (e.g., tables), might want to start a step back from the frame tree. Instead build WR2 or some other set of display items.
pcwalton: Display list builder is one file, but not as big of nsDisplayListItem.
gw: Pretty bare-bones. There is a standalone sample of webrender usage outside of Servo.
pcwalton: Also, WR1 and WR2 have the identical display list API.

Status of WR2?

vlad: What's the status?
gw: Lots of failing edge cases. Great perf. But 2-3 more weeks of development to reach parity of WR1.
vlad: Hope is to get Gecko+Servo side involved in trying out WR. Maybe with a separate path for now. Need to figure out minimum hardware requirements, specs, etc. WR2 will definitely need to have multiple backends. DX11, LLVMPIPE or software implementation.
nical: The WR2 stuff is really close to pixman, so might be able to just use it. Software is where overdraw kills us the most, so it might help.
vlad: But, people using the software backend are not once we think we can do work for. WinXP is going to be hard to get a good experience on.
gw: 19ms per frame for LLVMPIPE for a gigantic HIDPI screen. But, caching is really cheap.

Backends

vlad: How easy is it to add DX11/12 backend?
gw: It was designed to be easily replaced. 90% of the opengl code is on 1 file, 10% in the other. Apart from the boilerplate of compiling shaders, render targets, FBOs, etc. the core loop is only ~500 lines of code.
pcwalton: Writing the shaders will be the tricky part. They're basically just an unrolled loop.
gw: About 15 shaders. 8 for blending; 7 for render target stuff. They're fast enough, but not fully optimized.

MSAA

pcwalton: Software?
gw: Should be pretty easy.

Space

vlad: Rendered in true 3d space?
gw: Yes. It's not working in WR2 currently, but it does in WR1 and just hasn't been added to WR2 yet.

Printing

pcwalton: Do we have to use a vector graphics backend?
vlad: Yes, text has to be sent as text to printers.
pcwalton: May need a special path for that. If you don't care about how fast it is (which we don't), you can strip out a lot of code.
vlad: Should be possible to print from the display items.
pcwalton: That's what we do from Servo.
vlad: Might just generate PDF & send that to {something that knows how to print}
jeff: No real advantage. Your PDF generation API would look like Cairo.
larsberg: No Cairo dependency.
vlad: Moz2d. Or skia?
jeff: No skia

Stuff nical is doing around SVG path rendering

nical: Different ways to render complex paths in the GPU. One is masks on the CPU. Another is stencil & cover where you have one point, draw triangles to all the points on the path, then use even/odd rule on the stencil buffer to know if you were in the shape vs. outside the shape.
gw: NVPATH?
nical: Yes, with a fast path in their driver. Used by nanoVG. Very easy to implement, but gets really nasty when you have a lot of paths. The memory bandwidth grows. Also, switching between stencil & cover, so makes some drivers not perform well outside of nvidia. So not cross-platform.
nical: I've been looking into is a complete tessellation of the path. Take SVG path and output a series of triangles that represent just that mesh. What's a bit sad is that you have a huge texture buffer. But, lots of options for AA (CPU or multisampling). Vertex AA is the simplest approach. But, I'm not really fond of multisampling. Idea is to tessellate the path on the CPU.
nical: Been working on this for a year or two in side times. Basically, remove self-intersections, remove non-monotone, and then you can tessellate. All of the existing algorithms sort the points and go through them in order. Means you consume & produce in mostly the same order, so I worked on an algorithm that does everything at once and it's a lot faster. The per-pixel algorithms lose some of the intermediate data, so the bulk tessellator works well.
nical: Written in Rust. When I ported the C path flattening code from Gecko, I found & fixed two bugs in it.
nical: Just need to fix some small cases & vertex AA.
nical: Part of the goal was to get better at SVG performance.
benwa: See it every election season with the election maps, because they're SVG.
nical: Would like to let people do maps in SVG instead of WebGL. We have our own moz2d SVG renderer, but it doesn't know anything about GPUs or scene graphs or anything.
A: Rasterize to a CPU buffer using Cairo or Moz2d and then composite.
nical: Is servo interested in this?
larsberg: I'm almost 100% sure we (pcwalton, etc.) could help integrate this and finish it off.

War stories

nical: We've been getting killed by e10s bugs & stability.
nical: Now moving the GPU to a separate process. The new bad drivers are coming more quickly than we can handle them. Cross-process texture sharing on Windows is a terrible thing and results in many bad problems.
gw: Crashes?
nical: Drivers don't behave well. e.g., the locking APIs sometimes don't work. So the synchronization is ignored. The text rendering, etc. is all fine, but timing changes and stuff fails. Even non-buggy drivers get buggy. Tons of crashes.
A: black windows, white windows, corrupted texture memory, etc.
nical: Compositor thread to compositor process. Then do video decoding there, too. Surge in drivers bugs was HTML5 video moved to h/w decoding, which meant we had to handle the decoding and it was no longer out of process in Flash. Everything out of process now except possibly WebGL.
gw: b/c serialization is faster in-process?
nical: Partially. Also a desire not to migrate the code.

Vulkan

gw: Interest in Vulkan in Firefox?
benwa: Explicit hazards would be nice.
nical: Too few users getting devices with Vulcan drivers, but we'll continue looking into it.

Adapting WR for Gecko

benwa: Be great to know what's required to adapt WR to work with the frame tree, etc.
gw: Have some code to generate display lists that might be similar.
benwa: Maybe a prototype with no way of shipping?
gw: There's a http://github.com/glennw/wr-sample that's a boilerplate with minimal case. The WR repo has no dependencies on Servo. It has some of the Servo concepts in its public API but it's very standalone.

Branch location

https://github.com/glennw/webrender/tree/wr-next-next

Good first features

gw: Easy to start by opening up the shaders and see how the fit together. You can tweak the page and see how things work.
pcwalton: apitrace is a record & replay debugger for OpenGL. It'll show you all the paints, draw by draw. Irritating to build on mac (due to Qt dependency).
gw: GPU debugging tools for Linux & Windows are much better.
pcwalton: With apitrace, it's better.
vlad: Can use LLVMPIPE in the headless configuration.

Martin in H2

gw: In WR2, how do we do AA nicely on primitives? Lots of researchy ideas here. How do we build it into the shaders?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly