Skip to content

Shader and In Loop Deblocking

Richard Geldreich edited this page Jul 3, 2026 · 19 revisions

Shader and In-Loop Deblocking

Copyright (C) 2025-2026 Binomial LLC. All rights reserved except as granted under the Apache 2.0 LICENSE. Also see our NOTICE file. If you modify the Basis Universal source code, specifications, or wiki documents and redistribute the files, you must cause any modified files to carry prominent notices stating that you changed the files (see Apache 2.0 §4(b)).

Table of Contents

Intro

Block boundaries are predictable.

Large ASTC/XUASTC block sizes (8x8 through 12x12) reach very low bitrates — down to ~0.89 bpp — but block-boundary seams become increasingly visible beyond roughly 6x6. Basis Universal addresses this with a small, standardized deblocking reconstruction operator that smooths block-boundary seams at sample time, making the largest block sizes (and therefore the lowest bitrates and lowest VRAM consumption) practical.

This page explains what in-loop deblocking is, walks through the reference GLSL pixel shader section by section, and covers the integration rules and gotchas. It's written for graphics developers who know shaders, basic filtering theory, and roughly how JPEG works — no codec expertise assumed.

Reference implementations of the shader, all using the same operator:

Sample API Notes
shader_deblocking_glfw C++ / OpenGL 3.3 The native reference; bin/shader.glsl is the shader walked through below
shader_deblocking_d3d11 C++ / Direct3D 11 bin/deblock.hlsl has the operator as a reusable DeblockSample() HLSL function
webgl/shader_deblocking WebGL 2 Live demo here
shader_deblocking (Python) Python + OpenGL The original testbed; also supports raw PNG input for experimentation

What Is In-Loop Deblocking?

A quick tour of reconstruction operators, for context. Every lossy block-based codec defines decoding as a fixed reconstruction procedure: bits in, pixels out. The interesting design question is what else runs as part of that procedure.

  • JPEG has no deblocking in its decode contract. Any deblocking you've seen applied to JPEG is a post-process — decoder-side guesswork, bolted on after the fact. The encoder knows nothing about it, so it can't spend bits assuming the filter will clean anything up, and the filter can't distinguish a coding seam from real image content that happens to sit on an 8x8 boundary.
  • Modern video codecs (H.264, HEVC, AV1) moved the deblocking filter inside the coding loop (H.263 Annex J offered an optional in-loop deblocking mode back in 1998; H.264/AVC made the loop filter a mandatory part of the baseline decode — see also Wikipedia): the filter is part of the standardized decode, and — crucially — the encoder reconstructs frames through the filter when making its rate/distortion decisions. The encoder and decoder agree on the exact reconstruction operator, so the encoder can deliberately spend fewer bits on block-boundary fidelity, knowing the filter will smooth the seams. That agreement is what "in-loop" means, and it's a large part of why these codecs look so much better than their still-image ancestors at the same bitrate.

Basis Universal applies the same idea to GPU texture sampling. When deblocking is enabled at compression time, the encoder globally optimizes the output blocks (using Stochastic Coordinate Descent) with this exact deblocking operator in the loop — it evaluates candidate encodings by what they look like after the filter runs. The filter is therefore the decoder half of the codec, not an optional beautification pass. Content compressed with deblocking awareness should be sampled with the deblocking filter applied, and the KTX2 file records this expectation (see DeblockFilterID below).

(An aside: we are frankly surprised that the IHVs who have shipped ASTC hardware haven't added deblocking support to their texture samplers. In-loop deblocking has been standard practice in the video and image compression worlds for decades — the largest ASTC block sizes would be far more usable if the texture sampler applied even a simple hardware deblocking filter at sample time.)

The texture twist that makes this practical: unlike a video decoder, a GPU samples textures with fixed-function hardware, so the filter can't run "in the decoder." Instead it runs in one of two places — on the CPU while transcoding, or (preferably) in the pixel shader at sample time. And unlike video, the block lattice of a texture is entirely predictable from the sample position: a shader always knows exactly how far the current sample is from a block boundary, with no signaling.

The Reconstruction Operator

Per sample, the operator:

  1. Computes the sample's offset within its block. Interior texels get weight 0 and pass through completely untouched — bit-exact.
  2. Computes an edge-proximity weight that ramps from 0 to 1 near each block boundary, independently for horizontal and vertical seams.
  3. Samples the 4 axis-neighbors (5 taps total, including the center), runs a 3-tap box filter across each axis, and blends from the original sample toward the filtered result by the edge weight, with a normalization step where the horizontal and vertical filters overlap at block corners.

Why a 5-tap kernel?

We could have chosen a wider kernel — video codecs use longer, adaptive filters — but the 1+4 tap cross is the simplest kernel that proved robust in both 3D and 2D: it behaves well under bilinear/trilinear filtering, mipmapping, minification, and oblique viewing angles in 3D scenes, and equally well in flat 2D viewers with no mipmapping at all. It is also extremely cheap: one texture fetch for interior texels, five near a seam, no data-dependent branching around the fetches, and a handful of ALU ops. Wider kernels bought little visible quality on GPU texture content while costing taps, complexity, and temporal-stability risk. Since the encoder optimizes against whatever operator is standardized, a simple operator the encoder fully models beats a fancy one it can't.

The filter's falloff radius (how many texels from a boundary the blending ramp extends) is a tolerance, not a codec constant: 1.0 texel suits flat 2D viewers; 1.5 texels is more stable under minification and oblique angles, and is what the 3D samples use.

The Reference GLSL Shader

The complete fragment shader is shader_deblocking_glfw/bin/shader.glsl — about 95 lines including comments, and designed to be lifted into an engine nearly verbatim (the D3D11 sample's deblock.hlsl packages the same operator as a reusable HLSL function). It needs four inputs beyond the texture itself:

uniform sampler2D tex;
uniform vec4  texSize;  // xy = base mip (mip 0) dimensions, zw = SOURCE block size in texels
uniform float maxLod;   // number_of_mip_levels - 1
uniform vec4  const0;   // x = deblock on/off, y = edge-weight visualization

Note texSize.zw is the block size of the encoded source (from the KTX2 header), never the GPU storage format's block size — more on this in Gotchas.

The shader has five stages. Let's walk through each.

Stage 1: Recover the effective mip level

vec2 du = dFdx(vUV);
vec2 dv = dFdy(vUV);
float rho = max(length(du * texSize.xy), length(dv * texSize.xy));
float lod = clamp(log2(max(rho, 1e-8)), 0.0, maxLod);
float mipScale = exp2(floor(lod + 0.5)); // snap to dominant mip: 1=mip0, 2=mip1, 4=mip2...

This is the heart of what makes the filter mipmap-compatible. The fixed-function texture hardware picks a mip level from the screen-space UV derivatives; the shader reconstructs that same decision using the derivative instructions. rho is the texel-to-pixel density (how many base-mip texels one screen pixel spans), log2(rho) is the LOD the hardware would compute, and exp2(floor(lod + 0.5)) snaps it to the dominant mip — the level actually contributing most of the sampled color.

Everything downstream is then evaluated in that mip's texel space — the effective mip space. This matters because a block that is 12x12 texels at mip 0 is 6x6 texels at mip 1 and 3x3 at mip 2: the seams shrink with the mip chain, and a filter that only reasoned in base-texture space would smooth the wrong (too-wide) region as soon as the texture is minified. By tracking the dominant mip, the block lattice rescales with whatever the hardware is really sampling. (You could use textureQueryLod() on desktop GL, but it's unavailable in WebGL; the derivative math is the portable reference path, and CalculateLevelOfDetail() is the equivalent option in HLSL SM 4.1+.)

Stage 2: Locate the sample in the block lattice

vec2 texelStep = mipScale / texDim;          // one texel step, in effective mip space
vec2 texelPos  = (vUV * texDim) / mipScale;  // the physical texel coordinate in effective mip space
vec2 blockPos  = mod(texelPos, blockSize);   // offset within the current block

texelPos is the sample's coordinate in the dominant mip's texel grid, and blockPos is where it sits inside its block. Because block boundaries are perfectly periodic, one mod() gives the full answer — this is the "entirely predictable" property that makes shader deblocking so much simpler than video deblocking, which must signal filter decisions per edge.

Stage 3: The five taps

vec3 color = texture(tex, vUV).rgb;
// Keep these fetches outside non-uniform control flow: texture() uses implicit
// derivatives for LOD, which are undefined when neighboring fragments take
// different branches.
vec3 l1 = texture(tex, vUV - vec2(texelStep.x, 0.0)).rgb;
vec3 r1 = texture(tex, vUV + vec2(texelStep.x, 0.0)).rgb;
vec3 u1 = texture(tex, vUV - vec2(0.0, texelStep.y)).rgb;
vec3 d1 = texture(tex, vUV + vec2(0.0, texelStep.y)).rgb;

The key detail here is texelStep. Because mipScale doubles with each dominant mip level (1, 2, 4, ...), the UV step between taps is scaled up by the same factor — the kernel's up/down/left/right samples are deliberately "pushed apart" to compensate for mipmapping. A step of mipScale / texDim is exactly one texel in the mip actually being sampled: at mip 0 the taps land one base-mip texel apart, at mip 1 two base-mip texels apart, at mip 2 four, and so on. Without this scaling, at coarser mips all five taps would land inside the same coarse texel and the filter would degenerate into sampling the same value five times — no smoothing at all. Note the fetches happen unconditionally, before any branching on the edge weight: texture() derives its LOD from implicit derivatives, which are undefined inside divergent control flow. Hoisting the taps costs a few redundant fetches on interior pixels and buys correctness on every driver — keep this property when porting.

Stage 4: Edge proximity weights

const float falloff = 1.5;   // 3D tolerance; 2D viewers use 1.0
float leftProx   = 1.0 - clamp(blockPos.x / falloff, 0.0, 1.0);
float rightProx  = 1.0 - clamp((blockSize.x - blockPos.x) / falloff, 0.0, 1.0);
float topProx    = 1.0 - clamp(blockPos.y / falloff, 0.0, 1.0);
float bottomProx = 1.0 - clamp((blockSize.y - blockPos.y) / falloff, 0.0, 1.0);
float horizWeight = max(leftProx, rightProx);
float vertWeight  = max(topProx, bottomProx);
float edgeWeight  = max(horizWeight, vertWeight);

Each weight ramps linearly from 1 exactly on a boundary down to 0 at falloff texels away. Horizontal and vertical proximity are kept separate, because a sample near a vertical seam should only be filtered horizontally (across the seam) and vice versa. The smooth ramp is what makes the filter temporally stable: as a sample position slides toward a seam under motion or magnification, the filtering fades in continuously instead of popping. Note also that the filter is NOT gated or driven by the content itself — the weights depend purely on the sample's position within the block lattice, never on the sampled colors. A content-adaptive filter (like the edge-strength decisions in video deblocking) can flicker as content changes from frame to frame; this one cannot, which makes it even more temporally stable.

Stage 5: Filter and blend

vec3 filteredH = (l1 + c0 + r1) * (1.0 / 3.0);
vec3 filteredV = (u1 + c0 + d1) * (1.0 / 3.0);
vec3 horizColor = mix(c0, filteredH, horizWeight);
vec3 vertColor  = mix(c0, filteredV, vertWeight);
float totalW = horizWeight + vertWeight;
if (totalW > 0.0)
    color = (horizColor * horizWeight + vertColor * vertWeight) / totalW;

Two independent 3-tap box filters — one per axis — each blended against the original sample by its own edge weight, then combined with a weighted average normalized by totalW. The normalization handles block corners, where both weights are nonzero: without it, corner texels would be over-filtered by the sum of both axes. Interior pixels (edgeWeight == 0) skip all of this and keep the untouched center tap.

There's also a debug visualization path (const0.y): output edgeWeight as grayscale, which draws the block lattice on screen. In the samples this is the 2 key — zoom out and watch the lattice rescale as coarser mips take over; it's the fastest way to convince yourself the effective-mip-space math is working.

The .KTX2/.DDS Studio Variant (2D, CPU-Exact)

The .KTX2/.DDS Studio WebGL tool also supports shader deblocking — see its renderer.js — but it is the 2D viewer variant, with two deliberate differences from the 3D sample shader walked through above:

  • No mipmaps. The Studio views a single level at a time, so its shader skips the effective-mip-level recovery entirely (no derivative/LOD math) and evaluates the block lattice directly in the viewed level's texel space.
  • 1.0 texel falloff. It uses the tighter 2D falloff radius, and as a result it 100% accurately implements what the CPU deblocking filter does during transcoding — the Studio's on-screen result is the transcoder's output, texel for texel.

The 3D sample shader on this page purposely differs on both counts: it uses a slightly wider falloff (1.5 vs. 1.0 texels), which reduces the edge seams more strongly than 1.0 — under the distortions of 3D texture mapping (minification, oblique viewing angles, bilinear/trilinear blending) the extra smoothing width is what keeps the seams suppressed — and it takes mipmapping into account (the effective-mip-space math of Stage 1). Use the Studio/CPU variant for flat, single-level 2D viewing where matching the CPU filter exactly matters; use the 3D variant for anything rendered in a scene.

Two Ways to Deblock

On the GPU, at sample time (preferred). The shader above. No transcode-time cost, no extra memory, and it runs in exactly the space the hardware samples. This works regardless of the GPU storage format — BC7, ETC, or even uncompressed RGBA — because the block artifacts are baked into the content by the original encode.

The runtime cost is lower than it looks. The 4 extra filter taps are only one texel away from the center sample, so they are very likely to fall inside the GPU's texture cache, and the handful of ALU ops is likely to be essentially free on ASTC devices: large-block ASTC textures save large amounts of memory bandwidth versus smaller block sizes. If the alternative to deblocking is shipping the texture at 2x-4x (or more) the memory bandwidth, deblocking wins.

On the CPU, during transcoding. When transcoding ASTC/XUASTC LDR to other LDR formats (BC7, ETC1, raw pixels, etc.), the transcoder can apply the same operator on the CPU: each mip level is decompressed to memory, deblocked, then packed to the output format with the real-time encoders (bc7f, etc1f, ...). By default — when the caller passes neither deblocking decode flag — the KTX2 transcoder consults the file's DeblockFilterID field to decide whether to deblock; passing cDecodeFlagsNoDeblockFiltering or cDecodeFlagsForceDeblockFiltering overrides the file's field. At encode time, the encoder uses SCD (Stochastic Coordinate Descent) to optimize the output blocks for the deblocking filter when deblocking is enabled — by default on 10x8 or larger ASTC block sizes, i.e. a block area of 80 texels or more: 10x8, 10x10, 12x10, and 12x12 (BASISU_DEBLOCKING_BLOCK_SIZE_THRESHOLD in basisu_transcoder.h) — unless the user overrides this behavior, either disabling it completely or turning it on for all block sizes. Note CPU deblocking temporarily decompresses a whole mip level (the non-deblocking pipeline only ever decodes a few block rows at a time), so disable it if that transient memory matters.

Gotchas

  • Never deblock twice. Exactly one of the two modes should run. If your pixel shader performs the deblocking, you MUST disable the transcoder's CPU deblocking by passing cDecodeFlagsNoDeblockFiltering at transcode time — otherwise the image is filtered twice (once on the CPU into the GPU texture, then again in the shader), visibly over-blurring every seam. All four samples do this. (Note CPU deblocking can only occur when transcoding to a non-ASTC format — an ASTC-to-ASTC transcode passes the compressed blocks through untouched, so there is nothing for the CPU to filter; "deblocking" ASTC into ASTC would make no sense. When the GPU texture is ASTC, the pixel shader is the only place deblocking can happen.)
  • The block size is a property of the encoded SOURCE format, not the GPU format the source is transcoded to. Transcoding changes the container, not the artifact lattice: an 8x8 XUASTC source transcoded to BC7 (4x4 storage) still carries seams on the 8x8 source grid, and the shader must filter at 8x8. Always feed texSize.zw from the KTX2 header (ktx2_transcoder::get_block_width()/get_block_height()), never from the transcode target. Get this wrong and the filtering won't line up with the actual artifacts.
  • Keep the taps outside divergent control flow (Stage 3 above) — implicit-derivative fetches inside a non-uniform branch are undefined behavior that happens to work on some drivers and not others.
  • Pick the falloff for your use case: 1.0 for flat 2D UI/viewers, 1.5 for 3D. It's a tolerance; the codec doesn't require an exact value.
  • maxLod must match the mip chain you actually uploaded (levels − 1). If you upload only the base level, maxLod = 0 correctly collapses the lattice math to mip 0.

The DeblockFilterID KTX2 Key

Basis Universal writes a DeblockFilterID key/value into .KTX2 files, readable via ktx2_transcoder::get_deblocking_filter_index(). The value is ASCII "0" or "1":

  • "0" — the content was not compressed with deblocking awareness. (In practice the encoder simply omits the key in this case; a missing key reads back as 0.)
  • "1" — the content was compressed in-loop against the 5-tap reconstruction operator described on this page, and viewers/engines should apply that filter by default when sampling (users can still toggle it).

All four samples honor this contract: they auto-enable the GPU shader when the ID is 1, capture that as the reset default, and pass cDecodeFlagsNoDeblockFiltering at transcode time.

Note: this key will likely be changing names as this deblocking approach is standardized at Khronos.

Controlling Deblocking

Transcoder decode flags

  • cDecodeFlagsNoDeblockFiltering — disable all adaptive CPU deblocking (faster, less temporary memory; required when a GPU shader deblocks instead).
  • cDecodeFlagsForceDeblockFiltering — always deblock, even on block sizes smaller than 10x8.

Passing either flag causes the transcoder to ignore the KTX2 file's DeblockFilterID field; when neither is passed, the KTX2 transcoder uses that field (written by the encoder) to decide whether CPU deblocking occurs.

Command line (basisu tool)

Transcode-time (unpack/validate): -transcode_no_deblocking (alias -transcode_disable_deblocking) and -transcode_force_deblocking.

Encode-time, XUASTC LDR/ASTC LDR only — controls deblocking awareness (SCD) and/or the filter applied during compression (default -xuastc_deblocking_largest; each flag also has a -xuastc_ldr_* alias):

  • -xuastc_no_deblocking — disable both SCD and filtering.
  • -xuastc_deblocking_largest — SCD + filtering only on the largest block sizes (default).
  • -xuastc_deblocking_all — SCD + filtering on all block sizes.
  • -xuastc_deblocking_scd_no_filtering — SCD only, no filtering.
  • -xuastc_deblocking_no_scd_filtering_largest / _all — filtering without SCD.
  • -xuastc_deblocking_num_passes X — number of deblocking SCD filter passes (by default the number of passes is derived from the effort setting).

A useful recipe

For XUASTC LDR targeting a non-ASTC format such as BC7, very low Weight Grid DCT quality factors (1-15) become usable if you force adaptive deblocking on all block sizes (cDecodeFlagsForceDeblockFiltering). Deblocking is what permits the very lowest bitrates — or the largest block sizes — to become practical.

Future Directions

Two likely evolutions of this approach:

  • Wider kernels. The 5-tap cross was chosen as the simplest operator that is robust in both 2D and 3D, but nothing prevents standardizing wider or shaped kernels for content that benefits from stronger seam suppression. The DeblockFilterID mechanism already identifies which reconstruction operator the content was encoded against (ID 1 = the 5-tap kernel on this page), so new kernels can be added without breaking existing content — each ID pins down exactly one operator for the encoder to optimize against.
  • Encoder-supplied edge metadata. The current filter is deliberately content-blind (see Stage 4), which buys temporal stability — but it also means a real image edge that happens to land on a block boundary gets smoothed like a coding seam. The encoder, unlike the shader, can see the original image and knows which block edges carry genuine content. A future variant could have the encoder supply compact per-edge metadata that disables (or attenuates) filtering on those edges, preventing over-blurring of real edges while keeping the filter position-driven and flicker-free at runtime — the decisions are made once, offline, by the encoder, not per frame by inspecting sampled colors.

Related Pages

Clone this wiki locally