Skip to content

Shader and In Loop Deblocking

Richard Geldreich edited this page Jul 3, 2026 · 19 revisions

Shader and In-Loop Deblocking

Copyright (C) 2025-2026 Binomial LLC. All rights reserved except as granted under the Apache 2.0 LICENSE. Also see our NOTICE file. If you modify the Basis Universal source code, specifications, or wiki documents and redistribute the files, you must cause any modified files to carry prominent notices stating that you changed the files (see Apache 2.0 §4(b)).

Intro

Block boundaries are predictable.

Large ASTC/XUASTC block sizes (8x8 through 12x12) reach very low bitrates — down to ~0.89 bpp — but block-boundary seams become increasingly visible beyond roughly 6x6. Basis Universal addresses this with a small, standardized deblocking reconstruction operator that smooths block-boundary seams at sample time, making the largest block sizes (and therefore the lowest bitrates and lowest VRAM consumption) practical.

This page explains what in-loop deblocking is, walks through the reference GLSL pixel shader section by section, and covers the integration rules and gotchas. It's written for graphics developers who know shaders, basic filtering theory, and roughly how JPEG works — no codec expertise assumed.

Reference implementations of the shader, all using the same operator:

Sample API Notes
shader_deblocking_glfw C++ / OpenGL 3.3 The native reference; bin/shader.glsl is the shader walked through below
shader_deblocking_d3d11 C++ / Direct3D 11 bin/deblock.hlsl has the operator as a reusable DeblockSample() HLSL function
webgl/shader_deblocking WebGL 2 Live demo here
shader_deblocking (Python) Python + OpenGL The original testbed; also supports raw PNG input for experimentation

What Is In-Loop Deblocking?

A quick tour of reconstruction operators, for context. Every lossy block-based codec defines decoding as a fixed reconstruction procedure: bits in, pixels out. The interesting design question is what else runs as part of that procedure.

  • JPEG has no deblocking in its decode contract. Any deblocking you've seen applied to JPEG is a post-process — decoder-side guesswork, bolted on after the fact. The encoder knows nothing about it, so it can't spend bits assuming the filter will clean anything up, and the filter can't distinguish a coding seam from real image content that happens to sit on an 8x8 boundary.
  • Modern video codecs (H.264, HEVC, AV1) moved the deblocking filter inside the coding loop (H.263 Annex J offered an optional in-loop deblocking mode back in 1998; H.264/AVC made the loop filter a mandatory part of the baseline decode — see also Wikipedia): the filter is part of the standardized decode, and — crucially — the encoder reconstructs frames through the filter when making its rate/distortion decisions. The encoder and decoder agree on the exact reconstruction operator, so the encoder can deliberately spend fewer bits on block-boundary fidelity, knowing the filter will smooth the seams. That agreement is what "in-loop" means, and it's a large part of why these codecs look so much better than their still-image ancestors at the same bitrate.

Basis Universal applies the same idea to GPU texture sampling. When deblocking is enabled at compression time, the encoder globally optimizes the output blocks (using Stochastic Coordinate Descent) with this exact deblocking operator in the loop — it evaluates candidate encodings by what they look like after the filter runs. The filter is therefore the decoder half of the codec, not an optional beautification pass. Content compressed with deblocking awareness should be sampled with the deblocking filter applied, and the KTX2 file records this expectation (see DeblockFilterID below).

The texture twist that makes this practical: unlike a video decoder, a GPU samples textures with fixed-function hardware, so the filter can't run "in the decoder." Instead it runs in one of two places — on the CPU while transcoding, or (preferably) in the pixel shader at sample time. And unlike video, the block lattice of a texture is entirely predictable from the sample position: a shader always knows exactly how far the current sample is from a block boundary, with no signaling.

The Reconstruction Operator

Per sample, the operator:

  1. Computes the sample's offset within its block. Interior texels get weight 0 and pass through completely untouched — bit-exact.
  2. Computes an edge-proximity weight that ramps from 0 to 1 near each block boundary, independently for horizontal and vertical seams.
  3. Samples the 4 axis-neighbors (5 taps total, including the center), runs a 3-tap box filter across each axis, and blends from the original sample toward the filtered result by the edge weight, with a normalization step where the horizontal and vertical filters overlap at block corners.

Why a 5-tap kernel?

We could have chosen a wider kernel — video codecs use longer, adaptive filters — but the 1+4 tap cross is the simplest kernel that proved robust in both 3D and 2D: it behaves well under bilinear/trilinear filtering, mipmapping, minification, and oblique viewing angles in 3D scenes, and equally well in flat 2D viewers with no mipmapping at all. It is also extremely cheap: one texture fetch for interior texels, five near a seam, no data-dependent branching around the fetches, and a handful of ALU ops. Wider kernels bought little visible quality on GPU texture content while costing taps, complexity, and temporal-stability risk. Since the encoder optimizes against whatever operator is standardized, a simple operator the encoder fully models beats a fancy one it can't.

The filter's falloff radius (how many texels from a boundary the blending ramp extends) is a tolerance, not a codec constant: 1.0 texel suits flat 2D viewers; 1.5 texels is more stable under minification and oblique angles, and is what the 3D samples use.

The Reference GLSL Shader

The complete fragment shader is shader_deblocking_glfw/bin/shader.glsl — about 95 lines including comments, and designed to be lifted into an engine nearly verbatim (the D3D11 sample's deblock.hlsl packages the same operator as a reusable HLSL function). It needs four inputs beyond the texture itself:

uniform sampler2D tex;
uniform vec4  texSize;  // xy = base mip (mip 0) dimensions, zw = SOURCE block size in texels
uniform float maxLod;   // number_of_mip_levels - 1
uniform vec4  const0;   // x = deblock on/off, y = edge-weight visualization

Note texSize.zw is the block size of the encoded source (from the KTX2 header), never the GPU storage format's block size — more on this in Gotchas.

The shader has five stages. Let's walk through each.

Stage 1: Recover the effective mip level

vec2 du = dFdx(vUV);
vec2 dv = dFdy(vUV);
float rho = max(length(du * texSize.xy), length(dv * texSize.xy));
float lod = clamp(log2(max(rho, 1e-8)), 0.0, maxLod);
float mipScale = exp2(floor(lod + 0.5)); // snap to dominant mip: 1=mip0, 2=mip1, 4=mip2...

This is the heart of what makes the filter mipmap-compatible. The fixed-function texture hardware picks a mip level from the screen-space UV derivatives; the shader reconstructs that same decision using the derivative instructions. rho is the texel-to-pixel density (how many base-mip texels one screen pixel spans), log2(rho) is the LOD the hardware would compute, and exp2(floor(lod + 0.5)) snaps it to the dominant mip — the level actually contributing most of the sampled color.

Everything downstream is then evaluated in that mip's texel space — the effective mip space. This matters because a block that is 12x12 texels at mip 0 is 6x6 texels at mip 1 and 3x3 at mip 2: the seams shrink with the mip chain, and a filter that only reasoned in base-texture space would smooth the wrong (too-wide) region as soon as the texture is minified. By tracking the dominant mip, the block lattice — and the filter's footprint — rescale with whatever the hardware is really sampling. (You could use textureQueryLod() on desktop GL, but it's unavailable in WebGL; the derivative math is the portable reference path, and CalculateLevelOfDetail() is the equivalent option in HLSL SM 4.1+.)

Stage 2: Locate the sample in the block lattice

vec2 texelStep = mipScale / texDim;          // one texel step, in effective mip space
vec2 texelPos  = (vUV * texDim) / mipScale;  // the physical texel coordinate in effective mip space
vec2 blockPos  = mod(texelPos, blockSize);   // offset within the current block

texelPos is the sample's coordinate in the dominant mip's texel grid, and blockPos is where it sits inside its block. Because block boundaries are perfectly periodic, one mod() gives the full answer — this is the "entirely predictable" property that makes shader deblocking so much simpler than video deblocking, which must signal filter decisions per edge.

Stage 3: The five taps

vec3 color = texture(tex, vUV).rgb;
// Keep these fetches outside non-uniform control flow: texture() uses implicit
// derivatives for LOD, which are undefined when neighboring fragments take
// different branches.
vec3 l1 = texture(tex, vUV - vec2(texelStep.x, 0.0)).rgb;
vec3 r1 = texture(tex, vUV + vec2(texelStep.x, 0.0)).rgb;
vec3 u1 = texture(tex, vUV - vec2(0.0, texelStep.y)).rgb;
vec3 d1 = texture(tex, vUV + vec2(0.0, texelStep.y)).rgb;

The four neighbors are fetched one effective-mip texel away, so the kernel footprint automatically widens (in UV space) at coarser mips. Note the fetches happen unconditionally, before any branching on the edge weight: texture() derives its LOD from implicit derivatives, which are undefined inside divergent control flow. Hoisting the taps costs a few redundant fetches on interior pixels and buys correctness on every driver — keep this property when porting.

Stage 4: Edge proximity weights

const float falloff = 1.5;   // 3D tolerance; 2D viewers use 1.0
float leftProx   = 1.0 - clamp(blockPos.x / falloff, 0.0, 1.0);
float rightProx  = 1.0 - clamp((blockSize.x - blockPos.x) / falloff, 0.0, 1.0);
float topProx    = 1.0 - clamp(blockPos.y / falloff, 0.0, 1.0);
float bottomProx = 1.0 - clamp((blockSize.y - blockPos.y) / falloff, 0.0, 1.0);
float horizWeight = max(leftProx, rightProx);
float vertWeight  = max(topProx, bottomProx);
float edgeWeight  = max(horizWeight, vertWeight);

Each weight ramps linearly from 1 exactly on a boundary down to 0 at falloff texels away. Horizontal and vertical proximity are kept separate, because a sample near a vertical seam should only be filtered horizontally (across the seam) and vice versa. The smooth ramp is what makes the filter temporally stable: as a sample position slides toward a seam under motion or magnification, the filtering fades in continuously instead of popping.

Stage 5: Filter and blend

vec3 filteredH = (l1 + c0 + r1) * (1.0 / 3.0);
vec3 filteredV = (u1 + c0 + d1) * (1.0 / 3.0);
vec3 horizColor = mix(c0, filteredH, horizWeight);
vec3 vertColor  = mix(c0, filteredV, vertWeight);
float totalW = horizWeight + vertWeight;
if (totalW > 0.0)
    color = (horizColor * horizWeight + vertColor * vertWeight) / totalW;

Two independent 3-tap box filters — one per axis — each blended against the original sample by its own edge weight, then combined with a weighted average normalized by totalW. The normalization handles block corners, where both weights are nonzero: without it, corner texels would be over-filtered by the sum of both axes. Interior pixels (edgeWeight == 0) skip all of this and keep the untouched center tap.

There's also a debug visualization path (const0.y): output edgeWeight as grayscale, which draws the block lattice on screen. In the samples this is the 2 key — zoom out and watch the lattice rescale as coarser mips take over; it's the fastest way to convince yourself the effective-mip-space math is working.

Two Ways to Deblock

On the GPU, at sample time (preferred). The shader above. No transcode-time cost, no extra memory, and it runs in exactly the space the hardware samples. This works regardless of the GPU storage format — BC7, ETC, or even uncompressed RGBA — because the block artifacts are baked into the content by the original encode.

On the CPU, during transcoding. When transcoding ASTC/XUASTC LDR to other LDR formats (BC7, ETC1, raw pixels, etc.), the transcoder can apply the same operator on the CPU: each mip level is decompressed to memory, deblocked, then packed to the output format with the real-time encoders (bc7f, etc1f, ...). By default — when the caller passes neither deblocking decode flag — the KTX2 transcoder consults the file's DeblockFilterID field to decide whether to deblock; passing cDecodeFlagsNoDeblockFiltering or cDecodeFlagsForceDeblockFiltering overrides the file's field. (At encode time, deblocking awareness defaults to the largest block sizes — block area of 80 texels or more, i.e. 10x8, 10x10, 12x10, and 12x12, BASISU_DEBLOCKING_BLOCK_SIZE_THRESHOLD in basisu_transcoder.h — but can be enabled on all block sizes via compressor parameters.) Note CPU deblocking temporarily decompresses a whole mip level (the non-deblocking pipeline only ever decodes a few block rows at a time), so disable it if that transient memory matters.

Gotchas

  • Never deblock twice. Exactly one of the two modes should run. If your pixel shader performs the deblocking, you MUST disable the transcoder's CPU deblocking by passing cDecodeFlagsNoDeblockFiltering at transcode time — otherwise the image is filtered twice (once on the CPU into the GPU texture, then again in the shader), visibly over-blurring every seam. All four samples do this.
  • The block size is a property of the encoded SOURCE, not the GPU format. Transcoding changes the container, not the artifact lattice: an 8x8 XUASTC source transcoded to BC7 (4x4 storage) still carries seams on the 8x8 source grid, and the shader must filter at 8x8. Always feed texSize.zw from the KTX2 header (ktx2_transcoder::get_block_width()/get_block_height()), never from the transcode target. Get this wrong and the filtering won't line up with the actual artifacts.
  • Keep the taps outside divergent control flow (Stage 3 above) — implicit-derivative fetches inside a non-uniform branch are undefined behavior that happens to work on some drivers and not others.
  • Pick the falloff for your use case: 1.0 for flat 2D UI/viewers, 1.5 for 3D. It's a tolerance; the codec doesn't require an exact value.
  • maxLod must match the mip chain you actually uploaded (levels − 1). If you upload only the base level, maxLod = 0 correctly collapses the lattice math to mip 0.

The DeblockFilterID KTX2 Key

Basis Universal writes a DeblockFilterID key/value into .KTX2 files, readable via ktx2_transcoder::get_deblocking_filter_index(). The value is ASCII "0" or "1":

  • "0" — the content was not compressed with deblocking awareness. (In practice the encoder simply omits the key in this case; a missing key reads back as 0.)
  • "1" — the content was compressed in-loop against the 5-tap reconstruction operator described on this page, and viewers/engines should apply that filter by default when sampling (users can still toggle it).

All four samples honor this contract: they auto-enable the GPU shader when the ID is 1, capture that as the reset default, and pass cDecodeFlagsNoDeblockFiltering at transcode time.

Note: this key will likely be changing names as this deblocking approach is standardized at Khronos.

Controlling Deblocking

Transcoder decode flags

  • cDecodeFlagsNoDeblockFiltering — disable all adaptive CPU deblocking (faster, less temporary memory; required when a GPU shader deblocks instead).
  • cDecodeFlagsForceDeblockFiltering — always deblock, even on block sizes smaller than 10x8.

Passing either flag causes the transcoder to ignore the KTX2 file's DeblockFilterID field; when neither is passed, the KTX2 transcoder uses that field (written by the encoder) to decide whether CPU deblocking occurs.

Command line (basisu tool)

Transcode-time (unpack/validate): -transcode_no_deblocking (alias -transcode_disable_deblocking) and -transcode_force_deblocking.

Encode-time, XUASTC LDR only — controls deblocking awareness (SCD) and/or the filter applied during compression (default -xuastc_deblocking_largest; each flag also has a -xuastc_ldr_* alias):

  • -xuastc_no_deblocking — disable both SCD and filtering.
  • -xuastc_deblocking_largest — SCD + filtering only on the largest block sizes (default).
  • -xuastc_deblocking_all — SCD + filtering on all block sizes.
  • -xuastc_deblocking_scd_no_filtering — SCD only, no filtering.
  • -xuastc_deblocking_no_scd_filtering_largest / _all — filtering without SCD.
  • -xuastc_deblocking_num_passes X — number of deblocking filter passes.

A useful recipe

For XUASTC LDR targeting a non-ASTC format such as BC7, very low Weight Grid DCT quality factors (1-15) become usable if you force adaptive deblocking on all block sizes (cDecodeFlagsForceDeblockFiltering). Deblocking is what permits the very lowest bitrates — or the largest block sizes — to become practical.

Related Pages

Clone this wiki locally