Performance Issue. 28.1% less inference time on demo case with a simple change. #1521

David-Dingle · 2025-05-05T19:34:56Z

Bug

...
This is a report of an observed performance issue. By fixing this, the demo case from the home page:

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"

can enjoy 28.1% less total kernel execution time on the Nvidia RTX 3090.

Steps to reproduce

...
I installed Docling on Ubuntu 2204 conda virtual env through the official guide. The (above) given case will
pass inference data through class TMTransformerDecoderLayer(nn.TransformerDecoderLayer):
code-dir in my PC: /envs/docling/lib/python3.11/site-packages/docling_ibm_models/tableformer/models/table04_rs/transformer_rs.py
If you go down to lines 96 and 107.

       # From PyTorch but modified to only use the last tag
        tgt_last_tok = tgt[-1:, :, :]
        tmp_tgt = self.self_attn(
            tgt_last_tok,
            tgt,
            tgt,
            attn_mask=None,  #None, because we only care about the last tag
            key_padding_mask=tgt_key_padding_mask, is_docling = True,
        ) [0]    // 97
        tgt_last_tok = tgt_last_tok + self.dropout1(tmp_tgt)
        tgt_last_tok = self.norm1(tgt_last_tok)

        if memory is not None:
            with proton.scope("transformer_rs"):
                tmp_tgt = self.multihead_attn(
                    tgt_last_tok,
                    memory,
                    memory,
                    attn_mask=memory_mask,
                    key_padding_mask=memory_key_padding_mask, is_docling = True,
                ) [0]    // 107
            tgt_last_tok = tgt_last_tok + self.dropout2(tmp_tgt)
            tgt_last_tok = self.norm2(tgt_last_tok)

They only use the first one out of three return values, which means that 4 out of 6 computations are wasted.

Talking about the root: the above code will invoke
pytorch-nn-functional-_in_projection_packed

But only uses q_proj and discards kv_proj[0] and kv_proj[1].

To speed up, you can create a function, basically copy the pytorch code and comment out lines from 5726 to 5735. By doing so, you save time from 1 nn.Linear and 1 contiguous() copy operation.

def docling_in_projection_packed(
    q: Tensor,
    k: Tensor,
    v: Tensor,
    w: Tensor,
    b: Optional[Tensor] = None,
) -> list[Tensor]:
    E = q.size(-1)
    if k is v:
        if q is k:
            # self-attention
            proj = linear(q, w, b)
            # reshape to 3, E and not E, 3 is deliberate for better memory coalescing and keeping same order as chunk()
            proj = (
                proj.unflatten(-1, (3, E))
                .unsqueeze(0)
                .transpose(0, -2)
                .squeeze(-2)
                .contiguous()
            )
            return proj[0], proj[1], proj[2]
        else:
            # encoder-decoder attention
            w_q, w_kv = w.split([E, E * 2])
            if b is None:
                b_q = b_kv = None
            else:
                b_q, b_kv = b.split([E, E * 2])
            q_proj = linear(q, w_q, b_q)
_**# CODE REMOVAL**_
            return (q_proj, None, None)
    else:
        w_q, w_k, w_v = w.chunk(3)
        if b is None:
            b_q = b_k = b_v = None
        else:
            b_q, b_k, b_v = b.chunk(3)
        return linear(q, w_q, b_q), linear(k, w_k, b_k), linear(v, w_v, b_v)

Docling version

...
commit: bfcab3d

Python version

...
python=3.11.9

The text was updated successfully, but these errors were encountered:

cau-git · 2025-05-21T12:27:06Z

@David-Dingle Very interesting findings! We will have a look into it as soon as we can.

David-Dingle added the bug label May 5, 2025

vagenas assigned cau-git, maxmnemonic and nikos-livathinos May 21, 2025

vagenas added the table structure label May 21, 2025

cau-git added performance and removed bug labels May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Issue. 28.1% less inference time on demo case with a simple change. #1521

Performance Issue. 28.1% less inference time on demo case with a simple change. #1521

David-Dingle commented May 5, 2025 •

edited

Loading

cau-git commented May 21, 2025

Uh oh!

Performance Issue. 28.1% less inference time on demo case with a simple change. #1521

Performance Issue. 28.1% less inference time on demo case with a simple change. #1521

Comments

David-Dingle commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug

Steps to reproduce

Docling version

Python version

cau-git commented May 21, 2025

Uh oh!

David-Dingle commented May 5, 2025 •

edited

Loading