Skip to content

Performance Issue. 28.1% less inference time on demo case with a simple change. #1521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
David-Dingle opened this issue May 5, 2025 · 1 comment

Comments

@David-Dingle
Copy link

David-Dingle commented May 5, 2025

Bug

...
This is a report of an observed performance issue. By fixing this, the demo case from the home page:

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"

can enjoy 28.1% less total kernel execution time on the Nvidia RTX 3090.

Steps to reproduce

...
I installed Docling on Ubuntu 2204 conda virtual env through the official guide. The (above) given case will
pass inference data through class TMTransformerDecoderLayer(nn.TransformerDecoderLayer):
code-dir in my PC: /envs/docling/lib/python3.11/site-packages/docling_ibm_models/tableformer/models/table04_rs/transformer_rs.py
If you go down to lines 96 and 107.

       # From PyTorch but modified to only use the last tag
        tgt_last_tok = tgt[-1:, :, :]
        tmp_tgt = self.self_attn(
            tgt_last_tok,
            tgt,
            tgt,
            attn_mask=None,  #None, because we only care about the last tag
            key_padding_mask=tgt_key_padding_mask, is_docling = True,
        ) [0]    // 97
        tgt_last_tok = tgt_last_tok + self.dropout1(tmp_tgt)
        tgt_last_tok = self.norm1(tgt_last_tok)
        if memory is not None:
            with proton.scope("transformer_rs"):
                tmp_tgt = self.multihead_attn(
                    tgt_last_tok,
                    memory,
                    memory,
                    attn_mask=memory_mask,
                    key_padding_mask=memory_key_padding_mask, is_docling = True,
                ) [0]    // 107
            tgt_last_tok = tgt_last_tok + self.dropout2(tmp_tgt)
            tgt_last_tok = self.norm2(tgt_last_tok)

They only use the first one out of three return values, which means that 4 out of 6 computations are wasted.

Talking about the root: the above code will invoke
pytorch-nn-functional-_in_projection_packed

But only uses q_proj and discards kv_proj[0] and kv_proj[1].

To speed up, you can create a function, basically copy the pytorch code and comment out lines from 5726 to 5735. By doing so, you save time from 1 nn.Linear and 1 contiguous() copy operation.

def docling_in_projection_packed(
    q: Tensor,
    k: Tensor,
    v: Tensor,
    w: Tensor,
    b: Optional[Tensor] = None,
) -> list[Tensor]:
    E = q.size(-1)
    if k is v:
        if q is k:
            # self-attention
            proj = linear(q, w, b)
            # reshape to 3, E and not E, 3 is deliberate for better memory coalescing and keeping same order as chunk()
            proj = (
                proj.unflatten(-1, (3, E))
                .unsqueeze(0)
                .transpose(0, -2)
                .squeeze(-2)
                .contiguous()
            )
            return proj[0], proj[1], proj[2]
        else:
            # encoder-decoder attention
            w_q, w_kv = w.split([E, E * 2])
            if b is None:
                b_q = b_kv = None
            else:
                b_q, b_kv = b.split([E, E * 2])
            q_proj = linear(q, w_q, b_q)
_**# CODE REMOVAL**_
            return (q_proj, None, None)
    else:
        w_q, w_k, w_v = w.chunk(3)
        if b is None:
            b_q = b_k = b_v = None
        else:
            b_q, b_k, b_v = b.chunk(3)
        return linear(q, w_q, b_q), linear(k, w_k, b_k), linear(v, w_v, b_v)

Docling version

...
commit: bfcab3d

Python version

...
python=3.11.9

@David-Dingle David-Dingle added the bug Something isn't working label May 5, 2025
@cau-git cau-git added performance and removed bug Something isn't working labels May 21, 2025
@cau-git
Copy link
Contributor

cau-git commented May 21, 2025

@David-Dingle Very interesting findings! We will have a look into it as soon as we can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants