You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
...
This is a report of an observed performance issue. By fixing this, the demo case from the home page:
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
can enjoy 28.1% less total kernel execution time on the Nvidia RTX 3090.
Steps to reproduce
...
I installed Docling on Ubuntu 2204 conda virtual env through the official guide. The (above) given case will
pass inference data through class TMTransformerDecoderLayer(nn.TransformerDecoderLayer):
code-dir in my PC: /envs/docling/lib/python3.11/site-packages/docling_ibm_models/tableformer/models/table04_rs/transformer_rs.py
If you go down to lines 96 and 107.
# From PyTorch but modified to only use the last tag
tgt_last_tok = tgt[-1:, :, :]
tmp_tgt = self.self_attn(
tgt_last_tok,
tgt,
tgt,
attn_mask=None, #None, because we only care about the last tag
key_padding_mask=tgt_key_padding_mask, is_docling = True,
) [0] // 97
tgt_last_tok = tgt_last_tok + self.dropout1(tmp_tgt)
tgt_last_tok = self.norm1(tgt_last_tok)
if memory is not None:
with proton.scope("transformer_rs"):
tmp_tgt = self.multihead_attn(
tgt_last_tok,
memory,
memory,
attn_mask=memory_mask,
key_padding_mask=memory_key_padding_mask, is_docling = True,
) [0] // 107
tgt_last_tok = tgt_last_tok + self.dropout2(tmp_tgt)
tgt_last_tok = self.norm2(tgt_last_tok)
They only use the first one out of three return values, which means that 4 out of 6 computations are wasted.
But only uses q_proj and discards kv_proj[0] and kv_proj[1].
To speed up, you can create a function, basically copy the pytorch code and comment out lines from 5726 to 5735. By doing so, you save time from 1 nn.Linear and 1 contiguous() copy operation.
def docling_in_projection_packed(
q: Tensor,
k: Tensor,
v: Tensor,
w: Tensor,
b: Optional[Tensor] = None,
) -> list[Tensor]:
E = q.size(-1)
if k is v:
if q is k:
# self-attention
proj = linear(q, w, b)
# reshape to 3, E and not E, 3 is deliberate for better memory coalescing and keeping same order as chunk()
proj = (
proj.unflatten(-1, (3, E))
.unsqueeze(0)
.transpose(0, -2)
.squeeze(-2)
.contiguous()
)
return proj[0], proj[1], proj[2]
else:
# encoder-decoder attention
w_q, w_kv = w.split([E, E * 2])
if b is None:
b_q = b_kv = None
else:
b_q, b_kv = b.split([E, E * 2])
q_proj = linear(q, w_q, b_q)
_**# CODE REMOVAL**_
return (q_proj, None, None)
else:
w_q, w_k, w_v = w.chunk(3)
if b is None:
b_q = b_k = b_v = None
else:
b_q, b_k, b_v = b.chunk(3)
return linear(q, w_q, b_q), linear(k, w_k, b_k), linear(v, w_v, b_v)
Uh oh!
There was an error while loading. Please reload this page.
Bug
...
This is a report of an observed performance issue. By fixing this, the demo case from the home page:
can enjoy 28.1% less total kernel execution time on the Nvidia RTX 3090.
Steps to reproduce
...
I installed Docling on Ubuntu 2204 conda virtual env through the official guide. The (above) given case will
pass inference data through class TMTransformerDecoderLayer(nn.TransformerDecoderLayer):
code-dir in my PC: /envs/docling/lib/python3.11/site-packages/docling_ibm_models/tableformer/models/table04_rs/transformer_rs.py
If you go down to lines 96 and 107.
They only use the first one out of three return values, which means that 4 out of 6 computations are wasted.
Talking about the root: the above code will invoke
pytorch-nn-functional-_in_projection_packed
But only uses q_proj and discards kv_proj[0] and kv_proj[1].
To speed up, you can create a function, basically copy the pytorch code and comment out lines from 5726 to 5735. By doing so, you save time from 1 nn.Linear and 1 contiguous() copy operation.
Docling version
...
commit: bfcab3d
Python version
...
python=3.11.9
The text was updated successfully, but these errors were encountered: