Accelerate PDF type detection #7307

mmeeks · 2023-09-26T11:24:08Z

When profiling COOL usage - we see a lot of cost on the server associated with PDF thumbnailing it seems - @caolanm you're prolly interested.

I expect that this is all thumbnailing PDF files; we should prolly optimize these things in order:

40bn cycles in pdfi::PDFDetector::detect - that is gratuitous; and the boost::spirit on the stack doesn't encourage me - it seems to take longer to 'detect' that a file is PDF than it does to load it ;-) - we should fix that: if extension is PDF and magic is correct - do we need to take longer ?
27bn cycles FPDF_LoadPage - I -hope- that if we are thumbnailing, we only load the 1st page - and then on-demand render the PDF as needed - perhaps worth checking if whatever shortcut is used here is annotating the code-paths so we do as little work as possible.
4bn+ cycles in SdrPageWindow::RedrawAll - hopefully this draws only a single page - and not all of them (?)

Anyhow - the 1st one is (I hope) somewhat low hanging fruit =)

caolanm · 2023-09-26T15:02:05Z

I think this pdfi::getAdditionalStreams is to detect if this is a hybrid pdf document, and if so, get its mimetype and the contents of the embedded odf document to stash for the final import. So there is a parse of the pdf structure to get those.

I suspect the stashed embedded contents are not used by the final parse and we end up reparsing if it is hybrid (grep for "comment extracting hybrid substream twice - once during detection, second time here")

Because this flamegraph has a prominent SdPdfFilter::Import it looks like the pdfs involved are not hybrid ones in any case so the expensive parse in the detection presumably is that of a non-hybrid pdf to find out that is is non-hybrid and has no embedded document or mimetype

It might be that it might be safe to assume that the absence of a "AdditionalStreams" keyword in the last few lines or so of a pdf means that is isn't hybrid and so skip the parse during detect in such a case.

example end of a hybrid pdf document:

trailer
<</Size 43/Root 41 0 R
/Info 42 0 R
/ID [
]
/DocChecksum /9D2161D19E360801716BA882E229DC2D
/AdditionalStreams [/application#2Fvnd#2Eoasis#2Eopendocument#2Epresentation 19 0 R
]

startxref
28402
%%EOF

probably good enough to take the last few k of the document during detect to see if its a hybrid and if not elide this expensive detect-time parse: -> https://gerrit.libreoffice.org/c/core/+/157227 for that thought

mmeeks · 2024-01-12T19:13:54Z

I think this was fixed with your merge Caolan =) at least not seen it in the profile recently.

mmeeks added enhancement New feature or request performance Improving COOL performance labels Sep 26, 2023

github-project-automation bot added this to Collabora Online Sep 26, 2023

mmeeks mentioned this issue Sep 27, 2023

Calc rendering acceleration #7318

Open

mmeeks closed this as completed Jan 12, 2024

github-project-automation bot moved this to Done in Collabora Online Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate PDF type detection #7307

Accelerate PDF type detection #7307

mmeeks commented Sep 26, 2023

caolanm commented Sep 26, 2023

mmeeks commented Jan 12, 2024

Accelerate PDF type detection #7307

Accelerate PDF type detection #7307

Comments

mmeeks commented Sep 26, 2023

caolanm commented Sep 26, 2023

mmeeks commented Jan 12, 2024