Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate PDF type detection #7307

Closed
3 tasks
mmeeks opened this issue Sep 26, 2023 · 2 comments
Closed
3 tasks

Accelerate PDF type detection #7307

mmeeks opened this issue Sep 26, 2023 · 2 comments
Labels
enhancement New feature or request performance Improving COOL performance

Comments

@mmeeks
Copy link
Contributor

mmeeks commented Sep 26, 2023

When profiling COOL usage - we see a lot of cost on the server associated with PDF thumbnailing it seems - @caolanm you're prolly interested.

pdf-thumbnailing

I expect that this is all thumbnailing PDF files; we should prolly optimize these things in order:

  • 40bn cycles in pdfi::PDFDetector::detect - that is gratuitous; and the boost::spirit on the stack doesn't encourage me - it seems to take longer to 'detect' that a file is PDF than it does to load it ;-) - we should fix that: if extension is PDF and magic is correct - do we need to take longer ?
  • 27bn cycles FPDF_LoadPage - I -hope- that if we are thumbnailing, we only load the 1st page - and then on-demand render the PDF as needed - perhaps worth checking if whatever shortcut is used here is annotating the code-paths so we do as little work as possible.
  • 4bn+ cycles in SdrPageWindow::RedrawAll - hopefully this draws only a single page - and not all of them (?)

Anyhow - the 1st one is (I hope) somewhat low hanging fruit =)

@mmeeks mmeeks added enhancement New feature or request performance Improving COOL performance labels Sep 26, 2023
@caolanm
Copy link
Contributor

caolanm commented Sep 26, 2023

I think this pdfi::getAdditionalStreams is to detect if this is a hybrid pdf document, and if so, get its mimetype and the contents of the embedded odf document to stash for the final import. So there is a parse of the pdf structure to get those.

I suspect the stashed embedded contents are not used by the final parse and we end up reparsing if it is hybrid (grep for "comment extracting hybrid substream twice - once during detection, second time here")

Because this flamegraph has a prominent SdPdfFilter::Import it looks like the pdfs involved are not hybrid ones in any case so the expensive parse in the detection presumably is that of a non-hybrid pdf to find out that is is non-hybrid and has no embedded document or mimetype

It might be that it might be safe to assume that the absence of a "AdditionalStreams" keyword in the last few lines or so of a pdf means that is isn't hybrid and so skip the parse during detect in such a case.

example end of a hybrid pdf document:

trailer
<</Size 43/Root 41 0 R
/Info 42 0 R
/ID [
]
/DocChecksum /9D2161D19E360801716BA882E229DC2D
/AdditionalStreams [/application#2Fvnd#2Eoasis#2Eopendocument#2Epresentation 19 0 R
]

startxref
28402
%%EOF

probably good enough to take the last few k of the document during detect to see if its a hybrid and if not elide this expensive detect-time parse: -> https://gerrit.libreoffice.org/c/core/+/157227 for that thought

@mmeeks
Copy link
Contributor Author

mmeeks commented Jan 12, 2024

I think this was fixed with your merge Caolan =) at least not seen it in the profile recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Improving COOL performance
Projects
Archived in project
Development

No branches or pull requests

2 participants