Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up SplitEpub.get_split_files() by using existing SplitEpub.split_lines data #9

Merged
merged 2 commits into from Mar 26, 2024

Conversation

haowjy
Copy link
Contributor

@haowjy haowjy commented Mar 24, 2024

Previously, SplitEpub.get_split_files() calls lines = self.get_split_lines() which requires parsing through the DOM multiple times. This change will use the existing SplitEpub.split_lines and include the lines as needed by the linenums, overwriting the SplitEpub.split_lines for each time SplitEpub.get_split_files() is called.

@JimmXinu
Copy link
Owner

What did you use to test?

Because this is buggy for me--it's including all ToC entries in each split book, not just the selected ones. Selected files are correctly included.

Plus in my quick test (book of 108 files, select 103, do New Book for N Sections with N=5 and Orphans=5 generating 20 split books), the speed up on calling splitepub.write_split_epub() is from 75ms on average to 68ms on average, 10% improvement.

Unless you can show more significant speed up, I don't think it's worth trying to debug it.

@JimmXinu JimmXinu merged commit f350569 into JimmXinu:main Mar 26, 2024
@JimmXinu
Copy link
Owner

Alright. It would still be nice to know what kind of speed up you see with what kind of input.

@haowjy
Copy link
Contributor Author

haowjy commented Mar 27, 2024

I was testing with https://github.com/illinois-cs241/coursebook/blob/epub_deploy/main.epub and https://www.gutenberg.org/ebooks/84 (epub3) with Python 3.10.13 in the CLI (not with Calibre plugin).

After I pushed the new changes, I tried it out in Calibre and got an extra TOC for each chapter... not realizing there was a configuration option to add that cover to the start of each chapter by default, so I thought it was another bug in the code, but it ended up being fine after I unchecked the configurations.

time python epubsplit.py main.epub --split-by-section --output-dir=cs241
real 1m22.511s
user 1m21.460s
sys 0m0.137s

real 1m23.816s
user 1m22.726s
sys 0m0.169s

real 1m24.822s
user 1m23.706s
sys 0m0.207s

time python epubsplit.py main.epub --split-by-section --output-dir=cs241-ori

KeyboardInterrupt
real 22m42.219s
user 22m41.450s
sys 0m0.684s

(22 minutes at 0041-split.epub out of 404 splits, so it probably would've taken a few hours - the reason I decided to look back at the code)

time python epubsplit.py pg84-images-3.epub --split-by-section --output-dir=frank
real 0m0.771s
user 0m0.738s
sys 0m0.033s

real 0m0.773s
user 0m0.744s
sys 0m0.029s

real 0m0.771s
user 0m0.730s
sys 0m0.041s

time python epubsplit-ori.py pg84-images-3.epub --split-by-section --output-dir=frank-ori
real 0m5.370s
user 0m5.326s
sys 0m0.037s

real 0m5.423s
user 0m5.350s
sys 0m0.053s

real 0m5.218s
user 0m5.164s
sys 0m0.040s

@JimmXinu
Copy link
Owner

18 chapters, but +400 nav points in the text book is not a common use case, but valid and well worth addressing.

Thanks for the numbers, and the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants