Speed up `SplitEpub.get_split_files()` by using existing `SplitEpub.split_lines` data #9

haowjy · 2024-03-24T23:15:03Z

Previously, SplitEpub.get_split_files() calls lines = self.get_split_lines() which requires parsing through the DOM multiple times. This change will use the existing SplitEpub.split_lines and include the lines as needed by the linenums, overwriting the SplitEpub.split_lines for each time SplitEpub.get_split_files() is called.

JimmXinu · 2024-03-25T01:08:27Z

What did you use to test?

Because this is buggy for me--it's including all ToC entries in each split book, not just the selected ones. Selected files are correctly included.

Plus in my quick test (book of 108 files, select 103, do New Book for N Sections with N=5 and Orphans=5 generating 20 split books), the speed up on calling splitepub.write_split_epub() is from 75ms on average to 68ms on average, 10% improvement.

Unless you can show more significant speed up, I don't think it's worth trying to debug it.

JimmXinu · 2024-03-26T14:15:10Z

Alright. It would still be nice to know what kind of speed up you see with what kind of input.

haowjy · 2024-03-27T02:01:20Z

I was testing with https://github.com/illinois-cs241/coursebook/blob/epub_deploy/main.epub and https://www.gutenberg.org/ebooks/84 (epub3) with Python 3.10.13 in the CLI (not with Calibre plugin).

After I pushed the new changes, I tried it out in Calibre and got an extra TOC for each chapter... not realizing there was a configuration option to add that cover to the start of each chapter by default, so I thought it was another bug in the code, but it ended up being fine after I unchecked the configurations.

time python epubsplit.py main.epub --split-by-section --output-dir=cs241
real 1m22.511s
user 1m21.460s
sys 0m0.137s

real 1m23.816s
user 1m22.726s
sys 0m0.169s

real 1m24.822s
user 1m23.706s
sys 0m0.207s

time python epubsplit.py main.epub --split-by-section --output-dir=cs241-ori

KeyboardInterrupt
real 22m42.219s
user 22m41.450s
sys 0m0.684s

(22 minutes at 0041-split.epub out of 404 splits, so it probably would've taken a few hours - the reason I decided to look back at the code)

time python epubsplit.py pg84-images-3.epub --split-by-section --output-dir=frank
real 0m0.771s
user 0m0.738s
sys 0m0.033s

real 0m0.773s
user 0m0.744s
sys 0m0.029s

real 0m0.771s
user 0m0.730s
sys 0m0.041s

time python epubsplit-ori.py pg84-images-3.epub --split-by-section --output-dir=frank-ori
real 0m5.370s
user 0m5.326s
sys 0m0.037s

real 0m5.423s
user 0m5.350s
sys 0m0.053s

real 0m5.218s
user 0m5.164s
sys 0m0.040s

JimmXinu · 2024-03-27T02:13:02Z

18 chapters, but +400 nav points in the text book is not a common use case, but valid and well worth addressing.

Thanks for the numbers, and the PR.

Speed up get_split_files by using existing split_lines data

056c3f2

fix TOC bug and string cli input bug

f350569

JimmXinu merged commit f350569 into JimmXinu:main Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `SplitEpub.get_split_files()` by using existing `SplitEpub.split_lines` data #9

Speed up `SplitEpub.get_split_files()` by using existing `SplitEpub.split_lines` data #9

haowjy commented Mar 24, 2024

JimmXinu commented Mar 25, 2024

JimmXinu commented Mar 26, 2024

haowjy commented Mar 27, 2024

JimmXinu commented Mar 27, 2024

Speed up SplitEpub.get_split_files() by using existing SplitEpub.split_lines data #9

Speed up SplitEpub.get_split_files() by using existing SplitEpub.split_lines data #9

Conversation

haowjy commented Mar 24, 2024

JimmXinu commented Mar 25, 2024

JimmXinu commented Mar 26, 2024

haowjy commented Mar 27, 2024

JimmXinu commented Mar 27, 2024

Speed up `SplitEpub.get_split_files()` by using existing `SplitEpub.split_lines` data #9

Speed up `SplitEpub.get_split_files()` by using existing `SplitEpub.split_lines` data #9