Skip to content

Advanced Manual Pagination

Thertzlor edited this page May 12, 2024 · 3 revisions

The following Page documents the advanced paging settings and how to use them to arrive at more accurate page counts.

Paging Modes

Using the -p or --pagingmode argument you can choose how the script will go about actually defining page breaks.

  • "chars": This is the default mode and also the simplest of all. All it does is divide the number of characters of the books text by the number of pages we want, arriving at a fixed character count per page. This generally works well, but is best used for dense books with very long paragraphs, think of the styles of Proust or Saramago as examples.
  • "lines": In this mode the script divides the text up by line breaks and then calculates a fixed number of lines per page. The more predefined line breaks a book contains the better this mode works, so books of poetry are a good fit, as are books with lots of terse dialogue.
  • "words": In this mode the text will be split into individual words [defined as any sequence of non-whitespace characters; The output of the Python str.split()] and then calculates the average number of words on a page based on the total number of words in the text.
  • number: The final and most advanced paging mode is activated by passing a number as the argument. It works by using the lines mode and applying the provided number as a maximum character count per line. Shorter lines are left as-is, longer lines are split up. This can give you very accurate results, especially if you use the line length of the print edition as a reference (It's still not perfect of course, unless the book is typeset in a monospace font).

ToC Pages

Back in the days of paper many physical books had a table of contents which included page numbers because paper didn't support links. The -t or --tocpages argument lets us use this ancient knowledge to guarantee that our generated page numbers don't get too inaccurate.
The only requirement is that our ebook has a functioning table of contents as well.
The argument accepts a list of page numbers, each corresponding to a chapter, or a list of index:value pairs.

We'll cover the simple list first:

In this example we have a book with 100 pages and 5 chapters and we know that the chapters begin on page 5, 20, 50, 70 and 90 respectively:

py .\page_approximator.py ".\book.epub" 100 --tocpages 5 20 50 70 90

Now that we can map those pages to specific parts of the book using the toc.ncx/nav.xhtml of the file the script only needs to interpolate page numbers in the ranges between those known pages, increasing the accuracy immensely.

When using a basic list, it's required that the number of the --tocpages list matches the number of content markers in the ebook. If they do not match, the script will abort and show a message listing all entries in the table of contents for reference.
Chapters can be skipped by putting a 0 in their position. Should we only want to map every second chapter we can modify our previous example to this:

py .\page_approximator.py ".\book.epub" 100 --tocpages 5 0 50 0 90

If we only know a tiny number of page positions (such as when a book's table of contents only lists the pages for sections but not individual chapters), creating a chapter list of mostly 0 can be annoying. This is where the secondary mode of index:value pairs comes in handy.
Instead of listing all page positions in order of chapter markers we list a chapter index and a page number separated by :. All other values are automatically assumed to be 0.
Converted into index:value pairs theprevious example looks like this:

py .\page_approximator.py ".\book.epub" 100 --tocpages '1:5' '3:50' '5:90'

As we can see the 0 entries are no longer necessary and we can also provide entries in any order.

If one of the entries in the --tocpages list specifies page 1, and the location of that file is not the absolute beginning of the book, the script will automatically add a "0th" page onto which all content before the actual first page will be pushed.
According to "proper" publishing convention the front matter before page 1 should be numbered in Roman numerals, which is the subject of the next section.


Front Matter with Roman numbering

The --romanfrontmatter option, shortened to -r lets us define a front section of the book paginated with Roman numerals before resetting the pagination for the main content.

In this example we keep things simple, using only the page count without a ToC map:

py .\page_approximator.py ".\book.epub" 100 --romanfrontmatter 5

This tells the script that the first 5 pages are front matter. Note that in this mode the page number argument refers to the total number of pages in the book including front matter, so since the numbering restarts after the 5 Roman numerals we only have 95 normal pages.

To get back up to 100 normal pages we have to add the 5 pages to the total once more:

py .\page_approximator.py ".\book.epub" 105 --romanfrontmatter 5

This complication does not apply if we are using a ToC map, because in this case the script knows exactly where the first real page of the book starts and calculates the provided number of pages only within the "main text" of the book.

In the following example we assume a book that has a single section before the main content starts at page 1. Using the --romanfrontmatter option we tell the script that this first section consists of 5 pages with roman numbering. Note that is required to specify a location for the first page in the ToC map when this option is set.

py .\page_approximator.py ".\book.epub" 100 --tocpages 0 1 5 20 50 70 90 --romanfrontmatter 5

Alternatively it's also possible to pass --romanfrontmatter 0 which tells the script to decide for itself how many front matter pages to insert, based on the average page size of the main text.

Alternatively the ToC map itself can be used to indicate and further control pages with roman numerals:

py .\page_approximator.py ".\book.epub" 100 --tocpages i iv 1 5 20 50 70 90

In this example it is not necessary to set the --romanfrontmatter option since we have the roman numbering already in the page map.
In this book we have two sections before the first content page, defined with their roman numbers. How many front matter pages will be inserted between iv and 1 is again calculated via the average page size, although it is possible to pass the --romanfrontmatter option to indicate the total number of front matter pages explicitly. Like in the normal section of the page map it is also possible to "skip" sections by filling in a 0 in their spot.


A Complete Demonstration

This section will showcase the full process of paginating a book as precisely as possible using all techniques previously discussed. As our example book we will use The Sirens of Titan by Kurt Vonnegut, (named sirens.epub to keep things concise) with the data about the print version taken from Google Books (the first time this service has been useful to me).

Our first data point is the fact that the number of pages is listed at 336, so we start with the simplest approach:

py .\page_approximator.py ".\sirens.epub" 336

The result looks good, but comparing it to the Table of Contents in the Google Books preview shows that while the total page count is the same, there's some increasing divergence; The first few chapters are pretty close, chapter 3 should start on page 62, in our calculations it became 65. Chapter 4 which be 95 is already off by 5 at page 100. At the end of the book we're off by 10 pages, with the epilogue starting on page 308 and 318 respectively.

We can do better, so let's refine our approach. Looking again at the preview of the print edition, we see that the lines in the printed text are pretty short, only about 56 characters per line, so we factor this in using the lines+maximum paging mode:

py .\page_approximator.py ".\sirens.epub" 336 --pagingmode 56

This helps... only a tiny bit because the text density is quite constant. But we're still off by around 1 page less on average compared to our previous result.

To get truly accurate, we'll have to use the table of contents directly, especially because at least some of the inaccuracy can be traced back to the print version not counting the pages before the start of the first chapter.

The twelve chapters listed in the print edition's Table of Content start on the following pages: 1 41 62 95 105 143 167 187 199 218 256 270 308 ...But we need to modify this list before using it since the digital edition of the book includes five more content markers before the first chapter: "Praise", "Title Page", "Copyright Page", "Table of Contents" and "Epigraph". These need to be included in our page mapping but as far as our table of contents is concerned, the actual first page starts at the first chapter, so we represent that by having those first 5 entries in our page list set to 0:

py .\page_approximator.py ".\sirens.epub" 336 -p 56 -t 0 0 0 0 0 1 41 62 95 105 143 167 187 199 218 256 270 308

But to have the book paginated "properly" the section that is now page "0" should instead be paginated with Roman numerals. The table of contents does not tell us how many of those pages there are so, we set the -r option to 0 and let the script decide how many to generate:

py .\page_approximator.py ".\sirens.epub" 336 -p 56 -t 0 0 0 0 0 1 41 62 95 105 143 167 187 199 218 256 270 308 -r 0

Now the final output is as good as it gets, with all chapters on the right pages, Roman numerals in the front matter and thanks to the number of sample points the pages in-between are accurate to within a few lines.