Skip to content

Page Lists from Existing Tags

Thertzlor edited this page Apr 2, 2023 · 2 revisions

Introduction

Sometimes you might come across a book that has some sort of non-standard page markers that are not recognized by ebook readers as proper print reference pages.
It would be a shame not to make use of them, so Print Page Approximator includes functionality to detect these page markers based on a css-like selector and "upgrade" them to properly working reference pages.

Telling the script to work with existing tags is as easy as switching out the first argument which normally designates the page number with a tag selector as described below.

Selectors

The selector that can be passed as the pages argument works like a simplified CSS selector with the the following pattern:
tag.class[attribute=value]#id
It's not necessary to use all parts, for example if your book already has "valid" page break tags but just doesn't list them in its nav file, it's sufficient to define the selection like this:

py .\page_approximator.py .\example_book.epub "[epub:type=pagebreak]"

But if for the sake of demonstration we wanted to use all parts, it would look like this:

py .\page_approximator.py .\example_book.epub "span.pg[epub:type=pagebreak]#pg_*"

Here we would select all span tags with the class "pg" and the "epub:type" attribute with the value "pagebreak" and an ID that starts with "pg_".

This is a more detailed breakdown of the selector's capabilities:

  • The class selector matches the class of the element, even if other classes are applied on the same node.
  • The id selector supports wildcards with * to match whole ID patterns. Multiple * wildcards may be used.
  • The =value part of the [attribute=value] selector is optional. A selector with [epub:type] will select any tag in which an epub:type attribute of any value is defined.
  • Any part of the selector may be omitted. For example a#pg_*, a.pg and [epub:type=pagebreak]#pg_* are all valid selectors.
  • The order of the selectors needs to be kept the same, even with omissions. A selector like #pg_*[epub:type=pagebreak] is not valid.
  • Selecting more than one tag, class or attribute/value pair is not supported.

If the ebook is in the EPUB

Deriving Page Numbers

By default the logic that the script uses to determine which page the found page break tag actually represents works as follows:

  • If the tag contains any text, that text will be used as the name of the page.
  • If the text inside the tag contains numbers, the last sequence of numbers within that text will be used as the page number, but if no numbers are found the text will be retained as is.
  • If there is no text inside the tag, the script will attempt to derive the page number based on the element ID, using the same logic of finding the last string of digits in the id.
  • If there are no digits in the ID the script will assume the number based on the last successfully identified page number.
  • If no number can be identified at all, the page number will simply be assigned based on the element count.

Fetching Values from Other Attributes

If the non-standard page break tag specifies the page number in some other arbitrary attribute you can override the default logic by specifying the name of the attribute with the --attribute or -a parameter:

py .\page_approximator.py .\example_book.epub "span.pg" --attribute "value"

Now only the content of the value attribute on span tags with the "pg" class will be used to derive the page number, ignoring the text content and id. If no content is found in the value attribute, numbering will fall back to the element count.

You can force the script to number exclusively by element count by passing an empty string as the --attribute parameter:

py .\page_approximator.py .\example_book.epub "span.pg" --attribute ""

...Since there are no attributes without a name, there can be no attribute content, making the element count the only remaining numbering method since text and id methods are skipped.