New parser architecture #2

vbessonov · 2020-10-19T18:57:25Z

This PR changes the architecture of the library and decouples logic (parsing, validation, serialization) from POCO classes representing AST nodes such as Link, Collection, Manifest, etc.
It splits parsing into two separate phases:

syntax analysis where SyntaxAnalyzer is responsible for parsing raw JSON into AST
and semantic analysis where SemanticAnalyzer conducts semantic checking of the AST tree.

mickael-menu

I did a quick review of the public AST to see how it aligned with the mobile toolkit.

Using a visitor pattern was a good idea I think, I'll see if it can be of interest for Swift/Kotlin as well, to allow customizing the AST after parsing in an easier way. 👍

The extensibility is missing in some areas, here's a list of where it is expected:

In LinkProperties, any unknown JSON property is added to an otherProperties map.
In Manifest, a subcollections property holds any unknown core collection.
In Metadata, any unknown JSON property is added to an otherMetadata map. When parsing an EPUB, this is also filled with unknown <dc:*> or <meta> tags from the OPF.

mickael-menu · 2020-10-20T07:08:06Z

src/webpub_manifest_parser/core/ast.py

+    templated = BooleanProperty("templated", required=False)
+    type = StringProperty("type", required=False)
+    title = StringProperty("title", required=False)
+    rel = ArrayOfStringsProperty("rel", required=False)


For rel, using a set would be more suitable instead of an array.

Also we tend to pluralize these array properties compared to the raw RWPM, so rels, alternates and languages.

Good catch, thank you! ArrayOfStringsProperty has optional parameter unique_items which can be used to enforce uniqueness of items but I think I'll better create a new property SetOfStringsProperty.

mickael-menu · 2020-10-20T07:10:06Z

src/webpub_manifest_parser/core/ast.py

+        )
+
+
+class LinkList(Node, list):


In this document you will find a number of "helpers" functions that could be useful as well for LinkList. Note that this is a general guideline, but the naming should follow the convention of the target language. So get_by_rel() is fine for example.

mickael-menu · 2020-10-20T07:12:45Z

src/webpub_manifest_parser/core/ast.py

+class Metadata(Node):
+    """Dictionary containing manifest's metadata."""
+
+    identifier = URIProperty("identifier", required=False)


Although the spec suggests that identifier is an URL, in practice it might not always be the case depending on the format.

I was referring to the spec where it's defined as uri:

"identifier": { "type": "string", "format": "uri" },

The JSON schema is used for "real" RWPM and OPDS 2, in this case it's true that identifier is an URL. But R2 shared models can be used to parse other formats so they are only loosely based on the JSON schema. For example EPUB and PDF can have identifiers that are not URL.

Another example is that RWPM requires one link with the rel self, but when parsing a local package (EPUB, PDF), we don't have any self link to parse so this requirement is lifted for the shared models.

I see, in this case we could override this behaviour in child classes. For example:

class EPUBMetadata(Metadata): identifier = StringProperty("identifier", required=False) Metadata.extensions = (EPUBMetadata, )

It will force the parser to instantiate EPUBMetadata instead of Metadata which means that it will be expecting an identifier to be a string, not URI. Does it make sense?

I think so, on mobile we have only a single Metadata class so we're not in this situation, but maybe it's a proper workaround.

src/webpub_manifest_parser/core/ast.py

mickael-menu · 2020-10-20T07:16:17Z

src/webpub_manifest_parser/core/ast.py

+    publisher = ContributorProperty("publisher", required=False)
+    imprint = ContributorProperty("imprint", required=False)
+    subject = SubjectProperty("subject", required=False)
+    reading_progression = EnumProperty(


The reading_progression is a hint but is not very useful for the navigator side, so we added an effectiveReadingProgression helper to calculate the actual reading progression:

In this Kotlin commit you can find an example implementation with a set of test cases.

Although this is only useful if this AST is meant to be used to render publication.

mickael-menu · 2020-10-20T07:22:39Z

src/webpub_manifest_parser/core/ast.py

+    spread = EnumProperty("spread", False, ["auto", "both", "none", "landscape"])
+
+
+class CompactCollection(Node):


On mobile we didn't make a difference between compact or full RWPM collections, everything is made canonical in a single full Collection type, with empty metadata. Now I'm not sure this is necessarily better, I just wanted to point out the difference.

Also the role is not part of the collection itself but of its parent, which contains a subcollections property which is a Map of List<Collection>:

https://github.com/readium/r2-shared-kotlin/blob/alpha/r2-shared/src/main/java/org/readium/r2/shared/publication/PublicationCollection.kt#L33

https://github.com/readium/r2-shared-kotlin/blob/34c27504e93714d4c3f6ecfbc60b71913a01d659/r2-shared/src/main/java/org/readium/r2/shared/publication/Publication.kt#L86

Leonard also used the same approach with having a single Collection class. In my approach we can rely on type but in a dynamic language like Python it doesn't make much sense

mickael-menu · 2020-10-20T07:24:15Z

src/webpub_manifest_parser/core/ast.py

+    )
+
+    @property
+    def sub_collections(self):


"subcollection" is idiomatic English so you can drop the _ I think.

https://www.merriam-webster.com/dictionary/subcollection

mickael-menu · 2020-10-20T07:24:55Z

src/webpub_manifest_parser/core/ast.py

+    @property
+    def compact(self):
+        """Return a boolean value indicating if this collection is compact.
+
+        :return: Boolean value indicating if this collection is compact
+        :rtype: bool
+        """
+        return self.metadata is None and len(self._sub_collections) == 0
+
+    @property
+    def full(self):
+        """Return a boolean value indicating if this collection is full.
+
+        :return: Boolean value indicating if this collection is full
+        :rtype: bool
+        """
+        return self.metadata is not None and len(self._sub_collections) > 0


In practice I don't know any case where knowing if a collection is compact vs full is important. Did you need to use it?

I'm the original author of this and I think I only used it in enforcing the rules laid down by the spec (e.g. links must be a compact collection).

Ha yes in that case that makes sense. We used simple Link arrays for readingOrder and such on mobile.

mickael-menu · 2020-10-20T07:48:35Z

src/webpub_manifest_parser/core/registry.py

+    """Registry item representing a specific media type."""
+
+
+class LinkRelation(RegistryItem):


I recently added in Swift a LinkRelation struct to hold the various known relations and have rels be type safe. Since it needs to be opened to extensions, this is not an enum.

I see, I tried to split all different types or link relations into separate repositories. For example, in the case of RWPM there is a separate RWPMLinkRelationsRegistry:

class RWPMLinkRelationsRegistry(Registry): """Registry containing link relations mentioned in the RWPM spec.""" ALTERNATE = LinkRelation(key="alternate") CONTENTS = LinkRelation(key="contents") COVER = LinkRelation(key="cover") MANIFEST = LinkRelation(key="manifest") SEARCH = LinkRelation(key="search") SELF = LinkRelation(key="self") CORE_LINK_RELATIONS = [ALTERNATE, CONTENTS, COVER, MANIFEST, SEARCH, SELF] def __init__(self): """Initialize a new instance of RWPMLinkRelationsRegistry class.""" super(RWPMLinkRelationsRegistry, self).__init__(self.CORE_LINK_RELATIONS)

For OPDS 2.0 there is OPDS2LinkRelationsRegistry:

class OPDS2LinkRelationsRegistry(RWPMLinkRelationsRegistry): """Registry containing OPDS 2.0 link relations.""" ACQUISITION = LinkRelation(key="http://opds-spec.org/acquisition") OPEN_ACCESS = LinkRelation(key="http://opds-spec.org/acquisition/open-access") BORROW = LinkRelation(key="http://opds-spec.org/acquisition/borrow") BUY = LinkRelation(key="http://opds-spec.org/acquisition/buy") SAMPLE = LinkRelation(key="http://opds-spec.org/acquisition/sample") PREVIEW = LinkRelation(key="preview") SUBSCRIBE = LinkRelation(key="http://opds-spec.org/acquisition/subscribe") CORE_LINK_RELATIONS = [ ACQUISITION, OPEN_ACCESS, BORROW, BUY, SAMPLE, PREVIEW, SUBSCRIBE, ] def __init__(self): """Initialize a new instance of OPDS2LinkRelationsRegistry class.""" super(OPDS2LinkRelationsRegistry, self).__init__() self._add_items(self.CORE_LINK_RELATIONS)

It looks like it fits a similar need indeed.

mickael-menu · 2020-10-20T07:50:11Z

src/webpub_manifest_parser/epub/ast.py

+    )
+
+
+class EPUBEncryptionSettings(Node):


This will soonish be extracted from the EPUB extension into a DRM module instead. This was agreed on but not yet updated in the spec. So you can probably rename it to DRMEncrytionSettings or EncryptionSettings (on mobile we called it Encryption).

Since there other classes in epub module, do you think it makes sense to wait until you update the spec and then update the library accordingly? Because I think it would be better to extract it to a separate module

The encryption settings are needed for RWPM profiles protected with LCP as well, which are not necessary EPUBs. So as long as the EncryptionSettings object is accessible outside of the EPUB scope (EPUBManifest?), this issue is not pressing.

Here's an example for audiobooks: https://readium.org/lcp-specs/notes/lcp-for-audiobooks.html

leonardr · 2020-10-20T19:01:11Z

src/webpub_manifest_parser/core/parsers.py

+class ArrayParser(ValueParser):
+    """Array parser."""
+
+    def __init__(self, item_parser, unique_items=False):


Mickaël mentioned the idea of treating some arrays as sets. "Array with a uniqueness constraint" is slightly different from "set", but they're pretty similar and I don't think the distinction matters in an OPDS 2 context. So this might be a good way to implement sets -- or this might be code that should be moved over to a new SetParser class.

OK, I see from your comments that you had a similar idea.

I'm just thinking whether we really need SetParser because my first intention was to name parsers after the types and formats in the spec. So there is array type and there is ArrayParser

Now that you mention this, uniqueness is not specified in the JSON schema. But I don't think it makes sense to have duplicate relations anyway.

@mickael-menu-mantano, actually JSON schema contains uniqueItems:

"links": { "description": "Feed-level links such as search or pagination", "type": "array", "items": { "$ref": "https://readium.org/webpub-manifest/schema/link.schema.json" }, "uniqueItems": true

For links yeah but not for rel.

…> rels, author -> authors, etc.)

…ct metadata

… PyPi

leonardr

I'm fine with merging this if @mickael-menu is. Even if there are problems, it's better to merge this and use it as a starting point than to keep a branch open.

mickael-menu

There are still a few opened comments, but nothing that can't be fixed in a future PR.

vbessonov · 2021-05-03T12:50:33Z

@leonardr, could you merge this PR?

Initial commit

79e3a97

vbessonov mentioned this pull request Oct 19, 2020

New architecture #1

Closed

mickael-menu reviewed Oct 20, 2020

View reviewed changes

leonardr reviewed Oct 20, 2020

View reviewed changes

vbessonov added 9 commits October 21, 2020 17:06

Pluralize names of properties that can contain multiple values (rel -…

e74481f

…> rels, author -> authors, etc.)

Update the Apache license classifier

07f3cd4

Add information about README.md, repository and homepage to the proje…

10f7b45

…ct metadata

Add parse_stream to DocumentParser

f02264f

Bump the patch version to be able to upload webpub-manifest-parser to…

0ba5f41

… PyPi

Make the parser to parse contributors into Contributor objects

4dce6ca

Bump version: 0.0.2 → 0.0.3

f256029

Add DocumentParser.parse_json method

d52ef4e

Bump version: 0.0.3 → 0.0.4

0ada9a7

leonardr approved these changes Nov 6, 2020

View reviewed changes

mickael-menu approved these changes Nov 9, 2020

View reviewed changes

leonardr merged commit bb8a403 into NYPL-Simplified:master May 3, 2021

jonathangreen mentioned this pull request May 17, 2021

Update package name for webpub-manifest-parser ThePalaceProject/circulation-core#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New parser architecture #2

New parser architecture #2

vbessonov commented Oct 19, 2020 •

edited

mickael-menu left a comment

mickael-menu Oct 20, 2020

vbessonov Oct 20, 2020

mickael-menu Oct 20, 2020

mickael-menu Oct 20, 2020

vbessonov Oct 20, 2020

mickael-menu Oct 20, 2020

vbessonov Oct 20, 2020

mickael-menu Oct 20, 2020

mickael-menu Oct 20, 2020

mickael-menu Oct 20, 2020

vbessonov Oct 21, 2020

mickael-menu Oct 20, 2020

mickael-menu Oct 20, 2020

leonardr Oct 20, 2020

mickael-menu Oct 21, 2020

mickael-menu Oct 20, 2020

vbessonov Oct 21, 2020

vbessonov Oct 21, 2020

mickael-menu Oct 21, 2020

mickael-menu Oct 20, 2020

vbessonov Oct 21, 2020

mickael-menu Oct 21, 2020

leonardr Oct 20, 2020

leonardr Oct 20, 2020

vbessonov Oct 20, 2020

mickael-menu Oct 21, 2020

vbessonov Oct 21, 2020

mickael-menu Oct 21, 2020

leonardr left a comment

mickael-menu left a comment

vbessonov commented May 3, 2021

		spread = EnumProperty("spread", False, ["auto", "both", "none", "landscape"])


		class CompactCollection(Node):

		"""Registry item representing a specific media type."""


		class LinkRelation(RegistryItem):

New parser architecture #2

New parser architecture #2

Conversation

vbessonov commented Oct 19, 2020 • edited

mickael-menu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leonardr left a comment

Choose a reason for hiding this comment

mickael-menu left a comment

Choose a reason for hiding this comment

vbessonov commented May 3, 2021

vbessonov commented Oct 19, 2020 •

edited