Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New parser architecture #2

Merged
merged 10 commits into from May 3, 2021

Conversation

vbessonov
Copy link
Contributor

@vbessonov vbessonov commented Oct 19, 2020

This PR changes the architecture of the library and decouples logic (parsing, validation, serialization) from POCO classes representing AST nodes such as Link, Collection, Manifest, etc.
It splits parsing into two separate phases:

  • syntax analysis where SyntaxAnalyzer is responsible for parsing raw JSON into AST
  • and semantic analysis where SemanticAnalyzer conducts semantic checking of the AST tree.

@vbessonov vbessonov mentioned this pull request Oct 19, 2020
Copy link

@mickael-menu mickael-menu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick review of the public AST to see how it aligned with the mobile toolkit.

Using a visitor pattern was a good idea I think, I'll see if it can be of interest for Swift/Kotlin as well, to allow customizing the AST after parsing in an easier way. 👍

The extensibility is missing in some areas, here's a list of where it is expected:

  • In LinkProperties, any unknown JSON property is added to an otherProperties map.
  • In Manifest, a subcollections property holds any unknown core collection.
  • In Metadata, any unknown JSON property is added to an otherMetadata map. When parsing an EPUB, this is also filled with unknown <dc:*> or <meta> tags from the OPF.

templated = BooleanProperty("templated", required=False)
type = StringProperty("type", required=False)
title = StringProperty("title", required=False)
rel = ArrayOfStringsProperty("rel", required=False)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For rel, using a set would be more suitable instead of an array.

Also we tend to pluralize these array properties compared to the raw RWPM, so rels, alternates and languages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thank you! ArrayOfStringsProperty has optional parameter unique_items which can be used to enforce uniqueness of items but I think I'll better create a new property SetOfStringsProperty.

)


class LinkList(Node, list):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this document you will find a number of "helpers" functions that could be useful as well for LinkList. Note that this is a general guideline, but the naming should follow the convention of the target language. So get_by_rel() is fine for example.

class Metadata(Node):
"""Dictionary containing manifest's metadata."""

identifier = URIProperty("identifier", required=False)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the spec suggests that identifier is an URL, in practice it might not always be the case depending on the format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the spec where it's defined as uri:

"identifier": {
      "type": "string",
      "format": "uri"
    },

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON schema is used for "real" RWPM and OPDS 2, in this case it's true that identifier is an URL. But R2 shared models can be used to parse other formats so they are only loosely based on the JSON schema. For example EPUB and PDF can have identifiers that are not URL.

Another example is that RWPM requires one link with the rel self, but when parsing a local package (EPUB, PDF), we don't have any self link to parse so this requirement is lifted for the shared models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in this case we could override this behaviour in child classes. For example:

class EPUBMetadata(Metadata):
    identifier = StringProperty("identifier", required=False)

Metadata.extensions = (EPUBMetadata, )

It will force the parser to instantiate EPUBMetadata instead of Metadata which means that it will be expecting an identifier to be a string, not URI. Does it make sense?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, on mobile we have only a single Metadata class so we're not in this situation, but maybe it's a proper workaround.

src/webpub_manifest_parser/core/ast.py Outdated Show resolved Hide resolved
publisher = ContributorProperty("publisher", required=False)
imprint = ContributorProperty("imprint", required=False)
subject = SubjectProperty("subject", required=False)
reading_progression = EnumProperty(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reading_progression is a hint but is not very useful for the navigator side, so we added an effectiveReadingProgression helper to calculate the actual reading progression:

In this Kotlin commit you can find an example implementation with a set of test cases.

Although this is only useful if this AST is meant to be used to render publication.

spread = EnumProperty("spread", False, ["auto", "both", "none", "landscape"])


class CompactCollection(Node):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On mobile we didn't make a difference between compact or full RWPM collections, everything is made canonical in a single full Collection type, with empty metadata. Now I'm not sure this is necessarily better, I just wanted to point out the difference.

Also the role is not part of the collection itself but of its parent, which contains a subcollections property which is a Map of List<Collection>:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leonard also used the same approach with having a single Collection class. In my approach we can rely on type but in a dynamic language like Python it doesn't make much sense

)

@property
def sub_collections(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"subcollection" is idiomatic English so you can drop the _ I think.

https://www.merriam-webster.com/dictionary/subcollection

Comment on lines +692 to +708
@property
def compact(self):
"""Return a boolean value indicating if this collection is compact.

:return: Boolean value indicating if this collection is compact
:rtype: bool
"""
return self.metadata is None and len(self._sub_collections) == 0

@property
def full(self):
"""Return a boolean value indicating if this collection is full.

:return: Boolean value indicating if this collection is full
:rtype: bool
"""
return self.metadata is not None and len(self._sub_collections) > 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice I don't know any case where knowing if a collection is compact vs full is important. Did you need to use it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm the original author of this and I think I only used it in enforcing the rules laid down by the spec (e.g. links must be a compact collection).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha yes in that case that makes sense. We used simple Link arrays for readingOrder and such on mobile.

"""Registry item representing a specific media type."""


class LinkRelation(RegistryItem):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently added in Swift a LinkRelation struct to hold the various known relations and have rels be type safe. Since it needs to be opened to extensions, this is not an enum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I tried to split all different types or link relations into separate repositories. For example, in the case of RWPM there is a separate RWPMLinkRelationsRegistry:

class RWPMLinkRelationsRegistry(Registry):
    """Registry containing link relations mentioned in the RWPM spec."""

    ALTERNATE = LinkRelation(key="alternate")
    CONTENTS = LinkRelation(key="contents")
    COVER = LinkRelation(key="cover")
    MANIFEST = LinkRelation(key="manifest")
    SEARCH = LinkRelation(key="search")
    SELF = LinkRelation(key="self")

    CORE_LINK_RELATIONS = [ALTERNATE, CONTENTS, COVER, MANIFEST, SEARCH, SELF]

    def __init__(self):
        """Initialize a new instance of RWPMLinkRelationsRegistry class."""
        super(RWPMLinkRelationsRegistry, self).__init__(self.CORE_LINK_RELATIONS)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For OPDS 2.0 there is OPDS2LinkRelationsRegistry:

class OPDS2LinkRelationsRegistry(RWPMLinkRelationsRegistry):
    """Registry containing OPDS 2.0 link relations."""

    ACQUISITION = LinkRelation(key="http://opds-spec.org/acquisition")
    OPEN_ACCESS = LinkRelation(key="http://opds-spec.org/acquisition/open-access")
    BORROW = LinkRelation(key="http://opds-spec.org/acquisition/borrow")
    BUY = LinkRelation(key="http://opds-spec.org/acquisition/buy")
    SAMPLE = LinkRelation(key="http://opds-spec.org/acquisition/sample")
    PREVIEW = LinkRelation(key="preview")
    SUBSCRIBE = LinkRelation(key="http://opds-spec.org/acquisition/subscribe")

    CORE_LINK_RELATIONS = [
        ACQUISITION,
        OPEN_ACCESS,
        BORROW,
        BUY,
        SAMPLE,
        PREVIEW,
        SUBSCRIBE,
    ]

    def __init__(self):
        """Initialize a new instance of OPDS2LinkRelationsRegistry class."""
        super(OPDS2LinkRelationsRegistry, self).__init__()

        self._add_items(self.CORE_LINK_RELATIONS)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it fits a similar need indeed.

)


class EPUBEncryptionSettings(Node):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will soonish be extracted from the EPUB extension into a DRM module instead. This was agreed on but not yet updated in the spec. So you can probably rename it to DRMEncrytionSettings or EncryptionSettings (on mobile we called it Encryption).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there other classes in epub module, do you think it makes sense to wait until you update the spec and then update the library accordingly? Because I think it would be better to extract it to a separate module

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encryption settings are needed for RWPM profiles protected with LCP as well, which are not necessary EPUBs. So as long as the EncryptionSettings object is accessible outside of the EPUB scope (EPUBManifest?), this issue is not pressing.

Here's an example for audiobooks: https://readium.org/lcp-specs/notes/lcp-for-audiobooks.html

class ArrayParser(ValueParser):
"""Array parser."""

def __init__(self, item_parser, unique_items=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mickaël mentioned the idea of treating some arrays as sets. "Array with a uniqueness constraint" is slightly different from "set", but they're pretty similar and I don't think the distinction matters in an OPDS 2 context. So this might be a good way to implement sets -- or this might be code that should be moved over to a new SetParser class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see from your comments that you had a similar idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just thinking whether we really need SetParser because my first intention was to name parsers after the types and formats in the spec. So there is array type and there is ArrayParser

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention this, uniqueness is not specified in the JSON schema. But I don't think it makes sense to have duplicate relations anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mickael-menu-mantano, actually JSON schema contains uniqueItems:

"links": {
      "description": "Feed-level links such as search or pagination",
      "type": "array",
      "items": {
        "$ref": "https://readium.org/webpub-manifest/schema/link.schema.json"
      },
      "uniqueItems": true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For links yeah but not for rel.

Copy link
Contributor

@leonardr leonardr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with merging this if @mickael-menu is. Even if there are problems, it's better to merge this and use it as a starting point than to keep a branch open.

Copy link

@mickael-menu mickael-menu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still a few opened comments, but nothing that can't be fixed in a future PR.

@vbessonov
Copy link
Contributor Author

@leonardr, could you merge this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants