New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New parser architecture #2
New parser architecture #2
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick review of the public AST to see how it aligned with the mobile toolkit.
Using a visitor pattern was a good idea I think, I'll see if it can be of interest for Swift/Kotlin as well, to allow customizing the AST after parsing in an easier way. 👍
The extensibility is missing in some areas, here's a list of where it is expected:
- In
LinkProperties
, any unknown JSON property is added to anotherProperties
map. - In
Manifest
, asubcollections
property holds any unknown core collection. - In
Metadata
, any unknown JSON property is added to anotherMetadata
map. When parsing an EPUB, this is also filled with unknown<dc:*>
or<meta>
tags from the OPF.
templated = BooleanProperty("templated", required=False) | ||
type = StringProperty("type", required=False) | ||
title = StringProperty("title", required=False) | ||
rel = ArrayOfStringsProperty("rel", required=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For rel
, using a set would be more suitable instead of an array.
Also we tend to pluralize these array properties compared to the raw RWPM, so rels
, alternates
and languages
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thank you! ArrayOfStringsProperty
has optional parameter unique_items
which can be used to enforce uniqueness of items but I think I'll better create a new property SetOfStringsProperty
.
) | ||
|
||
|
||
class LinkList(Node, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this document you will find a number of "helpers" functions that could be useful as well for LinkList
. Note that this is a general guideline, but the naming should follow the convention of the target language. So get_by_rel()
is fine for example.
class Metadata(Node): | ||
"""Dictionary containing manifest's metadata.""" | ||
|
||
identifier = URIProperty("identifier", required=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the spec suggests that identifier
is an URL, in practice it might not always be the case depending on the format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the spec where it's defined as uri
:
"identifier": {
"type": "string",
"format": "uri"
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON schema is used for "real" RWPM and OPDS 2, in this case it's true that identifier
is an URL. But R2 shared models can be used to parse other formats so they are only loosely based on the JSON schema. For example EPUB and PDF can have identifiers that are not URL.
Another example is that RWPM requires one link with the rel self
, but when parsing a local package (EPUB, PDF), we don't have any self
link to parse so this requirement is lifted for the shared models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, in this case we could override this behaviour in child classes. For example:
class EPUBMetadata(Metadata):
identifier = StringProperty("identifier", required=False)
Metadata.extensions = (EPUBMetadata, )
It will force the parser to instantiate EPUBMetadata
instead of Metadata
which means that it will be expecting an identifier to be a string, not URI. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, on mobile we have only a single Metadata
class so we're not in this situation, but maybe it's a proper workaround.
publisher = ContributorProperty("publisher", required=False) | ||
imprint = ContributorProperty("imprint", required=False) | ||
subject = SubjectProperty("subject", required=False) | ||
reading_progression = EnumProperty( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reading_progression
is a hint but is not very useful for the navigator side, so we added an effectiveReadingProgression
helper to calculate the actual reading progression:
In this Kotlin commit you can find an example implementation with a set of test cases.
Although this is only useful if this AST is meant to be used to render publication.
spread = EnumProperty("spread", False, ["auto", "both", "none", "landscape"]) | ||
|
||
|
||
class CompactCollection(Node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On mobile we didn't make a difference between compact or full RWPM collections, everything is made canonical in a single full Collection
type, with empty metadata. Now I'm not sure this is necessarily better, I just wanted to point out the difference.
Also the role is not part of the collection itself but of its parent, which contains a subcollections
property which is a Map
of List<Collection>
:
- https://github.com/readium/r2-shared-kotlin/blob/alpha/r2-shared/src/main/java/org/readium/r2/shared/publication/PublicationCollection.kt#L33
- https://github.com/readium/r2-shared-kotlin/blob/34c27504e93714d4c3f6ecfbc60b71913a01d659/r2-shared/src/main/java/org/readium/r2/shared/publication/Publication.kt#L86
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leonard also used the same approach with having a single Collection
class. In my approach we can rely on type but in a dynamic language like Python it doesn't make much sense
) | ||
|
||
@property | ||
def sub_collections(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"subcollection" is idiomatic English so you can drop the _
I think.
@property | ||
def compact(self): | ||
"""Return a boolean value indicating if this collection is compact. | ||
|
||
:return: Boolean value indicating if this collection is compact | ||
:rtype: bool | ||
""" | ||
return self.metadata is None and len(self._sub_collections) == 0 | ||
|
||
@property | ||
def full(self): | ||
"""Return a boolean value indicating if this collection is full. | ||
|
||
:return: Boolean value indicating if this collection is full | ||
:rtype: bool | ||
""" | ||
return self.metadata is not None and len(self._sub_collections) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice I don't know any case where knowing if a collection is compact vs full is important. Did you need to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm the original author of this and I think I only used it in enforcing the rules laid down by the spec (e.g. links
must be a compact collection).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha yes in that case that makes sense. We used simple Link
arrays for readingOrder
and such on mobile.
"""Registry item representing a specific media type.""" | ||
|
||
|
||
class LinkRelation(RegistryItem): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently added in Swift a LinkRelation
struct to hold the various known relations and have rels
be type safe. Since it needs to be opened to extensions, this is not an enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I tried to split all different types or link relations into separate repositories. For example, in the case of RWPM there is a separate RWPMLinkRelationsRegistry:
class RWPMLinkRelationsRegistry(Registry):
"""Registry containing link relations mentioned in the RWPM spec."""
ALTERNATE = LinkRelation(key="alternate")
CONTENTS = LinkRelation(key="contents")
COVER = LinkRelation(key="cover")
MANIFEST = LinkRelation(key="manifest")
SEARCH = LinkRelation(key="search")
SELF = LinkRelation(key="self")
CORE_LINK_RELATIONS = [ALTERNATE, CONTENTS, COVER, MANIFEST, SEARCH, SELF]
def __init__(self):
"""Initialize a new instance of RWPMLinkRelationsRegistry class."""
super(RWPMLinkRelationsRegistry, self).__init__(self.CORE_LINK_RELATIONS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For OPDS 2.0 there is OPDS2LinkRelationsRegistry:
class OPDS2LinkRelationsRegistry(RWPMLinkRelationsRegistry):
"""Registry containing OPDS 2.0 link relations."""
ACQUISITION = LinkRelation(key="http://opds-spec.org/acquisition")
OPEN_ACCESS = LinkRelation(key="http://opds-spec.org/acquisition/open-access")
BORROW = LinkRelation(key="http://opds-spec.org/acquisition/borrow")
BUY = LinkRelation(key="http://opds-spec.org/acquisition/buy")
SAMPLE = LinkRelation(key="http://opds-spec.org/acquisition/sample")
PREVIEW = LinkRelation(key="preview")
SUBSCRIBE = LinkRelation(key="http://opds-spec.org/acquisition/subscribe")
CORE_LINK_RELATIONS = [
ACQUISITION,
OPEN_ACCESS,
BORROW,
BUY,
SAMPLE,
PREVIEW,
SUBSCRIBE,
]
def __init__(self):
"""Initialize a new instance of OPDS2LinkRelationsRegistry class."""
super(OPDS2LinkRelationsRegistry, self).__init__()
self._add_items(self.CORE_LINK_RELATIONS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it fits a similar need indeed.
) | ||
|
||
|
||
class EPUBEncryptionSettings(Node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will soonish be extracted from the EPUB extension into a DRM module instead. This was agreed on but not yet updated in the spec. So you can probably rename it to DRMEncrytionSettings
or EncryptionSettings
(on mobile we called it Encryption
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there other classes in epub
module, do you think it makes sense to wait until you update the spec and then update the library accordingly? Because I think it would be better to extract it to a separate module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The encryption settings are needed for RWPM profiles protected with LCP as well, which are not necessary EPUBs. So as long as the EncryptionSettings
object is accessible outside of the EPUB scope (EPUBManifest
?), this issue is not pressing.
Here's an example for audiobooks: https://readium.org/lcp-specs/notes/lcp-for-audiobooks.html
class ArrayParser(ValueParser): | ||
"""Array parser.""" | ||
|
||
def __init__(self, item_parser, unique_items=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mickaël mentioned the idea of treating some arrays as sets. "Array with a uniqueness constraint" is slightly different from "set", but they're pretty similar and I don't think the distinction matters in an OPDS 2 context. So this might be a good way to implement sets -- or this might be code that should be moved over to a new SetParser
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see from your comments that you had a similar idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just thinking whether we really need SetParser
because my first intention was to name parsers after the types and formats in the spec. So there is array
type and there is ArrayParser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you mention this, uniqueness is not specified in the JSON schema. But I don't think it makes sense to have duplicate relations anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mickael-menu-mantano, actually JSON schema contains uniqueItems
:
"links": {
"description": "Feed-level links such as search or pagination",
"type": "array",
"items": {
"$ref": "https://readium.org/webpub-manifest/schema/link.schema.json"
},
"uniqueItems": true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For links
yeah but not for rel
.
…> rels, author -> authors, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with merging this if @mickael-menu is. Even if there are problems, it's better to merge this and use it as a starting point than to keep a branch open.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still a few opened comments, but nothing that can't be fixed in a future PR.
@leonardr, could you merge this PR? |
This PR changes the architecture of the library and decouples logic (parsing, validation, serialization) from POCO classes representing AST nodes such as Link, Collection, Manifest, etc.
It splits parsing into two separate phases: