Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New architecture #1

Closed

Conversation

vbessonov
Copy link
Contributor

This PR changes the architecture of the library and decouples logic (parsing, validation, serialization) from POCO classes representing AST nodes such as Link, Collection, Manifest, etc:

  • it splits parsing into two separate phases: syntax analyses where SyntaxAnalyzer is responsible for parsing raw JSON into AST and SemanticAnalyzer conducts semantic checking of ready AST tree
  • it adds a separate class for serializing AST into JSON (to be developed yet)

.python-version Outdated
@@ -0,0 +1 @@
2.7.17
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in the repo? I'm not an expert at pyenv but it seems like an artifact of your installation. I want this code base to work on both Python 2 and Python 3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is actually quite often included into .gitignore but I thought it might be a good idea to let know what exact Python versions I used for testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a similar question on Stack Overflow saying that .python-version can be kept in the repository for information purpose.


if sub_collection_keys:
raise ParsingError(
'The following sub-collection keys are not registered: {0}'.format(sub_collection_keys))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a parsing error, since RWPM allows extension roles (though it says "Extensions that are not registered in the registry must use a URI for their role.") I suggest treating these leftover keys as CollectionRoles with compact=False and require=False.

Then, when validating the collection, issue a warning if a collection has a role that is neither in (our model of) the registry nor a URI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the parser raises an exception and halts on the first failure. Would you like to have it to return a list of errors, warnings instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was saying that this shouldn't even be considered an error. I think you're saying that it should be considered an error, but that an error doesn't necessarily need to stop the parsing process.

So, two questions to resolve: is this actually an error? And should a problem like this necessarily result in an exception?

First, I don't think we can say for sure that this particular thing is an error, because we only know the contents of the RWPM role registry at the time of the release of the software. I don't want to have to put out an immediate upgrade if a role is added to the registry so we can parse the new role without an error.

That said, if an "error" in this case just means something is added to a list, then it's liveable, and I'd be okay with returning a 2-tuple of (errors, warnings).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this code and the parser doesn't raise an exception in the case of unknown collections

@leonardr
Copy link
Contributor

The overall architecture looks great. I'm comfortable merging this in and using it as a basis for further work if that's what you'd like.

@vbessonov
Copy link
Contributor Author

@leonardr, would you like to keep the old name python-opds-parser? I was thinking about something more generic like python-rwpm-parser. Do you think it makes sense?

@leonardr
Copy link
Contributor

leonardr commented Sep 25, 2020

I'd be OK with python-webpub-manifest-parser, since webpub-manifest is the name of the Github project that contains the spec. We'll mention OPDS 2 in the description so this package shows up in searches. Let me know if you like this name and I'll rename the repo.

@vbessonov
Copy link
Contributor Author

@leonardr, python-webpub-manifest-parser sounds great!

@vbessonov
Copy link
Contributor Author

@leonardr, just one more question regarding the naming: would you like to drop python- in the case of the Python package?
So the name of the repository would be python-webpub-manifest-parser and the name of the Python package would be webpub-manifest-parser

@leonardr
Copy link
Contributor

leonardr commented Sep 28, 2020

Yes, that's exactly what I was thinking. python- helps web searches but is not necessary for PyPI searches.

@vbessonov
Copy link
Contributor Author

@leonardr, I finished, the parser right now should be compliant with RWPM and OPDS 2.0 specs. However, it doesn't allow to extend the specs and add custom sub-collections yet - it won't throw any error but those custom sub-collections won't be available as part of the AST and won't be validated.

@vbessonov vbessonov changed the title [WIP] New architecture New architecture Oct 9, 2020
@vbessonov
Copy link
Contributor Author

@leonardr, I have some integration tests locally which use two feeds from NYU, Pret Numerique and The University of Michigan Press. I find them extremely helpful and would like to include them into the repository but I'm not sure whether I'm allowed to make them public.

@leonardr
Copy link
Contributor

leonardr commented Oct 9, 2020

You should be able to include feeds if a) the feeds are public in the first place or b) you obscure the metadata -- title, author, identifiers, and links. Show me the feeds privately and I'll advise.

@vbessonov
Copy link
Contributor Author

@leonardr, I created a new PR because I messed up this branch and couldn't rebase it anymore.

@vbessonov vbessonov closed this Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants