Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make parsers CharacterStream aware #1145

Merged
merged 1 commit into from Aug 23, 2020
Merged

Conversation

ashleysommer
Copy link
Contributor

@ashleysommer ashleysommer commented Aug 19, 2020

Went to do a simple fix for #1144 and ended up creating quite a big (and IMHO important) set of changes.
There are two NTriples parsers in RDFLib. One in /plugins/parsers/nt.py called NTParser and another in /plugins/parsers/ntriples.py called NTriplesParser.
The latter is the original reference implementation of the NTriples W3C standard as provided by W3C.
It is a legacy style parser which takes a file which is an open filepointer, and when run it emits triples into a Sink.
The other NTParser in nt.py is a wrapper around the legacy parser, it adds rdflib compatibility, takes an rdflib.InputSource as input and emits triples to a rdflib.Graph.
This PR puts both in the same file, and renames the legacy NTriplesParser to W3CNTriplesParser to avoid confusion.

The most important change in here is adding CharacterStream support to the rdflib InputSource. This allows parsers to read unicode streams directly from the input source, as opposed to reading from the inputsource.ByteStream then converting to str with data.decode(). Often the InputSource was already a string to begin with. PR changes some parsers to prefer reading from the inputsource.CharacterStream if available instead of the ByteStream, this removes many useless string->bytes->bytestream->textstream->string conversions which were happening in the Parser pipelines.

Changed name of NTriplesParser to W3CNTriplesParser, it is the legacy parser
Populate CharacterStream attr on several types of rdflib InputSource, to provide unicode text stream, in addition to ByteStream
Add support to N3, Trig, NTriples, NQuads parsers to use the CharacterStream instead of the ByteStream where possible
Reduces many useless string->bytes->string conversions in parsers.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 75.65% when pulling ceab6b2 on ashleysommer:fix_1144 into 89cb369 on RDFLib:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 75.65% when pulling ceab6b2 on ashleysommer:fix_1144 into 89cb369 on RDFLib:master.

@coveralls
Copy link

coveralls commented Aug 19, 2020

Coverage Status

Coverage decreased (-0.2%) to 75.627% when pulling ceab6b2 on ashleysommer:fix_1144 into 89cb369 on RDFLib:master.

Copy link
Member

@nicholascar nicholascar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good tidy-up and it's closing an issue while passing all tests so an easy approve!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

N-triples parser: reading a file fails without binary mode on Python 3.6
3 participants