Make parsers CharacterStream aware #1145
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Went to do a simple fix for #1144 and ended up creating quite a big (and IMHO important) set of changes.
There are two NTriples parsers in RDFLib. One in /plugins/parsers/nt.py called
NTParser
and another in /plugins/parsers/ntriples.py calledNTriplesParser
.The latter is the original reference implementation of the NTriples W3C standard as provided by W3C.
It is a legacy style parser which takes a file which is an open filepointer, and when run it emits triples into a Sink.
The other
NTParser
innt.py
is a wrapper around the legacy parser, it adds rdflib compatibility, takes an rdflib.InputSource as input and emits triples to a rdflib.Graph.This PR puts both in the same file, and renames the legacy NTriplesParser to
W3CNTriplesParser
to avoid confusion.The most important change in here is adding CharacterStream support to the rdflib InputSource. This allows parsers to read unicode streams directly from the input source, as opposed to reading from the inputsource.ByteStream then converting to
str
withdata.decode()
. Often the InputSource was already a string to begin with. PR changes some parsers to prefer reading from the inputsource.CharacterStream if available instead of the ByteStream, this removes many useless string->bytes->bytestream->textstream->string conversions which were happening in the Parser pipelines.