Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error reading JSON-LD from text IO stream #1484

Closed
gklyne opened this issue Nov 30, 2021 · 8 comments
Closed

Error reading JSON-LD from text IO stream #1484

gklyne opened this issue Nov 30, 2021 · 8 comments

Comments

@gklyne
Copy link

gklyne commented Nov 30, 2021

I logged an issue against rdflib-jsonld a little over 3 years ago (with a suggested fix):

RDFLib/rdflib-jsonld#55

There was also what appears to be a related issue:

RDFLib/rdflib-jsonld#91

Back then, this problem was blocking my migration to Python3. I'm now revisiting that project, and no longer have the option to stick with Python 2, and I'm seeing the same problem with rdflib==6.0.1, which of course now incorporates rdflib-jsonld. Are there any plans to incorporate a fix for this issue?

(I'm not reading from a file, but a stream generated by another software component, so I don't have the option to just open the file in binary mode.)

The error I'm seeing is this:

Traceback (most recent call last):
  File "/Users/graham/workspace/github/gklyne/annalist/src/annalist_root/annalist/tests/test_jsonld_context.py", line 489, in test_jsonld_view_user
    result = g.parse(source=s, publicID=b+"/", format="json-ld")
  File "/Users/graham/workspace/github/gklyne/annalist/anenv3/lib/python3.7/site-packages/rdflib-6.0.1-py3.7.egg/rdflib/graph.py", line 1253, in parse
    parser.parse(source, self, **args)
  File "/Users/graham/workspace/github/gklyne/annalist/anenv3/lib/python3.7/site-packages/rdflib-6.0.1-py3.7.egg/rdflib/plugins/parsers/jsonld.py", line 114, in parse
    data = source_to_json(source)
  File "/Users/graham/workspace/github/gklyne/annalist/anenv3/lib/python3.7/site-packages/rdflib-6.0.1-py3.7.egg/rdflib/plugins/shared/jsonld/util.py", line 27, in source_to_json
    return json.load(StringIO(stream.read().decode("utf-8")))
AttributeError: 'str' object has no attribute 'decode'

(Apologies if this is already recorded as an issue: I did look at existing issues, but couldn't find a match.)

@gklyne
Copy link
Author

gklyne commented Dec 3, 2021

I've created a test case for this:

class TestIssue1484_json(unittest.TestCase):
    def test_issue_1484_json(self):
        """
        Test JSON-LD parsing of result from json.dump
        """
        n = Namespace("http://example.org/")
        jsondata = {
          "@id": n.s,
          "@type": [ n.t ],
          n.p: { "@id": n.o }
        }

        s = io.StringIO()
        json.dump(jsondata, s, indent=2, separators=(',', ': '))
        s.seek(0)

        DEBUG = False
        if DEBUG:
            print("S: ", s.read())
            s.seek(0)

        b = n.base
        g = Graph()
        g.bind("rdf", RDF)
        g.bind("rdfs", RDFS)
        g.parse(source=s, publicID=b, format="json-ld")

        assert((n.s, RDF.type, n.t) in g)
        assert((n.s, n.p, n.o) in g)

And also a fix that seems to work (but see below):

In file: rdflib/plugins/shared/jsonld/util.py

def source_to_json(source):

    if isinstance(source, PythonInputSource):
        return source.data

    if isinstance(source, StringInputSource):
        return json.load(source.getCharacterStream())

    # TODO: conneg for JSON (fix support in rdflib's URLInputSource!)
    source = create_input_source(source, format="json-ld")

    stream = source.getByteStream()
    try:
        # Use character stream as-is, or interpret byte stream as UTF-8
        if isinstance(stream, TextIOBase):
            # use_stream = stream
            return json.load(stream)
        else:
            use_stream = TextIOWrapper(stream, encoding='utf-8')
        return json.load(use_stream)
    finally:
        stream.close()

I have a clean version of all tests passing before I applied my changes. In particular, the sparql service tests were hanging. And a number of other unrelated tests appear to be failing. My changes and tests have been based on the master branch - Is there a clean branch for this kind of activity?

Anyway, as far as I can tell, the remaining failing tests are not related to my changes.

@ghost
Copy link

ghost commented Dec 3, 2021

I'm lucky enough to be able to run RDFLib master tests locally and get an all-clear. I've transcribed your fix and test into a separate branch, validated that the tests all pass and, pro tem pushed it up to a spare org where GitHub Actions can pick up Iwan Aucamp's extremely useful “validate.yaml” workflow that does the biz, running the tests over a matrix of platforms/Python versions.

Confirmatory results are here

Submitted PR is here

Thanks for the fix.

@gklyne
Copy link
Author

gklyne commented Dec 4, 2021

Thanks! In case it helps:

  1. I'm running the tests on MacOS, on an M1 max processor, using python "Python 3.9.7 (v3.9.7:1016ef3790, Aug 30 2021, 16:39:15)", as far as I can tell under Rosetta translation.

  2. there's another test I created which was to catch a regression from an earlier fix I tried:

class TestIssue1484_str(unittest.TestCase):
    def test_issue_1484_str(self):
        """
        Test JSON-LD parsing of result from string (used by round tripping tests)

        (Previously passes, but broken by earlier fix for above.)
        """
        n = Namespace("http://example.org/")
        jsonstr = """
            {
              "@id": "http://example.org/s",
              "@type": [
                "http://example.org/t"
              ],
              "http://example.org/p": {
                "@id": "http://example.org/o"
              }
            }
        """

        b = n.base
        g = Graph()
        g.bind("rdf", RDF)
        g.bind("rdfs", RDFS)
        g.parse(data=jsonstr, publicID=b, format="json-ld")

        assert((n.s, RDF.type, n.t) in g)
        assert((n.s, n.p, n.o) in g)

(This was catching an error that was otherwise showing up in the "roundtrip" tests.)

Further comment: I think the logic around handling different source data has become a bit diffused - it seems some is in the xml reader classes imported from SAX, some in rdflib.parser.create_input_source and associated components.

@gklyne
Copy link
Author

gklyne commented Dec 4, 2021

I just noticed my fix had some code inconsistency left over from my experiments. The following cleanup still passes my new tests locally (changes to the lines following if isinstance(stream, TextIOBase):)

def source_to_json(source):

    if isinstance(source, PythonInputSource):
        return source.data

    if isinstance(source, StringInputSource):
        return json.load(source.getCharacterStream())

    # TODO: conneg for JSON (fix support in rdflib's URLInputSource!)
    source = create_input_source(source, format="json-ld")

    stream = source.getByteStream()
    try:
        # Use character stream as-is, or interpret byte stream as UTF-8
        if isinstance(stream, TextIOBase):
            use_stream = stream
        else:
            use_stream = TextIOWrapper(stream, encoding='utf-8')
        return json.load(use_stream)
    finally:
        stream.close()

@vemonet
Copy link

vemonet commented Dec 6, 2021

Hi, we are getting a similar error as @gklyne

We have been trying to parse simple JSON-LD with http://schema.org as context (the most used context that can be found all other the web...), but it has been quite challenging due to encoding errors that seems to belong to the 90's...

Here is the basic JSON-LD we want to load in RDFLib:

{
    "@context": "https://schema.org",
    "@type": "Dataset",
    "name": "ECJ case law text similarity analysis",
    "description": "results from a study to analyse how closely the textual similarity of ECJ cases resembles the citation network of the cases.",
    "version": "v2.0",
    "url": "https://doi.org/10.5281/zenodo.4228652",
    "license": "https://www.gnu.org/licenses/agpl-3.0.txt"
}

Here is the error we got:

File "/usr/local/lib/python3.8/site-packages/rdflib/plugins/shared/jsonld/context.py", line 377, in _prep_sources
    new_ctx = self._fetch_context(
  File "/usr/local/lib/python3.8/site-packages/rdflib/plugins/shared/jsonld/context.py", line 409, in _fetch_context
    source = source_to_json(source_url)
  File "/usr/local/lib/python3.8/site-packages/rdflib/plugins/shared/jsonld/util.py", line 35, in source_to_json
    return json.load(StringIO(stream.read().decode("utf-8")))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Here is the fix that we came up with (nanopub_rdf is the JSON-LD object as a regular python Dict):

from pyld import jsonld
import json

## TODO: quickfix to remove, should be fixed in the rdflib releases after 6.0.2
if '@context' in nanopub_rdf.keys() and (str(nanopub_rdf['@context']).startswith('http://schema.org') or str(nanopub_rdf['@context']).startswith('https://schema.org')):
      # Regular content negotiation dont work with schema.org: https://github.com/schemaorg/schemaorg/issues/2578
      nanopub_rdf['@context'] = 'https://schema.org/docs/jsonldcontext.json'
# RDFLib JSON-LD has issue with encoding: https://github.com/RDFLib/rdflib/issues/1416
nanopub_rdf = jsonld.expand(nanopub_rdf)
nanopub_rdf = json.dumps(nanopub_rdf, ensure_ascii=False)

It seems to be due to RDFLib not being able to read the Schema.org JSON-LD context available at https://schema.org/docs/jsonldcontext.json

I tried also with another @context, more complex, but that actually works without the need for pyld expansion this time! Ironically this one also have a schema.org in the context, but it is not causing problem (I guess the JSON-LD parser has been written with a lot of personal arbitrary choices in term of how to handle the context)

{
  "@context": {
    "adms": "http://www.w3.org/ns/adms#",
    "dcat": "http://www.w3.org/ns/dcat#",
    "dcterms": "http://purl.org/dc/terms/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "schema": "http://schema.org/",
    "vcard": "http://www.w3.org/2006/vcard/ns#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@id": "http://nobelprize.org/datasets/dcat#ds1",
  "@type": "dcat:Dataset",
  "adms:contactPoint": {
    "@id": "http://nobelprize.org/contacts/n1",
    "@type": "foaf:Agent",
    "foaf:name": "Vincent Emonet"
  }
}

It would be really useful if there was some tests added to check for common real-world use of JSON-LD (e.g. using a single "http://schema.org" string as context...) Everyone is using this approach for JSON-LD to publish schema.org-related metadata all over the web...

Also it would be really easy to do this: when a single context is provided, and you don't manage to get the @context at the given URL, then you just concatenate the namespace provided in the @context with the property/@type given (exactly like you seems to do when multiple context are provided)

I tried with 6.0.2 and the master commit, I am still getting the same error with encoding on master so I guess reading strings is still not supported by RDFLib, and the RDFLib community don't really care about it: #1416

@nicholascar
Copy link
Member

@gklyne a bunch of improvements to RDFLib's test have been merged recently and also a tiny improvement to URLInputsource (see #1643) so do you want to just check the status of this Issue again now with master branch?

@ghost
Copy link

ghost commented Feb 5, 2022

do you want to just check the status of this Issue again now with master branch?

I ran gklyne's tests against master, they now pass.

@ghost ghost closed this as completed Feb 5, 2022
@gklyne
Copy link
Author

gklyne commented Feb 18, 2022

Testing my application with rdflib 6.1.1 -- my previously-failing tests are now all passing. Thanks all!

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants