Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to parse graph (json-ld, JSONDecodeError) #1423

Closed
lambdamusic opened this issue Sep 27, 2021 · 9 comments · Fixed by #1436
Closed

Failed to parse graph (json-ld, JSONDecodeError) #1423

lambdamusic opened this issue Sep 27, 2021 · 9 comments · Fixed by #1436

Comments

@lambdamusic
Copy link

I've got a little json-ld snippet that works fine with https://www.easyrdf.org/converter but it can't be loaded by rdflib (6.0.1, py 3.9). What is going wrong?

from rdflib import Graph
g = Graph()
g.parse("periodical.jsonld", format="json-ld")

##  => JSONDecodeError: Expecting value: line 2 column 1 (char 1)

This is periodical.json

{
  "@context": "http://schema.org/",
  "@graph": [
    {
        "@id": "#issue",
        "@type": "PublicationIssue",
        "issueNumber": "5",
        "datePublished": "2012",
        "isPartOf": {
            "@id": "#periodical",
            "@type": [
                "PublicationVolume",
                "Periodical"
            ],
            "name": "Cataloging & Classification Quarterly",
            "issn": [
                "1544-4554",
                "0163-9374"
            ],
            "volumeNumber": "50",
            "publisher": "Taylor & Francis Group"
        }
    },
    {
        "@type": "ScholarlyArticle",
        "isPartOf": "#issue",
        "description": "The library catalog as a catalog of works was an infectious idea, which together with research led to reconceptualization in the form of the FRBR conceptual model. Two categories of lacunae emerge--the expression entity, and gaps in the model such as aggregates and dynamic documents. Evidence needed to extend the FRBR model is available in contemporary research on instantiation. The challenge for the bibliographic community is to begin to think of FRBR as a form of knowledge organization system, adding a final dimension to classification. The articles in the present special issue offer a compendium of the promise of the FRBR model.",
        "sameAs": "https://doi.org/10.1080/01639374.2012.682254",
        "about": [
            "Works",
            "Catalog"
        ],
        "pageEnd": "368",
        "pageStart": "360",
        "name": "Be Careful What You Wish For: FRBR, Some Lacunae, A Review",
        "author": "Smiraglia, Richard P."
    }
  ]
}

And the full traceback:

JSONDecodeError                           Traceback (most recent call last)
<ipython-input-6-374cd859cd90> in <module>
----> 1 g.parse("periodical.jsonld")

~/Envs/ontospy2/lib/python3.9/site-packages/rdflib/graph.py in parse(self, source, publicID, format, location, file, data, **args)
   1251         parser = plugin.get(format, Parser)()
   1252         try:
-> 1253             parser.parse(source, self, **args)
   1254         except SyntaxError as se:
   1255             if could_not_guess_format:

~/Envs/ontospy2/lib/python3.9/site-packages/rdflib/plugins/parsers/jsonld.py in parse(self, source, sink, **kwargs)
    122             conj_sink = sink
    123
--> 124         to_rdf(data, conj_sink, base, context_data, version, generalized_rdf)
    125
    126

~/Envs/ontospy2/lib/python3.9/site-packages/rdflib/plugins/parsers/jsonld.py in to_rdf(data, dataset, base, context_data, version, generalized_rdf, allow_lists_of_lists)
    141         generalized_rdf=generalized_rdf, allow_lists_of_lists=allow_lists_of_lists
    142     )
--> 143     return parser.parse(data, context, dataset)
    144
    145

~/Envs/ontospy2/lib/python3.9/site-packages/rdflib/plugins/parsers/jsonld.py in parse(self, data, context, dataset)
    161             local_context = data.get(CONTEXT)
    162             if local_context:
--> 163                 context.load(local_context, context.base)
    164                 topcontext = True
    165             resources = data

~/Envs/ontospy2/lib/python3.9/site-packages/rdflib/plugins/shared/jsonld/context.py in load(self, source, base, referenced_contexts)
    351         source = source if isinstance(source, list) else [source]
    352         referenced_contexts = referenced_contexts or set()
--> 353         self._prep_sources(base, source, sources, referenced_contexts)
    354         for source_url, source in sources:
    355             if source is None:

~/Envs/ontospy2/lib/python3.9/site-packages/rdflib/plugins/shared/jsonld/context.py in _prep_sources(self, base, inputs, sources, referenced_contexts, in_source_url)
    375                 source_url = source
    376                 source_doc_base = base or self.doc_base
--> 377                 new_ctx = self._fetch_context(
    378                     source, source_doc_base, referenced_contexts
    379                 )

~/Envs/ontospy2/lib/python3.9/site-packages/rdflib/plugins/shared/jsonld/context.py in _fetch_context(self, source, base, referenced_contexts)
    407             return self._context_cache[source_url]
    408
--> 409         source = source_to_json(source_url)
    410         if source and CONTEXT not in source:
    411             raise INVALID_REMOTE_CONTEXT

~/Envs/ontospy2/lib/python3.9/site-packages/rdflib/plugins/shared/jsonld/util.py in source_to_json(source)
     25     stream = source.getByteStream()
     26     try:
---> 27         return json.load(StringIO(stream.read().decode("utf-8")))
     28     finally:
     29         stream.close()

/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    291     kwarg; otherwise ``JSONDecoder`` is used.
    292     """
--> 293     return loads(fp.read(),
    294         cls=cls, object_hook=object_hook,
    295         parse_float=parse_float, parse_int=parse_int,

/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    344             parse_int is None and parse_float is None and
    345             parse_constant is None and object_pairs_hook is None and not kw):
--> 346         return _default_decoder.decode(s)
    347     if cls is None:
    348         cls = JSONDecoder

/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/decoder.py in decode(self, s, _w)
    335
    336         """
--> 337         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338         end = _w(s, end).end()
    339         if end != len(s):

/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/decoder.py in raw_decode(self, s, idx)
    353             obj, end = self.scan_once(s, idx)
    354         except StopIteration as err:
--> 355             raise JSONDecodeError("Expecting value", s, err.value) from None
    356         return obj, end

JSONDecodeError: Expecting value: line 2 column 1 (char 1)
@vemonet
Copy link

vemonet commented Oct 6, 2021

Same here, finally RDFLib announces that JSON-LD (arguably the most used form of RDF) is supported without the need for a weird plugin!

But apparently the changes has been push to prod without being tested for common use cases

It seems to be broken when the @context is https://schema.org (but I think I have seen it break with all JSON-LD I tried)

I installed rdflib = "^6.0.1" with python3.7, and run it this way:

from rdflib import Graph
g = Graph()
g.parse(data=rdf_data, format='json-ld')

I tried with rdf_data as a dict and a str (using json.dumps(rdf_data)).

So I guess rdflib 6.0.1 is not ready for use (which is confusing because it has been released), and anyone who wants to parse the most common RDF format needs to downgrade to v5 and use the deprecated rdflib-jsonld plugin (https://github.com/RDFLib/rdflib-jsonld)

Note that PyLD seems to also have issues with loading JSON-LD files: digitalbazaar/pyld#154

Not sure if this is the same issue here, but for some reason Google did not implemented Content-Negociation on https://schema.org (yes the most commonly used Schema is not proper linked data)

So you can't just ask for application/ld+json and get the Schema @context:

curl -H "Accept: application/ld+json" https://schema.org

Ironic no?

@vemonet
Copy link

vemonet commented Oct 11, 2021

@lambdamusic indeed it is due to the fact that http://schema.org/ does not work with content-negotiation

A quick workaround would be to edit the @context value before running the RDFLib convertion to use https://schema.org/docs/jsonldcontext.jsonld

I was facing more encoding issue, so I needed to install and use the PyLD package... cf. issue #1416 (comment)

And you also need to convert your dict to a str (no idea why rdflib would absolutly need to get the JSON as string, this all seems really inefficient)

It feels a bit ridiculous to add again 2 additional libs to parse basic RDF, but a working process to parse JSON-LD will look like this:

from pyld import jsonld
import json

if rdf_data['@context'].startswith('http://schema.org'):
    rdf_data['@context'] = 'https://schema.org/docs/jsonldcontext.json'
rdf_data = jsonld.expand(rdf_data)
rdf_data = json.dumps(rdf_data)
g.parse(data=rdf_data, format='json-ld')

You can read more about schema.org not supporting content negotiation here: schemaorg/schemaorg#2578

To properly resolve schema.org context run this:

curl -I https://schema.org

You will see:

link: </docs/jsonldcontext.jsonld>; rel="alternate"; type="application/ld+json"

The rule is stated here: https://www.w3.org/TR/json-ld11/#interpreting-json-as-json-ld

In order to use an external context with an ordinary JSON document, when retrieving an ordinary JSON document via HTTP, processors MUST attempt to retrieve any JSON-LD document referenced by a Link Header with:
rel="http://www.w3.org/ns/json-ld#context", and type="application/ld+json".

Since this fails with RDFLib I guess RDFLib did not implemented this process, hence JSON-LD specifications are still not supported by RDFLib (would be nice to make it clear, because it can be confusing for some people who think the JSON-LD standard is supported)

@vemonet
Copy link

vemonet commented Oct 11, 2021

Update on resolving this issue: there is something implemented here that should extract the links (not sure if it is properly triggered for http://schema.org), I will take a look into it when I will have the time to set up a clean environment to test RDFLib:

@ashleysommer
Copy link
Contributor

@vemonet
I've got a chance now to make a fix for this. I've discovered the issue. It appears the context_from_urlinputsource method is not working for schema.org for three reasons:

  1. It only operates on the initial data passed in, ie, it looks at "periodical.jsonld", not the "schema.org" import.
  2. The response.getallmatchingheaders method of HTTPResponse stdlib is broken, and should not be used (it doesn't match on header names since Python v3.0)
  3. That method does content-negotiation by context (rel="http://www.w3.org/ns/json-ld#context") but schema.org does content-negotiation by rel="alternate" I think these are for two distinct purposes.

There is a comment in jsonld method 'src_to_json':
# TODO: conneg for JSON (fix support in rdflib's URLInputSource!)

That sounds like a sensible solution, now that jsonld is part of RDFLib. I'm adding content-negotiation by Link header (using rel="alternates") into the URLInputSource code in rdflib parser.py.

@ashleysommer
Copy link
Contributor

@vemonet I've created PR #1436 with a fix for this

@vemonet
Copy link

vemonet commented Oct 11, 2021

@ashleysommer thanks a lot!

I was actually just looking into where the issue could come from, and was around there too: # TODO: conneg for JSON (fix support in rdflib's URLInputSource!)

I'll take a look in your changes to better understand the problem! And try your pull request tomorrow

btw it seems like issue #1416 is related to the same source_to_json function:

return json.load(StringIO(stream.read().decode("utf-8")))

@ashleysommer
Copy link
Contributor

ashleysommer commented Oct 11, 2021

btw it seems like issue #1416 is related to the same source_to_json function

I've made some changes to that line, in my new PR in order to try to prevent yet another copy of the data in the input chain. That might've incidentally fixed #1416 in some cases (eg, if the input string was already decoded at some point), but probably not in all cases. I'll test it.

@vemonet
Copy link

vemonet commented Oct 14, 2021

Thanks a lot @ashleysommer ! I tested it with my RDF snippets and it works, also fixing #1416

I can add a few tests also once your PR will be merged if you want, I just need to add some here https://github.com/RDFLib/rdflib/tree/master/test/jsonld/1.1/toRdf ?

@lambdamusic
Copy link
Author

Thanks @vemonet @ashleysommer for looking into this!
It sounds like a fix will be released soon.. I'll be keeping an eye out for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants