Skip to content

Decisions

Ken McLeod edited this page Nov 27, 2019 · 9 revisions

Which rdfa parser to use?

  • Last commit 2016
  • Last release 2012 (on maven central)
  • Needs Jena
  • Needs HTML to be valid?
  • Last commit 2016
  • Last release 2016
  • Can’t use it with Apache clerezza as classes missing from jar:
import org.apache.clerezza.rdf.core.MGraph;
import org.apache.clerezza.rdf.core.UriRef;
import org.apache.clerezza.rdf.core.access.TcManager;
	...
        TcManager manager = TcManager.getInstance();
        UriRef graphUri = new UriRef(HTTP_EXAMPLE_COM); //there is no UriRef
        if (manager.listMGraphs().contains(graphUri)) {
            manager.deleteTripleCollection(graphUri);
        }
        MGraph mgraph = manager.createMGraph(graphUri);
  • Can’t use it with Jena as entire JAR file missing. Also uses very old version of Jena (2.11.1) from the HP days.

RDF4J

  • Underneath uses semargl ; see stacktrace:
at org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:111)
  • Cannot handle invalid HTML, eg:
[Fatal Error] :68:479: The element type "link" must be terminated by the matching end-tag "</link>".

Apache Any23

Any23 uses by default Semargl with the standard RDFa 1.1.

  • I can get this to work and it seems to tolerate invalid HTML (though results are questionable).
  • Any23 does both RDFa and JSON-LD, so it a single solution that mostly works.

What to scrape?

  1. additionalProperty is ignored because it is difficult to interpret what the properties relate to. Also, it is not a very schema way of modelling data.

  2. nofollow is used by uniprot:

<a href="/uniprot/?query=author:%22Chen+X.%22&amp;sort=score" rel="nofollow">Chen X.</a>

this produces (via Any23):

http://purl.uniprot.org/citations/16243425  http://schema.org/nofollow https://www.uniprot.org/uniprot/?query=author:%22Chen+S.%22&sort=score

There is no http://schema.org/nofollow. So ignoring these.