Skip to content

harvard-library/schematronium

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Schematronium

Schematronium is a gem providing:

  1. A single-object, single-function API for compiling a schematron script, and running it over an XML file, returning a Nokogiri NodeSet of the resulting failed-asserts and successful-reports.
  2. A script (schematronium) to run a schematron over one or many XML files and return aggregate date in a TBD format. Mostly meant as an example of something you could to with it. Also shows how to turn off some parser features to prevent XXE vulnerabilities, which is VERY IMPORTANT if you are parsing XML you do not personally 100% control. Schematronium does NOT do this by default.

The goals of Schematronium are very similar to schematron-wrapper. The primary difference is that, where schematron-wrapper runs the saxon jar via backticks per file, Schematronium uses the jRuby-only saxon-xslt library to compile and run the schematron. This has the upshot of not incurring the penalty of JDK initialization per file, which tends to be a substantial cost savings over even a small number of files.

Requirements

  • jRuby - Schematron is tested with jRuby 9000, but may be suitable for use with earlier jRubies.
  • JDK requirement is essentially whatever your chosen jRuby demands.
  • saxon-xslt
  • Nokogiri

API

API docs are hosted here.

The API for Schematronium is very, very minimal.

checker = Schematronium.new("schematron_filename.sch")

failed_assert_nodeset = checker.check(filename_or_IO_object_supporting_read)

Processing the NodeSet into the report or output you desire is left as an exercise to the consumer.

Known issues

Redundant parsing

Right now, the Saxon::XML::Document object returned by saxon-xslt is pretty opaque. In order to get a reasonable API on the returned results, Schematronium is just rendering the returned doc to a string, then re-parsing with Nokogiri and using its API to pull out the failed-asserts and successful-reports.

It's possible that someone who knew more about XDMDocument (and Java XML-handling in general) than your humble author might be able to dispense with the use of Nokogiri, and thus reduce dependencies and (probably) memory/execution time.