## XSLT transformation from ISO xml to GeoDCAT RDF/XML

This notebook has some convenience steps to perform the iso xml to geodcat transformation.

---

Because the XSLT is a 2.0 version, 'standard' python libraries like lxml/etree cannot be used. (they only support xslt v1)

So the initial steps are calling the saxon-he 12.5 java processor.

Subsequent steps use rdflib and pyshacl to work with the rdf data.




1) install java
2) download saxon-he (community version)
3) configure a 'task' in VSCode to run the transformation using the DeltaXML XSLT/XPath extension.

Or use the commandline like this:
```
    java.exe 
    -cp <path>\saxon-he-10.8.jar net.sf.saxon.Transform 
    -xsl:\xslt\iso-19139-to-dcat-ap.xsl 
    -s:\test_input\input.xml 
    -o:\test_output\result.xml 
    --allowSyntaxExtensions:off
```

#### XSLT transformation with SaxonC python API

There is a python API for SaxonC

```
pip install saxonche
```
see: https://pypi.org/project/saxonche/
and: https://www.saxonica.com/saxon-c/documentation12/index.html#!samples/samples_python



In [65]:
import os
from dotenv import load_dotenv
from saxonche import *
from pathlib import Path

def saxonTransform(input_xml, xslt_file, output_file):
    with PySaxonProcessor(license=False) as proc:

        # print(proc.version)
        try:
            xsltproc = proc.new_xslt30_processor()
            builder = proc.new_document_builder()
            document = builder.parse_xml(xml_file_name=Path(input_xml).as_uri())
            executable = xsltproc.compile_stylesheet(stylesheet_file=xslt_file)

            output = executable.transform_to_string(xdm_node=document)
            f = open(output_file, "w")
            f.write(output)
            f.close()

        except PySaxonApiError as err:
                print('Error during function call', err)


In [66]:

# builder.parse_xml(xml_file_name=Path(input_xml).as_uri()) expects a fully qualified pathname. 
# So put that in an .env variable (to make the code stable accross environments)
load_dotenv()  # take environment variables from .env.
base_path = os.environ['BASE_PATH']

xml_dir = Path(base_path + 'test_input\\')
xml_out_dir = base_path + 'test_output_xml\\'
xslt_file = base_path + 'xslt\\original\\iso-19139-to-dcat-ap.xsl'
# Transform both test files with the original XSL and with the adapted XSL to analyze any changes

for xml_file in xml_dir.glob('*.xml'):
    # print(xml_file)
    rdf_out = xml_out_dir + os.path.basename(xml_file).replace('xml','rdf')
    # print(rdf_out)
    saxonTransform(xml_file, xslt_file, rdf_out)


There are some issues with the XSLT. The output of the changes to the XSLT are written to the adapted_xsl folder

The changes in the xslt are annotated with the `<!-- @NH -->`  remark.

- spaces in the namespace declaration in lines 55 - 86 seem to cause trouble with python => removed
-  The creation of distributions and dataservices is done with blank nodes. This potentially causes issues when multiple rdf-dcat files are combined because the relation between dataset and distribution will get lost. provisional fix is to generate an id for the distribution line 1411 - 1466
- creation of dcat:themes is done in such a way that only uri's are created for a theme, rather than a skos concept line 3113
- every distribution gets a format/mediatype assigned, even though the stated format is not applicable to that distribuion. line 3410
- currently the xslt transforms geometries for bounding box and geometry properties in 3 different ways. The profile (and the shacl files) specifies a max of 1 value for these properties, but does not specify which serialization.  lines 2586, 2587

- also type a Dataset as dcat:Resource to fulfill shacl validation requirements foaf:primaryTopic -> dcat:Resource.  line 1133



In [67]:
# builder.parse_xml(xml_file_name=Path(input_xml).as_uri()) expects a fully qualified pathname. 
# So put that in an .env variable (to make the code stable accross environments)
load_dotenv()  # take environment variables from .env.
base_path = os.environ['BASE_PATH']

xml_dir = Path(base_path + 'test_input\\')
xml_out_dir = base_path + 'test_output_xml\\adapted_xsl\\'
xslt_file = base_path + 'xslt\\iso-19139-to-dcat-ap.xsl'
# Transform both test files with the original XSL and with the adapted XSL to analyze any changes

for xml_file in xml_dir.glob('*.xml'):
    # print(xml_file)
    rdf_out = xml_out_dir + os.path.basename(xml_file).replace('xml','rdf')
    # print(rdf_out)
    saxonTransform(xml_file, xslt_file, rdf_out)

### transform from rdf/xml into ttl

... because I like reading turtle more than reading xml ...

And it checks whether the rdf/xml is actually valid rdf and can be parsed by rdflib ...

In [68]:
from rdflib import Graph
from rdflib.exceptions import Error

def rdfxml_2_ttl(rdf_file, ttl_file):
    try:
        graph = Graph()
        graph.parse(rdf_file)
    
        graph.serialize(ttl_file, format='turtle')
    except Error as err:
        print('Error during rdf to ttl transformation', err)

In [69]:
rdf_folder = Path(base_path + 'test_output_xml\\')
ttl_folder = base_path + 'test_output_ttl\\'

for rdf_file in rdf_folder.glob('*.rdf'):
    ttl_out = ttl_folder + os.path.basename(rdf_file).replace('rdf','ttl')
    rdfxml_2_ttl(rdf_file, ttl_out)

In [70]:
rdf_folder = Path(base_path + 'test_output_xml\\adapted_xsl\\')
ttl_folder = base_path + 'test_output_ttl\\adapted_xsl\\'

for rdf_file in rdf_folder.glob('*.rdf'):
    ttl_out = ttl_folder + os.path.basename(rdf_file).replace('rdf','ttl')
    rdfxml_2_ttl(rdf_file, ttl_out)

### SHACL Validate

In [71]:
from pyshacl import validate
def do_validate(data_graph, sg):
    r = validate(data_graph,
      shacl_graph=sg,
      ont_graph=None,
      inference='rdfs',
      abort_on_first=False,
      allow_infos=False,
      allow_warnings=False,
      meta_shacl=False,
      advanced=False,
      js=False,
      debug=False)
    conforms, results_graph, results_text = r
    return conforms, results_graph, results_text

In [72]:
from pyshacl.errors import ReportableRuntimeError

def shacl_validation(ttl_folder, shacl_file, validation_folder):
    for ttl_file in ttl_folder.glob('*.ttl'):
        try:
            data_graph = Graph()
            data_graph.parse(ttl_file, format='turtle')
            sg = Graph()
            sg.parse(shacl_file,  format='turtle')

            sh_out = validation_folder + os.path.basename(ttl_file).replace('.ttl','.shacl.ttl')
            txt_out = validation_folder + os.path.basename(ttl_file).replace('.ttl','.txt')

            conforms, results_graph, results_text = do_validate(data_graph, sg)
            results_graph.serialize(sh_out)
            f = open(txt_out, "a")
            f.write(results_text)
            f.close()
            print(sh_out + ' Validation result: ' + str(conforms))
        except ReportableRuntimeError as err:
            print('Error during shacl validation', err)

In [73]:

ttl_folder = Path(base_path + 'test_output_ttl\\')
shacl_file = base_path + 'shacl\\original\\geodcat-ap-SHACL.ttl'
validation_folder = base_path + 'shacl_results\\'

shacl_validation(ttl_folder, shacl_file, validation_folder)

C:\\niels\\projecten\\ISO-2-DCAT\\geodcat_ap_3_xslt\\shacl_results\4e2ef670-cddd-11dd-ad8b-0800200c9a66.shacl.ttl Validation result: False
C:\\niels\\projecten\\ISO-2-DCAT\\geodcat_ap_3_xslt\\shacl_results\aa3b5e6e-7baa-40c0-8972-3353e927ec2f.shacl.ttl Validation result: False


there is a namespace error in the shacl file, so we also do the shacl validation on a fixed shacl file. https://github.com/SEMICeu/GeoDCAT-AP/issues/145


In [74]:
ttl_folder = Path(base_path + 'test_output_ttl\\adapted_xsl')
shacl_file = base_path + 'shacl\\geodcat-ap-SHACL.ttl'
validation_folder = base_path + 'shacl_results\\adapted_xsl_shacl\\'

shacl_validation(ttl_folder, shacl_file, validation_folder)

C:\\niels\\projecten\\ISO-2-DCAT\\geodcat_ap_3_xslt\\shacl_results\adapted_xsl_shacl\4e2ef670-cddd-11dd-ad8b-0800200c9a66.shacl.ttl Validation result: False
C:\\niels\\projecten\\ISO-2-DCAT\\geodcat_ap_3_xslt\\shacl_results\adapted_xsl_shacl\aa3b5e6e-7baa-40c0-8972-3353e927ec2f.shacl.ttl Validation result: False
