## XSLT transformation from ISO xml to GeoDCAT RDF/XML

This notebook has some convenience steps to perform the iso xml to geodcat transformation.

---

Because the XSLT is a 2.0 version, 'standard' python libraries like lxml/etree cannot be used. (they only support xslt v1)

So the initial steps are calling the saxon-he 12.5 java processor.

Subsequent steps use rdflib and pyshacl to work with the rdf data.




1) install java
2) download saxon-he (community version)
3) configure a 'task' in VSCode to run the transformation using the DeltaXML XSLT/XPath extension.

Or use the commandline like this:
```
    java.exe 
    -cp <path>\saxon-he-10.8.jar net.sf.saxon.Transform 
    -xsl:\xslt\iso-19139-to-dcat-ap.xsl 
    -s:\test_input\input.xml 
    -o:\test_output\result.xml 
    --allowSyntaxExtensions:off
```

#### XSLT transformation with SaxonC python API

There is a python API for SaxonC

```
pip install saxonche
```
see: https://pypi.org/project/saxonche/
and: https://www.saxonica.com/saxon-c/documentation12/index.html#!samples/samples_python



In [4]:
import os
from dotenv import load_dotenv
from saxonche import *
from pathlib import Path

load_dotenv()  # take environment variables from .env.

def transform(input_xml, xslt_file, output_file):
    with PySaxonProcessor(license=False) as proc:

        # print(proc.version)
        try:
            xsltproc = proc.new_xslt30_processor()
            builder = proc.new_document_builder()
            document = builder.parse_xml(xml_file_name=Path(input_xml).as_uri())
            executable = xsltproc.compile_stylesheet(stylesheet_file=xslt_file)

            output = executable.transform_to_string(xdm_node=document)
            f = open(output_file, "w")
            f.write(output)
            f.close()

        except PySaxonApiError as err:
                print('Error during function call', err)

# builder.parse_xml(xml_file_name=Path(input_xml).as_uri()) expects a fully qualified pathname. 
# So put that in an .env variable (to make the code stable accross environments)
input_path = os.environ['INPUT_PATH']

# Transform both test files with the original XSL and with the adapted XSL to analyze any changes

# rce file transform with original xsl
transform(input_xml=input_path+'4e2ef670-cddd-11dd-ad8b-0800200c9a66.xml',xslt_file='xslt/original/iso-19139-to-dcat-ap.xsl',output_file="./test_output_xml/4e2ef670-rce-orig.xml")

# rce file transform with adapted xsl
transform(input_xml=input_path+'4e2ef670-cddd-11dd-ad8b-0800200c9a66.xml',xslt_file='xslt/iso-19139-to-dcat-ap.xsl',output_file="./test_output_xml/4e2ef670-rce-new.xml")

# bag file transform with original xsl
transform(input_xml=input_path+'aa3b5e6e-7baa-40c0-8972-3353e927ec2f.xml',xslt_file='xslt/original/iso-19139-to-dcat-ap.xsl',output_file="./test_output_xml/aa3b5e6e-bag-orig.xml")

# bag file transform with adapted xsl
transform(input_xml=input_path+'aa3b5e6e-7baa-40c0-8972-3353e927ec2f.xml',xslt_file='xslt/iso-19139-to-dcat-ap.xsl',output_file="./test_output_xml/aa3b5e6e-bag-new.xml")


### transform from rdf/xml into ttl

... because I like reading turtle more than reading xml ...

And it checks whether the rdf/xml is actually valid rdf and can be parsed by rdflib ...

In [5]:
from rdflib import Graph

graph = Graph()
graph.parse(r'test_output_xml/4e2ef670-rce-orig.xml')
graph.serialize('test_output_ttl/4e2ef670-rce-orig.ttl', format='turtle')

graph2 = Graph()
graph2.parse(r'test_output_xml/4e2ef670-rce-new.xml')
graph2.serialize('test_output_ttl/4e2ef670-rce-new.ttl', format='turtle')

graph3 = Graph()
graph3.parse(r'test_output_xml/aa3b5e6e-bag-orig.xml')
graph3.serialize('test_output_ttl/aa3b5e6e-bag-orig.ttl', format='turtle')

graph4 = Graph()
graph4.parse(r'test_output_xml/aa3b5e6e-bag-new.xml')
graph4.serialize('test_output_ttl/aa3b5e6e-bag-new.ttl', format='turtle')

<Graph identifier=Nb374afbf294d476c8de42665439729d8 (<class 'rdflib.graph.Graph'>)>

### SHACL Validate

In [24]:
from pyshacl import validate
def do_validate(data_graph, sg):
    r = validate(data_graph,
      shacl_graph=sg,
      ont_graph=None,
      inference='rdfs',
      abort_on_first=False,
      allow_infos=False,
      allow_warnings=False,
      meta_shacl=False,
      advanced=False,
      js=False,
      debug=False)
    conforms, results_graph, results_text = r
    return conforms, results_graph, results_text

In [25]:
from rdflib import Graph

data_graph = Graph()
data_graph.parse(r'test_output_ttl/4e2ef670-rce-orig.ttl', format='turtle')
# data_graph.parse(r'xsl-out/result-gn-template.ttl', format='turtle')
sg = Graph()
sg.parse(r'shacl/geodcat-ap-SHACL.ttl',  format='turtle')

conforms1, results_graph1, results_text1 = do_validate(data_graph, sg)
results_graph1.serialize('shacl_results/result-4e2ef670-rce-orig.ttl')
f = open("shacl_results/results_rce_orig.txt", "a")
f.write(results_text1)
f.close()

In [26]:
data_graph2 = Graph()
data_graph2.parse(r'test_output_ttl/4e2ef670-rce-new.ttl', format='turtle')
# data_graph.parse(r'xsl-out/result-gn-template.ttl', format='turtle')
sg2 = Graph()
sg2.parse(r'shacl/geodcat-ap-SHACL.ttl',  format='turtle')

conforms2, results_graph2, results_text2 = do_validate(data_graph2, sg2)
results_graph2.serialize('shacl_results/result-4e2ef670-rce-new.ttl')
f = open("shacl_results/results_rce_new.txt", "a")
f.write(results_text2)
f.close()

In [27]:
data_graph3 = Graph()
data_graph3.parse(r'test_output_ttl/aa3b5e6e-bag-orig.ttl', format='turtle')
# data_graph.parse(r'xsl-out/result-gn-template.ttl', format='turtle')
sg3 = Graph()
sg3.parse(r'shacl/geodcat-ap-SHACL.ttl',  format='turtle')

conforms3, results_graph3, results_text3 = do_validate(data_graph3, sg3)
results_graph3.serialize('shacl_results/result-aa3b5e6e-bag-orig.ttl')
f = open("shacl_results/results_bag_orig.txt", "a")
f.write(results_text3)
f.close()

In [28]:
data_graph4 = Graph()
data_graph4.parse(r'test_output_ttl/aa3b5e6e-bag-new.ttl', format='turtle')
# data_graph.parse(r'xsl-out/result-gn-template.ttl', format='turtle')
sg4 = Graph()
sg4.parse(r'shacl/geodcat-ap-SHACL.ttl',  format='turtle')

conforms4, results_graph4, results_text4 = do_validate(data_graph4, sg4)
results_graph4.serialize('shacl_results/result-aa3b5e6e-bag-new.ttl')
f = open("shacl_results/results_bag_new.txt", "a")
f.write(results_text4)
f.close()

In [None]:
print(results_text2)