# Preliminaries
- install the saxon parser and API library using pip.

In [1]:
!pip install saxonche




[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


- Import the Python library

In [2]:
from saxonche import *

- Test the Python library and API, the result should be the API version

In [3]:
from saxonche import PySaxonProcessor

with PySaxonProcessor(license=False) as proc:
	print(proc.version)

SaxonC-HE 12.5 from Saxonica


- Transform the test.xml file using the test_one.xsl transformation stylesheet.

In [4]:
from saxonche import *

with PySaxonProcessor(license=False) as proc:
 
 xsltproc = proc.new_xslt30_processor()
 with open('postcard.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     executable = xsltproc.compile_stylesheet(stylesheet_file="test_one.xsl")
     output = executable.transform_to_string(xdm_node=document)
     print(output)
     xmlfile.close()

<!DOCTYPE HTML>
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   </head>
   <body>
      
      <hr>
      <h1>Postcard</h1>
      
      
      <div class="content">
         <h2>Contents</h2>I love marmalade!</div>
      
      
      <div class="address">
         <h2>Name and Address</h2>
         <ul>
            
            <li><b>name</b>&nbsp;Paddington Bear</li>
            
            <li><b>co</b>&nbsp;Mrs Brown</li>
            
            <li><b>street</b>&nbsp;32 Windsor Gardens</li>
            
            <li><b>town</b>&nbsp;Notting Hill</li>
            
            <li><b>county</b>&nbsp;London</li>
            
            <li><b>country</b>&nbsp;United Kingdom</li>
            
            <li><b>postcode</b>&nbsp;W9 3RG</li>
            </ul>
      </div>
      
      <hr>
      </body>
</html>


## Exercise 1 (10 minutes approx)

For the first exercise, we need to extend the data to reflect the postcard (postcard.xml) more accurately.
- Firstly think about ways to enhance the postcard contents.
- In the copy of postcard.xml (postcard-exercise.xml), add new elements to the XML data to reflect the enhancements.

Test your changes with the codeblock below - if the XML isn't well-formed the parser will fail to process it and you will see errors. If it runs through the parser you will also see the file contents.

Remember the tags must fully enclose the elements you want to capture i.e. `<example>text</example>`

In [5]:
from saxonche import *

with PySaxonProcessor(license=False) as proc:
 
 with open('postcard-exercise.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     print(document)
     xmlfile.close()

<!--- root element **REQUIRED** --><root>
   <postcard>
        <!-- postcard text -->
      <content>I love marmalade!</content>
      <!-- address details -->
      <address>
         <name>Paddington Bear</name>
         <co>Mrs Brown</co>
         <street>32 Windsor Gardens</street>
         <town>Notting Hill</town>
         <county>London</county>
         <country>United Kingdom</country>
         <postcode>W9 3RG</postcode>
      </address>
   </postcard>
</root>


## Exercise 2 (5 minutes approx)

Now try using attributes to enhance the elements you have added to the postcard-exercise.xml. If you need some assistance take a look at the postcard-enhanced.xml file for some hints about how to use the attributes.

Once you have added the new attributes to postcard-exercise.xml run the code snippet below to check for errors


In [None]:
from saxonche import *

with PySaxonProcessor(license=False) as proc:
 
 xsltproc = proc.new_xslt30_processor()
 with open('postcard-exercise.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     print(document)
     xmlfile.close()

## Sample XSLT

Here's a quick sample of some XSLT in action processing postcard-enhanced.xml into HTML. Apologies, but the images won't render in the HTML (that's a Jupyter issue).

In [12]:
from saxonche import *

with PySaxonProcessor(license=False) as proc:
 
 xsltproc = proc.new_xslt30_processor()
 with open('postcard-enhanced.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     executable = xsltproc.compile_stylesheet(stylesheet_file="test-enhanced.xsl")
     output = executable.transform_to_string(xdm_node=document)
     print(output)
     xmlfile.close()

<!DOCTYPE HTML>
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   </head>
   <body>
      
      <hr>
      <h1>Postcard</h1>
      <img class="front_plate" src="paddinton.jpg">
      <img class="stamp" src="firstclass.jpg">
      
      
      <div class="content">
         <h2>Contents</h2>I love marmalade!</div>
      
      
      <div class="address">
         <h2>Name and Address</h2>
         <ul>
            
            <li><b>name</b>&nbsp;Paddington Bear</li>
            
            <li><b>co</b>&nbsp;Mrs Brown</li>
            
            <li><b>street</b>&nbsp;32 Windsor Gardens</li>
            
            <li><b>town</b>&nbsp;Notting Hill</li>
            
            <li><b>county</b>&nbsp;London</li>
            
            <li><b>country</b>&nbsp;United Kingdom</li>
            
            <li><b>postcode</b>&nbsp;W9 3RG</li>
            </ul>
      </div>
      
      <hr>
      </body>
</html>


## Exercise 3 (10 minutes approx - optional)

Data definitions are the mechanism to control the use of elements and attributes within an XML document. This should (ideally) prevent the document contents becoming inconsistent. It's also the mechanism used to create standards such as MODS/METS, HTML, EPub amongst others.

There is no ideal way to create a data definition within the notable environment and hand-coding them is a tricky business. However, if you are feeling bold you can upload your postcard-exercise.xml file to this site [xsd generator](https://www.freeformatter.com/xsd-generator.html) and create a Schema file for it.

You can utilise the schema and apply the data definition by creating the XSD file (example.xsd for example) in your working directory, pasting the created schema into the new file. Then amend the root element to apply the schema.

`<root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="example.xsd">`

Once amended check that the schema is functioning by running the postcard-exercise.xml through the parser using the snippet below. You could also try adding some unconstrained elements to see the result.

In [None]:
from saxonche import *

with PySaxonProcessor(license=False) as proc:
 
 with open('postcard-exercise.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     print(document)
     xmlfile.close()

## Exercise 3 (alternative)

As an alternative I've provided a sample xHTML file which has an existing data definition. Try adding some unconstrained elements into xhtml-sample.xml and the parser will complain.

In [16]:
from saxonche import *

with PySaxonProcessor(license=False) as proc:
 
 with open('xhtml-sample.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     print(document)
     xmlfile.close()

<!-- 
    Using the older DTD definition to provide entities for the document.
    In this case the nbsp entity (non-breaking space).
--><!-- declarative markup, in this case using the root element of xhtml --><html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/1999/xhtml https://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd">
   <head>
      <title>Example of xHTML using the xHTML schema</title>
   </head>
   <body>
      <h1>Example of Declarative Markup</h1>
      <p>
            This is an example of using a controlled vocabulary to create a datafile of a particular <b>
            type</b>.
        </p>
      <p>
            In this case the schema used is the version 1.1 xHTML schema. There are some notable differences 
            between xHTML and HTML. And although HTML (a form of SGML) is still by far more widespread, xHTML
            is still used quite widely. The <b>EPUB standard</b>, fo

# XPath and XSLT

Welcome to the section on XPath and XSLT. I've attempted to framework the course around both these elements at the same time as XPath understanding is central the XSLT.

However for starters there are some sample XPath expressions below. I would advise ignoring most of the Python code.

## Example XPath

Get the addressee of the postcard from postcard.xml. The XPath in this case is in the code block but for clarity it is `/root/postcard/address/name`

In [19]:
from saxonche import *

with PySaxonProcessor(license=False) as proc:
 
 with open('postcard.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     xpathproc = proc.new_xpath_processor()
     xpathproc.set_context(xdm_item=document)
     # here's the xpath call
     result = xpathproc.evaluate(xpath_str='/root/postcard/address/name')
     print(result)
     xmlfile.close()

<name>Paddington Bear</name>


In the example above the XPath expression is `/root/postcard/address/name`. This tells the parser to extract the data element using the full path to the name element using the tags to identify the route. This example represents an *absolute path*

There are a few valid alternatives. For example:

In [20]:
with PySaxonProcessor(license=False) as proc:
 
 with open('postcard.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     xpathproc = proc.new_xpath_processor()
     xpathproc.set_context(xdm_item=document)
     # here's the xpath call
     result = xpathproc.evaluate(xpath_str='//name')
     print(result)
     xmlfile.close()

<name>Paddington Bear</name>


XPath has a variety of ways to select individual data elements. In the example above `//name` informs the XPath processor to find any element in the tree called 'name'. These navigational elements are known as **'Selectors'**.


In the example below using postcard-enhanced.xml I'm going to extract the 'image' attribute from the front plate element.

In [31]:
with PySaxonProcessor(license=False) as proc:
 
 with open('postcard-enhanced.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     xpathproc = proc.new_xpath_processor()
     xpathproc.set_context(xdm_item=document)
     # here's the xpath call
     result = xpathproc.evaluate(xpath_str='//front/@image')
     print(result)
     xmlfile.close()

image="paddington.jpg"


Using the '@' character the expression can find and process the attributes of elements. These are known as **'Axes'**

There are a variety of different axes available to the XPath processor. These include:

- ancestor
- ancestor-or-self
- attribute (@)
- child
- descendant
- descendant-or-self
- following
- following-sibling
- parent
- preceding
- preceding-sibling
- self

However, for now I would recommend using the basic selectors.

### Other selectors

There are other selector types. For example a wildcard can be used in an XPath expression.

In [32]:
with PySaxonProcessor(license=False) as proc:
 
 with open('postcard-enhanced.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     xpathproc = proc.new_xpath_processor()
     xpathproc.set_context(xdm_item=document)
     # here's the xpath call
     result = xpathproc.evaluate(xpath_str='//*/@image')
     print(result)
     xmlfile.close()

image="paddington.jpg"
image="firstclass.jpg"


The example above using the wildcard expression '*' finds both the image attributes in the document. But it's also possible to find particular content using XPath as in the example below:

In [34]:
with PySaxonProcessor(license=False) as proc:
 
 with open('postcard-enhanced.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     xpathproc = proc.new_xpath_processor()
     xpathproc.set_context(xdm_item=document)
     # here's the xpath call
     result = xpathproc.evaluate(xpath_str='//*[@image="firstclass.jpg"]')
     print(result)
     xmlfile.close()

<stamp image="firstclass.jpg" origin="UK">first class stamp - United Kingdom</stamp>


In this particular example the XPath selects the node which contains the matching attribute not the attribute contents. That's the role of the '[]' enclosing braces. It informs the processor the check within the element for the requirements and returns.

## Introduction to XSLT

XSLT is another W3C standard for transforming XML. This transformation is typically into another markup standard but potentially can be used to generate an arbitary output. It is not uncommon, for example, to process XML into a PDF or another binary format. It's also worth noting that XSLT is itself a form of XML.

The underlying principle of XSLT is that is aligns to the function of the parser (I alluded to it earlier). The parser will process the entire document. XSLT templates will instruct the processor to perform certain actions when an element is reached that matches a template instruction. The matching value is naturally provided to the processor using an XPath expression.

There's a simple XSL file provided that processes the 'postcard.xml' into HTML (as per the preamble). Open it an take a look at the contents.

In [35]:
with PySaxonProcessor(license=False) as proc:
 
 with open('test_one.xsl', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     print(document)
     xmlfile.close()

<xsl:stylesheet xmlns:math="http://www.w3.org/2005/xpath-functions/math"
                xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                exclude-result-prefixes="xs math xd"
                version="3.0">
   <xsl:output encoding="utf-8" method="html" omit-xml-declaration="yes"/>
   <xsl:template match="/">
      <xsl:apply-templates/>
   </xsl:template>
   <xsl:template match="root">
      <html>
         <head>
                <!-- place head info here -->
            </head>
         <body>
            <xsl:apply-templates/>
         </body>
      </html>
   </xsl:template>
   <xsl:template match="postcard">
      <hr/>
      <h1>Postcard</h1>
      <xsl:apply-templates/>
      <hr/>
   </xsl:template>
   <xsl:template match="content">
      <div class="content">
         <h2>Contents</h2>
         <xsl:apply-templates/>
      </div>
   </xsl:tem

The key elements of the XSL instructions are the `<xsl:template match="...">` tags and `<xsl:apply-templates/>`.

The former tells the processor to apply the template instruction when the matching tag is reached. The 'apply-templates' instruction is used to ensure that if an element contains sub-elements with their own template, they will also be processed.

These are debatably the two main instructions when considering an XSL transformation.

The next instruction to consider is `<xsl:value-of select="...">`. This assigns the value provided by the 'select' attribute. Which, unsuprisingly, is typically an XPath expression. In some of the examples you may see a '.' character in the select attribute. This is shorthand for the current element. 
{
 <xsl:template match="//name">
    <p><xsl:value-of select="."/></p>
 </xsl:template/>
}

So in the example above the currently selected node would be the name element, and the value '.' would also be the contents of the name element.

## Exercise 4 (5 minutes approx)

The snippet below, executes 'postcard-exercise.xsl'. This outputs only the name in the address block from postcard.xml

In [37]:
with PySaxonProcessor(license=False) as proc:
 
 xsltproc = proc.new_xslt30_processor()
 with open('postcard.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     executable = xsltproc.compile_stylesheet(stylesheet_file="postcard-exercise4.xsl")
     output = executable.transform_to_string(xdm_node=document)
     print(output)
     xmlfile.close()

PySaxonApiError: Unable to resolve <postcard-exercise.xsl> into a Source. Line number: -1