# Preliminaries
- install the saxon parser and API library using pip.

In [None]:
!pip install saxonche

- Test the Python library and API, the result should be the API version

In [None]:
from saxonche import PySaxonProcessor

with PySaxonProcessor(license=False) as proc:
	print(proc.version)

- Transform the test.xml file using the test_one.xsl transformation stylesheet.

In [4]:
from saxonche import *

with PySaxonProcessor(license=False) as proc:
 
 xsltproc = proc.new_xslt30_processor()
 with open('test.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     executable = xsltproc.compile_stylesheet(stylesheet_file="test_one.xsl")
     output = executable.transform_to_string(xdm_node=document)
     print(output)
     xmlfile.close()

FileNotFoundError: [Errno 2] No such file or directory: 'cdcs-xml-course/test.xml'

# XML Structures

## The Parser

The parser is a key element in utilising XML documents. Parsers are quite common in programming and have uses in a variety of situations.

A parser, in essence, iterates through a file of symbols or characters. Typically with the intention of segmenting them into forms of data.

In this case the __[Saxon](https://www.saxonica.com/welcome/welcome.xml)__ parser is being used. Saxon is a potent commercial parser for processing XML documents. Although Python has it's own internal parsers for processing XML, for example Lib/xml/etree or lxml. These libraries don't typically provide all the native XML processing features or if they do, they're not up-to-date.

The Parser is a **key** component in utilising an XML document. As you will see when using XSLT, understanding parser mechanics is crucial.

To execute parsing of an XML the minimum requirement is that an XML document is 
*well-formed*. Optionally it must also be *valid* (something that will be examined later). There is a tertiary form of validation addressed by an XML standard known as __[Schematron](https://schematron.com/)__. That won't be addressed in this course, but it's worth being aware of it.

## The Prologue

The prologue is an essential component of an XML document.

It provides some key information for a processor using the XML declaration. Namely that this file is in fact an XML file. What the character encoding of the document is and what version of XML is in use.

`<?xml version="1.0" encoding="UTF-8"?>
...`

Ostensibly there are two versions of XML: version 1.0 and version 1.1. In practise you will almost never see version 1.1 in the wild.


### Encoding

Essentially each character in a document has an underlying numeric code.

| character | unicode value |
-----------------------------
| A | U+0041 |
| a | U+0061 |
| B | U+0042 |
| b | U+0062 |

...

I won't be going into detail into encoding. But it is an important topic. The encoding variable sets the expected encoding of characters within the document. The value 'utf-8' is an acronymn for 'Unicode Transformation Format 8 bit'. And is suitable for most extant latin derived languages in the Western hemisphere.

The __[Unicode standard](https://home.unicode.org/)__ is currently the main standard in character encoding. But it's worth bearing in mind that Unicode only became a standard in 1991. Full acceptance and support of Unicode by software providers such as Microsoft was patchy until around 2000.

If you're dealing with non-extant languages, non-pervasive languages and dialects and other specialisms. Encoding is something to think carefully about. Another area to consider is if you're using digital sources that predate Unicode.

In [3]:
# character to integer transformation
from saxonche import *

with PySaxonProcessor(license=False) as proc:
 
 xsltproc = proc.new_xslt30_processor()
 with open('test.xml', encoding="utf-8") as xmlfile:
     document = proc.parse_xml(xml_text=xmlfile.read())    
     executable = xsltproc.compile_stylesheet(stylesheet_file="tounicode.xsl")
     output = executable.transform_to_string(xdm_node=document)
     print(output)
     xmlfile.close()

<!DOCTYPE HTML>
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   </head>
   <body>
      
      <hr>
      <h1>Postcard</h1>
      
      
      <div class="content">
         <h2>Contents</h2>I love marmalade!</div>
      
      
      <div class="address">
         <h2>Name and Address</h2>
         <ul>
            
            <li><b>name</b>&nbsp;80,97,100,100,105,110,103,116,111,110,32,66,101,97,114</li>
            
            <li><b>co</b>&nbsp;77,114,115,32,66,114,111,119,110</li>
            
            <li><b>street</b>&nbsp;51,50,32,87,105,110,100,115,111,114,32,71,97,114,100,101,110,115</li>
            
            <li><b>town</b>&nbsp;78,111,116,116,105,110,103,32,72,105,108,108</li>
            
            <li><b>county</b>&nbsp;76,111,110,100,111,110</li>
            
            <li><b>country</b>&nbsp;85,110,105,116,101,100,32,75,105,110,103,100,111,109</li>
            
            <li><b>postcode</b>&nbsp;87,57,32,51,

## Other Prologue Elements

This is typical known as *declarative markup*. It is used for a variety of purposes but the usual purpose is to provide a *schema* or *data definition* to the XML document. 

These elements, whilst optional, are intended to provide a controlled vocabulary for constructing and validating XML documents.

In [None]:
# Sample declarative markup using xHTML. An XML-derived variant of HTML.

## XML Document Body

An XML document is made up of distinct data types.