<a href="https://colab.research.google.com/github/SCS-Technology-and-Innovation/DSLP/blob/main/DSLP_M02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document structure

People will never give up on inventing new ways to structure information. A relatively modern-day take on this is XML (eXtensible Markup Language) that allows the specification of what elements a document may contain and how those elements relate to one another.

If at any point you feel like you need details on XML that go beyond what is provided on this Colab sheet, [there is a neat w3schools tutorial on XML](https://www.w3schools.com/xml/xml_whatis.asp) you can consult.

In [4]:
!pip install xmltodict

Collecting xmltodict
  Downloading xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: xmltodict
Successfully installed xmltodict-0.13.0


In [8]:
from xmltodict import parse # one of many libraries to parse an XML document

In [11]:
doc = parse(inputfile)
doc

{'document': {'author': {'name': 'Elisa', 'org': 'McGill'},
  'title': 'Example XML file',
  'date': {'created': {'month': 'May', 'day': '7', 'year': '2024'},
   'updated': {'month': 'June', 'day': '15', 'year': '2024'}}}}

In [29]:
xmlurl = 'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/doc.xml'

import urllib.request # an old friend from M1

# file contents
inputfile = urllib.request.urlopen(xmlurl).read()
print(inputfile.decode())

<?xml version="1.0" encoding="UTF-8"?>
<document>
  <author>
    <name>Elisa</name>
    <org>McGill</org>
  </author>
  <title>Example XML file</title>
  <date>
    <created>
      <month>May</month>
      <day>7</day>
      <year>2024</year>
    </created>
    <updated>
      <month>June</month>
      <day>15</day>
      <year>2024</year>
    </updated>
  </date>
</document>



Having *parsed* the file, we can access elements of it with more ease:

In [16]:
doc['document']['title']

'Example XML file'

In [17]:
doc['document']['date']['created']['year']

'2024'

How can we establish (optional) rules on what the document should contain? There are two competing options: document type definitions (DTD) and schemas. Let's examine those in that order.

In [22]:
dtdurl = 'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/doc.dtd'
dtdcontent = inputfile = urllib.request.urlopen(dtdurl).read().decode()
print(dtdcontent)

<!DOCTYPE document
[
<!ELEMENT document (author,title,date,)>
<!ELEMENT author (name,org)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT date (created,updated)>
<!ELEMENT created (month,day,year)>
<!ELEMENT updated (month,day,year)>
<!ELEMENT month (#PCDATA)>
<!ELEMENT day (#PCDATA)>
<!ELEMENT year (#PCDATA)>
]>



For the syntax of a DTD file, see for example [this geeksforgeeks tutorial](https://www.geeksforgeeks.org/dtd-syntax/).

In [28]:
from lxml import etree # a handy library that spares us the urllib preamble

docxml = etree.XML('https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/docwithdtd.xml')
dtd = etree.DTD('https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/doc.dtd')


dtd.validate(docxml) # NOT WORKING YET

XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1 (<string>, line 1)

Another alternative is to use a (more powerful) schema.

In [41]:
import xmlschema
schema = xmlschema.XMLSchema('https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/doc.xsd')

ParseError: no element found: line 42, column 0 (<string>)

In [38]:
schema.validate(inputfile)

NameError: name 'schema' is not defined