<a href="https://colab.research.google.com/github/SCS-Technology-and-Innovation/DSLP/blob/main/DSLP_M02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document structure

People will never give up on inventing new ways to structure information. A relatively modern-day take on this is XML (eXtensible Markup Language) that allows the specification of what elements a document may contain and how those elements relate to one another.

If at any point you feel like you need details on XML that go beyond what is provided on this Colab sheet, [there is a neat w3schools tutorial on XML](https://www.w3schools.com/xml/xml_whatis.asp) you can consult.

In [1]:
content = '''<?xml version="1.0" ?>
<author>
  <name>Elisa</name>
  <org>McGill</org>
</author>'''
print(content)

<?xml version="1.0" ?>
<author>
  <name>Elisa</name>
  <org>McGill</org>
</author>


In [2]:
!pip install xmltodict



In [3]:
from xmltodict import parse # one of many libraries to parse an XML document

info = parse(content)
print(info)

{'author': {'name': 'Elisa', 'org': 'McGill'}}


In [4]:
info['author']['org']

'McGill'

We could of course also store our XML document in a (plain-text) file.

In [5]:
xmlurl = 'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/doc.xml'

import urllib.request # an old friend from M1

# file contents
inputfile = urllib.request.urlopen(xmlurl).read()
print(inputfile.decode())

<?xml version="1.0" encoding="UTF-8"?>
<document>
  <author>
    <name>Elisa</name>
    <org>McGill</org>
  </author>
  <title>Example XML file</title>
  <date>
    <created>
      <month>May</month>
      <day>7</day>
      <year>2024</year>
    </created>
    <updated>
      <month>June</month>
      <day>15</day>
      <year>2024</year>
    </updated>
  </date>
</document>



In [6]:
doc = parse(inputfile)
doc

{'document': {'author': {'name': 'Elisa', 'org': 'McGill'},
  'title': 'Example XML file',
  'date': {'created': {'month': 'May', 'day': '7', 'year': '2024'},
   'updated': {'month': 'June', 'day': '15', 'year': '2024'}}}}

Having *parsed* the file, we can access elements of it with more ease:

In [7]:
doc['document']['title']

'Example XML file'

In [8]:
doc['document']['date']['created']['year']

'2024'

How can we establish (optional) rules on what the document should contain? There are two competing options: document type definitions (DTD) and schemas. Let's examine those in that order.

In [9]:
from io import StringIO
from lxml import etree

minidtd = StringIO("<!ELEMENT author (name,org?)>\n<!ELEMENT name (#PCDATA)>\n<!ELEMENT org (#PCDATA)>")
minixml = etree.XML(content)
output = etree.tostring(minixml, pretty_print = True, xml_declaration = True, encoding = 'UTF-8')
output

b"<?xml version='1.0' encoding='UTF-8'?>\n<author>\n  <name>Elisa</name>\n  <org>McGill</org>\n</author>\n"

In [10]:
dtdv = etree.DTD(minidtd)
dtdv.validate(minixml)

True

Let's make broken ones just to see how that would behave.

In [11]:
badminidtd = StringIO("<!ELEMENT author (name,org?)>\n<!ELEMENT org (#PCDATA)>")
dtdv2 = etree.DTD(badminidtd)
dtdv2.validate(minixml) # same XML, incomplete DTD
dtdv2.error_log.filter_from_errors()[0]

<string>:3:0:ERROR:VALID:DTD_UNKNOWN_ELEM: No declaration for element name

In [12]:
badcontent = '''<?xml version="1.0" ?>
<author>
  <name>Elisa</name>
  <company>McGill</company>
</author>'''
badminixml = etree.XML(badcontent) # field names do not match the DTD
dtdv.validate(badminixml)
dtdv.error_log.filter_from_errors()[0]

<string>:2:0:ERROR:VALID:DTD_CONTENT_MODEL: Element author content does not follow the DTD, expecting (name , org?), got (name company )

Let's try a more complex DTD from a file.

In [30]:
dtdurl = 'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/doc.dtd'
dtdcontent = urllib.request.urlopen(dtdurl).read().decode()
print(dtdcontent)

<!ELEMENT document (author+,title,date)>
<!ELEMENT author (name,org?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT org (#PCDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT date (created,updated?)>
<!ELEMENT created (month,day,year)>
<!ELEMENT updated (month,day,year)>
<!ELEMENT month (#PCDATA)>
<!ELEMENT day (#PCDATA)>
<!ELEMENT year (#PCDATA)>



In [41]:
xmlurl = 'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/docwithdtd.xml'
xmlcontent = urllib.request.urlopen(xmlurl).read()
print(xmlcontent)

b'<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE document SYSTEM "doc.dtd">\n<document>\n  <author>\n    <name>Elisa</name>\n    <org>McGill</org>\n  </author>\n  <title>Example XML file</title>\n  <date>\n    <created>\n      <month>May</month>\n      <day>7</day>\n      <year>2024</year>\n    </created>\n    <updated>\n      <month>June</month>\n      <day>15</day>\n      <year>2024</year>\n    </updated>\n  </date>\n</document>\n'


For the syntax of a DTD file, see for example [this geeksforgeeks tutorial](https://www.geeksforgeeks.org/dtd-syntax/).

In [35]:
dtd = etree.DTD(StringIO(dtdcontent))

In [43]:
docxml = etree.XML(xmlcontent)

In [44]:
dtd.validate(docxml)

True

Another alternative is to use a (more powerful) schema.

In [67]:
schurl = 'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/doc.xsd'
schcontent = urllib.request.urlopen(schurl)
schxml = etree.parse(schcontent)

In [68]:
schema = etree.XMLSchema(schxml)

In [73]:
xmlurl2 = 'https://raw.githubusercontent.com/SCS-Technology-and-Innovation/DSLP/main/data/doc.xml'
xmlcontent2 = urllib.request.urlopen(xmlurl2).read()
docxml2 = etree.XML(xmlcontent2)

schema.validate(docxml2)

True