# Chapter 2: Working with xml : reading

## 1. XML, XPath and DOM

XML stands for eXtensible Markup Language and is designed to describe data. Unlike html, XML is very flexible. Although you must have clean markup, you can invent your own tags or use those supplied by ODD, guidelines and set of tags design for collaboration and shared practices.

You've probably heard of TEI; TEI is a nice, complex, complete language for encoding texts.

**Write more here**

![XML, source of our everyday](images/XML.svg)

## 2\. Parsing XML with Python

As for querying the web, Python has many libraries for playing with xml. You will most likely encounter the following during your Pythonic journey :

- **lxml**, which we will use for this course. A clean, quite fast, strict library for dealing with xml resources. It's the most accepted library for this kind of request. If IBM writes [tutorials for it](http://www.ibm.com/developerworks/library/x-hiperfparse/), it should be good. It supports xpath and xslt.
- **BeautifulSoup**. Flexible, average speed. The good thing is if your xml markup is messed up, it will try to correct it. It's perfect for dealing with web scrapped data in HTML formats. For clean xml, it might be too slow.
- **xml** : the native integration in Python. Fast, clean but no good sides such as xpath and xslt.
- Read about others on the Python [official wiki](https://wiki.python.org/moin/PythonXml)


Based on my experience, lxml will meet most of your needs when dealing with clean data. Clean is the key word here : do not expect lxml to play well with bad html or bad xml. It will just throw errors at you until you give up or fix it by hand.

We can import lxml.etree the same way we imported requests earlier.

In [None]:
from lxml import etree

### From file to XML object

Opening an xml file is actually quite simple : you open it and you parse it. Who would have guessed ?

In [None]:
# We open our file
with open("data/phi1294.phi002.perseus-lat2.xml") as file:
    # We use the etree.parse property
    parsed = etree.parse(file)
# We print the object
print(parsed)

As you can see, we obtained an instance of type lxml.etree.\_ElementTree. It means the xml markup has been transformed into something Python understands.

The *parse* function of *etree* does not take many arguments. One way to customize its behaviour is to give it a home configured or homemade xml parser : 

In [None]:
# We initiate a new parser from etree, asking it to remove nodes of text which are empty
parser = etree.XMLParser(remove_blank_text=True)
# We open the file
with open("data/phi1294.phi002.perseus-lat2.xml") as file:
    # And we parse using the new parser
    parsed = etree.parse(file, parser)
# We print the object
print(parsed)
# We open the file

From the [documentation](http://lxml.de/parsing.html#parser-options) of the XMLParser function, here are some arguments that might be useful for you :

- *attribute_defaults* : Use DTD (if available) to add the default attributes
- *dtd_validation* : Validate against DTD while parsing
- *load_dtd* : Load and parse the DTD while parsing
- *ns_clean* : Clean up redundant namespace declarations
- *recover* : Try to fix ill-formed xml
- *remove_blank_text* : Removes blank text nodes
- *resolve_entities* : Replace entities by their value (Default : on)

You can then create a new parser according to its standards or clean namespace attribute. In this context, *ns_clean* would transform


`<root xmlns:a="xmlns1" xmlns:b="xmlns2"><tag xmlns:c="xmlns3" /><tag xmlns:a="xmlns1" /><tag /></root>`

into

`<root xmlns:a="xmlns1" xmlns:b="xmlns2"><tag xmlns:c="xmlns3" /><tag/><tag /></root>`

###From string to XML object

lxml parses strings in the same way that it parses files. The syntax differs, but is quite simple :

In [None]:
xml = '<root xmlns:a="xmlns1" xmlns:b="xmlns2"><tag xmlns:c="xmlns3" /><tag xmlns:a="xmlns1" /><tag /></root>'
parsed = etree.fromstring(xml)
print(parsed)

**DIY**

Can you parse a xml document made of one tag "humanities" with two children "field" named "classics" and "history"? 

In [None]:
# Put your code here

###Errors and understanding them

Previouly, we have said that lxml was quite strict about xml validity. Let's see an example :

In [None]:
xml = """
<fileDesc>
    <titleStmt>
        <title>Aeneid</title>
        <title type="sub">Machine readable text</title>
        <author n="Verg.">P. Vergilius Maro</author>
        <editor role="editor" n="Greenough">J. B. Greenough</editor>
        &responsibility;
        &fund.NEH;
    </titleStmt>
    <extent>about 505Kb</extent>
    &Perseus.publish;
    <sourceDesc>
        <biblStruct>
            <monogr>
                <author>Vergil</author>
                <title>Bucolics, Aeneid, and Georgics Of Vergil</title>
                <editor role="editor">J. B. Greenough</editor>
                <imprint>
                    <pubPlace>Boston</pubPlace>
                    <publisher>Ginn &amp; Co.</publisher>
                    <date>1900</date>
                </imprint>
            </monogr>
        </biblStruct>
    </sourceDesc>
</fileDesc>"""

etree.fromstring(xml)

What error did we raise trying to parse this XML ? We got an *XMLSyntaxError*. It can happen for various reasons, including when entities cannot be parsed. Can you try to find another way to raise an XMLSyntaxError ?

In [None]:
#Write your xml in xml variable
xml = """
"""
etree.fromstring(xml)

As you can see, errors are detailed enough so you can correct your own XML, at least manually.

### Node properties and methods

*Quick explanation* : Methods and properties are something special in Python and other programming languages. Unlike traditional functions (`len()`) and keys of dictionaries (`a["b"]`), they are part of something bigger.

**Methods** : Ever seen something such as `a.method()` ? Yes, you did with `.split()`, `.join()`, etc. Functions following a variable with a dot are called methods because they are an extension of the variable type. *eh* `split()` and `join()` are extensions of string objects, and they use their value as argument.

**Properties or Attributes** : Such as dictionary keys, properties are indexed values of an object, but instead of using the syntax made of square brackets, you just put the name of the key after a dot : `a.property`

**Warning : namespaces** : In lxml, namespaces are expressed using the Clark notation. This mean that, if a namespace defines a node, this node will be named using the following syntax "`{namespace}tagname`. Here is an example :

In [None]:
# With no namespace
print(etree.fromstring("<root />"))
# With namespace
print(etree.fromstring("<root xmlns='http://localhost' />"))

You can do plenty of things using lxml and access properties or methods of nodes, here is an overview of reading functionalities offered by lxml :

![Cheatsheet](images/CheatsheetElement.svg)

Let's see what that means in real life :

In [None]:
# First, we will need some xml
xml = """
<div type="Book" n="1">
    <l n="1">Arma virumque cano, Troiae qui primus ab oris</l>
    <tei:l n="2" xmlns:tei="http://www.tei-c.org/ns/1.0">Italiam, fato profugus, Laviniaque venit</tei:l>
    <l n="3">litora, multum ille et terris iactatus et alto</l>
    <l n="4">vi superum saevae memorem Iunonis ob iram;</l>
    <l n="5">multa quoque et bello passus, dum conderet urbem,</l>
    <l n="6">inferretque deos Latio, genus unde Latinum,</l>
    <l n="7">Albanique patres, atque altae moenia Romae.</l>
</div>
"""
div = etree.fromstring(xml)
print(parsed)

If we want to retrieve the attributes of our div, we can do as follow :

In [None]:
type_div = div.get("type")
print(type_div)
# If we want a dictionary of attributes
attributes_div = dict(div.attrib)
print(attributes_div)
# Of if we want a list
list_attributes_div = div.items()
print(list_attributes_div)

Great ! We accessed our first information using lxml ! Now, how about getting somewhere other than the root tag ? To do so, there are two ways :

- getchildren() will returns a list of children tags, such as div.
- list(div) will transform div in a list of children.

Both syntaxes return the same results, so it's up to you to decide which one you prefer. 

In [None]:
children = div.getchildren()
line_1 = children[0] # Because it's a list we can access children through index
print(line_1)

Now that we have access to our children, we can have access to their text :

In [None]:
print(line_1.text)

Ok, we are now able to get some stuff done. Remember the namespace naming ? Sometimes it's useful to retrieve namespaces and their prefix :

In [None]:
line_2 = children[1]
print(line_2.nsmap)
print(line_2.prefix)
print(line_2.tag)

**DIY**

Can you print the complete text of those lines ? (Hint : you will need to loop over something)

In [None]:
#Write your code here

**What you've learned** :

- How to parse a xml file or a string representing xml through `etree.parse()` and `etree.fromstring()`
- How to configure the way xml is parsed with `etree.XMLParser()`
- What is an attribute and a method
- Properties and methods of a node
- XMLParseError handling
- Clark's notation for namespaces and tags.

----

## 3\. XPath and XSLT with lxml

### XPath

XPath is a powerful tool for traversing an xml tree. XML is made of nodes such as tags, comments, texts. These nodes have attributes that can be used to identify them. For example, with the following xml :

> `<div><l n="1"><p>Text</p> followed</l><l n="2">by line two</div>`

the node p will be accessible by `/div/l[@n="1"]/p`. LXML has great support for complex XPath, which makes it the best friend of Humanists dealing with xml :

In [None]:
# We generate some xml and parse it
xml = '<div><l n="1"><p>Text</p> followed</l><l n="2">by line two</l><p>test</p></div>'
div = etree.fromstring(xml)
print(div)
# When doing an xpath, the results will be a list
ps = div.xpath("/div/l[@n='1']/p")
print(ps)
print(ps[0].text == "Text")

As you can see, the xpath returns a list. This behaviour is intended, since an xpath can retrieve more than one item :

In [None]:
print(div.xpath("//l"))

You see ? The xpath `//l` returns two elements, just like python does in a list. Now, let's apply some xpath to the children and see what happens :

In [None]:
# We assign our first line to a variable
line_1 = div.xpath("//l")[0]
print(line_1)

# We look for p
print(line_1.xpath("p")) # This works
print(line_1.xpath("./p")) # This works too
print(line_1.xpath(".//p")) # This still works
print(line_1.xpath("//p")) # This does not


As you can see, you can do xpath from any node in lxml. One important thing though : xpath `//tagname` *will return to the root* if you do not add a dot in front of it such as **`.`**`//tagname`. This is really important to remember, because most xpath resolvers do not behave this way.

**Xpath with namespaces and prefix**

As you've seen, lxml use Clark's naming convention for expressing namespaces. This is extremely important regarding xpath, because you will be able to retrieve a node using it under certain conditions :

In [None]:
xml = """<root>
<tag xmlns="http://localhost">Text</tag>
<tei:tag xmlns:tei="http://www.tei-c.org/ns/1.0"></tei:tag>
</root>"""
root = etree.fromstring(xml)

print(root.xpath("//tag")) # Does not retrieve anything because both tags have a namespace
print(root.findall("{http://localhost}tag")) # Retrieve first tag

print(root.xpath("//{http://www.tei-c.org/ns/1.0}tag")) # Will fail

The last line failed because Clark's notation is not accepted in xpath. To succeed, you will need to use a namespace dictionary and prefix, which you will feed to the `xpath()` method using the argument `namespaces` : 

In [None]:
# We create a valid xml object
xml = """<root>
<tag xmlns="http://localhost">Text</tag>
<tei:tag xmlns:tei="http://www.tei-c.org/ns/1.0">Other text</tei:tag>
</root>"""
root = etree.fromstring(xml)
# We register every namespaces in a dictionary using prefix as keys :
ns = {
    "local" : "http://localhost", # Even if this namespace had no prefix, we can register one for it
    "tei" : "http://www.tei-c.org/ns/1.0"
}
tag_1 = root.xpath("//local:tag", namespaces=ns)
print(tag_1[0].text)
tag_2 = root.xpath("//tei:tag", namespaces=ns)
print(tag_2[0].text)

Another point to kepe in mind : if you write your xpath incorrectly, Python will raise an *XPathEvalError * error

In [None]:
root.xpath("wrong:xpath:never:works")

**What you have learned** :

- Each node and xml document has an `.xpath()` method which takes as its first parameter xpath
- Method `xpath()` always returns a list, even for a single result
- Method `xpath()` will return to the root when you don't prefix your `//` with a dot.
- An incorrect XPath will issue a `XPathEvalError`
- Method `xpath()` accepts a `namespaces` argument : you should enter a dictionary where keys are prefixes and values namespaces
- Unlike `findall()`, `xpath()` does not accept Clark's notation

###XSLT

XSLT stands for *Extensible Stylesheet Language Transformations*. It's an xml-based language made for transforming xml documents to xml or other formats such as LaTeX and HTML. XSLT is really powerful when dealing with similarly formated data. It's far easier to transform 100 documents with the exact same structure via XSLT than in Python or any other language.

While Python is great at dealing with weird transformations of xml, the presence of XSLT in Python allows you to create production chains without leaving your favorite IDE.

To do some XSL, lxml needs two things : first, an xml document representing the xsl that will be parsed and entered into the function `etree.XSLT()`, and second, a document to transform. 

In [None]:
# Here is an xml containing an xsl: for each text node of an xml file in the xpath /humanities/field,
#     this will return a node <name> with the text inside
xslt_root = etree.fromstring("""
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="/">
        <fields><xsl:apply-templates /></fields>
    </xsl:template>
    <xsl:template match="/humanities/field">
        <name><xsl:value-of select="./text()" /></name>
    </xsl:template>
</xsl:stylesheet>""")
# We transform our document to an xsl 
xslt = etree.XSLT(xslt_root)

# We create some xml we need to change 
xml = """<humanities>
    <field>History</field>
    <field>Classics</field>
    <field>French</field>
    <field>German</field>
</humanities>"""
parsed_xml = etree.fromstring(xml)
# And now we process our xml :
transformed = xslt(parsed_xml)
print(transformed)

Did you see what happened ? We used `xslt(xml)`. `etree.XSLT()` transforms a xsl document into a function, which then takes one parameter (in this case an xml document). But can you figure out what this returns ? Let's ask Python :

In [None]:
print(type(transformed))
print(type(parsed_xml))

The result is not of the same type of element we usually have, even though it does share most of its methods and attributes :

In [None]:
print(transformed.xpath("//name"))

And has something more : you can change its type to string !

In [None]:
string_result = str(transformed)
print(string_result)

Pretty neat he ?

XSLT is more complex than just inputing xml. You can do XSLT using parameters as well. In this case, your parameters will be accessibles as a named argument to the generated function. If your XSL has a `name` xsl-param, the function given back by `etree.XSLT` will have a `name` argument :

In [None]:
# Here is an xml containing an xsl: for each text node of an xml file in the xpath /humanities/field,
#     this will return a node <name> with the text inside
xslt_root = etree.fromstring("""
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:param name="n" />
    <xsl:template match="/humanities">
        <fields>
            <xsl:attribute name="n">
                <xsl:value-of select="$n"/>
            </xsl:attribute>
            <xsl:apply-templates select="field"/>
        </fields>
    </xsl:template>
    <xsl:template match="/humanities/field">
        <name><xsl:value-of select="./text()" /></name>
    </xsl:template>
</xsl:stylesheet>""")
# We transform our document to an xsl 
xslt = etree.XSLT(xslt_root)

# We create some xml we need to change 
xml = """<humanities>
    <category>Humanities</category>
    <field>History</field>
    <field>Classics</field>
    <field>French</field>
    <field>German</field>
</humanities>"""
parsed_xml = etree.fromstring(xml)
# And now we process our xml :
transformed = xslt(parsed_xml, n="'Humanities'") # Note that for a string, we encapsulate it within single quotes
print(transformed)

# Be aware that you can use xpath as a value for the argument, though it can be rather complex sometimes
transformed = xslt(parsed_xml, n=etree.XPath("//category/text()"))
print(transformed)

## Exercises :

### 1\. CTS XML to Text

Given the following xpath, can you retrieve the text of the poem n°1.1 of Martial's Epigrammata stored in the variable `epigrammata` ? Store it in the amores.txt file.

In [None]:
xpath = "/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='?']/tei:div[@n='?']"
tei_ns = "http://www.tei-c.org/ns/1.0"
epigrammata = open("data/phi1294.phi002.perseus-lat2.xml")

# Write your code here :


### 2\. Milestone TEI to an OrderedDict
The **Choose a book** following code is not CTS compliant, because it uses `<milestone>`. Parse the following xml and feed the dictionary `passages` with the passage you will find :

In [None]:
from collections import OrderedDict
from lxml import etree

passages = OrderedDict()
tei = """
<div n="3" subtype="chapter">
<milestone unit="section" n="7" />
    <p>
        <reg>ac</reg> multum etiam novitatem tuam adiuvat quod eius modi nobiles tecum petunt, ut nemo sit
        qui audeat dicere plus illis nobilitatem quam tibi virtutem prodesse oportere.
          <reg>nam</reg>
        P. Galbam et L. Cassium summo loco natos quis est
        qui petere consu<pb n="156"/>latum putet? <reg>vides</reg> igitur amplissimis ex familiis homines, quod
        sine nervis sunt, tibi paris non esse. <reg>at</reg>
        Antonius et Catilina molesti sunt.
    </p> 
<milestone n="8" unit="section" />
    <p>
        <reg>immo</reg> homini navo, industrio, innocenti, diserto, gratioso apud eos qui res
        iudicant, optandi competitores ambo a puemitia sicarii, ambo libidinosi, ambo egentes.
          Eorum alterius bona proscripta vidimus, vocem denique audivimus iurantis se
          Romae iudicio aequo cum homine Graeco certare non posse, ex
        senatu eiectum scimus optima verorum censorum existimatione, in praetura competitorem
        habuimus amico Sabidio et Panthera, quom ad <pb n="157"/> tabulam quos poneret non haberet; quo tamen in
        magistratu amicam quam domi palam haberet de machinis emit. <reg>in</reg> petitione autem
        consulatus Cappadoces omnis compilare per turpissimam legationem maluit quam adesse et
        populo Romano supplicare.
    </p>
<milestone unit="section" n="9" />
    <p><reg>alter</reg>
        vero, di boni! quo splendore est? <reg>primum</reg> nobilitate eadem <del>qua
          Catilina</del>. <reg>num</reg> maiore? <reg>non</reg>. <reg>sed</reg> virtute.
          <reg>quam</reg> ob rem? <reg>quod</reg>
        Antonius umbram suam metuit, hic ne leges quidem natus in patris egestate,
        educatus in sororis stupris, corroboratus in caede civium, cuius primus ad rem publicam
        aditus in equitibus R. occidendis fuit (nam illis quos meminimus Gallis, qui
        tum Titiniorum ac Nanniorum ac <pb/>
        Tanusiorum capita demebant, Sulla <pb n="158"/> unum Catilinam
        praefecerat); in quibus ille hominem optimum, Q. Caecilium, sororis
        suae virum, equitem Romanum, nullarum partium, cum semper natura tum etiam
        aetate iam quietum, suis manibus occidit.
    </p>
</div>
"""


### 3\. XSLT Exercice

For each `<reg>` you will find in the following xml, add a `<note>` to a new document giving the text inside reg, using first xslt and then retrieve the same content with python lxml natives function and put it in a list :

In [None]:
TEI = """<div1 type="section" n="10"><p><reg>quid</reg> ego nunc dicam peteme
        eum consulatum, qui hominem carissimum populo Romano, M.
          Marium, inspectante populo Romano vitibus per totam umbem
        ceciderit, ad bustum egerit, ibi omni cmuciatu lacerarit, vivo stanti collum gladio sua
        dextera secuemit, cum sinistra capillum eius a vertice teneret, caput sua manu tulerit, cum
        inter digitos eius rivi sanguinis fluerent? qui postea cum histrionibus et cum gladiatoribus
        ita vixit ut alteros libidinis, alteros facinoris adiutores haberet, qui nullum in locum tam
        sanctum ac tam religiosum accessit in quo non, etiam si aliis culpa non esset, tamen ex sua
        nequitia dedecoris <pb n="159"/> suspicionem relinqueret, qui ex curia Curios et
          Annios, ab atmiis Sapalas et Carvilios, ex equestri ordine
          Pompilios et Vettios sibi amicissimos comparavit, qui tantum
        habet audaciae, tantum nequitiae, tantum denique in libidine artis et efficacitatis, ut
        prope in parentum gremiis praetextatos liberos constuprarit? <reg>quid</reg> ego nunc tibi
        de Africa, quid de testium dictis scribam? <reg>nota</reg> sunt, et ea tu
        saepius legito; sed tamen hoc mihi non praetermittendum videtur quod primum ex eo iudicio
        tam egens discessit quam quidam iudices eius ante illud in eum iudicium fuemunt, deinde tam
        invidiosus ut aliud in eum iudicium cotidie flagitetur. <reg>hic</reg> se sic habet ut magis
        <pb/> timeant etiam si quierit, quam ut contemnant si quid commoverit. <milestone n="11"
          unit="section"/><reg>quanto</reg> melior tibi fortuna petitionis data est quam nuper
        homini novo, C. Coelio! <reg>ille</reg> cum duobus hominibus ita
        nobilissimis petebat ut tamen in iis omnia pluris essent quam ipsa nobilitas, summa ingenia,
        summus pudor, plurima beneficia, sum<pb n="160"/>ma ratio ac diligentia petendi. <reg>ac</reg> tamen
        eorum alterum Coelius, cum multo inferior esset genere, superior nulla me
        paene, superavit. </p></div1>"""

### 4\. Counting tags and making statistic

Given the file Epigrammata, write a code which will make a dictionary of frequencies, where keys are tagnames and the values are the number of times they appear.



In [None]:
epigrammata = open("data/phi1294.phi002.perseus-lat2.xml")
tags = {}

In [None]:
# Run this to check previous results
# This should not raise any error if you have done it right :
assert tags["{http://www.tei-c.org/ns/1.0}div"] == 1542
assert tags["{http://www.tei-c.org/ns/1.0}l"] == 9462

## Going further

- R. Alexander Milowski, "Basics from Xpath", http://courses.ischool.berkeley.edu/i290-14/s05/lecture-4/allslides.html

-----

In [None]:
# Don't worry about this cell, it's just here to make the page look nicer.

from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

---

<p><small><a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Python Programming for the Humanities</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://fbkarsdorp.github.io/python-course" property="cc:attributionName" rel="cc:attributionURL">http://fbkarsdorp.github.io/python-course</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/fbkarsdorp/python-course" rel="dct:source">https://github.com/fbkarsdorp/python-course</a>.</small></p>