# Chapter 2: Working with xml : reading

## 1. XML, XPath and DOM

XML, standing for eXtensible Markup Language, is a markup language designed for description of data. Unlike html, XML is free of any rules, except having clean markup. You can invent your own tag or use tags supplied by ODD, guidelines and set of tags design for collaboration and shared practices.

You will have heard of TEI. TEI is a nice, complex, complete language for encoding texts.

**Write more here**

![XML, source of our everyday](images/XML.svg)

## 2\. Parsing XML with Python

As for querying the web, Python has many libraries for playing with xml. You will most likely encounter the following during your pythonic journey :

- **lxml**, which we will use for this course. A clean, quite fast, strict library for dealing with xml resources. It's the most accepted library for this kind of request. I mean, if IBM writes [tutorials for it](http://www.ibm.com/developerworks/library/x-hiperfparse/), it should be good. It supports xpath and xslt.
- **BeautifulSoup**. Flexible, average speed. The good thing with it ? If your xml markup is messed up, it will try to correct it. It's perfect for dealing with web scrapped data in HTML formats. For clean xml, it might be too slow.
- **xml** : the native integration in Python. Fast, clean but no good sides such as xpath and xslt.
- See some others on Python [official wiki](https://wiki.python.org/moin/PythonXml)


From a pure experience point of view, lxml appears to be the one who will fit most of your needs when dealing with clean data. Clean is the key word here : do not expect lxml to play well with bad html or bad xml. It will just throw errors at you until you give up or fix it by hand.

The same way we have imported requests, we will now import lxml.etree.

In [2]:
from lxml import etree

### From file to XML object

Opening an xml file is actually quite simple : you open it and you parse it. Who would have guessed ?

In [None]:
# We open our file
with open("data/phi1294.phi002.perseus-lat2.xml") as file:
    # We use the etree.parse property
    parsed = etree.parse(file)
# We print the object
print(parsed)

As you can see, we obtained an instance of type lxml.etree.\_ElementTree. It means the xml markup has been transformed into something Python understands.

The *parse* function of *etree* does not take much arguments. One way to customize its behaviour is to give it a home configured or homemade xml parser : 

In [None]:
# We initiate a new parser from etree, asking it to remove nodes of text which are empty
parser = etree.XMLParser(remove_blank_text=True)
# We open the file
with open("data/phi1294.phi002.perseus-lat2.xml") as file:
    # And we parse using the new parser
    parsed = etree.parse(file, parser)
# We print the object
print(parsed)
# We open the file

From the [documentation](http://lxml.de/parsing.html#parser-options) of the XMLParser function, few arguments might be useful to you :

- *attribute_defaults* : Use DTD (if available) to add the default attributes
- *dtd_validation* : Validate against DTD while parsing
- *load_dtd* : Load and parse the DTD while parsing
- *ns_clean* : Clean up redundant namespace declarations
- *recover* : Try to fix ill formed xml
- *remove_blank_text* : Removes blank text nodes
- *resolve_entities* : Replace entities by their value (Default : on)

You can then create a new parser according to its standards or clean namespace attribute. In this context, *ns_clean* would transform


`<root xmlns:a="xmlns1" xmlns:b="xmlns2"><tag xmlns:c="xmlns3" /><tag xmlns:a="xmlns1" /><tag /></root>`

into

`<root xmlns:a="xmlns1" xmlns:b="xmlns2"><tag xmlns:c="xmlns3" /><tag/><tag /></root>`

###From string to XML object

The same way lxml offers parsing file, lxml accept as well strings. The syntax differs, but is quite simple :

In [3]:
xml = '<root xmlns:a="xmlns1" xmlns:b="xmlns2"><tag xmlns:c="xmlns3" /><tag xmlns:a="xmlns1" /><tag /></root>'
parsed = etree.fromstring(xml)
print(parsed)

<Element root at 0x7f2aa813ed48>


**DIY**

Can you parse a xml document made of one tag "humanities" with two children "field" containing "classics" and "history" as text ? 

In [4]:
# Put your code here

###Errors and understanding them

Previouly, we have said that lxml was quite strict about xml validity. Let's see an example :

In [5]:
xml = """
<fileDesc>
    <titleStmt>
        <title>Aeneid</title>
        <title type="sub">Machine readable text</title>
        <author n="Verg.">P. Vergilius Maro</author>
        <editor role="editor" n="Greenough">J. B. Greenough</editor>
        &responsibility;
        &fund.NEH;
    </titleStmt>
    <extent>about 505Kb</extent>
    &Perseus.publish;
    <sourceDesc>
        <biblStruct>
            <monogr>
                <author>Vergil</author>
                <title>Bucolics, Aeneid, and Georgics Of Vergil</title>
                <editor role="editor">J. B. Greenough</editor>
                <imprint>
                    <pubPlace>Boston</pubPlace>
                    <publisher>Ginn &amp; Co.</publisher>
                    <date>1900</date>
                </imprint>
            </monogr>
        </biblStruct>
    </sourceDesc>
</fileDesc>"""

etree.fromstring(xml)

XMLSyntaxError: Entity 'responsibility' not defined, line 8, column 25 (<string>)

What error did we raise trying to parse this XML ? We got an *XMLSyntaxError*. It happens for different reasons, including when entities are not possible to be parsed. Can you try to find another error raising an XMLSyntaxError ?

In [None]:
#Write your xml in xml variable
xml = """
"""
etree.fromstring(xml)

As you can see, errors are detailed enough so you can correct your own XML, at least manually.

### Node properties and methods

*Quick explanation* : Methods and properties are something special in Python and other programming languages. Unlike traditional functions (`len()`) and keys of dictionaries (`a["b"]`), they are part of something bigger.

**Methods** : Ever seen something such as `a.method()` ? Yes, you did with `.split()`, `.join()`, etc. Functions following a variable with a point are called methods because they are an extension of the variable type. *eh* `split()` and `join()` are extensions of string objects, and they use their value as argument.

**Properties** : Such as dictionary keys, properties are indexed values of an object, but instead of using the syntax made of square brackets, you just put the name of the key after a point : `a.property`

**Warning : namespaces** : In lxml, namespaces are expressed using the Clark notation. This mean that, if a namespace defines a node, this node will be named using the following syntax "`{namespace}tagname`. Here is an example :

In [10]:
# With no namespace
print(etree.fromstring("<root />"))
# With namespace
print(etree.fromstring("<root xmlns='http://localhost' />"))

<Element root at 0x7f2aa80d4108>
<Element {http://localhost}root at 0x7f2aa80d4188>


You can do plenty of things using lxml and access properties or methods of nodes, here is a resume of reading functionalities offered by lxml :

![Cheatsheet](images/CheatsheetElement.svg)

Let's see what it means in real life :

In [32]:
# First, we will need some xml
xml = """
<div type="Book" n="1">
    <l n="1">Arma virumque cano, Troiae qui primus ab oris</l>
    <tei:l n="2" xmlns:tei="http://www.tei-c.org/ns/1.0">Italiam, fato profugus, Laviniaque venit</tei:l>
    <l n="3">litora, multum ille et terris iactatus et alto</l>
    <l n="4">vi superum saevae memorem Iunonis ob iram;</l>
    <l n="5">multa quoque et bello passus, dum conderet urbem,</l>
    <l n="6">inferretque deos Latio, genus unde Latinum,</l>
    <l n="7">Albanique patres, atque altae moenia Romae.</l>
</div>
"""
div = etree.fromstring(xml)
print(parsed)

<Element div at 0x7f2aa80cfb08>


If we want to retrieve the attributes of our div, we can do as follow :

In [33]:
type_div = div.get("type")
print(type_div)
# If we want a dictionary of attributes
attributes_div = dict(div.attrib)
print(attributes_div)
# Of if we want a list
list_attributes_div = div.items()
print(list_attributes_div)

Book
{'n': '1', 'type': 'Book'}
[('type', 'Book'), ('n', '1')]


Great ! We accessed our first information using lxml ! Now, how about getting to somewhere else than the root tag ? To do so, there is two ways :

- getchildren() will returns a list of children tags, such as div.
- list(div) will transform div in a list of children.

Both syntaxes have the same results, it's up to you to decide which one you prefer. 

In [34]:
children = div.getchildren()
line_1 = children[0] # Because it's a list we can access children through index
print(line_1)

<Element l at 0x7f2aa80db688>


Now that we have access to our children, we can have access to their text :

In [35]:
print(line_1.text)

Arma virumque cano, Troiae qui primus ab oris


Ok, we are now able to get some stuff done. Remember the namespace naming ? Sometime, it will be helpful to retrieve namespaces and their prefix :

In [38]:
line_2 = children[1]
print(line_2.nsmap)
print(line_2.prefix)
print(line_2.tag)

{'tei': 'http://www.tei-c.org/ns/1.0'}
tei
{http://www.tei-c.org/ns/1.0}l


**DIY**

Can you print the complete text of those lines ? (Hint : you will need to loop over something)

In [39]:
#Write your code here

## 3\. XPath and XSLT with lxml

### XPath

XPath is a powerful tool for traversing xml tree. It gives you the ability to avoid 

###XSLT

## 4\. Use case : how to cut locally a text into chunk.

## Exercises :

## Going further


-----

In [19]:
# Do not care about this cell, it's just here to make the page nicer.

from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

---

<p><small><a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Python Programming for the Humanities</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://fbkarsdorp.github.io/python-course" property="cc:attributionName" rel="cc:attributionURL">http://fbkarsdorp.github.io/python-course</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/fbkarsdorp/python-course" rel="dct:source">https://github.com/fbkarsdorp/python-course</a>.</small></p>