# Chapter 2: Working with xml : reading

## 1. XML, XPath and DOM

XML, standing for eXtensible Markup Language, is a markup language designed for description of data. Unlike html, XML is free of any rules, except having clean markup. You can invent your own tag or use tags supplied by ODD, guidelines and set of tags design for collaboration and shared practices.

You will have heard of TEI. TEI is a nice, complex, complete language for encoding texts.

**Write more here**

![XML, source of our everyday](images/XML.svg)

## 2\. Parsing XML with Python

As for querying the web, Python has many libraries for playing with xml. You will most likely encounter the following during your pythonic journey :

- **lxml**, which we will use for this course. A clean, quite fast, strict library for dealing with xml resources. It's the most accepted library for this kind of request. I mean, if IBM writes [tutorials for it](http://www.ibm.com/developerworks/library/x-hiperfparse/), it should be good. It supports xpath and xslt.
- **BeautifulSoup**. Flexible, average speed. The good thing with it ? If your xml markup is messed up, it will try to correct it. It's perfect for dealing with web scrapped data in HTML formats. For clean xml, it might be too slow.
- **xml** : the native integration in Python. Fast, clean but no good sides such as xpath and xslt.
- See some others on Python [official wiki](https://wiki.python.org/moin/PythonXml)


From a pure experience point of view, lxml appears to be the one who will fit most of your needs when dealing with clean data. Clean is the key word here : do not expect lxml to play well with bad html or bad xml. It will just throw errors at you until you give up or fix it by hand.

The same way we have imported requests, we will now import lxml.etree.

In [7]:
from lxml import etree

### From file to XML object

Opening an xml file is actually quite simple : you open it and you parse it. Who would have guessed ?

In [None]:
# We open our file
with open("data/phi1294.phi002.perseus-lat2.xml") as file:
    # We use the etree.parse property
    parsed = etree.parse(file)
# We print the object
print(parsed)

As you can see, we obtained an instance of type lxml.etree.\_ElementTree. It means the xml markup has been transformed into something Python understands.

The *parse* function of *etree* does not take much arguments. One way to customize its behaviour is to give it a home configured or homemade xml parser : 

In [None]:
# We initiate a new parser from etree, asking it to remove nodes of text which are empty
parser = etree.XMLParser(remove_blank_text=True)
# We open the file
with open("data/phi1294.phi002.perseus-lat2.xml") as file:
    # And we parse using the new parser
    parsed = etree.parse(file, parser)
# We print the object
print(parsed)
# We open the file

From the [documentation](http://lxml.de/parsing.html#parser-options) of the XMLParser function, few arguments might be useful to you :

- *attribute_defaults* : Use DTD (if available) to add the default attributes
- *dtd_validation* : Validate against DTD while parsing
- *load_dtd* : Load and parse the DTD while parsing
- *ns_clean* : Clean up redundant namespace declarations
- *recover* : Try to fix ill formed xml
- *remove_blank_text* : Removes blank text nodes
- *resolve_entities* : Replace entities by their value (Default : on)

You can then create a new parser according to its standards or clean namespace attribute. In this context, *ns_clean* would transform


`<root xmlns:a="xmlns1" xmlns:b="xmlns2"><tag xmlns:c="xmlns3" /><tag xmlns:a="xmlns1" /><tag /></root>`

into

`<root xmlns:a="xmlns1" xmlns:b="xmlns2"><tag xmlns:c="xmlns3" /><tag/><tag /></root>`

###From string to XML object

###Errors and understanding them

### Namespaces

## 3\. XPath with lxml

### XPath

### Nodes attributes

###Nodes methods

##4\. XSLT with lxml

### 5\. Use case : how to cut locally a text into chunk.

## Exercises :

## Going further


-----

In [19]:
# Do not care about this cell, it's just here to make the page nicer.

from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

---

<p><small><a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Python Programming for the Humanities</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://fbkarsdorp.github.io/python-course" property="cc:attributionName" rel="cc:attributionURL">http://fbkarsdorp.github.io/python-course</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/fbkarsdorp/python-course" rel="dct:source">https://github.com/fbkarsdorp/python-course</a>.</small></p>