# Session 5: XML

In this session we provide a quick introduction to **XML**. We then engage with common uses of `beautifulsoup4` to read, manipulate and write XML data with Python.

## XML and HTML

Both **XML** and **HTML** are markup languages. Markup languages are systems to annotate documents in a way that the annotation is syntactically distinguishable from the content. What does it mean? Well, we normally want to keep text and metatextual information separated. Metatextual information can be metadata, linguistic annotation, format, content description...

Two well known markup formats are **XML** and **HTML**. They are very similar in fact. Both are instances of SGML and both follow the DOM specification. However, **HTML** is a markup format made up of a pre-defined closed set of tags, with a specification that is used by web browsers to present web content. Whereas, **XML** is not restricted to a particular set of elements and/or purpose. Users can define the structure of the document, its elements, attributes, etc.

Because most of what we will learn for XML also applies to HTML (we can regard HTML as a specification of the more general XML), and there are plenty of resources in the web to learn HTML, we will focus on XML.

### Documents as trees

DOM stands for *Document Object Model*. This is the specification of how a **HTML** and **XML** documents has to be structured, as well as how the file is manipulated to create, edit or remove contents.

We can think of DOM as a tree structure:

In [None]:
<?xml version="1.0" encoding="UTF-8"?>
<TextCorpus lang="de">
    <text>Karin fliegt nach New York. Sie will dort Urlaub machen.</text>
    <tokens>
        <token ID="t_0">Karin</token>
        <token ID="t_1">fliegt</token>
        <token ID="t_2">nach</token>
        <token ID="t_3">New</token>
        <token ID="t_4">York</token>
        <token ID="t_5">.</token>
        <token ID="t_6">Sie</token>
        <token ID="t_7">will</token>
        <token ID="t_8">dort</token>
        <token ID="t_9">Urlaub</token>
        <token ID="t_10">machen</token>
        <token ID="t_11">.</token>
    </tokens>
</TextCorpus>

### XML

**XML** stands for E**X**tensible **M**arkup **L**anguage. This language was designed to store and transport data. And it was designed to be both human- and machine-readable. Unlike HTML the structure of the document, the elements, their attributes, and the content are not pre-defined. That provides a very flexible framework.

> XML is a generalized way of describing hierarchical structured data. An xml document contains one or more **elements**, which are delimited by **start** and **end** **tags**.

In [None]:
<s>This is a sentence.</s>

> Elements can be nested to any depth. An element inside another one is said to be a subelement or **child**. The first element in every xml document is called the **root** element. An xml document can only have one root element.

In [None]:
<s>
    <token>This</token>
    <token>is</token>
    <token>a</token>
    <token>sentence</token>
    <token>.</token>
</s>

> Elements can have **attributes**, which are name-value pairs. Attributes are listed within the start tag of an element and separated by whitespace. Attribute names can not be repeated within an element. Attribute values must be quoted. You may use either single or double quotes.

In [None]:
<s id="s_0">
    <token pos1="DT" pos2="DET">This</token>
    <token pos1="VBZ" pos2="VERB">is</token>
    <token pos1="DT" pos2="DET">a</token>
    <token pos1="NN" pos2="NOUN">sentence</token>
    <token pos1="." pos2="PUNCT">.</token>
</s>

> If an element has more than one attribute, the ordering of the attributes is not significant. An element’s attributes form an unordered set of keys and values, like a Python dictionary. There is no limit to the number of attributes you can define on each element.

In [None]:
<s id="s_0">
    <token pos1="DT" pos2="DET">This</token>
    <token pos2="VERB" pos1="VBZ">is</token>
    <token pos1="DT" pos2="DET">a</token>
    <token pos2="NOUN" pos1="NN">sentence</token>
    <token pos1="." pos2="PUNCT">.</token>
</s>

> Elements can have **text content**. Elements that contain no text and no children are **empty**. Elements that contain text and children elements are said to contain **mixed content**.

In [None]:
<s>This is a sentence.</s>

In [None]:
<comment type="gesture"/>

In [None]:
<s>This is a sentence with <italics>mixed</italics> content.</s>

> Finally, xml documents can contain character encoding information on the first line, before the root element.

In [None]:
<?xml version="1.0" encoding="UTF-8"?>
<s>
    <token>This</token>
    <token>is</token>
    <token>a</token>
    <token>sentence</token>
    <token>.</token>
</s>

(Mark Pilgrim. *Dive Into Python 3*. <http://www.diveintopython3.net/xml.html>)

### Well-formed and valid

Web browsers are quite lenient regarding not well-formed and invalid **HTML**. They will try to figure out how to render a page, even if there are errors. However, errors in **XML** documents will stop your **XML** applications. **XML** parsers will choke, **XML** errors are not allowed.

Therefore, whenever you work with markup languages, try to check that everything is alright to be sure that your material is error free. Follow this piece of advice and you will avoid lot of headache in the future.

#### Well-formed documents

A document is well-formed if it is compliant with some minimal requirements:

- the document contains a document type declaration
- a single element, known as the root element, contains all the other elements in the document.
- all elements are well formed (if they are):
    - opened and subsequently closed, or
    - if empty, properly terminated, and
    - properly nested so that they do not overlap
- `<`, `>`, `"`, `'`, and `&` are only used as markup (either part of a tag or a entity). If they are to be used in the document as character, entities should be used instead: `&lt;`, `&gt;`, `&quot;`, `&apos;`, `&amp;`.
- there are rules about the characters that can be used in element names and elsewhere
- tags are case-sensitive
- attribute values have to be quoted
- it contains only properly encoded legal Unicode characters

#### Valid documents

**HTML** documents have to conform to a particular specification where only a closed set of elements and attributes with particular contents and data types are allowed. Try to use anything else and you will get an error.

However, the structure and contents of **XML** documents can and have to be defined. The rules describing those aspects are defined in a DTD (Document Type Definition) or **XML** schema. A document is valid if:

- it is well-formed, and
- it observes the rules dictated by its DTD or **XML** schema.

If used properly, **XML** schemas can help you to detect annotation inconsistencies and errors (specially helpful if you are working with data created manually by humans).

There are different ways to define documents out there. My favorite schema language is **Relax NG compact**: it is quite easy to understand, write, and read. It is much more powerful than DTDs, but at the same time easier than other **XML** schema languages.

#### Validating XML

Some **XML** editors allow the validation of **XML** files using a DTD or **XML** schema.

<!-- provide the a XML file with its rnc -->

<!-- just work with the TCF Validator, to illustrate the point will be enough https://weblicht.sfs.uni-tuebingen.de/tcf-validator/ -->

If you have many files to validate, you probably want to use a command line tool like `xmllint`. It is included in [`libxml2`](<http://www.xmlsoft.org/downloads.html>). A library that you would need anyway, if you want to work with `lxml` package locally.

If you work with Relax NG Schema compact, you can use [`jing`](<https://github.com/relaxng/jing-trang>). There is also a python wrapper for jing-trang tools <https://pypi.python.org/pypi/jingtrang>.

## Python and XML/HTML

<!-- illustrate why trying to 'parse' literally or with regexp does not work: indentation, context, attribute order, attribute ambiguity... -->

### Packages

Python includes different markup processing tools in its standard library.

<https://docs.python.org/3.5/library/markup.html>

- `html`, an **HTML** and **XHTML** parser.
- `xml.etree.ElementTree`, a simple and light **XML** parser, it works pretty well, it is fast and it has a pythonic API.
- `xml.dom`, a [DOM](https://en.wikipedia.org/wiki/Document_Object_Model) **XML** parser. The DOM operates on the documents as a whole.
- `xml.sax`, a [SAX](https://en.wikipedia.org/wiki/Simple_API_for_XML) **XML** parser. The SAX parser operates on each piece of the **XML** document sequentially.
- `xml.parsers.expat`, the Expat parser binding.

There are also a few packages not included in the standard library which are very useful:

- [`lxml`](http://lxml.de), which uses `libxml2` and `libxslt` libraries. It parses **XML** and **HTML** and it is very fast. It follows the ElementTree API. Moreover, it adds interesting features like `XPath`, `XSLT` and much more.
- [`html5lib`](https://github.com/html5lib/html5lib-python), an **HTML** parser that creates valid **HTML5**, and parses pages like a browser does (extremely lenient).
- [`beautifulsoup4`](http://www.crummy.com/software/BeautifulSoup), a Python library for pulling data out of **HTML** and **XML** files. It provides idiomatic ways of navigating, searching and modifying the parse tree. You can use different parsers under the hood (like the excellent `lxml` and `html5lib`). You just learn one API and you use it for all the parsers.

### Setting up the environment

We are going to use `lxml` and `beautifulsoup4`. These packages are not part of python's standard library. We need to install them by using `pip`, the python package manager.

Go to the Shell Terminals in the cloud.

Install `lxml`:

In [None]:
sudo rpm --rebuilddb && sudo yum install -y libxml2-devel libxslt-devel && pip install lxml

Then, install `beautifulsoup4`:

In [None]:
pip install beautifulsoup4

> If you are working locally, the procedure to install `lxml` will be different. Check [Installing lxml](http://lxml.de/installation.html) for more information.

## Working with XML

### Read XML from a string

Cool! That's nice. Let's get familiar with our soup.

### Common operations



### Read a XML file

TCF format <http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format>

### Extracting Info

### Challenges

- reconstruct the text from `tokens`, so `text == tokens`
- find all nouns, print the POS tag
- find all nouns, retrieve word forms, print the text of the token
- find all nouns, check condition, retrieve POS tag of previous token


### Manipulating XML

### Challenge

In [None]:
soup = BeautifulSoup(open('tcf04-karin-wl.xml', 'r', encoding='utf-8'), 'xml')

- lower case all tokens
- change the `tag` 'ID' value, by changing the prefix. Instead of `pt` now should be `postag`.
- add `length` in characters to `token`
- add `length` in tokens to `sentence`
- for each `token` add an attribute called `pos` which should be its value from `tag`.
- renumber tokens, instead of starting at `0` they should start at `1`
- remove elements
- strip tags

### Creating an XML file

### Challenge

- convert list of tokens into into XML (tokens with attributes)