# XML Processing

Up until now, we either saved our data into regular text files or into professional databases. Sometimes however, our script is quite small and doesn’t need a big database but we still want to structure our data in files. For this, we can use XML.

XML stands for <b>*Extensible Markup Language*</b> and is a language that allows us to hierarchically structure our data in files. It is **platform-independent** and also **application-independent**. XML files that you create with a Python script, can be read and processed by a C++ or Java application.

## XML Parser

In Python, we can choose between two different XML parsers: **SAX** and **DOM**.

## Simple API for XML (SAX)

SAX stand for <b>*Simple API for XML*</b> and is better suited for large XML files or in situations where we have very limited RAM memory space. This is
because in this mode we never load the full file into our RAM. We read the file from our hard drive and only load the little parts that we need right at the moment into the RAM. An additional effect of this is that we can only read from the file and not manipulate it and change values.

## Document Object Model (DOM)

DOM stands for <b>*Document Object Model*</b> and is the generally recommended option. It is a **language-independent API** for working with XML. Here we always load the full XML file into our RAM and then save it there in a hierarchical structure. Because of that, we can use all of the features and also manipulate the file.

### What to use?

Obviously, DOM is a lot faster than SAX because it's using the RAM instead of the hard disk. The main memory is way more efficient than the hard disk. However, if we have a very large XML file, we might not be able to load it into our RAM. In this case, we have to use SAX.

So, there're no reason to not use both options in the same projects. We can choose depending on the situation.

## XML Structure

Let's take a look on this:

```xml
<group>
    <person id="1">
        <name>Max</name>
        <age>17</age>
        <weight>80</weight>
        <height>180</height>
    </person>
    <person id="2">
        <name>Sepp</name>
        <age>18</age>
        <weight>70</weight>
        <height>175</height>
    </person>
    <person id="3">
        <name>Nina</name>
        <age>16</age>
        <weight>55</weight>
        <height>165</height>
    </person>
</group>
```

This is a very simple XML file. We have a group of people and each person has a name, age, weight and height. The person also has an id. This is an attribute. We can use attributes to give additional information about an element. In this case, we use the id to identify the person.

I already save this in file folder, so we can use it later.

## XML with SAX

In order to work with SAX, we first need to import the module:

In [2]:
import xml.sax

Now, we can create a SAX parser with <code>contentHandler</code> and <code>parser</code>:

In [3]:
handler = xml.sax.ContentHandler()
parser = xml.sax.make_parser()
parser.setContentHandler(handler)
parser.parse("file/group.xml")

First we create an instance of the ContentHandler class. Then we use the method <i>make_parser</i> to create a parser object. This parser object is then used to parse our XML file.

After that we set our *handler* to the content handler of our parser. Finally, we can then parse the file by using the method <i>parse</i> of our parser object.

## Content Handler Class

In [4]:
class GroupHandler(xml.sax.ContentHandler):
    def startElement(self, name, attrs):
        print(name)
        
handler = GroupHandler()
parser = xml.sax.make_parser()
parser.setContentHandler(handler)
parser.parse("file/group.xml")

group
person
name
age
weight
height
person
name
age
weight
height
person
name
age
weight
height


Ok, so we have create a very simple code here that return all the tag (call as name) and all it attributes. For now, we only call the tag name

## Prorcessing XML Data

We will edit our code above, make it a bit more complex and includes two more functions.

In [5]:
class GroupHandler(xml.sax.ContentHandler):
    def startElement(self, name, attrs):
        self.current = name
        if self.current == "person":
            print("--- Person ---")
            id = attrs["id"]
            print("ID:%s" % id)
    
    def endElement(self, name):
        if self.current == "name":
            print("Name:%s" % self.name)
        elif self.current == "age":
            print("Age:%s" % self.age)
        elif self.current == "weight":
            print("Weight:%s" % self.weight)
        elif self.current == "height":
            print("Height:%s" % self.height)
        self.current = ""
    
    def characters(self, content):
        if self.current == "name":
            self.name = content
        elif self.current == "age":
            self.age = content
        elif self.current == "weight":
            self.weight = content
        elif self.current == "height":
            self.height = content

handler = GroupHandler()
parser = xml.sax.make_parser()
parser.setContentHandler(handler)
parser.parse("file/group.xml")

--- Person ---
ID:1
Name:Tom
Age:17
Weight:80
Height:180
--- Person ---
ID:2
Name:Sepp
Age:18
Weight:70
Height:175
--- Person ---
ID:3
Name:Nina
Age:16
Weight:55
Height:165


The first thing you will notice here is that we have three functions instead of one. When we start processing an element, the function <code>startElement</code> gets called. Then we go on to process the individual characters which are *name*,*age*, *weight* and *height* . At the end of the element parsing, we call the <code>endElement</code> function.

In this example, we first check if the element is a *person* or not. If this is the case we print the *id* just for information. We then go on with the characters
method. It checks which tag belongs to which attribute and saves the values accordingly. At the end, we print out all the values.

## XML with DOM

Like with SAX, we first need to import the module:

In [6]:
import xml.dom.minidom

When working with DOM, we need to create a so-called *DOM tree* and view all elements as collections or sequences.

In [7]:
domtree = xml.dom.minidom.parse("file/group.xml")

group = domtree.documentElement

We parse the XML file by using the method *parse*. This returns a DOM-tree, which we save into a variable. Then we get the <code>documentElement</code> of our tree and in our case this is **group**. We also save this one into an object.

In [8]:
persons = group.getElementsByTagName("person")
for person in persons:
    print ("--- Person ---")
    if person.hasAttribute("id"):
        print("ID:%s" % person.getAttribute("id"))
    name = person.getElementsByTagName('name')[0]
    print("Name:%s" % name.childNodes[0].data)
    age = person.getElementsByTagName('age')[0]
    print("Age:%s" % age.childNodes[0].data)
    weight = person.getElementsByTagName('weight')[0]
    print("Weight:%s" % weight.childNodes[0].data)
    height = person.getElementsByTagName('height')[0]
    print("Height:%s" % height.childNodes[0].data)

--- Person ---
ID:1
Name:Tom
Age:17
Weight:80
Height:180
--- Person ---
ID:2
Name:Sepp
Age:18
Weight:70
Height:175
--- Person ---
ID:3
Name:Nina
Age:16
Weight:55
Height:165


We get all the individual elements by using the method <code>getElementsByTagName</code>. This returns a list of all the elements with the given tag name. In our case, this is **person**.

By using the functions <code>hasAttribute</code> and <code>getAttribute</code>, we can check if an element has a certain attribute and get the value of it.

Finally, we can print out all the values with <code>.childNodes[0].data</code>. This is a bit more complicated. First, we get the child nodes of the element. This returns a list of all the child nodes. In our case, this is **name**, **age**, **weight** and **height**. We then get the first element of this list and then the data of it. This is the value of the element.

When we do all that and execute our script, we get the exact same result as with *SAX*.

## Manipulating XML Files 

Since we're now working with DOM, let's manipulate our XML file and change some values.

In [9]:
persons = group.getElementsByTagName("person")
persons[0].getElementsByTagName('name')[0].childNodes[0].data = "Tom"

Just like that, we're using the same function. To access our elements. Here we address the *name* tag of the first person object. We access the first person object by using the index 0. Then we get the first child node of this object, which is the name. Finally, we get the data of this element and change it to *Tom*.

Then we commit our change by:

In [10]:
domtree.writexml(open('file/group.xml', 'w'), encoding='utf-8')

This is what we get:

```xml
<?xml version="1.0" encoding="utf-8"?><group>
    <person id="1">
        <name>Tom</name>
        <age>17</age>
        <weight>80</weight>
        <height>180</height>
    </person>
```

We can also change the attributes by using the function <code>setAttribute</code>:

In [11]:
persons[0].setAttribute("id", "1")

## Creating New Elements

The last thing that we are going to look at in this chapter is creating new XML elements by using DOM,

In [15]:
newperson = domtree.createElement("person")
newperson.setAttribute("id", "4")

After that, we create all the elements that we need for the person and assign values to them.

In [16]:
name = domtree.createElement("name")
name.appendChild(domtree.createTextNode("Paul Smith"))
age = domtree.createElement("age")
age.appendChild(domtree.createTextNode("45"))
weight = domtree.createElement("weight")
weight.appendChild(domtree.createTextNode("78"))
height = domtree.createElement("height")
height.appendChild(domtree.createTextNode("178"))

<DOM Text node "'178'">

First, we create a new element for each attribute of the person. Then we use the method <code>.appendChild</code> to put something in between the tags of our
element. In this case we create a new <code>.createTextNode</code> , which is basically just text.

Last but not least, we again need to use the method <code>.appendChild</code> in order to define the hierarchical structure. The attribute elements are the child of the person element and this itself is the child of the group element.

In [18]:
newperson.appendChild(name)
newperson.appendChild(age)
newperson.appendChild(weight)
newperson.appendChild(height)
group.appendChild(newperson)
domtree.writexml(open("file/group.xml" , "w"))