## Easy XML in Python

**Creating an XML**

In [10]:
from xml.etree import ElementTree
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement

In [11]:
# <membership/>
membership = Element('membership')

In [12]:
# <membership><users/>
users = SubElement(membership, 'users')

In [13]:
# <membership><users><user/>
SubElement(users, 'user', name='john')
SubElement(users, 'user', name='charles')
SubElement(users, 'user', name='peter')

<Element 'user' at 0x3dccd30>

In [14]:
# <membership><groups/>
groups = SubElement(membership, 'groups')

In [15]:
# <membership><groups><group/>
group = SubElement(groups, 'group', name='users')

In [16]:
# <membership><groups><group><user/>
SubElement(group, 'user', name='john')
SubElement(group, 'user', name= 'charles')

<Element 'user' at 0x3dcca58>

In [17]:
# <membership><groups><group/>
group = SubElement(groups, 'group', name='administrators')

In [18]:
# <membership><groups><group><user/>
SubElement(group, 'user', name='peter')

<Element 'user' at 0x3e4eda0>

If Python let you indent freely, the syntax would have been even closer to what one would write directly in XML. In any event, because of how closely it resembles the target format, ElementTree can be considered to be a small domain-specific language. Writing this to a file can be done like this:

In [19]:
output_file = open( 'C:\Users\Wei\Desktop\membership.xml', 'w' )
output_file.write( '<?xml version="1.0"?>' )
output_file.write( ElementTree.tostring( membership ) )
output_file.close()

**Reading the XML file**

In [20]:
from xml.etree import ElementTree
document = ElementTree.parse('C:\Users\Wei\Desktop\membership.xml')

document will have an object that is not exactly a node in the XML structure, but it provides a handful of functions to consume the contents of the element hierarchy parsed from the file. Which way you choose is largely a matter of taste and probably influenced by the task at hand. The following are examples:

In [21]:
users = document.find('users')

is equivalent to:

In [22]:
membership = document.getroot()
users = document.find('users')

**Finding specific elements**

XML is a hierarchical structure. Depending on what you do, you may want to enforce certain hierarchy of elements when consuming the contents of the file. For example, we know that the membership.xml file expects users to be defined like membership -> users -> user. You can quickly get all the user nodes by doing this:

In [23]:
for user in document.findall('users/user'):
    print user.attrib['name']

john
charles
peter


Likewise, you can quickly get all the groups by doing this:

In [None]:
for group in document.findall('groups/group'):
    print group.attrib['name']

**Iterating elements**

Even after finding specific elements or entry points in the hierarchy, you will normally need to iterate the children of a given node. This can be done like this:

In [25]:
for group in document.findall('groups/group'):
    print 'Group:', group.attrib['name']
    print 'Users:'
    for node in group.getchildren():
        if node.tag == 'user':
            print '-', node.attrib['name']
    

john
charles
peter


Other times, you may need to visit every single element in the hierarchy from any given starting point. There are two ways of doing it, one includes the starting element in the iteration, the other only its children. Subtle, but important difference, i.e.:

1. Iterate nodes including starting point:

In [27]:
users = document.find( 'users' )
for node in users.getiterator():
    print node.tag, node.attrib, node.text, node.tail
# Produces this output:

# users {} None None
# user {'name': 'john'} None None
# user {'name': 'charles'} None None
# user {'name': 'peter'} None None

users {} None None
user {'name': 'john'} None None
user {'name': 'charles'} None None
user {'name': 'peter'} None None


 2.Iterate only the children:

In [29]:
users = document.find( 'users' )
for node in users.getchildren():
    print node.tag, node.attrib, node.text, node.tail
# Produces this output:

# user {'name': 'john'} None None
# user {'name': 'charles'} None None
# user {'name': 'peter'} None None

user {'name': 'john'} None None
user {'name': 'charles'} None None
user {'name': 'peter'} None None


  from ipykernel import kernelapp as app


**Handling namespaces**

Some XML files make use of namespaces to disambiguate element tags. For example, take XHTML, it uses http://www.w3.org/1999/xhtml as the namespace, i.e. the main element in the XML file reads like this： 
r "r <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">"

html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"

When parsing this file with ElementTree, the following instruction would return None:

In [32]:
body = document.find( 'body' )
print type( body )
# prints:

# <type 'NoneType'>

<type 'NoneType'>


which is not what was expected. The reason is that because of the user of the xmlns attribute in the <html/> element, all the tag names in all the elements would look like:

{http://www.w3.org/1999/xhtml} body
not simply:

body

The best way to handle this case is by using the QName class instead of a str when searching for tags based on name, e.g.:

In [34]:
from xml.etree.ElementTree import QName

namespace = 'http://www.w3.org/1999/xhtml'
body_tag = str( QName( namespace, 'body' ) )
body = document.find( body_tag )
print type( body )
# prints, as expected:

# <type 'instance'>

<type 'NoneType'>


Notice the use of namespace and body_tag, that would make it easier to construct other element tag names you may need to search, e.g.:

In [35]:
div_tag = str( QName( namespace, 'div' ) )
div_tag

'{http://www.w3.org/1999/xhtml}div'

xml.etree.ElementTree is a nice and intuitive way of dealing with XML content.