Python and XML
==========

We have looked at XML, what it's used for, and some of the world of XML technology for handling it. But it's also perfectly possible to handle XML from within Python! Here we will do some of the same things as we did in the last lesson, but using a language that you're already gaining some familiarity with.

The most well-supported library for handling XML in Python is known as [lxml](http://lxml.de/tutorial.html). We will use it here.

In [1]:
from lxml import etree

You start by creating an "element tree", which is what lxml calls a parsed XML document. The `parse()` method can understand a filename, a URL, or even a filehandle that you have already opened.

In [2]:
sent = etree.parse("sentence_example.xml")
sent.getroot()

<Element s at 0x105477808>

An element tree, like an XML document, has a single root element. You can treat that element as a list, iterating over it in a 'for' loop to get its child elements. In fact you can do this in two ways, depending on whether you want just the direct children of the root, or whether you want everything that the root contains.

In [3]:
print("Here are the child elements")
for el in sent.getroot():
    print(el.tag)

print("And here are all the descendants")
for el in sent.getroot().iter():
    print(el.tag)

Here are the child elements
w
w
w
w
w
And here are all the descendants
s
w
t
pos
w
t
pos
w
t
pos
w
t
pos
w
t
pos


In fact, you can treat any element as a list, in the same way!

In [4]:
thirdword = sent.getroot()[2]

print("Here are the child elements of the third word")
for el in thirdword:
    print(el.tag)

print("And here are all its descendants")
for el in thirdword.iter():
    print(el.tag)

Here are the child elements of the third word
t
pos
And here are all its descendants
w
t
pos


You can also treat any element as a dictionary to find its attributes.

In [5]:
print("The third word has an attribute")
for k in thirdword.keys():
    print(k)

print("...and we can get the value of an attribute of its child")
for el in thirdword:
    if el.tag == 'pos':
        print(el.get('class'))

The third word has an attribute
{http://www.w3.org/XML/1998/namespace}id
...and we can get the value of an attribute of its child
NNP


Naturally, we can also get the text content of an element. Along the way, we also see how we can use the `iter()` method on an element (which we saw above) to look for descendant elements with a particular tag.

In [6]:
for el in sent.iter('t'):
    print(el.text)

He
is
Mr. 
Smith
.


We might also want to know how to print the XML back out again, either for the entire document or for a particular element. This is a little tricky - we use the `etree.tostring()` method, but that gives us **bytes** rather than **characters**. To do this correctly, we will have to "decode" the bytes into characters, and in order to do that we will have to know what character set was used. We can get this information from the document's `docinfo`.

In [8]:
# Turn the element into bytes
xml_word_bytes = etree.tostring(thirdword)

# Find out what character set was being used - this is in the docinfo
# of the tree we first parsed.
character_set = sent.docinfo.encoding

# Now turn those bytes into characters
print(xml_word_bytes.decode(character_set))

<w xml:id="example.p.1.s.2.w.3">
        <t>Mr. </t>
        <pos class="NNP"/>
    </w>
    


Handling XML namespaces
------------

We've seen already that in "real" XML documents such as TEI files, we have to work within namespaces. The lxml library handles namespaces very directly - let's try parsing the Bertrand Russell document to see how it looks.

In [9]:
russell = etree.parse("russell.xml")
russell.getroot().tag

'{http://www.tei-c.org/ns/1.0}TEI'

Okay that's ugly, but it's also pretty direct. The namespace URL goes in curly braces `{}`, and the actual tag name comes right after. We need to use this format whenever we look for tags in the document. Let's say we want to look at all the page numbers, for example:

In [10]:
for pb in russell.getroot().iter("pb"):
    print("Found page %s" % pb.get("n"))

That didn't work—it didn't find any page breaks—because we didn't specify a namespace! Instead, this is how we need to look.

In [13]:
for pb in russell.getroot().iter("{http://www.tei-c.org/ns/1.0}pb"):
    print("Found page %s" % pb.get("n"))

Found page 198
Found page 199
Found page 200


Obviously it is boring and annoying to type out a namespace every time we want to look for a tag, so to make our lives easier we can define a function in our code that will spit out the namespace we want.

In [14]:
def nstag(tag):
    return "{http://www.tei-c.org/ns/1.0}%s" % tag

for pb in russell.getroot().iter(nstag("pb")):
    print("Found page %s" % pb.get("n"))

Found page 198
Found page 199
Found page 200


Handling mixed text content
----------

TEI is a somewhat complicated form of XML, because usually TEI documents have what we call *mixed content*. This means that a given element may contain a mixture of text and other elements. We see that in a paragraph like this:

    <p>If you say the same thing to a Frenchman with a slight knowledge 
    of English he will go through some inner speech which may be
    represented by <q xml:lang="fr-FR">"Que dit-il ? Ah, oui, une
    automobile !"</q> After this, the rest follows as with the Englishman. 
    Watson would contend that the inner speech must be incipiently
    pronounced; we should argue that it might be merely imaged. But this
    point is not important in the present connection. </p>
    
This paragraph has a text node (from 'If you say...' to '...represented by'), followed by a `<q>` element with its own text, followed by another text node that begins with 'After this...". So how would you retrieve the text of this paragraph?

In [15]:
# A variable to hold the paragraph we want
pgraph = None

# Find the right paragraph
for pg in russell.getroot().iter(nstag("p")):
    if pg.text.startswith("If you say"):
        pgraph = pg
        break
        
# Now let's see what the text of this paragraph is.
pgraph.text

'If you say the same thing to a Frenchman with a slight knowledge of English he will go through some inner speech which may be represented by '

Huh, that's incomplete. What is going on here?

Let's look at what lxml thinks the contents of the paragraph should be.

In [16]:
for el in pgraph:
    print(el)
    print(el.text)

<Element {http://www.tei-c.org/ns/1.0}q at 0x105496748>
"Que dit-il ? Ah, oui, une automobile !"


Hmm. So from the perspective of lxml, the paragraph has some text, which is the first bit of what we think its text is, and then it has an element `<q>`, which has its own text. So where is the rest of the text hiding? The answer is in the *tail*.

The idea here is that an element can have text inside it, and it can also have text after it. So when using lxml, if we want all the text inside an element, we have to look both inside *and behind* its child elements!

In [17]:
print(pgraph.text)
for el in pgraph:
    print(el.text)
    print(el.tail)

If you say the same thing to a Frenchman with a slight knowledge of English he will go through some inner speech which may be represented by 
"Que dit-il ? Ah, oui, une automobile !"
 After this, the rest follows as with the Englishman. Watson would contend that the inner speech must be incipiently pronounced; we should argue that it might be merely imaged. But this point is not important in the present connection. 


Be careful though - if you want *all* the text inside an element, you might need to use `.iter()` to get not just the children, but all descendants!

In [19]:
list_pgraph = None
for pg in russell.getroot().iter(nstag("p")):
    if pg.text.startswith("So far we have found"):
        list_pgraph = pg
        break
        
print("This will be incomplete...")
print(list_pgraph.text)
for el in list_pgraph:
    print(el.text)
    print(el.tail)
    
print("\n...but this should get everything!")
for el in list_pgraph.iter():
    if el.text is not None and el.text.rstrip() != '':
        print(el.text)
    if el.tail is not None and el.tail.rstrip() != '':
        print(el.tail)

This will be incomplete...
So far we have found four ways of understanding words: 
               

                  
 

...but this should get everything!
So far we have found four ways of understanding words: 
               
(1)
On suitable occasions you use the word properly.
(2)
When you hear it you act appropriately 
(3)
You associate the word with another word (say in a different language) which has the appropriate effect on behaviour.
(4)
When the word is being first learnt, you may associate it with an object, which is what it " means" or a representative of various objects that it "means."


Python and XPath
--------

The skills we learned in the last class didn't go entirely to waste; it's perfectly possible, and also pretty useful, to use XPath expressions from within Python! 

In order to do that, though, we want to handle the namespaces a little more nicely than lxml tends to. Full namespace URLs will have slash characters (`/`) in them, and these characters are also important in XPath, so mixing the two is never nice. None of you want to have to write an XPath expression that looks like

    //http:\/\/www.tei-c.org\/ns\/1.0:text/http:\/\/www.tei-c.org\/ns\/1.0:body/[...]
    
and so on. Fortunately, as we will see, you can define a shorthand for any namespace (just as we did while using XQuery) to get around this problem and send your shortcut to the XPath evaluator. This is what it looks like.

In [20]:
# Read in a Shakespeare play
play = etree.parse("merchant_venice.xml")

# Make a dictionary to define our namespace shortcuts
ns = {'tei':'http://www.tei-c.org/ns/1.0'}

# Run an XPath search within the play, but give it the dictionary to
# tell it what namespaces we are using!
play.xpath('//tei:fileDesc/tei:titleStmt/tei:title/text()', namespaces=ns)

['The Merchant of Venice.']

An important thing to remember is that XPath results will usually come in the form of a list! It's possible to write an expression that gives you something else, e.g. if you want to count elements.

In [21]:
play.xpath('count(//tei:sp)', namespaces=ns)

637.0

But if your expression is meant to return the result of a search for an element or for text nodes, then you'll get a list even if it contains only one element.

In [22]:
play.xpath('//tei:teiHeader', namespaces=ns)

[<Element {http://www.tei-c.org/ns/1.0}teiHeader at 0x105496dc8>]

In XPath, context is everything. So far we have run all of our XPath expressions against the root of the document, so that the only thing under the initial `/` is the root element, which is to say, the `<TEI>` element. But in fact you can run an XPath expression against any element in the document! The element against which the XPath expression is run is called the *context*, and it is quite similar to working on the command line.


In [24]:
# Let's get a list of speaking parts from the play
speaking_parts = play.xpath('//tei:text//tei:sp', namespaces=ns)

# Here is an XPath expression that will retrieve the speaker for a given speaking part
xpath_expression = './tei:speaker/text()'

# Now we can run the same expression against different <sp> elements and get different results!
print(speaking_parts[1].xpath(xpath_expression, namespaces=ns))
print(speaking_parts[2].xpath(xpath_expression, namespaces=ns))
print(speaking_parts[3].xpath(xpath_expression, namespaces=ns))
print(speaking_parts[4].xpath(xpath_expression, namespaces=ns))

['Sal.']
['Salar.']
['Sal.']
['Anth.']


One last note concerning the text of the document: be careful not to confuse the lxml way with the xpath way! Let's see what happens when we ask for the text nodes of the Russell paragraph we were looking at before.

In [26]:
pgraph.xpath('descendant::text()', namespaces=ns)

['If you say the same thing to a Frenchman with a slight knowledge of English he will go through some inner speech which may be represented by ',
 '"Que dit-il ? Ah, oui, une automobile !"',
 ' After this, the rest follows as with the Englishman. Watson would contend that the inner speech must be incipiently pronounced; we should argue that it might be merely imaged. But this point is not important in the present connection. ']