# Chapter 3: Working with xml : writing

In [None]:
# Run this before going further :
from lxml import etree

## 1. From xml to string and files

`lxml` offers many way to interact with xml in Python, including converting xml objects to string. This is quite convenient to retrieve, for example, part of an XML object by using the function `etree.tostring()`

In [None]:
with open("data/phi1294.phi002.perseus-lat2.xml") as f:
    xml = etree.parse(f)

#Let's get the third div
div = xml.xpath("//tei:div/tei:div", namespaces = { "tei" : "http://www.tei-c.org/ns/1.0" })[2]

#And now we export it to string
print(etree.tostring(div))

As you can see, the result is simple, but it's not pretty. Most of all, the result is a byte, which is an object type in Python3. Thankfully, etree.tostring offers several options :

- `encoding` : accepts both string types (str) or typical encoding such as "utf-8" or "iso-xxx"
- `xml_declaration` : if set to `True`, will add the famous `<?xml version='1.0' encoding='***'?>` to the beginning
- `with_comments` : if set to False, will remove the comments

See more details on the [official documentation](lxml.de/api/lxml.etree-module.html#tostring)

In [None]:
# Let see how encoding works :
print(etree.tostring(div, encoding=str))
# And with utf-8 ? 
print(etree.tostring(div, encoding="utf-8"))

**Be careful** : this is still bytes. `str` in Python3 is actually unicode. So if you need to transform something to a string, you should put `encoding=str` as much as possible !

**Files**

There is two options for writing file. One is simply exporting to string and then writing the string you just made with `tostring`. The other other is using `etree.xmlfile`. To do that properly, you will need to use the `with` statement. You have seen `with` use before. `with FileFunction() as var:` allows you to properly process your code by opening the file at the beginning and closing it at the end :

In [None]:

with open("data/phi1294.phi002.perseus-lat2.xml") as f:
    xml = etree.parse(f)
    print(f) # f exists
print(f.read()) # Raise an exception because the file is closed.

`etree.xmlfile()` does the same thing : it opens the file after the with command and closes it at the end. Once you have retrieved the variable, the output of `etree.xmlfile()` supports different functions :

- `write(xml)` allows you to write some xml to the document
- `write_doctype(doctype)` add a doctype verbatim such as `write_doctype("<!DOCTYPE root SYSTEM "some.dtd">")`
- `write_declaration(standalone=None)` write a declaration and if True set it to standalone
- `element(xml, attrib={}, nsmap={})` using `with` statement, allows for creating and opening a contextual node taking attributes and a namespace map

In [None]:
with etree.xmlfile("somefile.xml", encoding='utf-8') as xf:
    xf.write(div)
    
with open("somefile.xml") as f:
    print(f.read())
    
with etree.xmlfile("somefile.xml", encoding='utf-8') as xf:
    with xf.element("tei:TEI", nsmap={"tei":"{http://www.tei-c.org/ns/1.0}"}): # No variable stored here
        xf.write(div) # Note that we still use "xf" as the main variable.
    
with open("somefile.xml") as f:
    print(f.read())

** DIY **

Using the parsing system we talked about, can you clean up the namespace created by `tei:` ?

In [None]:
# Write your code here

** What you have learned **

- `etree.tostring()`
- `etree.xmlfile()`
- `with` statement

## 2\. Tree and Element


On top of parsing and traversing xml, `lxml` provides a lot of functionality for creating XML inside Python. There are numerous uses for it in the Humanities : correcting xml, generating XML from raw text after using regular expressions, creating xml in web applications, etc.

Most humanities projects run on XML with TEI for a sustainable text format. OpenEdition, Perseus, Perseids, etc. are all running on XML resources. Being able to master generation of XML is undoubtedly a skill you will require throughout your career.

Let's start with the basics :

**Element**

`etree.Element()` takes a required argument, the name of your attribute : `etree.element("div")` will result in `<div />`. That's great but quite limited. You can use Clark's notation to add some namespace :

In [None]:
e = etree.Element("{http://localhost}div")
print(etree.tostring(e, encoding=str))

As you can see, lxml uses ns{number} by default for prefixing xml. Default namespacing can be resolved using the function `etree.register_namespace(prefix, uri)`:

In [None]:
etree.register_namespace("local", "http://localhost")
e = etree.Element("{http://localhost}div")
print(etree.tostring(e, encoding=str))

You can also register namespace through the `nsmap` named argument of `lxml.Element` :

In [None]:
e = etree.Element("{http://localhost}div", nsmap={None:"http://localhost"})
print(etree.tostring(e, encoding=str))

While `etree.register_namespace` does not accept `None` as a value, the `nsmap` argument accepts both string and None value. Using None, you will ensure that no ns0 or any other useless prefix is used.

Now that we can use namespaces, let's add a nice set of attributes. Attributes are easy to define through a `dict` and the parameter `attrib` of `etree.Element`

In [None]:
e = etree.Element(
    "{http://localhost}div",
    attrib={"n" : "1", "type": "textpart", "subtype" : "section"},
    nsmap={None:"http://localhost"}
)
print(etree.tostring(e, encoding=str))

Great, now we have a very complex tag. But what if you need to add the language via xml:lang ?

In [None]:
e = etree.Element(
    "{http://localhost}div",
    attrib={"n" : "1", "type": "textpart", "subtype" : "section", "{http://www.w3.org/XML/1998/namespace}lang" : "fre"},
    nsmap={None:"http://localhost"}
)
print(etree.tostring(e, encoding=str))

As you can see, you can use default namespace and Clark's notation without specifying the prefix for "xml:". It's also true for xhtml. Another thing : if you need some Dublin Core attributes, you can write it like :

In [None]:
e = etree.Element(
    "{http://localhost}div",
    attrib={
        "n" : "1",
        "type": "textpart",
        "subtype" : "section",
        "{http://www.w3.org/XML/1998/namespace}lang" : "fre",
        "{http://purl.org/dc/elements/1.1/}title" : "Nice div"
    },
    nsmap={None:"http://localhost", "dc" : "http://purl.org/dc/elements/1.1/"}
)
print(etree.tostring(e, encoding=str))

That's about the most complex tag you can create, except for now we haven't included any text. Inserting text is quite easy, however. You just set the attribute `text` of your node  :

In [None]:
e.text = "Hello there"
print(etree.tostring(e, encoding=str))

Now that we can create elements, we should construct a Tree out of them in order to open up many other options. To do so, you will need to use the function `etree.ElementTree` that takes a simple argument : a node.

In [None]:
doc = etree.ElementTree(e)
print(etree.tostring(doc, encoding=str))

Now you are able to perform validation, add DTD and any other XML processes you know.

**Attributes modification **

Be aware that nodes accept `set(attribute, value)` as a method, which modifies an attribute.

In [None]:
e.set("foo", "bar")

print(etree.tostring(doc, encoding=str))

**DIY**

Create an XML node with the following output :

    <div xmlns="http://www.tei-c.org/ns/1.0" type="textpart" subtype="section" xml:lang="fro">I am a div</div>

In [None]:
#Write your code here

## 3\. SubElements

`etree.SubElement()` allows you to add a child to an element. It takes as its first element the parent node and as its second a string representing the tag. It requires Clark's Notation, allowing you to keep or alter the namespace. It normally inherits namespaces and prefix of its parent elements.

In [None]:
e = etree.Element(
    "{http://localhost}div",
    attrib={
        "n" : "1",
        "type": "textpart",
        "subtype" : "section",
        "{http://www.w3.org/XML/1998/namespace}lang" : "fre",
        "{http://purl.org/dc/elements/1.1/}title" : "Nice div"
    },
    nsmap={None:"http://localhost", "dc" : "http://purl.org/dc/elements/1.1/"}
)
e.text = "Hello world !"
s_e = etree.SubElement(e, "{http://purl.org/dc/elements/1.1/}a")
s_e.text = "I am an anchor"
print(etree.tostring(e, encoding=str))

Adding attributes to this SubElement differs slightly how it's done with Element. You can either use `attrib` or named arguments :

In [None]:
# With attrib :
s_e_2 = etree.SubElement(e, "{http://purl.org/dc/elements/1.1/}author", attrib = { "foo" : "bar"})
# With named argument :
s_e_3 = etree.SubElement(e, "{http://purl.org/dc/elements/1.1/}meta", name = "foo", bar="foo2")
print(etree.tostring(e, encoding=str))

You are now equipped for long days of document creation. There's another way to deal with children, though: the `append()` method. Here, for example, is a more complex tree:

In [None]:
# We define a namespace map that we will reuse :
ns = { None : "http://www.tei-c.org/ns/1.0"}

# We define a list of lines text (urn:cts:latinLit:phi1294.phi002:1.2):
lines = ["Qui tecum cupis esse meos ubicumque libellos",
        "Et comites longae quaeris habere viae,",
        "Hos eme, quos artat brevibus membrana tabellis:",
        "Scrinia da magnis, me manus una capit.",
        "Ne tamen ignores ubi sim venalis, et erres",
        "Urbe vagus tota, me duce certus eris:",
        "Libertum docti Lucensis quaere Secundum",
        "Limina post Pacis Palladiumque forum. "]

# We create a div
div = etree.Element(
    "{http://www.tei-c.org/ns/1.0}div",
    attrib= { "type" : "textpart", "subtype" : "poem", "n" : "2"},
    nsmap = ns
)

i = 1
while len(lines) > 0:
    # We create an element
    l = etree.Element(
        "{http://www.tei-c.org/ns/1.0}l",
        attrib= {"n" : str(i)}
    )
    l.text = lines.pop(0)  # We remove the first element. See .pop(i) documentation
    div.append(l)
    
    i += 1 # We increment our index

print(etree.tostring(div, encoding=str, pretty_print=True))

`lxml` allows also index insertion : it means that if you forgot something or have to process some information after having added your content, you can insert an element at a chosen index with `.insert(index, element)`. This will prove useful to you if you have to request data asynchronously and have some processing to do before.

In [None]:
l = etree.Element(
    "{http://www.tei-c.org/ns/1.0}head"
)
l.text = "Epigram 2"
div.insert(0, l)
print(etree.tostring(div, encoding=str, pretty_print=True))

Note that the last part could have been done using a list of elements : `etree.Element()` instances support the `.extend()` method that allows you to append a list of elements as children to the node.

** DIY **

Below is the sixth Epigramm of Martial. Can you rewrite this to conform to the xml architecture discussed above ? Add a `xml:lang` attribute with a value `lat`. Add the `head` at the beginning and a `note` as third child, containing the following text : "I am a note"

In [None]:
poem = """Aetherias aquila puerum portante per auras
Inlaesum timidis unguibus haesit onus:
Nunc sua Caesareos exorat praeda leones,
Tutus et ingenti ludit in ore lepus.
Quae maiora putas miracula? summus utrisque
Auctor adest: haec sunt Caesaris, illa Iovis."""

##4\. Removing elements 

Removing an element is also simple using lxml. It's done using the `etree.Element`'s method `remove(node)`. This method needs an element as a parameter, which means you need to have access to it in one way or another. In the case you did something wrong, you can just remove it using the previous variable you assigned it to.

In [None]:
a = etree.Element("tag")
b = etree.Element("subtag")
a.append(b)

print(etree.tostring(a, encoding=str))

a.remove(b)
print(etree.tostring(a, encoding=str))

You may never need to remove a node you just added. Most of the time you'll have xml from external sources and want to remove something. To do so, we'll identify the node using `xpath` or `findall` methods, assign it to a value, and remove it :

In [None]:
xml = """<ahab:result xmlns:ahab="http://github.com/Capitains/ahab">
<ahab:urn>urn:cts:latinLit:phi1020.phi001.perseus-lat2</ahab:urn>
<ahab:passageUrn>urn:cts:latinLit:phi1020.phi001.perseus-lat2:5.612</ahab:passageUrn>
<ahab:text>
    <p><span class="previous"/><span class="hi">lascivum</span><span class="following">  et prono vexantem gramina cursu?  </span></p>
</ahab:text>
</ahab:result>"""
parsed = etree.fromstring(xml)
# We don't care about the urn, so let's find it first
urn = parsed.xpath("//ahab:urn", namespaces = { "ahab" : "http://github.com/Capitains/ahab" })
# We assign it to a variable 
urn_0 = urn[0]
# And we remove it
parsed.remove(urn_0)
# Let's check
print(etree.tostring(parsed, encoding=str, pretty_print=True))

The only other destructive method of `etree.Element` is `.clear()` which removes everything inside a node, including attributes, text and children :

In [None]:
# We reuse xml string from last passage
parsed = etree.fromstring(xml)
# We don't care about the urn, so let's find it first
passage = parsed.xpath("//span[@class='hi']")
# We assign it to a variable 
passage_0 = passage[0]
# And we remove it
passage_0.clear()
# Let's check
print(etree.tostring(parsed, encoding=str, pretty_print=True))

As you can see, everything has been removed, and only the tag remains.

** DIY **

Print a string representing the xml from variable `tei` without any milestones :

In [None]:
tei = """
<div n="3" subtype="chapter">
<milestone unit="section" n="7" />
    <p>
        <reg>ac</reg> multum etiam novitatem tuam adiuvat quod eius modi nobiles tecum petunt, ut nemo sit
        qui audeat dicere plus illis nobilitatem quam tibi virtutem prodesse oportere.
          <reg>nam</reg>
        P. Galbam et L. Cassium summo loco natos quis est
        qui petere consu<pb n="156"/>latum putet? <reg>vides</reg> igitur amplissimis ex familiis homines, quod
        sine nervis sunt, tibi paris non esse. <reg>at</reg>
        Antonius et Catilina molesti sunt.
    </p> 
<milestone n="8" unit="section" />
    <p>
        <reg>immo</reg> homini navo, industrio, innocenti, diserto, gratioso apud eos qui res
        iudicant, optandi competitores ambo a puemitia sicarii, ambo libidinosi, ambo egentes.
          Eorum alterius bona proscripta vidimus, vocem denique audivimus iurantis se
          Romae iudicio aequo cum homine Graeco certare non posse, ex
        senatu eiectum scimus optima verorum censorum existimatione, in praetura competitorem
        habuimus amico Sabidio et Panthera, quom ad <pb n="157"/> tabulam quos poneret non haberet; quo tamen in
        magistratu amicam quam domi palam haberet de machinis emit. <reg>in</reg> petitione autem
        consulatus Cappadoces omnis compilare per turpissimam legationem maluit quam adesse et
        populo Romano supplicare.
    </p>
<milestone unit="section" n="9" />
    <p><reg>alter</reg>
        vero, di boni! quo splendore est? <reg>primum</reg> nobilitate eadem <del>qua
          Catilina</del>. <reg>num</reg> maiore? <reg>non</reg>. <reg>sed</reg> virtute.
          <reg>quam</reg> ob rem? <reg>quod</reg>
        Antonius umbram suam metuit, hic ne leges quidem natus in patris egestate,
        educatus in sororis stupris, corroboratus in caede civium, cuius primus ad rem publicam
        aditus in equitibus R. occidendis fuit (nam illis quos meminimus Gallis, qui
        tum Titiniorum ac Nanniorum ac <pb/>
        Tanusiorum capita demebant, Sulla <pb n="158"/> unum Catilinam
        praefecerat); in quibus ille hominem optimum, Q. Caecilium, sororis
        suae virum, equitem Romanum, nullarum partium, cum semper natura tum etiam
        aetate iam quietum, suis manibus occidit.
    </p>
</div>
"""

##5\. Element maker

ElementMaker is a strong exclusive function of lxml. It allows one to easily create a tree without storing many variables. It's part of another `lxml` module and should be imported as follows :

In [None]:
from lxml.builder import E

`E` is an instance of the ElementMaker function. This allows one to create nodes quite simply : the syntax is composed of `E.TAGNAME()`, *eg* : `e.div() == "<div />"`. To pass information to it, you just add values. `E.something()` takes in the order you wish attributes as a dictionary, strings and other `E.something` calls :

In [None]:
xml = E.div(
    { 
        "n" : "1",
        "type": "textpart",
        "subtype" : "section"
    },
    "Head",
    E.l( { "n" : "1" }, "Qui tecum cupis esse meos ubicumque libellos"),
    E.l( { "n" : "2" }, "Et comites longae quaeris habere viae,"),
    E.l( { "n" : "3" }, "Hos eme, quos artat brevibus membrana tabellis:"),
    E.l( { "n" : "4" }, "Scrinia da magnis, me manus una capit."),
    E.l( { "n" : "5" }, "5Ne tamen ignores ubi sim venalis, et erres"),
    E.l( { "n" : "6" }, "Urbe vagus tota, me duce certus eris:"),
    E.l( { "n" : "7" }, "Libertum docti Lucensis quaere Secundum"),
    E.l( { "n" : "8" }, "Limina post Pacis Palladiumque forum. "),
    "Tail"
)

# But we still need lxml to print
from lxml import etree
print(etree.tostring(xml, encoding=str))

As we said, `E` is an instance of function `lxml.builder.ElementMaker`. This means it's already the result of another function. `ElementMaker` takes namespaces as an argument, giving you the ability to create a default namespace for your whole tree ! One of the downsides, however, is that it once again adds the `ns0` prefix :

In [None]:
from lxml.builder import ElementMaker
TEI = ElementMaker(namespace="http://www.tei-c.org/ns/1.0")
xml = TEI.div(
    { 
        "n" : "1",
        "type": "textpart",
        "subtype" : "section"
    },
    "Head",
    TEI.l( { "n" : "1" }, "Qui tecum cupis esse meos ubicumque libellos"),
    TEI.l( { "n" : "2" }, "Et comites longae quaeris habere viae,"),
    TEI.l( { "n" : "3" }, "Hos eme, quos artat brevibus membrana tabellis:"),
    TEI.l( { "n" : "4" }, "Scrinia da magnis, me manus una capit."),
    TEI.l( { "n" : "5" }, "5Ne tamen ignores ubi sim venalis, et erres"),
    TEI.l( { "n" : "6" }, "Urbe vagus tota, me duce certus eris:"),
    TEI.l( { "n" : "7" }, "Libertum docti Lucensis quaere Secundum"),
    TEI.l( { "n" : "8" }, "Limina post Pacis Palladiumque forum. "),
    "Tail"
)

# But we still need lxml to print
from lxml import etree
print(etree.tostring(xml, encoding=str))

## Going further

- Python XML processing with lxml, John W. Shipman, http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/module-etree.html

##Exercises

### 1\. XML to String

Using the first poem identified by 1.1 in Martial's Epigrammata, replace all lines with paragraphs. You should request it from the CTS service using the URN `urn:cts:latinLit:phi1294.phi002.perseus-lat2:1.1` and the inventory `nemo` on `http://services2.perseids.org/exist/restxq/cts`

### 2\. XML from XPath

Create a function which given an xpath, creates and returns the whole tree.

Expected outcome :

    xml = fromXpath("/TEI/body/text/div")
    print(etree.tostring(xml))
> `<TEI><body><text><div></div></text></body></TEI>`

In [None]:
# Write your code here

### 3\. XML from XPath with attributes

Taking a single node xpath as a parameter, write a function to create a node with its attributes.

Expected outcome :

    xml = fromXpath("div[@n='1' and @subtype='chapter']")
    print(etree.tostring(xml))
> `<div n="1" subtype="chapter"/>`

### 4\. Full XML from complex XPath

Using the outcome of the previous two exercises, write a function to create a complex tree using xpath :

    xml = fromXpath("/TEI/body/text/div[@type='edition' and @n='0']/div[@n='1' and @subtype='chapter']")
    print(etree.tostring(xml))
> `<TEI><body><text><div n="0" type="edition"><div n="1" subtype="chapter" /></div></text></body></TEI>`

-----

In [114]:
# Don't worry about this cell, it's just here to make the page nicer.

from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

---

<p><small><a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Python Programming for the Humanities</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://fbkarsdorp.github.io/python-course" property="cc:attributionName" rel="cc:attributionURL">http://fbkarsdorp.github.io/python-course</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/fbkarsdorp/python-course" rel="dct:source">https://github.com/fbkarsdorp/python-course</a>.</small></p>