# An introduction to Text Encoding and XML

# 1. Introduction

Goals of this lecture:
- **What** is XML and why is it **useful** for text-based DH?
- How to **read and navigate** XML documents in **Python**
- How to **search** in and **extract** information from XML documents.

Lecture and notebook based on [A Very Gentle Introduction to XML](https://www.tei-c.org/Vault/P4/doc/html/SG.html) part of the [Text Encoding Initiative](https://tei-c.org/) (or TEI). 

List of [projects](https://tei-c.org/Activities/Projects/) using TEI.


**Question to the sceptical hamster: Is XML scary? 

Siberian saphire hamster


![](https://debeste.de/upload/51795a52ac20c074a8de79e35c14a90a1159.jpg)

![](https://media.giphy.com/media/ej1127nuAfGxoZnWTb/giphy.gif)

## 1.1 Toy Example(s)

To understand XML it is worthwile spending some time simply looking at examples.

XML is an information **tree**.

```xml
<anthology>
    <poem id='1' 
      author='Blake, William.'
      date='1794'>
      <heading id='1.1'>The SICK ROSE</heading>
      <stanza id='1.2'>
           <line id='1.2.1'>O Rose thou art sick.</line>
           <line id='1.2.2'>The invisible worm,</line>
           <line id='1.2.3'>That flies in the night</line>
           <line id='1.2.4'>In the howling storm:</line>
      </stanza>
      <stanza id='1.3'>
           <line id='1.3.1'>Has found out thy bed</line>
           <line id='1.3.2'>Of crimson joy:</line>
           <line id='1.3.3'>And his dark secret love</line>
           <line id='1.3.4'>Does thy life destroy.</line>
      </stanza>
    </poem>
    <poem id='2'
     author='Shakespeare, William.' >
        <heading id='2.1'>Sonnet 106</heading>
      <stanza id='2.1'>
        <line id='2.2.1'>When in the chronicle of wasted time</line>
        <line id='2.2.1'>I see descriptions of the fairest wights,</line>
        <line id='2.2.1'>And beauty making beautiful old rhyme</line>
        <line id='2.2.1'>In praise of ladies dead, and lovely knights,</line>
     </stanza>
  </poem> 
</anthology>
```

![tree](https://tei-c.org/release/doc/tei-p5-doc/en/html/Images/xmlFlowChart.png)

Another example, showing metadata for historical newspaper articles

```xml
<?xml version="1.0"?>
<lwm>
  <process>
    <lwm_tool>
      <name>extract_text</name>
      <version>0.3.0</version>
      <source>https://github.com/alan-turing-institute/Living-with-Machines-code</source>
    </lwm_tool>
    <source_type>newspaper</source_type>
    <xml_flavour>alto</xml_flavour>
    <software>CCS docWorks/METAe Version 7.1-0</software>
    <input_sub_path>0002247/1863/1010</input_sub_path>
    <input_filename>0002247_18631010_mets.xml</input_filename>
    <mets_namespace>http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version18/mets.xsd info:lc/xmlns/premis-v2 http://www.loc.gov/standards/premis/v2/premis-v2-0.xsd http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-2.xsd</mets_namespace>
    <alto_namespace>http://schema.ccs-gmbh.com/docworks/alto-1-4.xsd</alto_namespace>
  </process>
  <publication id="0002247">
    <source>British Library Heritage Made Digital Newspapers</source>
    <title>The Bee-Hive :</title>
    <location>London, England</location>
    <issue id="1863-10-10">
      <date>1863-10-10</date>
      <item id="art0101">
        <plain_text_file>0002247_18631010_art0101.txt</plain_text_file>
        <title>gtt ginthon Marlats.</title>
        <item_type>ARTICLE</item_type>
        <word_count>125</word_count>
        <ocr_quality_mean>0.9591</ocr_quality_mean>
        <ocr_quality_sd>0.1138</ocr_quality_sd>
      </item>
    </issue>
  </publication>
</lwm>

```

### ✏️ Exercise

Go to this online [XML viewer](https://jsonformatter.org/xml-viewer) and copy paste the above example and press "Validate". Explore the XML tree.



# 3. Why XML?

## 3.1 Markup

**XML stands for "eXtensible Markup Language"**

Why add markup to a text document?
- XML categorizes chunks of text into different (semantic) units.
- With XML we can articulate an **interpretation** of text through **encoding**
- Importantly, XML doesn't **do** anything (that's up to Python) it only marks things up
    
From the TEI tutorial:
> "it is a process of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be (or has been) interpreted."

Historically, XML and Text Encoding central to research in digital humanities. However, less important today but you are still likely to encounter it at some points (depends also where you are situated in the DH universe).

For these reasons, we still teach some basics but prefer to talk about other stuff in general.

What does it actually mean **"articulate an interpretation of text through encoding".**

To give a concrete example, let's work with a snippet of parliamentary speech.

Mr. ATTLEE: I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day. The PRIME MINISTER (Mr. Baldwin): Reference has been made by the Mover of the Address and by the right hon. Gentleman the Leader of the Opposition to the grievous loss which has befallen His Majesty.

By reading this snippet as just a sequence of words we'd loose a lot valuable **context information**.
- Instead of a long, uninterrupted text, this snippet captures a specific type of discourse event, a debate between individual politicians
- Not all words are spoken text, some refer to names of politicians.
- It is an instance of parliamentary speech.


We can make this interpretation explicit through encoding.

<proceeding>
    <speaker>Mr. ATTLEE:</speaker>
     <speech>
            <p>I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.
            </p>
    </speech>
    <speaker>The PRIME MINISTER (Mr. Baldwin):</speaker>
    <speech>
            <p>Reference has been made by the Mover of the Address and by the right hon. Gentleman the Leader of the Opposition to the grievous loss which has befallen His Majesty. 
            </p>
    </speech>
</proceeding>

Why XML important for DH research?

## 3.2 Extensible
- XML defines the **rules/conventions** for encoding a text 
    - which tags are allowed (speaker, speech)
    - articulate relations between tags (paragraph embedded in speech tag)
- XML is **extensible** because it allows you to define a new (or refine an existing) set of tags (it is not a set of predefined tags like HTML).
- XML is a **"metalanguage"** (a language for **describing** markup languages).

Summary so far:
- XML is **human** and **machine** readable (it aligns humanistic interpretation of document with machine processing)
- **Encoding** text means adding a layer of annotation (enclosing text segments with tags)
- TEI aims to encode the **meaning of data** (not just **formal** (page breaks) or **procedural** (font type). 

**It captures and communicates what is actually going on in a piece of text.**

In this session:

- Introduce the building blocks of XML.
- Focus on **working** with XML documents (XML for the **consumer** side)
- We explore scenarios where you encounter **text encoded in XML**, and how to make use of the **structure** provided by XML.

# 4. XML building blocks

## 4.1 Elements

**XML Elements** emerge as the principal building blocks of a document (i.e. the units that structure a document): `<speaker>Mr. ATTLEE:</speaker>`

- Everything from the start and end tag is called an element.
- Each element must be **explicitly tagged** (elements **must** have a closing tag.).
- Elements can contain other elements, text or a combination of the two.


> Mr. ATTLEE: I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.

`Mr. ATTLEE: <p>I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.</p>`

- Elements can be **contained/nested** within each other.

<proceeding>
    <speaker>Mr. ATTLEE:</speaker>
    <speech>
            <p>I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.
            </p>
    </speech>
    <speaker>The PRIME MINISTER (Mr. Baldwin):</speaker>
    <speech>
            <p>Reference has been made by the Mover of the Address and by the right hon. Gentleman the Leader of the Opposition to the grievous loss which has befallen His Majesty. 
            </p>
    </speech>
</proceeding>

- tree relations: parent - child - sibling

<root> # parent
    <child>Mr. ATTLEE:</child> #child
    <child-sibling> # sibling
            <subchild>I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.
            </subchild>
    </child-sibling>
</root>



To be **syntactically** correct:
- Each element contained by the root element
- No partial overlaps between elements
- Tags mark start and end of each element. Tags must be closed.

## 4.2 Attributes

- **Elements** structure what is **implicit**, the **attributes** often enrich this structure by adding information that is **absent**. 
- XML syntax for encoding information not the text itself. 

Double click on cell below to see an example:

<speech pm:speaker="Mr. ATTLEE" 
        pm:party="Labour" 
        pm:role="mp" 
        pm:party-ref="uk.p.Lab" 
        pm:member-ref="uk.m.16522" 
        pm:id="uk.proc.d.1935-12-03.3.10.6" 
        dp:time="15:38:00">
    <p pm:id="uk.proc.d.1935-12-03.3.10.6.1">I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.
    </p>
</speech>


- Generally, attributes capture information that is "descriptive of an element but not part of its content" (e.g. identifiers.)
- Attributes must be **quoted**
- Attributes are embedded in the **start tag** (no attributes in the end-tag.)

## 4.3 Advanced: Schemas

- Documents are considered a belonging to a specific type ("academic paper", "novel", "parliamentary proceedings", etc).
- Validity assessed in relation to a schema (**Document Type Definition**). 
- Schema defines the constituent parts and their relations (in what arrangements can elements appear).
- Example of Political Mashup's schema for [parliamentary proceedings](https://web.archive.org/web/20170504021953/http://schema.politicalmashup.nl/schemas.html).


**Anthology**  an example taken from [TEI](https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html#:~:text=XML%20is%20widely%20used%20for,on%20the%20World%20Wide%20Web.)

<anthology>
 <poem>
  <heading>The SICK ROSE</heading>
  <stanza>
   <line>O Rose thou art sick.</line>
   <line>The invisible worm,</line>
   <line>That flies in the night</line>
   <line>In the howling storm:</line>
  </stanza>
  <stanza>
   <line>Has found out thy bed</line>
   <line>Of crimson joy:</line>
   <line>And his dark secret love</line>
   <line>Does thy life destroy.</line>
  </stanza>
 </poem>
<!-- more poems go here -->
</anthology>
              


In short, schemas:
- **are a formal representation of a document's structure**
-  govern which **elements** are **allowed** and how they can be **arranged** (relation between element, i.e. speech can contain paragraph etc.)
- are an **interpretation** of the structure source types.
- a schema **balances** demands for clarity, uniformity and complexity. Schemas are a (necessary) simplification. It is hard to agree on a standard/schema and there will always be some reduction, harm to the uniqueness of a document.



## 4.4 Advanced: Namespaces
- Namespaces indicate **provenance** of the element and attributes. 
- Namespaces avoid **name conflicts**.

Example of Parliamentary Proceedings:

`<root xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsd="http://www.w3.org/2001/XMLSchema-datatypes" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:pmd="http://www.politicalmashup.nl/docinfo" xmlns:pm="http://www.politicalmashup.nl" xmlns:dp="http://dilipad.history.ac.uk" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">`

`<proceedings xmlns="http://www.politicalmashup.nl" pm:id="uk.proc.d.1935-12-03">`

According the ["Gentle Introduction"](https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html):
> If the markup of an XML document is thought of as an expression in some language, then a namespace may be thought of as analogous to the lexicon of that language. 


- Namespaces are the **lexicon** of a markup language (`xmlns:pm="`[http://www.politicalmashup.nl](https://web.archive.org/web/20190505013425/http://www.politicalmashup.nl)`"`)
- Prefix signals **association** with a namespace. (`pm:id="uk.proc.d.1935-12-03"` or `pm:party="Labour"`)
- XML documents allow you to **combine** elements from different namespaces (i.e. you can create your own schema and use to extend existing representations.)


Now, let's turn from theory to practice...

# 5. Working with XML in Python

- Goal of this tutorial: How to **open, process and extract relevant content from XML documents**
- You can also **create** and **validate** XML in Python, but we will cover this in less detail (as said, emphasis on the consumer side)

Python offers [multiple libraries](https://realpython.com/python-xml-parser/) for working with XML but we will use [lxml](https://lxml.de/parsing.html). 

![lxml](https://lxml.de/python-xml-title.png)

Similar to Pandas, lxml provides many tools for parsing and traversing XML-type tree-structured documents.

In [None]:
from lxml import etree
print(etree.__doc__)

## 5.1 Toy Example (again)


In [None]:
xml_string = """
<anthology>
    <poem id='1' 
      author='Blake, William.'
      date='1794'>
      <heading id='1.1'>The SICK ROSE</heading>
      <stanza id='1.2'>
           <line id='1.2.1'>O Rose thou art sick.</line>
           <line id='1.2.2'>The invisible worm,</line>
           <line id='1.2.3'>That flies in the night</line>
           <line id='1.2.4'>In the howling storm:</line>
      </stanza>
      <stanza id='1.3'>
           <line id='1.3.1'>Has found out thy bed</line>
           <line id='1.3.2'>Of crimson joy:</line>
           <line id='1.3.3'>And his dark secret love</line>
           <line id='1.3.4'>Does thy life destroy.</line>
      </stanza>
    </poem>
    <poem id='2'
     author='Shakespeare, William.' >
        <heading id='2.1'>Sonnet 106</heading>
      <stanza id='2.1'>
        <line id='2.2.1'>When in the chronicle of wasted time</line>
        <line id='2.2.1'>I see descriptions of the fairest wights,</line>
        <line id='2.2.1'>And beauty making beautiful old rhyme</line>
        <line id='2.2.1'>In praise of ladies dead, and lovely knights,</line>
     </stanza>
  </poem> 
</anthology>
"""
              

In [None]:
# etree will convert a string to a tree this is called parsing
from io import StringIO
tree = etree.parse(StringIO(xml_string)); tree

In [None]:
# what goodies are attached to an lxml etree object?
help(tree)

In [None]:
# iterate over all elements
for i in tree.getiterator(): print(i)

In [None]:
# explore tree structure
root = tree.getroot()
len(root) # we only have two poems in anthology

In [None]:
# first poem
root[0]

In [None]:
# second poem
root[1]

In [None]:
# second stanza of the second poem
# stanza is the child of poem
root[1][1]

In [None]:
# each element has different Python attributes
root[1][1].tag

In [None]:
root[1][1][2].tag

In [None]:
# why is this empty?
root[1][1].text

In [None]:
# text attribute only gets the text directly enclosed by the tag
root[1][1][2].text

In [None]:
# print the second stanza of the first poem
print(etree.tostring(root[0][1],method='text',encoding='unicode'))

In [None]:
#attributes are stored as a dictionary
root[0].attrib

In [None]:
root[0].attrib['author']

In [28]:
root[0].attrib.get('author',None)

### ✏️ 1. Exercise.

- Convert the small dialogue below to an XML document, using the tags: 
    - `dialogue` (root tag)
    - `speech_turn` (nested with `dialogue)
    - `line` (nested within `speech_turn`)
- add the name of the speaker as a **name** attribute to each `speech_turn` as well as an **unique identifier** for each actor
- add **identifiers** to each element similarly to the anthology example
- store the XML document as a **string** in a variable called `xml_dialogue`
- **parse** the string with and save the Element Tree in a new variable
- print the text on the second `speech_turn`
- (**Difficult**): can you print all lines spoken by George?

```
George: “You’re gonna over-dry your laundry.”
Jerry: “You can’t over-dry.”
George: “Why not?”
Jerry: “Same reason you can’t over-wet.”
```
![seinfeld](https://media.giphy.com/media/14dXclYKbx2ONW/giphy.gif)

In [None]:
# add your answer here

## 5.2 A Realistic Example (with namespace, ugh)
### UK Parliamentary Debates (29th June 1967)

Political Mashup's schema for [parliamentary proceedings](https://web.archive.org/web/20170504021953/http://schema.politicalmashup.nl/schemas.html).

In [14]:
# open a xml file
with open('xml_examples/uk.proc.d.1967-06-29.xml','rb') as in_xml:
    tree = etree.parse(in_xml) # parse the content of the xml file

In [15]:
# get the root element
root = tree.getroot()

In [16]:
len(root), type(root)

(3, lxml.etree._Element)

In [17]:
# iterate over all the elements directly under the root element
# each element is a list of elements
for element in root:
    print(element.tag)

{http://www.politicalmashup.nl/docinfo}docinfo
meta
{http://www.politicalmashup.nl}proceedings


In these examples elements belong to different namespaces, for example

`{http://www.politicalmashup.nl}proceedings`

The namespace appears as a prefix and is surrounded by curly brackets.

In [18]:
# the proceedings, the actual content
# is the third element in the list
proceedings = root[2]

In [19]:
# third element of the proceedings element
# i.e. the third topic
proceedings[2]

<Element {http://www.politicalmashup.nl}topic at 0x10acab840>

In [20]:
# first scene in third topic
proceedings[2][0]

<Element {http://www.politicalmashup.nl}scene at 0x10acb61c0>

In [21]:
# you can iterate over these elements as a list to get the children
for i,element in enumerate(proceedings[2]):
    print(i,element.tag,len(element))

0 {http://www.politicalmashup.nl}scene 4
1 {http://www.politicalmashup.nl}scene 6
2 {http://www.politicalmashup.nl}scene 5
3 {http://www.politicalmashup.nl}scene 5
4 {http://www.politicalmashup.nl}scene 6
5 {http://www.politicalmashup.nl}scene 9
6 {http://www.politicalmashup.nl}scene 22
7 {http://www.politicalmashup.nl}scene 4
8 {http://www.politicalmashup.nl}scene 4
9 {http://www.politicalmashup.nl}scene 4
10 {http://www.politicalmashup.nl}scene 18
11 {http://www.politicalmashup.nl}scene 9
12 {http://www.politicalmashup.nl}scene 4
13 {http://www.politicalmashup.nl}scene 4
14 {http://www.politicalmashup.nl}scene 4
15 {http://www.politicalmashup.nl}scene 4
16 {http://www.politicalmashup.nl}scene 6
17 {http://www.politicalmashup.nl}scene 6
18 {http://www.politicalmashup.nl}scene 4
19 {http://www.politicalmashup.nl}scene 10
20 {http://www.politicalmashup.nl}scene 4


### ✏️ 2. Exercise.
- How many paragraphs (`<p>` tags) does the 1st speech in the 7th scene of the 3rd topic contain?
- How many words does the first paragraph contain? (You can just use `split()` to divide the string into tokens).

In [22]:
# write answer here

In [None]:
# you can iterate over a specific element
# notice that we need to define the namespace
# for both the element and attribute
for element in tree.iter("{http://www.politicalmashup.nl}speech"):
    print(element.attrib['{http://www.politicalmashup.nl}id'])

In [None]:
# get the namespace mapping
nsmap = root.nsmap; nsmap

### ✏️ 3. Exercise.
- (Difficult) What is the longest speech in the fourth topic element?

In [None]:
# type answer here

## Elements carry attributes as dictionaries

In [None]:
proceedings[2][6][15].tag,proceedings[2][6][15].attrib

In [None]:
# use string formatting for namespaces
speech = proceedings[2][6][15]
speech.attrib["{%s}speaker" % root.nsmap['pm']]

In [None]:
# it is safer to use .get() method
speech.attrib.get("{%s}speaker" % root.nsmap['pm'], '')

### ✏️ Exercise.

- Count the number of speeches made by Conservative members during this sitting of Parliament.

In [None]:
# type answer here

### ✏️ Exercise.

- How often does the Speaker intervene?

In [None]:
# type answer here

# Find methods

In [None]:
help(root.find)

In [None]:
# note the namespace prefix
# .// means from current location (i.e proceedings element)
# search for descendant speech elements regardless of their 
# position in the subtree
# find only returns the first element
nsmap = root.nsmap
speeches = root[2].find('.//pm:speech',namespaces=nsmap); speeches

In [None]:
# findall returns all element
speeches = root[2].findall('.//pm:speech',namespaces=nsmap); len(speeches)

# Fin.

# Adanced: Finding Needles and Mining Haystacks with XPath

- XPath is a rich language for **querying** and **navigating** elements and attributes in an XML document.
    - Elaborate language XPath for defining regions interest in you XML documents.
- Select elements, attributes, or text nodes.
- XPath uses **path expressions** for querying and XML.

In [None]:
# first scene of fifth topic
# notice that index notation does not start at zero!
topic = root.xpath('./pm:proceedings/pm:topic[5]/pm:scene[1]',namespaces=nsmap); topic

In [None]:
## first finds the first scene of fifth topic
## // means any speech element contained by the third topic
speeches = root.xpath('./pm:proceedings/pm:topic[3]//pm:speech',namespaces=nsmap); len(speeches)

In [None]:
# .// from root find all speeches regardless of where they are situated
speeches = root.xpath('.//pm:speech',namespaces=nsmap); len(speeches)

In [None]:
# return speeches that contain a particular attribute
# in this case speeches with a role attribute
speeches = root.xpath('.//pm:speech[@pm:role]',namespaces=nsmap); len(speeches)

In [None]:
# return the attributes themselves not the element
speeches = root.xpath('.//pm:speech/@pm:role',namespaces=nsmap); 
speeches[:10]

In [None]:
# find speeches where attribute matches a particular value
speeches = root[2].xpath('.//pm:speech[@pm:role="chair"]',namespaces=nsmap); len(speeches)

More information [here](https://devhints.io/xpath)

### ✏️ 5. Exercises.

- Which parties are participating in this sitting?
- How many conservative/labour speeches does the document contain? Which party intervenes more?

In [None]:
# write answer here

## Combining structure and text

In [None]:
# return paragraphs that mention a certain token, in this women
mentions = root.xpath(".//pm:speech/pm:p[contains(text(), 'women')]",namespaces=nsmap)

In [None]:
# get the fourth mention
# move up one level in the tree to get the speech element 
# and then the speaker name in the attribute
mentions[4].getparent().attrib['{http://www.politicalmashup.nl}speaker']

### ✏️ 6. Exercises.

- Get the title attributes of each topic element.
- Formulate an XPath expression that finds the debate (i.e. topic element) on the medical termination of pregnancy bill.
- Formulate an XPath expression that finds speeches made by Labour MPs during the debate on the medical termination of pregnancy bill.
- Formulate an XPath expression that finds paragraphs in speeches made by Conservative MPs during the debate on the medical termination of pregnancy bill that mentions the word "women". Print the content of these paragraphs.
- Formulate a query the returns the speech in which the word woman is mentioned. (Difficult, requires some Googling)

In [None]:
# write answer here

### ✏️ 7. Exercise
- (**Difficult**) Make a small program that computes word frequencies by party (only Labour and Conservative is fine).
- (**Difficult**) Can you calculate if the probability of mentioning "woman" or "women" differs by party? The probability in this case is just the frequency of the word "woman" divided by the total number of words spoken by a certain party.

## Solutions

### ✏️ 1. Exercise.

- Convert the small dialogue below to an XML document, using the tags: 
    - `dialogue` (root tag)
    - `speech_turn` (nested with `dialogue)
    - `line` (nested within `speech_turn`)
- add the name of the speaker as a **name** attribute to each `speech_turn` as well as an **unique identifier** for each actor
- add **identifiers** to each element similarly to the anthology example
- store the XML document as a **string** in a variable called `xml_dialogue`
- **parse** the string with and save the Element Tree in a new variable
- print the text on the second `speech_turn`
- (**Difficult**): can you print all lines spoken by George?

In [23]:

xml_dialogue = """<dialogue>
    <speech_turn name="George" id="1">
        <line id="1.1">You’re gonna over-dry your laundry.”</line>
    </speech_turn>
    <speech_turn name="Jerry" id="2">
        <line  id="2.1">You can’t over-dry.</line>
    </speech_turn>
    <speech_turn name="George" id="3">
        <line  id="2.1">Why not?</line>
    </speech_turn>
    <speech_turn name="Jerry" id="4">
        <line  id="4.1">Same reason you can’t over-wet</line>
    </speech_turn>
</dialogue>
"""

In [24]:
from lxml import etree
from io import StringIO

tree = etree.parse(StringIO(xml_dialogue))
root = tree.getroot()
root[1][0].text

'You can’t over-dry.'

In [25]:
for e in tree.getiterator():
    if e.attrib.get('name') == 'George':
        print(etree.tostring(e,method='text',encoding='unicode'))


        You’re gonna over-dry your laundry.”
    
    

        Why not?
    
    


### ✏️ 2. Exercise.
- How many paragraphs (`<p>` tags) does the 1st speech in the 7th scene of the 3rd topic contain?
- How many words does the first paragraph contain? (You can just use `split()` to divide the string into tokens).