# Text Encoding and XML

## Introduction

Goals of this lecture:
- What is XML and why is it useful for text-based DH?
- How to read and navigate XML documents?

Lecture and notebook based on [A Very Gentle Introduction to XML](https://www.tei-c.org/Vault/P4/doc/html/SG.html) part of the [Text Encoding Initiative](https://tei-c.org/) (or TEI). 

List of [projects](https://tei-c.org/Activities/Projects/) using TEI.

### XML stands for "eXtensible Markup Language" 

### Markup and Text Encoding
- **markup** used for presenting and enriching documents with tag annotations
- **tags** can be used presentational purposes

`Some tags serve as <b>markup for bold text</b> others for <i>italic</i> words.`

Is interpreted as:

Some tags serve as <b>markup for bold text</b> others for 
<i>italic</i> words.

Why add markup to a text document?
- XML categorizes chunks of text into different (semantic) units.
- With XML we can articulate an **interpretation** of text through **encoding**
    
From the TEI tutorial:
> "it is a process of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be (or has been) interpreted."

Mr. ATTLEE: I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day. The PRIME MINISTER (Mr. Baldwin): Reference has been made by the Mover of the Address and by the right hon. Gentleman the Leader of the Opposition to the grievous loss which has befallen His Majesty.

<proceeding>
    <speaker>Mr. ATTLEE:</speaker>
        <speech>
            <p>I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.
            </p>
        </speech>
    <speaker>The PRIME MINISTER (Mr. Baldwin):</speaker>
        <speech>
            <p>Reference has been made by the Mover of the Address and by the right hon. Gentleman the Leader of the Opposition to the grievous loss which has befallen His Majesty. 
            </p>
    </speech>
</proceeding>

- XML is **human** and **machine** readable (it aligns interpretation with machine processing)
- **Encoding** text means adding a layer of annotation (enclosing text segments with tags)
- TEI aims to encode the **meaning of data** (not just **formal** (page breaks) or **procedural** (font type). It communicates to a computer what is going on in a piece of text?

### Extensible
- XML defines the **rules/conventions** for encoding a text
    - which tags are allowed (speaker, speech)
    - articulate relations between tags (speech embedding in speaker tag)
- XML is **extensible** because it allows you to define a new (or refine an existing) set of rules, i.e. markup language.
- XML is a "metalanguage" (a language for describing markup languages).

In this session:

- Focus on working with XML documents (XML for the consumer side)
- We explore scenarios where you encounter text embedded in XML files, and how to make use of the structure provided by XML.

# Elements

In the schemas encountered earlier, **elements** emerge as the principle building blocks of a document (i.e. the units that structure a document).
- Each element must be **explicitly tagged** (opened with start-tag and closed with end-tag).
- Everything between these tags is the **content** of that element.

> Mr. ATTLEE: I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.

`Mr. ATTLEE: <p>I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.</p>`

- Elements can be **contained/nested** within each other.
- Ordering of elements determined by the document type definition.

<proceeding>
    <speaker>Mr. ATTLEE:</speaker>
        <speech>
            <p>I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.
            </p>
        </speech>
    <speaker>The PRIME MINISTER (Mr. Baldwin):</speaker>
        <speech>
            <p>Reference has been made by the Mover of the Address and by the right hon. Gentleman the Leader of the Opposition to the grievous loss which has befallen His Majesty. 
            </p>
    </speech>
</proceeding>

The example does not follow the schema (not compliant with the PM schema) but is (syntactically) well-formed.



To be **syntactically** correct:
- each element contained by the root element
- no partial overlaps between elements
- tags mark start and end of each element

# Attributes

<speech pm:speaker="Mr. ATTLEE" 
        pm:party="Labour" 
        pm:role="mp" 
        pm:party-ref="uk.p.Lab" 
        pm:member-ref="uk.m.16522" 
        pm:id="uk.proc.d.1935-12-03.3.10.6" 
        dp:time="15:38:00">
    <p pm:id="uk.proc.d.1935-12-03.3.10.6.1">I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.
    </p>
</speech>


- Attributes capture information that is "descriptive of an element but not part of its content" (e.g. identifiers.)
- **Elements** sturcture what is **implicit**, the **attributes** often enrich this structure by adding information that is **absent**.
- Attributes contained in the **start tag** (no attributes in the end-tag.)

# Schemas

- Texts are considered a belonging to a type ("academic paper", "novel", "parliamentary proceedings", etc).
- Validity assessed in relation to a schema (**Document Type Definition**). 
- Schema defines the constituent parts and their relations (in what arrangements they can appear).
- Example of Political Mashup's schema for [parliamentary proceedings](https://web.archive.org/web/20170504021953/http://schema.politicalmashup.nl/schemas.html).


**Anthology**  an example taken from [TEI](https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html#:~:text=XML%20is%20widely%20used%20for,on%20the%20World%20Wide%20Web.)

<anthology>
 <poem>
  <heading>The SICK ROSE</heading>
  <stanza>
   <line>O Rose thou art sick.</line>
   <line>The invisible worm,</line>
   <line>That flies in the night</line>
   <line>In the howling storm:</line>
  </stanza>
  <stanza>
   <line>Has found out thy bed</line>
   <line>Of crimson joy:</line>
   <line>And his dark secret love</line>
   <line>Does thy life destroy.</line>
  </stanza>
 </poem>
<!-- more poems go here -->
</anthology>
              


In short, schemas:
-  govern which **elements** are **allowed** and how they can be **arranged** (relation between element, i.e. speech can contain paragraph etc.)
- are an **interpretation** of the structure source types.
- a schema **balances** demands for clarity, uniformity and complexity. Schemas are a (necessary) simplification. It is hard to agree on a standard/schema and there will always be some reduction, harm to the uniqueness of a document.



# Namespaces
Namespaces indicate **provenance** of the element and attributes.

Example of Parliamentary Proceedings:

`<root xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsd="http://www.w3.org/2001/XMLSchema-datatypes" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:pmd="http://www.politicalmashup.nl/docinfo" xmlns:pm="http://www.politicalmashup.nl" xmlns:dp="http://dilipad.history.ac.uk" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">`

`<proceedings xmlns="http://www.politicalmashup.nl" pm:id="uk.proc.d.1935-12-03">`

According the ["Gentle Introduction"](https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html):
> If the markup of an XML document is thought of as an expression in some language, then a namespace may be thought of as analogous to the lexicon of that language. 


- Namespaces are the **lexicon** of a markup language (`xmlns:pmd="`[http://www.politicalmashup.nl/docinfo](https://web.archive.org/web/20190505013425/http://schema.politicalmashup.nl/include/docinfo.html)`"`)
- Prefix signals association with a namespace. (`pm:id="uk.proc.d.1935-12-03"`)
- XML documents allows you to combine elements from different namespaces (i.e. you can create your own schema and use to extend existing representations.)

## Recap, Why use XML? 

XML encodes the **structure** of a text and makes **meaningful units explicit** to humans and computers. Such digital representations of a document are useful for digital humanists: we can distinguish elements and treat them differently in our programs. This allows for a more accurate, refined analysis of texts.

Now, let's turn from theory to practice...

# Working with XML in Python

- Goal of this tutorial: How to **open, process and extract relevant content from XML documents**
- You can also **create** and **validate** XML in Python, but we won't cover this (as said, emphasis on the consumer side)

Python offers [multiple libraries](https://realpython.com/python-xml-parser/) for working with XML but we will use [lxml](https://lxml.de/parsing.html). 

![lxml](https://lxml.de/python-xml-title.png)

In [293]:
from lxml import etree
print(etree.__doc__)


The ``lxml.etree`` module implements the extended ElementTree API for XML.



## Toy Example
### The Extended Anthology

In [294]:
xml_string = """
<anthology>
 <poem id='1' 
      author='Blake, William.'
      date='1794'>
  <heading id='1.1'>The SICK ROSE</heading>
  <stanza id='1.2'>
   <line id='1.2.1'>O Rose thou art sick.</line>
   <line id='1.2.2'>The invisible worm,</line>
   <line id='1.2.3'>That flies in the night</line>
   <line id='1.2.4'>In the howling storm:</line>
  </stanza>
  <stanza id='1.3'>
   <line id='1.3.1'>Has found out thy bed</line>
   <line id='1.3.2'>Of crimson joy:</line>
   <line id='1.3.3'>And his dark secret love</line>
   <line id='1.3.4'>Does thy life destroy.</line>
  </stanza>
 </poem>
 <poem id='2'
     author='Shakespeare, William.' >
    <heading id='2.1'>Sonnet 106</heading>
    <stanza id='2.1'>
        <line id='2.2.1'>When in the chronicle of wasted time</line>
        <line id='2.2.1'>I see descriptions of the fairest wights,</line>
        <line id='2.2.1'>And beauty making beautiful old rhyme</line>
        <line id='2.2.1'>In praise of ladies dead, and lovely knights,</line>
     </stanza>
  </poem> 
</anthology>
"""
              

In [296]:
# etree will convert a string to a tree this is called parsing
from io import StringIO
tree = etree.parse(StringIO(xml_string)) 

![tree](https://tei-c.org/release/doc/tei-p5-doc/en/html/Images/xmlFlowChart.png)

In [297]:
# what goodies are attached to a tree object?
help(tree)

Help on _ElementTree object:

class _ElementTree(builtins.object)
 |  Methods defined here:
 |  
 |  __copy__(self)
 |  
 |  __deepcopy__(self, memo)
 |  
 |  find(self, path, namespaces)
 |      find(self, path, namespaces=None)
 |      
 |      Finds the first toplevel element with given tag.  Same as
 |      ``tree.getroot().find(path)``.
 |      
 |      The optional ``namespaces`` argument accepts a
 |      prefix-to-namespace mapping that allows the usage of XPath
 |      prefixes in the path expression.
 |  
 |  findall(self, path, namespaces)
 |      findall(self, path, namespaces=None)
 |      
 |      Finds all elements matching the ElementPath expression.  Same as
 |      getroot().findall(path).
 |      
 |      The optional ``namespaces`` argument accepts a
 |      prefix-to-namespace mapping that allows the usage of XPath
 |      prefixes in the path expression.
 |  
 |  findtext(self, path, default, namespaces)
 |      findtext(self, path, default=None, namespaces=None)


In [298]:
# iterate over all elements
for i in tree.getiterator(): print(i)

<Element anthology at 0x7fa6a6e17ac0>
<Element poem at 0x7fa6a6e20d40>
<Element heading at 0x7fa6a6e20fc0>
<Element stanza at 0x7fa6a6e20f40>
<Element line at 0x7fa6a6e20d40>
<Element line at 0x7fa6a6e20fc0>
<Element line at 0x7fa6a6e20f40>
<Element line at 0x7fa6a6e20d40>
<Element stanza at 0x7fa6a6e20fc0>
<Element line at 0x7fa6a6e20f40>
<Element line at 0x7fa6a6e20d40>
<Element line at 0x7fa6a6e20fc0>
<Element line at 0x7fa6a6e20f40>
<Element poem at 0x7fa6a6e20d40>
<Element heading at 0x7fa6a6e20fc0>
<Element stanza at 0x7fa6a6e20f40>
<Element line at 0x7fa6a6e20d40>
<Element line at 0x7fa6a6e20fc0>
<Element line at 0x7fa6a6e20f40>
<Element line at 0x7fa6a6e20d40>


In [299]:
# explore tree structure
root = tree.getroot()
len(root) # we only have two poems in anthology

2

In [307]:
# first poem
root[0]

<Element poem at 0x7fa6a6c65180>

In [308]:
# second poem
root[1]

<Element poem at 0x7fa6a5371cc0>

In [309]:
# second stanza of the second poem
# stanza is the child of poem
root[1][1]

<Element stanza at 0x7fa6a6be0d00>

In [302]:
# each element has different __Python__ attributes
root[1][1].tag

'stanza'

In [303]:
# why is this empty?
root[1][1].text

'\n        '

In [52]:
# text only get the text directly enclosed by the tag
root[1][1][2].text

'And beauty making beautiful old rhyme'

In [310]:
# print the second stanze of the first poem
print(etree.tostring(root[0][1],method='text',encoding='unicode'))


   O Rose thou art sick.
   The invisible worm,
   That flies in the night
   In the howling storm:
  
  


In [311]:
#attributes are stored as a dictionary
root[0].attrib

{'id': '1', 'author': 'Blake, William.', 'date': '1794'}

In [312]:
root[0].attrib['author']

'Blake, William.'

## ✏️ Exercise

- Convert the small dialogue below to an XML document, using the tags: `dialogue`, `speech_turn`, `line`
- add a name attribute to each speech_turn as well as an unique identifier for each actor
- add identifiers to each element similarly to the anthology example
- save the XML document as a string in a variable called `xml_dialogue`
- parse the string with and save the Element Tree in a new variable
- print the text of the second line

```
George: “You’re gonna over-dry your laundry.”
Jerry: “You can’t over-dry.”
George: “Why not?”
Jerry: “Same reason you can’t over-wet.”
```
![seinfeld](https://media.giphy.com/media/14dXclYKbx2ONW/giphy.gif)

In [313]:
# add your answer here

## Realistic Example
### UK Parliamentary Debates on Abortion (29th June 1967)

Political Mashup's schema for [parliamentary proceedings](https://web.archive.org/web/20170504021953/http://schema.politicalmashup.nl/schemas.html).

In [324]:
with open('xml_examples/uk.proc.d.1967-06-29.xml','rb') as in_xml:
    tree = etree.parse(in_xml)

In [325]:
root = tree.getroot()

In [70]:
len(root), type(root)

(3, lxml.etree._Element)

## Elements are lists

In [67]:
for element in root:
    print(element.tag)

{http://www.politicalmashup.nl/docinfo}docinfo
meta
{http://www.politicalmashup.nl}proceedings


In [73]:
proceedings = root[2]

In [89]:
proceedings[2]

<Element {http://www.politicalmashup.nl}topic at 0x7fa6a68ad7c0>

In [316]:
proceedings[2][0]

<Element {http://www.politicalmashup.nl}scene at 0x7fa6a6891640>

In [317]:
# you can iterate over these elements as a list to get the children
for i,element in enumerate(proceedings[2]):
    print(i,element.tag,len(element))

0 {http://www.politicalmashup.nl}scene 4
1 {http://www.politicalmashup.nl}scene 6
2 {http://www.politicalmashup.nl}scene 5
3 {http://www.politicalmashup.nl}scene 5
4 {http://www.politicalmashup.nl}scene 6
5 {http://www.politicalmashup.nl}scene 9
6 {http://www.politicalmashup.nl}scene 22
7 {http://www.politicalmashup.nl}scene 4
8 {http://www.politicalmashup.nl}scene 4
9 {http://www.politicalmashup.nl}scene 4
10 {http://www.politicalmashup.nl}scene 18
11 {http://www.politicalmashup.nl}scene 9
12 {http://www.politicalmashup.nl}scene 4
13 {http://www.politicalmashup.nl}scene 4
14 {http://www.politicalmashup.nl}scene 4
15 {http://www.politicalmashup.nl}scene 4
16 {http://www.politicalmashup.nl}scene 6
17 {http://www.politicalmashup.nl}scene 6
18 {http://www.politicalmashup.nl}scene 4
19 {http://www.politicalmashup.nl}scene 10
20 {http://www.politicalmashup.nl}scene 4


### ✏️ Exercise
- How many paragraphs does the 1st speech in the 7th scene of the 3rd topic contain?
- How many words does the first paragraph contain? (You can just use `split()` to divide the string into tokens).

In [314]:
# type answer here

In [114]:
proceedings[2][6][1]

<Element {http://www.politicalmashup.nl}speech at 0x7fa6a68a3440>

In [115]:
len(proceedings[2][6][1][0].text.split())

42

In [319]:
# you can iterate over a specific element
# notice that we need to define the namespace
# for both the element and attribute
for element in tree.iter("{http://www.politicalmashup.nl}speech"):
    print(element.attrib['{http://www.politicalmashup.nl}id'])

In [342]:
nsmap = root.nsmap; nsmap

{'dc': 'http://purl.org/dc/elements/1.1/',
 'xsd': 'http://www.w3.org/2001/XMLSchema-datatypes',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'pmd': 'http://www.politicalmashup.nl/docinfo',
 'pm': 'http://www.politicalmashup.nl',
 'dp': 'http://dilipad.history.ac.uk',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'html': 'http://www.w3.org/1999/xhtml',
 'dcterms': 'http://purl.org/dc/terms/',
 'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}

### ✏️ Exercise
- (Difficult) What is the longest speech in the fourth topic element?

In [None]:
# type answer here

In [136]:
longest_speech,max_length = None,0
for el in proceedings[3].getiterator("{http://www.politicalmashup.nl}speech"):
    len_speech = 0
    for p in el:
        len_speech+=len(p.text.split())
    if len_speech > max_length:
        max_length = len_speech
        longest_speech = el
max_length, etree.tostring(longest_speech,method='text',encoding='unicode')

(84,
 "\n          No, Sir. A separate inquiry, directed particularly at the effects of increased food prices on these areas would not be fruitful since it would not be possible to disentangle the effects from the overall consequences of Great Britain's entry into the European Economic Community In any case, the economic circumstances of the development areas should improve substantially during the next few years as a result of the Government's measures and particularly the regional employment premium announced by my right hon. Friend on 5th June.\n        \n        ")

## Elements carry attributes as dictionaries

In [320]:
proceedings[2][6][15].tag,proceedings[2][6][15].attrib

('{http://www.politicalmashup.nl}speech',
 {'{http://www.politicalmashup.nl}speaker': 'Mr. Goodhart', '{http://www.politicalmashup.nl}party': 'Conservative', '{http://www.politicalmashup.nl}role': 'mp', '{http://www.politicalmashup.nl}party-ref': 'uk.p.Con', '{http://www.politicalmashup.nl}member-ref': 'uk.m.17551', '{http://www.politicalmashup.nl}id': 'uk.proc.d.1967-06-29.3.7.16'})

In [326]:
root.nsmap

{'dc': 'http://purl.org/dc/elements/1.1/',
 'xsd': 'http://www.w3.org/2001/XMLSchema-datatypes',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'pmd': 'http://www.politicalmashup.nl/docinfo',
 'pm': 'http://www.politicalmashup.nl',
 'dp': 'http://dilipad.history.ac.uk',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'html': 'http://www.w3.org/1999/xhtml',
 'dcterms': 'http://purl.org/dc/terms/',
 'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}

In [335]:
# use string formatting for namespaces
speech = proceedings[2][6][15]
speech.attrib["{%s}speaker" % root.nsmap['pm']]

'Mr. Goodhart'

In [336]:
# it is safer to use .get() method
speech.attrib.get("{%s}speaker" % root.nsmap['pm'], '')

'Mr. Goodhart'

### ✏️ Exercise

- Count the number of speeches made by Conservative members during this sitting of Parliament.

In [338]:
# type answer here

In [149]:
con_speeches = 0
for element in tree.iter("{http://www.politicalmashup.nl}speech"):
    if element.attrib.get('{http://www.politicalmashup.nl}party-ref','') == 'uk.p.Con':
        con_speeches+=1
con_speeches

400

### ✏️ Exercise:

- How often does the Speaker intervene?

In [337]:
# type answer here

# Find methods

In [340]:
help(root.find)

Help on method find in module lxml.etree:

find(path, namespaces) method of lxml.etree._Element instance
    find(self, path, namespaces=None)
    
    Finds the first matching subelement, by tag name or path.
    
    The optional ``namespaces`` argument accepts a
    prefix-to-namespace mapping that allows the usage of XPath
    prefixes in the path expression.



In [343]:
# note the namespace prefix
# .// means from current location (i.e proceedings element)
# search for descendant speech element regardless of their 
# position in the subtree
# find only returns the first element
nsmap = root.nsmap
speeches = root[2].find('.//pm:speech',namespaces=nsmap); speeches

<Element {http://www.politicalmashup.nl}speech at 0x7fa6a6e31bc0>

In [169]:
# findall returns all element
speeches = root[2].findall('.//pm:speech',namespaces=nsmap); len(speeches)

1101

# Finding Needles and Mining Haystacks with XPath

- XPath is a rich language for **querying** and **navigating** elements and attributes in an XML document.
    - Elaborate language XPath for defining regions interest in you XML documents.
- Select elements, attributes, or text nodes.
- XPath uses **path expressions** for querying and XML.

In [347]:
# first scene of fifth topic
# notice that index notation does not start at zero!
topic = root.xpath('./pm:proceedings/pm:topic[5]/pm:scene[1]',namespaces=nsmap); topic

[<Element {http://www.politicalmashup.nl}scene at 0x7fa6a6c87f80>]

In [349]:
## first finds the first scene of fifth topic
## // means any speech element contained by the third topic
speeches = root.xpath('./pm:proceedings/pm:topic[3]//pm:speech',namespaces=nsmap); len(speeches)

141

In [351]:
# .// from root find all speeches regardless of where they are situated
speeches = root.xpath('.//pm:speech',namespaces=nsmap); len(speeches)

1101

In [353]:
# return speeches that contain a particular attribute
# in this case speeches with a role attribute
speeches = root.xpath('.//pm:speech[@pm:role]',namespaces=nsmap); len(speeches)

1101

In [354]:
# return the attributes themselves not the element
speeches = root.xpath('.//pm:speech/@pm:role',namespaces=nsmap); 
speeches[:10]

['mp',
 'government',
 'mp',
 'government',
 'mp',
 'government',
 'mp',
 'government',
 'mp',
 'government']

In [231]:
# find speeches where attribute matches a particular value
speeches = root[2].xpath('.//pm:speech[@pm:role="chair"]',namespaces=nsmap); len(speeches)

178

### ✏️ Exercise

- Which parties are participating in this sitting?
- How many conservative/labour speeches does the document contain?

In [355]:
# type answer here

In [240]:
print(set(root.xpath('.//pm:speech/@pm:party-ref',namespaces=nsmap)))

{'uk.p.Con', 'uk.p.Lab', 'uk.p.UUP', 'uk.p.Lib', 'uk.p.LDem'}


In [238]:
print(len(root.xpath('.//pm:speech[@pm:party-ref="uk.p.Con"]',namespaces=nsmap)))
print(len(root.xpath('.//pm:speech[@pm:party-ref="uk.p.Lab"]',namespaces=nsmap)))

400
595


## Combining structure and text

In [358]:
# return paragraphs that mention a certain token, in this women
mentions = root.xpath(".//pm:speech/pm:p[contains(text(), 'women')]",namespaces=nsmap)

In [359]:
# get the fourth mention
# move up one level in the tree to get the speech element 
# and then the speaker name in the attribute
mentions[4].getparent().attrib['{http://www.politicalmashup.nl}speaker']

'Mrs. Knight'

### ✏️ Exercises:

- Get the title attributes of the topic elements
- Formulate an XPath expression that finds the debate (i.e. topic element) on the medical termination of pregnancy bill.
- Formulate an XPath expression that finds speeches made by Labour MPs during the debate on the medical termination of pregnancy bill.
- Formulate an XPath expression that finds paragraphs in speeches made by Conservative MPs during the debate on the medical termination of pregnancy bill that mentions the word "women". Print the content of these paragraphs.
- Formulate a query the returns the speech in which the word woman is mentioned. (Difficult, requires some Googling)

In [None]:
# type answer here

In [282]:
root.xpath('./pm:proceedings//pm:topic/@pm:title',namespaces=nsmap)

['Preamble',
 'PRAYERS',
 'Oral Answers to Questions — OVERSEAS DEVELOPMENT',
 'Oral Answers to Questions — ECONOMIC AFFAIRS',
 'Oral Answers to Questions — POST OFFICE',
 'Oral Answers to Questions — TELEPHONE SERVICE',
 'Oral Answers to Questions — WIRELESS AND TELEVISION',
 'Oral Answers to Questions — HOME DEPARTMENT',
 'QUESTIONS TO MINISTERS',
 'NOTICES OF MOTION',
 'BUSINESS OF THE HOUSE',
 'PETROL PRICES (SURCHARGE)',
 'BILLS PRESENTED',
 'Orders of the Day — SUPPLY',
 'Orders of the Day — NATIONAL AIRPORT POLICY',
 'Orders of the Day — BUSINESS OF THE HOUSE',
 'Orders of the Day — MEDICAL TERMINATION OF PREGNANCY BILL',
 'Orders of the Day — ADJOURNMENT']

In [283]:
abortion_debate = root.xpath('./pm:proceedings/pm:topic[@pm:title="Orders of the Day — MEDICAL TERMINATION OF PREGNANCY BILL"]',namespaces=nsmap)


In [284]:
lab_speeches = root.xpath('./pm:proceedings/pm:topic[@pm:title="Orders of the Day — MEDICAL TERMINATION OF PREGNANCY BILL"]//pm:speech[@pm:party-ref="uk.p.Lab"]',namespaces=nsmap)
lab_speeches

[<Element {http://www.politicalmashup.nl}speech at 0x7fa6a6b25f40>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6ce4340>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6c2fbc0>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6cb7fc0>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6c2f240>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6cb7a40>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6b27840>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6b27c00>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6cb7d00>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6b0b240>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6cb7b40>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6cb7400>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6b0bfc0>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6b0bb00>,
 <Element {http://www.politicalmashup.nl}speech 

In [285]:
con_paras = root.xpath('./pm:proceedings/pm:topic[@pm:title="Orders of the Day — MEDICAL TERMINATION OF PREGNANCY BILL"]//pm:speech[@pm:party-ref="uk.p.Con"]/pm:p[contains(text(), "women")]',namespaces=nsmap)
for s in con_paras:
    print(etree.tostring(s,method='text',encoding='unicode'))

The effect of the Bill's handling on doctors is shown by a letter I received a few days ago from two eminent Yorkshire consultants, Mr. G. A. Craig, consultant obstetrician and gynaecologist for the Bradford A group of hospitals, and Mr. John Mander, consultant obstetrician and gynaecologist at York, which reads: The following statement has been submitted to the 29 Consultant Gynaecologists employed by the Leeds Regional Hospital Board (as per attached list). There follows a list of distinguished medical men and women.
          
I will not anticipate arguments which I shall address to the House later, but it will emerge in the course of debate during the night that this is not a Bill which will permit a large increase in therapeutic abortion in the National Health Service hospitals. The Minister knows that well. There are about 10,000 gynaecological beds in the country and there are 60,000 women on the gynaecological waiting list, awaiting appointments. The chances of the National Hea

### ✏️ Exercise
- (Difficult) Make a small program that computes word frequencies by party (only Labour and Conservative is fine).
- (Difficult) Can you calculate if the probability of mentioning "woman" or "women" differs by party? The probability in this case is just the frequency of the word "woman" divided by the total number of words spoken by a certain party.

# Fin.