# Text Encoding and XML

## Introduction

Goals of this lecture:
- What is XML and why is useful for text-based DH
- How to read and interrogate XML documents

Lecture and notebook based on [A Very Gentle Introduction to XML](https://www.tei-c.org/Vault/P4/doc/html/SG.html) part of the [Text Encoding Initiative](https://tei-c.org/) (or tei). 

List of [projects](https://tei-c.org/Activities/Projects/) using TEI.

XML stands for "eXtensible Markup Language" 
- markup used for presenting and enriching documents
- tags can be used presentational purposes

Some tags serve as <b>markup for bold tex</b> others for 
<i>italic</i> words.

Why adding markup to a text document?
- XML categorizes chunks of text into different (semantic) units.
- With XML we can explicates the **interpretation** of text through **encoding**
    
 
> "it is a process of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be (or has been) interpreted."

Mr. ATTLEE: I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day. The PRIME MINISTER (Mr. Baldwin): Reference has been made by the Mover of the Address and by the right hon. Gentleman the Leader of the Opposition to the grievous loss which has befallen His Majesty.

<proceeding>
    <speaker>Mr. ATTLEE:</speaker>
        <speech>
            <p>I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.
            </p>
        </speech>
    <speaker>The PRIME MINISTER (Mr. Baldwin):</speaker>
        <speech>
            <p>Reference has been made by the Mover of the Address and by the right hon. Gentleman the Leader of the Opposition to the grievous loss which has befallen His Majesty. 
            </p>
    </speech>
</proceeding>

- XML defines the rules/conventions for encoding a text
    - which tags are allowed (speaker, speech)
    - articulate relations between tags (speech embedding in speaker tag)

- XML is **human** and **machine** readable (aligns interpretation with machine processing)
- **Encoding** text means adding a layer of annotation (wrapping text segments with tags)
- TEI aims to encode the **meaning of data** (not just formal (page breaks) or representational (font type): what is going on this piece of text?



- XML is **extensible** because it allows you to define a new (or refine an existing) set of rules, i.e markup language
- XML is a "metalanguage" (a language for describing markup languages)

In this session:
- Focus on working with XML documents (XML for consumer side)
- We won't go into too much detail of XML itself (for example how to create or validate XML files)
- Scenario where you encounter text embedding in XML files, and how to make use of the structure provided by XML.

# Document Types and Schemas

- XML is descriptive (instead of procedural, i.e. `<b>` tags)
- Validity with respect to **document type definition**
- Documents are considered a belonging to a type ("academic paper", "novel", "parliamentary proceedings", etc)
- DTD/Schema defines the constituent parts and how they can be arranged
- Example of Political Mashup's schema for [parliamentary proceedings](https://web.archive.org/web/20170504021953/http://schema.politicalmashup.nl/schemas.html
)


**Anthology**  an example taken from [TEI](https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html#:~:text=XML%20is%20widely%20used%20for,on%20the%20World%20Wide%20Web.)

<anthology>
 <poem>
  <heading>The SICK ROSE</heading>
  <stanza>
   <line>O Rose thou art sick.</line>
   <line>The invisible worm,</line>
   <line>That flies in the night</line>
   <line>In the howling storm:</line>
  </stanza>
  <stanza>
   <line>Has found out thy bed</line>
   <line>Of crimson joy:</line>
   <line>And his dark secret love</line>
   <line>Does thy life destroy.</line>
  </stanza>
 </poem>
<!-- more poems go here -->
</anthology>
              

In [2]:
relag.ng schema for anthology

anthology_p = element anthology { poem_p+ }
poem_p = element poem { heading_p?, stanza_p+ }
stanza_p = element stanza {line_p+}
heading_p = element heading { text }
line_p = element line { text }
start = anthology_p

SyntaxError: invalid syntax (504300000.py, line 1)


In short, shemas:
-  govern which **elements** are **allowed** and how they can be **arranged** (relation between element, i.e. speech can contain paragraph etc.)
- are an **interpretation** what structure as specific source type
- **balance** demands for clarity, uniformity and complexity. (Schemas are a necessary simplification, hard to agree on a standard/schema and there will always be some reduction, harm to the uniqueness of a document).



# Elements

In schemas seen above, **elements** emerage as the principle building blocks of a document (i.e. the units that structure a document).
- Each element must be **explicitly tagged** (opened with start-tag and closed with end-tag).
- Everything between these tags is the **content** of that element.

> Mr. ATTLEE: I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.

`Mr. ATTLEE: <p>I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.</p>`

- Elements can be **contained/embedded**  within each other. A speech, not just a sequence of paragraphs

Ordering of elements determined by the document type definition.

`<proceeding><speaker>Mr. ATTLEE:</speaker><speech><p>I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.</p></speech>
<speaker>The PRIME MINISTER (Mr. Baldwin):</speaker><speech><p>Reference has been made by the Mover of the Address and by the right hon. Gentleman the Leader of the Opposition to the grievous loss which has befallen His Majesty. </p></speech></proceeding>`

Please note, the example does not follow the schema (not valid XML, compliant (i.e. valid) to the PM schema) but is (syntactically) well formed.



To be syntactically correct:
- one element encloses the whole document
- each element contained by the root element, no partial overlaps between elements
- tags mark start and end of each element

Why? XML encodes the structure of a text and makes meaningful units explicit to both human and computer. Such digital representation of document's structure is useful for research, i.e. we can distinguish elements and treat them differently in our programs, which allow for a more accurate, refined analysis of texts.

# Attributes

<speech pm:speaker="Mr. ATTLEE" 
        pm:party="Labour" 
        pm:role="mp" 
        pm:party-ref="uk.p.Lab" 
        pm:member-ref="uk.m.16522" 
        pm:id="uk.proc.d.1935-12-03.3.10.6" 
        dp:time="15:38:00">
    <p pm:id="uk.proc.d.1935-12-03.3.10.6.1">I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.
    </p>
</speech>


- Attributes capture information that is "descriptive of an element but not part of its content" (e.g. identifiers.)
- Elements sturcture what is implicit, the attributes often enrich the document by adding information that is absent.
- Attributes contained in the start tag (no attributes in the end-tag.)

# Namespaces
Namespaces indicate provenance of the element and attributes


`<root xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsd="http://www.w3.org/2001/XMLSchema-datatypes" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:pmd="http://www.politicalmashup.nl/docinfo" xmlns:pm="http://www.politicalmashup.nl" xmlns:dp="http://dilipad.history.ac.uk" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">`

`<proceedings xmlns="http://www.politicalmashup.nl" pm:id="uk.proc.d.1935-12-03">`

According the ["Gentle Introduction"](https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html):
> If the markup of an XML document is thought of as an expression in some language, then a namespace may be thought of as analogous to the lexicon of that language. 


- Namespaces are the **lexicon** of a markup language (`xmlns:pmd="`[http://www.politicalmashup.nl/docinfo](https://web.archive.org/web/20190505013425/http://schema.politicalmashup.nl/include/docinfo.html)`"`)
- Prefix signals association with a namespace. (`pm:id="uk.proc.d.1935-12-03"`)
- XML documents allows you to combine elements from different namespaces (i.e. you can create your own schema and use to extend existing representations.)

# Recap

- Why XML/Text Encoding is useful for DH
- Basic terminology (schemas, elements, attributes and namespaces)

# Working with XML in Python

- Goal of this tutorial: How to open, process and extract relevant text from XML document
- You can also create and validate XML in Python, but we won't cover this (as said, emphasis on the consumer side)

In [8]:
!ls xml_examples/

uk.proc.d.1935-12-03.xml


Python offers [multiple libraries](https://realpython.com/python-xml-parser/) for working with XML but we will use [lxml](https://lxml.de/parsing.html). 

![lxml](https://lxml.de/python-xml-title.png)

## Toy Example
### The Anthology

In [157]:
from lxml import etree

In [158]:
xml_string = """
<anthology>
 <poem id='1' 
      author='Blake, William.'
      date='1794'>
  <heading id='1.1'>The SICK ROSE</heading>
  <stanza id='1.2'>
   <line id='1.2.1'>O Rose thou art sick.</line>
   <line id='1.2.2'>The invisible worm,</line>
   <line id='1.2.3'>That flies in the night</line>
   <line id='1.2.4'>In the howling storm:</line>
  </stanza>
  <stanza id='1.3'>
   <line id='1.3.1'>Has found out thy bed</line>
   <line id='1.3.2'>Of crimson joy:</line>
   <line id='1.3.3'>And his dark secret love</line>
   <line id='1.3.4'>Does thy life destroy.</line>
  </stanza>
 </poem>
 <poem id='2'
     author='Shakespeare, William.' >
    <heading id='2.1'>Sonnet 106</heading>
    <stanza id='2.1'>
        <line id='2.2.1'>When in the chronicle of wasted time</line>
        <line id='2.2.1'>I see descriptions of the fairest wights,</line>
        <line id='2.2.1'>And beauty making beautiful old rhyme</line>
        <line id='2.2.1'>In praise of ladies dead, and lovely knights,</line>
     </stanza>
  </poem> 
</anthology>
"""
              

In [159]:
from io import StringIO
tree = etree.parse(StringIO(xml_string)) # "parsing" the string, i.e. convert it to an XML tree

![tree](https://tei-c.org/release/doc/tei-p5-doc/en/html/Images/xmlFlowChart.png)

In [32]:
help(tree)

Help on _ElementTree object:

class _ElementTree(builtins.object)
 |  Methods defined here:
 |  
 |  __copy__(self)
 |  
 |  __deepcopy__(self, memo)
 |  
 |  find(self, path, namespaces)
 |      find(self, path, namespaces=None)
 |      
 |      Finds the first toplevel element with given tag.  Same as
 |      ``tree.getroot().find(path)``.
 |      
 |      The optional ``namespaces`` argument accepts a
 |      prefix-to-namespace mapping that allows the usage of XPath
 |      prefixes in the path expression.
 |  
 |  findall(self, path, namespaces)
 |      findall(self, path, namespaces=None)
 |      
 |      Finds all elements matching the ElementPath expression.  Same as
 |      getroot().findall(path).
 |      
 |      The optional ``namespaces`` argument accepts a
 |      prefix-to-namespace mapping that allows the usage of XPath
 |      prefixes in the path expression.
 |  
 |  findtext(self, path, default, namespaces)
 |      findtext(self, path, default=None, namespaces=None)


In [34]:
# iterate over all elements
for i in tree.getiterator(): print(i)

<Element anthology at 0x7fa6a5476480>
<Element poem at 0x7fa6a54769c0>
<Element heading at 0x7fa6a6869b40>
<Element stanza at 0x7fa6a6869cc0>
<Element line at 0x7fa6a54769c0>
<Element line at 0x7fa6a6869b40>
<Element line at 0x7fa6a6869cc0>
<Element line at 0x7fa6a54769c0>
<Element stanza at 0x7fa6a6869b40>
<Element line at 0x7fa6a6869cc0>
<Element line at 0x7fa6a54769c0>
<Element line at 0x7fa6a6869b40>
<Element line at 0x7fa6a6869cc0>


In [44]:
# explore tree structure
root = tree.getroot()
len(root) # we only have two poems in anthology

2

In [46]:
# first poem
root[1]

<Element poem at 0x7fa6a6886040>

In [48]:
# second stanza of the first poem
# stanza is the child of poem
root[1][1]

<Element stanza at 0x7fa6a67e1880>

In [50]:
# each element has different __Python__ attributes
root[1][1].tag

'stanza'

In [51]:
# why is this empty?
root[1][1].text

'\n        '

In [52]:
# text only get the text directly enclosed by the tag
root[1][1][2].text

'And beauty making beautiful old rhyme'

In [57]:
# print the text of the first stanza
print(etree.tostring(root[1][1],method='text',encoding='unicode'))


        When in the chronicle of wasted time
        I see descriptions of the fairest wights,
        And beauty making beautiful old rhyme
        In praise of ladies dead, and lovely knights,
     
  


In [60]:
#attributes are stored as a dictionary
root[0].attrib

{'id': '1', 'author': 'Blake, William.', 'date': '1794'}

In [61]:
root[0].attrib['author']

'Blake, William.'

## ✏️ Exercise

- Convert the small dialogue below to an XML document, using the tags: `dialogue`, `speech_turn`, `line`
- add name to speech_turn as well as unique identifier for each actor
- add identifiers to each element similarly to the anthology example
- save the XML document as a string in a variable called `xml_dialogue`
- parse the string with and save the Element Tree in a new variable
- print the text of the second line

```
George: “You’re gonna over-dry your laundry.”
Jerry: “You can’t over-dry.”
George: “Why not?”
Jerry: “Same reason you can’t over-wet.”
```
![seinfeld](https://media.giphy.com/media/14dXclYKbx2ONW/giphy.gif)

In [None]:
# add your answer here

## Realistic Example
### UK Parliamentary Debates on Abortion (29th June 1967)

In [63]:
with open('xml_examples/uk.proc.d.1967-06-29.xml','rb') as in_xml:
    tree = etree.parse(in_xml)

In [69]:
root = tree.getroot()

In [70]:
len(root), type(root)

(3, lxml.etree._Element)

#### Elements are lists

In [67]:
for element in root:
    print(element.tag)

{http://www.politicalmashup.nl/docinfo}docinfo
meta
{http://www.politicalmashup.nl}proceedings


In [73]:
proceedings = root[2]

In [89]:
proceedings[2]

<Element {http://www.politicalmashup.nl}topic at 0x7fa6a68ad7c0>

In [90]:
proceedings[2][0]

<Element {http://www.politicalmashup.nl}scene at 0x7fa6a6891640>

In [91]:
for i,element in enumerate(proceedings[2]):
    print(i,element.tag,len(element))

0 {http://www.politicalmashup.nl}scene 4
1 {http://www.politicalmashup.nl}scene 6
2 {http://www.politicalmashup.nl}scene 5
3 {http://www.politicalmashup.nl}scene 5
4 {http://www.politicalmashup.nl}scene 6
5 {http://www.politicalmashup.nl}scene 9
6 {http://www.politicalmashup.nl}scene 22
7 {http://www.politicalmashup.nl}scene 4
8 {http://www.politicalmashup.nl}scene 4
9 {http://www.politicalmashup.nl}scene 4
10 {http://www.politicalmashup.nl}scene 18
11 {http://www.politicalmashup.nl}scene 9
12 {http://www.politicalmashup.nl}scene 4
13 {http://www.politicalmashup.nl}scene 4
14 {http://www.politicalmashup.nl}scene 4
15 {http://www.politicalmashup.nl}scene 4
16 {http://www.politicalmashup.nl}scene 6
17 {http://www.politicalmashup.nl}scene 6
18 {http://www.politicalmashup.nl}scene 4
19 {http://www.politicalmashup.nl}scene 10
20 {http://www.politicalmashup.nl}scene 4


### ✏️ Exercise
- How many paragraphs does the 1st speech in the 7th scene of 3rd topic contain?
- How many words does the first paragraph contain? (You can just use `split()` to divide the string into tokens).

In [114]:
proceedings[2][6][1]

<Element {http://www.politicalmashup.nl}speech at 0x7fa6a68a3440>

In [115]:
len(proceedings[2][6][1][0].text.split())

42

In [132]:
# you can iterate over a specific element
# notice that we need to define the namespace
# for both the element and attribute
for element in tree.iter("{http://www.politicalmashup.nl}speech"):
    print(element.attrib['{http://www.politicalmashup.nl}id'])

uk.proc.d.1967-06-29.3.1.1
uk.proc.d.1967-06-29.3.1.2
uk.proc.d.1967-06-29.3.1.3
uk.proc.d.1967-06-29.3.1.4
uk.proc.d.1967-06-29.3.2.1
uk.proc.d.1967-06-29.3.2.2
uk.proc.d.1967-06-29.3.2.3
uk.proc.d.1967-06-29.3.2.4
uk.proc.d.1967-06-29.3.2.5
uk.proc.d.1967-06-29.3.2.6
uk.proc.d.1967-06-29.3.3.1
uk.proc.d.1967-06-29.3.3.2
uk.proc.d.1967-06-29.3.3.3
uk.proc.d.1967-06-29.3.3.4
uk.proc.d.1967-06-29.3.3.5
uk.proc.d.1967-06-29.3.4.1
uk.proc.d.1967-06-29.3.4.2
uk.proc.d.1967-06-29.3.4.3
uk.proc.d.1967-06-29.3.4.4
uk.proc.d.1967-06-29.3.4.5
uk.proc.d.1967-06-29.3.5.1
uk.proc.d.1967-06-29.3.5.2
uk.proc.d.1967-06-29.3.5.3
uk.proc.d.1967-06-29.3.5.4
uk.proc.d.1967-06-29.3.5.5
uk.proc.d.1967-06-29.3.5.6
uk.proc.d.1967-06-29.3.6.1
uk.proc.d.1967-06-29.3.6.2
uk.proc.d.1967-06-29.3.6.3
uk.proc.d.1967-06-29.3.6.4
uk.proc.d.1967-06-29.3.6.5
uk.proc.d.1967-06-29.3.6.6
uk.proc.d.1967-06-29.3.6.7
uk.proc.d.1967-06-29.3.6.8
uk.proc.d.1967-06-29.3.6.9
uk.proc.d.1967-06-29.3.7.1
uk.proc.d.1967-06-29.3.7.2
u

### ✏️ Exercise
- (Difficult) What is the longest speech in the fourth topic element?

In [136]:
longest_speech,max_length = None,0
for el in proceedings[3].getiterator("{http://www.politicalmashup.nl}speech"):
    len_speech = 0
    for p in el:
        len_speech+=len(p.text.split())
    if len_speech > max_length:
        max_length = len_speech
        longest_speech = el
max_length, etree.tostring(longest_speech,method='text',encoding='unicode')

(84,
 "\n          No, Sir. A separate inquiry, directed particularly at the effects of increased food prices on these areas would not be fruitful since it would not be possible to disentangle the effects from the overall consequences of Great Britain's entry into the European Economic Community In any case, the economic circumstances of the development areas should improve substantially during the next few years as a result of the Government's measures and particularly the regional employment premium announced by my right hon. Friend on 5th June.\n        \n        ")

## Elements carry attributes as dictionaries

In [154]:
proceedings[2][6][15].tag,proceedings[2][6][15].attrib

('{http://www.politicalmashup.nl}speech',
 {'{http://www.politicalmashup.nl}speaker': 'Mr. Goodhart', '{http://www.politicalmashup.nl}party': 'Conservative', '{http://www.politicalmashup.nl}role': 'mp', '{http://www.politicalmashup.nl}party-ref': 'uk.p.Con', '{http://www.politicalmashup.nl}member-ref': 'uk.m.17551', '{http://www.politicalmashup.nl}id': 'uk.proc.d.1967-06-29.3.7.16'})

In [155]:
root.nsmap

{'dc': 'http://purl.org/dc/elements/1.1/',
 'xsd': 'http://www.w3.org/2001/XMLSchema-datatypes',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'pmd': 'http://www.politicalmashup.nl/docinfo',
 'pm': 'http://www.politicalmashup.nl',
 'dp': 'http://dilipad.history.ac.uk',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'html': 'http://www.w3.org/1999/xhtml',
 'dcterms': 'http://purl.org/dc/terms/',
 'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}

In [156]:
speech = proceedings[2][6][15]
speech.attrib["{%s}speaker" % root.nsmap['pm']]

'Mr. Goodhart'

In [153]:
# it is safer to use .get() method
speech.attrib.get("{%s}speaker" % root.nsmap['pm'], '')

'Mr. Goodhart'

### ✏️ Exercise

- Count the number of speeches made by Conservatives during this sitting of Parliament.

In [149]:
con_speeches = 0
for element in tree.iter("{http://www.politicalmashup.nl}speech"):
    if element.attrib.get('{http://www.politicalmashup.nl}party-ref','') == 'uk.p.Con':
        con_speeches+=1
con_speeches

400

### ✏️ Exercise

- How often does the Speaker intervene?

# Find methods

In [161]:
nsmap = root.nsmap; nsmap

{'dc': 'http://purl.org/dc/elements/1.1/',
 'xsd': 'http://www.w3.org/2001/XMLSchema-datatypes',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'pmd': 'http://www.politicalmashup.nl/docinfo',
 'pm': 'http://www.politicalmashup.nl',
 'dp': 'http://dilipad.history.ac.uk',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'html': 'http://www.w3.org/1999/xhtml',
 'dcterms': 'http://purl.org/dc/terms/',
 'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}

In [165]:
help(root.find)

Help on method find in module lxml.etree:

find(path, namespaces) method of lxml.etree._Element instance
    find(self, path, namespaces=None)
    
    Finds the first matching subelement, by tag name or path.
    
    The optional ``namespaces`` argument accepts a
    prefix-to-namespace mapping that allows the usage of XPath
    prefixes in the path expression.



In [168]:
# note the namespace prefix
speeches = root[2].find('.//pm:speech',namespaces=nsmap); speeches

<Element {http://www.politicalmashup.nl}speech at 0x7fa6a6c94b00>

In [169]:
speeches = root[2].findall('.//pm:speech',namespaces=nsmap); len(speeches)

1101

# Finding Needles and Mining Haystacks with XPath

- XPath is a language for querying and navigating elements and attributes in an XML document.
- With XPath you can define regions of you XML documents and select element, attribute or text nodes
- XPath uses path expressions for querying and XML 



In [204]:
# first scene of fifth topic
topic = root.xpath('./pm:proceedings/pm:topic[5]/pm:scene[1]',namespaces=nsmap); topic

[<Element {http://www.politicalmashup.nl}scene at 0x7fa6a6d52fc0>]

In [222]:
# first scene of fifth topic
## // means any speech element nested in the third topic
speeches = root.xpath('./pm:proceedings/pm:topic[3]//pm:speech',namespaces=nsmap); len(speeches)

141

In [224]:
# .// from root find all speeches regardless where they appear
speeches = root.xpath('.//pm:speech',namespaces=nsmap); len(speeches)

1101

In [227]:
# find speeches that contain a particular attribute
speeches = root.xpath('.//pm:speech[@pm:role]',namespaces=nsmap); len(speeches)

1101

In [230]:
# find attributes
speeches = root.xpath('.//pm:speech/@pm:role',namespaces=nsmap); 
speeches[:10]

['mp',
 'government',
 'mp',
 'government',
 'mp',
 'government',
 'mp',
 'government',
 'mp',
 'government']

In [231]:
# find speeches where attribute matches a particular value
speeches = root[2].xpath('.//pm:speech[@pm:role="chair"]',namespaces=nsmap); len(speeches)

178

### ✏️ Exercise

- Which parties are participating in this sitting?
- How many conservative/labour speeches does the document contain?

In [240]:
print(set(root.xpath('.//pm:speech/@pm:party-ref',namespaces=nsmap)))

{'uk.p.Con', 'uk.p.Lab', 'uk.p.UUP', 'uk.p.Lib', 'uk.p.LDem'}


In [238]:
print(len(root.xpath('.//pm:speech[@pm:party-ref="uk.p.Con"]',namespaces=nsmap)))
print(len(root.xpath('.//pm:speech[@pm:party-ref="uk.p.Lab"]',namespaces=nsmap)))

400
595


In [277]:
# find paragraphs that mention a certain element
mentions = root.xpath(".//pm:speech/pm:p[contains(text(), 'women')]",namespaces=nsmap)

In [281]:
# moving up the tree
mentions[4].getparent().attrib['{http://www.politicalmashup.nl}speaker']

'Mrs. Knight'

### ✏️ Exercise

- Get the title attributes of the topic elements
- Formulate an XPath expression that finds the debate (i.e. topic element) on the medical termination of pregnancy bill.
- Formulate an XPath expression that finds speeches made by Labour MPs during the debate on the medical termination of pregnancy bill.
- Formulate an XPath expression that finds paragraphs in speeches made by Conservative MPs during the debate on the medical termination of pregnancy bill that mention the word "women". Print the content of these paragraphs.


In [282]:
root.xpath('./pm:proceedings//pm:topic/@pm:title',namespaces=nsmap)

['Preamble',
 'PRAYERS',
 'Oral Answers to Questions — OVERSEAS DEVELOPMENT',
 'Oral Answers to Questions — ECONOMIC AFFAIRS',
 'Oral Answers to Questions — POST OFFICE',
 'Oral Answers to Questions — TELEPHONE SERVICE',
 'Oral Answers to Questions — WIRELESS AND TELEVISION',
 'Oral Answers to Questions — HOME DEPARTMENT',
 'QUESTIONS TO MINISTERS',
 'NOTICES OF MOTION',
 'BUSINESS OF THE HOUSE',
 'PETROL PRICES (SURCHARGE)',
 'BILLS PRESENTED',
 'Orders of the Day — SUPPLY',
 'Orders of the Day — NATIONAL AIRPORT POLICY',
 'Orders of the Day — BUSINESS OF THE HOUSE',
 'Orders of the Day — MEDICAL TERMINATION OF PREGNANCY BILL',
 'Orders of the Day — ADJOURNMENT']

In [283]:
abortion_debate = root.xpath('./pm:proceedings/pm:topic[@pm:title="Orders of the Day — MEDICAL TERMINATION OF PREGNANCY BILL"]',namespaces=nsmap)


In [284]:
lab_speeches = root.xpath('./pm:proceedings/pm:topic[@pm:title="Orders of the Day — MEDICAL TERMINATION OF PREGNANCY BILL"]//pm:speech[@pm:party-ref="uk.p.Lab"]',namespaces=nsmap)
lab_speeches

[<Element {http://www.politicalmashup.nl}speech at 0x7fa6a6b25f40>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6ce4340>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6c2fbc0>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6cb7fc0>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6c2f240>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6cb7a40>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6b27840>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6b27c00>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6cb7d00>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6b0b240>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6cb7b40>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6cb7400>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6b0bfc0>,
 <Element {http://www.politicalmashup.nl}speech at 0x7fa6a6b0bb00>,
 <Element {http://www.politicalmashup.nl}speech 

In [285]:
con_paras = root.xpath('./pm:proceedings/pm:topic[@pm:title="Orders of the Day — MEDICAL TERMINATION OF PREGNANCY BILL"]//pm:speech[@pm:party-ref="uk.p.Con"]/pm:p[contains(text(), "women")]',namespaces=nsmap)
for s in con_paras:
    print(etree.tostring(s,method='text',encoding='unicode'))

The effect of the Bill's handling on doctors is shown by a letter I received a few days ago from two eminent Yorkshire consultants, Mr. G. A. Craig, consultant obstetrician and gynaecologist for the Bradford A group of hospitals, and Mr. John Mander, consultant obstetrician and gynaecologist at York, which reads: The following statement has been submitted to the 29 Consultant Gynaecologists employed by the Leeds Regional Hospital Board (as per attached list). There follows a list of distinguished medical men and women.
          
I will not anticipate arguments which I shall address to the House later, but it will emerge in the course of debate during the night that this is not a Bill which will permit a large increase in therapeutic abortion in the National Health Service hospitals. The Minister knows that well. There are about 10,000 gynaecological beds in the country and there are 60,000 women on the gynaecological waiting list, awaiting appointments. The chances of the National Hea

### ✏️ Exercise
- (Difficult) Make a small program that compute word frequencies for speeches made by Labour and Conservative party members.
- (Difficult) Can you compute the probability of mentions to "woman" or "women" differ by party? The probability in this case is just the frequency of woman divided by the total number of words.

## Additional examples and exercises: working with Old Bailey

In [286]:
!wget -O oldbailey.zip https://figshare.shef.ac.uk/ndownloader/files/7979134

--2022-07-06 18:25:08--  https://figshare.shef.ac.uk/ndownloader/files/7979134
Resolving figshare.shef.ac.uk (figshare.shef.ac.uk)... 52.213.244.144, 108.128.50.10
Connecting to figshare.shef.ac.uk (figshare.shef.ac.uk)|52.213.244.144|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pstorage-sheffield-5641355/7979134/OBO_XML_72.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIXQ5KBYLXRRFOHUQ/20220706/eu-west-1/s3/aws4_request&X-Amz-Date=20220706T172509Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=f5842e316631e87d4f5712e26ba99150533f87261d6fae08a3505fa68378bd9f [following]
--2022-07-06 18:25:08--  https://s3-eu-west-1.amazonaws.com/pstorage-sheffield-5641355/7979134/OBO_XML_72.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIXQ5KBYLXRRFOHUQ/20220706/eu-west-1/s3/aws4_request&X-Amz-Date=20220706T172509Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=f5842e316631e87d4f5712e26ba991

In [257]:
!mkdir oldbailey_data

mkdir: oldbailey_data: File exists


In [259]:
!unzip oldbailey.zip -d oldbailey_data/ 

Archive:  oldbailey.zip
  inflating: oldbailey_data/licence.txt  
  inflating: oldbailey_data/listOA.txt  
  inflating: oldbailey_data/listOBP.txt  
  inflating: oldbailey_data/OB_xml_notes.docx  
   creating: oldbailey_data/ordinarysAccounts/
  inflating: oldbailey_data/ordinarysAccounts/.DS_Store  
  inflating: oldbailey_data/ordinarysAccounts/OA16760517.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16760705.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16760830.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16761025.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16770316.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16770504.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16771017.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16771219.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16780123.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16780306.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16780417.xml  
  inf

  inflating: oldbailey_data/ordinarysAccounts/OA17131024.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17131223.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17140129.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17140310.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17140421.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17140528.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17140716.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17140922.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17141027.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17141222.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17150202.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17150511.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17150520.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17150622.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17150803.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA1715092

  inflating: oldbailey_data/ordinarysAccounts/OA17390803.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17391221.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17400213.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17400507.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17400806.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17401124.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17410318.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17410504.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17410612.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17410731.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17410914.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17410916.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17420113.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17420407.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17420507.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA1742071

  inflating: oldbailey_data/ordinarysAccounts/OA17700419.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17700604.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17720527.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17720708.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17721014.xml  
   creating: oldbailey_data/sessionsPapers/
  inflating: oldbailey_data/sessionsPapers/.DS_Store  
  inflating: oldbailey_data/sessionsPapers/16740429.xml  
  inflating: oldbailey_data/sessionsPapers/16740717.xml  
  inflating: oldbailey_data/sessionsPapers/16740909.xml  
  inflating: oldbailey_data/sessionsPapers/16741014.xml  
  inflating: oldbailey_data/sessionsPapers/16741212.xml  
  inflating: oldbailey_data/sessionsPapers/16750115.xml  
  inflating: oldbailey_data/sessionsPapers/16750219.xml  
  inflating: oldbailey_data/sessionsPapers/16750414.xml  
  inflating: oldbailey_data/sessionsPapers/16750707.xml  
  inflating: oldbailey_data/sessionsPapers/16750909.xml  
  infl

  inflating: oldbailey_data/sessionsPapers/16921207.xml  
  inflating: oldbailey_data/sessionsPapers/16930116.xml  
  inflating: oldbailey_data/sessionsPapers/16930426.xml  
  inflating: oldbailey_data/sessionsPapers/16930531.xml  
  inflating: oldbailey_data/sessionsPapers/16930713.xml  
  inflating: oldbailey_data/sessionsPapers/16930906.xml  
  inflating: oldbailey_data/sessionsPapers/16931012.xml  
  inflating: oldbailey_data/sessionsPapers/16931206.xml  
  inflating: oldbailey_data/sessionsPapers/16940221.xml  
  inflating: oldbailey_data/sessionsPapers/16940418.xml  
  inflating: oldbailey_data/sessionsPapers/16940524.xml  
  inflating: oldbailey_data/sessionsPapers/16940711.xml  
  inflating: oldbailey_data/sessionsPapers/16940830.xml  
  inflating: oldbailey_data/sessionsPapers/16941010.xml  
  inflating: oldbailey_data/sessionsPapers/16950220.xml  
  inflating: oldbailey_data/sessionsPapers/16950403.xml  
  inflating: oldbailey_data/sessionsPapers/16950508.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/17241014.xml  
  inflating: oldbailey_data/sessionsPapers/17241204.xml  
  inflating: oldbailey_data/sessionsPapers/17250115.xml  
  inflating: oldbailey_data/sessionsPapers/17250224.xml  
  inflating: oldbailey_data/sessionsPapers/17250407.xml  
  inflating: oldbailey_data/sessionsPapers/17250513.xml  
  inflating: oldbailey_data/sessionsPapers/17250630.xml  
  inflating: oldbailey_data/sessionsPapers/17250827.xml  
  inflating: oldbailey_data/sessionsPapers/17251013.xml  
  inflating: oldbailey_data/sessionsPapers/17251208.xml  
  inflating: oldbailey_data/sessionsPapers/17260114.xml  
  inflating: oldbailey_data/sessionsPapers/17260302.xml  
  inflating: oldbailey_data/sessionsPapers/17260420.xml  
  inflating: oldbailey_data/sessionsPapers/17260425.xml  
  inflating: oldbailey_data/sessionsPapers/17260711.xml  
  inflating: oldbailey_data/sessionsPapers/17260831.xml  
  inflating: oldbailey_data/sessionsPapers/17261012.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/17420714.xml  
  inflating: oldbailey_data/sessionsPapers/17420715.xml  
  inflating: oldbailey_data/sessionsPapers/17420909.xml  
  inflating: oldbailey_data/sessionsPapers/17421013.xml  
  inflating: oldbailey_data/sessionsPapers/17421208.xml  
  inflating: oldbailey_data/sessionsPapers/17430114.xml  
  inflating: oldbailey_data/sessionsPapers/17430223.xml  
  inflating: oldbailey_data/sessionsPapers/17430413.xml  
  inflating: oldbailey_data/sessionsPapers/17430519.xml  
  inflating: oldbailey_data/sessionsPapers/17430629.xml  
  inflating: oldbailey_data/sessionsPapers/17430907.xml  
  inflating: oldbailey_data/sessionsPapers/17431012.xml  
  inflating: oldbailey_data/sessionsPapers/17431207.xml  
  inflating: oldbailey_data/sessionsPapers/17440113.xml  
  inflating: oldbailey_data/sessionsPapers/17440223.xml  
  inflating: oldbailey_data/sessionsPapers/17440404.xml  
  inflating: oldbailey_data/sessionsPapers/17440510.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/17610225.xml  
  inflating: oldbailey_data/sessionsPapers/17610401.xml  
  inflating: oldbailey_data/sessionsPapers/17610506.xml  
  inflating: oldbailey_data/sessionsPapers/17610625.xml  
  inflating: oldbailey_data/sessionsPapers/17610916.xml  
  inflating: oldbailey_data/sessionsPapers/17611021.xml  
  inflating: oldbailey_data/sessionsPapers/17611209.xml  
  inflating: oldbailey_data/sessionsPapers/17620114.xml  
  inflating: oldbailey_data/sessionsPapers/17620224.xml  
  inflating: oldbailey_data/sessionsPapers/17620421.xml  
  inflating: oldbailey_data/sessionsPapers/17620526.xml  
  inflating: oldbailey_data/sessionsPapers/17620714.xml  
  inflating: oldbailey_data/sessionsPapers/17620917.xml  
  inflating: oldbailey_data/sessionsPapers/17621020.xml  
  inflating: oldbailey_data/sessionsPapers/17621208.xml  
  inflating: oldbailey_data/sessionsPapers/17630114.xml  
  inflating: oldbailey_data/sessionsPapers/17630223.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/17791020.xml  
  inflating: oldbailey_data/sessionsPapers/17791208.xml  
  inflating: oldbailey_data/sessionsPapers/17800112.xml  
  inflating: oldbailey_data/sessionsPapers/17800223.xml  
  inflating: oldbailey_data/sessionsPapers/17800405.xml  
  inflating: oldbailey_data/sessionsPapers/17800510.xml  
  inflating: oldbailey_data/sessionsPapers/17800628.xml  
  inflating: oldbailey_data/sessionsPapers/17800913.xml  
  inflating: oldbailey_data/sessionsPapers/17801018.xml  
  inflating: oldbailey_data/sessionsPapers/17801206.xml  
  inflating: oldbailey_data/sessionsPapers/17810110.xml  
  inflating: oldbailey_data/sessionsPapers/17810222.xml  
  inflating: oldbailey_data/sessionsPapers/17810425.xml  
  inflating: oldbailey_data/sessionsPapers/17810530.xml  
  inflating: oldbailey_data/sessionsPapers/17810711.xml  
  inflating: oldbailey_data/sessionsPapers/17810912.xml  
  inflating: oldbailey_data/sessionsPapers/17811017.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18130714.xml  
  inflating: oldbailey_data/sessionsPapers/18130915.xml  
  inflating: oldbailey_data/sessionsPapers/18131027.xml  
  inflating: oldbailey_data/sessionsPapers/18131201.xml  
  inflating: oldbailey_data/sessionsPapers/18140112.xml  
  inflating: oldbailey_data/sessionsPapers/18140216.xml  
  inflating: oldbailey_data/sessionsPapers/18140420.xml  
  inflating: oldbailey_data/sessionsPapers/18140525.xml  
  inflating: oldbailey_data/sessionsPapers/18140706.xml  
  inflating: oldbailey_data/sessionsPapers/18140914.xml  
  inflating: oldbailey_data/sessionsPapers/18141026.xml  
  inflating: oldbailey_data/sessionsPapers/18141130.xml  
  inflating: oldbailey_data/sessionsPapers/18150111.xml  
  inflating: oldbailey_data/sessionsPapers/18150215.xml  
  inflating: oldbailey_data/sessionsPapers/18150405.xml  
  inflating: oldbailey_data/sessionsPapers/18150510.xml  
  inflating: oldbailey_data/sessionsPapers/18150621.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18310512.xml  
  inflating: oldbailey_data/sessionsPapers/18310630.xml  
  inflating: oldbailey_data/sessionsPapers/18310908.xml  
  inflating: oldbailey_data/sessionsPapers/18311020.xml  
  inflating: oldbailey_data/sessionsPapers/18311201.xml  
  inflating: oldbailey_data/sessionsPapers/18320105.xml  
  inflating: oldbailey_data/sessionsPapers/18320216.xml  
  inflating: oldbailey_data/sessionsPapers/18320405.xml  
  inflating: oldbailey_data/sessionsPapers/18320517.xml  
  inflating: oldbailey_data/sessionsPapers/18320705.xml  
  inflating: oldbailey_data/sessionsPapers/18320906.xml  
  inflating: oldbailey_data/sessionsPapers/18321018.xml  
  inflating: oldbailey_data/sessionsPapers/18321129.xml  
  inflating: oldbailey_data/sessionsPapers/18330103.xml  
  inflating: oldbailey_data/sessionsPapers/18330214.xml  
  inflating: oldbailey_data/sessionsPapers/18330411.xml  
  inflating: oldbailey_data/sessionsPapers/18330516.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18440506.xml  
  inflating: oldbailey_data/sessionsPapers/18440610.xml  
  inflating: oldbailey_data/sessionsPapers/18440701.xml  
  inflating: oldbailey_data/sessionsPapers/18440819.xml  
  inflating: oldbailey_data/sessionsPapers/18440916.xml  
  inflating: oldbailey_data/sessionsPapers/18441021.xml  
  inflating: oldbailey_data/sessionsPapers/18441125.xml  
  inflating: oldbailey_data/sessionsPapers/18441216.xml  
  inflating: oldbailey_data/sessionsPapers/18450106.xml  
  inflating: oldbailey_data/sessionsPapers/18450203.xml  
  inflating: oldbailey_data/sessionsPapers/18450303.xml  
  inflating: oldbailey_data/sessionsPapers/18450407.xml  
  inflating: oldbailey_data/sessionsPapers/18450512.xml  
  inflating: oldbailey_data/sessionsPapers/18450616.xml  
  inflating: oldbailey_data/sessionsPapers/18450707.xml  
  inflating: oldbailey_data/sessionsPapers/18450818.xml  
  inflating: oldbailey_data/sessionsPapers/18450915.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18560514.xml  
  inflating: oldbailey_data/sessionsPapers/18560616.xml  
  inflating: oldbailey_data/sessionsPapers/18560707.xml  
  inflating: oldbailey_data/sessionsPapers/18560818.xml  
  inflating: oldbailey_data/sessionsPapers/18560915.xml  
  inflating: oldbailey_data/sessionsPapers/18561027.xml  
  inflating: oldbailey_data/sessionsPapers/18561124.xml  
  inflating: oldbailey_data/sessionsPapers/18561215.xml  
  inflating: oldbailey_data/sessionsPapers/18570105.xml  
  inflating: oldbailey_data/sessionsPapers/18570202.xml  
  inflating: oldbailey_data/sessionsPapers/18570302.xml  
  inflating: oldbailey_data/sessionsPapers/18570406.xml  
  inflating: oldbailey_data/sessionsPapers/18570511.xml  
  inflating: oldbailey_data/sessionsPapers/18570615.xml  
  inflating: oldbailey_data/sessionsPapers/18570706.xml  
  inflating: oldbailey_data/sessionsPapers/18570817.xml  
  inflating: oldbailey_data/sessionsPapers/18570914.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18680706.xml  
  inflating: oldbailey_data/sessionsPapers/18680817.xml  
  inflating: oldbailey_data/sessionsPapers/18680921.xml  
  inflating: oldbailey_data/sessionsPapers/18681026.xml  
  inflating: oldbailey_data/sessionsPapers/18681123.xml  
  inflating: oldbailey_data/sessionsPapers/18681214.xml  
  inflating: oldbailey_data/sessionsPapers/18690111.xml  
  inflating: oldbailey_data/sessionsPapers/18690201.xml  
  inflating: oldbailey_data/sessionsPapers/18690301.xml  
  inflating: oldbailey_data/sessionsPapers/18690405.xml  
  inflating: oldbailey_data/sessionsPapers/18690503.xml  
  inflating: oldbailey_data/sessionsPapers/18690607.xml  
  inflating: oldbailey_data/sessionsPapers/18690712.xml  
  inflating: oldbailey_data/sessionsPapers/18690816.xml  
  inflating: oldbailey_data/sessionsPapers/18690920.xml  
  inflating: oldbailey_data/sessionsPapers/18691025.xml  
  inflating: oldbailey_data/sessionsPapers/18691122.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18800426.xml  
  inflating: oldbailey_data/sessionsPapers/18800524.xml  
  inflating: oldbailey_data/sessionsPapers/18800628.xml  
  inflating: oldbailey_data/sessionsPapers/18800803.xml  
  inflating: oldbailey_data/sessionsPapers/18800913.xml  
  inflating: oldbailey_data/sessionsPapers/18801018.xml  
  inflating: oldbailey_data/sessionsPapers/18801123.xml  
  inflating: oldbailey_data/sessionsPapers/18801213.xml  
  inflating: oldbailey_data/sessionsPapers/18810110.xml  
  inflating: oldbailey_data/sessionsPapers/18810131.xml  
  inflating: oldbailey_data/sessionsPapers/18810228.xml  
  inflating: oldbailey_data/sessionsPapers/18810328.xml  
  inflating: oldbailey_data/sessionsPapers/18810502.xml  
  inflating: oldbailey_data/sessionsPapers/18810523.xml  
  inflating: oldbailey_data/sessionsPapers/18810627.xml  
  inflating: oldbailey_data/sessionsPapers/18810802.xml  
  inflating: oldbailey_data/sessionsPapers/18810912.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18920502.xml  
  inflating: oldbailey_data/sessionsPapers/18920523.xml  
  inflating: oldbailey_data/sessionsPapers/18920627.xml  
  inflating: oldbailey_data/sessionsPapers/18920725.xml  
  inflating: oldbailey_data/sessionsPapers/18920912.xml  
  inflating: oldbailey_data/sessionsPapers/18921017.xml  
  inflating: oldbailey_data/sessionsPapers/18921114.xml  
  inflating: oldbailey_data/sessionsPapers/18921212.xml  
  inflating: oldbailey_data/sessionsPapers/18930109.xml  
  inflating: oldbailey_data/sessionsPapers/18930206.xml  
  inflating: oldbailey_data/sessionsPapers/18930306.xml  
  inflating: oldbailey_data/sessionsPapers/18930410.xml  
  inflating: oldbailey_data/sessionsPapers/18930501.xml  
  inflating: oldbailey_data/sessionsPapers/18930529.xml  
  inflating: oldbailey_data/sessionsPapers/18930626.xml  
  inflating: oldbailey_data/sessionsPapers/18930724.xml  
  inflating: oldbailey_data/sessionsPapers/18930911.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/19040913.xml  
  inflating: oldbailey_data/sessionsPapers/19041017.xml  
  inflating: oldbailey_data/sessionsPapers/19041114.xml  
  inflating: oldbailey_data/sessionsPapers/19041212.xml  
  inflating: oldbailey_data/sessionsPapers/19050109.xml  
  inflating: oldbailey_data/sessionsPapers/19050206.xml  
  inflating: oldbailey_data/sessionsPapers/19050306.xml  
  inflating: oldbailey_data/sessionsPapers/19050403.xml  
  inflating: oldbailey_data/sessionsPapers/19050502.xml  
  inflating: oldbailey_data/sessionsPapers/19050529.xml  
  inflating: oldbailey_data/sessionsPapers/19050626.xml  
  inflating: oldbailey_data/sessionsPapers/19050724.xml  
  inflating: oldbailey_data/sessionsPapers/19050912.xml  
  inflating: oldbailey_data/sessionsPapers/19051016.xml  
  inflating: oldbailey_data/sessionsPapers/19051113.xml  
  inflating: oldbailey_data/sessionsPapers/19051211.xml  
  inflating: oldbailey_data/sessionsPapers/19060108.xml  
  inflating: o

In [262]:
!ls oldbailey_data/ordinarysAccounts

OA16760517.xml  OA16971222.xml  OA17150803.xml  OA17290822.xml  OA17480318.xml
OA16760705.xml  OA16980126.xml  OA17150921.xml  OA17291121.xml  OA17480511.xml
OA16760830.xml  OA16980309.xml  OA17151028.xml  OA17291222.xml  OA17480622.xml
OA16761025.xml  OA16980622.xml  OA17151102.xml  OA17300417.xml  OA17481028.xml
OA16770316.xml  OA16980803.xml  OA17151223.xml  OA17300512.xml  OA17490220.xml
OA16770504.xml  OA16990802.xml  OA17160127.xml  OA17300601.xml  OA17490317.xml
OA16771017.xml  OA17000301.xml  OA17160608.xml  OA17301007.xml  OA17490426.xml
OA16771219.xml  OA17000419.xml  OA17160713.xml  OA17301116.xml  OA17490703.xml
OA16780123.xml  OA17000524.xml  OA17160919.xml  OA17301223.xml  OA17490804.xml
OA16780306.xml  OA17000719.xml  OA17160921.xml  OA17310308.xml  OA17491018.xml
OA16780417.xml  OA17000720.xml  OA17161219.xml  OA17310514.xml  OA17500207.xml
OA16780522.xml  OA17000906.xml  OA17170201.xml  OA17310616.xml  OA17500326.xml
OA16780906.xml  OA17001106.xml  OA171703

In [263]:
!open oldbailey_data/ordinarysAccounts/

In [265]:
with open('oldbailey_data/ordinarysAccounts/OA17670914.xml','rb') as in_xml:
    tree = etree.parse(in_xml)

In [266]:
root = tree.getroot()

In [269]:
root.nsmap

{}

In [None]:
https://stackoverflow.com/questions/2009268/how-to-write-an-xpath-query-to-match-two-attributes
https://stackoverflow.com/questions/9683054/xpath-to-select-element-based-on-childs-child-value

In [291]:
els = tree.xpath('.//persName')

In [298]:
els = tree.xpath('//persName[interp[@type="gender" and @value="female"]]'); print(len(els))

23


In [303]:
etree.tostring(els[0])

b'<persName id="OA17670914n20-1">\n                  <interp inst="OA17670914n20-1" type="gender" value="female"/>\n                  <interp inst="OA17670914n20-1" type="surname" value="BROWNRIGG"/>\n                  <interp inst="OA17670914n20-1" type="given" value="ELIZABETH"/>OF ELIZABETH BROWNRIGG</persName>, Who was executed at '

In [299]:
els[0].attrib

{'id': 'OA17670914n20-1'}

In [305]:
[etree.tostring(e, method='text', encoding='ISO-8859-1',with_tail=False) for e in els]

[b'\n                  \n                  \n                  OF ELIZABETH BROWNRIGG',
 b'\n                  \n                  \n                  Mary Clifford',
 b'\n                  \n                  \n                  Mary Clifford',
 b'\n                  \n                  \n                  Elizabeth Brownrigg',
 b'\n                  \n                  \n                  Mary Clifford',
 b'\n                  \n                  \n                  Elizabeth',
 b'\n                  \n                  \n                  Elizabeth',
 b'\n                  \n                  \n                  Mary Clifford',
 b'\n                  \n                  \n                  Elizabeth',
 b'\n                  \n                  \n                  Mary',
 b'\n                  \n                  \n                  Elizabeth',
 b'\n                  \n                  \n                  Mary',
 b'\n                  \n                  \n                  Mary Mit

# Fin.