# Text Encoding and XML

## Introduction

metalanguage: language for describing languages
markup/encoding:

    - explicating the interpretation of text
    - punctuation as markup?
    
    
> "it is a process of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be (or has been) interpreted."
    
markup languange: rules/conventions for encoding a text
XML markup not just a formal representation but aims to encode the meaning of data (at some level).

Focus on working with XML document, not too much about the detail of XML itself, for example how to create or validate XML.

Scenario where you encounter text embedding in XML files, and how to make use of the structure provided by XML.

### XML as descriptive markup

XML is descriptive markup, categorizes chunks of text into different (semantic) units.

Document Type (Definition)

Document belong to a type, constituent parts which can arranged in certain ways

Document are syntactically correct and valid 


Documents belong to a type, similar to a dictionary
it has to confirm to this type to be valid

Why encoding? Texts constituted in different units, parts can be categorized as having a different function.

An academic article, title, abstract, headings, paragraphs

Often crucial for meaningful engagement with text, we have a closer look at parliamentary data. 

https://web.archive.org/web/20170504021953/http://schema.politicalmashup.nl/schemas.html


## Adding structure
### Elements: 

The main building blocks (textual units that structure a document) of a text are called **elements**. The schema defines how different elements relate to each other. 

Each element must be explicitly tagged, with start-tag and end-tag.

Everything between these tags is the content of the element.

> Mr. ATTLEE: I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.

`Mr. ATTLEE: <p>I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.</p>`

In [None]:
Elements canbe contained/embedded  within each other. A speech, not just a sequence of paragraphs

Again how elements can be ordered is determined by the schema.

`<proceeding><speaker>Mr. ATTLEE:</speaker><speech><p>I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day.</p></speech>
<speaker>The PRIME MINISTER (Mr. Baldwin):</speaker><speech><p>Reference has been made by the Mover of the Address and by the right hon. Gentleman the Leader of the Opposition to the grievous loss which has befallen His Majesty. </p></speech></proceeding>`

Example does not follow the schema (not valid XML, compliant to the PM schema) but is well formed.
- one element encloses the whole document
- each element contained by the root element, no partial overlaps between elements
- tags mark start and end of and element

In [None]:
Anthology another example (taken from https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html#:~:text=XML%20is%20widely%20used%20for,on%20the%20World%20Wide%20Web.)

<anthology>
 <poem>
  <heading>The SICK ROSE</heading>
  <stanza>
   <line>O Rose thou art sick.</line>
   <line>The invisible worm,</line>
   <line>That flies in the night</line>
   <line>In the howling storm:</line>
  </stanza>
  <stanza>
   <line>Has found out thy bed</line>
   <line>Of crimson joy:</line>
   <line>And his dark secret love</line>
   <line>Does thy life destroy.</line>
  </stanza>
 </poem>
<!-- more poems go here -->
</anthology>
              

XML encodes the structure of text and makes meaningful units explicit. Useful for research, we can distinguish elements and treat them differently. More accurate, refined analysis of texts.

# Schema 

Schema govern which elements are allowed and how to arrange them
An interpretation of the structure of a source, balance clarity with uniformity and complexity. Schemas are a necessary simplification, hard to agree on a standard/schema. 

Always some reduction, harm to the uniqueness of documents, less maybe so for parliamentary proceedings though.

In [None]:
relag.ng schema for anthology

anthology_p = element anthology { poem_p+ }
poem_p = element poem { heading_p?, stanza_p+ }
stanza_p = element stanza {line_p+}
heading_p = element heading { text }
line_p = element line { text }
start = anthology_p

In [None]:
https://web.archive.org/web/20190505013417/http://schema.politicalmashup.nl/proceedings.html#ProceedingsContent

## Attributes

Attributes capture information that is "descriptive of an element but not part of its content". Elements sturcture what is implicit, the attributes often enrich the document by adding information that is absent.

Attributes contained in the start tag. No attributes in the end-tag.

<speech pm:speaker="Mr. ATTLEE" pm:party="Labour" pm:role="mp" pm:party-ref="uk.p.Lab" pm:member-ref="uk.m.16522" pm:id="uk.proc.d.1935-12-03.3.10.6" dp:time="15:38:00">
          <p pm:id="uk.proc.d.1935-12-03.3.10.6.1">I would like at the outset to associate myself with the sympathy expressed by the hon. Member who moved this Address with regard to the cause of the absence of His Majesty from our proceedings to-day. I am sure that he voiced the feeling of the whole House. I understand that we are to have an opportunity of expressing it at a later stage, and I will, therefore, say nothing more about it at the moment.</p></speech>

In [None]:
Identifiers

# Namespaces
Provenance of the element and attributes, 


`<root xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsd="http://www.w3.org/2001/XMLSchema-datatypes" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:pmd="http://www.politicalmashup.nl/docinfo" xmlns:pm="http://www.politicalmashup.nl" xmlns:dp="http://dilipad.history.ac.uk" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">`



`<proceedings xmlns="http://www.politicalmashup.nl" pm:id="uk.proc.d.1935-12-03">`


Namespaces as the lexicon, refers to a set of elements
Prefix signals association with a namespace.
XML documents allows you to combine elements from different namespaces. 

You could create your own schema and use to extend the other representations.

## Working with XML in Python

"the consumption" side of XML.

In [8]:
!ls xml_examples/

uk.proc.d.1935-12-03.xml


Different libraries, we will use [lxml](https://lxml.de/parsing.html)

In [11]:
from lxml import etree

In [22]:
with open('xml_examples/uk.proc.d.1935-12-03.xml','rb') as in_xml:
    tree = etree.parse(in_xml)

In [23]:
#tree = etree.parse(open('xml_examples/uk.proc.d.1935-12-03.xml','rb'))

In [24]:
help(tree)

Help on _ElementTree object:

class _ElementTree(builtins.object)
 |  Methods defined here:
 |  
 |  __copy__(self)
 |  
 |  __deepcopy__(self, memo)
 |  
 |  find(self, path, namespaces)
 |      find(self, path, namespaces=None)
 |      
 |      Finds the first toplevel element with given tag.  Same as
 |      ``tree.getroot().find(path)``.
 |      
 |      The optional ``namespaces`` argument accepts a
 |      prefix-to-namespace mapping that allows the usage of XPath
 |      prefixes in the path expression.
 |  
 |  findall(self, path, namespaces)
 |      findall(self, path, namespaces=None)
 |      
 |      Finds all elements matching the ElementPath expression.  Same as
 |      getroot().findall(path).
 |      
 |      The optional ``namespaces`` argument accepts a
 |      prefix-to-namespace mapping that allows the usage of XPath
 |      prefixes in the path expression.
 |  
 |  findtext(self, path, default, namespaces)
 |      findtext(self, path, default=None, namespaces=None)


In [26]:
print(tree)

<lxml.etree._ElementTree object at 0x7faf0c49dc40>


### Access by root element

XML document is a tree of elements.

In [44]:
root = tree.getroot()

In [46]:
len(root)

3

In [155]:
type(root)

lxml.etree._Element

In [156]:
help(root)

Help on _Element object:

class _Element(builtins.object)
 |  Element class.
 |  
 |  References a document object and a libxml node.
 |  
 |  By pointing to a Document instance, a reference is kept to
 |  _Document as long as there is some pointer to a node in it.
 |  
 |  Methods defined here:
 |  
 |  __bool__(self, /)
 |      True if self else False
 |  
 |  __contains__(...)
 |      __contains__(self, element)
 |  
 |  __copy__(self)
 |      __copy__(self)
 |  
 |  __deepcopy__(self, memo)
 |      __deepcopy__(self, memo)
 |  
 |  __delitem__(...)
 |      __delitem__(self, x)
 |      
 |      Deletes the given subelement or a slice.
 |  
 |  __getitem__(...)
 |      Returns the subelement at the given position or the requested
 |      slice.
 |  
 |  __iter__(...)
 |      __iter__(self)
 |  
 |  __len__(...)
 |      __len__(self)
 |      
 |      Returns the number of subelements.
 |  
 |  __repr__(...)
 |      __repr__(self)
 |  
 |  __reversed__(self)
 |      __reversed__(self)


### Element are list

In [138]:
for element in root:
    print(element.tag)

{http://www.politicalmashup.nl/docinfo}docinfo
meta
{http://www.politicalmashup.nl}proceedings


In [143]:
for scene in root[2][2]:
    print(len(scene))

3
2
1
1
1
1
3
1
25
74
2


In [150]:
root[2][2][9][15].tag

'{http://www.politicalmashup.nl}speech'

In [145]:
root[2][2][9][15][0].text

'I do not believe that to be true for one moment, and I do not think that it is in his Department that expansion is contemplated. It is the heads of the Admiralty and of the Air service that we want to have on the Floor of the House. We shall indeed have to make an effective contribution to collective security commensurate with our resources and responsibilities on the sea and in the air, but none of us contemplates for one moment that the Government will ask this House to expand the Army again into a force which we could send abroad to fight in foreign countries. Therefore, I hope that the First Lord of the Admiralty and the Secretary of State for Air will be here to explain the proposals which the Government contemplate putting before the House.'

In [147]:
etree.tostring(root[2][2][9][15],method='text',encoding='unicode')

"\n          I do not believe that to be true for one moment, and I do not think that it is in his Department that expansion is contemplated. It is the heads of the Admiralty and of the Air service that we want to have on the Floor of the House. We shall indeed have to make an effective contribution to collective security commensurate with our resources and responsibilities on the sea and in the air, but none of us contemplates for one moment that the Government will ask this House to expand the Army again into a force which we could send abroad to fight in foreign countries. Therefore, I hope that the First Lord of the Admiralty and the Secretary of State for Air will be here to explain the proposals which the Government contemplate putting before the House.\n          It is very difficult indeed for us to tell on what scale the Government are intending to ask us to re-arm, and in the whole of the Government's electoral strategy there was nothing more remarkable than the reliance upon

### Iterate over the tree

In [134]:
for element in tree.iter():
    print(element)

<Element root at 0x7faf0c470080>
<Element {http://www.politicalmashup.nl/docinfo}docinfo at 0x7faf0c5e4ec0>
<Element {http://www.politicalmashup.nl/docinfo}transformer at 0x7faf0c5e4f00>
<Element {http://www.politicalmashup.nl/docinfo}comment at 0x7faf0c5afcc0>
<Element {http://www.politicalmashup.nl/docinfo}validation at 0x7faf0c5af980>
<Element meta at 0x7faf0c5afa40>
<Element {http://purl.org/dc/elements/1.1/}identifier at 0x7faf0c5af780>
<Element {http://purl.org/dc/elements/1.1/}format at 0x7faf0c5af8c0>
<Element {http://purl.org/dc/elements/1.1/}type at 0x7faf0c5af740>
<Element {http://purl.org/dc/elements/1.1/}contributor at 0x7faf0c5afbc0>
<Element {http://purl.org/dc/elements/1.1/}coverage at 0x7faf0c5aff80>
<Element country at 0x7faf0c5af6c0>
<Element {http://purl.org/dc/elements/1.1/}creator at 0x7faf0c5af700>
<Element {http://purl.org/dc/elements/1.1/}language at 0x7faf0c5af640>
<Element {http://www.politicalmashup.nl}language at 0x7faf0c5af5c0>
<Element {http://purl.org/dc

In [148]:
elements = [e for e in tree.iter()]

In [149]:
elements[101]

<Element {http://www.politicalmashup.nl}p at 0x7faf0c57b9c0>

In [165]:
for element in tree.iter("{http://www.politicalmashup.nl}speech"):
    print(element)

<Element {http://www.politicalmashup.nl}speech at 0x7faf0c57b800>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5b0c00>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5b08c0>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5b01c0>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5817c0>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c581e80>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5ab940>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5abc80>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5ab980>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5a4780>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5a45c0>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5a4300>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5a4180>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5f9400>
<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5f9780>
<Element {

## Elements carry attributes as dictionaries

In [153]:
element[2][9][15].attrib

{'{http://www.politicalmashup.nl}speaker': 'Sir A. SINCLAIR', '{http://www.politicalmashup.nl}party': 'Liberal', '{http://www.politicalmashup.nl}role': 'mp', '{http://www.politicalmashup.nl}member-ref': 'uk.m.21027', '{http://www.politicalmashup.nl}id': 'uk.proc.d.1935-12-03.3.10.16', '{http://www.politicalmashup.nl}party-ref': 'uk.p.Lib'}

In [161]:
speech = element[2][9][15]

In [164]:
root.nsmap

{'dc': 'http://purl.org/dc/elements/1.1/',
 'xsd': 'http://www.w3.org/2001/XMLSchema-datatypes',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'pmd': 'http://www.politicalmashup.nl/docinfo',
 'pm': 'http://www.politicalmashup.nl',
 'dp': 'http://dilipad.history.ac.uk',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'html': 'http://www.w3.org/1999/xhtml',
 'dcterms': 'http://purl.org/dc/terms/',
 'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}

In [163]:
speech.attrib["{%s}speaker" % root.nsmap['pm']]

'Sir A. SINCLAIR'

## Finding Needles and Mining Haystacks with XPath

In [170]:
nsmap = root.nsmap; nsmap

{'dc': 'http://purl.org/dc/elements/1.1/',
 'xsd': 'http://www.w3.org/2001/XMLSchema-datatypes',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'pmd': 'http://www.politicalmashup.nl/docinfo',
 'pm': 'http://www.politicalmashup.nl',
 'dp': 'http://dilipad.history.ac.uk',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'html': 'http://www.w3.org/1999/xhtml',
 'dcterms': 'http://purl.org/dc/terms/',
 'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}

In [179]:
help(root.find)

Help on method find in module lxml.etree:

find(path, namespaces) method of lxml.etree._Element instance
    find(self, path, namespaces=None)
    
    Finds the first matching subelement, by tag name or path.
    
    The optional ``namespaces`` argument accepts a
    prefix-to-namespace mapping that allows the usage of XPath
    prefixes in the path expression.



In [193]:
%time speeches = root[2].find('.//pm:speech',namespaces=nsmap); speeches

CPU times: user 64 µs, sys: 1 µs, total: 65 µs
Wall time: 68.2 µs


<Element {http://www.politicalmashup.nl}speech at 0x7faf0c57b800>

In [198]:
%time speeches = root[2].findall('.//pm:speech',namespaces=nsmap); len(speeches)

CPU times: user 89 µs, sys: 0 ns, total: 89 µs
Wall time: 92 µs


57

In [199]:
%time speeches = root[2].xpath('.//pm:speech',namespaces=nsmap); len(speeches)

CPU times: user 545 µs, sys: 297 µs, total: 842 µs
Wall time: 628 µs


57

In [201]:
%time speeches = root[2].xpath('.//pm:speech[@pm:role]',namespaces=nsmap); len(speeches)

CPU times: user 557 µs, sys: 122 µs, total: 679 µs
Wall time: 521 µs


57

In [203]:
%time speeches = root[2].xpath('.//pm:speech[@pm:role="chair"]',namespaces=nsmap); len(speeches)

CPU times: user 911 µs, sys: 818 µs, total: 1.73 ms
Wall time: 853 µs


2

In [205]:
%time speeches = root[2].xpath('.//pm:speech[@pm:party-ref="uk.p.Con"]',namespaces=nsmap); len(speeches)

CPU times: user 557 µs, sys: 160 µs, total: 717 µs
Wall time: 438 µs


21

In [224]:
%time scenes = root.xpath('.//pm:scene[@pm:title="DEBATE ON THE ADDRESS."]',namespaces=nsmap); len(scenes)

CPU times: user 346 µs, sys: 100 µs, total: 446 µs
Wall time: 353 µs


1

In [239]:
%time speeches = root.xpath('.//pm:scene[@pm:title="DEBATE ON THE ADDRESS."]//pm:speech[@pm:party-ref="uk.p.Lib"]',namespaces=nsmap); len(speeches)

CPU times: user 600 µs, sys: 119 µs, total: 719 µs
Wall time: 458 µs


5

In [240]:
len(speeches[0])

9

In [251]:
mentions = root.xpath(".//pm:speech/pm:p[contains(text(), 'law')]",namespaces=nsmap)

In [253]:
mentions[0].getparent()

<Element {http://www.politicalmashup.nl}speech at 0x7faf0c5b0c00>

## Exercise: working with Old Bailey

In [255]:
!wget -O oldbailey.zip https://figshare.shef.ac.uk/ndownloader/files/7979134

--2022-06-28 20:28:47--  https://figshare.shef.ac.uk/ndownloader/files/7979134
Resolving figshare.shef.ac.uk (figshare.shef.ac.uk)... 2a05:d018:1f4:d003:1aba:1dd8:169d:eaeb, 2a05:d018:1f4:d000:bceb:51bb:4bb:1d0c, 52.213.244.144, ...
Connecting to figshare.shef.ac.uk (figshare.shef.ac.uk)|2a05:d018:1f4:d003:1aba:1dd8:169d:eaeb|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pstorage-sheffield-5641355/7979134/OBO_XML_72.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIXQ5KBYLXRRFOHUQ/20220628/eu-west-1/s3/aws4_request&X-Amz-Date=20220628T192847Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=b4958c9b6961b219d09294f469310bfa768ee1b3ba04c16813cf2d45d9bd70fe [following]
--2022-06-28 20:28:47--  https://s3-eu-west-1.amazonaws.com/pstorage-sheffield-5641355/7979134/OBO_XML_72.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIXQ5KBYLXRRFOHUQ/20220628/eu-west-1/s3/aws4_request&X-Amz-Date=20220628T192

In [257]:
!mkdir oldbailey_data

mkdir: oldbailey_data: File exists


In [259]:
!unzip oldbailey.zip -d oldbailey_data/ 

Archive:  oldbailey.zip
  inflating: oldbailey_data/licence.txt  
  inflating: oldbailey_data/listOA.txt  
  inflating: oldbailey_data/listOBP.txt  
  inflating: oldbailey_data/OB_xml_notes.docx  
   creating: oldbailey_data/ordinarysAccounts/
  inflating: oldbailey_data/ordinarysAccounts/.DS_Store  
  inflating: oldbailey_data/ordinarysAccounts/OA16760517.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16760705.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16760830.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16761025.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16770316.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16770504.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16771017.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16771219.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16780123.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16780306.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA16780417.xml  
  inf

  inflating: oldbailey_data/ordinarysAccounts/OA17131024.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17131223.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17140129.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17140310.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17140421.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17140528.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17140716.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17140922.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17141027.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17141222.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17150202.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17150511.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17150520.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17150622.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17150803.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA1715092

  inflating: oldbailey_data/ordinarysAccounts/OA17390803.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17391221.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17400213.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17400507.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17400806.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17401124.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17410318.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17410504.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17410612.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17410731.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17410914.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17410916.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17420113.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17420407.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17420507.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA1742071

  inflating: oldbailey_data/ordinarysAccounts/OA17700419.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17700604.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17720527.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17720708.xml  
  inflating: oldbailey_data/ordinarysAccounts/OA17721014.xml  
   creating: oldbailey_data/sessionsPapers/
  inflating: oldbailey_data/sessionsPapers/.DS_Store  
  inflating: oldbailey_data/sessionsPapers/16740429.xml  
  inflating: oldbailey_data/sessionsPapers/16740717.xml  
  inflating: oldbailey_data/sessionsPapers/16740909.xml  
  inflating: oldbailey_data/sessionsPapers/16741014.xml  
  inflating: oldbailey_data/sessionsPapers/16741212.xml  
  inflating: oldbailey_data/sessionsPapers/16750115.xml  
  inflating: oldbailey_data/sessionsPapers/16750219.xml  
  inflating: oldbailey_data/sessionsPapers/16750414.xml  
  inflating: oldbailey_data/sessionsPapers/16750707.xml  
  inflating: oldbailey_data/sessionsPapers/16750909.xml  
  infl

  inflating: oldbailey_data/sessionsPapers/16921207.xml  
  inflating: oldbailey_data/sessionsPapers/16930116.xml  
  inflating: oldbailey_data/sessionsPapers/16930426.xml  
  inflating: oldbailey_data/sessionsPapers/16930531.xml  
  inflating: oldbailey_data/sessionsPapers/16930713.xml  
  inflating: oldbailey_data/sessionsPapers/16930906.xml  
  inflating: oldbailey_data/sessionsPapers/16931012.xml  
  inflating: oldbailey_data/sessionsPapers/16931206.xml  
  inflating: oldbailey_data/sessionsPapers/16940221.xml  
  inflating: oldbailey_data/sessionsPapers/16940418.xml  
  inflating: oldbailey_data/sessionsPapers/16940524.xml  
  inflating: oldbailey_data/sessionsPapers/16940711.xml  
  inflating: oldbailey_data/sessionsPapers/16940830.xml  
  inflating: oldbailey_data/sessionsPapers/16941010.xml  
  inflating: oldbailey_data/sessionsPapers/16950220.xml  
  inflating: oldbailey_data/sessionsPapers/16950403.xml  
  inflating: oldbailey_data/sessionsPapers/16950508.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/17241014.xml  
  inflating: oldbailey_data/sessionsPapers/17241204.xml  
  inflating: oldbailey_data/sessionsPapers/17250115.xml  
  inflating: oldbailey_data/sessionsPapers/17250224.xml  
  inflating: oldbailey_data/sessionsPapers/17250407.xml  
  inflating: oldbailey_data/sessionsPapers/17250513.xml  
  inflating: oldbailey_data/sessionsPapers/17250630.xml  
  inflating: oldbailey_data/sessionsPapers/17250827.xml  
  inflating: oldbailey_data/sessionsPapers/17251013.xml  
  inflating: oldbailey_data/sessionsPapers/17251208.xml  
  inflating: oldbailey_data/sessionsPapers/17260114.xml  
  inflating: oldbailey_data/sessionsPapers/17260302.xml  
  inflating: oldbailey_data/sessionsPapers/17260420.xml  
  inflating: oldbailey_data/sessionsPapers/17260425.xml  
  inflating: oldbailey_data/sessionsPapers/17260711.xml  
  inflating: oldbailey_data/sessionsPapers/17260831.xml  
  inflating: oldbailey_data/sessionsPapers/17261012.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/17420714.xml  
  inflating: oldbailey_data/sessionsPapers/17420715.xml  
  inflating: oldbailey_data/sessionsPapers/17420909.xml  
  inflating: oldbailey_data/sessionsPapers/17421013.xml  
  inflating: oldbailey_data/sessionsPapers/17421208.xml  
  inflating: oldbailey_data/sessionsPapers/17430114.xml  
  inflating: oldbailey_data/sessionsPapers/17430223.xml  
  inflating: oldbailey_data/sessionsPapers/17430413.xml  
  inflating: oldbailey_data/sessionsPapers/17430519.xml  
  inflating: oldbailey_data/sessionsPapers/17430629.xml  
  inflating: oldbailey_data/sessionsPapers/17430907.xml  
  inflating: oldbailey_data/sessionsPapers/17431012.xml  
  inflating: oldbailey_data/sessionsPapers/17431207.xml  
  inflating: oldbailey_data/sessionsPapers/17440113.xml  
  inflating: oldbailey_data/sessionsPapers/17440223.xml  
  inflating: oldbailey_data/sessionsPapers/17440404.xml  
  inflating: oldbailey_data/sessionsPapers/17440510.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/17610225.xml  
  inflating: oldbailey_data/sessionsPapers/17610401.xml  
  inflating: oldbailey_data/sessionsPapers/17610506.xml  
  inflating: oldbailey_data/sessionsPapers/17610625.xml  
  inflating: oldbailey_data/sessionsPapers/17610916.xml  
  inflating: oldbailey_data/sessionsPapers/17611021.xml  
  inflating: oldbailey_data/sessionsPapers/17611209.xml  
  inflating: oldbailey_data/sessionsPapers/17620114.xml  
  inflating: oldbailey_data/sessionsPapers/17620224.xml  
  inflating: oldbailey_data/sessionsPapers/17620421.xml  
  inflating: oldbailey_data/sessionsPapers/17620526.xml  
  inflating: oldbailey_data/sessionsPapers/17620714.xml  
  inflating: oldbailey_data/sessionsPapers/17620917.xml  
  inflating: oldbailey_data/sessionsPapers/17621020.xml  
  inflating: oldbailey_data/sessionsPapers/17621208.xml  
  inflating: oldbailey_data/sessionsPapers/17630114.xml  
  inflating: oldbailey_data/sessionsPapers/17630223.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/17791020.xml  
  inflating: oldbailey_data/sessionsPapers/17791208.xml  
  inflating: oldbailey_data/sessionsPapers/17800112.xml  
  inflating: oldbailey_data/sessionsPapers/17800223.xml  
  inflating: oldbailey_data/sessionsPapers/17800405.xml  
  inflating: oldbailey_data/sessionsPapers/17800510.xml  
  inflating: oldbailey_data/sessionsPapers/17800628.xml  
  inflating: oldbailey_data/sessionsPapers/17800913.xml  
  inflating: oldbailey_data/sessionsPapers/17801018.xml  
  inflating: oldbailey_data/sessionsPapers/17801206.xml  
  inflating: oldbailey_data/sessionsPapers/17810110.xml  
  inflating: oldbailey_data/sessionsPapers/17810222.xml  
  inflating: oldbailey_data/sessionsPapers/17810425.xml  
  inflating: oldbailey_data/sessionsPapers/17810530.xml  
  inflating: oldbailey_data/sessionsPapers/17810711.xml  
  inflating: oldbailey_data/sessionsPapers/17810912.xml  
  inflating: oldbailey_data/sessionsPapers/17811017.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/17951028.xml  
  inflating: oldbailey_data/sessionsPapers/17951202.xml  
  inflating: oldbailey_data/sessionsPapers/17960113.xml  
  inflating: oldbailey_data/sessionsPapers/17960217.xml  
  inflating: oldbailey_data/sessionsPapers/17960406.xml  
  inflating: oldbailey_data/sessionsPapers/17960511.xml  
  inflating: oldbailey_data/sessionsPapers/17960622.xml  
  inflating: oldbailey_data/sessionsPapers/17960914.xml  
  inflating: oldbailey_data/sessionsPapers/17961026.xml  
  inflating: oldbailey_data/sessionsPapers/17961130.xml  
  inflating: oldbailey_data/sessionsPapers/17970111.xml  
  inflating: oldbailey_data/sessionsPapers/17970215.xml  
  inflating: oldbailey_data/sessionsPapers/17970426.xml  
  inflating: oldbailey_data/sessionsPapers/17970531.xml  
  inflating: oldbailey_data/sessionsPapers/17970712.xml  
  inflating: oldbailey_data/sessionsPapers/17970920.xml  
  inflating: oldbailey_data/sessionsPapers/17971025.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18130714.xml  
  inflating: oldbailey_data/sessionsPapers/18130915.xml  
  inflating: oldbailey_data/sessionsPapers/18131027.xml  
  inflating: oldbailey_data/sessionsPapers/18131201.xml  
  inflating: oldbailey_data/sessionsPapers/18140112.xml  
  inflating: oldbailey_data/sessionsPapers/18140216.xml  
  inflating: oldbailey_data/sessionsPapers/18140420.xml  
  inflating: oldbailey_data/sessionsPapers/18140525.xml  
  inflating: oldbailey_data/sessionsPapers/18140706.xml  
  inflating: oldbailey_data/sessionsPapers/18140914.xml  
  inflating: oldbailey_data/sessionsPapers/18141026.xml  
  inflating: oldbailey_data/sessionsPapers/18141130.xml  
  inflating: oldbailey_data/sessionsPapers/18150111.xml  
  inflating: oldbailey_data/sessionsPapers/18150215.xml  
  inflating: oldbailey_data/sessionsPapers/18150405.xml  
  inflating: oldbailey_data/sessionsPapers/18150510.xml  
  inflating: oldbailey_data/sessionsPapers/18150621.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18310512.xml  
  inflating: oldbailey_data/sessionsPapers/18310630.xml  
  inflating: oldbailey_data/sessionsPapers/18310908.xml  
  inflating: oldbailey_data/sessionsPapers/18311020.xml  
  inflating: oldbailey_data/sessionsPapers/18311201.xml  
  inflating: oldbailey_data/sessionsPapers/18320105.xml  
  inflating: oldbailey_data/sessionsPapers/18320216.xml  
  inflating: oldbailey_data/sessionsPapers/18320405.xml  
  inflating: oldbailey_data/sessionsPapers/18320517.xml  
  inflating: oldbailey_data/sessionsPapers/18320705.xml  
  inflating: oldbailey_data/sessionsPapers/18320906.xml  
  inflating: oldbailey_data/sessionsPapers/18321018.xml  
  inflating: oldbailey_data/sessionsPapers/18321129.xml  
  inflating: oldbailey_data/sessionsPapers/18330103.xml  
  inflating: oldbailey_data/sessionsPapers/18330214.xml  
  inflating: oldbailey_data/sessionsPapers/18330411.xml  
  inflating: oldbailey_data/sessionsPapers/18330516.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18440506.xml  
  inflating: oldbailey_data/sessionsPapers/18440610.xml  
  inflating: oldbailey_data/sessionsPapers/18440701.xml  
  inflating: oldbailey_data/sessionsPapers/18440819.xml  
  inflating: oldbailey_data/sessionsPapers/18440916.xml  
  inflating: oldbailey_data/sessionsPapers/18441021.xml  
  inflating: oldbailey_data/sessionsPapers/18441125.xml  
  inflating: oldbailey_data/sessionsPapers/18441216.xml  
  inflating: oldbailey_data/sessionsPapers/18450106.xml  
  inflating: oldbailey_data/sessionsPapers/18450203.xml  
  inflating: oldbailey_data/sessionsPapers/18450303.xml  
  inflating: oldbailey_data/sessionsPapers/18450407.xml  
  inflating: oldbailey_data/sessionsPapers/18450512.xml  
  inflating: oldbailey_data/sessionsPapers/18450616.xml  
  inflating: oldbailey_data/sessionsPapers/18450707.xml  
  inflating: oldbailey_data/sessionsPapers/18450818.xml  
  inflating: oldbailey_data/sessionsPapers/18450915.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18560514.xml  
  inflating: oldbailey_data/sessionsPapers/18560616.xml  
  inflating: oldbailey_data/sessionsPapers/18560707.xml  
  inflating: oldbailey_data/sessionsPapers/18560818.xml  
  inflating: oldbailey_data/sessionsPapers/18560915.xml  
  inflating: oldbailey_data/sessionsPapers/18561027.xml  
  inflating: oldbailey_data/sessionsPapers/18561124.xml  
  inflating: oldbailey_data/sessionsPapers/18561215.xml  
  inflating: oldbailey_data/sessionsPapers/18570105.xml  
  inflating: oldbailey_data/sessionsPapers/18570202.xml  
  inflating: oldbailey_data/sessionsPapers/18570302.xml  
  inflating: oldbailey_data/sessionsPapers/18570406.xml  
  inflating: oldbailey_data/sessionsPapers/18570511.xml  
  inflating: oldbailey_data/sessionsPapers/18570615.xml  
  inflating: oldbailey_data/sessionsPapers/18570706.xml  
  inflating: oldbailey_data/sessionsPapers/18570817.xml  
  inflating: oldbailey_data/sessionsPapers/18570914.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18680706.xml  
  inflating: oldbailey_data/sessionsPapers/18680817.xml  
  inflating: oldbailey_data/sessionsPapers/18680921.xml  
  inflating: oldbailey_data/sessionsPapers/18681026.xml  
  inflating: oldbailey_data/sessionsPapers/18681123.xml  
  inflating: oldbailey_data/sessionsPapers/18681214.xml  
  inflating: oldbailey_data/sessionsPapers/18690111.xml  
  inflating: oldbailey_data/sessionsPapers/18690201.xml  
  inflating: oldbailey_data/sessionsPapers/18690301.xml  
  inflating: oldbailey_data/sessionsPapers/18690405.xml  
  inflating: oldbailey_data/sessionsPapers/18690503.xml  
  inflating: oldbailey_data/sessionsPapers/18690607.xml  
  inflating: oldbailey_data/sessionsPapers/18690712.xml  
  inflating: oldbailey_data/sessionsPapers/18690816.xml  
  inflating: oldbailey_data/sessionsPapers/18690920.xml  
  inflating: oldbailey_data/sessionsPapers/18691025.xml  
  inflating: oldbailey_data/sessionsPapers/18691122.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18800426.xml  
  inflating: oldbailey_data/sessionsPapers/18800524.xml  
  inflating: oldbailey_data/sessionsPapers/18800628.xml  
  inflating: oldbailey_data/sessionsPapers/18800803.xml  
  inflating: oldbailey_data/sessionsPapers/18800913.xml  
  inflating: oldbailey_data/sessionsPapers/18801018.xml  
  inflating: oldbailey_data/sessionsPapers/18801123.xml  
  inflating: oldbailey_data/sessionsPapers/18801213.xml  
  inflating: oldbailey_data/sessionsPapers/18810110.xml  
  inflating: oldbailey_data/sessionsPapers/18810131.xml  
  inflating: oldbailey_data/sessionsPapers/18810228.xml  
  inflating: oldbailey_data/sessionsPapers/18810328.xml  
  inflating: oldbailey_data/sessionsPapers/18810502.xml  
  inflating: oldbailey_data/sessionsPapers/18810523.xml  
  inflating: oldbailey_data/sessionsPapers/18810627.xml  
  inflating: oldbailey_data/sessionsPapers/18810802.xml  
  inflating: oldbailey_data/sessionsPapers/18810912.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/18920502.xml  
  inflating: oldbailey_data/sessionsPapers/18920523.xml  
  inflating: oldbailey_data/sessionsPapers/18920627.xml  
  inflating: oldbailey_data/sessionsPapers/18920725.xml  
  inflating: oldbailey_data/sessionsPapers/18920912.xml  
  inflating: oldbailey_data/sessionsPapers/18921017.xml  
  inflating: oldbailey_data/sessionsPapers/18921114.xml  
  inflating: oldbailey_data/sessionsPapers/18921212.xml  
  inflating: oldbailey_data/sessionsPapers/18930109.xml  
  inflating: oldbailey_data/sessionsPapers/18930206.xml  
  inflating: oldbailey_data/sessionsPapers/18930306.xml  
  inflating: oldbailey_data/sessionsPapers/18930410.xml  
  inflating: oldbailey_data/sessionsPapers/18930501.xml  
  inflating: oldbailey_data/sessionsPapers/18930529.xml  
  inflating: oldbailey_data/sessionsPapers/18930626.xml  
  inflating: oldbailey_data/sessionsPapers/18930724.xml  
  inflating: oldbailey_data/sessionsPapers/18930911.xml  
  inflating: o

  inflating: oldbailey_data/sessionsPapers/19040913.xml  
  inflating: oldbailey_data/sessionsPapers/19041017.xml  
  inflating: oldbailey_data/sessionsPapers/19041114.xml  
  inflating: oldbailey_data/sessionsPapers/19041212.xml  
  inflating: oldbailey_data/sessionsPapers/19050109.xml  
  inflating: oldbailey_data/sessionsPapers/19050206.xml  
  inflating: oldbailey_data/sessionsPapers/19050306.xml  
  inflating: oldbailey_data/sessionsPapers/19050403.xml  
  inflating: oldbailey_data/sessionsPapers/19050502.xml  
  inflating: oldbailey_data/sessionsPapers/19050529.xml  
  inflating: oldbailey_data/sessionsPapers/19050626.xml  
  inflating: oldbailey_data/sessionsPapers/19050724.xml  
  inflating: oldbailey_data/sessionsPapers/19050912.xml  
  inflating: oldbailey_data/sessionsPapers/19051016.xml  
  inflating: oldbailey_data/sessionsPapers/19051113.xml  
  inflating: oldbailey_data/sessionsPapers/19051211.xml  
  inflating: oldbailey_data/sessionsPapers/19060108.xml  
  inflating: o

In [262]:
!ls oldbailey_data/ordinarysAccounts

OA16760517.xml  OA16971222.xml  OA17150803.xml  OA17290822.xml  OA17480318.xml
OA16760705.xml  OA16980126.xml  OA17150921.xml  OA17291121.xml  OA17480511.xml
OA16760830.xml  OA16980309.xml  OA17151028.xml  OA17291222.xml  OA17480622.xml
OA16761025.xml  OA16980622.xml  OA17151102.xml  OA17300417.xml  OA17481028.xml
OA16770316.xml  OA16980803.xml  OA17151223.xml  OA17300512.xml  OA17490220.xml
OA16770504.xml  OA16990802.xml  OA17160127.xml  OA17300601.xml  OA17490317.xml
OA16771017.xml  OA17000301.xml  OA17160608.xml  OA17301007.xml  OA17490426.xml
OA16771219.xml  OA17000419.xml  OA17160713.xml  OA17301116.xml  OA17490703.xml
OA16780123.xml  OA17000524.xml  OA17160919.xml  OA17301223.xml  OA17490804.xml
OA16780306.xml  OA17000719.xml  OA17160921.xml  OA17310308.xml  OA17491018.xml
OA16780417.xml  OA17000720.xml  OA17161219.xml  OA17310514.xml  OA17500207.xml
OA16780522.xml  OA17000906.xml  OA17170201.xml  OA17310616.xml  OA17500326.xml
OA16780906.xml  OA17001106.xml  OA171703

In [263]:
!open oldbailey_data/ordinarysAccounts/

In [265]:
with open('oldbailey_data/ordinarysAccounts/OA17670914.xml','rb') as in_xml:
    tree = etree.parse(in_xml)

In [266]:
root = tree.getroot()

In [269]:
root.nsmap

{}

In [None]:
https://stackoverflow.com/questions/2009268/how-to-write-an-xpath-query-to-match-two-attributes
https://stackoverflow.com/questions/9683054/xpath-to-select-element-based-on-childs-child-value

In [291]:
els = tree.xpath('.//persName')

In [298]:
els = tree.xpath('//persName[interp[@type="gender" and @value="female"]]'); print(len(els))

23


In [303]:
etree.tostring(els[0])

b'<persName id="OA17670914n20-1">\n                  <interp inst="OA17670914n20-1" type="gender" value="female"/>\n                  <interp inst="OA17670914n20-1" type="surname" value="BROWNRIGG"/>\n                  <interp inst="OA17670914n20-1" type="given" value="ELIZABETH"/>OF ELIZABETH BROWNRIGG</persName>, Who was executed at '

In [299]:
els[0].attrib

{'id': 'OA17670914n20-1'}

In [305]:
[etree.tostring(e, method='text', encoding='ISO-8859-1',with_tail=False) for e in els]

[b'\n                  \n                  \n                  OF ELIZABETH BROWNRIGG',
 b'\n                  \n                  \n                  Mary Clifford',
 b'\n                  \n                  \n                  Mary Clifford',
 b'\n                  \n                  \n                  Elizabeth Brownrigg',
 b'\n                  \n                  \n                  Mary Clifford',
 b'\n                  \n                  \n                  Elizabeth',
 b'\n                  \n                  \n                  Elizabeth',
 b'\n                  \n                  \n                  Mary Clifford',
 b'\n                  \n                  \n                  Elizabeth',
 b'\n                  \n                  \n                  Mary',
 b'\n                  \n                  \n                  Elizabeth',
 b'\n                  \n                  \n                  Mary',
 b'\n                  \n                  \n                  Mary Mit

# Conclusion



XML and HTML