In [1]:
%load_ext lab_black

# Explore `rawdata` 🎭

The raw data, as provided by [Folger Digital Texts](https://shakespeare.folger.edu/download-the-folger-shakespeare-complete-set/), comes in TEI Simple-encoded XML files. 

These files can be parsed using the XML parser provided by BeautifulSoup:

In [2]:
from glob import glob
import regex as re

from bs4 import BeautifulSoup
import pandas as pd

from statics import RAWDATA_PATH

In [3]:
rawdata_files = sorted(glob(f"{RAWDATA_PATH}/*.xml"))

In [4]:
play_raw = rawdata_files[0]

In [5]:
with open(play_raw) as f:
    soup = BeautifulSoup(f, "lxml-xml")

## Data Format

The parsed data looks like this:

In [6]:
soup.find("stage")  # stage direction

<stage n="SD 1.1.0" type="entrance" who="#Theseus_MND #Hippolyta_MND #Philostrate_MND #ATTENDANTS_MND" xml:id="stg-0000">
<w n="SD 1.1.0" xml:id="fs-mnd-0000070">Enter</w>
<c> </c>
<w n="SD 1.1.0" xml:id="fs-mnd-0000090">Theseus</w>
<pc n="SD 1.1.0" xml:id="fs-mnd-0000100">,</pc>
<c> </c>
<w n="SD 1.1.0" xml:id="fs-mnd-0000120">Hippolyta</w>
<pc n="SD 1.1.0" xml:id="fs-mnd-0000130">,</pc>
<c> </c>
<w n="SD 1.1.0" xml:id="fs-mnd-0000150">and</w>
<c> </c>
<w n="SD 1.1.0" xml:id="fs-mnd-0000170">Philostrate</w>
<pc n="SD 1.1.0" xml:id="fs-mnd-0000180">,</pc>
<c> </c>
<w n="SD 1.1.0" xml:id="fs-mnd-0000200">with</w>
<c> </c>
<w n="SD 1.1.0" xml:id="fs-mnd-0000220">others</w>
<pc n="SD 1.1.0" xml:id="fs-mnd-0000230">.</pc>
</stage>

In [7]:
soup.find("sp")  # speech

<sp who="#Theseus_MND" xml:id="sp-0001">
<speaker xml:id="spk-0001">
<w xml:id="fs-mnd-0000240">THESEUS</w>
</speaker>
<l n="1.1.1" xml:id="ftln-0001">
<w ana="#av" lemma="now" n="1.1.1" xml:id="fs-mnd-0000250">Now</w>
<pc n="1.1.1" xml:id="fs-mnd-0000260">,</pc>
<c> </c>
<w ana="#j" lemma="fair" n="1.1.1" xml:id="fs-mnd-0000280">fair</w>
<c> </c>
<w ana="#n1-nn" lemma="Hippolyta" n="1.1.1" xml:id="fs-mnd-0000300">Hippolyta</w>
<pc n="1.1.1" xml:id="fs-mnd-0000310">,</pc>
<c> </c>
<w ana="#po" lemma="our" n="1.1.1" xml:id="fs-mnd-0000330">our</w>
<c> </c>
<w ana="#j" lemma="nuptial" n="1.1.1" xml:id="fs-mnd-0000350">nuptial</w>
<c> </c>
<w ana="#n1" lemma="hour" n="1.1.1" xml:id="fs-mnd-0000370">hour</w>
</l>
<l n="1.1.2" xml:id="ftln-0002">
<w ana="#vvz" lemma="draw" n="1.1.2" xml:id="fs-mnd-0000380">Draws</w>
<c> </c>
<w ana="#acp-p" lemma="on" n="1.1.2" xml:id="fs-mnd-0000400">on</w>
<c> </c>
<w ana="#av" lemma="apace" n="1.1.2" xml:id="fs-mnd-0000420">apace</w>
<pc n="1.1.2" xml:id

Note the use of the "who" attribute to indicate who enters (in the stage direction) resp. speaks (in the speech), and how the speaker information is effectively encoded twice (in the "who" attribute and in the "speaker" tag).

## Some Explorations

The `soup` we produced above can be navigated just like any other `BeautifulSoup` object. 
<br/>
In the following, we highlight some interesting features of the raw data, and try to convey an intuition for the difficulties we encounter when preprocessing this raw data to ultimately yield graph representations.

### Characters

The data in `data/*.cast.csv` is derived from the `castItem` elements:

In [8]:
soup.find_all("castItem")

[<castItem xml:id="Hermia_MND">
 <role>
 <name>Hermia</name>
 </role>
 </castItem>,
 <castItem xml:id="Lysander_MND">
 <role>
 <name>Lysander</name>
 </role>
 </castItem>,
 <castItem xml:id="Helena_MND">
 <role>
 <name>Helena</name>
 </role>
 </castItem>,
 <castItem xml:id="Demetrius_MND">
 <role>
 <name>Demetrius</name>
 </role>
 </castItem>,
 <castItem xml:id="Theseus_MND">
 <role>
 <name>Theseus</name>
 </role>
 <roleDesc>duke of Athens</roleDesc>
 </castItem>,
 <castItem xml:id="Hippolyta_MND">
 <role>
 <name>Hippolyta</name>
 </role>
 <roleDesc>queen of the Amazons</roleDesc>
 </castItem>,
 <castItem xml:id="Egeus_MND">
 <role>
 <name>Egeus</name>
 </role>
 <roleDesc>father to Hermia</roleDesc>
 </castItem>,
 <castItem xml:id="Philostrate_MND">
 <role>
 <name>Philostrate</name>
 </role>
 <roleDesc>master of the revels to Theseus</roleDesc>
 </castItem>,
 <castItem xml:id="Bottom_MND">
 <role>
 <name>Nick Bottom</name>
 </role>
 <roleDesc>weaver</roleDesc>
 </castItem>,
 <castItem 

### Acts and Scenes

Acts and scenes are contained in `<div>` tags:

In [9]:
[x.attrs for x in soup.find_all("div")]

[{'type': 'act', 'n': '1'},
 {'type': 'scene', 'n': '1'},
 {'type': 'scene', 'n': '2'},
 {'type': 'act', 'n': '2'},
 {'type': 'scene', 'n': '1'},
 {'type': 'scene', 'n': '2'},
 {'type': 'act', 'n': '3'},
 {'type': 'scene', 'n': '1'},
 {'type': 'scene', 'n': '2'},
 {'type': 'act', 'n': '4'},
 {'type': 'scene', 'n': '1'},
 {'type': 'scene', 'n': '2'},
 {'type': 'act', 'n': '5'},
 {'type': 'scene', 'n': '1'}]

### Stage Directions

Stage directions are contained in `<stage>` tags.
<br/>
Here are the different types of stage directions present in _A Midsummer Night's Dream_:

In [10]:
set(t.get("type") for t in soup.find_all("stage"))

{'business', 'delivery', 'entrance', 'exit', 'mixed', 'modifier', 'sound'}

Our graph representations currently reflect entrances and exits only, as we use stage directions of these types to determine who is on stage when a particular word is spoken. 
<br/>
There are lots of complications here, as the stage directions in the Folger Shakespeare do not make all character exits explicit. 
We have documented our judgment calls to handle these complications in our code base.
<br/>
Handling other types of stage directions relevant to the information flow on stage is on our list for future improvements.

To illustrate why, let's take a closer look at the stage directions in _A Midsummer Night's Dream_:

In [11]:
stage_df = pd.DataFrame.from_records(
    [
        {**x.attrs, "text": re.sub("\s+", " ", x.get_text("").strip())}
        for x in soup.find_all("stage")
    ]
)
stage_df.head()

Unnamed: 0,xml:id,n,type,who,text,rendition,ana
0,stg-0000,SD 1.1.0,entrance,#Theseus_MND #Hippolyta_MND #Philostrate_MND #...,"Enter Theseus , Hippolyta , and Philostrate , ...",,
1,stg-0016.1,SD 1.1.16.1,exit,#Philostrate_MND,Philostrate exits .,,
2,stg-0020.1,SD 1.1.20.1,entrance,#Egeus_MND #Hermia_MND #Lysander_MND #Demetriu...,"Enter Egeus and his daughter Hermia , and Lysa...",,
3,stg-0101.0,SD 1.1.101.0,delivery,,", to Theseus",,
4,stg-0129.1,SD 1.1.129.1,exit,#Theseus_MND #Hippolyta_MND #Egeus_MND #Demetr...,All but Hermia and Lysander exit .,,


The excess whitespace in the `text` column is an artifact of the detailed encoding of the underlying XML source.
<br/>
We don't need to worry about that for now.

But notice, for example, `stg-0101.0`: 
<br/>
The delivery of the speech following the stage direction is modified such that it is heard only by _Theseus_.
<br/>
Hence, if a graph representation pretends that everyone currently on stage hears that speech, it is misrepresenting the information flow expressed in the raw data.
<br/>
_All_ Hyperbard representations currently do this, and the example above shows why: 
<br/>
While entrances and exits are systematically annotated with `who` attributes, this does not hold for the other types of stage directions.
<br/>
Hence, to handle stage directions of other types correctly, we need to parse their text semantically.

### Speech

As all spoken words are descendants of `<sp>` tags, we can extract them and annotate them with their speaker(s) via some XML munging. 
<br/>
We still need to be careful, though, as a closer look at a `<sp>` tag reveals:

In [12]:
soup.find("sp")

<sp who="#Theseus_MND" xml:id="sp-0001">
<speaker xml:id="spk-0001">
<w xml:id="fs-mnd-0000240">THESEUS</w>
</speaker>
<l n="1.1.1" xml:id="ftln-0001">
<w ana="#av" lemma="now" n="1.1.1" xml:id="fs-mnd-0000250">Now</w>
<pc n="1.1.1" xml:id="fs-mnd-0000260">,</pc>
<c> </c>
<w ana="#j" lemma="fair" n="1.1.1" xml:id="fs-mnd-0000280">fair</w>
<c> </c>
<w ana="#n1-nn" lemma="Hippolyta" n="1.1.1" xml:id="fs-mnd-0000300">Hippolyta</w>
<pc n="1.1.1" xml:id="fs-mnd-0000310">,</pc>
<c> </c>
<w ana="#po" lemma="our" n="1.1.1" xml:id="fs-mnd-0000330">our</w>
<c> </c>
<w ana="#j" lemma="nuptial" n="1.1.1" xml:id="fs-mnd-0000350">nuptial</w>
<c> </c>
<w ana="#n1" lemma="hour" n="1.1.1" xml:id="fs-mnd-0000370">hour</w>
</l>
<l n="1.1.2" xml:id="ftln-0002">
<w ana="#vvz" lemma="draw" n="1.1.2" xml:id="fs-mnd-0000380">Draws</w>
<c> </c>
<w ana="#acp-p" lemma="on" n="1.1.2" xml:id="fs-mnd-0000400">on</w>
<c> </c>
<w ana="#av" lemma="apace" n="1.1.2" xml:id="fs-mnd-0000420">apace</w>
<pc n="1.1.2" xml:id

Not all `<w>` tags that are descendants of `<sp>` tags are spoken words, and the `<speaker>` tag holds redundant information that is also present, in a more structured form, in the `who` attribute of the `<sp>` tag. 
<br/>
In our data preprocessing, we need to account for lots of little things like this.