## Progress report 2 notebook

This notebook loads and parses a sample of the XML data used in the project. A general report that includes more information about the project is available in the [project report](progress_report.md).

This notebook continues [progress_report_1.ipynb](progress_report_1.ipynb), integrating improvements, corrections, and new functionality. This is the *existing* option for report 2, renamed for easy access to and comparison of the two stages.

## Sample data

The sample dataset consists of multiple stanzas from a long poem. An artificial one-stanza extract looks like the following:

```xml
<poem>
    <stanza stanzaNo="001">
        <line lineNo="001">"Мой дядя самых честных пр<stress>а</stress>вил,</line>
        <line lineNo="002">Когда не в шутку занем<stress>о</stress>г,</line>
        <line lineNo="003">Он уважать себя заст<stress>а</stress>вил</line>
        <line lineNo="004">И лучше выдумать не м<stress>о</stress>г.</line>
        <line lineNo="005">Его пример другим на<stress>у</stress>ка;</line>
        <line lineNo="006">Но, боже мой, какая ск<stress>у</stress>ка</line>
        <line lineNo="007">С больным сидеть и день и н<stress>о</stress>чь,</line>
        <line lineNo="008">Не отходя ни шагу пр<stress>о</stress>чь!</line>
        <line lineNo="009">Какое низкое ков<stress>а</stress>рство</line>
        <line lineNo="010">Полу-живого забавл<stress>я</stress>ть,</line>
        <line lineNo="011">Ему подушки поправл<stress>я</stress>ть,</line>
        <line lineNo="012">Печально подносить лек<stress>а</stress>рство,</line>
        <line lineNo="013">Вздыхать и думать про себ<stress>я</stress>:</line>
        <line lineNo="014">Когда же чорт возьмет теб<stress>я</stress>!"</line>
    </stanza>
</poem>   
```

## Reload libraries each time, since we’re tinkering with them

In [1]:
%load_ext autoreload
%autoreload 2

## Load libraries

In [2]:
from xml.dom import pulldom  # parse input XML
from xml.dom.minidom import Document  # construct output XML
import numpy as np
import pandas as pd
import regex as re
from cyr2phon import cyr2phon, utility  # custom package

## Class and variables for parsing input XML

In [3]:
class Stack(list):  # keep track of open nodes while constructing XML output
    def push(self, item):
        self.append(item)

    def peek(self):  
        return self[-1]


open_elements = Stack()
WS_RE = re.compile(r'\s+')  # normalize white space in output

## Function to parse the XML

Returns a list of lists, with stanza number, line number, and `<line>` element for each line. We use the light-weight *xml.dom.pulldom* library to parse the input XML and *xml.dom.minidom* to construct the lines as simplified XML, removing elements we don’t care about, such as `<latin>` and `<italic>`, before serializing them to the output. (We actually do care about `<latin>`, but we are ignoring it temporarily, and we’ll return to it at a later stage in the project.)

In [4]:
def process(input_xml):
    stanzaNo = 0
    lineNo = 0
    inline = 0  # flag to control behavior inside and outside lines
    result = []  # array of arrays, one per line, with stanzaNo, lineNo, and serialized XML
    doc = pulldom.parse(input_xml)
    for event, node in doc:
        if event == pulldom.START_ELEMENT and node.localName == 'stanza':
            stanzaNo = node.getAttribute("stanzaNo")
        elif event == pulldom.START_ELEMENT and node.localName == 'line':
            d = Document()  # each line is an output XML document
            open_elements.push(d)  # document node
            lineNo = node.getAttribute("lineNo")
            inline = 1  # we’re inside a line
            open_elements.peek().appendChild(node)  # add as child of current node in output tree
            open_elements.push(node)  # keep track of open elements
        elif event == pulldom.END_ELEMENT and node.localName == 'line':
            inline = 0  # when we finish our work here, we’ll no longer be inside a line
            open_elements.pop()  # line is finished
            # serialize XML, strip declaration, rewrite &quot; entity as character
            result.append([int(stanzaNo), int(lineNo),
                WS_RE.sub(" " ,
                open_elements.pop().toxml().replace('<?xml version="1.0" ?>', '').replace('&quot;', '"'))])
        elif event == pulldom.START_ELEMENT and node.localName == 'stress':
            open_elements.peek().appendChild(node)  # add as child of current node in output tree
            open_elements.push(node)  # keep track of open elements
        elif event == pulldom.END_ELEMENT and node.localName == 'stress':
            open_elements.pop()  # stress element is finished
        elif event == pulldom.CHARACTERS and inline:  # keep text only inside lines
            t = d.createTextNode(node.data)
            open_elements.peek().appendChild(t)
    return result

## Parse the XML into an array of arrays

In [5]:
with open("data_samples/eo1.xml") as f:
    all_lines = process(f)
all_lines[:5]  # take a look

[[1,
  1,
  '<line lineNo="001">"Мой дядя самых честных пр<stress>а</stress>вил,</line>'],
 [1,
  2,
  '<line lineNo="002">Когда не в шутку занем<stress>о</stress>г,</line>'],
 [1, 3, '<line lineNo="003">Он уважать себя заст<stress>а</stress>вил</line>'],
 [1, 4, '<line lineNo="004">И лучше выдумать не м<stress>о</stress>г.</line>'],
 [1, 5, '<line lineNo="005">Его пример другим на<stress>у</stress>ка;</line>']]

## General descriptive information

Use `//` for integer division to return *54* instead of *54.0*.

In [6]:
line_count = len(all_lines)
print ('There are ' + str(line_count // 14) + ' 14-line stanzas in this sample, with a total of ' + 
       str(line_count) + ' lines. Since we know that the poem is fully rhymed, there are ' + 
       str(line_count // 2) + ' rhyme pairs in the sample.')

There are 54 14-line stanzas in this sample, with a total of 756 lines. Since we know that the poem is fully rhymed, there are 378 rhyme pairs in the sample.


## Write the data into a dataframe

In [7]:
df = pd.DataFrame(all_lines, columns=["StanzaNo", "LineNo", "Text"])
df.head(5)

Unnamed: 0,StanzaNo,LineNo,Text
0,1,1,"<line lineNo=""001"">""Мой дядя самых честных пр<..."
1,1,2,"<line lineNo=""002"">Когда не в шутку занем<stre..."
2,1,3,"<line lineNo=""003"">Он уважать себя заст<stress..."
3,1,4,"<line lineNo=""004"">И лучше выдумать не м<stres..."
4,1,5,"<line lineNo=""005"">Его пример другим на<stress..."


## Transliterate all lines and save in new column

### Notes

1. Because only the last stress in the line is marked, the phonetic representation of all words except the last is incorrect. That doesn’t matter for the analysis of end rhyme.
1. Words in foreign languages are not being treated specially, and are therefore usually phonetically incorrect. That *does* matter for the analysis of end rhyme. Deal with it later, first by excluding those lines (revise the XML parsing to record that information), and eventually by phoneticizing them correctly.
1. The `transliterate()` function is part of the [custom *cyr2phon* package](cyr2phon/cyr2phon.py).

In [8]:
trans_vec = np.vectorize(cyr2phon.transliterate)
df["Phonetic"] = trans_vec(df["Text"])
df.head()  # take a look

Unnamed: 0,StanzaNo,LineNo,Text,Phonetic
0,1,1,"<line lineNo=""001"">""Мой дядя самых честных пр<...",maJ DiDi samix Čistnix prAVil
1,1,2,"<line lineNo=""002"">Когда не в шутку занем<stre...",kagda Nifšutku zaNimOk
2,1,3,"<line lineNo=""003"">Он уважать себя заст<stress...",an uvažaT SiBi zastAVil
3,1,4,"<line lineNo=""004"">И лучше выдумать не м<stres...",iluČši vidumaT NimOk
4,1,5,"<line lineNo=""005"">Его пример другим на<stress...",Jiva pRiMir druGim naUka


## Write the rhyme word into a new column

In [9]:
df["RhymeWord"] = df["Phonetic"].str.split().str[-1]
df.head()  # take a look

Unnamed: 0,StanzaNo,LineNo,Text,Phonetic,RhymeWord
0,1,1,"<line lineNo=""001"">""Мой дядя самых честных пр<...",maJ DiDi samix Čistnix prAVil,prAVil
1,1,2,"<line lineNo=""002"">Когда не в шутку занем<stre...",kagda Nifšutku zaNimOk,zaNimOk
2,1,3,"<line lineNo=""003"">Он уважать себя заст<stress...",an uvažaT SiBi zastAVil,zastAVil
3,1,4,"<line lineNo=""004"">И лучше выдумать не м<stres...",iluČši vidumaT NimOk,NimOk
4,1,5,"<line lineNo=""005"">Его пример другим на<stress...",Jiva pRiMir druGim naUka,naUka


## Syllabify rhyme word and write into new column

The `syllabify()` function is part of the [custom *utility* package](cyr2phon/utility.py).

In [10]:
df["Syllabified"] = [utility.syllabify(word) for word in df["RhymeWord"]]
df.head()
# writing a list into a cell isn’t good practice; is it okay as a stepping stone?

Unnamed: 0,StanzaNo,LineNo,Text,Phonetic,RhymeWord,Syllabified
0,1,1,"<line lineNo=""001"">""Мой дядя самых честных пр<...",maJ DiDi samix Čistnix prAVil,prAVil,"[prA, Vil]"
1,1,2,"<line lineNo=""002"">Когда не в шутку занем<stre...",kagda Nifšutku zaNimOk,zaNimOk,"[za, Ni, mOk]"
2,1,3,"<line lineNo=""003"">Он уважать себя заст<stress...",an uvažaT SiBi zastAVil,zastAVil,"[za, stA, Vil]"
3,1,4,"<line lineNo=""004"">И лучше выдумать не м<stres...",iluČši vidumaT NimOk,NimOk,"[Ni, mOk]"
4,1,5,"<line lineNo=""005"">Его пример другим на<stress...",Jiva pRiMir druGim naUka,naUka,"[na, U, ka]"


## Identify rhyme zone and write into new column

The *rhyme zone* is the portion of the line that participates in line end-rhyme. According to Russian rhyming conventions, the rhyme zone typically begins with the last stressed vowel of the line and continues until the end of the line. The one exception is that in Russian open masculine rhyme (that is, rhyme involving stress on a final syllable that ends in a vowel) also requires a *supporting consonant*, that is, it also requires that the consonants *before the stressed vowels* (not otherwise considered part of the rhyme zone* also agree. For example:

* _see_ and _tree_ do not rhyme in Russian because this open (ends in a vowel sound) masculine (stress on the final syllable) rhyme does not have a supporting consonant (consonants before the stressed vowels do not agree).
* *seat* and *treat* do rhyme in Russian because closed (ends in a consonant sound) masculine (stress on the final syllable) rhyme does not require a supporting consonant, so the lack of phonetic correspondence between the consonants before the stressed vowels does not matter.

Russian rhyme may also be *enriched* by phonetic agreement or similarity outside the rhyme zone. For example, *stop* and *strop* constitute a perfect rhyme because the *op* sounds match. Nonetheless, the match of *st* before the rhyme zone enhances, or enriches, the rhyme. The present study ignores enrichment and concentrates only on the core rhyme components, but enrichment will be incorporated into the analysis at a later stage.

With that said, this first pass at identifying the rhyme zone removes the pretonic syllables, but not the pretonic onset where a supporting consonant is not needed. More cleaning to follow!

In [11]:
def remove_pretonic_syllables(l: list) -> list: # removes syllables in place
    for position, syllable in enumerate(l):
        if re.search(r'[AEIOU]', syllable): # rhyme zone begins here
            l = l[position:]
            return l
df["RhymeZone"] = df["Syllabified"].apply(remove_pretonic_syllables)
df.head(10)

Unnamed: 0,StanzaNo,LineNo,Text,Phonetic,RhymeWord,Syllabified,RhymeZone
0,1,1,"<line lineNo=""001"">""Мой дядя самых честных пр<...",maJ DiDi samix Čistnix prAVil,prAVil,"[prA, Vil]","[prA, Vil]"
1,1,2,"<line lineNo=""002"">Когда не в шутку занем<stre...",kagda Nifšutku zaNimOk,zaNimOk,"[za, Ni, mOk]",[mOk]
2,1,3,"<line lineNo=""003"">Он уважать себя заст<stress...",an uvažaT SiBi zastAVil,zastAVil,"[za, stA, Vil]","[stA, Vil]"
3,1,4,"<line lineNo=""004"">И лучше выдумать не м<stres...",iluČši vidumaT NimOk,NimOk,"[Ni, mOk]",[mOk]
4,1,5,"<line lineNo=""005"">Его пример другим на<stress...",Jiva pRiMir druGim naUka,naUka,"[na, U, ka]","[U, ka]"
5,1,6,"<line lineNo=""006"">Но, боже мой, какая ск<stre...",nabaži maJ kakaJi skUka,skUka,"[skU, ka]","[skU, ka]"
6,1,7,"<line lineNo=""007"">С больным сидеть и день и н...",zbaLnim SiDiT iDiN inOČ,inOČ,"[i, nOČ]",[nOČ]
7,1,8,"<line lineNo=""008"">Не отходя ни шагу пр<stress...",NiatxaDi Nišagu prOČ,prOČ,[prOČ],[prOČ]
8,1,9,"<line lineNo=""009"">Какое низкое ков<stress>а</...",kakaJi NiskaJi kavArstva,kavArstva,"[ka, vAr, stva]","[vAr, stva]"
9,1,10,"<line lineNo=""010"">Полу-живого забавл<stress>я...",palu-živava zabavLAT,zabavLAT,"[za, ba, vLAT]",[vLAT]


Note to self: Mutable objects inside a DataFrame require extra care.

## Remove pretonic onsets from syllables that do not require supporting consonant

We may want to treat this differently down the road, perhaps retaining pretonic onsets and letting the data tell us that they have to match only with open masculine rhymes. For now, though, we’ll following the handbook definition of canonic Russian rhyme.

In [15]:
df["StrippedRhymeZone"] = df["RhymeZone"].apply(utility.strip_onset)
df.head(14) # first stanza
# 0 = feminine, 1 = closed masculine, 12 = open masculine

Unnamed: 0,StanzaNo,LineNo,Text,Phonetic,RhymeWord,Syllabified,RhymeZone,StrippedRhymeZone
0,1,1,"<line lineNo=""001"">""Мой дядя самых честных пр<...",maJ DiDi samix Čistnix prAVil,prAVil,"[prA, Vil]","[prA, Vil]","[A, Vil]"
1,1,2,"<line lineNo=""002"">Когда не в шутку занем<stre...",kagda Nifšutku zaNimOk,zaNimOk,"[za, Ni, mOk]",[mOk],[Ok]
2,1,3,"<line lineNo=""003"">Он уважать себя заст<stress...",an uvažaT SiBi zastAVil,zastAVil,"[za, stA, Vil]","[stA, Vil]","[A, Vil]"
3,1,4,"<line lineNo=""004"">И лучше выдумать не м<stres...",iluČši vidumaT NimOk,NimOk,"[Ni, mOk]",[mOk],[Ok]
4,1,5,"<line lineNo=""005"">Его пример другим на<stress...",Jiva pRiMir druGim naUka,naUka,"[na, U, ka]","[U, ka]","[U, ka]"
5,1,6,"<line lineNo=""006"">Но, боже мой, какая ск<stre...",nabaži maJ kakaJi skUka,skUka,"[skU, ka]","[skU, ka]","[U, ka]"
6,1,7,"<line lineNo=""007"">С больным сидеть и день и н...",zbaLnim SiDiT iDiN inOČ,inOČ,"[i, nOČ]",[nOČ],[OČ]
7,1,8,"<line lineNo=""008"">Не отходя ни шагу пр<stress...",NiatxaDi Nišagu prOČ,prOČ,[prOČ],[prOČ],[OČ]
8,1,9,"<line lineNo=""009"">Какое низкое ков<stress>а</...",kakaJi NiskaJi kavArstva,kavArstva,"[ka, vAr, stva]","[vAr, stva]","[Ar, stva]"
9,1,10,"<line lineNo=""010"">Полу-живого забавл<stress>я...",palu-živava zabavLAT,zabavLAT,"[za, ba, vLAT]",[vLAT],[AT]


In [18]:
df.loc[12:13]

Unnamed: 0,StanzaNo,LineNo,Text,Phonetic,RhymeWord,Syllabified,RhymeZone,StrippedRhymeZone
12,1,13,"<line lineNo=""013"">Вздыхать и думать про себ<s...",vzdixaT idumaT praSiBA,praSiBA,"[pra, Si, BA]",[BA],[BA]
13,1,14,"<line lineNo=""014"">Когда же чорт возьмет теб<s...",kagdaži Čirt vaZMit TiBA,TiBA,"[Ti, BA]",[BA],[BA]


## (Resume here)

Todo:

1. remove pretonic onsets from syllables that do not require supporting consonant
1. decompose syllables into segments
1. decompose segments into phonetic features
1. **[learn how to apply ML to rhyme identification]**
1. build table of rhymes
1. identify imperfect rhymes and describe and analyze at segment and feature level

Each fourteen-line stanza in this poem has the same regular rhyme scheme: **aBaBccDDeFFeGG**. We can use this regularity to find lines that rhyme by matching the line numbers, creating a gold standard that we can use (through cross-validation) to test our (eventual) analytic identification of rhyme.