# Progress report 3 notebook

## Overview
This notebook loads and parses a sample of the XML data used in the project. A general report that includes more information about the project is available in the [project report](../docs/progress_report.md).

This notebook continues [progress_report_1.ipynb](progress_report_1.ipynb) and [progress_report_2.ipynb](progress_report_2.ipynb), integrating improvements, corrections, and new functionality. This is the *existing* option for report 3, renamed for easy access to and comparison of the three stages.

We have changed the method in the following two ways:

### 1. Replace syllabification with C(C)/V decomposition

Instead of decomposing the rhyme zone into syllables, we decompose it into sequences of vowels and consonant clusters (which may be single consonants). This modification was adopted because segments that should be regarded as matching for rhyme analysis purposes may belong to different syllables, which means that syllabification could compromise their identification. For example:

<table>
    <tr><th>Word</th><th>Phonetic</th><th>|</th><th colspan="5">Syllables</th><th>|</th><th colspan="11">C(C)/V decomposition</th></tr>
    <tr>
        <td>вы́бора</td>
        <td>vIbara</td>
        <td>|</td>
        <td>vI</td>
        <td>-</td>
        <td>ba</td>
        <td>-</td>
        <td>ra</td>
        <td>|</td>
        <td>v</td>
        <td>-</td>
        <td>I</td>
        <td>-</td>
        <td>b</td>
        <td>-</td>
        <td>a</td>
        <td>-</td>
        <td>r</td>
        <td>-</td>
        <td>a</td>
    </tr>
    <tr>
        <td>вы́борка</td>
        <td>vIbarka</td>
        <td>|</td>
        <td>vI</td>
        <td>-</td>
        <td>bar</td>
        <td>-</td>
        <td>ka</td>
        <td>|</td>
        <td>v</td>
        <td>-</td>
        <td>I</td>
        <td>-</td>
        <td>b</td>
        <td>-</td>
        <td>a</td>
        <td>-</td>
        <td>rk</td>
        <td>-</td>
        <td>a</td>
    </tr>
</table>
    

### 2. Identify rhymes by segments, rather than distinctive features

Our original assumption was that after decomposing syllables into onset, nucleus, and coda we could decompose those parts into segments (the nucleus is always present and monosegmental; the onset and coda are optional and potentially polysegmental). This turns out not to be helpful in situations where a monosegmental component might have to be compared to a polysegmental one, since it isn’t clear where the single segment should be aligned. We may reevaluate this decision later (perhaps feature-level comparison will prove useful in cases of isosegmental columns; perhaps we will introduce alignment logic to assign the segments in anisosegmental columns), and we will nonetheless continue to use phonetic distinctive features in evaluating and analyzing rhyming lines, even if not for identifying them.

## Sample data

The sample dataset consists of multiple stanzas from a long poem. An artificial one-stanza extract looks like the following:

```xml
<poem>
    <stanza stanzaNo="001">
        <line lineNo="001">"Мой дядя самых честных пр<stress>а</stress>вил,</line>
        <line lineNo="002">Когда не в шутку занем<stress>о</stress>г,</line>
        <line lineNo="003">Он уважать себя заст<stress>а</stress>вил</line>
        <line lineNo="004">И лучше выдумать не м<stress>о</stress>г.</line>
        <line lineNo="005">Его пример другим на<stress>у</stress>ка;</line>
        <line lineNo="006">Но, боже мой, какая ск<stress>у</stress>ка</line>
        <line lineNo="007">С больным сидеть и день и н<stress>о</stress>чь,</line>
        <line lineNo="008">Не отходя ни шагу пр<stress>о</stress>чь!</line>
        <line lineNo="009">Какое низкое ков<stress>а</stress>рство</line>
        <line lineNo="010">Полу-живого забавл<stress>я</stress>ть,</line>
        <line lineNo="011">Ему подушки поправл<stress>я</stress>ть,</line>
        <line lineNo="012">Печально подносить лек<stress>а</stress>рство,</line>
        <line lineNo="013">Вздыхать и думать про себ<stress>я</stress>:</line>
        <line lineNo="014">Когда же чорт возьмет теб<stress>я</stress>!"</line>
    </stanza>
</poem>   
```

## Reload libraries each time, since we’re tinkering with them

In [1]:
%load_ext autoreload
%autoreload 2

## Load libraries

In [2]:
from xml.dom import pulldom  # parse input XML
from xml.dom.minidom import Document  # construct output XML
import numpy as np
import pandas as pd
from scipy import stats
import regex as re
from cyr2phon import cyr2phon  # custom package

## Class and variables for parsing input XML

In [3]:
class Stack(list):  # keep track of open nodes while constructing XML output
    def push(self, item):
        self.append(item)

    def peek(self):  
        return self[-1]


open_elements = Stack()
WS_RE = re.compile(r'\s+')  # normalize white space in output

## Function to parse the XML

Returns a list of lists, with stanza number, line number, and `<line>` element for each line. We use the light-weight *xml.dom.pulldom* library to parse the input XML and *xml.dom.minidom* to construct the lines as simplified XML, removing elements we don’t care about, such as `<latin>` and `<italic>`, before serializing them to the output. (We actually do care about `<latin>`, but we are ignoring it temporarily, and we’ll return to it at a later stage in the project.)

In [4]:
def process(input_xml):
    stanzaNo = 0
    lineNo = 0
    inline = 0  # flag to control behavior inside and outside lines
    result = []  # array of arrays, one per line, with stanzaNo, lineNo, and serialized XML
    doc = pulldom.parse(input_xml)
    for event, node in doc:
        if event == pulldom.START_ELEMENT and node.localName == 'stanza':
            stanzaNo = node.getAttribute("stanzaNo")
        elif event == pulldom.START_ELEMENT and node.localName == 'line':
            d = Document()  # each line is an output XML document
            open_elements.push(d)  # document node
            lineNo = node.getAttribute("lineNo")
            inline = 1  # we’re inside a line
            open_elements.peek().appendChild(node)  # add as child of current node in output tree
            open_elements.push(node)  # keep track of open elements
        elif event == pulldom.END_ELEMENT and node.localName == 'line':
            inline = 0  # when we finish our work here, we’ll no longer be inside a line
            open_elements.pop()  # line is finished
            # serialize XML, strip declaration, rewrite &quot; entity as character
            result.append([int(stanzaNo), int(lineNo),
                WS_RE.sub(" " ,
                open_elements.pop().toxml().replace('<?xml version="1.0" ?>', '').replace('&quot;', '"'))])
        elif event == pulldom.START_ELEMENT and node.localName == 'stress':
            open_elements.peek().appendChild(node)  # add as child of current node in output tree
            open_elements.push(node)  # keep track of open elements
        elif event == pulldom.END_ELEMENT and node.localName == 'stress':
            open_elements.pop()  # stress element is finished
        elif event == pulldom.CHARACTERS and inline:  # keep text only inside lines
            t = d.createTextNode(node.data)
            open_elements.peek().appendChild(t)
    return result

## Parse the XML into an array of arrays

In [5]:
# data = "data_samples/eo1.xml"
data = "../data/eo-all.xml"
with open(data) as f:
    all_lines = process(f)
all_lines[:5]  # take a look

[[1,
  1,
  '<line lineNo="001">"Мой дядя самых честных пр<stress>а</stress>вил,</line>'],
 [1,
  2,
  '<line lineNo="002">Когда не в шутку занем<stress>о</stress>г,</line>'],
 [1, 3, '<line lineNo="003">Он уважать себя заст<stress>а</stress>вил</line>'],
 [1, 4, '<line lineNo="004">И лучше выдумать не м<stress>о</stress>г.</line>'],
 [1, 5, '<line lineNo="005">Его пример другим на<stress>у</stress>ка;</line>']]

## General descriptive information

Use `//` for integer division to return *54* instead of *54.0*.

In [6]:
line_count = len(all_lines)
print ('There are ' + str(line_count // 14) + ' 14-line stanzas in this sample, with a total of ' + 
       str(line_count) + ' lines.\nSince we know that the poem is fully rhymed, there are ' + 
       str(line_count // 2) + ' rhyme pairs in the sample.')

There are 364 14-line stanzas in this sample, with a total of 5101 lines.
Since we know that the poem is fully rhymed, there are 2550 rhyme pairs in the sample.


## Write the data into a dataframe

In [7]:
df = pd.DataFrame(all_lines, columns=["StanzaNo", "LineNo", "Text"])
df.head(5)

Unnamed: 0,StanzaNo,LineNo,Text
0,1,1,"<line lineNo=""001"">""Мой дядя самых честных пр<..."
1,1,2,"<line lineNo=""002"">Когда не в шутку занем<stre..."
2,1,3,"<line lineNo=""003"">Он уважать себя заст<stress..."
3,1,4,"<line lineNo=""004"">И лучше выдумать не м<stres..."
4,1,5,"<line lineNo=""005"">Его пример другим на<stress..."


## Transliterate all lines and save in new column

1. Because only the last stress in the line is marked, the phonetic representation of all words except the last is incorrect. That doesn’t matter for the analysis of end rhyme.
1. Words in foreign languages are not being treated specially, and are therefore usually phonetically incorrect. That *does* matter for the analysis of end rhyme. Deal with it later, first by excluding those lines (revise the XML parsing to record that information), and eventually by phoneticizing them correctly.
1. The `transliterate()` function is part of the [custom *cyr2phon* package](cyr2phon/cyr2phon.py).

In [8]:
trans_vec = np.vectorize(cyr2phon.transliterate)
df["Phonetic"] = trans_vec(df["Text"])
df.head()  # take a look

Unnamed: 0,StanzaNo,LineNo,Text,Phonetic
0,1,1,"<line lineNo=""001"">""Мой дядя самых честных пр<...",maJ DiDi samix Čistnix prAVil
1,1,2,"<line lineNo=""002"">Когда не в шутку занем<stre...",kagda Nifšutku zaNimOk
2,1,3,"<line lineNo=""003"">Он уважать себя заст<stress...",an uvažaT SiBi zastAVil
3,1,4,"<line lineNo=""004"">И лучше выдумать не м<stres...",iluČši vidumaT NimOk
4,1,5,"<line lineNo=""005"">Его пример другим на<stress...",Jiva pRiMir druGim naUka


## Write the rhyme word into a new column

In [9]:
df["RhymeWord"] = df["Phonetic"].str.split().str[-1] # clitics have already been joined
df.head()  # take a look

Unnamed: 0,StanzaNo,LineNo,Text,Phonetic,RhymeWord
0,1,1,"<line lineNo=""001"">""Мой дядя самых честных пр<...",maJ DiDi samix Čistnix prAVil,prAVil
1,1,2,"<line lineNo=""002"">Когда не в шутку занем<stre...",kagda Nifšutku zaNimOk,zaNimOk
2,1,3,"<line lineNo=""003"">Он уважать себя заст<stress...",an uvažaT SiBi zastAVil,zastAVil
3,1,4,"<line lineNo=""004"">И лучше выдумать не м<stres...",iluČši vidumaT NimOk,NimOk
4,1,5,"<line lineNo=""005"">Его пример другим на<stress...",Jiva pRiMir druGim naUka,naUka


## Identify rhyme zone and write into new column

The *rhyme zone* is the portion of the line that participates in line end-rhyme. According to Russian rhyming conventions, the rhyme zone typically begins with the last stressed vowel of the line and continues until the end of the line. The one exception is that open masculine rhyme (that is, rhyme involving stress on a final syllable that ends in a vowel, e.g., **себя́** *[SiBA]*) also requires a *supporting consonant*, that is, it also requires that the consonants *before the stressed vowels* (not otherwise considered part of the rhyme zone* also agree. For example:

* _see_ and _tree_ do not rhyme in Russian because this open (ends in a vowel sound) masculine (stress on the final syllable) rhyme does not have a supporting consonant (consonants before the stressed vowels do not agree).
* *seat* and *treat* do rhyme in Russian because closed (ends in a consonant sound) masculine (stress on the final syllable) rhyme does not require a supporting consonant, so the lack of phonetic correspondence between the consonants before the stressed vowels does not matter.

Russian rhyme may also be *enriched* by phonetic agreement or similarity outside the rhyme zone. For example, *stop* and *strop* constitute a perfect rhyme because the *op* sounds match. Nonetheless, the match of *st* before the rhyme zone enhances, or enriches, the rhyme. The present study ignores enrichment and concentrates only on the core rhyme components, but enrichment will be incorporated into the analysis at a later stage.

With that said, this first pass at identifying the rhyme zone removes the pretonic segments, but not the final consonant of a pretonic onset where a supporting consonant is not needed. More cleaning to follow!

In [10]:
rhymezonepat = re.compile(r'(.?[AEIOU]$)|([AEIOU].*$)')
def remove_pretonic_segments(s: str) -> str: # removes segments in place
    try:
        return rhymezonepat.search(s).group(0)
    except: # modify this to raise a real error, instead of just reporting
        print(s)
df["RhymeZone"] = df["RhymeWord"].apply(remove_pretonic_segments)
df.head(14)

iaDivalSi
têti-à-tÊti


Unnamed: 0,StanzaNo,LineNo,Text,Phonetic,RhymeWord,RhymeZone
0,1,1,"<line lineNo=""001"">""Мой дядя самых честных пр<...",maJ DiDi samix Čistnix prAVil,prAVil,AVil
1,1,2,"<line lineNo=""002"">Когда не в шутку занем<stre...",kagda Nifšutku zaNimOk,zaNimOk,Ok
2,1,3,"<line lineNo=""003"">Он уважать себя заст<stress...",an uvažaT SiBi zastAVil,zastAVil,AVil
3,1,4,"<line lineNo=""004"">И лучше выдумать не м<stres...",iluČši vidumaT NimOk,NimOk,Ok
4,1,5,"<line lineNo=""005"">Его пример другим на<stress...",Jiva pRiMir druGim naUka,naUka,Uka
5,1,6,"<line lineNo=""006"">Но, боже мой, какая ск<stre...",nabaži maJ kakaJi skUka,skUka,Uka
6,1,7,"<line lineNo=""007"">С больным сидеть и день и н...",zbaLnim SiDiT iDiN inOČ,inOČ,OČ
7,1,8,"<line lineNo=""008"">Не отходя ни шагу пр<stress...",NiatxaDi Nišagu prOČ,prOČ,OČ
8,1,9,"<line lineNo=""009"">Какое низкое ков<stress>а</...",kakaJi NiskaJi kavArstva,kavArstva,Arstva
9,1,10,"<line lineNo=""010"">Полу-живого забавл<stress>я...",palu-živava zabavLAT,zabavLAT,AT


## Tokenize rhyme zone into C(C) and V

We may have some null values in the rhyme zone for incomplete lines that have no final stressed syllable (*EO* 37: 13) and foreign words that we may not be handling properly (*EO* 23:2). Check for those:

In [11]:
df[df["RhymeZone"].isnull()]

Unnamed: 0,StanzaNo,LineNo,Text,Phonetic,RhymeWord,RhymeZone
2302,37,13,"<line lineNo=""013"">И одевался...</line>",iaDivalSi,iaDivalSi,
4702,23,2,"<line lineNo=""002"">Сей неприятный tête-à-t<str...",SiJ NipRiJitniJ têti-à-tÊti,têti-à-tÊti,


Filter them out provisionally by writing in a placeholder value:

In [12]:
df.loc[df["RhymeZone"].isnull(), "RhymeZone"] = "Abcde"
df[df["RhymeZone"] == "Aplaceholder"]

Unnamed: 0,StanzaNo,LineNo,Text,Phonetic,RhymeWord,RhymeZone


In [13]:
df["tokenized"] = [x[0] for x in df["RhymeZone"].str.
                   findall(r"(.?)([AEIOU])([^aeiou]*)([aeiou]?)([^aeiou]*)([aeiou]?)([^aeiou]*)([aeiou]?)([^aeiou]*)([aeiou]?)([^aeiou]*)([aeiou]?)([^aeiou]*)([aeiou]?)([^aeiou]*)([aeiou]?)([^aeiou]*)([aeiou]?)")]
i = 0
while pd.np.count_nonzero([item[i] for item in df["tokenized"]]) > 0:
    # print([item[i] for item in df["tokenized"]]) # diagnostic
    df["token" + str(i)] = [item[i] for item in df["tokenized"]]
    i += 1
tokenheaders = list([item for item in df.columns if re.match(r'token\d', item)])
df[tokenheaders] = df[tokenheaders].replace(r'^$', "missing", regex=True) # replace empty strings with specific value; inplace doesn't work (?)
df.filter(regex=r"StanzaNo|LineNo|RhymeWord|^token\d") # columns we care about

Unnamed: 0,StanzaNo,LineNo,RhymeWord,token0,token1,token2,token3,token4,token5
0,1,1,prAVil,missing,A,V,i,l,missing
1,1,2,zaNimOk,missing,O,k,missing,missing,missing
2,1,3,zastAVil,missing,A,V,i,l,missing
3,1,4,NimOk,missing,O,k,missing,missing,missing
4,1,5,naUka,missing,U,k,a,missing,missing
5,1,6,skUka,missing,U,k,a,missing,missing
6,1,7,inOČ,missing,O,Č,missing,missing,missing
7,1,8,prOČ,missing,O,Č,missing,missing,missing
8,1,9,kavArstva,missing,A,rstv,a,missing,missing
9,1,10,zabavLAT,missing,A,T,missing,missing,missing


## A bit of exploration

### What values appear in which columns

In [14]:
for i in range(int(df.columns[-1][-1]) + 1): # number value of last token column
    print(df.groupby("token" + str(i)).size().nlargest(1000))

token0
missing    4207
n           204
J            88
v            70
N            62
l            52
r            50
t            39
s            33
d            33
m            32
k            30
K            28
L            23
T            20
š            19
B            19
R            16
g            12
Č            10
ž            10
S             9
V             8
X             6
M             5
D             4
Q             4
x             4
b             2
z             2
dtype: int64
token1
A    1482
E    1265
O    1184
I     834
U     336
dtype: int64
token2
missing    900
J          618
t          431
l          359
n          314
m          226
N          178
NJ         154
T          154
k          150
L          111
r          108
f          106
d           98
v           90
x           90
s           55
ts          54
D           46
g           44
ž           42
S           40
M           40
žn          38
R           38
rn          30
Č           30
dn          30
V  

### Is this what we expect?

We can’t compare sound frequencies directly to letter frequencies and we don’t have fully stressed texts (either our own poetry or a standard reference corpus), so we can’t can’t derive the pronunciation from the orthography. But we can check whether the frequency of stressed *[U]* (or unstressed) in our poetry corpus matches the frequency of the letters **у** and __ю__ in the general reference corpus, since those vowel letters, uniquely, do not participate as either sources or targets of vowel reduction. Let’s checked stressed *[U]*

#### Frequency of *[u]* in Russian in general

In [15]:
# from http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-languages/russian-letter-frequencies/
frequencies = pd.Series({
'А' :  8.04,
'О' : 10.61,
'Е' :  8.21,
'Ё' :  0.22,
'Ы' :  1.91,
'И' :  7.98,
'Э' :  0.31,
'У' :  2.28,
'Ю' :  0.63,
'Я' :  2.00
})
all_total = frequencies.sum()
all_u = frequencies[['У', 'Ю']].sum()
all_u_freq = all_u / all_total
all_u_freq

0.06897369044797344

#### Frequency in our corpus

In [16]:
corpus_total = len(df) # all lines have stressed vowels
corpus_u = len(df[df["token1"].isin(["U"])])
corpus_u_freq = corpus_u / corpus_total
corpus_u_freq

0.06586943736522251

#### Is it significant?

Use a binomial test (<https://en.wikipedia.org/wiki/Binomial_test>, <https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom_test.html>):

In [17]:
stats.binom_test(corpus_u, corpus_total, all_u_freq) # successes, trials, probability of success 

0.39184733383988657

Looks significant from here! The distribution in our corpus is not inconsistent with what we’d expect from a Russian text.

We could similarly test the few other places where letters have consistent pronunciation: nasals and liquids. Oral obstruents are all subject to voicing adjustments, and we can’t do anything useful with vowels other than *[u]* because we don’t have stress information for a general corpus and vowels other than *[u]* may be either sources or targets of vowel reduction.