# Experiment with decomposing rhyme scheme

## Overview

### General

For all of the following, find maximum domain (number of syllables, segments, features, etc.) and write empty strings (not `Null` or `NaN`) where missing.

### Syllable-based procedure

1. Create column for each syllable
1. Split syllables into onset, nucleus, coda
1. Split onset and coda into segments
1. Decompose segments into features (not implemented for syllable-based procedure; see below)

Experimentation with sample data came to reveal that syllable decomposition was problematic because of potential rhyme examples like **вы́бора** *[vI-ba-ra]* ~ **вы́борка** *[vI-bar-ka]*, where the *[r]* is in different positions in different syllables, and therefore would not be aligned naturally in the df. This led to a mid-stream reorientation toward dividing the strings into alternating sequences of vowels and consonants (individual or consonant clusters), about which see below.

### C/V-based procedure

1. Divide rhyme zone into alternating units of vowels and consonant clusters. Pretonic consonant is included only for open masculine rhymes, and it is only a single consonant. Column labels have the form `tokenx`, where `token` is a literal string and `x` is an integer that ranges up from 0. Consonant and vowel columns are not distinguished by label.
1. Rhyme identification is at the segment (or consonant-cluster) level, and not by feature, pending a decision about how to align features where some columns contain one segment and some contain more than one

## Shared initialization

### Load libraries

In [2]:
import pandas as pd
import regex as re

### Create sample data and write into df

In [20]:
words = [
    ['BA'], # sebja (open masculine)
    ['Ok'], # mog (closed masculine)
    ['AST'], # strast' (closed masculine with coda cluster)
    ['Instv'], # menšynstv (closed masculine with coda cluster)
    ['U', 'ka'], # nauka (open feminine)
    ['A', 'Vil'], # pravil (closed feminine)
    ['I', 'graT'], # vygrat' (closed feminine with post-tonic onset cluster) 
    ['Or', 'daST'], # gordost' (closed feminine with post-tonic coda cluster)
    ['U', 'pnaST'], # sovokupnost' (closed feminine with post-tonic onset and coda clusters)
    ['I', 'Vi', 'STi'], # vyvesti (dactyl)
    ['E', 'tska', 'va'], # sovetskogo (dactyl)
    ['I', 'ba','ra'], # vybora
    ['I', 'bar', 'ka'] #vyborka
]
df = pd.DataFrame()
df["rz"] = ["".join(item) for item in words]
df["rzs"] = [item for item in words] # rhyme zone syllables
df

Unnamed: 0,rz,rzs
0,BA,[BA]
1,Ok,[Ok]
2,AST,[AST]
3,Instv,[Instv]
4,Uka,"[U, ka]"
5,AVil,"[A, Vil]"
6,IgraT,"[I, graT]"
7,OrdaST,"[Or, daST]"
8,UpnaST,"[U, pnaST]"
9,IViSTi,"[I, Vi, STi]"


## Syllable-based procedure (code)

### Create column for each syllable

`t0` = tonic, `t1` = first post-tonic, etc.

In [21]:
df["syllcounts"] = df["rzs"].apply(len) 
m = df["syllcounts"].max() # longest word in syllable count; hold on to this for processing later
for i in range(m): # Use max syllable count in rzs to create tonic, posttonic, etc. columns
    df['t' + str(i)] = [x[i] if len(x) > i else '' for x in df["rzs"] ]
df

Unnamed: 0,rz,rzs,syllcounts,t0,t1,t2
0,BA,[BA],1,BA,,
1,Ok,[Ok],1,Ok,,
2,AST,[AST],1,AST,,
3,Instv,[Instv],1,Instv,,
4,Uka,"[U, ka]",2,U,ka,
5,AVil,"[A, Vil]",2,A,Vil,
6,IgraT,"[I, graT]",2,I,graT,
7,OrdaST,"[Or, daST]",2,Or,daST,
8,UpnaST,"[U, pnaST]",2,U,pnaST,
9,IViSTi,"[I, Vi, STi]",3,I,Vi,STi


### Split syllables into onset, nucleus, coda

In [22]:
syllcols = df.filter(regex=("^t\d+$"))
for syllcol in syllcols:
    headers = [syllcol + i for i in ['o', 'n', 'c']]
    df[headers] = df[syllcol].str.extract('^(.*)([aeiouAEIOU])(.*)$')
df.fillna(value='', inplace=True) # replace Null and NaN with empty string
df

Unnamed: 0,rz,rzs,syllcounts,t0,t1,t2,t0o,t0n,t0c,t1o,t1n,t1c,t2o,t2n,t2c
0,BA,[BA],1,BA,,,B,A,,,,,,,
1,Ok,[Ok],1,Ok,,,,O,k,,,,,,
2,AST,[AST],1,AST,,,,A,ST,,,,,,
3,Instv,[Instv],1,Instv,,,,I,nstv,,,,,,
4,Uka,"[U, ka]",2,U,ka,,,U,,k,a,,,,
5,AVil,"[A, Vil]",2,A,Vil,,,A,,V,i,l,,,
6,IgraT,"[I, graT]",2,I,graT,,,I,,gr,a,T,,,
7,OrdaST,"[Or, daST]",2,Or,daST,,,O,r,d,a,ST,,,
8,UpnaST,"[U, pnaST]",2,U,pnaST,,,U,,pn,a,ST,,,
9,IViSTi,"[I, Vi, STi]",3,I,Vi,STi,,I,,V,i,,ST,i,


### Split onset, nucleus, and coda into segments

(Deferred pending decision about how to write consonant clusters into columns.)

In [23]:
# Columns of interest match ^t\d[onc]$
syllpartcols = df.filter(regex=("^t\d[onc]$"))
# for col in syllpartcols:
#     m = syllpartcols[col].apply(len).max()
#     for i in range(m):
#         print(col + '-' + str(i+1))

## C/V-based procedure

### Tokenize rhyme zone into C(C) and V

In [39]:
df["tokenized"] = [x[0] for x in df["rz"].str.findall(r"(.?)([AEIOU])([^aeiou]*)([aeiou]?)([^aeiou]*)([aeiou]?)([^aeiou]*)([aeiou]?)")]
i = 0
while pd.np.count_nonzero([item[i] for item in df["tokenized"]]) > 0:
    # print([item[i] for item in df["tokenized"]]) # diagnostic
    df["token" + str(i)] = [item[i] for item in df["tokenized"]]
    i += 1
tokencols = df.filter(regex="^token\d$")
tokencols

Unnamed: 0,token0,token1,token2,token3,token4,token5
0,B,A,,,,
1,,O,k,,,
2,,A,ST,,,
3,,I,nstv,,,
4,,U,k,a,,
5,,A,V,i,l,
6,,I,gr,a,T,
7,,O,rd,a,ST,
8,,U,pn,a,ST,
9,,I,V,i,ST,i


## Feature decomposition

### Prepare feature dictionary

In [10]:
# https://www.kaggle.com/jboysen/quick-tutorial-flatten-nested-json-in-pandas
import json
from pandas.io.json import json_normalize
with open('features.json') as f:
    d = json.load(f)
d["segments"][:2]

[{'p': [{'Syllabic': '0'},
   {'Sonorant': '0'},
   {'Anterior': '1'},
   {'Coronal': '0'},
   {'Palatalized': '0'},
   {'Nasal': '0'},
   {'Voiced': '0'},
   {'Continuant': '0'},
   {'Lateral': '0'},
   {'Delayedrelease': '0'}]},
 {'P': [{'Syllabic': '0'},
   {'Sonorant': '0'},
   {'Anterior': '1'},
   {'Coronal': '0'},
   {'Palatalized': '1'},
   {'Nasal': '0'},
   {'Voiced': '0'},
   {'Continuant': '0'},
   {'Lateral': '0'},
   {'Delayedrelease': '0'}]}]

In [11]:
flattened = {}
for item in d["segments"]:
    (key, value), = item.items() # key is the phonep
    flattened[key] = {k: v for d in value for k, v in d.items()} # flatten list of one-item dictionaries to key:value pairs
print(flattened)

{'p': {'Syllabic': '0', 'Sonorant': '0', 'Anterior': '1', 'Coronal': '0', 'Palatalized': '0', 'Nasal': '0', 'Voiced': '0', 'Continuant': '0', 'Lateral': '0', 'Delayedrelease': '0'}, 'P': {'Syllabic': '0', 'Sonorant': '0', 'Anterior': '1', 'Coronal': '0', 'Palatalized': '1', 'Nasal': '0', 'Voiced': '0', 'Continuant': '0', 'Lateral': '0', 'Delayedrelease': '0'}, 'b': {'Syllabic': '0', 'Sonorant': '0', 'Anterior': '1', 'Coronal': '0', 'Palatalized': '0', 'Nasal': '0', 'Voiced': '1', 'Continuant': '0', 'Lateral': '0', 'Delayedrelease': '0'}, 'B': {'Syllabic': '0', 'Sonorant': '0', 'Anterior': '1', 'Coronal': '0', 'Palatalized': '1', 'Nasal': '0', 'Voiced': '1', 'Continuant': '0', 'Lateral': '0', 'Delayedrelease': '0'}, 't': {'Syllabic': '0', 'Sonorant': '0', 'Anterior': '1', 'Coronal': '1', 'Palatalized': '0', 'Nasal': '0', 'Voiced': '0', 'Continuant': '0', 'Lateral': '0', 'Delayedrelease': '0'}, 'T': {'Syllabic': '0', 'Sonorant': '0', 'Anterior': '1', 'Coronal': '1', 'Palatalized': '1', '