# Data Parsing

My goal with the data parsing is to fit each "branch" and sound change rule in the data into a predefined data format.

How should I go about doing this? There are a few options. I want my end result to allow me to have a kind of connected graph between individual sounds, and I want to be able to see what rules and language branches the connections come from.

Because of this, I think I want to have two different "collections" of objects:

1. Rules
2. Branches

I'm going to opt to output these as JSON files, since I much prefer that format over XML for readability. I may also pickle them so that I can import them more easily when doing visualizations.

Each object type will be its own class, so I can have a set of predefined fields on each.

## Catching issues

My first version of the data parsing script (2/21/23) worked somewhat well, but there were some that were parsed incorrectly. Here's a rundown of what was still not working:

1. Sound changes with multiple correlated rules -- e.g. `z zː → j dʒː` -- were being parsed as single rules.
2. Sounds grouped together -- e.g. in `{s3,ʒ} → ʃ / #_` -- were being parsed as one sound.
3. Sounds like `S[+ voiced]` were being parsed as two separate sounds -- `S[+` and `voiced]` -- because I was naively splitting at all spaces.
4. Some extra text was not being filtered out correctly, such as the quotation in `ɨ u → e {i,e} “(all */u/ affected, but conditions for when it became /i/ or /e/ are not known)”`.
5. Rules like `rdʒ → {rdʒ,rdz(→ rz)}` that had optional steps were not parsed correctly at all.
6. Optional modifiers to sounds, like `ts(ʼ)`, were parsed as one rule rather than two (`ts`, `tsʼ`.)

There were almost certainly other issues, but these were the ones I noticed. Some of these were going to be harder to fix than others.

As I fix some of these cases (+ others as I find them), I'll include them below.

There were some cases where making the parser handle a certain situation was not worth it, as the number of rules that would be handled by that was negligible, so I manually modified those rules so they would be parsed correctly. I also did this in cases where there was plain text like "occasionally". I made a new file `sid-tidy-with-edits.html` so the unedited version would still be accessible.

There are also some rules where multiple end results are listed due to the source's author "hedging", so I edited those manually to consider both end results as their own rules.

I also decided to make it so that anything within backticks is ignored when parsing but included in the 'original text' field of the rule. This lets me manually handle rules like `r → *L (some sort of lateral?) / occasionally` with extra stuff that I want to ignore, without losing that information.

Another oddity is this fun one: `s → c& _ (the paper doesn’t explain what this represents)`. How am I supposed to parse this??? Who knows. I'm going to just manually add an environment separator and assume it's supposed to mean "in any environment."

I also realized there were some cases where there were simply typos in the data, such as `ŋ → {∅,n} #_ else` (which is missing the environment separator before `#_`.) I manually fixed those.

Fortljus Ryggrad's *[Corrections, Clarifications, and Uncertainties of Index Diachronica](https://drive.google.com/file/d/1veWbeZhXUZjUtGZZezyvCvF6BA103wS8/view?usp=sharing)* helped me resolve some of the oddities I ran into when parsing as well.

In [25]:
import importlib
import data_parsing_script as dps
importlib.reload(dps)

dps.parse_rule_steps('z zː → j dʒː')

[('z', [], 'j'), ('zː', [], 'dʒː')]

In [24]:
importlib.reload(dps)

dps.parse_rule_steps('S → [+ voice]')

[('S', [], '[+ voice]')]

In [23]:
importlib.reload(dps)

dps.parse_rule_steps('— j w → i u')

[('j', [], 'i'), ('w', [], 'u')]

In [22]:
importlib.reload(dps)

dps.parse_rule_steps('V[- high - long] → ∅')

[('V[- high - long]', [], '∅')]

In [86]:
importlib.reload(dps)

[(rule.from_sound, rule.intermediate_steps, rule.to_sound) for rule in dps.parse_sound_change('ew (→ øj) (→ øj) → yj', '', '')]

[('ew', ['øj'], 'yj'), ('ew', [], 'yj')]

In [96]:
importlib.reload(dps)

dps.parse_rule_steps('{s3,ʒ} → ʃ')

bracketed sound {s3,ʒ}


[('s3', [], 'ʃ'), ('ʒ', [], 'ʃ')]

In [101]:
importlib.reload(dps)

[(rule.from_sound, rule.intermediate_steps, rule.to_sound) for rule in dps.parse_sound_change('tʃʷ(ʼ) tɕ(ʼ) dʒʷ dʑ → f(ʼ) ts(ʼ) v dz', '', '')]

[('tʃʷʼ', [], 'fʼ'),
 ('tɕʼ', [], 'tsʼ'),
 ('dʒʷ', [], 'v'),
 ('dʑ', [], 'dz'),
 ('tʃʷ', [], 'f'),
 ('tɕ', [], 'ts')]

In [103]:
importlib.reload(dps)

[(rule.from_sound, rule.intermediate_steps, rule.to_sound) for rule in dps.parse_sound_change('rdʒ → {rdʒ,rdz(→ rz)}', '', '')]

[('rdʒ', ['{rdʒ,rdz'], 'rz}'), ('rdʒ', [], '{rdʒ,rdz}')]

In [105]:
importlib.reload(dps)

[(rule.from_sound, rule.intermediate_steps, rule.to_sound) for rule in
  dps.parse_sound_change('d ɡ → t k (may have been part of a more sweeping merger; Firespeaker calls it “lenis-fortis”)', '', '')]

[('d', [], 't'), ('ɡ', [], 'k')]

In [108]:
importlib.reload(dps)

[(rule.from_sound, rule.intermediate_steps, rule.to_sound) for rule in
  dps.parse_sound_change('r → *L `(some sort of lateral?)` / occasionally', '', '')]

[('r', [], '*L', '')]

In [118]:
importlib.reload(dps)

dps.parse_rule_steps('{æ,e}i ai au w{ɪ,i} wy wV iu Vː → eː aː oː weː woː wVː eː oː V')

[('æi', [], 'eː'),
 ('ei', [], 'eː'),
 ('ai', [], 'aː'),
 ('au', [], 'oː'),
 ('wɪ', [], 'weː'),
 ('wi', [], 'weː'),
 ('wy', [], 'woː'),
 ('wV', [], 'wVː'),
 ('iu', [], '{eː,oː}'),
 ('Vː', [], 'V')]

...AND IT ALL PARSES! Only took like 4 hours of running it over and over and fixing any errors that were hit!!

However, as of now, there is still some data parsing stuff still to take care of:

1. Rules with multiple optional things that are the same (e.g. `(C)(C)`) are only parsed as being all-or-nothing (one `(C)` is not possible). After fixing this, rules where `(:)` on both sides of the rule mean "long becomes long, short becomes short" still need to be handled correctly (e.g. `N(ː) k k(ː) N(ː)ɡ ɡ(ː) ɣ → ɲc(ː) c(ː) ɲɟ(ː) ɟ(ː) ʝ / _{i,j} `)
2. Go through everything and look for any more obvious errors. (Maybe sort individual sounds by length to find outliers?)

So here's what the parser is doing, by the way: