# Data Parsing

**[View the data parsing script!](data_parsing_script.py)**

My goal with the data parsing is to fit each "branch" and sound change rule in the data into a predefined data format.

How should I go about doing this? There are a few options. I want my end result to allow me to have a kind of connected graph between individual sounds, and I want to be able to see what rules and language branches the connections come from.

Because of this, I think I want to have two different "collections" of objects:

1. Rules
2. Branches

I'm going to opt to output these as JSON files, since I much prefer that format over XML for readability. I may also pickle them so that I can import them more easily when doing visualizations.

Each object type will be its own class, so I can have a set of predefined fields on each.

## Catching issues

My first version of the data parsing script (2/21/23) worked somewhat well, but there were some that were parsed incorrectly. Here's a rundown of what was still not working:

1. Sound changes with multiple correlated rules -- e.g. `z zː → j dʒː` -- were being parsed as single rules.
2. Sounds grouped together -- e.g. in `{s3,ʒ} → ʃ / #_` -- were being parsed as one sound.
3. Sounds like `S[+ voiced]` were being parsed as two separate sounds -- `S[+` and `voiced]` -- because I was naively splitting at all spaces.
4. Some extra text was not being filtered out correctly, such as the quotation in `ɨ u → e {i,e} “(all */u/ affected, but conditions for when it became /i/ or /e/ are not known)”`.
5. Rules like `rdʒ → {rdʒ,rdz(→ rz)}` that had optional steps were not parsed correctly at all.
6. Optional modifiers to sounds, like `ts(ʼ)`, were parsed as one rule rather than two (`ts`, `tsʼ`.)

**Have no idea what this means?** I don't blame you. I go into more detail in the ["Parsing sound changes" section](#parsing-sound-changes).

There were almost certainly other issues, but these were the ones I noticed. Some of these were going to be harder to fix than others.

There were some cases where making the parser handle a certain situation was not worth it, as the number of rules that would be handled by that was negligible, so I manually modified those rules so they would be parsed correctly. I also did this in cases where there was plain text like "occasionally". I made a new file `sid-tidy-with-edits.html` so the unedited version would still be accessible.

There are also some rules where multiple end results are listed due to the source's author "hedging", so I edited those manually to consider both end results as their own rules.

I also decided to make it so that anything within backticks is ignored when parsing but included in the 'original text' field of the rule. This lets me manually handle rules like `r → *L (some sort of lateral?) / occasionally` with extra stuff that I want to ignore, without losing that information.

Another oddity is this fun one: `s → c& _ (the paper doesn’t explain what this represents)`. How am I supposed to parse this??? Who knows. I'm going to just manually add an environment separator and assume it's supposed to mean "in any environment."

I also realized there were some cases where there were simply typos in the data, such as `ŋ → {∅,n} #_ else` (which is missing the environment separator before `#_`.) I manually fixed those.

Fortljus Ryggrad's *[Corrections, Clarifications, and Uncertainties of Index Diachronica](https://drive.google.com/file/d/1veWbeZhXUZjUtGZZezyvCvF6BA103wS8/view?usp=sharing)* helped me resolve some of the oddities I ran into when parsing as well.

As I fixed some of the broken cases (+ others as I found them), I included them below:

In [2]:
import importlib
import data_parsing_script as dps
importlib.reload(dps)

dps.parse_rule_steps('z zː → j dʒː')

[('z', [], 'j'), ('zː', [], 'dʒː')]

In [3]:
importlib.reload(dps)

dps.parse_rule_steps('S → [+ voice]')

[('S', [], '[+ voice]')]

In [4]:
importlib.reload(dps)

dps.parse_rule_steps('— j w → i u')

[('j', [], 'i'), ('w', [], 'u')]

In [5]:
importlib.reload(dps)

dps.parse_rule_steps('V[- high - long] → ∅')

[('V[- high - long]', [], '∅')]

In [6]:
importlib.reload(dps)

[(rule.from_sound, rule.intermediate_steps, rule.to_sound) for rule in dps.parse_sound_change('ew (→ øj) → yj', '', '')]

[('ew', ['øj'], 'yj'), ('ew', [], 'yj')]

In [7]:
importlib.reload(dps)

dps.parse_rule_steps('{s3,ʒ} → ʃ')

[('s3', [], 'ʃ'), ('ʒ', [], 'ʃ')]

In [8]:
importlib.reload(dps)

[(rule.from_sound, rule.intermediate_steps, rule.to_sound) for rule in dps.parse_sound_change('tʃʷ(ʼ) tɕ(ʼ) dʒʷ dʑ → f(ʼ) ts(ʼ) v dz', '', '')]

[('tʃʷʼ', [], 'fʼ'),
 ('tɕʼ', [], 'tsʼ'),
 ('dʒʷ', [], 'v'),
 ('dʑ', [], 'dz'),
 ('tʃʷ', [], 'f'),
 ('tɕ', [], 'ts')]

In [9]:
importlib.reload(dps)

[(rule.from_sound, rule.intermediate_steps, rule.to_sound) for rule in dps.parse_sound_change('rdʒ → {rdʒ,rdz}', '', '')]

[('rdʒ', [], 'rdʒ'), ('rdʒ', [], 'rdz')]

In [10]:
importlib.reload(dps)

[(rule.from_sound, rule.intermediate_steps, rule.to_sound) for rule in
  dps.parse_sound_change('d ɡ → t k (may have been part of a more sweeping merger; Firespeaker calls it “lenis-fortis”)', '', '')]

[('d', [], 't'), ('ɡ', [], 'k')]

In [11]:
importlib.reload(dps)

[(rule.from_sound, rule.intermediate_steps, rule.to_sound) for rule in
  dps.parse_sound_change('r → *L `(some sort of lateral?)` / occasionally', '', '')]

[('r', [], '*L')]

In [12]:
importlib.reload(dps)

dps.parse_rule_steps('{æ,e}i ai au w{ɪ,i} wu wV i u V → eː aː oː weː woː wVː eː oː Vː')

[('æi', [], 'eː'),
 ('ei', [], 'eː'),
 ('ai', [], 'aː'),
 ('au', [], 'oː'),
 ('wɪ', [], 'weː'),
 ('wi', [], 'weː'),
 ('wu', [], 'woː'),
 ('wV', [], 'wVː'),
 ('i', [], 'eː'),
 ('u', [], 'oː'),
 ('V', [], 'Vː')]

In [13]:
importlib.reload(dps)

[(rule.from_sound, rule.intermediate_steps, rule.to_sound) for rule in
  dps.parse_sound_change('l(ː)ʀ n(ː)ʀ → lː nː / ”Vː_ (or all V_ ?)', '', '')]

[('lːʀ', [], 'lː'), ('nːʀ', [], 'nː'), ('lʀ', [], 'lː'), ('nʀ', [], 'nː')]

...AND IT ALL PARSES! Only took like 4 hours of running it over and over and fixing any errors that were hit!!

However, as of now, there is still some data parsing stuff still to take care of:

1. Rules with multiple optional things that are the same (e.g. `(C)(C)`) are only parsed as being all-or-nothing (one `(C)` is not possible). After fixing this, rules where `(:)` on both sides of the rule mean "long becomes long, short becomes short" still need to be handled correctly (e.g. `N(ː) k k(ː) N(ː)ɡ ɡ(ː) ɣ → ɲc(ː) c(ː) ɲɟ(ː) ɟ(ː) ʝ / _{i,j} `)
2. Nested curly brackets are not handled correctly. (`{e,w{æ,i}}` is parsed as `['{e,wæ', '{e,wi}']`)
3. I need to through everything and look for any more obvious errors. (Maybe sort individual sounds by length to find outliers?)

I'm happy with the current state of things for the first progress report, though.

## What the parser is doing

1. It loops through each `<section>` tag on the page, each of which is a branch from a parent language to a daughter language (e.g. Proto-Germanic to Proto-Norse), and parses information about the branch. This is pretty simple.
2. It then loops through each sound change listed in the section and parses THAT. This is where it gets messy.

### Parsing sound changes

Sound changes in the *Index* are in a (relatively...) standard format. The 'from' and 'to' sounds are separated by an arrow (with some intermediate steps sometimes), while the environment of the sound change is specified after a forward slash. The specifics of how environments are written aren't necessary to know for this, since I mostly just care about the sounds themselves.

When parsing a sound change, first, any text contained in backticks is removed from the sound change text (as noted above), as is any `(?)`. Then, the environment is split off from the rest of the rule. (Often, rules are followed by some text in parentheses or double-quotes, so I include that in the environment.)

In [14]:
import re

rule_string = 'd j → r ɭ (sporadic)'

env_split = rule_string.split(" / ", 1)

environment = ''

if len(env_split) > 1:
    environment = env_split[1]
else:
    # If no environment, but rule ends with some text in parentheses or quotes, consider that the environment
    parens_match = re.search(r'(.+) (\(.+\)|“.+”)$', rule_string)
    if parens_match:
        env_split[0] = parens_match.group(1)
        environment = parens_match.group(2)

environment

'(sporadic)'

Next, the steps are split up and every sound is separated out. This is because I want single sound changes, like `a → o`, and a sound change like `a e → o i` is equivalent to `a → o` and `e → i` separately.

This is the first place where things get pretty complex. First, I need to handle any 'optionals', the name I've given to stuff in parentheses in rules like `a(i) → ey`, which is equivalent to `a → ey` and `ai → ey` separately. Because there can be multiple in a single rule, I need to have every possible combination accounted for. I used `itertools`' `combinations()` function to accomplish this.

In [15]:
import itertools
rule_string = 'a(i) {e,w{æ,i}} {we,ei} (w)ɪ → ey ø y ʏ'

optionals = re.findall(r'(\(.*?\))', rule_string)
if (optionals):
    combinations = list(itertools.chain.from_iterable(itertools.combinations(optionals, l) for l in range(len(optionals) + 1)))
    for combo in combinations:
        combo_string = rule_string
        for to_replace in combo:
            combo_string = combo_string.replace(to_replace, '')
        combo_string = combo_string.replace('(','').replace(')','')
        print(combo_string)

ai {e,w{æ,i}} {we,ei} wɪ → ey ø y ʏ
a {e,w{æ,i}} {we,ei} wɪ → ey ø y ʏ
ai {e,w{æ,i}} {we,ei} ɪ → ey ø y ʏ
a {e,w{æ,i}} {we,ei} ɪ → ey ø y ʏ


Then, each of those are run through a `parse_rule_steps` function, which does a lot, including:

1. Removing extraneous symbols
2. Splitting the rules into steps
3. Splitting the steps into each sound
4. Making sure the number of sounds at each step are the same (so they can be correlated correctly)
5. Splitting bracketed sounds into individual sounds, and generating rules for each combination of those individual sounds:

In [16]:
def handle_brackets(sound: str) -> list[str]:
    """Handles bracketed sounds"""
    sounds: list[str] = []
    if bracket_matches := re.match(r'^(\D+)?\{(.*)\}(\D+)?$', sound):
        # print(f'bracketed sound {sound}')
        prefix = str(bracket_matches.group(1) or '')
        split = bracket_matches.group(2).split(',')
        suffix = str(bracket_matches.group(3) or '')
        for sound in split:
            sounds.append(prefix + sound + suffix)
    else:
        sounds.append(sound)
    return sounds

handle_brackets('C{a,e,i}Ns')

['CaNs', 'CeNs', 'CiNs']

Finally, everything is output to JSON and pickle files!

In [21]:
import pickle, jsons
from data_parsing_script import Rule, Branch
with open('./data/rules.pkl', 'rb+') as rules_file:
    rules = pickle.load(rules_file)
with open('./data/branches.pkl', 'rb+') as branches_file:
    branches = pickle.load(branches_file)

print(f'{len(rules)} rules in {len(branches)} branches')
print(jsons.dumps(rules[500], { 'indent': 4, 'ensure_ascii': False }))
print(jsons.dumps(rules[4242], { 'indent': 4, 'ensure_ascii': False }))
print(jsons.dumps(branches[123], { 'indent': 4, 'ensure_ascii': False }))
print(jsons.dumps(branches[456], { 'indent': 4, 'ensure_ascii': False }))

15865 rules in 702 branches
{
    "environment": "",
    "from_sound": "dʒ",
    "id": "Eastern-Libyan-Arabic-dˤ-dʒ-q",
    "intermediate_steps": [],
    "original_text": "dˤ dʒ q → ðˤ ʒ ɡ",
    "to_sound": "ʒ"
}
{
    "environment": "",
    "from_sound": "dɮ",
    "id": "Proto-Circassian-ɬː-tɬː-tɬʼ-dɮ",
    "intermediate_steps": [],
    "original_text": "ɬ(ː) tɬ(ː) tɬʼ dɮ → ɕ(ː) tɕ(ː) tɕʼ tħ",
    "to_sound": "tħ"
}
{
    "id": "Yunaga-2",
    "index": "10.3.5.8.2",
    "name": "Proto-Yunaga to Yunaga 2",
    "source": "<i>thetha</i>, from Ozanne-Rivierre, Françoise (1992), “The Proto-Oceanic Consonantal System and the Languages of New Caledonia”. <i>Oceanic Linguistics</i> 31(2):191 – 207; and Ozanne-Rivierre, Françoise (1995), “Structural Changes in the Languages of Northern New Caledonia”. <i>Oceanic Linguistics</i> 34(1):44 – 72"
}
{
    "id": "Sekani",
    "index": "29.1.1.1.18",
    "name": "Proto-Athabaskan to Sekani",
    "source": "<i>Whimemsz</i>, from Krauss, Michael and Vi