# Analysis

*New, continuing* -- part 2; progress report 2

My most basic goals of analysis are the following:

- What are the most common sound changes?

- Given a specific starting sound, what sound changes are most likely? Least likely?

- What are some interesting connections that can be made by connecting strings of sound changes?

In [1]:
import pandas as pd

In [8]:
from data_parsing_script import Branch, Rule

branches: list[Branch] = pd.read_pickle('./data/branches.pkl')
rules: list[Rule] = pd.read_pickle('./data/rules.pkl')

Analysis will be much easier if my data is in dataframe form, so I'll make that happen by using `from_records()` and `vars()` (which pulls each field value out of an object / class.)

In [9]:
rules_df = pd.DataFrame.from_records([vars(rule) for rule in rules])
rules_df.describe()

Unnamed: 0,id,branch_id,branch_index,original_text,environment,from_sound,intermediate_steps,to_sound
count,16496,16496,16496,16496,16496.0,16496,16496,16496
unique,8954,702,702,7042,2724.0,1965,141,1487
top,"Shetland-Norn-u,oː-a-aː-ɒ,œ,y-e-iː",Old-Norse,17.7.3.1,"{s,z}(ʔ) {ʃ,ʒ}(ʔ) {ɬ,ɮ}(ʔ) → s ʃ ɬ / _#",,e,[],∅
freq,40,371,371,48,7577.0,389,16205,1737


## Most common sound changes

Before finding what sound changes are "most common", I need to decide what makes two sound changes "the same." If the starting sound and ending sound are the same, but there are differing in-between sounds or differing environments, should they be considered to be the same? I think it'd be interesting to see if the results differ depending on which approach you take, so I'll be looking at both.

However, the `describe()` call above made me remember a potential confounding issue: there are a *lot* of sound changes that are essentially copied between different daughter languages in a single branch. If one sound change is shared between 15 daughter languages, should that really be considered "more common" than one that exists in 15 separate branches?

Because of this, I went back and modified my data parsing script to add the branch index. This will let me discount copies of the same sound change shared between sister branches on the same level.

As a starting point, let's see what the most common `from_sound`s and `to_sound`s are.

In [11]:
rules_df['from_sound'].value_counts()[:10]

e    389
k    384
t    362
a    339
s    326
p    310
u    288
i    252
j    250
o    244
Name: from_sound, dtype: int64

It makes sense that these are single IPA symbols, since they're the least complex possible sounds. I expect the 'to' sounds to be largely similar, with the addition of the null symbol.

In [12]:
rules_df['to_sound'].value_counts()[:10]

∅    1737
s     504
i     441
e     388
o     368
h     364
k     350
a     333
t     318
u     304
Name: to_sound, dtype: int64

I was correct! Although which sounds are present, and in which order, does change a bit. /s/ is, by far, the most common result of a sound change (other than the sound being removed.)

Now, let's just do a very basic version of the "most common sound change" idea -- what is the most common pairing of `from_sound` and `to_sound`?

In [13]:
rules_df[['from_sound', 'to_sound']].value_counts()[:10]

from_sound  to_sound
h           ∅           164
e           i           100
w           ∅            93
j           ∅            92
k           ∅            87
ʔ           ∅            87
V           ∅            74
o           u            73
n           ∅            72
ts          s            71
dtype: int64

/h/ being removed is most common, which makes a lot of sense to me -- this even happened in British English (e.g. "history".) /ts/ → /s/ is probably the most interesting inclusion on the list, but it's also not a very surprising change. Let's see what the most common are when excluding removed sounds:

In [14]:
rules_df[['from_sound', 'to_sound']][rules_df['to_sound'] != '∅'].value_counts()[:10]

from_sound  to_sound
e           i           100
o           u            73
ts          s            71
u           o            67
i           e            61
e           a            60
k           ɡ            60
a           e            58
s           ʃ            57
            h            56
dtype: int64

Nothing too surprising here. The empty row in the 'from' column is actually s → h; not sure why the notebook renders it that way, but it does. Now what if we include environment -- does that change things much?

In [21]:
rules_df[['from_sound', 'to_sound', 'environment']].value_counts()[:10]

from_sound  to_sound  environment
ts          s                        55
h           ∅                        44
ɡ           k                        34
TŠ          TS                       28
ʃ           s                        26
b           p                        25
iː          i                        25
tʃ          ts                       25
ʔ           ∅                        24
q           ∅                        24
dtype: int64

Things do change a bit. The weirdest thing here is definitely TŠ → TS, but looking at the data, this is due to the issue I mentioned earlier -- sounds copied between sister branches. The Athabaskan sound changes mostly involve *series* of consonants; "TŠ" represents palatals, while "TS" represents dental affricates & fricatives.

Let's see the same stats, but excluding sister branches. First, I'll add a column for "parent branch index", so we can find the number of unique parent branches that have a sound change, rather than a raw count of the sound change.

In [22]:
def get_parent_branch_index(branch_index: str):
  return '.'.join(branch_index.split('.')[:-1])

get_parent_branch_index('6.5.4.3.2')

'6.5.4.3'

In [29]:
import numpy as np
get_parent_branch_index_vec = np.vectorize(get_parent_branch_index)
rules_df['parent_index'] = get_parent_branch_index_vec(rules_df['branch_index'])
rules_df.groupby(['branch_index', 'parent_index']).size().head(10) # verify it worked

branch_index  parent_index
10.1          10               4
10.1.1        10.1            21
10.1.1.1      10.1.1           8
10.1.1.2      10.1.1          12
10.1.1.3      10.1.1          15
10.1.1.5      10.1.1           8
10.1.2        10.1            10
10.1.2.1      10.1.2          11
10.1.2.10     10.1.2          15
10.1.2.2      10.1.2          13
dtype: int64

Looks good! Let's get those stats again now.

In [32]:
rules_df[['from_sound', 'to_sound', 'parent_index']].drop_duplicates()[['from_sound', 'to_sound']].value_counts()[:10]

from_sound  to_sound
h           ∅           46
w           ∅           44
ʔ           ∅           40
k           ∅           37
j           ∅           34
e           i           34
a           e           32
i           e           30
u           o           29
o           u           29
dtype: int64

These numbers are much lower than before, but the results are mostly similar, with a few sounds falling out of the top 10. Let's do the same with environment taken into account:

In [33]:
rules_df[['from_sound', 'to_sound', 'environment', 'parent_index']].drop_duplicates()[['from_sound', 'to_sound', 'environment']].value_counts()[:10]

from_sound  to_sound  environment
ts          s                        19
tʃ          ts                       15
ʔ           ∅                        15
h           ∅                        14
w           v                        14
ʃ           s                        14
ɡ           k                        12
dz          z                        12
b           p                        11
o           u                        11
dtype: int64

This definitely *feels* more representative now, and the commonness of all of these sound changes make sense to me.

Finally, let's take intermediate steps into account, with and without environment. (I have to do `.astype(str)` here because `drop_duplicates()` uses hashing, and list values (which is what `intermediate_steps` is) can't be hashed.)

In [36]:
rules_df[['from_sound', 'intermediate_steps', 'to_sound', 'parent_index']]\
  .astype(str).drop_duplicates()[['from_sound', 'intermediate_steps', 'to_sound']].value_counts()[:10]

from_sound  intermediate_steps  to_sound
h           []                  ∅           46
w           []                  ∅           43
ʔ           []                  ∅           40
k           []                  ∅           37
j           []                  ∅           34
e           []                  i           34
a           []                  e           32
i           []                  e           30
o           []                  u           29
u           []                  o           28
dtype: int64

No intermediate steps is, predictably, most common. What if we specify only rules *with* intermediate steps? Are there *any* that occur more than once?

In [43]:
counts = rules_df[rules_df['intermediate_steps'].apply(lambda x: len(x) > 0)][['from_sound', 'intermediate_steps', 'to_sound', 'parent_index']]\
  .astype(str).drop_duplicates()[['from_sound', 'intermediate_steps', 'to_sound']].value_counts()

counts[counts > 1]

from_sound  intermediate_steps  to_sound
ai          ['ɛi']              əi          2
ɛː          ['ɛi']              əi?         2
dtype: int64

It seems like there are two that occur more than once, but on further inspection, maybe not...

In [46]:
rules_df[(rules_df['from_sound'] == 'ai') & (rules_df['to_sound'] == 'əi')]

Unnamed: 0,id,branch_id,branch_index,original_text,environment,from_sound,intermediate_steps,to_sound,parent_index
6634,Scots-—-ai,Scots,17.7.2.1.9,— ai → ɛi → əi / when stem-final,when stem-final,ai,[ɛi],əi,17.7.2.1
16259,Scots-Vowel-Shifts-ai,Scots-Vowel-Shifts,46.5,ai → ɛi → əi / when stem-final,when stem-final,ai,[ɛi],əi,46


In [44]:
rules_df[rules_df['to_sound'] == 'əi?']

Unnamed: 0,id,branch_id,branch_index,original_text,environment,from_sound,intermediate_steps,to_sound,parent_index
6658,Scots-—-ɛː,Scots,17.7.2.1.9,— ɛː → ɛi (→ əi?) / in some northern varieties,in some northern varieties,ɛː,[ɛi],əi?,17.7.2.1
16283,Scots-Vowel-Shifts-ɛː,Scots-Vowel-Shifts,46.5,ɛː → ɛi (→ əi?) / in some northern varieties,in some northern varieties,ɛː,[ɛi],əi?,46


...both of these are just duplicated between the Scots section and the Vowel Shifts section. Oh well.