# Overview (WIP)

The goal of this Jupyter notebook is to answer the following question: **What percentage of allophones of a given language are in the minimal feature representation of the language's phonemic inventory (after dimentionality reduction)**?

The sections we have (or will have are):
- Loading the dataframe and compiling preliminary statistics on the data available in Phoible
    - Including removing inventories with no allophones
- Adding feature representations to allophones without features (not in this notebook, but the allophones without feature reps will be found/mentioned here)
    - MAYBE also write a script to see which allophones shouldn't be included (e.g. 'ja')
- Iterate over the inventories and apply dimensionality reduction to find the smallest number of features which are able to describe all the contrasts that exist in the language. Then, see which allophones are able to be accurately (contrastively? faithfully?) represented by this reduced subset of features 

# Todo
- Go back and change the loops to df.apply for safer iteration

# Load the Dataframe

Much code used from https://github.com/zyocum/phoible-notebook/blob/main/notebook.ipynb 

In [1]:
# Import pandas

from collections import defaultdict
import pandas as pd

In [3]:
# Download the dataframe

df = pd.read_csv('https://raw.githubusercontent.com/phoible/dev/master/data/phoible.csv', low_memory=False)

In [5]:
df

Unnamed: 0,InventoryID,Glottocode,ISO6393,LanguageName,SpecificDialect,GlyphID,Phoneme,Allophones,Marginal,SegmentClass,...,advancedTongueRoot,periodicGlottalSource,epilaryngealSource,spreadGlottis,constrictedGlottis,fortis,lenis,raisedLarynxEjective,loweredLarynxImplosive,click
0,1,kore1280,kor,Korean,,0068,h,ç h ɦ,,consonant,...,-,-,-,+,-,-,-,-,-,-
1,1,kore1280,kor,Korean,,006A,j,j,,consonant,...,-,+,-,-,-,-,-,-,-,-
2,1,kore1280,kor,Korean,,006B,k,k̚ ɡ k,,consonant,...,-,-,-,-,-,-,-,-,-,-
3,1,kore1280,kor,Korean,,006B+02B0,kʰ,kʰ,,consonant,...,-,-,-,+,-,-,-,-,-,-
4,1,kore1280,kor,Korean,,006B+02C0,kˀ,kˀ,,consonant,...,-,-,-,-,+,-,-,-,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105479,3020,lamu1254,lby,Tableland Lamalama,,0294,ʔ,,False,consonant,...,-,-,-,-,+,-,-,-,-,-
105480,3020,lamu1254,lby,Tableland Lamalama,,03B8,θ,,False,consonant,...,-,-,-,-,-,-,-,-,-,-
105481,3020,lamu1254,lby,Tableland Lamalama,,0061,a,,False,vowel,...,-,+,-,-,-,0,0,-,-,0
105482,3020,lamu1254,lby,Tableland Lamalama,,0069,i,,False,vowel,...,-,+,-,-,-,0,0,-,-,0


In [92]:
df.iloc[0]

InventoryID                       1
Glottocode                 kore1280
ISO6393                         kor
LanguageName                 Korean
SpecificDialect                 NaN
GlyphID                        0068
Phoneme                           h
Allophones                    ç h ɦ
Marginal                        NaN
SegmentClass              consonant
Source                          spa
tone                              0
stress                            -
syllabic                          -
short                             -
long                              -
consonantal                       -
sonorant                          -
continuant                        +
delayedRelease                    +
approximant                       -
tap                               -
trill                             -
nasal                             -
lateral                           -
labial                            -
round                             0
labiodental                 

# Set up the analysis

In [6]:
# Map the index of the dataframe to the language name

index_to_language = dict(df[['InventoryID', 'LanguageName']].values)

In [7]:
print(index_to_language[3020])

len(index_to_language)

Tableland Lamalama


3020

In [8]:
# Find all inventories with at least one NaNs in the Allophones column

nan_inventories = set(df[df['Allophones'].isna()]['InventoryID'].values)
print(len(nan_inventories))

1688


In [9]:
# Find inventories with only non-NaNs

defined_inventories = set(df.groupby('InventoryID').filter(lambda x: x['Allophones'].isna().sum() != len(x))['InventoryID'].values)
print(len(defined_inventories))

1332


In [10]:
# Find inventories where each phoneme in an inventory maps to exactly one allophone 387
meaningful_allophone_inventories = set()
for inventory, phonemes in df.groupby('InventoryID'):
    if inventory in defined_inventories: # Only go through if it's defined
        for index, row in phonemes.iterrows():
            if row['Phoneme'] != row['Allophones']:
                meaningful_allophone_inventories.add(inventory)
defined_no_allophone_inventories = {ID for ID in defined_inventories if ID not in meaningful_allophone_inventories}
print(f'There are {len(defined_no_allophone_inventories)} inventories with no meaningful allophones')
print(f'These are inventories {sorted(defined_no_allophone_inventories)}')

There are 387 inventories with no meaningful allophones
These are inventories [3, 26, 41, 173, 649, 650, 651, 655, 656, 659, 660, 663, 664, 666, 667, 668, 669, 670, 671, 672, 675, 676, 677, 679, 680, 681, 682, 683, 684, 686, 687, 688, 692, 693, 695, 696, 697, 698, 699, 700, 701, 702, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 718, 719, 720, 724, 725, 727, 729, 730, 732, 733, 735, 737, 738, 739, 740, 741, 743, 744, 746, 748, 749, 751, 752, 753, 754, 756, 759, 762, 763, 764, 765, 766, 768, 769, 770, 771, 772, 776, 777, 779, 782, 783, 784, 787, 788, 789, 792, 793, 794, 795, 796, 797, 800, 801, 805, 806, 807, 809, 810, 811, 812, 813, 814, 816, 818, 819, 820, 822, 826, 827, 829, 830, 831, 832, 833, 834, 835, 836, 837, 842, 843, 844, 846, 847, 848, 849, 850, 852, 853, 859, 860, 863, 864, 868, 869, 870, 872, 873, 874, 877, 878, 879, 902, 903, 912, 916, 924, 926, 931, 933, 936, 942, 946, 951, 952, 960, 963, 964, 986, 989, 991, 993, 999, 1023, 1033, 1041, 1047, 1049, 1063,

The total number of inventories is equal to the number of NaN-only columns and mixed columns. This means that every inventory either is all-NaNs or all non-nans

In [11]:
# Define a function which returns true if a phoneme is in the list of phonemes

def is_defined(phoneme):
    return phoneme in df['Phoneme'].values

In [12]:
# Check to see which phonemes don't have allophones

allophone_counts = defaultdict(int)
undefined_allophones = set()

# For each inventory with meaningful allophones
for inventory in sorted(meaningful_allophone_inventories):
    # For each phoneme in that inventory
    for _, phoneme in df[df['InventoryID'] == inventory].iterrows():
        # For each allophone of that phoneme
        for allophone in phoneme['Allophones'].split():
            allophone_counts[allophone] += 1
            if not is_defined(allophone):
                undefined_allophones.add(allophone)
    if inventory % 5 == 0:
        print(inventory)


all_allophones = allophone_counts.keys()

5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
105
110
115
120
125
130
135
140
145
150
155
160
165
170
175
180
185
190
195
665
685
690
745
750
755
760
775
780
785
790
815
825
840
845
855
865
875
880
885
890
895
900
905
910
915
920
925
930
935
940
945
950
955
965
970
975
980
985
990
995
1000
1005
1010
1015
1020
1025
1030
1035
1040
1045
1050
1055
1060
1065
1070
1075
1080
1085
1090
1095
1100
1105
1110
1115
1120
1125
1130
1135
1140
1145
1155
1160
1165
1170
1175
1185
1190
1210
1215
1220
1225
1230
1255
1260
1265
1285
1290
1295
1300
1315
1320
1325
1330
1340
1350
1355
1365
1375
1380
1385
1390
1395
1400
1405
1410
1415
1435
1440
1445
1455
1470
1475
1480
1500
1510
1520
1535
1550
1565
1570
1575
1585
1590
1600
1605
1610
1625
1630
1635
1650
1655
1660
1665
1670
1675
1680
1685
1690
2160
2165
2170
2175
2180
2185
2190
2195
2200
2205
2210
2215
2220
2225
2230
2235


# Get data on the inventories

In [13]:
# Of the inventories with allophones, how many phonemes and allophones do they have?
# Only considers inventories with defined allophones
id_to_phoneme_count = {} # Maps an ID number to the number of phonemes
id_to_allophone_count = {} # Maps an ID number to the number of phonemes
id_to_allophone_phoneme_ratio = {} # Maps an ID number to the ratio of allophones to phonemes
for inventory, phonemes in df.groupby('InventoryID'):
    if inventory in defined_inventories: # Only go through if it's defined
        num_allophones = 0
        for allophones in phonemes:
            num_allophones += len(allophones.split() )
        id_to_phoneme_count[inventory] = len(phonemes)
        id_to_allophone_count[inventory] = num_allophones
        id_to_allophone_phoneme_ratio[inventory] = num_allophones / len(phonemes)

In [14]:
average_allophone_phoneme_ratio = sum(id_to_allophone_phoneme_ratio.values() ) / len(id_to_allophone_phoneme_ratio)
print(f'The average ratio of allophones to phonemes is {average_allophone_phoneme_ratio}')
print(f'There are {average_allophone_phoneme_ratio - 1} more allophones than phonemes')

The average ratio of allophones to phonemes is 1.3943044299461032
There are 0.39430442994610315 more allophones than phonemes


In [15]:
# How many of the allophones are already in the inventory?

# Get all the phonemes in an inventory
inventory_phonemes = defaultdict(set)
# For each inventory with meaningful allophones
for inventory in sorted(meaningful_allophone_inventories):
    # For each phoneme in that inventory
    for _, phoneme in df[df['InventoryID'] == inventory].iterrows():
        inventory_phonemes[inventory].add(phoneme['Phoneme'])

# Get all the allophones in an inventory
inventory_allophones = defaultdict(set)
# For each inventory with meaningful allophones
for inventory in sorted(meaningful_allophone_inventories):
    # For each phoneme in that inventory
    for _, phoneme in df[df['InventoryID'] == inventory].iterrows():
        # For each allophone of that phoneme
        for allophone in phoneme['Allophones'].split():
            inventory_allophones[inventory].add(allophone)

In [16]:
# Check to see how many of the allophones already exist within the phonemic inventory
allophone_percent_outside_inventory = {}
# For each inventory with meaningful allophones
for inventory in sorted(meaningful_allophone_inventories):
    allophone_percent_outside_inventory[inventory] = len(inventory_allophones[inventory]) / len(inventory_phonemes[inventory]) - 1

In [17]:
average_allophone_percent_outside_inventory = sum(allophone_percent_outside_inventory.values() ) / len(allophone_percent_outside_inventory)
print(f'The average percent of allophones outside the phonemic inventory is {average_allophone_percent_outside_inventory}')
print(f'This means that {average_allophone_percent_outside_inventory / (average_allophone_phoneme_ratio - 1)}% of allophones lie outside the inventory')

The average percent of allophones outside the phonemic inventory is 0.33712105332857684
This means that 0.8549765808480959% of allophones lie outside the inventory


# Get data on the allophones

In [18]:
# Return the number of allophones and unique allophones
print(f'There are {len(allophone_counts)} allophones')
print(f'{len(undefined_allophones)} don\'t have feature representations')

There are 3034 allophones
1016 don't have feature representations


In [19]:
# Account for marginality
marginal_defined_allophones = set()
nonmarginal_undefined_allophones = set()
for undefined_allophone in undefined_allophones:
    # If it's marginal and exists nonmarginally
    if (undefined_allophone.startswith("<") and is_defined(undefined_allophone[1:-1]) ) or (is_defined("<" + undefined_allophone + ">") ):
        marginal_defined_allophones.add(undefined_allophone)
    else:
        nonmarginal_undefined_allophones.add(undefined_allophone)


In [20]:
# Return info on the marginality of some allophones
print(f'Of the {len(undefined_allophones)} without feature representations:')
print(f'{len(marginal_defined_allophones)} have feature representations when accounting for marginality')
print(f'These are {marginal_defined_allophones}')
print(f'{len(nonmarginal_undefined_allophones)} still don\'t feature representations when accounting for marginality')

Of the 1016 without feature representations:
7 have feature representations when accounting for marginality
These are {'<o>', '<ɲ̟>', '<œ>', '<ɾ>', '<y>', '<ɯ>', '<q>'}
1009 still don't feature representations when accounting for marginality


In [21]:
# Filter by the number of times they appear
nonunique_allophones = {key for key, value in allophone_counts.items() if value > 1}
nonunique_undefined_allophones = {key for key, value in allophone_counts.items() if value > 1 and key in nonmarginal_undefined_allophones}
print(f'There are {len(nonunique_allophones)} allophones that appear more than once.')
print(f'Of those, {len(nonunique_undefined_allophones)} don\'t have feature representations.')
print(f'These are {nonunique_undefined_allophones}')

There are 1294 allophones that appear more than once.
Of those, 174 don't have feature representations.
These are {'lʷˠ', 't̪̚', 'dɪ̯', 'sʷˠ', 'tˡ', 'ʃ̩', 'ɤ̞̥', 'p̬', 'h̃ʷ', 'ʀ̥ʁ̥', 'k͈ʷː', 'nʷˠ', 'ʊ˞', 'tːʃ', 'uʔ', 'k̟ʲ', 'k̃', 'ʈʲʰ', 'n̠d̠ʒʲ', 'dˡ', 'ø̥', 'β̞̃', 'œə', 'ʊ̥̆', 'r̞', 'ɖʲ', 'k̚ʷ', 'ʔb', 'k̠ʰ', 't̪ⁿ', 'd̃', 'tʰɪ̯', 'ə̠', 'ĭ̥', 'ɐ˞', 'nn̥', 'l̪̥ˠ', 't̻̚', 'aj', 'ɡ̠', '↘', 'kʔ', 'ɡ̥', 'k̚', 'l̻̥', 'tʔ', 'β̞̜', 'ŋw̃', 'ɨ̟', 'ə̯iː', 'm̥ˠ', 'pʔ', 'ʃˤː', 'ɐ̥', 'ɡ͉', 'ɑ̥', 'ɲʲ', 'ʌ̆', 'ă̟', 'ʔm', 'β̥', 's̃', 'ɟɲ', 'ɹ̃', 'o̥', 'b̃', 'pˡ', 'ɽʲ', 't̠ʃʲʼ', 'k̠', 'ɡ̥̊', 'd̠̈', 'p̚', 'ɣ̟', 'ɗ̚', 'kp̚', 'ʒ̃', 'ɪ̥̆', 'f̃', 'nʒ', 'pⁿ', 'ʔp̚', 'ɛ˞', 'ŋ̚', 't̠ʃ̚', 'ɐ̯', 's̬', 'ja', 'ʔt', 'k̬', 'c̚', 'ɦ̃', 'ɯ̥', 'tʷˠ', 'ɣ̞', 'ɱvʷ', 'ʲt̚', 'ɣ̃', 'p̚ˀ', 'ɟ̥', 'ɰʷ', 'o̞ə̯', 'wʊ', 'hw', 'd̪ʉ', 'ʔ̚', 'ʈ̚', 'ʁ̥', 'ɪ̥', 'ɛ̥', 'm̚', 'ñ', 'd̠ʒ̥', 'ʔk', 'ɖ̚', 'k͉̚', 'ʛ', 'ə̥̆', 'ʒ̥', 'kⁿ', 'ʊ̥', 'ʀ̥', 'ĭ̃', 'ʔn', 'x̟', 't̚', 'ɟ̚', 'j͉', 'ɡ̃', 'n̚', 'ʲh', 'ɻ̥', 'ɔ̥', 'z͇̥', 'd̠̥z̠̥ʲ', 'k̟̚', 'h̃ʲ

# Add automatic feature representations where possible

Everything below here is messy and will be cleaned up soon. I needed to write out a lot of stuff to pass the dataframe through add-features.R and will clean it up after add-features.R is updated.

In [22]:
df.to_csv('original_df.csv', encoding='utf-8')

In [23]:
df

Unnamed: 0,InventoryID,Glottocode,ISO6393,LanguageName,SpecificDialect,GlyphID,Phoneme,Allophones,Marginal,SegmentClass,...,advancedTongueRoot,periodicGlottalSource,epilaryngealSource,spreadGlottis,constrictedGlottis,fortis,lenis,raisedLarynxEjective,loweredLarynxImplosive,click
0,1,kore1280,kor,Korean,,0068,h,ç h ɦ,,consonant,...,-,-,-,+,-,-,-,-,-,-
1,1,kore1280,kor,Korean,,006A,j,j,,consonant,...,-,+,-,-,-,-,-,-,-,-
2,1,kore1280,kor,Korean,,006B,k,k̚ ɡ k,,consonant,...,-,-,-,-,-,-,-,-,-,-
3,1,kore1280,kor,Korean,,006B+02B0,kʰ,kʰ,,consonant,...,-,-,-,+,-,-,-,-,-,-
4,1,kore1280,kor,Korean,,006B+02C0,kˀ,kˀ,,consonant,...,-,-,-,-,+,-,-,-,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105479,3020,lamu1254,lby,Tableland Lamalama,,0294,ʔ,,False,consonant,...,-,-,-,-,+,-,-,-,-,-
105480,3020,lamu1254,lby,Tableland Lamalama,,03B8,θ,,False,consonant,...,-,-,-,-,-,-,-,-,-,-
105481,3020,lamu1254,lby,Tableland Lamalama,,0061,a,,False,vowel,...,-,+,-,-,-,0,0,-,-,0
105482,3020,lamu1254,lby,Tableland Lamalama,,0069,i,,False,vowel,...,-,+,-,-,-,0,0,-,-,0


In [22]:
# Convert the allophones into different rows

allophoned_df = pd.DataFrame(columns=df.columns.tolist()[:11] )
allophoned_df = allophoned_df.rename(columns={'Phoneme': 'OriginalPhoneme', 'Allophones': 'Phoneme'})

In [23]:
# Add each row

for index, phoneme in df.copy().rename(columns={'Phoneme': 'OriginalPhoneme', 'Allophones': 'Phoneme'}).iterrows():
    if not pd.isna(phoneme['Phoneme']):
        for allophone in phoneme['Phoneme'].split():
            new_allophone = phoneme.copy()
            new_allophone['Phoneme'] = allophone
            allophoned_df.loc[len(allophoned_df.index)] = new_allophone
    if index % 1000 == 0:
        print(f'Progress is {index / len(df.index)}')

Progress is 0.0
Progress is 0.0094801107276933
Progress is 0.0189602214553866
Progress is 0.028440332183079897
Progress is 0.0379204429107732
Progress is 0.0474005536384665
Progress is 0.056880664366159794
Progress is 0.0663607750938531
Progress is 0.0758408858215464
Progress is 0.08532099654923969
Progress is 0.094801107276933
Progress is 0.1042812180046263
Progress is 0.11376132873231959
Progress is 0.1232414394600129
Progress is 0.1327215501877062
Progress is 0.14220166091539949
Progress is 0.1516817716430928
Progress is 0.1611618823707861
Progress is 0.17064199309847938
Progress is 0.1801221038261727
Progress is 0.189602214553866
Progress is 0.19908232528155928
Progress is 0.2085624360092526
Progress is 0.2180425467369459
Progress is 0.22752265746463918
Progress is 0.23700276819233249
Progress is 0.2464828789200258
Progress is 0.2559629896477191
Progress is 0.2654431003754124
Progress is 0.2749232111031057
Progress is 0.28440332183079897
Progress is 0.2938834325584923
Progress is 0

In [73]:
allophoned_df

Unnamed: 0,InventoryID,Glottocode,ISO6393,LanguageName,SpecificDialect,GlyphID,OriginalPhoneme,Phoneme,Marginal,SegmentClass,Source
0,1,kore1280,kor,Korean,,0068,h,ç,,consonant,spa
1,1,kore1280,kor,Korean,,0068,h,h,,consonant,spa
2,1,kore1280,kor,Korean,,0068,h,ɦ,,consonant,spa
3,1,kore1280,kor,Korean,,006A,j,j,,consonant,spa
4,1,kore1280,kor,Korean,,006B,k,k̚,,consonant,spa
...,...,...,...,...,...,...,...,...,...,...,...
65464,2238,even1259,evn,Evenki,,006F+02D0,oː,oː,False,vowel,ph
65465,2238,even1259,evn,Evenki,,0075,u,u,False,vowel,ph
65466,2238,even1259,evn,Evenki,,0075+02D0,uː,uː,False,vowel,ph
65467,2238,even1259,evn,Evenki,,0259,ə,ə,False,vowel,ph


In [6]:
def unicode_hex(phoneme):
    unicodes = []
    for character in phoneme:
        code_point = ord(character)
        unicodes.append('{:04X}'.format(code_point) )
    return '+'.join(unicode for unicode in unicodes)

In [24]:
# Update each row's GlyphID

for index, phoneme in allophoned_df.iterrows():
    phoneme['GlyphID'] = unicode_hex(phoneme['Phoneme'])
    if index % 10000 == 0:
        print(f'Progress is {index / len(df.index)}')

Progress is 0.0
Progress is 0.094801107276933
Progress is 0.189602214553866
Progress is 0.28440332183079897
Progress is 0.379204429107732
Progress is 0.47400553638466497
Progress is 0.5688066436615979


In [25]:
allophoned_df

Unnamed: 0,InventoryID,Glottocode,ISO6393,LanguageName,SpecificDialect,GlyphID,OriginalPhoneme,Phoneme,Marginal,SegmentClass,Source
0,1,kore1280,kor,Korean,,00E7,h,ç,,consonant,spa
1,1,kore1280,kor,Korean,,0068,h,h,,consonant,spa
2,1,kore1280,kor,Korean,,0266,h,ɦ,,consonant,spa
3,1,kore1280,kor,Korean,,006A,j,j,,consonant,spa
4,1,kore1280,kor,Korean,,006B+031A,k,k̚,,consonant,spa
...,...,...,...,...,...,...,...,...,...,...,...
65464,2238,even1259,evn,Evenki,,006F+02D0,oː,oː,False,vowel,ph
65465,2238,even1259,evn,Evenki,,0075,u,u,False,vowel,ph
65466,2238,even1259,evn,Evenki,,0075+02D0,uː,uː,False,vowel,ph
65467,2238,even1259,evn,Evenki,,0259,ə,ə,False,vowel,ph


In [26]:
# These are the cases that throw errors
unhandled_diacritic_base = {
    '031F': ['ɜ', 'w', 'æ', 'n'], # advanced
    '031D': ['ə', 'ɐ'], # uptack
    '0308': ['ə', 'ʉ', 'œ', 'ɯ'], # centralized
    '031E': ['ɣ', 'r', 'ɹ', 'b', 'ɖ', 'ɡ'], # downtack
    '031C': ['β', 'θ', 'ɬ'], # less-round
    '033D': ['e', 'o', 'ø', 'a'], # mid-centralized
    '0353': ['d'], # frictionalized
    '0339': ['w'] # more-round
}
# These are also unhandled
unhandled_characters = ['0334', '2198', '1D4A']
unhandled_special_feats = [
    'tsx', 'tsç', 'sts', 'stʃ', 'ŋɡb', 'ŋkp', 'dzj', 'ʔtʃ', 'ʔdʒ', 'tʃɥ', 'bvf', 'dzs', 'dʒʃ', 'hts', 'çts', 'htʃ', 'çtʃ'
]


In [37]:
allophoned_df['Unhandled'] = False
for index, phone in allophoned_df.iterrows():
    # Mark unhandled characters
    for unhandled_character in unhandled_characters:
        if unhandled_character in phone['GlyphID']:
            allophoned_df.at[index, 'Unhandled'] = True
    # Mark unhandled diacritic base combinations
    for unhandled_diacritic, unhandled_bases in unhandled_diacritic_base.items():
        if unhandled_diacritic in phone['GlyphID']:
            for unhandled_base in unhandled_bases:
                if unhandled_base in phone['Phoneme']:
                    allophoned_df.at[index, 'Unhandled'] = True
    # Mark unhandled_special_feats
    for unhandled_special_feat in unhandled_special_feats:
        if unhandled_special_feat in phone['Phoneme']:
            allophoned_df.at[index, 'Unhandled'] = True
    # Show progress
    if index % 2000 == 0:
        print(f'Progress is {index / len(allophoned_df.index)}')

Progress is 0.0
Progress is 0.030548809360155187
Progress is 0.06109761872031037
Progress is 0.09164642808046557
Progress is 0.12219523744062075
Progress is 0.15274404680077594
Progress is 0.18329285616093113
Progress is 0.21384166552108633
Progress is 0.2443904748812415
Progress is 0.2749392842413967
Progress is 0.3054880936015519
Progress is 0.3360369029617071
Progress is 0.36658571232186227
Progress is 0.39713452168201746
Progress is 0.42768333104217265
Progress is 0.4582321404023278
Progress is 0.488780949762483
Progress is 0.5193297591226382
Progress is 0.5498785684827934
Progress is 0.5804273778429486
Progress is 0.6109761872031038
Progress is 0.6415249965632589
Progress is 0.6720738059234141
Progress is 0.7026226152835693
Progress is 0.7331714246437245
Progress is 0.7637202340038797
Progress is 0.7942690433640349
Progress is 0.8248178527241901
Progress is 0.8553666620843453
Progress is 0.8859154714445004
Progress is 0.9164642808046556
Progress is 0.9470130901648108
Progress is 0

In [38]:
allophoned_df

Unnamed: 0,InventoryID,Glottocode,ISO6393,LanguageName,SpecificDialect,GlyphID,OriginalPhoneme,Phoneme,Marginal,SegmentClass,Source,Unhandled
0,1,kore1280,kor,Korean,,00E7,h,ç,,consonant,spa,False
1,1,kore1280,kor,Korean,,0068,h,h,,consonant,spa,False
2,1,kore1280,kor,Korean,,0266,h,ɦ,,consonant,spa,False
3,1,kore1280,kor,Korean,,006A,j,j,,consonant,spa,False
4,1,kore1280,kor,Korean,,006B+031A,k,k̚,,consonant,spa,False
...,...,...,...,...,...,...,...,...,...,...,...,...
65464,2238,even1259,evn,Evenki,,006F+02D0,oː,oː,False,vowel,ph,False
65465,2238,even1259,evn,Evenki,,0075,u,u,False,vowel,ph,False
65466,2238,even1259,evn,Evenki,,0075+02D0,uː,uː,False,vowel,ph,False
65467,2238,even1259,evn,Evenki,,0259,ə,ə,False,vowel,ph,False


In [48]:
# Print all the ones that are marked
unhandled_allophones = allophoned_df[allophoned_df['Unhandled'] == True]
unhandled_allophones.to_csv('unhandled_allophones.csv', encoding='utf-8')

In [49]:
unhandled_allophones

Unnamed: 0,InventoryID,Glottocode,ISO6393,LanguageName,SpecificDialect,GlyphID,OriginalPhoneme,Phoneme,Marginal,SegmentClass,Source,Unhandled
3451,62,kala1399,kal,Inuit,,0251+0334,a,ɑ̴,,vowel,spa,True
3453,62,kala1399,kal,Inuit,,0251+0334+02D0,aː,ɑ̴ː,,vowel,spa,True
3456,62,kala1399,kal,Inuit,,0259+0334,i,ə̴,,vowel,spa,True
3461,62,kala1399,kal,Inuit,,0254+0334,u,ɔ̴,,vowel,spa,True
3464,62,kala1399,kal,Inuit,,0259+0334+02D0,ɪː,ə̴ː,,vowel,spa,True
...,...,...,...,...,...,...,...,...,...,...,...,...
63708,2210,stan1288,spa,Spanish,Castilian Spanish,006E+031F,n,n̟,False,consonant,ph,True
63728,2210,stan1288,spa,Spanish,Castilian Spanish,0263+031E,ɡ,ɣ̞,False,consonant,ph,True
63731,2210,stan1288,spa,Spanish,Castilian Spanish,006E+031F,ɲ,n̟,False,consonant,ph,True
65196,2234,pite1240,sje,Pite Saami,,0068+0074+0073,ʰts,hts,False,consonant,ph,True


In [42]:
# Remove marginality
for index, phoneme in unhandled_allophones.iterrows():
    unhandled_allophones.at[index, 'Phoneme'] = phoneme['Phoneme'].replace('<', '').replace('>', '')
    unhandled_allophones.at[index, 'GlyphID'] = phoneme['GlyphID'].replace('003C+', '').replace('+003E', '')
# Remove unhandled phonemes
allophoned_df_without_unhandled = unhandled_allophones[unhandled_allophones['Unhandled'] == False]

# NONHANDLED PHONEMES

1. s with downtick [0073+031E+0329], s̞̩, 
2. Combining tilde overlay, to find exact value I'll check back the referneces
3. Multiple phonemes as an allophone, including tsx, and many more (Check comment)
4. 031F (combining plus sign below) with ɜ 
5. 1D4A (small schwa), 2198 (bottom right arrow)
6. Check in with marginality and < and >

In [45]:
allophoned_df_without_unhandled.to_csv('phoible-nofeats.csv', encoding='utf-8')

Above here, process the new chart, then import it again!

# Look into the new processed dataframe

In [13]:
special_feats = pd.read_csv('special-feature-table.csv', low_memory=False)

In [16]:
special_feats['segment'].tolist()

['pɸ',
 'pf',
 'tθ',
 'ts',
 'tʃ',
 'tʆ',
 'ʈʂ',
 'cç',
 'kx',
 'qχ',
 'bβ',
 'bv',
 'dð',
 'dz',
 'dʒ',
 'dʓ',
 'ɖʐ',
 'ɟʝ',
 'ɡɣ',
 'ɢʁ',
 'kp',
 'ɡb',
 'mbβ',
 'mpf',
 'mbv',
 'ndð',
 'nts',
 'ndz',
 'ntʃ',
 'ndʒ',
 'ɳʈʂ',
 'ɳɖʐ',
 'ɲɟʝ',
 'ŋkx',
 'ŋɡɣ',
 'ɴɢʁ',
 'ŋmkp',
 'ŋmɡb',
 'dkx',
 'ɡkx',
 'ɡǀkx',
 'ɡǁkx',
 'ɡǂkx',
 'ɡǃkx',
 'ɡʘkx',
 'kǀkx',
 'kǁkx',
 'kǂkx',
 'kǃkx',
 'kʘkx',
 'pkx',
 'tkx',
 'tʃx',
 'tsɦ',
 'dʒx',
 'kpɾ',
 'ɡbɾ',
 'kpr',
 'ɡbr',
 'tʃɾ',
 'dʒɾ',
 'ntʃɾ',
 'ŋmkpɾ']