# Alphabet

We start with developing sets and regular expressions that validate canonical letters in the NENA text-corpus. These patterns will be used for validating text-input.

In [2]:
import re
import json
import collections
import unicodedata
import pandas as pd
from pprint import pprint
pd.options.display.max_rows = 200
from tf.app import use
nena = use('nena:clone', hoist=globals(), checkout='clone', version='0.02')

Using TF-app in /Users/cody/github/annotation/app-nena/code:
	repo clone offline under ~/github (local github)
Using data in /Users/cody/github/CambridgeSemiticsLab/nena_tf/tf/0.02:
	repo clone offline under ~/github (local github)
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


## Testing using the corpus

We use the NENA Text-Fabric corpus to test what characters are present in the corpus. We check to ensure that characters and accents are combined consistently. 

We also make corrections to the counted set to derive a host of acceptable characters which should go into the standard.

In [3]:
examples = collections.defaultdict(list)
counts = collections.Counter()

# should fix these letters to be indicated as foreign
ignore = [
    re.compile(let) for let in {'ɑ','ŕ', 'ã̀'}
]

def ignore_letter(let):
    """Check whether letter is known foreign"""
    for i_letter in ignore:
        if i_letter.findall(let):
            return True

for letter in F.otype.s('letter'):
    word = L.u(letter,'word')[0]
    letter_text = F.text.v(letter).lower()
    # skip foreign words
    if F.lang.v(word) or F.foreign.v(word) or ignore_letter(letter_text):
        continue
        
    # add replacements that will be handled later
    # in the architecture / conversion process
    replacements = [
        ('⁺', ''), # should go in "begin"
        ('ɉ', 'ɟ'), # should replace everywhere 
        ('ʸ', 'y'), # should replace everywhere
        ('ĉ', 'č'), # should fix error in text
        ('p̂', 'p̭'), # should put in acceptable substitutions
    ]
    
    for find, replace in replacements:
        letter_text = letter_text.replace(find, replace)
    
    examples[letter_text].append([letter, word])
    counts[letter_text] += 1
    
letter_counts = pd.DataFrame.from_dict(counts, orient='index').sort_values(by=0, ascending=False)

display(letter_counts.head())
print(len(letter_counts))

Unnamed: 0,0
a,63304
l,38896
ə,29280
á,26857
n,26509


89


In [4]:
letter_counts

Unnamed: 0,0
a,63304
l,38896
ə,29280
á,26857
n,26509
m,25744
t,23795
r,23351
x,20583
ʾ,19306


### Check for ordering issues

Are all letters with the same set of characters also ordered in the same way?

We can check for such bad cases by doing the following:

1. decompose all letters 
2. find letters that intersect on a set of their characters
3. check whether intersecting set letters do not `==` each other, if so, store that result.

In [4]:
bad_order_issues = []

for ia, letter_a in enumerate(letter_counts.index):
    
    a_charset = set(unicodedata.normalize('NFD', letter_a))
    
    for ib, letter_b in enumerate(letter_counts.index):
        
        # skip identical indices
        if ia == ib:
            continue
        
        b_charset = set(unicodedata.normalize('NFD', letter_b))

        if a_charset == b_charset:
            if letter_a != letter_b:
                bad_order_issues.append((letter_a, letter_b))
            else:
                continue
        else:
            continue
            
print(f'{len(bad_order_issues)} bad order issues found')

0 bad order issues found


No issues found. Is it because `unicodedata.normalize` fixes these kinds of problems?

In [5]:
example = unicodedata.normalize('NFD', 'č̭')

for char in example:
    print(ord(char))

99
813
780


In [6]:
# above we have the natural order of the accents
# below we change that order to see what happens

bad_order = chr(99) + chr(780) + chr(813)

print('before:', example)
print('after:', bad_order)

before: č̭
after: č̭


In [7]:
print('good order:')
for char in example:
    print(' '+char, ord(char))

print()

print('bad order:')
for char in bad_order:
    print(' '+char, ord(char))

good order:
 c 99
 ̭ 813
 ̌ 780

bad order:
 c 99
 ̌ 780
 ̭ 813


Now we try to normalize `bad_order` and see if the order issue is solved.

In [8]:
for char in unicodedata.normalize('NFD', bad_order):
    print(' '+char, ord(char))

 c 99
 ̭ 813
 ̌ 780


**`unicodedata` does indeed fix bad order in accents**

## Accent range

The full range of accents in unicode is given as `\u0300-\u036F`. We want to only utilize a subset of these. Let's identify the relevant range now.

In [9]:
any_accent = re.compile('[\u0300-\u036F]')

attested_accents = set()

for letter in letter_counts.index:
    decomposed = unicodedata.normalize("NFD", letter)
    for c in decomposed:
        if any_accent.match(c):
            attested_accents.add(c)
            
len(attested_accents)

8

In [10]:
for c in attested_accents:
    print(hex(ord(c)))

0x323
0x306
0x32d
0x301
0x307
0x30c
0x300
0x304


Below we write a pattern and test it.

In [11]:
accent_pattern = re.compile('[\u0300-\u033d]')

for c in attested_accents:
    if not accent_pattern.match(c):
        print(hex(ord(c)), 'not found')

Thus, we cannot define a precise range. But we can define a smaller range that eliminates some spurious possibilities.

## Towards a Standard

Now that we know `unicodedata.normalize` repairs bard orders of accents, and that our corpus contains no bad accents, we can move towards defining some standards for canonical letters.

We will store each letter as a dictionary entry with metadata such as acceptable combining accents, as well class data about the letter. 

To get started, and to generate this dictionary efficiently, we'll begin with a few sets. 

**Note that for all cases we work with lowercase letters**.

In [5]:
# build up letter data into a dictionary
# make vowels set
vowels = list('aeɛəiou')
letter_data = []
vowel_data = []

# categories identified with findall
category2re = {
    'point': [
        ('[pbfvmw]', 'labial'),
        ('[tdθðnlr]|[sz](?!\u030c)', 'dental-alveolar'),
        ('[j]|[csz].?\u030c', 'palatal-alveolar'),
        ('[ɟy]|c(?![\u032d\u0323]?\u030c)', 'palatal'),
        ('[kgxg]', 'velar'),
        ('q', 'uvular'),
        ('h\u0323|ʿ', 'pharyngeal'),
        ('h(?!\u0323)|ʾ', 'laryngeal')
    ],    
    'manner': [
        ('[pbtdcjɟkqʾ]|g(?!\u0307)', 'affricative'),
        ('[fvθðxhʿ]|g\u0307', 'fricative'),
        ('[sz]', 'sibilant'),
        ('[mn]', 'nasal'),
        ('l', 'lateral'),
        ('[wry]','other'),
    ],
    'phonation': [
        ('[ptck](?![\u032d\u0323])|ʾ', 'unvoiced_aspirated'),
        ('[ptck]\u032d', 'unvoiced_unaspirated'),
        ('[bdjɟgvðz]', 'voiced'),
        ('[fθxh]|s(?!\u0323)', 'unvoiced'),
        ('[ptckðdszmlr]\u0323|ʿ', 'emphatic'),
        ('[mnlwry](?!\u0323)', 'plain')
    ],
}

category2comp = {}
for cat, patterns in category2re.items():
    category2comp[cat] = [(re.compile(pat), val) for pat,val in patterns]

# for every class, letter, assign data and store in leter dict
for letter in sorted(letter_counts.index):
    
    # decomposed data
    decomposed = unicodedata.normalize('NFD', letter)
    decomposed_upper = decomposed[0].upper() + decomposed[1:]
    decomposed_codes = tuple(ord(c) for c in decomposed)
    decomposed_upper_codes = tuple(ord(c) for c in decomposed_upper)
    decomposed_regex = f'{decomposed}(?![\u0300-\u036F])|{decomposed_upper}(?![\u0300-\u036F])'
    
    # composed data
    composed = unicodedata.normalize('NFC', letter)
    composed_upper = unicodedata.normalize('NFC', decomposed_upper)
    composed_codes = tuple(ord(c) for c in composed)
    composed_upper_codes = tuple(ord(c) for c in composed_upper)
    
    # base data
    base = decomposed[0]
    base_upper = decomposed[0].upper()
    base_code = ord(base)
    base_upper_code = ord(base_upper)
    
    # accents data
    accents = decomposed[1:]
    accent_codes = tuple(ord(a) for a in accents)
    letter_class = 'vowel' if base in vowels else 'consonant'
    
    # handle consonants / vowels differently
    ldat = {
            'decomposed_regex': decomposed_regex,
            
            'decomposed_string': decomposed,
            'decomposed_upper_string': decomposed_upper,
            'decomposed_codepoints': decomposed_codes,
            'decomposed_upper_codepoints': decomposed_upper_codes,
            
            'composed_string': composed,
            'composed_upper_string': composed_upper,
            'composed_codepoints': composed_codes,
            'composed_upper_codepoints': composed_upper_codes,
            
            'base_string': base,
            'base_upper_string': base_upper,
            'base_code': base_code,
            'base_upper_code': base_upper_code,

            'decomposed_accent_string': accents,
            'decomposed_accent_codes': accent_codes,
            'class': letter_class,
        }
    
    # apply categorizations
    for category, patterns in category2comp.items():
        for patt, value in patterns:
            if patt.findall(decomposed):
                ldat[category] = value
    
    if letter_class == 'consonant':
        letter_data.append(ldat)
    else:
        vowel_data.append(ldat)

def sort_letters(letter_list):
    """Sort letter list"""
    return sorted(
        letter_list, 
        key=lambda data: (data['base_code'], len(data['decomposed_accent_codes']), ''.join(chr(c) for c in data['decomposed_codepoints']))
    )
        
letter_data = sort_letters(letter_data)
vowel_data = sort_letters(vowel_data)
letter_data.extend(vowel_data)

In [6]:
len(letter_data)

89

In [7]:
# for ld in letter_data:
#     pprint(ld, sort_dicts=False, indent=4)
#     print()

In [8]:
with open('../alphabet.json', 'w') as outfile:
    json.dump(letter_data, outfile, indent=4, ensure_ascii=False)

### Are composed/decomposed preserved with json?

In [63]:
with open('../alphabet.json', 'r') as infile:
    alphabet = json.load(infile)

In [64]:
len(alphabet[64]['composed_string'])

1

In [65]:
len(alphabet[64]['decomposed_string'])

2

**Yes. Json loading preserves the decomposed string.**

## Testing Alphabet Regex Codes

In [20]:
def normalize_string(string):
    return unicodedata.normalize('NFD', string).lower()

def tokenize_string(string):
    norm_string = normalize_string(string)
    return re.findall('.[\u0300-\u036F]*', norm_string)

In [21]:
type(alphabet)

dict

In [22]:
test_sample = "č̭ c̭"

test_sample

'č̭ c̭'

In [23]:
re2data = {re.compile(rd['decomposed_regex']):rd for rd in alphabet.values()}

In [24]:
unmatched = set()

for c in tokenize_string(test_sample):
    match = False
    for patt in re2data:
        if patt.match(c):
            print(c, 'matched', patt, re2data[patt]['decomposed_codepoints'])
            match = True
    if not match:
        unmatched.add(c)

č̭ matched re.compile('č̭(?![̀-ͯ])') [99, 813, 780]
c̭ matched re.compile('c̭(?![̀-ͯ])') [99, 813]


**It works**

## One Regex

We need a regular expression that validates good nena text. We build that pattern here using the standardized alphabet.

To do this, we need to cluster base characters on attested accentuations.

### Experiments 

In [25]:
# test = unicodedata.normalize('NFD', 'hám-ʾən ʾàsqətˈ ʾap-ʾáyya qabū̀l-ila')

# re.findall('.[\u0300-\u036F]*', test)

In [26]:
test_letter = 'ð'

def make_unistring(char):
    """Makes \\u unicode string from a c"""
    return '\\u'+hex(ord(char)).replace('x','')

for c in unicodedata.normalize('NFD', test_letter):
    print(' '+c, ord(c), make_unistring(c))

 ð 240 \u0f0


In [27]:
test = unicodedata.normalize('NFD', test_letter)

re.findall('[a-zðɟəɛʾʿθ](?![\u0300-\u036F])', test)

['ð']

### Cluster on overlapping features for regex patterns

In [None]:
accents2bases = collections.defaultdict(list)
for letter, ldat in letter_dict.items():
    uni_string = ''.join(make_unistring(chr(c)) for c in ldat['accent_codes'])
    
    if '036f' in uni_string:
        print(letter, uni_string)
    
    accents2bases[uni_string].append(ldat['base'])

In [None]:
len(accents2bases)

In [None]:
print(accents2bases)

**cluster a second time**

In [None]:
bases2accents = collections.defaultdict(set)

for accents, lset in accents2bases.items():
    if not accents:
        continue
    bases2accents[tuple(lset)].add(accents)
    for accents2, lset2 in accents2bases.items():
        if accents != accents2:
            continue
        elif lset == lset2:
            bases2accents[tuple(lset)].add(accents2)        
            
# for accents, lset in accents2bases.items():
#     print(accents)
#     print(lset)
#     print()

In [None]:
pprint(bases2accents)

In [None]:
accented_sorted = sorted(bases2accents.items(), key=lambda k: len(k[0]), reverse=True)

for bases, accents in accented_sorted:
    print('[' + ''.join(bases) + ']', ' | '.join(accents))
    print()

### Regexes

In [None]:
def normalize_char(char):
    """Normalize character for regex testing.
    
    Characters normalized by:
        * decomposing/sorting chars with unicodedata.normalize
        * converted to lowercase
    Decomposition allows for one-to-one matching with accents.
    NB: that normalize_char removes capitalization and
    thus should not replace text inputs.
    
    Arguments:
        char: str of single character
    
    Returns:
        str normalized
    """
    return unicodedata.normalize('NFD', char).lower()

base_chars = '[a-zðɟəɛʾʿθ]'
unaccented = base_chars + '(?![\u0300-\u036F])'
cons_accented = '[dhlmprstzð]\u0323|[ckpt]\u032d|[csz]\u030c|c[\u0323\u032d]\u030c|g\u0307'
vowel_accented = '[aeiouəɛ][\u0300\u0301]|[aeiouɛ]\u0304|[aeiou]\u0304[\u0300\u0301]|[au]\u0306[\u0300\u0301]?'
one_regex = '|'.join([unaccented, cons_accented, vowel_accented])

In [None]:
# test
for letter in letter_dict:
    letter_text = unicodedata.normalize('NFD', letter)
    if not test_regex.findall(letter_text):
        print(letter, 'not found')

Every letter is thus captured. Let's test the new system against bogus letters currently residing in the corpus (some of these are valid for foreign strings).

In [None]:
bad_letters = set()

for letter in F.otype.s('letter'):
    letter_text = normalize_char(F.text.v(letter)).replace('⁺', '')
    if not test_regex.findall(letter_text):
        bad_letters.add(letter_text)
        
bad_letters