# Minimal pairs

## Task

We produce all pairs of Greek/Hebrew vocalized lexemes/occurrences that satisfy these conditions:

1. both members of the pair are equally long
2. the edit distance between the members of the pair is exactly 1
3. either `(p1, p2)` or `(p2, p1)` is in the pair but not both

These are the so-called minimal pairs.
They can be used to find phonological rules and to train language learners.

See http://www.ibiblio.org/bgreek/forum/viewtopic.php?f=17&t=4308&p=29117#p29117.

Thanks to Jonathan Robie for pointing me to this question.

So, we produce 4 sets of minimal pairs: for Greek lexemes, Greek occurrences, Hebrew lexemes, Hebrew occurrences.

## Method

We use the Levenshtein edit distance to check condition 2.

Install Levenshtein module:

```
sudo -H pip3 install python-Levenshtein
```

Documentation of [Levenshtein module](http://www.coli.uni-saarland.de/courses/LT1/2011/slides/Python-Levenshtein.html)

## Data

We pick text-fabric data for Greek and Hebrew.
TF files start with a few metadata lines, preceded by a `@`, then have an empty line, and the
subsequent lines contain data items, possibly preceded by a number that we ignore:

### Greek

```
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2017-03-21T14:55:58Z

βίβλος
γένεσις
Ἰησοῦς
Χριστός
υἱός
```

### Hebrew

```
@node
@author=Eep Talstra Centre for Bible and Computer
@dataset=BHSA
@datasetName=Biblia Hebraica Stuttgartensia Amstelodamensis
@email=shebanq@ancient-data.org
@encoders=Constantijn Sikkel (QDF), and Dirk Roorda (TF)
@valueType=str
@version=2017
@website=https://shebanq.ancient-data.org
@writtenBy=Text-Fabric
@dateWritten=2017-10-10T10:55:32Z

1437403	בְּ
רֵאשִׁית
ברא
אֱלֹהִים
אֵת
הַ
```
...

Note that we do not need the package Text-Fabric to handle this data.
However, by using TF it would have been easy to pick up the *node* numbers of the lexemes/occurrences, and from there
to find all occurrences of the lexemes.
That way, we could enrich each minimal pair with a set of concrete examples.

### Greek
We grab raw text-fabric data from [SBLGNT](https://github.com/Dans-labs/text-fabric-data/tree/master/greek/sblgnt),
in particular the data that corresponds to the
[UnicodeLemma and Unicode](https://dans-labs.github.io/text-fabric-data/features/greek/sblgnt/0_overview.html)
features.

These files give the occurrences and lemmas for each of the 137000+ word occurrences of the Greek New Testament.
After reading this file, we should weed out duplicates, before computing the minimal pairs.

### Hebrew
For lexemes we grab raw text-fabric data from
[BHSA](https://github.com/ETCBC/bhsa) (this repo), in particular the data that corresponds to the 
[voc_lex_utf8](https://etcbc.github.io/bhsa/features/hebrew/2017/voc_lex_utf8) feature.

This file contains all 8000+ distinct lexemes of the Hebrew Bible.

For occurrences we use the "phonetic" representation found in the `phono` feature in the
[phono](https://github.com/ETCBC/phono/blob/master/tf/2017/phono.tf) repo.

This file gives the 425,000+ word occurrences of the Hebrew Bible.

## Result

### Greek

#### Lexemes
The result based on lexemes is in [greek-lex.tsv](greek-lex.tsv) having 467 minimal pairs:

```
ἄγε	ἄγω
ἄγω	ἄνω
ἄδηλος	ἄδολος
ἄλαλος	ἄναλος
ἄλλος	ἄλλως
ἄμωμον	ἄμωμος
ἄνεμος	ἄνομος
ἄνεσις	ἄφεσις
ἄνομος	ἄτομος
ἄπειμι	ἔπειμι
```
...

#### Occurrences
The result based on occurrences is in [greek-occ.tsv](greek-occ.tsv) having 6061 minimal pairs:

```
αἰγιαλόν	αἰγιαλὸν
αἰγύπτιοι	αἰγύπτιον
αἰγύπτιοι	αἰγύπτιος
αἰγύπτιον	αἰγύπτιος
αἰνεῖν	αἰτεῖν
αἰνεῖτε	αἰτεῖτε
αἰνὼν	αἰνῶν
αἰνῶν	αἰτῶν
αἰσχρόν	αἰσχρὸν
αἰσχύνας	αἰσχύνης
```
...

### Hebrew

#### Lexemes
The result based on lemmas is in [hebrew-lex.tsv](hebrew-lex.tsv) having 12715 minimal pairs:

```
אֱהִי	אֱוִי
אֱהִי	אֱלִי
אֱוִי	אֱלִי
אֱוִיל	אֱלִיל
אֱלִי	עֱלִי
אֱלִיפַז	אֱלִיפַל
אֱלִישָׁה	אֱלִישָׁע
אֱסוּר	אֵסוּר
אֲבִי	אֲחִי
אֲבִי	אֲנִי
```
...

#### Occurrences
The result based on occurrences is in [hebrew-occ.tsv](hebrew-occ.tsv) having 40138 minimal pairs:

```
baddê	daddê
baddê	vaddê
baddîm	vaddîm
baddîm	ḵaddîm
baddāʸw	maddāʸw
baddāʸw	vaddāʸw
bah	bal
bah	bar
bah	baz
bah	baḏ
```

In [1]:
import os
import time
from unicodedata import normalize, category
from Levenshtein import distance

# Time management
We need to conduct over 1 billion computations, so let us keep track of how much time it takes.

In [2]:
timestamp = None

def startTime():
    global timestamp
    timestamp = time.time()

def elapsed():
        interval = time.time() - timestamp
        if interval < 10: return "{: 2.2f}s".format(interval)
        interval = int(round(interval))
        if interval < 60: return "{:>2d}s".format(interval)
        if interval < 3600: return "{:>2d}m {:>02d}s".format(interval // 60, interval % 60)
        return "{:>2d}h {:>02d}m {:>02d}s".format(interval // 3600, (interval % 3600) // 60, interval % 60)

In [3]:
uLetter = {'Ll', 'Lu', 'Lo'}
def stripPunct(x): return ''.join(c for c in x if category(c) in uLetter)
def skipStress(x): return x.replace("ˈ", '').replace("ˌ", '')

In [4]:
D = 'NFD'
C = 'NFC'

VERSION = '2017'
DATA = {
    'hebrew-lex': '~/github/etcbc/bhsa/tf/{}/voc_lex_utf8.tf'.format(VERSION),
    'hebrew-occ': '~/github/etcbc/phono/tf/{}/phono.tf'.format(VERSION),
    'greek-lex': '~/github/Dans-labs/text-fabric-data/greek/sblgnt/UnicodeLemma.tf',
    'greek-occ': '~/github/Dans-labs/text-fabric-data/greek/sblgnt/Unicode.tf',
}

for (data, file) in DATA.items(): DATA[data] = os.path.expanduser(file)

OPTIONS = {
    'hebrew-lex': dict(norm=D),
    'hebrew-occ': dict(norm=C, trans=skipStress),
    'greek-lex': dict(norm=D, lower=True),
    'greek-occ': dict(norm=C, lower=True, trans=stripPunct),
}

In [5]:
def readTfData(key):
    dataFile = DATA[key]
    options = OPTIONS[key]
    norm = options.get('norm', None)
    trans = options.get('trans', None)
    lower = options.get('lower', None)
    
    items = []
    with open(dataFile) as d:
        inMeta = True
        for line in d:
            if inMeta:
                if not line.startswith('@'):
                    inMeta = False
                continue
            comps = line.rstrip('\n').split('\t', 1)
            item = comps[1] if len(comps) == 2 else comps[0]
            item = normalize(norm, item) if norm else item
            item = item.lower() if lower else item
            item = trans(item) if trans else item
            items.append(item)
    # weed out duplicates
    items = sorted(set(items))
    return items

In [6]:
def getMinimalPairs(items):
    pairs = []
    nItems = len(items)
    
    # we compute the number of cases in order to give progress messages
    # only if there are "many" cases
    # note the number of cases is quadratic in the number of items!
    # If you have less than 8000 items, it takes seconds
    # If you have more than 50000 items, it takes minutes
    nCases = int(nItems*(nItems-1) / 2)
    print('\t{:>3}  {:>11} cases'.format(' ', nCases))
    interval = round(nCases / 100) if nCases > 1000000 else 0

    cases = 0
    c = 0
    perc = 0
    for i in range(nItems):
        for j in range(i, nItems):
            cases += 1
            c += 1
            if interval and c == interval:
                c = 0
                perc += 1
                print('\t{:>3}% ({:>11} cases)'.format(perc, cases))
            itemI = items[i]
            itemJ = items[j]
            if len(itemI) != len(itemJ): continue
            d = distance(itemI, itemJ)
            if d == 1:
                pairs.append((itemI, itemJ))
    return (nCases, pairs)

In [7]:
def writePairs(pairs, pFile):
    with open(pFile, 'w') as f:
        for (item1, item2) in sorted(pairs):
            f.write('{}\t{}\n'.format(item1, item2))

# Generate pairs

By commenting out a language and/or a kind, you can selectively generate a set of minimal pair
for a set of language items.

In [8]:
selectLang = {
    'greek',
    'hebrew',
}
selectKind = {
    'lex',
    'occ',
}

totalItems = 0
totalCases = 0
totalPairs = 0

startTime()

for lang in sorted(selectLang):
    for kind in sorted(selectKind):
        key = '{}-{}'.format(lang, kind)
        items = readTfData(key)
        nItems = len(items)
        totalItems += nItems
        print('{:<10}\n\t{:>11} items'.format(key, nItems))

        # this may take long
        (nCases, pairs) = getMinimalPairs(items)
        totalCases += nCases
        nPairs = len(pairs)
        totalPairs += nPairs

        pairFile = '{}.tsv'.format(key)
        writePairs(pairs, pairFile)
        print('{:>10} {:>11} pairs'.format(elapsed(), nPairs))
print('''
Computed:
    {:>11} comparisons of
    {:>11} items resulting in
    {:>11} minimal pairs during
    {:>11}
'''.format(
    totalCases,
    totalItems,
    totalPairs,
    elapsed(),
))

greek-lex 
	       5451 items
	        14853975 cases
	  1% (     148540 cases)
	  2% (     297080 cases)
	  3% (     445620 cases)
	  4% (     594160 cases)
	  5% (     742700 cases)
	  6% (     891240 cases)
	  7% (    1039780 cases)
	  8% (    1188320 cases)
	  9% (    1336860 cases)
	 10% (    1485400 cases)
	 11% (    1633940 cases)
	 12% (    1782480 cases)
	 13% (    1931020 cases)
	 14% (    2079560 cases)
	 15% (    2228100 cases)
	 16% (    2376640 cases)
	 17% (    2525180 cases)
	 18% (    2673720 cases)
	 19% (    2822260 cases)
	 20% (    2970800 cases)
	 21% (    3119340 cases)
	 22% (    3267880 cases)
	 23% (    3416420 cases)
	 24% (    3564960 cases)
	 25% (    3713500 cases)
	 26% (    3862040 cases)
	 27% (    4010580 cases)
	 28% (    4159120 cases)
	 29% (    4307660 cases)
	 30% (    4456200 cases)
	 31% (    4604740 cases)
	 32% (    4753280 cases)
	 33% (    4901820 cases)
	 34% (    5050360 cases)
	 35% (    5198900 cases)
	 36% (    5347440 cases)
	 37% (   

	  5% (   45634570 cases)
	  6% (   54761484 cases)
	  7% (   63888398 cases)
	  8% (   73015312 cases)
	  9% (   82142226 cases)
	 10% (   91269140 cases)
	 11% (  100396054 cases)
	 12% (  109522968 cases)
	 13% (  118649882 cases)
	 14% (  127776796 cases)
	 15% (  136903710 cases)
	 16% (  146030624 cases)
	 17% (  155157538 cases)
	 18% (  164284452 cases)
	 19% (  173411366 cases)
	 20% (  182538280 cases)
	 21% (  191665194 cases)
	 22% (  200792108 cases)
	 23% (  209919022 cases)
	 24% (  219045936 cases)
	 25% (  228172850 cases)
	 26% (  237299764 cases)
	 27% (  246426678 cases)
	 28% (  255553592 cases)
	 29% (  264680506 cases)
	 30% (  273807420 cases)
	 31% (  282934334 cases)
	 32% (  292061248 cases)
	 33% (  301188162 cases)
	 34% (  310315076 cases)
	 35% (  319441990 cases)
	 36% (  328568904 cases)
	 37% (  337695818 cases)
	 38% (  346822732 cases)
	 39% (  355949646 cases)
	 40% (  365076560 cases)
	 41% (  374203474 cases)
	 42% (  383330388 cases)
	 43% (  392