# Minimal pairs

## Task

We produce all pairs of Hebrew vocalized lexemes that satisfy these conditions:

1. both members of the pair are equally long
2. the edit distance between the members of the pair is exactly 1
3. either `(p1, p2)` or `(p2, p1)` is in the pair but not both

These are the so-called minimal pairs.
They can be used to find phonological rules and to train language learners.

See http://www.ibiblio.org/bgreek/forum/viewtopic.php?f=17&t=4308&p=29117#p29117.

Thanks to Jonathan Robie for pointing me to this question.

## Method

We use the Levenshtein edit distance to check condition 2.

Install Levenshtein module:

```
sudo -H pip3 install python-Levenshtein
```

Documentation of [Levenshtein module](http://www.coli.uni-saarland.de/courses/LT1/2011/slides/Python-Levenshtein.html)

## Data

We grab raw text-fabric data from this repo, in particular the data that corresponds to the 
[voc_lex_utf8](https://etcbc.github.io/bhsa/features/hebrew/2017/voc_lex_utf8) feature.

This file contains all 9000+ lexemes of the Hebrew Bible.
The file starts with a few metadata lines, preceded by a `@`, then has an empty line, and the
subsequent lines contain each a lexeme, possibly preceded by a number that we ignore:

```
@node
@author=Eep Talstra Centre for Bible and Computer
@dataset=BHSA
@datasetName=Biblia Hebraica Stuttgartensia Amstelodamensis
@email=shebanq@ancient-data.org
@encoders=Constantijn Sikkel (QDF), and Dirk Roorda (TF)
@valueType=str
@version=2017
@website=https://shebanq.ancient-data.org
@writtenBy=Text-Fabric
@dateWritten=2017-10-10T10:55:32Z

1437403	בְּ
רֵאשִׁית
ברא
אֱלֹהִים
אֵת
הַ
```
...

Note that we do not need the package Text-Fabric to handle this data.
However, by using TF it would have been easy to pick up the *node* numbers of the lexemes, and from there
to find all occurrences of the lexemes. That way, we could enrich each minimal pair with a set of concrete examples.

## Result
The result is a tab-delimited file of 22000+ minimal pairs:

```
בְּ	כְּ
בְּ	כְּ
ברא	קרא
ברא	ברך
ברא	בוא
ברא	ירא
ברא	קרא
ברא	ברח
ברא	ברך
ברא	בטא
ברא	ברה
ברא	ברה
ברא	ברר
ברא	בדא
ברא	בזא
ברא	ברד
ברא	פרא
ברא	מרא
ברא	ברק
ברא	מרא
ברא	ברך
ברא	קרא
ברא	ברך
אֵת	אֵד
אֵת	אֵם
אֵת	אֵי
אֵת	עֵת
אֵת	חֵת
```
...

In [33]:
import os
from unicodedata import normalize
from Levenshtein import distance

In [34]:
NFD = 'NFD'
REPO = os.path.expanduser('~/github/etcbc/bhsa')
TEMP = '{}/_temp'.format(REPO)
VERSION = '2017'
TFDIR = '{}/tf/{}'.format(REPO, VERSION)
lexemeData = '{}/voc_lex_utf8.tf'.format(TFDIR)
pairFile = '{}/minimalPairs.tsv'.format(TEMP)

In [35]:
def readTfData(dataFile):
    items = []
    with open(dataFile) as d:
        inMeta = True
        for line in d:
            if inMeta:
                if not line.startswith('@'):
                    inMeta = False
                continue
            comps = line.rstrip('\n').split('\t', 1)
            item = comps[1] if len(comps) == 2 else comps[0]
            nitem = normalize(NFD, item)
            items.append(item)
    return items

In [36]:
vlexs = readTfData(lexemeData)

In [22]:
len(vlexs)

9233

In [37]:
def getMinimalPairs(items):
    pairs = []
    nItems = len(items)
    for i in range(nItems):
        for j in range(i, nItems):
            itemI = items[i]
            itemJ = items[j]
            if len(itemI) != len(itemJ): continue
            d = distance(itemI, itemJ)
            if d == 1:
                pairs.append((itemI, itemJ))
    return pairs

In [38]:
# this takes 10 seconds or so
minimalPairs = getMinimalPairs(vlexs)

In [39]:
len(minimalPairs)

22197

In [31]:
def writePairs(pairs, pFile):
    with open(pFile, 'w') as f:
        for (item1, item2) in pairs:
            f.write('{}\t{}\n'.format(item1, item2))

In [32]:
writePairs(minimalPairs, pairFile)