<p>
<img align="left" width="200" src="images/peshitta_small.png"/>
<img align="right" width="200" src="images/tf-small.png"/>
</p>

# LinkSyr workshop

## Amsterdam 2018-03-12

### Dirk Roorda

#### dirk.roorda@dans.knaw.nl

<p>
<img align="left" width="200" src="images/etcbc.png"/>
<img align="right" width="200" src="images/dans.png"/>
</p>

![easy](images/easy.png)

# Link Syriaca data

This notebook links Syriaca data to the Syriac New Testament.

We add the links as features to the
[Text-Fabric](https://github.com/Dans-labs/text-fabric) 
representation of SyrNT.

In [None]:
import sys, os, collections
from IPython.display import display, Markdown, HTML
from tf.fabric import Fabric

In [None]:
REPO = '~/github/etcbc/linksyr'
SOURCE = 'syrnt'
CORPUS = f'{REPO}/data/tf/{SOURCE}'

In [None]:
TF = Fabric(locations=[CORPUS], modules=[''], silent=False )

# Load Features
We load all available features of the SyrNT data.

In [None]:
api = TF.load('', silent=True)
allFeatures = TF.explore(silent=True, show=True)
loadableFeatures = allFeatures['nodes'] + allFeatures['edges']
TF.load(loadableFeatures, add=True, silent=True)
api.makeAvailableIn(globals())

In [None]:
print('\n'.join(allFeatures['nodes']))

## Syriaca data

We load the index of people and places.

In [None]:
SYRIACA = os.path.expanduser(f'{REPO}/data/syriaca')
SC_PEOPLE = f'{SYRIACA}/index_of_persons.csv'
SC_PLACES = f'{SYRIACA}/index_of_places.csv'

SC_URL = 'http://syriaca.org'
SC_PLACE = 'place'
SC_PERSON = 'person'

SC_CONFIG = (
    (SC_PERSON, SC_URL, SC_PEOPLE),
    (SC_PLACE, SC_URL, SC_PLACES),    
)

SC_TYPES = tuple(x[0] for x in SC_CONFIG)

SC_FIELDS = ('trans', 'syriac', 'id')

NA_SYRIAC = {
    '[Syriac Not Available]', 
    '[Syriac Not', 
    '[Syriac',
}

In [None]:
tables = {}
irregular = {}

(transF, syriacF, idF) = SC_FIELDS

for (dataType, baseUrl, dataFile) in SC_CONFIG:
    tables[dataType] = {field: {} for field in SC_FIELDS}
    irregular[dataType] = set()
    dest = tables[dataType]
    irreg = irregular[dataType]
    table = dest[idF]
    indexTrans = dest[transF]
    indexSyriac = dest[syriacF]
    with open(dataFile) as fh:
        for (i, line) in enumerate(fh):
            (transV, syriacV, idV) = line.rstrip('\n').split('\t')
            prefix = f'{baseUrl}/{dataType}/'
            if idV.startswith(prefix):
                idV = idV.replace(prefix, '', 1)
            else:
                irreg.add(idV)
            table[idV] = (transV, syriacV)
            indexTrans.setdefault(transV, set()).add(idV)
            if syriacV not in NA_SYRIAC:
                if '[' in syriacV:
                    print(f'WARNING {dataType} line {i+1}: syriac value "{syriacV}"')
                indexSyriac.setdefault(syriacV, set()).add(idV)

In [None]:
for (dataType, data) in tables.items():
    table = data[idF]
    irreg = irregular[dataType]
    print(f'''
{dataType:>12}s: {len(table):>5} (irregular: {len(irreg):>4})
{"by syriac":>12} : {len(data[syriacF]):>5}
{"by trans":>12} : {len(data[transF]):>5}
''')

## Link to SyrNT

We can only hope to find connections based on the Syriac.
Let's see if there are words in the SyrNT text that show up in the persons and places lists.

We work with lexemes.

In [None]:
hits = {dataType: {} for dataType in SC_TYPES}

for lx in F.otype.s('lexeme'):
    lex = F.lexeme.v(lx)
    for dataType in SC_TYPES:
        idV = tables[dataType][syriacF].get(lex, None)
        if idV is not None:
            hits[dataType][lx] = idV

In [None]:
for (dataType, theseHits) in hits.items():
    print(f'{dataType:>12}s: {len(theseHits):>5} hits')

We show the hits by picking the first occurrence of each lexeme and showing it in context.

In [None]:
for (dataType, theseHits) in hits.items():
    markdown = f'''### {dataType}s
lexeme | linked | n-occs | passage | verse text
--- | --- | --- | --- | ---
'''
    for (lx, linked) in sorted(
        theseHits.items(),
        key=lambda x: F.lexeme.v(x[0]),
    ):
        lex = F.lexeme.v(lx)
        ids = ' '.join(sorted(linked))
        occs = L.d(lx, otype='word')
        passage = '{} {}:{}'.format(*T.sectionFromNode(occs[0]))
        verse = L.u(occs[0], otype='verse')[0]
        text = T.text(L.d(verse, otype='word'))
        markdown += (
            f'<span class="syc">{lex}</span> | {ids} | {len(occs)} | {passage} |'
            f' <span class="syc">{text}</syc>\n'
        )
    display(Markdown(markdown))

In [None]:
HTML('''
<style>
.syc {
    font-family: Estrangelo Edessa;
    font-size: 14pt;
}
</style>
''')

We want to disambiguate the entity references.
First we need to see all possible references per name, in order to weed out the ones that
are definitely not applicable.

We generate a list and display it here. We also generate the list as csv file.

In [None]:
syriacaResolve = os.path.expanduser(f'{REPO}/data/user/syriacaSyrNT.csv')

In [None]:
fieldNames = ('lexeme', 'trans', 'url', 'applicable')

fh = open(syriacaResolve, 'w')
for (dataType, theseHits) in hits.items():
    tsv = '\t'.join(fieldNames) + '\n'
    markdown = f'''### {dataType}s
{" | ".join(fieldNames)}
--- | --- | --- | ---
'''
    table = tables[dataType]
    data = table[idF]
    for (lx, linked) in sorted(
        theseHits.items(),
        key=lambda x: F.lexeme.v(x[0]),
    ):
        lex = F.lexeme.v(lx)
        for lid in linked:
            trans = data[lid][0]
            url = f'{SC_URL}/{dataType}/{lid}'                
            markdown += (
                f'<span class="syc">{lex}</span> | {trans} | {url} | no\n'
            )
            tsv += f'{lex}\t{trans}\t{url}\tno\n'

    fh.write(tsv)
    display(Markdown(markdown))
fh.close()

# Thanks

## Dirk Roorda

### dirk.roorda@dans.knaw.nl

<img align="right" width="400" src="images/dans.png"/>