<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Data" data-toc-modified-id="Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#Reference-docs" data-toc-modified-id="Reference-docs-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reference docs</a></span></li><li><span><a href="#Start-up" data-toc-modified-id="Start-up-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Start up</a></span></li><li><span><a href="#Sign-representations" data-toc-modified-id="Sign-representations-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Sign representations</a></span></li><li><span><a href="#Pairs-per-object" data-toc-modified-id="Pairs-per-object-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Pairs per object</a></span><ul class="toc-item"><li><span><a href="#Simple-closeness" data-toc-modified-id="Simple-closeness-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Simple closeness</a></span></li><li><span><a href="#Refined-closeness" data-toc-modified-id="Refined-closeness-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Refined closeness</a></span></li></ul></li></ul></div>

<img align="left" src="images/P005381-obverse-photo.png" width="15%"/>
<img align="left" src="images/P005381-obverse-lineart-annot.png" width="15%"/>
<img align="right" src="images/P005381-reverse-photo.png" width="15%"/>
<img align="right" src="images/P005381-reverse-lineart.png" width="15%"/>

<p>
```
&P005381 = MSVO 3, 70
```
</p>
<p>
<img src="images/P005381-obverse-atf.png" width="40%"/>
<img src="images/P005381-reverse-atf.png" width="40%"/>
</p>

<img align="right" src="images/tf-small.png"/>


# Collation

We want to get insights in the co-occurrences of signs on tablets in the 
[Uruk III/IV](http://cdli.ox.ac.uk/wiki/doku.php?id=proto-cuneiform)
corpus (4000-3100 BC).
These tablets have a poor archival context, since they come from rubbish pits, and may have been transported
from various different places than where they have been excavated.

In order to get more information about their chronology and context, we need to study the evolution of
the signs on the tablets. Collation is one of the pre-requisites to do so.

The tutorial ended with a first exercise in collation, where we collated pairs of signs
that co-occur on tablets and used an unsophisticated distance measure.

We repeat that exercise, and proceed to refine the collation method step by step.

## Data

We have downloaded the transcriptions from the 
**Cuneiform Digital Library Initiative**
[CDLI](https://cdli.ucla.edu),
and converted them to
[Text-Fabric](https://github.com/Dans-labs/text-fabric).
Read more about the details of the conversion in the
[checks](checks.ipynb) notebook.
For an introduction to Text-Fabric, follow the
[start](start.ipynb) tutorial.

## Reference docs
The functions used by this notebook are documented in the following places:

[Feature docs](https://github.com/Dans-labs/Nino-cunei/blob/master/docs/transcription.md)

[Cunei API](https://github.com/Dans-labs/Nino-cunei/blob/master/docs/cunei.md)

[Utils API](https://github.com/Dans-labs/Nino-cunei/blob/master/docs/utils.md)

[Text-Fabric API](https://github.com/Dans-labs/text-fabric)


# Authors

J. Cale Johnson and Dirk Roorda (see the 
[README](https://github.com/Dans-labs/Nino-cunei)
of this repository).

## Start up

We import the Python modules we need.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys, os, collections
from tf.fabric import Fabric

We set up our working locations on the file system.

In [3]:
REPO = '~/github/Dans-labs/Nino-cunei'
SOURCE = 'uruk'
VERSION = '0.1'
CORPUS = f'{REPO}/tf/{SOURCE}/{VERSION}'
SOURCE_DIR = os.path.expanduser(f'{REPO}/sources/cdli')
PROGRAM_DIR = os.path.expanduser(f'{REPO}/programs')
TEMP_DIR = os.path.expanduser(f'{REPO}/_temp')
RESULT_DIR = f'{TEMP_DIR}/collation'
REPORT_DIR = os.path.expanduser(f'{REPO}/reports')

We create the temporary and report directories, if they do not exist already.

In [4]:
sys.path.append(PROGRAM_DIR)
from cunei import Cunei
from utils import Compare

In [5]:
for cdir in (TEMP_DIR, REPORT_DIR, RESULT_DIR):
    os.makedirs(cdir, exist_ok=True)

In [6]:
TF = Fabric(locations=[CORPUS], modules=[''], silent=False )

This is Text-Fabric 3.2.2
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

32 features found and 0 ignored


In [7]:
api = TF.load('''
    grapheme prime repeat
    variant variantOuter
    modifier modifierInner modifierFirst
    damage uncertain remarkable written
    period name type identifier catalogId
    number fullNumber origNumber badNumbering
    crossref text
    srcLn srcLnNum
    op sub comments''')
api.makeAvailableIn(globals())
CUNEI = Cunei(api)
COMP = Compare(api, SOURCE_DIR, TEMP_DIR)

  0.00s loading features ...
   |     0.00s B catalogId            from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.02s B fullNumber           from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.02s B number               from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.05s B grapheme             from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.05s B srcLn                from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.02s B srcLnNum             from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.00s B prime                from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.01s B repeat               from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.01s B variant              from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.00s B variantOuter         from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.00s B modi

We pick up where we left off in the [start](start.ipynb) tutorial: computing co-occurrences
by tablet. But we make the move to put our recipes into functions, that we will re-use and refine later on.

## Sign representations

We pre-compute the sign representations for each node.
We also make an index of occurrences for each sign representation.

In [8]:
NA = {'', 'â€¦', 'X'}

def getSigns():
    signFromNode = dict()
    nodeFromSign = collections.defaultdict(list)

    for tablet in F.otype.s('tablet'):
        for s in L.d(tablet, otype='sign'):
            if F.grapheme.v(s) in NA:
                continue
            signRep = CUNEI.atfFromSign(s)
            signFromNode[s] = signRep
            nodeFromSign[signRep].append(s)
    print(f'computed {len(nodeFromSign)} distinct sign representations from {len(signFromNode)} nodes')
    return (signFromNode, nodeFromSign)

(signFromNode, nodeFromSign) = getSigns()

computed 1526 distinct sign representations from 93253 nodes


## Pairs per object

In the [start](start.ipynb) tutorial we collected pairs per tablet, and we calculated
a coarse distance between pairs, based on the distance of signs in the linear sequence
by which signs have been transcribed.

We are going to write that process as a function, where we abstract from the level at
which the pairs must co-occur. We also abstract from how we measure the distance.

We write a function `getPairs(perType, measureName)` that computes co-occurent pairs on objects
of type `perType`. Here `measureName` is the name of a function that, given two sign nodes, computes
a measure of closeness between those nodes.

We also show the top pairs, and save all pairs to disk in a tsv file.

In [14]:
def simpleRelativeCloseness(i, j, signLength=None, quadInfo=None):
    return (signLength - abs(j - i)) / signLength

def refinedRelativeCloseness(i, j, signLength=None, quadInfo=None):
    (quadLength, outerQuadFromSign) = quadInfo
    return (quadLength - abs(outerQuadFromSign[i] - outerQuadFromSign[j])) / quadLength

In [29]:
SHOWPAIRS = 10

def getPairs(perType, measureName):
    pairs = collections.Counter()
    measure = globals()[measureName]

    for obj in F.otype.s(perType):
        signs = L.d(obj, otype='sign')
        signLength = signs[-1] - signs[0]
        
        outerQuads = CUNEI.getOuterQuads(obj)
        quadLength = len(outerQuads)
        outerQuadFromSign = {}
        for (i, outerQuad) in enumerate(outerQuads):
            if F.otype.v(outerQuad) == 'sign':
                outerQuadFromSign[outerQuad] = i
            else:
                for s in L.d(outerQuad, otype='sign'):
                    outerQuadFromSign[s] = i
        if not(set(signs) <= set(outerQuadFromSign)):
            print('\n'.join(COMP.getSource(obj)))
            print(outerQuadFromSign)
        quadInfo = (quadLength, outerQuadFromSign)

        thesePairs = {}
        for i in range(len(signs)):
            if i not in signFromNode:
                continue
            signI = signFromNode[i]
            for j in range(i + 1, len(signs)):
                if j not in signFromNode:
                    continue
                signJ = signFromNode[j]
                if signJ == signI:
                    continue
                pair = (signI, signJ) if signI < signJ else (signJ, signI)
                closeness = measure(i, j, signLength=signLength, quadInfo=quadInfo)
                oldCloseness = thesePairs.get(pair, None)
                if oldCloseness is None or oldCloseness < closeness:
                    thesePairs[pair] = closeness
        for (pair, closeness) in thesePairs.items():
            pairs[pair] += closeness
    showPairs(pairs, perType, measureName, limit=SHOWPAIRS)
    return pairs

def showPairs(pairs, perType, measureName, limit=None):
    print(f'{len(pairs)} co-occurrences in {perType}s with measure {measureName}')
    sortedPairs = sorted(
        pairs.items(), 
        key=lambda x: (-x[1], x[0]),
    )
    pairFile = f'per-{perType}-{measureName}.tsv'
    pairPath = f'{RESULT_DIR}/{pairFile}'
    with open(pairPath, 'w') as fh:
        fh.write('signI\tsignJ\t{measureName}\n')
        for ((signI, signJ), closeness) in sortedPairs:
            fh.write(f'{signI}\t{signJ}\t{closeness}\n')
    print(f'Written {len(pairs)} pairs to {pairFile}')
    
    showPairs = sortedPairs if limit is None or len(sortedPairs) < limit else sortedPairs[0:limit] 
    for ((signI, signJ), closeness) in showPairs:
        print(f'{signI:>10} <=~ {closeness:>7.2f} ~=> {signJ:<10}')
    if limit < len(sortedPairs):
        print(f'...and {len(sortedPairs) - limit} more')

### Simple closeness

Let's do business with this function, and get our results back for tablets and a closeness
function based on the size of the tablet.

In [16]:
pairsTablet = getPairs('tablet', 'simpleRelativeCloseness')

6903 co-occurrences in tablets with measure simpleRelativeCloseness
Written 6903 pairs to per-tablet-simpleRelativeCloseness.tsv
    3(N14) <=~ 3946.86 ~=> SANGA~a   
    1(N14) <=~ 1691.75 ~=> 3(N14)    
    1(N14) <=~ 1658.29 ~=> SUHUR     
    3(N14) <=~ 1574.87 ~=> SUHUR     
    1(N01) <=~ 1371.65 ~=> DUG~b     
    1(N01) <=~ 1368.98 ~=> SUHUR     
    1(N01) <=~ 1336.39 ~=> 1(N14)    
    1(N57) <=~ 1309.85 ~=> DUG~b     
    1(N01) <=~ 1281.73 ~=> 1(N57)    
     DUG~b <=~ 1280.59 ~=> SUHUR     
...and 6893 more


And now the beauty: we can do the same for co-occurrences on faces, columns, lines.

In [17]:
pairsFace = getPairs('face', 'simpleRelativeCloseness')

6903 co-occurrences in faces with measure simpleRelativeCloseness
Written 6903 pairs to per-face-simpleRelativeCloseness.tsv
    3(N14) <=~ 3859.17 ~=> SANGA~a   
    1(N14) <=~ 1360.27 ~=> 3(N14)    
    1(N14) <=~ 1319.55 ~=> SUHUR     
    3(N14) <=~ 1250.66 ~=> SUHUR     
    1(N01) <=~ 1065.79 ~=> SUHUR     
    1(N01) <=~ 1060.74 ~=> DUG~b     
    1(N01) <=~ 1039.43 ~=> 1(N14)    
    1(N57) <=~ 1002.81 ~=> DUG~b     
     DUG~b <=~  987.96 ~=> SUHUR     
    1(N01) <=~  986.59 ~=> 3(N14)    
...and 6893 more


The same amount of co-occurrences on tablets as on faces. Are they the same pairs?
Let's check. We strip the closeness numbers from the pairs, reduce them to sets, and test
whether they are equal.

In [18]:
set(pairsTablet) == set(pairsFace)

True

In [19]:
pairsColumn = getPairs('column', 'simpleRelativeCloseness')

3321 co-occurrences in columns with measure simpleRelativeCloseness
Written 3321 pairs to per-column-simpleRelativeCloseness.tsv
    3(N14) <=~ 4640.35 ~=> SANGA~a   
    1(N14) <=~  734.40 ~=> 3(N14)    
    1(N14) <=~  691.12 ~=> SUHUR     
    3(N14) <=~  649.35 ~=> SUHUR     
    1(N01) <=~  500.79 ~=> SUHUR     
    1(N01) <=~  492.95 ~=> DUG~b     
    1(N01) <=~  486.40 ~=> 1(N14)    
    1(N01) <=~  457.58 ~=> 3(N14)    
    1(N57) <=~  455.30 ~=> DUG~b     
     DUG~b <=~  453.81 ~=> SUHUR     
...and 3311 more


In [20]:
pairsLine = getPairs('line', 'simpleRelativeCloseness')

190 co-occurrences in lines with measure simpleRelativeCloseness
Written 190 pairs to per-line-simpleRelativeCloseness.tsv
    3(N14) <=~  875.31 ~=> SANGA~a   
    1(N14) <=~   34.61 ~=> 3(N14)    
    1(N14) <=~   29.10 ~=> SUHUR     
    3(N14) <=~   27.29 ~=> SUHUR     
    1(N01) <=~   22.36 ~=> DUG~b     
    1(N01) <=~   21.09 ~=> SUHUR     
    1(N01) <=~   20.45 ~=> 1(N14)    
     DUG~b <=~   20.45 ~=> SUHUR     
    1(N14) <=~   19.81 ~=> DUG~b     
    1(N57) <=~   19.47 ~=> DUG~b     
...and 180 more


### Refined closeness

We want to employ a more refined notion of closeness, one that does justice to the
geometry of a tablet.

Instead of basing closeness on sign distance, we want to base it on quad distance.

In [30]:
pairsTablet = getPairs('tablet', 'refinedRelativeCloseness')

KeyError: 6