<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Data" data-toc-modified-id="Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#Reference-docs" data-toc-modified-id="Reference-docs-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reference docs</a></span></li><li><span><a href="#Start-up" data-toc-modified-id="Start-up-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Start up</a></span></li><li><span><a href="#Sign-representations" data-toc-modified-id="Sign-representations-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Sign representations</a></span></li><li><span><a href="#Pairs-per-object" data-toc-modified-id="Pairs-per-object-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Pairs per object</a></span><ul class="toc-item"><li><span><a href="#A-battery-of-collocations" data-toc-modified-id="A-battery-of-collocations-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>A battery of collocations</a></span></li><li><span><a href="#Simple-versus-refined" data-toc-modified-id="Simple-versus-refined-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Simple versus refined</a></span></li></ul></li><li><span><a href="#Results" data-toc-modified-id="Results-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Results</a></span></li><li><span><a href="#Alternatives" data-toc-modified-id="Alternatives-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Alternatives</a></span></li></ul></div>

<img align="left" src="images/P005381-obverse-photo.png" width="15%"/>
<img align="left" src="images/P005381-obverse-lineart-annot.png" width="15%"/>
<img align="right" src="images/P005381-reverse-photo.png" width="15%"/>
<img align="right" src="images/P005381-reverse-lineart.png" width="15%"/>

<p>
```
&P005381 = MSVO 3, 70
```
</p>
<p>
<img src="images/P005381-obverse-atf.png" width="40%"/>
<img src="images/P005381-reverse-atf.png" width="40%"/>
</p>

<img align="right" src="images/tf-small.png"/>


# Collocation

We want to get insights in the co-occurrences of signs on tablets in the 
[Uruk III/IV](http://cdli.ox.ac.uk/wiki/doku.php?id=proto-cuneiform)
corpus (4000-3100 BC).
These tablets have a poor archival context, since they come from rubbish pits, and may have been transported
from various different places than where they have been excavated.

In order to get more information about their chronology and context, we need to study the evolution of
the signs on the tablets. Collocation is one of the pre-requisites to do so.

The tutorial ended with a first exercise in collocation, where we collated pairs of signs
that co-occur on tablets and used an unsophisticated distance measure.

We repeat that exercise, and proceed to refine the collocation method step by step.

## Data

We have downloaded the transcriptions from the 
**Cuneiform Digital Library Initiative**
[CDLI](https://cdli.ucla.edu),
and converted them to
[Text-Fabric](https://github.com/Dans-labs/text-fabric).
Read more about the details of the conversion in the
[checks](checks.ipynb) notebook.
For an introduction to Text-Fabric, follow the
[start](start.ipynb) tutorial.

## Reference docs
The functions used by this notebook are documented in the following places:

[Feature docs](https://github.com/Dans-labs/Nino-cunei/blob/master/docs/transcription.md)

[Cunei API](https://github.com/Dans-labs/Nino-cunei/blob/master/docs/cunei.md)

[Utils API](https://github.com/Dans-labs/Nino-cunei/blob/master/docs/utils.md)

[Text-Fabric API](https://github.com/Dans-labs/text-fabric)


# Authors

J. Cale Johnson and Dirk Roorda (see the 
[README](https://github.com/Dans-labs/Nino-cunei)
of this repository).

## Start up

We import the Python modules we need.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys, os, collections
from IPython.display import Markdown, display
from tf.fabric import Fabric

We set up our working locations on the file system.

In [3]:
GITHUB = 'https://github.com'
REPO_REL = 'Dans-labs/Nino-cunei'
REPO = f'~/github/{REPO_REL}'
SOURCE = 'uruk'
VERSION = '0.1'
CORPUS = f'{REPO}/tf/{SOURCE}/{VERSION}'
SOURCE_DIR = os.path.expanduser(f'{REPO}/sources/cdli')
PROGRAM_DIR = os.path.expanduser(f'{REPO}/programs')
TEMP_DIR = os.path.expanduser(f'{REPO}/_temp')
REPORT_DIR = os.path.expanduser(f'{REPO}/reports')
RESULT_DIR = f'{REPORT_DIR}/collocation'
RESULT_GH = f'{GITHUB}/{REPO_REL}/blob/master/reports/collocation'
TEMP_RESULT_DIR = f'{TEMP_DIR}/collocation'

We create the temporary and report directories, if they do not exist already.

In [4]:
sys.path.append(PROGRAM_DIR)
from cunei import Cunei
from utils import Compare

In [5]:
for cdir in (TEMP_DIR, REPORT_DIR, RESULT_DIR, TEMP_RESULT_DIR):
    os.makedirs(cdir, exist_ok=True)

In [6]:
TF = Fabric(locations=[CORPUS], modules=[''], silent=False )

This is Text-Fabric 3.2.2
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

33 features found and 0 ignored


In [7]:
api = TF.load('''
    grapheme prime repeat
    variant variantOuter
    modifier modifierInner modifierFirst
    damage uncertain remarkable written
    period name type identifier catalogId
    number fullNumber origNumber badNumbering
    crossref text
    srcLn srcLnNum
    op sub comments''')
api.makeAvailableIn(globals())
CUNEI = Cunei(api)
COMP = Compare(api, SOURCE_DIR, TEMP_DIR)

  0.00s loading features ...
   |     0.00s B catalogId            from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.02s B fullNumber           from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.02s B number               from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.05s B grapheme             from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.04s B srcLn                from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.02s B srcLnNum             from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.00s B prime                from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.01s B repeat               from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.01s B variant              from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.00s B variantOuter         from /Users/dirk/github/Dans-labs/Nino-cunei/tf/uruk/0.1
   |     0.00s B modi

We pick up where we left off in the [start](start.ipynb) tutorial: computing co-occurrences
by tablet. But we make the move to put our recipes into functions, that we will re-use and refine later on.

## Sign representations

We pre-compute the sign representations for each node.
We also make an index of occurrences for each sign representation.

In [8]:
NA = {'', '…', 'X'}

def getSigns():
    signFromNode = dict()
    nodeFromSign = collections.defaultdict(list)

    for tablet in F.otype.s('tablet'):
        for s in L.d(tablet, otype='sign'):
            if F.grapheme.v(s) in NA:
                continue
            signRep = CUNEI.atfFromSign(s)
            signFromNode[s] = signRep
            nodeFromSign[signRep].append(s)
    print(f'computed {len(nodeFromSign)} distinct sign representations from {len(signFromNode)} nodes')
    return (signFromNode, nodeFromSign)

(signFromNode, nodeFromSign) = getSigns()

computed 1481 distinct sign representations from 91362 nodes


## Pairs per object

In the [start](start.ipynb) tutorial we collected pairs per tablet, and we calculated
a coarse distance between pairs, based on the distance of signs in the linear sequence
by which signs have been transcribed.

We are going to write that process as a function, where we abstract from the level at
which the pairs must co-occur. We also abstract from how we measure the distance.

In addition to our coarse definition based on relative closeness, we also
want a refined notion of closeness, one that does justice to the
geometry of a tablet.

Instead of basing closeness on sign distance, we want to base it on quad distance.

In [9]:
def coarseRelativeCloseness(i, j, signLength=None, quadInfo=None):
    return (signLength - abs(j - i)) / signLength

def refinedRelativeCloseness(i, j, signLength=None, quadInfo=None):
    (quadLength, outerQuadFromSign) = quadInfo
    return (quadLength - abs(outerQuadFromSign[i] - outerQuadFromSign[j])) / quadLength

We write a function `getPairs(perType, measureName)` that computes co-occurent pairs on objects
of type `perType`. Here `measureName` is the name of a function that, given two sign nodes, computes
a measure of closeness between those nodes.

We also show the top pairs, and save all pairs to disk in a tsv file in your TEMP_RESULT_DIR.
A shorter version of the results we write to your RESULT_DIR.

In [10]:
SHOWPAIRS = 10
RESULTPAIRS = 1000

def getPairs(perType, measureName):
    pairs = collections.Counter()
    measure = globals()[measureName]

    for obj in F.otype.s(perType):
        signs = L.d(obj, otype='sign')
        signLength = signs[-1] - signs[0]
        
        outerQuads = CUNEI.getOuterQuads(obj)
        quadLength = len(outerQuads)
        outerQuadFromSign = {}
        for (i, outerQuad) in enumerate(outerQuads):
            if F.otype.v(outerQuad) == 'sign':
                outerQuadFromSign[outerQuad] = i
            else:
                for s in L.d(outerQuad, otype='sign'):
                    outerQuadFromSign[s] = i
        quadInfo = (quadLength, outerQuadFromSign)

        thesePairs = {}
        for i in range(len(signs)):
            nodeI = signs[i]
            if nodeI not in signFromNode:
                continue
            signI = signFromNode[nodeI]
            for j in range(i + 1, len(signs)):
                nodeJ = signs[j]
                if nodeJ not in signFromNode:
                    continue
                signJ = signFromNode[nodeJ]
                if signJ == signI:
                    continue
                pair = (signI, signJ) if signI < signJ else (signJ, signI)
                closeness = measure(nodeI, nodeJ, signLength=signLength, quadInfo=quadInfo)
                oldCloseness = thesePairs.get(pair, None)
                if oldCloseness is None or oldCloseness < closeness:
                    thesePairs[pair] = closeness
        for (pair, closeness) in thesePairs.items():
            pairs[pair] += closeness
    showPairs(pairs, perType, measureName)
    return pairs

def sortPairs(pairs):
    return sorted(
        pairs.items(), 
        key=lambda x: (-x[1], x[0]),
    )

def sortPairsBare(pairs):
    return sorted(
        pairs, 
        key=lambda x: (-pairs[x], x),
    )   

def showPairs(pairs, perType, measureName):
    print(f'{len(pairs)} co-occurrences in {perType}s with measure {measureName}')
    sortedPairs = sortPairs(pairs)
    pairFile = f'per-{perType}-{measureName}.tsv'
    pairTempPath = f'{TEMP_RESULT_DIR}/{pairFile}'
    with open(pairTempPath, 'w') as fh:
        fh.write(f'signI\tsignJ\t{measureName}\n')
        for ((signI, signJ), closeness) in sortedPairs:
            fh.write(f'{signI}\t{signJ}\t{closeness}\n')
    print(f'Written {len(pairs)} pairs to {pairFile} in _temp')
    
    pairPath = f'{RESULT_DIR}/{pairFile}'
    with open(pairPath, 'w') as fh:
        fh.write('signI\tsignJ\t{measureName}\n')
        for ((signI, signJ), closeness) in sortedPairs[0:RESULTPAIRS]:
            fh.write(f'{signI}\t{signJ}\t{closeness}\n')
    print(f'Written {len(pairs)} pairs to {pairFile} in report')

    showPairs = sortedPairs[0:SHOWPAIRS] 
    for ((signI, signJ), closeness) in showPairs:
        print(f'{signI:>10} <=~ {closeness:>7.2f} ~=> {signJ:<10}')
    if SHOWPAIRS < len(sortedPairs):
        print(f'...and {len(sortedPairs) - SHOWPAIRS} more')

### A battery of collocations

Let's do business with this function, and get our results back for tablets and a closeness
function based on the size of the tablet.

In [11]:
collocationPairs = {}

collocationObjects = '''
    tablet
    face
    column
    line
'''.strip().split()

closenessMethods = '''
    coarseRelativeCloseness
    refinedRelativeCloseness
'''.strip().split()

In [12]:
for obj in collocationObjects:
    for method in closenessMethods:
        collocationPairs.setdefault(obj, {})[method] = getPairs(obj, method)

117472 co-occurrences in tablets with measure coarseRelativeCloseness
Written 117472 pairs to per-tablet-coarseRelativeCloseness.tsv in _temp
Written 117472 pairs to per-tablet-coarseRelativeCloseness.tsv in report
    1(N01) <=~  794.77 ~=> 2(N01)    
    1(N01) <=~  641.24 ~=> 1(N14)    
    1(N01) <=~  573.55 ~=> EN~a      
    1(N14) <=~  572.39 ~=> 2(N01)    
    1(N01) <=~  507.99 ~=> 3(N01)    
    2(N01) <=~  439.41 ~=> 3(N01)    
    1(N01) <=~  434.06 ~=> N         
    1(N14) <=~  413.32 ~=> 3(N01)    
    1(N01) <=~  402.56 ~=> AN        
    1(N01) <=~  387.34 ~=> GAL~a     
...and 117462 more
117472 co-occurrences in tablets with measure refinedRelativeCloseness
Written 117472 pairs to per-tablet-refinedRelativeCloseness.tsv in _temp
Written 117472 pairs to per-tablet-refinedRelativeCloseness.tsv in report
    1(N01) <=~  801.53 ~=> 2(N01)    
    1(N01) <=~  647.10 ~=> 1(N14)    
    1(N01) <=~  578.82 ~=> EN~a      
    1(N14) <=~  578.19 ~=> 2(N01)    
    1(N01) <=~  

### Simple versus refined

Does the refined measure give other results?

In [13]:
def comparePairs(obj, method1, method2):
    pairs1 = sortPairsBare(collocationPairs[obj][method1])
    pairs2 = sortPairsBare(collocationPairs[obj][method2])
    if len(pairs1) != len(pairs2):
        print(f'{obj:<6}: !!! {method1:>24} => {len(pairs1):>6} =/= {len(pairs2):>6} <= {method2:<24}')
    else:
        print(f'{obj:<6}:     {method1:>24} => {len(pairs1):>6} === {len(pairs2):>6} <= {method2:<24}')
    firstDiff = -1
    for (i, (pair1, pair2)) in enumerate(zip(pairs1, pairs2)):
        if pair1 != pair2:
            firstDiff = i
            break
    if firstDiff < 0:
        if len(pairs1) == len(pairs2):
            print('\tIDENTICAL')
        else:
            methodSmall = method1 if len(pairs1) < len(pairs2) else method2
            methodBig = method2 if len(pairs1) < len(pairs2) else method1
            print(f'\tPREFIX: pairs by {methodSmall} are a prefix of pairs by {methodBig}')
    else:
        print(f'\tFIRST DIFFERENCE at position {firstDiff}')
    
    topPairs1 = pairs1[0:RESULTPAIRS]
    topPairs2 = pairs2[0:RESULTPAIRS]
    setTop1 = set(topPairs1)
    setTop2 = set(topPairs2)

    if setTop1 == setTop2:
        print(f'\tEQUAL as set of the top-{RESULTPAIRS} pairs')
    else:
        common = setTop1 & setTop2
        print(f'\tSHARE {len(common)} pairs in their top-{RESULTPAIRS}')
    print('')

In [14]:
for obj in collocationObjects:
    comparePairs(obj, *closenessMethods)

tablet:      coarseRelativeCloseness => 117472 === 117472 <= refinedRelativeCloseness
	FIRST DIFFERENCE at position 27
	SHARE 997 pairs in their top-1000

face  :      coarseRelativeCloseness => 104628 === 104628 <= refinedRelativeCloseness
	FIRST DIFFERENCE at position 25
	SHARE 996 pairs in their top-1000

column:      coarseRelativeCloseness =>  72618 ===  72618 <= refinedRelativeCloseness
	FIRST DIFFERENCE at position 28
	SHARE 989 pairs in their top-1000

line  :      coarseRelativeCloseness =>  36507 ===  36507 <= refinedRelativeCloseness
	FIRST DIFFERENCE at position 4
	SHARE 934 pairs in their top-1000



**Observation**

The more fine-grained your object of collocation is, the bigger the differences between
the refined and the coarse closeness measures.

But all in all, the difference remains pretty small.

## Results

The top {{RESULTPAIRS}} can be viewed online.

In [15]:
resultLinks = []
for obj in collocationObjects:
    for method in closenessMethods:
        resultLinks.append(f'[{obj}-{method}]({RESULT_GH}/per-{obj}-{method}.tsv)\n\n')

RESULT_LINKS = ''.join(resultLinks)
display(Markdown(RESULT_LINKS))

[tablet-coarseRelativeCloseness](https://github.com/Dans-labs/Nino-cunei/blob/master/reports/collocation/per-tablet-coarseRelativeCloseness.tsv)

[tablet-refinedRelativeCloseness](https://github.com/Dans-labs/Nino-cunei/blob/master/reports/collocation/per-tablet-refinedRelativeCloseness.tsv)

[face-coarseRelativeCloseness](https://github.com/Dans-labs/Nino-cunei/blob/master/reports/collocation/per-face-coarseRelativeCloseness.tsv)

[face-refinedRelativeCloseness](https://github.com/Dans-labs/Nino-cunei/blob/master/reports/collocation/per-face-refinedRelativeCloseness.tsv)

[column-coarseRelativeCloseness](https://github.com/Dans-labs/Nino-cunei/blob/master/reports/collocation/per-column-coarseRelativeCloseness.tsv)

[column-refinedRelativeCloseness](https://github.com/Dans-labs/Nino-cunei/blob/master/reports/collocation/per-column-refinedRelativeCloseness.tsv)

[line-coarseRelativeCloseness](https://github.com/Dans-labs/Nino-cunei/blob/master/reports/collocation/per-line-coarseRelativeCloseness.tsv)

[line-refinedRelativeCloseness](https://github.com/Dans-labs/Nino-cunei/blob/master/reports/collocation/per-line-refinedRelativeCloseness.tsv)



## Alternatives

In [clustering](clustering.ipynb) we explore how we can cluster signs.