<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Tablets" data-toc-modified-id="Tablets-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Tablets</a></span></li><li><span><a href="#Faces" data-toc-modified-id="Faces-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Faces</a></span></li><li><span><a href="#Columns" data-toc-modified-id="Columns-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Columns</a></span></li><li><span><a href="#Lines" data-toc-modified-id="Lines-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Lines</a></span></li><li><span><a href="#Graphemes" data-toc-modified-id="Graphemes-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Graphemes</a></span><ul class="toc-item"><li><span><a href="#Primes" data-toc-modified-id="Primes-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Primes</a></span></li><li><span><a href="#Variants" data-toc-modified-id="Variants-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Variants</a></span></li></ul></li></ul></div>

# Checks
Various checks on the correctness of the transformation from ascii transcriptions to a text-fabric data set.

The [diagnostics]()
of the transformation contains valueable issues that may be used to correct mistakes in the sources.
Or, equally likely, they correspond to misunderstandings on my (Dirk's) part of the model
that underlies the transcriptions.

We will perform *grep* commands on the source files, and we will traverse node in Text-Fabric and collect information.

Then we compare these sets of information.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys, os, collections, re
from glob import glob
from tf.fabric import Fabric
from utils import Compare

In [3]:
REPO = '~/github/Dans-labs/nino-cunei'
SOURCE = 'uruk'
VERSION = '0.1'
CORPUS = f'{REPO}/tf/{SOURCE}/{VERSION}'
SOURCE_DIR = os.path.expanduser(f'{REPO}/sources/cdli')
TEMP_DIR = os.path.expanduser(f'{REPO}/_temp')

In [65]:
TF = Fabric(locations=[CORPUS], modules=[''], silent=False )

This is Text-Fabric 3.2.0
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

26 features found and 0 ignored


In [66]:
api = TF.load('''
    grapheme prime variant modifier repeat
    damage uncertain remarkable written
    name number badNumbering catalogId period
    type identifier
    srcLn srcLnNum
    op comments
''')
api.makeAvailableIn(globals())
COMP = Compare(api, SOURCE_DIR, TEMP_DIR)

  0.00s loading features ...
   |     0.00s B catalogId            from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.02s B number               from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.06s B grapheme             from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.05s B srcLn                from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.03s B srcLnNum             from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.00s B prime                from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.01s B variant              from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.00s B modifier             from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.01s B repeat               from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.01s B damage               from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.00s B unce

## Tablets
We check whether we have the same sequence of tablet numbers.
In TF, the tablet number is stored in the feature `catalogId`.

Note that we also check on the order of the tablets.

In [67]:
def tfTablets():
    tablets = []
    for t in F.otype.s('tablet'):
        (tablet, column, line) = T.sectionFromNode(t)
        tablets.append((F.period.v(t), tablet, F.srcLnNum.v(t), F.catalogId.v(t)))
    return tablets

def grepTablets(gen):
    tablets = []
    prevTablet = None
    for (period, tablet, ln, line, skip) in gen:
        if skip:
            print(f'GREP: skipping duplicate tablet "{tablet}"')
            continue
        if tablet != prevTablet:
            tablets.append((period, tablet, ln, tablet))
        prevTablet = tablet
    return tablets

In [68]:
COMP.checkSanity(
    grepTablets,
    tfTablets,
)

GREP: skipping duplicate tablet "P002176"
GREP: skipping duplicate tablet "P252175"
Number of results: TF 6396; GREP 6396
IDENTICAL: all 6396 items
=    : uruk-iii ◆ P006427 ◆ 1 ◆ P006427
=    : uruk-iii ◆ P006428 ◆ 11 ◆ P006428
=    : uruk-iii ◆ P448701 ◆ 36 ◆ P448701
=    : uruk-iii ◆ P448702 ◆ 50 ◆ P448702
=    : uruk-iii ◆ P448703 ◆ 71 ◆ P448703
=    : uruk-iii ◆ P471695 ◆ 87 ◆ P471695
=    : uruk-iii ◆ P482082 ◆ 114 ◆ P482082
=    : uruk-iii ◆ P482083 ◆ 127 ◆ P482083
=    : uruk-iii ◆ P499393 ◆ 147 ◆ P499393
=    : uruk-iii ◆ P504412 ◆ 166 ◆ P504412
=    : uruk-iii ◆ P504413 ◆ 189 ◆ P504413
=    : uruk-iii ◆ P006438 ◆ 199 ◆ P006438
=    : uruk-iii ◆ P000014 ◆ 220 ◆ P000014
=    : uruk-iii ◆ P000456 ◆ 297 ◆ P000456
=    : uruk-iii ◆ P002718 ◆ 326 ◆ P002718
=    : uruk-iii ◆ P000021 ◆ 341 ◆ P000021
=    : uruk-iii ◆ P000023 ◆ 374 ◆ P000023
=    : uruk-iii ◆ P000025 ◆ 403 ◆ P000025
=    : uruk-iii ◆ P000167 ◆ 500 ◆ P000167
=    : uruk-iii ◆ P000453 ◆ 531 ◆ P000453
=     and 6376 more

## Faces

We check whether we see the same faces with GREP and TF.

Note that in TF we have inserted missing faces `@noface`.
We leave them out again in the comparison.

In [69]:
FACES = set(
    '''
    obverse
    reverse
    top
    bottom
    left
    seal
    surface
    edge
'''.strip().split()
)

NOFACE = 'noface'

facePat = re.compile('^@([a-z]+)')

In [70]:
def tfFaces():
    faces = []
    for tablet in F.otype.s('tablet'):
        tabletName = F.catalogId.v(tablet)
        period = F.period.v(tablet)
        for face in L.d(tablet, otype='face'):
            tp = F.type.v(face)
            it = F.identifier.v(face) or None
            ln = F.srcLnNum.v(face)
            itStr = '' if it is None else f' {it}'
            if tp != 'noface':
                faces.append((period, tabletName, ln, f'@{tp}{itStr}'))
    return faces

In [71]:
def grepFaces(gen):
    faces = []
    for (period, tablet, ln, line, skip) in gen:
        if skip:
            continue
        match = facePat.match(line)
        if match:
            face = match.group(1)
            if face in FACES:
                faces.append((period, tablet, ln, line.strip()))
    return faces

In [72]:
COMP.checkSanity(
    grepFaces,
    tfFaces,
)

Number of results: TF 9441; GREP 9441
IDENTICAL: all 9441 items
=    : uruk-iii ◆ P006427 ◆ 4 ◆ @obverse
=    : uruk-iii ◆ P006428 ◆ 14 ◆ @obverse
=    : uruk-iii ◆ P448701 ◆ 39 ◆ @obverse
=    : uruk-iii ◆ P448701 ◆ 46 ◆ @reverse
=    : uruk-iii ◆ P448702 ◆ 53 ◆ @obverse
=    : uruk-iii ◆ P448702 ◆ 67 ◆ @reverse
=    : uruk-iii ◆ P448703 ◆ 74 ◆ @obverse
=    : uruk-iii ◆ P448703 ◆ 83 ◆ @reverse
=    : uruk-iii ◆ P471695 ◆ 90 ◆ @obverse
=    : uruk-iii ◆ P471695 ◆ 109 ◆ @reverse
=    : uruk-iii ◆ P482082 ◆ 117 ◆ @obverse
=    : uruk-iii ◆ P482082 ◆ 123 ◆ @reverse
=    : uruk-iii ◆ P482083 ◆ 130 ◆ @obverse
=    : uruk-iii ◆ P482083 ◆ 143 ◆ @reverse
=    : uruk-iii ◆ P499393 ◆ 150 ◆ @obverse
=    : uruk-iii ◆ P499393 ◆ 162 ◆ @reverse
=    : uruk-iii ◆ P504412 ◆ 169 ◆ @obverse
=    : uruk-iii ◆ P504412 ◆ 185 ◆ @reverse
=    : uruk-iii ◆ P504413 ◆ 192 ◆ @obverse
=    : uruk-iii ◆ P504413 ◆ 195 ◆ @reverse
=     and 9421 more


## Columns

We check whether we see the same columns with GREP and TF.

Note that in TF we have inserted missing columns as `@column 0`.
We leave them out again in the comparison.

In [73]:
def tfColumns():
    columns = []
    for tablet in F.otype.s('tablet'):
        tabletName = F.catalogId.v(tablet)
        period = F.period.v(tablet)
        for face in L.d(tablet, otype='face'):
            tp = F.type.v(face)
            for column in L.d(face, otype='column'):
                number = F.number.v(column)
                prime = F.prime.v(column)
                ln = F.srcLnNum.v(column)
                primeStr = "'" if prime else ''
                if number != '0':
                    columns.append((period, tabletName, ln, tp, f'@column {number}{primeStr}'))
    return columns

In [82]:
def grepColumns(gen):
    columns = []
    columnPat = re.compile('^@col')
    correctPat = re.compile('^@([a-z]+)(\s*)(\S*)')
    curFace = NOFACE
    prevTablet = None
    for (period, tablet, ln, line, skip) in gen:
        if skip:
            continue
        if tablet != prevTablet:
            curFace = NOFACE
        prevTablet = tablet

        match = facePat.match(line)
        if match:
            face = match.group(1)
            if face in FACES:
                curFace = face

        if columnPat.match(line):
            if not line.startswith('@column '):
                match = correctPat.match(line)
                if match:
                    colSpec = match.group(1)
                    sep = match.group(2)
                    colNum = match.group(3)
                    line = f'@column {colNum}'
                    print(f'GREP: corrected "{colSpec}{sep}{colNum}" => "{line}"')
                else:
                    print(f'GREP: found "{line}"')
                
            columns.append((period, tablet, ln, curFace, line.strip()))
    return columns

In [83]:
COMP.checkSanity(
    grepColumns,
    tfColumns,
)

GREP: corrected "columm 4" => "@column 4"
GREP: corrected "column3" => "@column 3"
Number of results: TF 13123; GREP 13123
IDENTICAL: all 13123 items
=    : uruk-iii ◆ P006427 ◆ 5 ◆ obverse ◆ @column 1
=    : uruk-iii ◆ P006427 ◆ 7 ◆ obverse ◆ @column 2
=    : uruk-iii ◆ P006428 ◆ 15 ◆ obverse ◆ @column 1
=    : uruk-iii ◆ P006428 ◆ 18 ◆ obverse ◆ @column 2
=    : uruk-iii ◆ P006428 ◆ 21 ◆ obverse ◆ @column 3
=    : uruk-iii ◆ P006428 ◆ 29 ◆ obverse ◆ @column 4
=    : uruk-iii ◆ P006428 ◆ 32 ◆ obverse ◆ @column 5
=    : uruk-iii ◆ P448701 ◆ 40 ◆ obverse ◆ @column 1
=    : uruk-iii ◆ P448701 ◆ 43 ◆ obverse ◆ @column 2
=    : uruk-iii ◆ P448702 ◆ 54 ◆ obverse ◆ @column 1
=    : uruk-iii ◆ P448702 ◆ 60 ◆ obverse ◆ @column 2
=    : uruk-iii ◆ P448702 ◆ 64 ◆ obverse ◆ @column 3
=    : uruk-iii ◆ P448703 ◆ 75 ◆ obverse ◆ @column 1
=    : uruk-iii ◆ P448703 ◆ 81 ◆ obverse ◆ @column 2
=    : uruk-iii ◆ P471695 ◆ 91 ◆ obverse ◆ @column 1
=    : uruk-iii ◆ P471695 ◆ 104 ◆ obverse ◆ @column 2
=  

## Lines

We check whether we see the same line numbers with GREP and TF.

During the conversion to TF we have detected bad numberings in some columns.
We flag them here.

See also

In [73]:
def tfColumns():
    columns = []
    for tablet in F.otype.s('tablet'):
        tabletName = F.catalogId.v(tablet)
        period = F.period.v(tablet)
        for face in L.d(tablet, otype='face'):
            tp = F.type.v(face)
            for column in L.d(face, otype='column'):
                number = F.number.v(column)
                prime = F.prime.v(column)
                ln = F.srcLnNum.v(column)
                primeStr = "'" if prime else ''
                if number != '0':
                    columns.append((period, tabletName, ln, tp, f'@column {number}{primeStr}'))
    return columns

In [82]:
def grepColumns(gen):
    columns = []
    columnPat = re.compile('^@col')
    correctPat = re.compile('^@([a-z]+)(\s*)(\S*)')
    curFace = NOFACE
    prevTablet = None
    for (period, tablet, ln, line, skip) in gen:
        if skip:
            continue
        if tablet != prevTablet:
            curFace = NOFACE
        prevTablet = tablet

        match = facePat.match(line)
        if match:
            face = match.group(1)
            if face in FACES:
                curFace = face

        if columnPat.match(line):
            if not line.startswith('@column '):
                match = correctPat.match(line)
                if match:
                    colSpec = match.group(1)
                    sep = match.group(2)
                    colNum = match.group(3)
                    line = f'@column {colNum}'
                    print(f'GREP: corrected "{colSpec}{sep}{colNum}" => "{line}"')
                else:
                    print(f'GREP: found "{line}"')
                
            columns.append((period, tablet, ln, curFace, line.strip()))
    return columns

In [83]:
COMP.checkSanity(
    grepColumns,
    tfColumns,
)

GREP: corrected "columm 4" => "@column 4"
GREP: corrected "column3" => "@column 3"
Number of results: TF 13123; GREP 13123
IDENTICAL: all 13123 items
=    : uruk-iii ◆ P006427 ◆ 5 ◆ obverse ◆ @column 1
=    : uruk-iii ◆ P006427 ◆ 7 ◆ obverse ◆ @column 2
=    : uruk-iii ◆ P006428 ◆ 15 ◆ obverse ◆ @column 1
=    : uruk-iii ◆ P006428 ◆ 18 ◆ obverse ◆ @column 2
=    : uruk-iii ◆ P006428 ◆ 21 ◆ obverse ◆ @column 3
=    : uruk-iii ◆ P006428 ◆ 29 ◆ obverse ◆ @column 4
=    : uruk-iii ◆ P006428 ◆ 32 ◆ obverse ◆ @column 5
=    : uruk-iii ◆ P448701 ◆ 40 ◆ obverse ◆ @column 1
=    : uruk-iii ◆ P448701 ◆ 43 ◆ obverse ◆ @column 2
=    : uruk-iii ◆ P448702 ◆ 54 ◆ obverse ◆ @column 1
=    : uruk-iii ◆ P448702 ◆ 60 ◆ obverse ◆ @column 2
=    : uruk-iii ◆ P448702 ◆ 64 ◆ obverse ◆ @column 3
=    : uruk-iii ◆ P448703 ◆ 75 ◆ obverse ◆ @column 1
=    : uruk-iii ◆ P448703 ◆ 81 ◆ obverse ◆ @column 2
=    : uruk-iii ◆ P471695 ◆ 91 ◆ obverse ◆ @column 1
=    : uruk-iii ◆ P471695 ◆ 104 ◆ obverse ◆ @column 2
=  

## Graphemes

Note that we have defined a function to produce a string value for a full grapheme, including 
repeats, primes, variants and modifiers.
See [utils](utils.py).

A complication is that there are missing line numbers in a few cases, 
so the usual grep pattern does not pick them up.

There a lines that start with `[` and with `|`, so we have to take care we get them.

There are also line numbers with a hyphen in it, such as `6-7`.

In [9]:
lineNumPat = '^(?:(?:[a-zA-Z0-9.\'-]+\s+)|(?=[|\[]))'

### Primes

First an overview of the occurrence of primes.

**N.B.:** This is about primes on *signs*, not on *column* numbers.

In [10]:
for (value, frequency) in F.prime.freqList():
    print(f'{frequency:>5} x {value}')

    9 x 1


Now let us check the primes with grep, directly in the source files.
We look into lines starting with a (hierarchical number), followed by space,
and then later a single of double prime, but not one within a grapheme, such as `GA'AR`.

In [11]:
def tfPrimes():
    primes = []
    for n in F.prime.s(1):
        (tablet, column, line) = T.sectionFromNode(n)
        t = L.u(n, otype='tablet')[0]
        case = L.u(n, otype='case')[0]
        
        primes.append((F.period.v(t), tablet, F.srcLnNum.v(case), f"{COMP.strFromSign(n)}"))
    return primes

In [13]:
def grepPrimes(gen):
    primes = []
    primePat = re.compile(f'{lineNumPat}(.*[\'"][^A].*)')
    graphemePat = re.compile('(?:[0-9]+\([^)]+[\'"]\))|(?:[A-Z0-9~@a-wyz\'-]+\')')
    for (period, tablet, ln, line, skip) in gen:
        if skip:
            continue
        match = primePat.match(line)
        if match:
            material = match.group(1)
            graphemes = graphemePat.findall(material)
            for grapheme in graphemes:
                primes.append((period, tablet, ln, grapheme))
    return primes

In [14]:
COMP.checkSanity(
    grepPrimes,
    tfPrimes,
)

Number of results: TF 9; GREP 9
DIFFERENT: first different item is at position 1 in the list
TF   : uruk-iii ◆ P411604 ◆ 48967 ◆ 1(N24')
GREP : uruk-iii ◆ P411604 ◆ 48967 ◆ 1(N24")
remaining items (TF: 8); GREP: 8
=    : uruk-iii ◆ P411610 ◆ 49069 ◆ 1(N24')
=    : uruk-iii ◆ P411610 ◆ 49071 ◆ 1(N24')
=    : uruk-iii ◆ P411610 ◆ 49073 ◆ 1(N24')
=    : uruk-iii ◆ P411610 ◆ 49075 ◆ 1(N24')
=    : uruk-iii ◆ P411539 ◆ 49391 ◆ 1(N24')
=    : uruk-iii ◆ P006437 ◆ 54446 ◆ 1(N30c')
=    : uruk-iii ◆ P464140 ◆ 55938 ◆ 1(N24')
=    : uruk-iii ◆ P464140 ◆ 55939 ◆ 1(N24')
      no more items


This makes it clear: in the transcription there is a strange double prime on the `N(24")`.

### Variants

Overview of variants:

In [15]:
for (value, frequency) in F.variant.freqList():
    print(f'{frequency:>5} x {value}')

23843 x a
 4214 x b
 1534 x c
 1356 x a1
  703 x b1
  194 x a2
  191 x d
  127 x b2
   85 x f
   73 x a3
   40 x e
   29 x c2
   22 x c1
   22 x c3
   14 x c5
   13 x b3
   12 x a0
   12 x d1
   12 x v
   11 x c4
    6 x a4
    6 x g
    5 x d2
    4 x d4
    4 x h
    2 x 3a
    2 x d3
    1 x h2


So there are many variants.

Again, we look for variants in the TF resource and by GREPping them from the sources.

In [16]:
def tfVariants():
    variants = []
    for n in F.otype.s('sign'):
        variant = F.variant.v(n)
        if variant is None:
            continue
        (tablet, column, line) = T.sectionFromNode(n)
        t = L.u(n, otype='tablet')[0]
        case = L.u(n, otype='case')[0]
        
        position = (F.period.v(t), tablet, F.srcLnNum.v(case))
        variants.append((*position, f"{COMP.strFromSign(n)}"))

        written = F.written.v(n)
        if written is not None:
            if '~' in written:
                variants.append((*position, written))

    return variants

There are complications.

#### Order of modifiers and variants
When we extract variants by GREP, we face the problem that the order between modifiers and variants is not consistent.
We see cases of variant and then modifier:

```
3. 1(N14) 8(N01) , RAD~a@g ERIM~a SZU2 A?
```

and cases with modifier and then variant:

```
4. 2(N01) , URUDU@g~b SZU2#
```

both from the same tablet P003407.

We will normalize the order of variant and modifier.

In [19]:
def swap(match):
    return f'{match.group(2)}{match.group(1)}'

def lower(match):
    return f'~{match.group(1).lower()}'

def grepVariants(gen):
    variants = []
    variantPat = re.compile(f'{lineNumPat}(.*[~].*)')
    varPat = re.compile('[~]([A-Z])')
    graphemePat = re.compile('(?:[0-9]+\([^)]+[~][^)]+\))|(?:[A-Z0-9@a-wyz\'-]+[~][0-9@a-wyz]+)')
    combiPat = re.compile('(@[a-ywz0-9]+)(~[a-ywz0-9]+)')
    for (period, tablet, ln, line, skip) in gen:
        if skip:
            continue
        match = variantPat.match(line)
        if match:
            material = match.group(1)
            material = varPat.sub(lower, material)
            graphemes = graphemePat.findall(material)
            for grapheme in graphemes:
                grapheme = combiPat.sub(swap, grapheme)
                variants.append((period, tablet, ln, grapheme))
    return variants

In [20]:
COMP.checkSanity(
    grepVariants,
    tfVariants,
)

Number of results: TF 32452; GREP 32452
IDENTICAL: all 32452 items
=    : uruk-iii ◆ P006427 ◆ 8 ◆ SANGA~a
=    : uruk-iii ◆ P006428 ◆ 26 ◆ DUG~b
=    : uruk-iii ◆ P448701 ◆ 42 ◆ AB~a
=    : uruk-iii ◆ P448701 ◆ 42 ◆ APIN~a
=    : uruk-iii ◆ P448701 ◆ 42 ◆ NUN~a
=    : uruk-iii ◆ P448701 ◆ 45 ◆ SZE~a
=    : uruk-iii ◆ P448701 ◆ 45 ◆ NUN~a
=    : uruk-iii ◆ P448702 ◆ 56 ◆ KA~a
=    : uruk-iii ◆ P448702 ◆ 57 ◆ KASZ~b
=    : uruk-iii ◆ P448702 ◆ 57 ◆ NUN~a
=    : uruk-iii ◆ P448702 ◆ 58 ◆ KASZ~a
=    : uruk-iii ◆ P448702 ◆ 59 ◆ 2(N39~a)
=    : uruk-iii ◆ P471695 ◆ 92 ◆ APIN~a
=    : uruk-iii ◆ P471695 ◆ 92 ◆ UR4~a
=    : uruk-iii ◆ P471695 ◆ 93 ◆ EN~a
=    : uruk-iii ◆ P471695 ◆ 94 ◆ BAN~b
=    : uruk-iii ◆ P471695 ◆ 94 ◆ KASZ~c
=    : uruk-iii ◆ P471695 ◆ 97 ◆ PAP~a
=    : uruk-iii ◆ P471695 ◆ 100 ◆ EN~a
=    : uruk-iii ◆ P471695 ◆ 100 ◆ EZINU~d
=     and 32432 more
