<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Tablets" data-toc-modified-id="Tablets-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Tablets</a></span></li><li><span><a href="#Graphemes" data-toc-modified-id="Graphemes-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Graphemes</a></span><ul class="toc-item"><li><span><a href="#Primes" data-toc-modified-id="Primes-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Primes</a></span></li><li><span><a href="#Variants" data-toc-modified-id="Variants-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Variants</a></span></li></ul></li></ul></div>

# Checks
Various checks on the correctness of the transformation from ascii transcriptions to a text-fabric data set.

We will perform *grep* commands on the source files, and we will traverse node in Text-Fabric and collect information.

Then we compare these sets of information.

In [16]:
import sys, os, collections, re
from glob import glob
from tf.fabric import Fabric
from utils import Compare

In [17]:
REPO = '~/github/Dans-labs/nino-cunei'
SOURCE = 'uruk'
VERSION = '0.1'
CORPUS = f'{REPO}/tf/{SOURCE}/{VERSION}'
SOURCE_DIR = os.path.expanduser(f'{REPO}/sources/cdli')
TEMP_DIR = os.path.expanduser(f'{REPO}/_temp')

In [18]:
TF = Fabric(locations=[CORPUS], modules=[''], silent=False )

This is Text-Fabric 3.2.0
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

25 features found and 0 ignored


In [19]:
api = TF.load('''
    grapheme prime variant modifier repeat
    damage uncertain remarkable written
    name number catalogId period
    srcLn srcLnNum
    op comments
''')
api.makeAvailableIn(globals())
COMP = Compare(api, SOURCE_DIR, TEMP_DIR)

  0.00s loading features ...
   |     0.00s B catalogId            from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.02s B number               from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.06s B grapheme             from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.05s B srcLn                from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.03s B srcLnNum             from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.00s B prime                from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.01s B variant              from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.00s B modifier             from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.01s B repeat               from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.01s B damage               from /Users/dirk/github/Dans-labs/nino-cunei/tf/uruk/0.1
   |     0.00s B unce

## Tablets
We check whether we have the same sequence of tablet numbers.
In TF, the tablet number is stored in the feature `catalogId`.

Note that we also check on the order of the tablets.

In [20]:
def tfTablets():
    tablets = []
    for t in F.otype.s('tablet'):
        (tablet, column, line) = T.sectionFromNode(t)
        tablets.append((F.period.v(t), F.srcLnNum.v(t), F.catalogId.v(t)))
    return tablets

def grepTablets(gen):
    tablets = set()
    tabletOccs = []
    tabletPat = re.compile('^&([^ =]+)')
    for (period, ln, line) in gen:
        match = tabletPat.match(line)
        if match:
            tablet = match.group(1)
            if tablet in tablets:
                print(f'WARNING: duplicate tablet {tablet}: {period}:{ln} {line}')
            else:
                tablets.add(tablet)
                tabletOccs.append((period, ln, tablet))
    return tabletOccs

In [21]:
COMP.checkSanity(
    grepTablets,
    tfTablets,
)

Number of results: TF 6396; GREP 6396
IDENTICAL: all 6396 items
=    : uruk-iii ◆ 1 ◆ P006427
=    : uruk-iii ◆ 11 ◆ P006428
=    : uruk-iii ◆ 36 ◆ P448701
=    : uruk-iii ◆ 50 ◆ P448702
=    : uruk-iii ◆ 71 ◆ P448703
=    : uruk-iii ◆ 87 ◆ P471695
=    : uruk-iii ◆ 114 ◆ P482082
=    : uruk-iii ◆ 127 ◆ P482083
=    : uruk-iii ◆ 147 ◆ P499393
=    : uruk-iii ◆ 166 ◆ P504412
=    : uruk-iii ◆ 189 ◆ P504413
=    : uruk-iii ◆ 199 ◆ P006438
=    : uruk-iii ◆ 220 ◆ P000014
=    : uruk-iii ◆ 297 ◆ P000456
=    : uruk-iii ◆ 326 ◆ P002718
=    : uruk-iii ◆ 341 ◆ P000021
=    : uruk-iii ◆ 374 ◆ P000023
=    : uruk-iii ◆ 403 ◆ P000025
=    : uruk-iii ◆ 500 ◆ P000167
=    : uruk-iii ◆ 531 ◆ P000453
=     and 6376 more


## Graphemes

Note that we have defined a function to produce a string value for a full grapheme, including 
repeats, primes, variants and modifiers.
See [utils](utils.py).

A complication is that there are missing line numbers in a few cases, 
so the usual grep pattern does not pick them up.

There a lines that start with `[` and with `|`, so we have to take care we get them.

There are also line numbers with a hyphen in it, such as `6-7`.

In [22]:
lineNumPat = '^(?:(?:[a-zA-Z0-9.\'-]+\s+)|(?=[|\[]))'

### Primes

First an overview of the occurrence of primes.

**N.B.:** This is about primes on *signs*, not on *column* numbers.

In [23]:
for (value, frequency) in F.prime.freqList():
    print(f'{frequency:>5} x {value}')

    9 x 1


Now let us check the primes with grep, directly in the source files.
We look into lines starting with a (hierarchical number), followed by space,
and then later a single of double prime, but not one within a grapheme, such as `GA'AR`.

In [24]:
def tfPrimes():
    primes = []
    for n in F.prime.s(1):
        (tablet, column, line) = T.sectionFromNode(n)
        t = L.u(n, otype='tablet')[0]
        case = L.u(n, otype='case')[0]
        
        primes.append((F.period.v(t), F.srcLnNum.v(case), f"{COMP.strFromSign(n)}"))
    return primes

In [25]:
def grepPrimes(gen):
    primes = []
    primePat = re.compile(f'{lineNumPat}(.*[\'"][^A].*)')
    graphemePat = re.compile('(?:[0-9]+\([^)]+[\'"]\))|(?:[A-Z0-9~@a-wyz\']+\')')
    for (period, ln, line) in gen:
        match = primePat.match(line)
        if match:
            material = match.group(1)
            graphemes = graphemePat.findall(material)
            for grapheme in graphemes:
                primes.append((period, ln, grapheme))
    return primes

In [26]:
COMP.checkSanity(
    grepPrimes,
    tfPrimes,
)

Number of results: TF 9; GREP 9
DIFFERENT: first different item is at 1
TF   : uruk-iii ◆ 48967 ◆ 1(N24')
GREP : uruk-iii ◆ 48967 ◆ 1(N24")
remaining items (TF: 8); GREP: 8
=    : uruk-iii ◆ 49069 ◆ 1(N24')
=    : uruk-iii ◆ 49071 ◆ 1(N24')
=    : uruk-iii ◆ 49073 ◆ 1(N24')
=    : uruk-iii ◆ 49075 ◆ 1(N24')
=    : uruk-iii ◆ 49391 ◆ 1(N24')
=    : uruk-iii ◆ 54446 ◆ 1(N30c')
=    : uruk-iii ◆ 55938 ◆ 1(N24')
=    : uruk-iii ◆ 55939 ◆ 1(N24')
      no more items


This makes it clear: in the transcription there is a strange double prime on the `N(24")`.

### Variants

Overview of variants:

In [27]:
for (value, frequency) in F.variant.freqList():
    print(f'{frequency:>5} x {value}')

23839 x a
 4213 x b
 1534 x c
 1356 x a1
  703 x b1
  194 x a2
  191 x d
  127 x b2
   85 x f
   73 x a3
   40 x e
   29 x c2
   22 x c1
   22 x c3
   14 x c5
   13 x b3
   12 x a0
   12 x d1
   12 x v
   11 x c4
    6 x a4
    6 x g
    5 x d2
    4 x d4
    4 x h
    2 x 3a
    2 x d3
    1 x h2


So there are many variants.

Again, we look for variants in the TF resource and by GREPping them from the sources.

In [28]:
def tfVariants():
    variants = []
    for n in F.otype.s('sign'):
        variant = F.variant.v(n)
        if variant is None:
            continue
        (tablet, column, line) = T.sectionFromNode(n)
        t = L.u(n, otype='tablet')[0]
        case = L.u(n, otype='case')[0]
        
        variants.append((F.period.v(t), F.srcLnNum.v(case), f"{COMP.strFromSign(n)}"))
        written = F.written.v(n)
        if written is not None:
            if '~' in written:
                variants.append((F.period.v(t), F.srcLnNum.v(case), written))

    return variants

There are complications.

#### Order of modifiers and variants
When we extract variants by GREP, we face the problem that the order between modifiers and variants is not consistent.
We see cases of variant and then modifier:

```
3. 1(N14) 8(N01) , RAD~a@g ERIM~a SZU2 A?
```

and cases with modifier and then variant:

```
4. 2(N01) , URUDU@g~b SZU2#
```

both from the same tablet P003407.

We will normalize the order of variant and modifier.

In [29]:
def swap(match):
    return f'{match.group(2)}{match.group(1)}'

def lower(match):
    return f'~{match.group(1).lower()}'

def grepVariants(gen):
    variants = []
    variantPat = re.compile(f'{lineNumPat}(.*[~].*)')
    varPat = re.compile('[~]([A-Z])')
    graphemePat = re.compile('(?:[0-9]+\([^)]+[~][^)]+\))|(?:[A-Z0-9@a-wyz\']+[~][0-9@a-wyz]+)')
    combiPat = re.compile('(@[a-ywz0-9]+)(~[a-ywz0-9]+)')
    for (period, ln, line) in gen:
        match = variantPat.match(line)
        if match:
            material = match.group(1)
            material = varPat.sub(lower, material)
            graphemes = graphemePat.findall(material)
            for grapheme in graphemes:
                grapheme = combiPat.sub(swap, grapheme)
                variants.append((period, ln, grapheme))
    return variants

In [30]:
COMP.checkSanity(
    grepVariants,
    tfVariants,
)

Number of results: TF 32447; GREP 32455
DIFFERENT: first different item is at 14156
=     start with 14135 items
=    : uruk-iii ◆ 53752 ◆ 2(N42~a)
=    : uruk-iii ◆ 53752 ◆ HI~a@g
=    : uruk-iii ◆ 53753 ◆ 1(N29~a)
=    : uruk-iii ◆ 53753 ◆ SZE~a
=    : uruk-iii ◆ 53767 ◆ DA~a
=    : uruk-iii ◆ 53767 ◆ SZE~a
=    : uruk-iii ◆ 53768 ◆ KU6~a
=    : uruk-iii ◆ 53768 ◆ KISAL~b1
=    : uruk-iii ◆ 53768 ◆ SZE~a
=    : uruk-iii ◆ 53768 ◆ UR5~a
=    : uruk-iii ◆ 53769 ◆ NUNUZ~a1
=    : uruk-iii ◆ 53770 ◆ ZI~a
=    : uruk-iii ◆ 53771 ◆ GIR~b@g
=    : uruk-iii ◆ 53771 ◆ NIM~b1
=    : uruk-iii ◆ 53771 ◆ UNUG~a
=    : uruk-iii ◆ 53782 ◆ EN~a
=    : uruk-iii ◆ 53782 ◆ NUN~b
=    : uruk-iii ◆ 53783 ◆ DU6~b
=    : uruk-iii ◆ 53785 ◆ DU8~c
=    : uruk-iii ◆ 53791 ◆ DU8~c
TF   : uruk-iii ◆ 53812 ◆ E2~a
GREP : uruk-iii ◆ 53811 ◆ DU8~c
remaining items (TF: 18291); GREP: 18299
TF   : uruk-iii ◆ 53811 ◆ DU8~c
GREP : uruk-iii ◆ 53812 ◆ E2~a
=    : uruk-iii ◆ 53817 ◆ DU8~c
=    : uruk-iii ◆ 53827 ◆ 4(N39~a)

In [29]:
tId = 'P325218'
line = T.nodeFromSection((tId, '1', '1'))

In [30]:
line

239908

In [31]:
for s in L.d(line, otype='sign'):
    print(f'{s} {COMP.strFromSign(s)}')

53327 6(N34)
53328 4(N14)
53329 2(N01)
53330 LAGAB~b
53331 DARA4~c2
53332 UDU~a
53333 5(N14)
53334 6(N01)
53335 UDU~a
53336 IB~c
53337 5(N34)
53338 4(N14)
53339 6(N01)
53340 BA
53341 DARA4~c2
53342 …
