# Mapping readings to Unicode

## Task

We want to map *readings* and *graphemes* in cuneiform corpora to cuneiform unicode characters,
based on extant mapping tables.

We generate a plain mapping that can be used readily by programs that convert from ATF to TF or something else.

## Problem

There are multiple mapping tables, there are several ways to transliterate readings.

## Sources

We take the ATF transliterations from CDLI, for tablets of various corpora.

We take the file
[GeneratedSignList.json](https://github.com/Nino-cunei/oldbabylonian/blob/master/sources/writing/GeneratedSignList.json)
with mappings like

```json
        "BANIA": {
            "signName": "BANIA",
            "signNumber": 551,
            "signCunei": "íëî",
            "codePoint": "",
            "values":
			[
                "BANIA", "A≈†2.UoverU", "5S≈™TU"
            ]
        },
        "MA": {
            "signName": "MA",
            "signNumber": 552,
            "signCunei": "íà†",
            "codePoint": "",
            "values":
			[
                "MA", "PE≈†3", "PE≈†≈†E", "WA6"
            ]
        },
```

See [transcription](https://github.com/Nino-cunei/oldbabylonian/blob/master/docs/transcription.md)
about the provenance of this file.

# Status

This is work in progress. 
The mapping is needed in the conversion from ATF to TF in the program
[tfFromATF.py](tfFromATF.py).

# Authors

Cale Johnson, Martijn Kokken, Dirk Roorda

# Acknowledgements

We are indebted to **Auday Hussein** for helpfully sending *GeneratedSignList.json* file to us;
to **Alba de Ridder** for hints and comments.

In [1]:
import os
import collections
import re
import json
from unicodedata import name as uname

# Local topography

In [2]:
BASE = os.path.expanduser('~/github')
ORG = 'Nino-cunei'
REPO = 'signs'

REPO_DIR = f'{BASE}/{ORG}/{REPO}'

WRITING_DIR = f'{REPO_DIR}/writing'

SIGN_FILE = 'GeneratedSignList.json'
SIGN_PATH = f'{WRITING_DIR}/{SIGN_FILE}'

MAPPING_FILE = f'{os.path.abspath("..")}/characters/mapping.tsv'
AMBI_FILE = f'{os.path.abspath("..")}/characters/ambiguous.tsv'

# Reading collection

We read sign files for various corpora.

In [3]:
CORPORA = '''
  oldbabylonian
  oldassyrian
'''.strip().split()

charFiles = {corp: f'{BASE}/{ORG}/{corp}/characters/corpus.tsv' for corp in CORPORA}

In [4]:
tokens = set()

for (corp, charFile) in charFiles.items():
  with open(charFile) as f:
    theseTokens = set()
    for line in f:
      (repeat, sign) = line.rstrip('\n').split('\t')
      theseTokens.add((repeat, sign) if repeat else sign)
  print(f'{corp:<15}: {len(theseTokens)} readings')
  tokens |= theseTokens
print(f'{"TOTAL":<15}: {len(tokens)} readings')

oldbabylonian  : 976 readings
oldassyrian    : 639 readings
TOTAL          : 1145 readings


# Unicode style versus ATF style

We use mappings between Unicode style transliterations and ATF.

In [5]:
transAscii = {
    '≈°': 'sz',
    '·π£': 's,',
    '≈õ': "s'",
    '·π≠': 't,',
    '·∏´': 'h,',
}

transAscii.update({k.upper(): v.upper() for (k, v) in transAscii.items()})

def makeAscii(r):
  for (rin, rout) in transAscii.items():
    r = r.replace(rin, rout)
  return r

In [6]:
transAscii

{'≈°': 'sz',
 '·π£': 's,',
 '≈õ': "s'",
 '·π≠': 't,',
 '·∏´': 'h,',
 '≈†': 'SZ',
 '·π¢': 'S,',
 '≈ö': "S'",
 '·π¨': 'T,',
 '·∏™': 'H,'}

In [7]:
REPEAT_INV = dict(
  one=1,
  two=2,
  three=3,
  four=4,
  five=5,
  six=6,
  seven=7,
  eight=8,
  nine=9,
)

REPEAT = {v: k for (k, v) in REPEAT_INV.items()}

In [8]:
FRACTION = {
  '1/2': 'one half',
  '1/3': 'one third',
  '2/3': 'two thirds',
  '1/4': 'one quarter',
  '1/6': 'one sixth',
  '5/6': 'five sixths',
  '1/8': 'one eighth',
}

# Read the sign list

We read the json file with generated signs.

For each sign, we find a list of *values*.

These values correspond to possible readings or graphemes, in short, *tokens*. 
They are in unicode transliteration style.

In the mapping we create, we convert them to plain ATF,
which makes it easier to look them up from our Old Babylonian corpus.

In [9]:
with open(SIGN_PATH) as fh:
  signs = json.load(fh)['signs']

print(f'{len(signs)} signs in the json file')

mapping = collections.defaultdict(set)

for (sign, signData) in signs.items():
  uniStr = signData['signCunei']
  values = signData['values']
  for value in values:
    valueAscii = makeAscii(value)
    mapping[valueAscii].add(uniStr)

print(f'{len(mapping)} distinct values in table')

1768 signs in the json file
8765 distinct values in table


# Token lookup

We look up each corpus token in the mapping just constructed.

Depending on whether we find 0, 1 or multiple values, we store them in dictionaries
`unmapped`, `unique`, `multiple`.

In [10]:
MAPPING_FIXES = {
    'd': 'dingir',
}

unmapped = set()
unique = {}
multiple = {}

for t in tokens:
  if type(t) is tuple:
    unmapped.add(t)
    continue
  tLookup = MAPPING_FIXES.get(t, t)
  tU = tLookup.upper()
  if tU not in mapping:
    unmapped.add(t)
    continue
  targets = mapping[tU]
  if len(targets) == 1:
    unique[t] = list(targets)[0]
  else:
    multiple[t] = targets
    
print(f'{len(unmapped):>3} unmapped tokens')
print(f'{len(multiple):>3} ambiguously mapped tokens')
print(f'{len(unique):>3} uniquely mapped tokens')

182 unmapped tokens
 58 ambiguously mapped tokens
905 uniquely mapped tokens


In [11]:
MAPPING_FIXES = {
    'd': 'dingir',
}

unmapped = set()
unique = {}
multiple = {}

for t in tokens:
  if type(t) is tuple:
    unmapped.add(t)
    continue
  tLookup = MAPPING_FIXES.get(t, t)
  tU = tLookup.upper()
  if tU not in mapping:
    unmapped.add(t)
    continue
  targets = mapping[tU]
  if len(targets) == 1:
    unique[t] = list(targets)[0]
  else:
    multiple[t] = targets
    
print(f'{len(unmapped):>3} unmapped tokens')
print(f'{len(multiple):>3} ambiguously mapped tokens')
print(f'{len(unique):>3} uniquely mapped tokens')

182 unmapped tokens
 58 ambiguously mapped tokens
905 uniquely mapped tokens


# Unmapped tokens

In [12]:
unkey = lambda x: (x[1].lower(), str(x[0])) if type(x) is tuple else (x.lower(), '')

print(f'{len(unmapped):>3} unmapped tokens')
sorted(unmapped, key=unkey)

182 unmapped tokens


['&i2',
 "'i",
 '...',
 '2(disz@t)',
 "a'",
 'ah',
 'AH',
 'alamusz',
 'asal2',
 ('1', 'asz'),
 ('1/2', 'asz'),
 ('1/3', 'asz'),
 ('2', 'asz'),
 ('3', 'asz'),
 ('4', 'asz'),
 ('5', 'asz'),
 ('6', 'asz'),
 ('7', 'asz'),
 ('8', 'asz'),
 ('9', 'asz'),
 'babila',
 'babila2',
 ('1', 'ban2'),
 ('2', 'ban2'),
 ('3', 'ban2'),
 ('4', 'ban2'),
 ('5', 'ban2'),
 'barig',
 ('1', 'barig'),
 ('2', 'barig'),
 ('3', 'barig'),
 ('4', 'barig'),
 ('5', 'barig'),
 "bur'u",
 ('1', "bur'u"),
 ('2', "bur'u"),
 ('3', "bur'u"),
 ('4', "bur'u"),
 ('5', "bur'u"),
 ('1', 'bur3'),
 ('2', 'bur3'),
 ('3', 'bur3'),
 ('4', 'bur3'),
 ('5', 'bur3'),
 ('6', 'bur3'),
 ('8', 'bur3'),
 ('9', 'bur3'),
 'dah',
 ('1', 'disz'),
 ('1/2', 'disz'),
 ('1/3', 'disz'),
 ('1/4', 'disz'),
 ('1/6', 'disz'),
 ('2', 'disz'),
 ('2/2', 'disz'),
 ('2/3', 'disz'),
 ('3', 'disz'),
 ('3/4', 'disz'),
 ('4', 'disz'),
 ('5', 'disz'),
 ('5/6', 'disz'),
 ('6', 'disz'),
 ('7', 'disz'),
 ('8', 'disz'),
 ('9', 'disz'),
 'duh',
 'eh',
 'EH',
 'eri11',
 (

# Fix the unmapped tokens

We look up the unmapped tokens in the unicode table.

In [13]:
cuneiBlocks = {
  'Cuneiform': ('12000', '123FF'),
  'Cuneiform Numbers and Punctuation': ('12400', '1247F'),
  'Early Dynastic Cuneiform': ('12480', '1254F'),
}

In [14]:
cunicode = {}

for (block, (start, end)) in cuneiBlocks.items():
  for u in range(int(start, 16), int(end, 16) + 1):
    c = chr(u)
    name = uname(c, None)
    if name is None:
      continue
    cunicode[name] = c

In [15]:
mapAddition = {}
notFixed = set()

def getLookup(r):
  return (
    r.
    replace("'", '').
    upper().
    replace("SZ", 'SH').
    replace('.', ' TIMES ')
  )
  
  
for t in sorted(unmapped, key=unkey):
  if type(t) is tuple:
    if type(t[0]) is int:
      (repeat, r) = t
      tRepeat = REPEAT.get(repeat, None)
      if tRepeat is None:
        notFixed.add(t)
        continue
      tLookup =  getLookup(r)
      name = f'CUNEIFORM NUMERIC SIGN {tRepeat.upper()} {tLookup}'
      c = cunicode.get(name, None)
      if c is not None:
        mapAddition[t] = c
        continue
      name = f'CUNEIFORM SIGN {tLookup}'
    else:
      (fraction, r) = t
      tFraction = FRACTION.get(fraction, None)
      if tFraction is None:
        notFixed.add(t)
        continue
      tLookup =  getLookup(r)
      name = f'CUNEIFORM NUMERIC SIGN {tFraction.upper()} {tLookup}'
  else:
    tLookup =  getLookup(t)
    name = f'CUNEIFORM SIGN {tLookup}'
  c = cunicode.get(name, None)
  if c is None:
    notFixed.add(t)
  else:
    mapAddition[t] = c

print(f'fixed {len(mapAddition)} out of {len(unmapped)}')

if mapAddition:
  print('FIXED')
  for (t, c) in sorted(mapAddition.items(), key=unkey):
    print(f'\t{str(t):<15} => {c}')
else:
  print('NOTHING FIXED')
  
if notFixed:
  print('UNFIXED')
  for t in sorted(notFixed, key=unkey):
    print(f'\t{str(t):<15} => ?')
else:
  print('ALL FIXED')

fixed 18 out of 182
FIXED
	a'              => íÄÄ
	asal2           => íÄ∑
	duh             => íÇÉ
	HA              => íÑ©
	ha              => íÑ©
	hal             => íÑ¨
	HI              => íÑ≠
	hi              => íÑ≠
	HU              => íÑ∑
	hu              => íÑ∑
	hub2            => íÑ∏
	'i              => íÑø
	luh             => íàõ
	mah             => íà§
	pesz2           => íâæ
	('1/3', 'disz') => íëö
	('2/3', 'disz') => íëõ
	('5/6', 'disz') => íëú
UNFIXED
	&i2             => ?
	...             => ?
	2(disz@t)       => ?
	ah              => ?
	AH              => ?
	alamusz         => ?
	('1', 'asz')    => ?
	('1/2', 'asz')  => ?
	('1/3', 'asz')  => ?
	('2', 'asz')    => ?
	('3', 'asz')    => ?
	('4', 'asz')    => ?
	('5', 'asz')    => ?
	('6', 'asz')    => ?
	('7', 'asz')    => ?
	('8', 'asz')    => ?
	('9', 'asz')    => ?
	babila          => ?
	babila2         => ?
	('1', 'ban2')   => ?
	('2', 'ban2')   => ?
	('3', 'ban2')   => ?
	('4', 'ban2')   => ?
	('5', 'ba

# Solutions

Most of the remaining problems above got solved by a 
[table provided by Martijn Kokken](https://github.com/Nino-cunei/oldbabylonian/blob/master/sources/writing/MartijnKokken.txt)

In [17]:
MAPPING_SOLUTIONS = dict(
  ah=('HIxNUN', 'U12134'),
  AH=('HIxNUN', 'U12134'),
  alamusz=('TAxHI', 'U122ED'),
  babila2=('KA2.AN.RA', 'U1218D U1202D U1228F'),
  dah=('MU/MU', 'U1222D'),
  eh=('HIxNUN', 'U12134'),
  EH=('HIxNUN', 'U12134'),
  eri11=('AB gun√ª', 'U12015'),
  geszimmar=('≈†A6', 'U122B7'),
  gudu4=('HIxNUN.ME', 'U12134 U12228'),
  had2=('UD', 'U12313'),
  har=('HIxA≈†2', 'U1212F'),
  HAR=('HIxA≈†2', 'U1212F'),
  he=('HI', 'U1212D'),
  he2=('GAN', 'U120F6'),
  hun=('E≈†2', 'U120A0'),
  hur=('HIxA≈†2', 'U1212F'),
  huz=('LUM', 'U1221D'),
  ih=('HIxNUN', 'U12134'),
  IH=('HIxNUN', 'U12134'),
  itu=('UDxU.U.U', 'U12317'),
  KA=('KA TA', 'U12157 U122EB'),
  kislah=('KI.UD', 'U121A0 U12313'),
  lah=('UD', 'U12313'),
  lah4=('DU / DU', 'U1207B'),
  lah5=('DU.DU', 'U1207A U1207A'),
  lah6=('DU', 'U1207A'),
  lal3=('TAxHI', 'U122ED'),
  muhaldim=('MU', 'U1222C'),
  nigar=('U.UD.KID', 'U1230B U12313 U121A4'),
  nirah=('MU≈†', 'U12232'),
  sa10=('NINDA2x≈†E', 'U1225A'),
  sahar=('I≈†', 'U12156'),
  siskur2=('AMARx≈†E.AMARx≈†E', 'U1202C U1202C'),
  szagina=('GIR3.ARAD', 'U1210A U12034'),
  szah=('≈†UBUR', 'U122DA'),
  szah2=('DUN', 'U12084'),
  szandana=('GAL.NI', 'U120F2 U1224C'),
  tah=('MU/MU', 'U1222D'),
  tap=('TAB', 'U122F0'),
  udru=('A≈†2', 'U1203E'),
  UH=('HIxNUN', 'U12134'),
  uh=('HIxNUN', 'U12134'),
  UH2=('UD.KU≈†U2', 'U12313 U121B5'),
  uh2=('UD.KU≈†U2', 'U12313 U121B5'),
  uh3=('KU≈†U2', 'U121B5'),
  UH3=('KU≈†U2', 'U121B5'),
  ukken=('URUxBAR', 'U1233A'),
  unu=('AB gun√ª', 'U12015'),
)
MAPPING_SOLUTIONS.update({
  '1(asz)': ('', 'U12038'),
  '2(asz)': ('', 'U12400'),
  '3(asz)': ('', 'U12401'),
  '4(asz)': ('', 'U12402'),
  '5(asz)': ('', 'U12403'),
  '6(asz)': ('', 'U12404'),
  '7(asz)': ('', 'U12405'),
  '8(asz)': ('', 'U12406'),
  '9(asz)': ('', 'U12407'),
  '1/2(asz)': ('', 'U12226'),
  '1/3(asz)': ('', 'U1245A'),
  '1/4(asz)': ('', 'U12460'),
  '1/8(asz)': ('', 'U1245F'),
  'babila': ('', 'U1218D U1202D'),
  '1(ban2)': ('', 'U1244F'),
  '2(ban2)': ('', 'U12450'),
  '3(ban2)': ('', 'U12451'),
  '4(ban2)': ('', 'U12452'),
  '5(ban2)': ('', 'U12454'),
  'barig': ('', 'U12079'),
  '1(barig)': ('', 'U12079'),
  '2(barig)': ('', 'U12079 U12079'),
  '3(barig)': ('', 'U12079 U12079 U12079'),
  '4(barig)': ('', 'U1235D'),
  '5(barig)': ('', 'U12125'),
  'bur3': ('', 'U1230B'),
  "bur'u": ('', 'U12434'),
  '1(bur3)': ('', 'U1230B'),
  '2(bur3)': ('', 'U1230B U1230B'),
  '3(bur3)': ('', 'U1230B U1230B U1230B'),
  '4(bur3)': ('', 'U1240F'),
  '5(bur3)': ('', 'U12410'),
  '6(bur3)': ('', 'U12411'),
  '7(bur3)': ('', 'U12412'),
  '8(bur3)': ('', 'U12413'),
  '9(bur3)': ('', 'U12414'),
  '1(disz)': ('', 'U12079'),
  '2(disz)': ('', 'U1222B'),
  '3(disz)': ('', 'U12408'),
  '4(disz)': ('', 'U12409'),
  '5(disz)': ('', 'U1240A'),
  '6(disz)': ('', 'U1240B'),
  '7(disz)': ('', 'U1240C'),
  '8(disz)': ('', 'U1240D'),
  '9(disz)': ('', 'U1240E'),
  '1/2(disz)': ('', 'U12226'),
  '1/3(disz)': ('', 'U1245A'),
  '1/4(disz)': ('', 'U12462'),
  '1/6(disz)': ('', 'U12461'),
  '2/3(disz)': ('', 'U1245B'),
  '5/6(disz)': ('', 'U1245C'),
  '13(disz)': ('', 'U12399 U12408'),
  'disz@t': ('', 'U120F5'),
  'hi@v': ('', 'U1212D'),
  '1(iku)': ('', 'U12038'),
  '2(iku)': ('', 'U12400'),
  '3(iku)': ('', 'U12401'),
  '4(iku)': ('', 'U12402'),
  '5(iku)': ('', 'U12403'),
  '6(iku)': ('', 'U12404'),
  '7(iku)': ('', 'U12405'),
  '8(iku)': ('', 'U12406'),
  '9(iku)': ('', 'U12407'),
  '1(esze3)': ('', 'U12458'),
  '2(esze3)': ('', 'U12459'),
  '3(esze3)': ('', 'U12038 U1230B'),
  'gesz2': ('', 'U12415'),
  '1(gesz2)': ('', 'U12415'),
  "gesz'u": ('', 'U1241E'),
  'la5': ('', 'U121F3'),
  'szar2': ('', 'U122B9'),
  '1(u)': ('', 'U1230B'),
  '2(u)': ('', 'U12399'),
  '3(u)': ('', 'U1230D'),
  '4(u)': ('', 'U1240F'),
  '5(u)': ('', 'U12410'),
  '6(u)': ('', 'U12411'),
  '7(u)': ('', 'U12412'),
  '8(u)': ('', 'U12413'),
  '9(u)': ('', 'U12414'),
})

In [17]:
MAPPING_SOLUTIONSX = {}

for (token, (grapheme, uniChars)) in MAPPING_SOLUTIONS.items():
  uniStr = ''.join(chr(int(uc[1:], 16)) for uc in uniChars.split())
  MAPPING_SOLUTIONSX[token] = uniStr
MAPPING_SOLUTIONSX

{'ah': 'íÑ¥',
 'AH': 'íÑ¥',
 'alamusz': 'íã≠',
 'babila2': 'íÜçíÄ≠íäè',
 'dah': 'íà≠',
 'eh': 'íÑ¥',
 'EH': 'íÑ¥',
 'eri11': 'íÄï',
 'geszimmar': 'íä∑',
 'gudu4': 'íÑ¥íà®',
 'had2': 'íåì',
 'har': 'íÑØ',
 'HAR': 'íÑØ',
 'he': 'íÑ≠',
 'he2': 'íÉ∂',
 'hun': 'íÇ†',
 'hur': 'íÑØ',
 'huz': 'íàù',
 'ih': 'íÑ¥',
 'IH': 'íÑ¥',
 'itu': 'íåó',
 'KA': 'íÖóíã´',
 'kislah': 'íÜ†íåì',
 'lah': 'íåì',
 'lah4': 'íÅª',
 'lah5': 'íÅ∫íÅ∫',
 'lah6': 'íÅ∫',
 'lal3': 'íã≠',
 'muhaldim': 'íà¨',
 'nigar': 'íåãíåìíÜ§',
 'nirah': 'íà≤',
 'sa10': 'íâö',
 'sahar': 'íÖñ',
 'siskur2': 'íÄ¨íÄ¨',
 'szagina': 'íÑäíÄ¥',
 'szah': 'íãö',
 'szah2': 'íÇÑ',
 'szandana': 'íÉ≤íâå',
 'tah': 'íà≠',
 'tap': 'íã∞',
 'udru': 'íÄæ',
 'UH': 'íÑ¥',
 'uh': 'íÑ¥',
 'UH2': 'íåìíÜµ',
 'uh2': 'íåìíÜµ',
 'uh3': 'íÜµ',
 'UH3': 'íÜµ',
 'ukken': 'íå∫',
 'unu': 'íÄï',
 '1(asz)': 'íÄ∏',
 '2(asz)': 'íêÄ',
 '3(asz)': 'íêÅ',
 '4(asz)': 'íêÇ',
 '5(asz)': 'íêÉ',
 '6(asz)': 'íêÑ',
 '7

# Ambiguously mapped readings

In [18]:
print(f'{len(multiple):>3} ambiguously mapped readings')
for r in sorted(multiple):
  unis = multiple[r]
  uniStr = ' - '.join(sorted(unis))
  print(f'{r} => ({len(unis)}) => {uniStr}')

 58 ambiguously mapped readings
2 => (3) => íà´ - íà´íåç - íêÄ
IA => (2) => íÖÄ - íâø
IL => (2) => íÄß - íÖã
IRI => (2) => íÖï - íå∑
KAM => (2) => íÑ≠íÅÅ - íÑ∞
LUM => (2) => íàù - íãû
USZ => (2) => íçë - íçñ
UZ => (2) => íäª - íçñ
WA => (2) => íÅÄ - íâø
ba4 => (3) => íÄÄíÄ≠íÇ∑ - íÇ∑ - íçùíÇ∑íÇ∑
ba6 => (2) => íÅÄíåë - íåë
bara2 => (2) => íÅÅ - íÅà
bum => (2) => íÖ§ - íÜÉ
buru14 => (2) => íÇò - íÇô
da2 => (2) => íã´ - íã¨
dabin => (2) => íÇ†íä∫ - íç•íä∫
dilmun => (3) => íâåíåá - íä©íÑ∏ - íä©íåá
eri => (2) => íÖï - íå∑
erisz => (2) => íä©íà† - íä©íåÜ
gala => (3) => íÉ≤ - íçëíÜ™ - íçì
gin7 => (2) => íÅ∂ - íÑÄ
gurusz => (2) => íÑ® - íÜó
ia => (2) => íÖÄ - íâø
idim => (2) => íÅÅ - íÖÇ
ii => (2) => íÖÄ - íâø
il => (2) => íÄß - íÖã
iri => (2) => íÖï - íå∑
isz8 => (2) => íÄπ - íåã
iu => (2) => íÖÄ - íâø
kam => (2) => íÑ≠íÅÅ - íÑ∞
kesz2 => (2) => íÇ° - íÜü
kesz3 => (2) => íãôíÄ≠íÑ≤ - íãôíÄ≠íÑ≤íÜ†
limmu2 => (

Make an excel sheet of all ambiguously mapped readings.
For each such reading, collect 4 examples on 4 distinct lines (if possible).

The sheet has columns for:

* reading
* unicode1: name and glyph
* unicode2: name and glyph
* example word
* example line

In [19]:
def unameStr(glyphstr):
  return ' + '.join(uname(g).replace('CUNEIFORM', '').replace('SIGN', '').strip() for g in glyphstr)

# Uniquely mapped readings

In [22]:
print(f'{len(unique):>3} uniquely mapped readings')
for r in sorted(unique):
  print(f'{r:>10} => {unique[r]}')

905 uniquely mapped readings
         A => íÄÄ
        A2 => íÄâ
        AB => íÄä
        AD => íÄú
        AG => íÄù
        AK => íÄù
        AL => íÄ†
        AM => íÑ†
        AN => íÄ≠
        AR => íÖà
      ARAD => íÄ¥
     ARAD2 => íÄµ
       AS, => íäç
       AS2 => íÄæ
       ASZ => íÄ∏
        AZ => íäç
        BA => íÅÄ
       BAD => íÅÅ
       BAR => íÅá
        BE => íÅÅ
        BI => íÅâ
        BU => íÅç
       BUR => íÅì
        DA => íÅï
       DAM => íÅÆ
        DI => íÅ≤
       DIM => íÅ¥
       DIN => íÅ∑
      DISZ => íÅπ
        DU => íÅ∫
      DU10 => íÑ≠
      DUL3 => íä®
         E => íÇä
      EDIN => íÇî
        EK => íÖÖ
        EL => íÇñ
        ER => íÖï
        GA => íÇµ
       GAG => íÜï
       GAL => íÉ≤
      GAN2 => íÉ∑
       GAR => íÉª
       GAZ => íÑ§
      GESZ => íÑë
        GI => íÑÄ
       GIR => íÑ´
      GIR2 => íÑà
        GU => íÑñ
         I => íÑø
        IB => íÖÅ
        ID => íÄâ
  

# Write the mapping file

In [26]:
pairs = {}
for (k, vs) in multiple.items():
  pairs[k] = sorted(vs)[0]
for (t, v) in mapAddition.items():
  k = f'{t[0]}({t[1]})' if type(t) is tuple else t
  pairs[k] = v
for (k, v) in MAPPING_SOLUTIONSX.items():
  pairs[k] = v
for (k, v) in unique.items():
  pairs[k] = v

Solutions already found for OldBabylonian

In [28]:
obbMappingFile = os.path.expanduser('~/github/Nino-cunei/oldbabylonian/characters/mapping.tsv')
obbMapping = {}
with open(obbMappingFile) as fh:
    for line in fh:
        (name, uni) = line.rstrip().split('\t')
        obbMapping[name] = uni

In [30]:
ok = set()
diff = set()
new = set()

for (name, uni) in obbMapping.items():
    if name in pairs:
        if uni == pairs[name]:
            ok.add(name)
        else:
            diff.add(name)
    else:
        new.add(name)
        
print(f"{len(new)} new")
print(f"{len(diff)} diff")
print(f"{len(ok)} ok")

print("NEW")
for name in sorted(new):
    print(f"\t{name:<20} => {obbMapping[name]}")
    
print("DIFF")
for name in sorted(diff):
    print(f"\t{name:<20} => OBB: {obbMapping[name]}, CANONICAL: {pairs[name]}")

20 new
4 diff
940 ok
NEW
	1(bur'u)             => íê¥
	1(gesz'u)            => íêû
	1(szar2)             => íäπ
	2(bur'u)             => íêµ
	2(gesz'u)            => íêü
	2(gesz2)             => íêñ
	2(gisz)              => íÑë
	2(szar2)             => íê£
	3(bur'u)             => íê∂
	3(gesz'u)            => íê†
	3(gesz2)             => íêó
	4(bur'u)             => íê∏
	4(gesz'u)            => íê°
	4(gesz2)             => íêò
	5(bur'u)             => íêπ
	5(gesz2)             => íêô
	6(gesz2)             => íêö
	7(gesz2)             => íêõ
	8(gesz2)             => íêú
	9(gesz2)             => íêù
DIFF
	2(disz)              => OBB: íÅπ, CANONICAL: íà´
	2(u)                 => OBB: íåã, CANONICAL: íéô
	3(u)                 => OBB: íåã, CANONICAL: íåç
	gesz2                => OBB: íêï, CANONICAL: íÅπ


We add the new ones to our mapping.

In the cases where the previous OBB mapping was different, the new mapping is the right one.

So:

In [31]:
for (name, uni) in obbMapping.items():
    if name in pairs:
        continue
    pairs[name] = uni

In [32]:
with open(MAPPING_FILE, 'w') as mf:
  for (k,v) in sorted(pairs.items()):
    mf.write(f'{k}\t{v}\n')
print(f'{len(pairs)} entries written to {MAPPING_FILE}')

1123 entries written to /Users/dirk/github/Nino-cunei/signs/characters/mapping.tsv
