<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/ninologo.png" width="128"/>
<img align="right" src="images/dans.png" width="128"/>

---

To get started: consult [start](start.ipynb)

---

# Similar lines

We spot the many similarities between lines in the corpus.

There are ca 100000 lines in the corpus. To compare them all requires 5 billion comparisons.
That is a costly operation.
[On this laptop it took 96 whole minutes](https://nbviewer.jupyter.org/github/nino-cunei/oldassyrian/blob/master/programs/parallels.ipynb).

The good news it that we have stored the outcome in an extra feature.

This feature is packaged in a TF data module, that we will load below, by using the parameter `mod` in the `use()` statement.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections

from tf.app import use


In [3]:
A = use(
    "oldassyrian:clone",
    checkout="clone",
    mod="Nino-cunei/oldassyrian/parallels/tf:clone",
    hoist=globals(),
)
# A = use('oldassyrian', mod='Nino-cunei/oldassyrian/parallels/tf', hoist=globals())

The new feature is **sim** and it it an edge feature.
It annotates pairs of lines $(l, m)$ where $l$ and $m$ have similar content.
The degree of similarity is a percentage (between 90 and 100), and this value
is annotated onto the edges.

Here is an example:

In [5]:
exampleLine = F.otype.s("line")[1000]
sisters = E.sim.b(exampleLine)
print(f"{len(sisters)} similar lines")
print("\n".join(f"{s[0]} with similarity {s[1]}" for s in sisters[0:10]))
A.table(tuple((s[0],) for s in sisters), end=10)

36 similar lines
866817 with similarity 100
874947 with similarity 100
879396 with similarity 100
894252 with similarity 100
904016 with similarity 100
904905 with similarity 100
907962 with similarity 100
910310 with similarity 100
914251 with similarity 100
914820 with similarity 100


n,p,line
1,P360493 reverse:3,ma-al-u2-tim
2,P390646 envelope - reverse:6,[ma]-al-u2-tim
3,P358334 obverse:12,ma-al-u2-tim
4,P358498 reverse:2,ma-al-u2-tim
5,P358744 obverse:13,ma-al-u2-tim
6,P358783 reverse:4,ma-al-u2-tim
7,P358910 obverse:11,ma-al-u2-tim
8,P273595 reverse:8,ma-al-u2-tim
9,P359405 reverse:2,ma-al-u2-tim
10,P359420 envelope - reverse:10,ma-al-u2-tim


# All similarities

Let's first find out the range of similarities:

In [6]:
minSim = None
maxSim = None

for ln in F.otype.s("line"):
    sisters = E.sim.f(ln)
    if not sisters:
        continue
    thisMin = min(s[1] for s in sisters)
    thisMax = max(s[1] for s in sisters)
    if minSim is None or thisMin < minSim:
        minSim = thisMin
    if maxSim is None or thisMax > maxSim:
        maxSim = thisMax

print(f"minimum similarity is {minSim:>3}")
print(f"maximum similarity is {maxSim:>3}")

minimum similarity is  90
maximum similarity is 100


# The bottom lines

We give a few examples of the least similar lines.

**N.B.** When lines are less than 90% similar, they have not made it into the `sim` feature!

We can use a search template to get the 90% lines.

In [7]:
query = """
line
-sim=90> line
"""

In words: find a line connected via a sim-edge with value 90 to an other line.

In [8]:
results = A.search(query)

  0.35s 3784 results


Not very much indeed. It seems that lines are either very similar, or not so similar at all.

In [9]:
A.table(results, start=1, end=10)

n,p,line,line.1
1,P361250 obverse:3,i-a-ti2 a-na a-szur-i-mi3-ti2 / qi2-bi-ma / um-ma,a-na a-szur-i-[mi3-ti2 qi2-bi-ma]
2,P361250 obverse:3,i-a-ti2 a-na a-szur-i-mi3-ti2 / qi2-bi-ma / um-ma,qi2-bi-ma um-ma a-szur-i-mi3-ti2
3,P361250 obverse:3,i-a-ti2 a-na a-szur-i-mi3-ti2 / qi2-bi-ma / um-ma,qi2-bi-ma um-ma a-szur-i-mi3-ti2-ma
4,P361250 obverse:3,i-a-ti2 a-na a-szur-i-mi3-ti2 / qi2-bi-ma / um-ma,qi2-bi-ma um-ma a-szur-i-mi3-ti2-ma#
5,P360467 envelope - obverse:4,[_kiszib3_ i]-ri#-szi2-im _dumu_ a-mur-{d}utu,i-ri-szi2-im _dumu_ a-mur-{d}utu
6,P360467 envelope - obverse:5,"[8(disz) ma]-na _ku3-babbar_ s,a-ru-pa2-am","1/3(disz) _ma-na_ 3(disz) _gin2 ku3-babbar_ s,a-ru-pa2-am"
7,P360467 envelope - obverse:5,"[8(disz) ma]-na _ku3-babbar_ s,a-ru-pa2-am","4(u) 4(disz) _ma-na ku3-babbar_ s,a-ru-pa2-am"
8,P360467 envelope - obverse:5,"[8(disz) ma]-na _ku3-babbar_ s,a-ru-pa2-am","1(u) 5(disz) _ma-na ku3-babbar_ s,a-ru-pa2-am"
9,P360467 envelope - obverse:5,"[8(disz) ma]-na _ku3-babbar_ s,a-ru-pa2-am","2/3(disz) _ma-na_ 5(disz) _gin2 ku3-babbar_ s,a-ru-pa2-am"
10,P360467 envelope - obverse:5,"[8(disz) ma]-na _ku3-babbar_ s,a-ru-pa2-am","4(u) 4(disz) _ma-na ku3-babbar_ s,a-ru-pa2-am"


In case the atf flags and clusters are a bit heavy on the eye, you can switch to a more pleasing rich text layout:

In [10]:
A.table(results, start=1, end=10, fmt="layout-orig-rich")

n,p,line,line.1
1,P361250 obverse:3,i-a-ti₂ a-na a-šur-i-mi₃-ti₂ / qi₂-bi-ma / um-ma,a-na a-šur-i-mi₃-ti₂ qi₂-bi-ma
2,P361250 obverse:3,i-a-ti₂ a-na a-šur-i-mi₃-ti₂ / qi₂-bi-ma / um-ma,qi₂-bi-ma um-ma a-šur-i-mi₃-ti₂
3,P361250 obverse:3,i-a-ti₂ a-na a-šur-i-mi₃-ti₂ / qi₂-bi-ma / um-ma,qi₂-bi-ma um-ma a-šur-i-mi₃-ti₂-ma
4,P361250 obverse:3,i-a-ti₂ a-na a-šur-i-mi₃-ti₂ / qi₂-bi-ma / um-ma,qi₂-bi-ma um-ma a-šur-i-mi₃-ti₂-ma
5,P360467 envelope - obverse:4,kišib₃ i-ri-ši₂-im dumu a-mur-dutu,i-ri-ši₂-im dumu a-mur-dutu
6,P360467 envelope - obverse:5,8⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am,1/3⌈diš⌉ ma-na 3⌈diš⌉ gin₂ ku₃-babbar ṣa-ru-pa₂-am
7,P360467 envelope - obverse:5,8⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am,4⌈u⌉ 4⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am
8,P360467 envelope - obverse:5,8⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am,1⌈u⌉ 5⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am
9,P360467 envelope - obverse:5,8⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am,2/3⌈diš⌉ ma-na 5⌈diš⌉ gin₂ ku₃-babbar ṣa-ru-pa₂-am
10,P360467 envelope - obverse:5,8⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am,4⌈u⌉ 4⌈diš⌉ ma-na ku₃-babbar ṣa-ru-pa₂-am


Or even in cuneiform unicode:

In [11]:
A.table(results, start=1, end=10, fmt="layout-orig-unicode")

n,p,line,line.1
1,P361250 obverse:3,𒄿𒀀𒊹 𒀀𒈾 𒀀𒋩𒄿𒈨𒊹 𒁹 𒆠𒁉𒈠 𒁹 𒌝𒈠,𒀀𒈾 𒀀𒋩𒄿𒈨𒊹 𒆠𒁉𒈠
2,P361250 obverse:3,𒄿𒀀𒊹 𒀀𒈾 𒀀𒋩𒄿𒈨𒊹 𒁹 𒆠𒁉𒈠 𒁹 𒌝𒈠,𒆠𒁉𒈠 𒌝𒈠 𒀀𒋩𒄿𒈨𒊹
3,P361250 obverse:3,𒄿𒀀𒊹 𒀀𒈾 𒀀𒋩𒄿𒈨𒊹 𒁹 𒆠𒁉𒈠 𒁹 𒌝𒈠,𒆠𒁉𒈠 𒌝𒈠 𒀀𒋩𒄿𒈨𒊹𒈠
4,P361250 obverse:3,𒄿𒀀𒊹 𒀀𒈾 𒀀𒋩𒄿𒈨𒊹 𒁹 𒆠𒁉𒈠 𒁹 𒌝𒈠,𒆠𒁉𒈠 𒌝𒈠 𒀀𒋩𒄿𒈨𒊹𒈠
5,P360467 envelope - obverse:4,𒁾 𒄿𒊑𒋛𒅎 𒌉 𒀀𒄯𒀭𒌓,𒄿𒊑𒋛𒅎 𒌉 𒀀𒄯𒀭𒌓
6,P360467 envelope - obverse:5,𒐍 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠,𒑚 𒈠𒈾 𒐈 𒂅 𒆬𒌓 𒍝𒊒𒁀𒄠
7,P360467 envelope - obverse:5,𒐍 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠,𒐏 𒐉 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠
8,P360467 envelope - obverse:5,𒐍 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠,𒌋 𒐊 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠
9,P360467 envelope - obverse:5,𒐍 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠,𒑛 𒈠𒈾 𒐊 𒂅 𒆬𒌓 𒍝𒊒𒁀𒄠
10,P360467 envelope - obverse:5,𒐍 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠,𒐏 𒐉 𒈠𒈾 𒆬𒌓 𒍝𒊒𒁀𒄠


From now on we forget about the level of similarity, and focus on whether two lines are just "similar", meaning that they have
a high degree of similarity.

# Cluster the lines

Before we try to find them, let's see if we can cluster the lines in similar clusters.

In [12]:
CLUSTER_THRESHOLD = 0.5


def makeClusters():
    A.indent(reset=True)
    chunkSize = 1000
    b = 0
    j = 0
    clusters = []
    for ln in F.otype.s("line"):
        j += 1
        b += 1
        if b == chunkSize:
            b = 0
            A.info(f"{j:>5} lines and {len(clusters):>5} clusters")
        lSisters = {x[0] for x in E.sim.b(ln)}
        lAdded = False
        for cl in clusters:
            if len(cl & lSisters) > CLUSTER_THRESHOLD * len(cl):
                cl.add(ln)
                lAdded = True
                break
        if not lAdded:
            clusters.append({ln})
    A.info(f"{j} lines and {len(clusters)} clusters")
    return clusters

In [13]:
clusters = makeClusters()

  0.11s  1000 lines and   957 clusters
  0.40s  2000 lines and  1808 clusters
  0.82s  3000 lines and  2570 clusters
  1.43s  4000 lines and  3489 clusters
  2.14s  5000 lines and  4276 clusters
  3.05s  6000 lines and  5137 clusters
  4.17s  7000 lines and  6065 clusters
  5.46s  8000 lines and  6968 clusters
  6.88s  9000 lines and  7830 clusters
  8.43s 10000 lines and  8666 clusters
    10s 11000 lines and  9459 clusters
    12s 12000 lines and 10314 clusters
    14s 13000 lines and 11162 clusters
    16s 14000 lines and 12040 clusters
    19s 15000 lines and 12832 clusters
    21s 16000 lines and 13731 clusters
    24s 17000 lines and 14554 clusters
    27s 18000 lines and 15392 clusters
    30s 19000 lines and 16281 clusters
    34s 20000 lines and 17143 clusters
    37s 21000 lines and 17849 clusters
    40s 22000 lines and 18589 clusters
    44s 23000 lines and 19322 clusters
    48s 24000 lines and 20176 clusters
    52s 25000 lines and 21040 clusters
    56s 26000 lines and 2

What is the distribution of the clusters, in terms of how many similar lines they contain?
We count them.

In [14]:
clusterSizes = collections.Counter()

for cl in clusters:
    clusterSizes[len(cl)] += 1

for (size, amount) in sorted(
    clusterSizes.items(),
    key=lambda x: (-x[0], x[1]),
):
    print(f"clusters of size {size:>4}: {amount:>5}")

clusters of size  455:     1
clusters of size  352:     1
clusters of size  267:     1
clusters of size  199:     1
clusters of size  146:     1
clusters of size  137:     1
clusters of size  131:     1
clusters of size  129:     1
clusters of size  123:     1
clusters of size  115:     1
clusters of size  112:     1
clusters of size  109:     1
clusters of size  106:     1
clusters of size  100:     1
clusters of size   99:     2
clusters of size   97:     1
clusters of size   94:     2
clusters of size   91:     1
clusters of size   89:     1
clusters of size   86:     1
clusters of size   85:     1
clusters of size   75:     1
clusters of size   73:     1
clusters of size   72:     1
clusters of size   71:     2
clusters of size   70:     1
clusters of size   69:     1
clusters of size   68:     2
clusters of size   65:     1
clusters of size   63:     1
clusters of size   62:     1
clusters of size   61:     2
clusters of size   60:     2
clusters of size   59:     1
clusters of si

# Interesting groups

Let's investigate some interesting groups, that lie in some sweet spots.

* the biggest clusters: more than 31 members
* the medium clusters: between 12 and 30 members
* the small clusters: between 2 and 11 members

---

All chapters:

* **[start](start.ipynb)** become an expert in creating pretty displays of your text structures
* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **[share](share.ipynb)** draw in other people's data and let them use yours
* **similarLines** spot the similarities between lines

---

See the [cookbook](cookbook) for recipes for small, concrete tasks.

CC-BY Dirk Roorda