<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/logo.png" width="128"/>
<img align="right" src="images/dans.png" width="128"/>

---

To get started: consult [start](start.ipynb)

---

# Similar lines

We spot the many similarities between lines in the corpus.

There are ca 25000 lines in the corpus. To compare them all requires 300 million comparisons.
That is a costly operation.
[On this laptop it took 6 whole minutes](https://nbviewer.jupyter.org/github/nino-cunei/oldbabylonian/blob/master/programs/parallels.ipynb).

The good news it that we have stored the outcome in an extra feature.

This feature is packaged in a TF data module, that we will load below, by using the parameter `mod` in the `use()` statement.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import collections

from tf.app import use

In [3]:
A = use("Nino-cunei/ninmed:clone", mod="Nino-cunei/ninmed/parallels/tf:clone", checkout="clone", hoist=globals())
# A = use("Nino-cunei/ninmed", mod="Nino-cunei/ninmed/parallels/tf", hoist=globals())

This is Text-Fabric 9.2.5
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

49 features found and 0 ignored
   |     0.01s T sim                  from ~/github/Nino-cunei/ninmed/parallels/tf/0.3


The new feature is **sim** and it it an edge feature.
It annotates pairs of lines $(l, m)$ where $l$ and $m$ have similar content.
The degree of similarity is a percentage (between 90 and 100), and this value
is annotated onto the edges.

Here is an example:

In [4]:
exampleLine = F.otype.s("line")[1]
sisters = E.sim.b(exampleLine)
print(f"{len(sisters)} similar lines")
print("\n".join(f"{s[0]} with similarity {s[1]}" for s in sisters[0:10]))
A.table(tuple((s[0],) for s in sisters), end=10)

3 similar lines
59324 with similarity 100
59787 with similarity 100
61647 with similarity 100


n,p,line
1,P285136 obverse:11',2 KA.INIM.MA [x x x x x x]
2,P393782 reverse:1:15,KA.INIM#.[MA ...]
3,P393735 obverse:2:28,[KA.INIM.MA?] [...]


# All similarities

Let's first find out the range of similarities:

In [5]:
minSim = None
maxSim = None

for ln in F.otype.s("line"):
    sisters = E.sim.f(ln)
    if not sisters:
        continue
    thisMin = min(s[1] for s in sisters)
    thisMax = max(s[1] for s in sisters)
    if minSim is None or thisMin < minSim:
        minSim = thisMin
    if maxSim is None or thisMax > maxSim:
        maxSim = thisMax

print(f"minimum similarity is {minSim:>3}")
print(f"maximum similarity is {maxSim:>3}")

minimum similarity is  80
maximum similarity is 100


# The bottom lines

We give a few examples of the least similar lines.

**N.B.** When lines are less than 80% similar, they have not made it into the `sim` feature!

We can use a search template to get the 80% lines.

In [6]:
query = """
line
-sim=80> line
"""

In words: find a line connected via a sim-edge with value 90 to an other line.

In [7]:
results = A.search(query)

  0.01s 25 results


Not very much indeed. It seems that lines are either very similar, or not so similar at all.

In [8]:
A.table(results, start=1, end=10)

n,p,line,line.1
1,P285136 obverse:6',EN₂ ma-mit GIM šar-ra-<qi₂> ina KA₂ pil-ši un gi ha ba# [x x x x x x x x x x x x],[EN₂ ma]-mit GIM šar-ra-qi₂ ina KA₂ pil-ši# [un? ...]
2,P399223 obverse:1:5,[x x x x x x x x x x x TI-šu₂] I₃.GIŠ {šim}GUR₂.GUR₂ ŠEŠ₂-su,[x x x x x x x x x x ana TI]-šu₂ I₃.GIŠ {šim}GUR₂.GUR₂ I₃.GIŠ {šim}LI ŠEŠ₂-[su]
3,P394104 obverse:1:28,[ina TUG₂] te-sip 7 UD-me ina KUN₄ te-te-[mer E₁₁-ma ...],7 UD-me ina [KUN₄ te-te-mer E₁₁-ma ...] x [...]
4,P394104 obverse:2:17,[...] x-ta ana ŠA₃ IGI.MIN-šu₂ tu-na-tak,[... SUD₂?] ana ŠA₃ IGI.MIN-šu₂ [tu-na-tak₂]
5,P394104 obverse:2:55,x x [... SUD₂] IGI.MIN-šu₂ MAR,[... IGI.MIN-šu₂] MAR?
6,P394104 reverse:4:47',[E₂.GAL {m}aš-šur-DU₃.A LUGAL ŠU₂ 20 KUR AN.ŠAR₂{ki} ša {d}AG u {d}taš-me-tu₄ GEŠTU.MIN ra-pa-aš₂-tu₄ iš-ru-ku-šu₂],E₂.GAL {diš}AN.ŠAR₂-DU₃.A 20 ŠU₂# 20# KUR AN.ŠAR₂{ki} ša {d}AG# [u] {d#}taš-me-tu₄ GEŠTU.MIN ra-pa-aš₂-tu₄ iš-ru-ku-uš
7,P393782 reverse:1:12,ŠA₃-[bi AN u] KI# lip-pa-šir [ŠA₃-bi x x x x x x],ŠA₃-[bi AN u KI lip-pa-šir₃] ki-ma ŠA₃-bi AN u KI [...] x x x
8,P404547 obverse:5',[DUH.ŠE].GIŠ.I₃# ša₂-bu-lu-te GAZ# [SIM x x x x x x x x x],DUH.ŠE.GIŠ.I₃ ša₂-bu-[lu-te x x x x x x x x x x x (x)]
9,P365742 obverse:1:23,1 GIN₂ U₅ ARGAB{mušen} 1/2 [GIN₂ ...],1 GIN₂ U₅-ARGAB{mušen} [SA₉ ...]
10,P365742 obverse:1:41,[... SILA₁₁-aš] SAR-ab LAL-ma KI [MIN],[... ina] GA# [SILA₁₁-aš SAR]-ab LAL-ma KI MIN


In case the atf flags and clusters are a bit heavy on the eye, you can switch to a more pleasing rich text layout:

In [9]:
A.table(results, start=1, end=10, fmt="layout-orig-plain")

n,p,line,line.1
1,P285136 obverse:6',EN₂ ma-mit GIM šar-ra-qi ina KA₂ pil-ši un gi ha ba x x x x x x x x x x x x,EN₂ ma-mit GIM šar-ra-qi ina KA₂ pil-ši un ...
2,P399223 obverse:1:5,x x x x x x x x x x x TI-šu I₃.GIŠ šimGUR₂.GUR₂ ŠEŠ₂-su,x x x x x x x x x x ana TI-šu I₃.GIŠ šimGUR₂.GUR₂ I₃.GIŠ šimLI ŠEŠ₂-su
3,P394104 obverse:1:28,ina TUG₂ te-sip 7 UD-me ina KUN₄ te-te-mer E₁₁-ma ...,7 UD-me ina KUN₄ te-te-mer E₁₁-ma ... x ...
4,P394104 obverse:2:17,... x-ta ana ŠA₃ IGI.MIN-šu tu-na-tak,... SUD₂ ana ŠA₃ IGI.MIN-šu tu-na-tak
5,P394104 obverse:2:55,x x ... SUD₂ IGI.MIN-šu MAR,... IGI.MIN-šu MAR
6,P394104 reverse:4:47',E₂.GAL maš-šur-DU₃.A LUGAL ŠU₂ 20 KUR AN.ŠAR₂ki ša dAG u dtaš-me-tu GEŠTU.MIN ra-pa-aš-tu iš-ru-ku-šu,E₂.GAL dišAN.ŠAR₂-DU₃.A 20 ŠU₂ 20 KUR AN.ŠAR₂ki ša dAG u dtaš-me-tu GEŠTU.MIN ra-pa-aš-tu iš-ru-ku-uš
7,P393782 reverse:1:12,ŠA₃-bi AN u KI lip-pa-šir ŠA₃-bi x x x x x x,ŠA₃-bi AN u KI lip-pa-šir ki-ma ŠA₃-bi AN u KI ... x x x
8,P404547 obverse:5',DUH.ŠE.GIŠ.I₃ ša-bu-lu-te GAZ SIM x x x x x x x x x,DUH.ŠE.GIŠ.I₃ ša-bu-lu-te x x x x x x x x x x x x
9,P365742 obverse:1:23,1 GIN₂ U₅ ARGABmušen 12 GIN₂ ...,1 GIN₂ U₅-ARGABmušen SA₉ ...
10,P365742 obverse:1:41,... SILA₁₁-aš SAR-ab LAL-ma KI MIN,... ina GA SILA₁₁-aš SAR-ab LAL-ma KI MIN


From now on we forget about the level of similarity, and focus on whether two lines are just "similar", meaning that they have
a high degree of similarity.

# Cluster the lines

Before we try to find them, let's see if we can cluster the lines in similar clusters.

In [10]:
CLUSTER_THRESHOLD = 0.5


def makeClusters():
    A.indent(reset=True)
    chunkSize = 1000
    b = 0
    j = 0
    clusters = []
    for ln in F.otype.s("line"):
        j += 1
        b += 1
        if b == chunkSize:
            b = 0
            A.info(f"{j:>5} lines and {len(clusters):>5} clusters")
        lSisters = {x[0] for x in E.sim.b(ln)}
        lAdded = False
        for cl in clusters:
            if len(cl & lSisters) > CLUSTER_THRESHOLD * len(cl):
                cl.add(ln)
                lAdded = True
                break
        if not lAdded:
            clusters.append({ln})
    A.info(f"{j} lines and {len(clusters)} clusters")
    return clusters

In [11]:
clusters = makeClusters()

  0.09s  1000 lines and   944 clusters
  0.31s  2000 lines and  1787 clusters
  0.65s  3000 lines and  2702 clusters
  0.69s 3082 lines and 2776 clusters


What is the distribution of the clusters, in terms of how many similar lines they contain?
We count them.

In [12]:
clusterSizes = collections.Counter()

for cl in clusters:
    clusterSizes[len(cl)] += 1

for (size, amount) in sorted(
    clusterSizes.items(),
    key=lambda x: (-x[0], x[1]),
):
    print(f"clusters of size {size:>4}: {amount:>5}")

clusters of size   11:     1
clusters of size   10:     2
clusters of size    9:     1
clusters of size    8:     1
clusters of size    7:     1
clusters of size    6:     8
clusters of size    5:     3
clusters of size    4:     7
clusters of size    3:    16
clusters of size    2:   152
clusters of size    1:  2584


# Interesting groups

Let's investigate some interesting groups, that lie in some sweet spots.

* the biggest clusters: more than 6 members
* the medium clusters: between 3 and 6 members
* the small clusters: between 1 and 2 members

---

All chapters:

* **[start](start.ipynb)** become an expert in creating pretty displays of your text structures
* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **similarLines** spot the similarities between lines

---

See the [cookbook](cookbook) for recipes for small, concrete tasks.

CC-BY Dirk Roorda