For the main tutorial go to [start](../start.ipynb)

---

# Word searches

Searches for particular morphology inside words can become complicated. Here are some ways to achieve results.

In particular, we show how to use *regular expressions* inside search templates.

Regular expressions are search partterns as used in tools like *grep*, *awk*, *vim*, and many programming languages,
among which Python.

TF search patterns can tap the full power of Python
[regular expressions](https://docs.python.org/3/library/re.html#module-re).

In [1]:
from tf.app import use

In [2]:
A = use("oldbabylonian:clone", checkout="clone", hoist=globals())
# A = use('oldbabylonian', hoist=globals())

# i-na + ...-?im

We look for word pairs, of which the first is `i-na` and the second ends in a sign whose reading ends in `im`.

In [3]:
query = """
line
  word
    =: sign reading=i
    <: sign reading=na
    :=
  <: word
    := sign reading~im$
"""

Explanation of the expression in the last line `reading~im$`.

We do not say:
> `reading` equals `im`

but

> `reading` **matches** `im$`

Matching means that the reading is matched against a pattern, also known as a *regular expression*.

This regular expression means: it should contain the substring `im` at the end. The `$` matches the end of the string.

You can use any legal regular expression that Python recognizes.

For a reference, consult the
[Python documentation](https://docs.python.org/3/library/re.html#module-re)
of regular expressions.

In [4]:
results = A.search(query)

  0.72s 307 results


In [5]:
A.table(results, end=10)

n,p,line,word,sign,sign.1,word.1,sign.2
1,P509375 reverse:9,i-na la-hi-a-nim,i-na,i-,na,la-hi-a-nim,nim
2,P510527 obverse:6,{disz}ip-qu2-i3-li2-szu _di-ku5_ i-na pu-uh2-ri-im,i-na,i-,na,pu-uh2-ri-im,im
3,P510527 obverse:15,i-na pu-uh2-ri-im i-na da-ba-bi-im,i-na,i-,na,pu-uh2-ri-im,im
4,P510527 obverse:15,i-na pu-uh2-ri-im i-na da-ba-bi-im,i-na,i-,na,da-ba-bi-im,im
5,P510538 obverse:10,i-na tam-li-tim,i-na,i-,na,tam-li-tim,tim
6,P510562 obverse:7,i-na# pa-ni-tim a-na a-<ma?>-az{ki} ta-al-li-ik-ma#,i-na#,i-,na#,pa-ni-tim,tim
7,P510567 reverse:7,[i-na] e-bu-ri-im,[i-na],[i-,na],e-bu-ri-im,im
8,P510571 reverse:13,i-na an-ni-tim at-hu-<ut>-ka#,i-na,i-,na,an-ni-tim,tim
9,P510574 obverse:8,tup-pi2 i-na a-ma-ri-im,i-na,i-,na,a-ma-ri-im,im
10,P510575 obverse:11,[i]-na# qa-tim ta-ki-il-tim,[i]-na#,[i]-,na#,qa-tim,tim


Let's vary a bit on this theme. Suppose we want to tighten the criterion that the last sign of the last word
ends in `im`. Suppose we want it to be `tim` or `nim`. We can express that as follows:

In [6]:
query = """
line
  word
    =: sign reading=i
    <: sign reading=na
    :=
  <: word
    := sign reading~^[nt]im$
"""

Explanation: `^` matches the start of the reading. So the pattern `[nt]im` must cover the whole reading.
`[nt]` means: either `n` or `t`. In general, `[` *characters* `]` is a choice between the *characters*.
You can also say things like `[A-Z0-9]`, which matches any upper case latin letter or a digit.

In [7]:
results = A.search(query)

  0.75s 120 results


In [8]:
A.table(results, end=10)

n,p,line,word,sign,sign.1,word.1,sign.2
1,P509375 reverse:9,i-na la-hi-a-nim,i-na,i-,na,la-hi-a-nim,nim
2,P510538 obverse:10,i-na tam-li-tim,i-na,i-,na,tam-li-tim,tim
3,P510562 obverse:7,i-na# pa-ni-tim a-na a-<ma?>-az{ki} ta-al-li-ik-ma#,i-na#,i-,na#,pa-ni-tim,tim
4,P510571 reverse:13,i-na an-ni-tim at-hu-<ut>-ka#,i-na,i-,na,an-ni-tim,tim
5,P510575 obverse:11,[i]-na# qa-tim ta-ki-il-tim,[i]-na#,[i]-,na#,qa-tim,tim
6,P510593 obverse:8,i-na pa-ni-tim i-nu-ma a-na tam-li-tim a-na e2-duru5-bi2-sa3{ki#},i-na,i-,na,pa-ni-tim,tim
7,P510643 reverse:6,i-na an-ni-tim ta-ka-li ta-ma-ar,i-na,i-,na,an-ni-tim,tim
8,P510659 reverse:10',i-na an-ni-tim at#-[hu-ut-ka],i-na,i-,na,an-ni-tim,tim
9,P510698 obverse:11,szum-ma i-na ki-tim a-bi,i-na,i-,na,ki-tim,tim
10,P510698 obverse:13,"i-na an-ni-tim et,-ra-an-ni-i-ma",i-na,i-,na,an-ni-tim,tim


What if we wanted a reading that is `tim`, `nim` or `im`? We can say that as follows:

In [9]:
query = """
line
  word
    =: sign reading=i
    <: sign reading=na
    :=
  <: word
    := sign reading~^[nt]?im$
"""

Explanation: the `?` makes the preceding thing *optional*. The preceding thing here is `[nt]`.

In [10]:
results = A.search(query)

  0.74s 301 results


In [11]:
A.table(results, end=10)

n,p,line,word,sign,sign.1,word.1,sign.2
1,P509375 reverse:9,i-na la-hi-a-nim,i-na,i-,na,la-hi-a-nim,nim
2,P510527 obverse:6,{disz}ip-qu2-i3-li2-szu _di-ku5_ i-na pu-uh2-ri-im,i-na,i-,na,pu-uh2-ri-im,im
3,P510527 obverse:15,i-na pu-uh2-ri-im i-na da-ba-bi-im,i-na,i-,na,pu-uh2-ri-im,im
4,P510527 obverse:15,i-na pu-uh2-ri-im i-na da-ba-bi-im,i-na,i-,na,da-ba-bi-im,im
5,P510538 obverse:10,i-na tam-li-tim,i-na,i-,na,tam-li-tim,tim
6,P510562 obverse:7,i-na# pa-ni-tim a-na a-<ma?>-az{ki} ta-al-li-ik-ma#,i-na#,i-,na#,pa-ni-tim,tim
7,P510567 reverse:7,[i-na] e-bu-ri-im,[i-na],[i-,na],e-bu-ri-im,im
8,P510571 reverse:13,i-na an-ni-tim at-hu-<ut>-ka#,i-na,i-,na,an-ni-tim,tim
9,P510574 obverse:8,tup-pi2 i-na a-ma-ri-im,i-na,i-,na,a-ma-ri-im,im
10,P510575 obverse:11,[i]-na# qa-tim ta-ki-il-tim,[i]-na#,[i]-,na#,qa-tim,tim


If you have a few discrete options, you can also list the options and separate them with `|`.

Let's obtain the same results with a different expression:

In [12]:
query = """
line
  word
    =: sign reading=i
    <: sign reading=na
    :=
  <: word
    := sign reading~^(tim|nim|im)$
"""

Caution: mind the brackets: we do not want

> `^tim` or `nim` or `im$`

but

> `^`, then `tim` or `nim` or `im`, then `$`

In [13]:
results = A.search(query)

  0.75s 301 results


In [14]:
A.table(results, end=10)

n,p,line,word,sign,sign.1,word.1,sign.2
1,P509375 reverse:9,i-na la-hi-a-nim,i-na,i-,na,la-hi-a-nim,nim
2,P510527 obverse:6,{disz}ip-qu2-i3-li2-szu _di-ku5_ i-na pu-uh2-ri-im,i-na,i-,na,pu-uh2-ri-im,im
3,P510527 obverse:15,i-na pu-uh2-ri-im i-na da-ba-bi-im,i-na,i-,na,pu-uh2-ri-im,im
4,P510527 obverse:15,i-na pu-uh2-ri-im i-na da-ba-bi-im,i-na,i-,na,da-ba-bi-im,im
5,P510538 obverse:10,i-na tam-li-tim,i-na,i-,na,tam-li-tim,tim
6,P510562 obverse:7,i-na# pa-ni-tim a-na a-<ma?>-az{ki} ta-al-li-ik-ma#,i-na#,i-,na#,pa-ni-tim,tim
7,P510567 reverse:7,[i-na] e-bu-ri-im,[i-na],[i-,na],e-bu-ri-im,im
8,P510571 reverse:13,i-na an-ni-tim at-hu-<ut>-ka#,i-na,i-,na,an-ni-tim,tim
9,P510574 obverse:8,tup-pi2 i-na a-ma-ri-im,i-na,i-,na,a-ma-ri-im,im
10,P510575 obverse:11,[i]-na# qa-tim ta-ki-il-tim,[i]-na#,[i]-,na#,qa-tim,tim


We have 6 results less than our original query.

Can we find a template that searches exactly for the missing ones?

In [15]:
query = """
line
  word
    =: sign reading=i
    <: sign reading=na
    :=
  <: word
    := sign reading~^[^nt]im$
"""

Explanation: the `^` inside the square brackets means the negation of the characters listed.
So here we say: we want anything **but** an `n` or a `t`.
Note that we still want *anything*, so the case of a bare `im` will not match.

So this yields precisely those cases that we found initially, minus the `nim`, `tim`, `im` cases.

In [16]:
results = A.search(query)

  0.71s 6 results


In [17]:
A.table(results, end=10)

n,p,line,word,sign,sign.1,word.1,sign.2
1,P510596 obverse:11,ki#-ma ti-du-u2 i-na a-lim ma-ah-ri-ka,i-na,i-,na,a-lim,lim
2,P510608 obverse:10,u2-lu i-na a-lim e-ma i-ba-asz-szu#-u2,i-na,i-,na,a-lim,lim
3,P510784 reverse:3,ki-ma i-na a-lim te-<esz>-te-ne2-em-mu,i-na,i-,na,a-lim,lim
4,P510837 obverse:8,{disz}{d}na-bi-um-ma-lik i-na# _a-sza3_-lim,i-na#,i-,na#,_a-sza3_-lim,lim
5,P313311 reverse:10,um-mi i-na a-lim is-su2-ha,i-na,i-,na,a-lim,lim
6,P275147 obverse:6,i-[na e-mu-ut]-ba-lim ka-li-a,i-[na,i-,[na,e-mu-ut]-ba-lim,lim


There is an alternative way of matching words. Not by sign, but by using the feature `sym` on words.

In [18]:
query = """
line
  word sym=i-na
  <: word sym~im$
"""

In [19]:
results = A.search(query)

  0.19s 306 results


In [20]:
A.table(results, end=10)

n,p,line,word,word.1
1,P509375 reverse:9,i-na la-hi-a-nim,i-na,la-hi-a-nim
2,P510527 obverse:6,{disz}ip-qu2-i3-li2-szu _di-ku5_ i-na pu-uh2-ri-im,i-na,pu-uh2-ri-im
3,P510527 obverse:15,i-na pu-uh2-ri-im i-na da-ba-bi-im,i-na,pu-uh2-ri-im
4,P510527 obverse:15,i-na pu-uh2-ri-im i-na da-ba-bi-im,i-na,da-ba-bi-im
5,P510538 obverse:10,i-na tam-li-tim,i-na,tam-li-tim
6,P510562 obverse:7,i-na# pa-ni-tim a-na a-<ma?>-az{ki} ta-al-li-ik-ma#,i-na#,pa-ni-tim
7,P510567 reverse:7,[i-na] e-bu-ri-im,[i-na],e-bu-ri-im
8,P510571 reverse:13,i-na an-ni-tim at-hu-<ut>-ka#,i-na,an-ni-tim
9,P510574 obverse:8,tup-pi2 i-na a-ma-ri-im,i-na,a-ma-ri-im
10,P510575 obverse:11,[i]-na# qa-tim ta-ki-il-tim,[i]-na#,qa-tim


It seems that we miss one result in this way. Let's find out which:

In [21]:
query = """
line
  word
    =: sign reading=i
    <: sign reading=na
    :=
  <: word sym~(?<!im)$
    := sign reading~im$
"""

Explanation: the piece `(?<!im)` is a negative look-behind assertion. It matches at a point that does not follow immediately on `im`.
So we look for words whose sym feature does not end in `im`.

In [22]:
results = A.search(query)

  0.80s 1 result


That must be the culprit!

In [23]:
A.show(results)

Ah: the second word does end in `im` reading-wise, but not sym-wise, because the sym feature has `im!SZI`

## Solution

Let's quickly inspect all readings ending in `im`:

In [24]:
sorted(
    {F.reading.v(s) for s in F.otype.s("sign") if (F.reading.v(s) or "").endswith("im")}
)

['dim',
 'erim',
 'gim',
 'idim',
 'im',
 'inim',
 'lim',
 'maszkim',
 'muhaldim',
 'nim',
 'silim',
 'sim',
 'szim',
 'szitim',
 'tim',
 'zadim']

We do not want to consider readings like `maszkim` and `muhaldim`, just the ones with a single letter in front of the `im`.
Alas, the `sz` also counts as a single letter.

Lets turn to `symr` instead of `sym`.

In [25]:
sorted(
    {
        F.readingr.v(s)
        for s in F.otype.s("sign")
        if (F.readingr.v(s) or "").endswith("im")
    }
)

['dim',
 'erim',
 'gim',
 'idim',
 'im',
 'inim',
 'lim',
 'maškim',
 'muhaldim',
 'nim',
 'silim',
 'sim',
 'tim',
 'zadim',
 'šim',
 'šitim']

Now we can state the condition: words where feature symr consists of either `im` or a single letter followed by `im`.

In [26]:
query = """
line
  word sym=i-na
  <: word symr~-.?im$
"""

Explanation: the dot `.` stands for an arbitrary, single letter. Because of the `?` behind it, that letter is optional.

In [27]:
results = A.search(query)

  0.20s 306 results


In [28]:
A.table(results, end=10)

n,p,line,word,word.1
1,P509375 reverse:9,i-na la-hi-a-nim,i-na,la-hi-a-nim
2,P510527 obverse:6,{disz}ip-qu2-i3-li2-szu _di-ku5_ i-na pu-uh2-ri-im,i-na,pu-uh2-ri-im
3,P510527 obverse:15,i-na pu-uh2-ri-im i-na da-ba-bi-im,i-na,pu-uh2-ri-im
4,P510527 obverse:15,i-na pu-uh2-ri-im i-na da-ba-bi-im,i-na,da-ba-bi-im
5,P510538 obverse:10,i-na tam-li-tim,i-na,tam-li-tim
6,P510562 obverse:7,i-na# pa-ni-tim a-na a-<ma?>-az{ki} ta-al-li-ik-ma#,i-na#,pa-ni-tim
7,P510567 reverse:7,[i-na] e-bu-ri-im,[i-na],e-bu-ri-im
8,P510571 reverse:13,i-na an-ni-tim at-hu-<ut>-ka#,i-na,an-ni-tim
9,P510574 obverse:8,tup-pi2 i-na a-ma-ri-im,i-na,a-ma-ri-im
10,P510575 obverse:11,[i]-na# qa-tim ta-ki-il-tim,[i]-na#,qa-tim


# i-na + ...!im + ...-?im

We look for word triples, of which the first is `i-na`, the second does not end in `im` and the third one ends in `im`.
In the third word, there may be a single letter before the im of the last sign.

In [29]:
query = """
line
  word sym=i-na
  <: word symr~(?<!im)$
  <: word symr~-.?im$
"""

In [30]:
results = A.search(query)

  0.34s 67 results


In [31]:
A.table(results, end=10)

n,p,line,word,word.1,word.2
1,P509373 reverse:11',a-na ki-ma i-[na] _dub e2-gal_-lim,i-[na],_dub,e2-gal_-lim
2,P510573 reverse:1,i-na <<an-na>> an-ni-tim a-hu!-ut-ka,i-na,<<an-na>>,an-ni-tim
3,P510594 obverse:5',szum-ma i-na _{gesz}ban2_ {d}ki-it-tim,i-na,_{gesz}ban2_,{d}ki-it-tim
4,P510594 obverse:7',i-na _{gesz}ban2_ {d}ki-it#-tim#,i-na,_{gesz}ban2_,{d}ki-it#-tim#
5,P510594 reverse:3,szum-ma i-na _{gesz}ban2_ {d}ki-it-tim,i-na,_{gesz}ban2_,{d}ki-it-tim
6,P510607 obverse:10,i-na pi2-ha-at a-lim,i-na,pi2-ha-at,a-lim
7,P510677 obverse:2',[i-na _e2_ a-ki-tim isz]-sza#-ak#-[ka-an],[i-na,_e2_,a-ki-tim
8,P510688 obverse:10,<<i>> el-qe2 i-na _a-sza3_ [x]-x-lim,i-na,_a-sza3_,[x]-x-lim
9,P510712 reverse:17',[i]-na# re-esz ma-ak-ku-ri-im#,[i]-na#,re-esz,ma-ak-ku-ri-im#
10,P510722 reverse:8,i-na pi2-sza-an-ni ku-nu-ka-tim,i-na,pi2-sza-an-ni,ku-nu-ka-tim


The question is: we do miss cases where the second word ends in e.g. `-maškim`. Is that bad?
Let's find the missing cases:

In [32]:
query = """
line
  word sym=i-na
  <: word symr~[^-][^-]im$
  <: word symr~-.?im$
"""

So we actively look for cases where the second word ends in a reading that ends in `im`, preceded by at least two characters
that are not a hyphen.

In [33]:
results = A.search(query)

  0.29s 0 results


We do not find any, so we can stick to our initial query for triplet words.

Just in case you like the highlighting of signs, we rewrite this query in the more elaborate, sign based form:

In [34]:
query = """
line
  word
    =: sign reading=i
    <: sign reading=na
    :=
  <: word
    := sign reading~(?<!im)$
  <: word
    := sign readingr~.?im$
"""

Note that in the last line we use `readingr` instead of `reading`, because in `readingr` digraphs such as `sz` appear as
a single letter.

In [35]:
results = A.search(query)

  1.13s 67 results


In [36]:
A.table(results, end=10)

n,p,line,word,sign,sign.1,word.1,sign.2,word.2,sign.3
1,P509373 reverse:11',a-na ki-ma i-[na] _dub e2-gal_-lim,i-[na],i-,[na],_dub,_dub,e2-gal_-lim,lim
2,P510573 reverse:1,i-na <<an-na>> an-ni-tim a-hu!-ut-ka,i-na,i-,na,<<an-na>>,na>>,an-ni-tim,tim
3,P510594 obverse:5',szum-ma i-na _{gesz}ban2_ {d}ki-it-tim,i-na,i-,na,_{gesz}ban2_,ban2_,{d}ki-it-tim,tim
4,P510594 obverse:7',i-na _{gesz}ban2_ {d}ki-it#-tim#,i-na,i-,na,_{gesz}ban2_,ban2_,{d}ki-it#-tim#,tim#
5,P510594 reverse:3,szum-ma i-na _{gesz}ban2_ {d}ki-it-tim,i-na,i-,na,_{gesz}ban2_,ban2_,{d}ki-it-tim,tim
6,P510607 obverse:10,i-na pi2-ha-at a-lim,i-na,i-,na,pi2-ha-at,at,a-lim,lim
7,P510677 obverse:2',[i-na _e2_ a-ki-tim isz]-sza#-ak#-[ka-an],[i-na,[i-,na,_e2_,_e2_,a-ki-tim,tim
8,P510688 obverse:10,<<i>> el-qe2 i-na _a-sza3_ [x]-x-lim,i-na,i-,na,_a-sza3_,sza3_,[x]-x-lim,lim
9,P510712 reverse:17',[i]-na# re-esz ma-ak-ku-ri-im#,[i]-na#,[i]-,na#,re-esz,esz,ma-ak-ku-ri-im#,im#
10,P510722 reverse:8,i-na pi2-sza-an-ni ku-nu-ka-tim,i-na,i-,na,pi2-sza-an-ni,ni,ku-nu-ka-tim,tim
