# Use Text-Fabric search to find word patterns

Prompted by [Paul Noorlander](https://cambridge.academia.edu/PaulNoorlander), answered by Dirk Roorda

# Getting data and using the TF browser

It is convenient to have the Text-Fabric browser on the side to make quick excursions through the data.

So, go off to a terminal and give the command

```text-fabric peshitta:latest --checkout=latest```

This fetches the latest version of the Peshitta app and data.

After that, you can just say

```text-fabric peshitta```

until you got word that a new version of app and/or data has become available.

In [1]:
from tf.app import use

In [2]:
A = use("peshitta", hoist=globals())

## string `JBW L` in the text

Assuming `JBW` is a single word and L is a single word:

In [3]:
query = """
word word_etcbc=JBW
<: word word_etcbc=L
"""

In [4]:
results = A.search(query)

  0.75s 0 results


That does not help. At least one of the assumptions leads to nowhere.
At this point it might help to use the TF browser to conduct some experiments on the side line.

Running `word word_etcbc=L|JBW` shows that there are no words whose full text is either `L` or `JBW`.

But there are plenty of words starting with a `L`.

```
text-fabric peshitta
```

You get something like this

![tfb](images/start-l.png)


See
[search with regular expressions](https://annotation.github.io/text-fabric/tf/about/searchusage.html#feature-specifications)
for how you can use search patterns to look within feature values.

The query is `word word_etcbc~^L`, the anatomy of which reads:

* look for a word with a constraint on its feature `word_etcbc`.
* the constraint is that it should match the regular expression `^L`:
* the `^` matches the beginning of the string, the `L` matches just an `L`,
* resulting in the condition: `word_etcbc` starts with an `L`.

Likewise, `$` matches the end of the string, so `word word_etcbc~JBW$` matches each word whose etcbc-transcription ends in `JBW`.

If you try that, you get 58 results:

![tfb](images/end-jbw.png)

Good. Let's see whether there are combined results.

We do that here, in the notebook.

In [5]:
query = """
word word_etcbc~JBW$
<: word word_etcbc~^L
"""

In [6]:
results = A.search(query)

  0.91s 17 results


Lo and behold:

In [7]:
A.table(results, fmt="text-trans-full")

n,p,word,word.1
1,Numbers 35:11,VJBW,LKWN
2,Job 17:1,VJBW,LJ
3,Joshua 1:11,VJBW,LKWN
4,Joshua 11:5,W>TVJBW,LM<BD
5,Kings_1 8:32,LMXJBW,LXJB>
6,Kings_2 22:20,W>TJBW,LMLK>
7,Jeremiah 51:20,VJBW,LJ
8,Sirach 12:1,VJBW,LVJBWTK
9,Chronicles_2 6:23,LMXJBW,LXJB>
10,Chronicles_2 35:14,VJBW,LHWN


We get the transcription by asking for text format `text-trans-full`.

The available text formats can be found in the TF browser, under options (see screenshot above).

## word starting with '<:L' followed by an other word starting with `<:L`

In [8]:
query = """
word word_etcbc~^<:L
<: word word_etcbc~<:L
"""

In [9]:
results = A.search(query)

  0.90s 0 results


I suspect the `:` spoils things. How many words contain `:`?

Fire `word word_etcbc~:` and you get only 6.

Let's leave out the `:`:

In [10]:
query = """
word word_etcbc~^<L
<: word word_etcbc~<L
"""

In [11]:
results = A.search(query)

  0.95s 158 results


In [12]:
A.table(results, fmt="text-trans-full", start=0, end=20)

n,p,word,word.1
0,Esdras_3 2:15,<LJHWN,<L
1,Genesis 20:9,<LJ,W<L
2,Genesis 34:27,<LW,<L
3,Genesis 39:14,<LJN=.,<L
4,Genesis 39:14,<L,<LJ
5,Genesis 39:17,<L,<LJ
6,Genesis 43:7,<LJN,W<L
7,Exodus 8:5,<LJK^=.,W<L
8,Exodus 8:17,<LJK,W<L
9,Exodus 20:24,<LWHJ,"<LW""TK"


## word containing root `HWY`

The current Peshitta data set is not up to this question, because lemma's and roots are not marked.
The best we can do is to try for a set of surface patterns.

Playing around (in the TF browser) yields this:

* `word word_etcbc~HWY` 0 results! Could it be that you meant `HWJ`?
* `word word_etcbc~HWJ` 673 results! Business.

In [13]:
query = """
word word_etcbc~HWJ
"""

In [14]:
results = A.search(query)

  0.44s 673 results


In [15]:
A.table(results, fmt="text-trans-full", start=0, end=20)

n,p,word
0,Esdras_3 8:69,HWJT
1,Genesis 3:5,WHWJTWN
2,Genesis 11:3,WHWJ>
3,Genesis 12:2,WHWJ
4,Genesis 17:1,WHWJ
5,Genesis 18:12,HWJ>
6,Genesis 19:37,HWJW
7,Genesis 19:38,HWJW
8,Genesis 24:41,HWJT
9,Genesis 24:60,HWJ


We can look for the *bare* occurrences of `HWJ`:

* `word word_etcbc~^HWJ$` 42 occurrences, or simpler:
* `word word_etcbc=HWJ` idem

## word containing several roots

You can look for several roots at the same time, e.g. `HWJ` and `RHV`:

* `word word_etcbc=HWJ|RHV` 54 occurrences

If you want the non-bare occurrences also, we are helped by the fact that you can use `|` inside regular expressions as well:

* `word word_etcbc~HWJ|RHV` 791 results

We show results 4 and 5 here, not as table but in pretty display, by using the function `show()` instead of
`table()`:

In [16]:
query = """
word word_etcbc~HWJ|RHV
"""

In [17]:
results = A.search(query)

  0.47s 791 results


In [18]:
A.show(results, fmt="text-trans-full", start=4, end=5)

If you want all results in an Excel table, do this

In [19]:
A.export(results)

Find it in your downloads folder:

![tsv](images/results.png)

You can open it directly in Excel:


![xls](images/resultsx.png)

See also the
[documentation of export()](https://annotation.github.io/text-fabric/tf/advanced/display.html#tf.advanced.display.export)

You can also make these exports directly from the TF browser:

![export](images/export.png)

Look for a file with a name like `peshitta-default.zip` in your Downloads folder.
In it is a file `resultsx.tsv` with the same content.