<img align="right" src="images/tf.png" width="200"/>
<img align="right" src="images/huc.png" width="200"/>
<img align="right" src="images/logo.png" width="200"/>

---

To get started: consult [start](start.ipynb)

---

# Search Introduction

*Search* in Text-Fabric is a template based way of looking for structural patterns in your dataset.

Within Text-Fabric we have the unique possibility to combine the ease of formulating search templates for
complicated syntactical patterns with the power of programmatically processing the results.

This notebook will show you how to get up and running.

## Easy command

Search is as simple as saying (just an example)

```python
results = A.search(template)
A.show(results)
```

See all ins and outs in the
[search template docs](https://annotation.github.io/text-fabric/tf/about/searchusage.html).

# Incantation

The ins and outs of installing Text-Fabric, getting the corpus, and initializing a notebook are
explained in the [start tutorial](start.ipynb).

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use

In [3]:
A = use(
    "CLARIAH/descartes-tf:clone",
    checkout="clone",
    hoist=globals(),
)

This is Text-Fabric 11.0.7
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

28 features found and 0 ignored
  0.09s Dataset without structure sections in otext:no structure functions in the T-API
  0.35s All features loaded/computed - for details use TF.isLoaded()
  0.01s All additional features loaded - for details use TF.isLoaded()


Name,# of nodes,# slots/node,% coverage
,,,
volume,8.0,85241.88,100.0
letter,725.0,940.6,100.0
page,2884.0,236.45,100.0
postscriptum,56.0,46.79,0.0
opener,545.0,1.97,0.0
closer,541.0,13.1,1.0
address,86.0,15.22,0.0
head,725.0,23.37,2.0
p,8438.0,80.82,100.0


# Basic search command

We start with the most simple form of issuing a query.
Let's look for the 16th sentence of the paragraphs that have that many sentences.

Note that sentences are numbered within paragraphs and that the sentence number is in feature `n`.

In [4]:
template = """
sentence n=16
"""

results = A.search(template)

  0.01s 13 results


We see the amount of results, but how do we get the results?

In [5]:
results

[(709073,),
 (709215,),
 (710310,),
 (712826,),
 (714388,),
 (714650,),
 (717861,),
 (717913,),
 (718273,),
 (718766,),
 (720933,),
 (721641,),
 (722538,)]

Nice try. These are indeed the results, but they are just the nodes, i.e. meaningless numbers (to us).

We get more flesh and blood by displaying the results.

In [6]:
A.table(results)

n,p,sentence
1,1 1027:3,"Ce n'est pas que je ne l'aime et que je ne le tienne pour un homme tout plein d'honneur et de bonté; mais parce que je ne connais que deux personnes, avec qui il ait jamais eu quelque chose à démêler, qui sont M. Mydorge et M. Morin, et qu'il se plaint de tous les deux, je ne saurais que je ne juge qu'il tient quelque chose de cette humeur, où il faut dire qu'il est bien malheureux."
2,1 1032:4,"Quâ tamen in re non judico te satis prudenter cavere tuis rebus: quid enim si de istius manuscripti fide dubitatur? nunquid tutius esset testes adhibere vel tabulis publicis confirmare? Sed profecto, ut verum loquar, istae divitiae, quae fures timent et tantâ cum sollicitudine debent asservari, miserum te reddunt potius quàm beatum; nec, si mihi credis, te pigebit illa amittere simul cum morbo."
3,1 1116:3,"Opinor autem quod, sicuti apud Poetam consessus Didonianus, conticebunt omnes intentique ora tenebunt. Precor autem te et obtestor ut eodem tenore caetera quae in manibus habes prosequaris et aliquando proferas, meque subinde epistolio tuo bees."
4,3 3174:23,"Je m'étonne aussi de ce que, nonobstant que j'aie clairement démontré tout ce que j'ai dit devoir être corrigé en sa règle, et qu'il n'aît donné aucune raison à l'encontre, il ne laisse pas de dire que j'y ai mal réussi, au lieu de quoi je me persuade qu'il m'en devrait remercier; et même il ajoute que j'ai failli pour avoir dit qu'il fallait donner deux noms à la ligne qu'il nomme B etc., ce qui ne réussit, dit-il, qu'aux questions qui sont aisées, au lieu qu'il devrait dire que c'est donc lui-même qui avait failli, à cause que j'ai suivi en cela son texte de mot à mot, ainsi que j'ai dû faire pour le corriger."
5,3 3220:3,"Je suis, Monsieur, Votre très obéissant et très obligé serviteur, DESCARTES. De Leyde le lundi au soir [12 décembre 1639]. Monsieur,"
6,4 4230:3,"(baillet, II, 21-22.)"
7,6 6391:3,"Maer niet-te-min dewijl al de Werelt oordeelt, dat hy de voornaemste autheur is vande lasteringhen die in het ghemelt fameux boeck teghens my worden ghevonden, versoeck ick U. Ed."
8,6 6396:3,"Je vous assure qu'elles ne me touchent guère, et ne m'ont point emmaigri, comme Voetius, à qui on dit qu'elles ont ôté treize livres de chair, mais non pas de graisse, à cause qu'il n'en eut jamais tant."
9,6 6425:9,"Quin etiam nullum ea de re scriptum peculiare composui, sed obiter tantum in epistola in qua de Patre quodam Societatis conquerebar, et quam tunc commodam sub praelo habebam, paucas de illo paginas inserui."
10,6 6470:4,"C'est un avantage qu'a eu aussi dans la suite la version française des Principes de M. Descartes, faite par l'Abbé Picot."


## Figures

Let's look for all paragraphs with an illustration in it.

In [8]:
query = """
p
  figure
"""

In [9]:
results = A.search(query)

  0.00s 319 results


In [10]:
A.table(results, end=3)

n,p,p.1,figure
1,1 1001:4,,
2,1 1002:3,,
3,1 1002:3,,


The results are shown inside the sentences that they occur in.
`p`s are too big to fit into sentences, so the `p`s are left out and only the images show up.

We can make the display richer: instead of a plain table, we can unfold the sentences in the results:

In [11]:
A.show(results, end=3)

The results are collected and shown in their surrounding sentence. 

Not that we see only the sentences that contain an image.

But we can see more if we tell text-fabric to condense the result not in sentences, but in `p`s:

In [12]:
A.show(results, end=3, condenseType="p")

# Formulas

Now let's look for formulas that have a square root in them.

Note that in TeX a square root is written as `\sqrt`.

The TeX source of a formula is contained in the `tex` feature of a formula node, provided
the formula is written in TeX. Not all formulas are written in TeX.

In [13]:
query = """
formula tex~sqrt
"""

In [14]:
results = A.search(query)

  0.00s 54 results


In [15]:
A.show(results, end=3, condensed=True)

We can get rid of the TeX codes.

We see them because our query mentioned the feature `tex`, but we can turn that off (showing the 3rd result only)

In [16]:
A.show(results, start=3, end=3, condensed=True, queryFeatures=False)

## Formulas without TeX

We gather the formulas not written in TeX:

In [17]:
query = """
formula notation#TeX
"""

results = A.search(query)

  0.01s 5981 results


The majority is not written in TeX, let's sample a few:

In [18]:
from random import seed, sample

seed(42)

selected = sample(results, 20)

A.table(selected)

n,p,formula
1,7 7538:6,IN
2,2 2122:12,G
3,1 1020:7,ZZ
4,2 2164:25,GR
5,2 2159:5,g
6,2 2156:9,C in A + C in E − Aq − A in E bis − Eq
7,2 2126:16,AO
8,1 1f1b:3,DE
9,7 7547:12,S
10,4 4303:13,BG


These formulas are all so simple that TeX was not needed to display them.

Let's see the first 2 of them in context:

In [19]:
A.show(selected[0:2], condensed=True)

---

# Contents

* **[start](start.ipynb)** intro and highlights
* **search** turbo charge your hand-coding with search templates
* **[compute](compute.ipynb)** sink down a level and compute it yourself
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results

Advanced

* **[similar sentences](similar.ipynb)** find similar sentences

CC-BY Dirk Roorda