<img align="right" src="images/etcbc.png" width="150"/>
<img align="right" src="images/tf.png" width="150"/>
<img align="right" src="images/emdros.png" width="250"/>

# MQL versus TF-Query

See [tfVersusMql](tfVersusMql.ipynb) for an introduction.

# Loading

We load the Text-Fabric program and the BHSA data.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use
from tf.core.helpers import project

from util import getTfVerses, getShebanqData, compareResults, MQL_RESULTS

In [3]:
VERSION = '2017'
# A = use('bhsa', hoist=globals(), version=VERSION)
A = use('bhsa:clone', checkout="clone", hoist=globals(), version=VERSION)

# Example 10b

[Bas Meeuse: Example 10: NTN + object + recipient](https://shebanq.ancient-data.org/hebrew/query?version=2017&id=4437)

```
[clause focus
  [UNORDEREDGROUP
    [phrase typ = VP AND NOT function IN (PreO, PtcO)
      [word vs = qal AND lex = "NTN["]
    ]
    [phrase function = Objc]
    [phrase function = Cmpl
      [word first lex = "L"]
    ]
  ]
]
OR
[clause focus
  [UNORDEREDGROUP
    [phrase function IN (PreO, PtcO)
      [word vs = qal AND lex = "NTN["]
    ]
    [phrase function = Cmpl
      [word first lex = "L"]
    ]
  ]
]
```

In [4]:
(verses, words) = getShebanqData(A, MQL_RESULTS, "10b")

2087 results in 675 verses with 5573 words


In [5]:
query1 = """
clause
  phrase typ=VP function#PreO|PtcO
    word vs=qal lex=NTN[
  phrase function=Objc
  phrase function=Cmpl
    =: word lex=L
"""

query2 = """
clause
  phrase function=PreO|PtcO
    word vs=qal lex=NTN[
  phrase function=Cmpl
    =: word lex=L
"""  

In [6]:
results1 = A.search(query1)
results2 = A.search(query2)

  1.76s 650 results
  1.27s 112 results


The number of results does not nicely add up to the expected 2087 it is 762.

We'll come back to that later.

Let's see how we fare when we compare the verses in which results occur, and the words that
occur in the results.

In [7]:
(tfVerses1, tfWords1) = getTfVerses(A, results1, (0,))
(tfVerses2, tfWords2) = getTfVerses(A, results2, (0,))

580 verses
4896 words
103 verses
677 words


We combine the verses and words and test for equality.

In [8]:
tfVerses = sorted(set(tfVerses1) | set(tfVerses2))
tfWords = sorted(set(tfWords1) | set(tfWords2))

In [9]:
compareResults(A, verses, words, tfVerses, tfWords)

VERSES EQUAL
WORDS EQUAL


Wonderful!

In order to show results in the natural order, we have to merge them.

In [10]:
results = sorted(results1 + results2)

Let's find the first result of query2 in the merged sequence:

In [11]:
for (i, r) in enumerate(results):
    if len(r) == 5:
        break
i + 1

5

In [12]:
A.show(results, condenseType="clause", start=3, end=7)

## Counting results

Why is there this big discrepancy in results between SHEBANQ and TF?

Lets work towards a simpler case that shows the same problem.
First we restrict ourselves to the second `OR` part.

[Dirk Roorda: Example 10a: number of results](https://shebanq.ancient-data.org/hebrew/query?version=2017&id=4469)

```
[clause focus
  [UNORDEREDGROUP
    [phrase function IN (PreO, PtcO)
      [word vs = qal AND lex = "NTN["]
    ]
    [phrase function = Cmpl
      [word first lex = "L"]
    ]
  ]
]
```

In [13]:
(verses, words) = getShebanqData(A, MQL_RESULTS, "10b1")

217 results in 103 verses with 677 words


In [14]:
query = """
clause
  phrase function=PreO|PtcO
    word vs=qal lex=NTN[
  phrase function=Cmpl
    =: word lex=L
"""  

In [15]:
results = A.search(query)

  1.25s 112 results


In [16]:
(tfVerses, tfWords) = getTfVerses(A, results, (0,))

103 verses
677 words


In [17]:
compareResults(A, verses, words, tfVerses, tfWords)

VERSES EQUAL
WORDS EQUAL


Let's make it even simpler, we remove the unordered group:

[Dirk Roorda: Example10b: continued](https://shebanq.ancient-data.org/hebrew/query?version=2017&id=4470)

```
[clause focus
    [phrase function IN (PreO, PtcO)
      [word vs = qal AND lex = "NTN["]
    ]
    [phrase function = Cmpl
      [word first lex = "L"]
    ]
]
```

In [18]:
(verses, words) = getShebanqData(A, MQL_RESULTS, "10b2")

63 results in 63 verses with 406 words


In [19]:
query = """
clause
  phrase function=PreO|PtcO
    word vs=qal lex=NTN[
  <: phrase function=Cmpl
    =: word lex=L
"""  

In [20]:
results = A.search(query)

  1.22s 63 results


In [21]:
(tfVerses, tfWords) = getTfVerses(A, results, (0,))

 63 verses
406 words


In [22]:
compareResults(A, verses, words, tfVerses, tfWords)

VERSES EQUAL
WORDS EQUAL


The difference occurs when counting the results of a query with `UNORDEREDGROUP` in it.

Let's try to make a minimal example.

[Dirk Roorda: Example 10b: minimal](https://shebanq.ancient-data.org/hebrew/query?version=2017&id=4471)

```
[clause
  [UNORDEREDGROUP
    [phrase function=Pred
      [word focus lex="NTN["]
    ]
    [phrase function=Subj
      [word focus lex="HJ>"]
    ]
  ]
]
```

In [23]:
(verses, words) = getShebanqData(A, MQL_RESULTS, "10bm")

8 results in 4 verses with 8 words


In [24]:
query = """
clause
  phrase function=Pred
    word lex=NTN[
  phrase function=Subj
    =: word lex=HJ>
"""  

In [25]:
results = A.search(query)

  1.30s 4 results


In [26]:
A.table(results)

n,p,clause,phrase,word,phrase.1,word.1
1,Genesis 3:12,הִ֛וא נָֽתְנָה־לִּ֥י מִן־הָעֵ֖ץ,נָֽתְנָה־,נָֽתְנָה־,הִ֛וא,הִ֛וא
2,Genesis 38:14,וְהִ֕וא לֹֽא־נִתְּנָ֥ה לֹ֖ו לְאִשָּֽׁה׃,נִתְּנָ֥ה,נִתְּנָ֥ה,הִ֕וא,הִ֕וא
3,1_Samuel 18:19,וְהִ֧יא נִתְּנָ֛ה לְעַדְרִיאֵ֥ל הַמְּחֹלָתִ֖י לְאִשָּֽׁה׃,נִתְּנָ֛ה,נִתְּנָ֛ה,הִ֧יא,הִ֧יא
4,Daniel 11:6,וְתִנָּתֵ֨ן הִ֤יא וּמְבִיאֶ֨יהָ֙ וְהַיֹּ֣לְדָ֔הּ וּמַחֲזִקָ֖הּ בָּעִתִּֽים׃,תִנָּתֵ֨ן,תִנָּתֵ֨ן,הִ֤יא וּמְבִיאֶ֨יהָ֙ וְהַיֹּ֣לְדָ֔הּ וּמַחֲזִקָ֖הּ,הִ֤יא


In [27]:
(tfVerses, tfWords) = getTfVerses(A, results, (2, 4))

  4 verses
  8 words


In [28]:
compareResults(A, verses, words, tfVerses, tfWords)

VERSES EQUAL
WORDS EQUAL


When we look at a screenshot of the SHEBANQ results, we see that there are 4 results in reality, although 8 results are advertized.

![minimal](images/minimal.png)

**There is a bug in how SHEBANQ counts the results of queries, manifested when the query
in question makes use of the UNORDEREDGROUP construct.**

In SHEBANQ the results of a query are not actually retrieved, and they are not even counted.
Emdros, the engine that delivers the results of the MQL query to SHEBANQ, does not deliver actual results,
but a compact construct, called a *sheaf*, from which the results could be generated.

However, in SHEBANQ we have chosen not to display the results as such, but the set of words that occur in the results.
Emdros provides a convenient and fast function to do so.

The only thing we want to know about the results is how many there are.
I (Dirk Roorda) programmed a simple function to compute the number of results by inspecting the sheaf, without actually extracting
all those results.

It turns out that my recipe does not work correctly if the sheaf represents the results of a query with `UNORDEREDGROUP` in it.
This construct is a relatively late addition to MQL, and I need  advice from the maker of Emdros,
[Ulrik Sandborg-Petersen](https://github.com/emg).

There is an underlying reason for this approach: it is easy to fire a badly designed query to SHEBANQ that has billions of results.
In Emdros, counting the results means: generating them first. That takes forbiddingly long for a website, and the query will
probably be discarded.
So I needed a quick way to compute the number of results, in order to prevent generating results if their number is too high.

**To be continued**