In [None]:
from tf.app import use

In [1]:
# A = use('dss', hoist=globals())
# A = use('dss:hot', checkout='hot', hoist=globals())
A = use("dss:clone", checkout="clone", hoist=globals())

# Lexemes and occurrences

How do you find the word occurrences that are associated to a lexeme?

## Example

We find an example of a lexeme with only 5 occurrences.

In [2]:
rareLexemes = [lx for (lx, nOccs) in F.lex.freqList(nodeTypes={"word"}) if nOccs == 5]
len(rareLexemes)

399

We just pick the one of these as an example, and get the nodes which have that lexeme, among which is the lexeme node.

In [3]:
exampleLx = rareLexemes[100]
print(f'Lexeme "{exampleLx}"')
nodes = F.lex.s(exampleLx)
for n in nodes:
    print(f"{F.otype.v(n)} {n} {F.fullo.v(n)} {F.full.v(n)}")

Lexeme "גֹּוג"
lex 1544074 None None
word 1626559 gwg גוג
word 1832664 [gwg  ]גוג
word 1974200 gwg גוג
word 2106500 gwg גוג
word 2106521 gwg גוג


We identify the lexeme node as `lx` and keep the word occurrence nodes as `givenOccs`

In [4]:
lx = nodes[0]
givenOccs = nodes[1:]
print(f"lexeme {lx} is {F.lexo.v(lx)} = {F.lex.v(lx)}")

lexeme 1544074 is g…øwg = גֹּוג


Now we can try out ways to get from `lx` to `givenOccs` and check whether we do it right.

# Via `L.d()`

If you are used to working with the [BHSA](https://github.com/etcbc/bhsa),
you know that you can just use the locality-down operator to get from a lexeme node to the word nodes it contains.

So let's try that here.

In [5]:
lwords = L.d(lx, otype="word")
lwords

(1626559, 1832664, 1974200, 2106500, 2106521)

That seems to work, but let's make it crystal clear:

In [6]:
lwords == givenOccs

True

We can also go upwards:

In [7]:
for w in lwords:
    llxs = L.u(w, otype="lex")
    for llx in llxs:
        print(f"{w} => {llx} {llx == lx}")

1626559 => 1544074 True
1832664 => 1544074 True
1974200 => 1544074 True
2106500 => 1544074 True
2106521 => 1544074 True


# Via E.occ

There is an edge feature that links lexeme nodes to its word occurrence nodes: `occ`. It is directed from lexeme to occurrences.

In [8]:
ewords = E.occ.f(lx)
ewords

(1626559, 1832664, 1974200, 2106500, 2106521)

In [9]:
ewords == givenOccs

True

We can also go back:

In [10]:
for w in ewords:
    elxs = E.occ.t(w)
    for elx in elxs:
        print(f"{w} => {elx} {elx == lx}")

1626559 => 1544074 True
1832664 => 1544074 True
1974200 => 1544074 True
2106500 => 1544074 True
2106521 => 1544074 True


# Queries

When you are hand-coding, both methods are equivalent in outcome, and probably
equal in performance.

Let's see how we can work with lexemes in queries.

## From lexeme to occurrence

We find the occurrences for a lexeme.

In [11]:
lresults = A.search(
    """
lex lexo=g…øwg
  word
"""
)
lresults

  0.13s 5 results


[(1544074, 1626559),
 (1544074, 1832664),
 (1544074, 1974200),
 (1544074, 2106500),
 (1544074, 2106521)]

In [12]:
eresults = A.search(
    """
lex lexo=g…øwg
-occ> word
"""
)
eresults

  0.11s 5 results


[(1544074, 1626559),
 (1544074, 1832664),
 (1544074, 1974200),
 (1544074, 2106500),
 (1544074, 2106521)]

Again, equivalent results, but it seems that the query using the edge `occ` is faster.

Let's check that by querying *all* lexemes and *all* occurrences!

In [13]:
alresults = A.search(
    """
lex
  word
"""
)
len(alresults)

  1.44s 470845 results


470845

In [14]:
aeresults = A.search(
    """
lex
-occ> word
"""
)
len(aeresults)

  1.63s 470845 results


470845

Well, there seems to be no significant difference.

### Occurrences with an uncertain sign

Now we make the query a bit more complex: we want the lexemes of occurrences with an uncertain sign

In [15]:
lresults = A.search(
    """
lex
  word
    sign unc
"""
)
len(lresults)

  1.96s 92247 results


92247

In [16]:
eresults = A.search(
    """
lex
-occ> word
    sign unc
"""
)
len(eresults)

  1.57s 92247 results


92247

**Warning** The data in version 0.4 will lead to different results, due to generated empty words between brackets without
any other material.
In later versions of the data we don't do that anymore.
You can use the incantation with `checkout=hot` to get the latest data.

# From occurrence to lexeme

Suppose we want to find the lexeme nodes of all words in a line that has an uncertain sign in it.

This is more difficult, because the extra condition is not on the word, but on the line, and lexemes and lines
do not embed each other in general.

We have to find the lines first, then find the words, and then the lexemes associated with it.
We have to go from word to lexeme.

In [17]:
lquery = """
line
  sign unc
  w:word

lex
  w
"""

Explanation:

We look for lines with an uncertain sign in it and an arbitrary word and its container lexeme.

In [18]:
lresults = A.search(lquery)

  7.91s 1096153 results


In [19]:
equery = """
line
  sign unc
  w:word

lex
-occ> w
"""

In [20]:
eresults = A.search(equery)

  7.94s 1096153 results


# Advice

The use of `E.occ` is preferred over `L.d`, because the `occ` edge is defined to be the relationship between lexemes and their
occurrences.

In contrast, `L.d` is defined to be the relationship between a node and all nodes that are contained in it, slot-wise.
That makes `L.d` a bit brittle as a proxy for the lexeme-occurrence relationship.

Version 0.4 made that clear: we smuggled in some empty words for reasons not related to lexemes, and they somehow
ended up in the scope of a lexeme.