# Lexeme features

This notebook looks for all features on lexeme nodes and spread their values over all its occurences, if it is not already done.

Then we save those, together with the original values to new feature files with the same name.

Dirk Roorda, computing in the library of St Johns College Cambridge

and

Cody Kingham,  computing in the library of St Johns College Cambridge

on 2019-01-31

In [41]:
import os
import collections

from tf.fabric import Fabric
from tf.app import use


Local topography.

In [42]:
BASE = os.path.expanduser("~/github")
ORG = "etcbc"
REPO = "bhsa"
VERSION = "c"

REPO_PATH = f"{BASE}/{ORG}/{REPO}"
TF_IN = f"{REPO_PATH}/tf/{VERSION}"
TF_OUT = f"{REPO_PATH}/_temp/lex/{VERSION}"

The culprits.

In [7]:
lexFeatures = """
  gloss
  nametype
  voc_lex
  voc_lex_utf8
""".strip().split()

Same metadata as before.

In [16]:
generic = dict(
    author="Eep Talstra Centre for Bible and Computer",
    dataset="BHSA",
    datasetName="Biblia Hebraica Stuttgartensia Amstelodamensis",
    email="shebanq@ancient-data.org",
    encoders="Constantijn Sikkel (QDF), and Dirk Roorda (TF)",
    version="c",
    website="https://shebanq.ancient-data.org",
)

In [17]:
featureMeta = {feat: dict(valueType="str") for feat in lexFeatures}

In [18]:
metaData = {"": generic}
metaData.update(featureMeta)
metaData

{'': {'author': 'Eep Talstra Centre for Bible and Computer',
  'dataset': 'BHSA',
  'datasetName': 'Biblia Hebraica Stuttgartensia Amstelodamensis',
  'email': 'shebanq@ancient-data.org',
  'encoders': 'Constantijn Sikkel (QDF), and Dirk Roorda (TF)',
  'version': 'c',
  'website': 'https://shebanq.ancient-data.org'},
 'gloss': {'valueType': 'str'},
 'nametype': {'valueType': 'str'},
 'voc_lex': {'valueType': 'str'},
 'voc_lex_utf8': {'valueType': 'str'}}

Load the original features.

In [19]:
TFin = Fabric(locations=TF_IN)

This is Text-Fabric 7.4.4
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

114 features found and 0 ignored


In [20]:
api = TFin.load(lexFeatures)
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.01s B voc_lex_utf8         from /Users/dirk/github/etcbc/bhsa/tf/c
   |     0.01s B gloss                from /Users/dirk/github/etcbc/bhsa/tf/c
   |     0.00s B nametype             from /Users/dirk/github/etcbc/bhsa/tf/c
   |     0.01s B voc_lex              from /Users/dirk/github/etcbc/bhsa/tf/c
  3.58s All features loaded/computed - for details use loadLog()


[('Computed',
  'computed-data',
  ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),
 ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),
 ('Fabric', 'loading', ('ensureLoaded', 'TF', 'ignored', 'loadLog')),
 ('Locality', 'locality', ('L Locality',)),
 ('Misc', 'messaging', ('cache', 'error', 'indent', 'info', 'reset')),
 ('Nodes',
  'navigating-nodes',
  ('N Nodes', 'sortKey', 'sortKeyTuple', 'otypeRank', 'sortNodes')),
 ('Features',
  'node-features',
  ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),
 ('Search', 'search', ('S Search',)),
 ('Text', 'text', ('T Text',))]

Hard labour.

In [24]:
nodeFeatures = collections.defaultdict(dict)

for feat in lexFeatures:
    print(f"{feat} ...")
    for lx in F.otype.s("lex"):
        value = Fs(feat).v(lx)
        if value is not None:
            for w in L.d(lx, otype="word"):
                nodeFeatures[feat][w] = value
            nodeFeatures[feat][lx] = value

gloss ...
nametype ...
voc_lex ...
voc_lex_utf8 ...


Fire up a new TF API for writing.

In [25]:
TFout = Fabric(locations=TF_OUT)

This is Text-Fabric 7.4.4
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

0 features found and 0 ignored


  0.00s Warp feature "otype" not found in
/Users/dirk/github/etcbc/bhsa/_temp/lex/c/
  0.00s Warp feature "oslots" not found in
/Users/dirk/github/etcbc/bhsa/_temp/lex/c/


  0.00s Warp feature "otext" not found. Working without Text-API



Save the features.

In [27]:
TFout.save(nodeFeatures=nodeFeatures, edgeFeatures={}, metaData=metaData)

  0.00s Exporting 4 node and 0 edge and 0 config features to /Users/dirk/github/etcbc/bhsa/_temp/lex/c:
   |     0.56s T gloss                to /Users/dirk/github/etcbc/bhsa/_temp/lex/c
   |     0.05s T nametype             to /Users/dirk/github/etcbc/bhsa/_temp/lex/c
   |     0.51s T voc_lex              to /Users/dirk/github/etcbc/bhsa/_temp/lex/c
   |     0.56s T voc_lex_utf8         to /Users/dirk/github/etcbc/bhsa/_temp/lex/c
  1.69s Exported 4 node features and 0 edge features and 0 config features to /Users/dirk/github/etcbc/bhsa/_temp/lex/c


True

# Test the new features

We load the TF API the modern way and let it download the newest release of the data.

In [38]:
A = use("bhsa", hoist=globals(), check=True)

TF app is up-to-date.
Using annotation/app-bhsa commit 43c1c5e88b371f575cdbbf57e38167deb8725f7f (=latest)
  in /Users/dirk/text-fabric-data/__apps__/bhsa.
	downloading etcbc/bhsa - c rv1.6
	from https://github.com/etcbc/bhsa/releases/download/v1.6/tf-c.zip ...
	unzipping ...
	saving etcbc/bhsa - c rv1.6
	saved etcbc/bhsa - c rv1.6
Using etcbc/bhsa/tf - c rv1.6 (=latest) in /Users/dirk/text-fabric-data
No new data release available online.
Using etcbc/phono/tf - c r1.2 (=latest) in /Users/dirk/text-fabric-data.
No new data release available online.
Using etcbc/parallels/tf - c r1.2 (=latest) in /Users/dirk/text-fabric-data.
   |      |     0.67s C __levels__           from otype, oslots, otext
   |      |       12s C __order__            from otype, oslots, __levels__
   |      |     0.69s C __rank__             from otype, __order__
   |      |       11s C __levUp__            from otype, oslots, __levels__, __rank__
   |      |     8.57s C __levDown__          from otype, __levUp__, _

**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="provenance of BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis">BHSA</a> <a target="_blank" href="https://annotation.github.io/text-fabric/writing/hebrew.html" title="('Hebrew characters and transcriptions',)">Character table</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="BHSA feature documentation">Feature docs</a> <a target="_blank" href="https://github.com/annotation/app-bhsa" title="bhsa API documentation">bhsa API</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Api/Fabric/" title="text-fabric-api">Text-Fabric API 7.4.4</a> <a target="_blank" href="https://annotation.github.io/text-fabric/Use/Search/" title="Search Templates Introduction and Reference">Search Reference</a>

A few test queries by Cody Kingham, who urged me to make the lexeme values move to the occurrences.

In [39]:
query = r"""
word g_word~^.@(?!\d) voc_lex~^.E.E.$
"""

query2 = r"""
word g_word~^.@(?!\d)
"""

query3 = """
word voc_lex~^.E.E.$
"""

In [40]:
results = A.search(query)

  0.57s 23 results


In [33]:
results = A.search(query2)

  0.57s 40237 results


In [35]:
results = A.search(query3)

  0.35s 0 results
