<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/logo.png" width="128"/>
<img align="right" src="images/etcbc.png" width="128"/>
<img align="right" src="images/dans.png" width="128"/>

---

To get started: consult [start](start.ipynb)

---

# Sharing data features

## Explore additional data

Once you analyse a corpus, it is likely that you produce data that others can reuse.
Maybe you have defined a set of proper name occurrences, or special numerals, or you have computed part-of-speech assignments.

It is possible to turn these insights into *new features*, i.e. new `.tf` files with values assigned to specific nodes.

## Make your own data

New data is a product of your own methods and computations in the first place.
But how do you turn that data into new TF features?
It turns out that the last step is not that difficult.

If you can shape your data as a mapping (dictionary) from node numbers (integers) to values
(strings or integers), then TF can turn that data into a feature file for you with one command.

## Share your new data
You can then easily share your new features on GitHub, so that your colleagues everywhere
can try it out for themselves.

You can add such data on the fly, by passing a `mod={org}/{repo}/{path}` parameter,
or a bunch of them separated by commas.

If the data is there, it will be auto-downloaded and stored on your machine.

Let's do it.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
from tf.app import use

In [3]:
A = use("etcbc/dss", hoist=globals())

This is Text-Fabric 9.2.2
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

67 features found and 1 ignored


# Making data

We illustrate the data creation part by creating a new feature, `cert`.
The idea is that we mark every consonant sign voor certainty.

A certain consonant gets `cert=100`.

If the consonant has the uncertain feature `unc`, then 10 times its value is subtracted from 100.

If the consonant has the feature `rec`, it loses 45 points.

Ancient removal `rem2` leads to minus 20, modern removal `rem` to minus 40.

Ancient correction `cor2` leads to minus 12, modern correction `cor` to minus 18.

Alternate marking `alt` leads to minus 25.

The minimum is `1`.

We extend the `cert` measure to words, fragments and scrolls by averaging over signs that
have received a `cert` measure.

In [4]:
def measure(s):
    c = 100
    d = F.unc.v(s)
    if d:
        c -= 10 * d
    d = F.rec.v(s)
    if d:
        c -= 45
    d = F.rem.v(s)
    if d == 1:
        c -= 40
    elif d == 2:
        c -= 20
    d = F.cor.v(s)
    if d == 2 or d == 3:
        c -= 12
    elif d == 1:
        c -= 18
    d = F.alt.v(s)
    if d:
        c -= 25
    if c < 1:
        c = 1
    return c

In [5]:
CONS = "cons"
cert = {}

A.indent(reset=True)

for sc in F.otype.s("scroll"):
    fN = 0
    fSum = 0
    for f in L.d(sc, otype="fragment"):
        lN = 0
        lSum = 0
        for ln in L.d(f, otype="line"):
            wN = 0
            wSum = 0
            for w in L.d(ln, otype="word"):
                sN = 0
                sSum = 0
                for s in L.d(w, otype="sign"):
                    if F.type.v(s) != CONS:
                        continue
                    sCert = measure(s)
                    cert[s] = sCert
                    sN += 1
                    sSum += sCert
                if sN:
                    wCert = int(round(sSum / sN))
                    cert[w] = wCert
                    wN += 1
                    wSum += wCert
            if wN:
                lCert = int(round(wSum / wN))
                cert[ln] = lCert
                lN += 1
                lSum += lCert
        if lN:
            fCert = int(round(lSum / lN))
            cert[f] = fCert
            fN += 1
            fSum += fCert
    if fN:
        scCert = int(round(fSum / fN))
        cert[sc] = scCert

A.info(f"{len(cert)} certainties determined")

  3.85s 1625373 certainties determined


# Saving data

The [documentation](https://annotation.github.io/text-fabric/tf/core/fabric.html#tf.core.fabric.FabricCore.save) explains how to save this data into a text-fabric
data file.

We choose a location where to save it, the `exercises` folder in the `dss` repository in the `dss` organization.

In order to do this, we restart the TF api, but now with the desired output location in the `locations` parameter.

In [6]:
GITHUB = os.path.expanduser("~/github")
ORG = "etcbc"
REPO = "dss"
PATH = "exercises"
VERSION = A.version

Note the version: we have built the version against a specific version of the data:

In [7]:
A.version

'0.9'

Later on, we pass this version on, so that users of our data will get the shared data in exactly the same version as their core data.

We have to specify a bit of metadata for this feature:

In [8]:
metaData = {
    "cert": dict(
        valueType="int",
        description="measure of certainty of material, between 1 and 100 (most certain)",
        creator="Dirk Roorda",
    ),
}

Now we can give the save command:

In [9]:
TF.save(
    nodeFeatures=dict(cert=cert),
    metaData=metaData,
    location=f"{GITHUB}/{ORG}/{REPO}/{PATH}/tf",
    module=VERSION,
)

  0.00s Exporting 1 node and 0 edge and 0 config features to ~/github/etcbc/dss/exercises/tf/0.9:
   |     1.31s T cert                 to ~/github/etcbc/dss/exercises/tf/0.9
  1.31s Exported 1 node features and 0 edge features and 0 config features to ~/github/etcbc/dss/exercises/tf/0.9


True

# Sharing data

How to share your own data is explained in the
[documentation](https://annotation.github.io/text-fabric/tf/about/datasharing.html).

Here we show it step by step for the `cert` feature.

If you commit your changes to the exercises repo, and have done a `git push origin master`,
you already have shared your data!

If you want to make a stable release, so that you can keep developing, while your users fall back
on the stable data, you can make a new release.

Go to the GitHub website for that, go to your repo, and click *Releases* and follow the nudges.

If you want to make it even smoother for your users, you can zip the data and attach it as a binary to the release just created.

We need to zip the data in exactly the right directory structure. Text-Fabric can do that for us:

In [10]:
%%sh

text-fabric-zip etcbc/dss/exercises/tf

This is a TF dataset
Create release data for etcbc/dss/exercises/tf
Found 5 versions
zip files end up in ~/Downloads/etcbc-release/dss
zipping etcbc/dss                  0.3 with   1 features ==> exercises-tf-0.3.zip
zipping etcbc/dss                  0.4 with   1 features ==> exercises-tf-0.4.zip
zipping etcbc/dss                  0.5 with   1 features ==> exercises-tf-0.5.zip
zipping etcbc/dss                  0.6 with   1 features ==> exercises-tf-0.6.zip
zipping etcbc/dss                  0.9 with   1 features ==> exercises-tf-0.9.zip


All versions have been zipped, but it works OK if you only attach the newest version to the newest release.

If a user asks for an older version in this release, the system can still find it.

# Use the data

We can use the data by calling it up when we say `use('ETCBC/dss', ...)`.

Here is how:

(use the line without `clone` if the data is really published,
use the line with `clone` if you want to test your local copy of the feature).

In [11]:
A = use(
    "etcbc/dss", hoist=globals(), mod="etcbc/dss/exercises/tf:clone"
)

This is Text-Fabric 9.2.2
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

68 features found and 1 ignored
   |     6.84s T cert                 from ~/github/etcbc/dss/exercises/tf/0.9


Above you see a new section in the feature list: **etcbc/dss/exercises/tf** with our foreign feature in it: `cert`.

Now, suppose did not know much about this feature, then we would like to do a few basic checks:

In [12]:
F.cert.freqList()

((100, 730507),
 (55, 667515),
 (60, 50110),
 (80, 38107),
 (70, 20927),
 (90, 18767),
 (88, 6423),
 (85, 4794),
 (78, 4499),
 (87, 4226),
 (95, 4070),
 (93, 3997),
 (92, 3529),
 (58, 3017),
 (82, 2839),
 (98, 2833),
 (72, 2655),
 (68, 2628),
 (89, 2562),
 (56, 2510),
 (65, 2496),
 (96, 2436),
 (57, 2358),
 (97, 2275),
 (62, 2117),
 (63, 2089),
 (66, 2076),
 (75, 2011),
 (73, 1943),
 (94, 1904),
 (83, 1818),
 (84, 1796),
 (91, 1766),
 (79, 1762),
 (61, 1759),
 (64, 1754),
 (86, 1694),
 (74, 1549),
 (59, 1479),
 (77, 1402),
 (43, 1401),
 (67, 1284),
 (81, 1272),
 (76, 1234),
 (69, 1193),
 (71, 1118),
 (99, 1068),
 (30, 566),
 (35, 352),
 (48, 230),
 (40, 133),
 (50, 106),
 (15, 100),
 (37, 67),
 (45, 49),
 (51, 29),
 (53, 23),
 (49, 22),
 (47, 18),
 (25, 17),
 (54, 14),
 (52, 12),
 (28, 11),
 (42, 11),
 (44, 9),
 (38, 8),
 (46, 7),
 (20, 6),
 (23, 3),
 (36, 3),
 (1, 1),
 (18, 1),
 (21, 1),
 (22, 1),
 (29, 1),
 (31, 1),
 (34, 1),
 (39, 1))

Which nodes have the lowest uncertainty?

In [13]:
{F.otype.v(n) for n in N.walk() if F.cert.v(n) and F.cert.v(n) < 10}

{'sign'}

Only signs are this uncertain.

Let's look for pretty uncertain fragments:

In [14]:
results = A.search(
    """
fragment cert<50
"""
)

  0.01s 0 results


In [15]:
results = A.search(
    """
fragment cert<60
"""
)

  0.01s 380 results


In [16]:
A.table(results, start=1, end=20)

n,p,fragment
1,1QSb f12,1QSb f12
2,1Q21 f3,1Q21 f3
3,2Q18 f1,2Q18 f1
4,2Q18 f2,2Q18 f2
5,2Q29 f1,2Q29 f1
6,4Q163 f26,4Q163 f26
7,4Q201 f1vi,4Q201 f1vi
8,4Q202 f1vi,4Q202 f1vi
9,4Q204 f1vii,4Q204 f1vii
10,4Q204 f5i,4Q204 f5i


Same for scrolls:

In [17]:
results = A.search(
    """
scroll cert<50
"""
)

  0.00s 0 results


In [18]:
results = A.search(
    """
scroll cert<60
"""
)

  0.00s 27 results


In [19]:
A.show(results)

Lines with certainty of 50:

In [20]:
results = A.search(
    """
line cert<57
"""
)

  0.04s 2061 results


In [21]:
A.show(results, start=100, end=102)

With highlights and drilled down to sign level:

In [22]:
highlights = {}

for s in F.otype.s("sign"):
    if not F.cert.v(s):
        continue
    color = "lightsalmon" if F.cert.v(s) < 56 else "mediumaquamarine"
    highlights[s] = color

In [23]:
A.show(
    results,
    start=100,
    end=102,
    withNodes=True,
    condensed=True,
    highlights=highlights,
    baseTypes="sign",
)

# All together!

If more researchers have shared data modules, you can draw them all in.

Then you can design queries that use features from all these different sources.

In that way, you build your own research on top of the work of others.

Hover over the features to see where they come from, and you'll see they come from your local github repo.

---

All chapters:

* **[start](start.ipynb)** become an expert in creating pretty displays of your text structures
* **[display](display.ipynb)** become an expert in creating pretty displays of your text structures
* **[search](search.ipynb)** turbo charge your hand-coding with search templates
* **[exportExcel](exportExcel.ipynb)** make tailor-made spreadsheets out of your results
* **share** draw in other people's data and let them use yours
* **[similarLines](similarLines.ipynb)** spot the similarities between lines

---

See the [cookbook](cookbook) for recipes for small, concrete tasks.

CC-BY Dirk Roorda