In [1]:
%load_ext autoreload
%autoreload 2

# Modify the punc features

2021-06-17

Sophie Arnoult has provided a file with in it the slots where the `punc` feature needs an extra space.
We are going to make a new version of the data where these spaces have been included.

First we sanitize the raw data file from Sophie. It contains all the relevant slots, but with repetitions.

In [2]:
import os

In [3]:
TF_VERSION_OLD = "0.8"
TF_VERSION_NEW = "0.8.1"
BASE = os.path.expanduser(f"~/github/Dans-labs/clariah-gm")
SPACE_IN = f"{BASE}/_local/Sophie Arnoult/{TF_VERSION_OLD}/punct_topad.pos"
SPACE_OUT = f"{BASE}/corrections/{TF_VERSION_OLD}/punct_topad.pos"

TF_DIR_OLD = f"{BASE}/tf/{TF_VERSION_OLD}"
TF_DIR_NEW = f"{BASE}/tf/{TF_VERSION_NEW}"

In [4]:
topad = set()

with open(SPACE_IN) as fh:
    for line in fh:
        topad.add(int(line.strip()))
        
print(f"{len(topad)} slots")
                  
with open(SPACE_OUT, "w") as fh:
    for n in sorted(topad):
        fh.write(f"{n}\n")

410307 slots


From now on we can use this much smaller data file.

In [5]:
TOPAD = f"{BASE}/corrections/{TF_VERSION_OLD}/punct_topad.pos"

topad = set()

with open(TOPAD) as fh:
    for line in fh:
        topad.add(int(line.strip()))
        
print(f"{len(topad)} slots")

410307 slots


Now we create dicts for the modified features `punc`, `puncn`, `punco`, `puncr`

In [6]:
from tf.fabric import Fabric

In [7]:
featureList = """
punc
puncn
punco
puncr
""".strip().split()

In [8]:
TF = Fabric(locations=TF_DIR_OLD)
api = TF.load(featureList)

This is Text-Fabric 8.5.12
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

40 features found and 0 ignored
  0.00s loading features ...
    11s All features loaded/computed - for details use loadLog()


We now append a space to the values of these features, but only for the slots that are in 
the `topad` set, and only if the original value is present and not `None`.
We produce some statistics per feature.

In [9]:
stats = {feat: dict(defined=0, added=0, skipped=0, already=0) for feat in featureList}
modified = {feat: {} for feat in featureList}

F = api.F
Fs = api.Fs

TF.indent(reset=True)
TF.info("Computing new feature data")

for w in F.otype.s("word"):
    addSpace = w in topad

    for feat in featureList:
        val = Fs(feat).v(w)
        info = stats[feat]
        if val is None:
            if addSpace:
                info["skipped"] += 1
        else:
            info["defined"] += 1
            if addSpace:
                if val.endswith(" "):
                    info['already'] += 1
                else:
                    info["added"] += 1
                    modified[feat][w] = f"{val} "

TF.info("Done")

for (feat, info) in sorted(stats.items()):
    print(
        f"{feat:<8}: {info['defined']:>7} defined values;"
        f" {info['added']:>7} spaces added;"
        f" {info['already']:>7} spaces already present;"
        f" {info['skipped']:>7} skipped"
    )

  0.00s Computing new feature data
    20s Done
punc    : 5316429 defined values;  410307 spaces added;       0 spaces already present;       0 skipped
puncn   :  208706 defined values;   22103 spaces added;       0 spaces already present;  388204 skipped
punco   : 3260840 defined values;  261314 spaces added;       0 spaces already present;  148993 skipped
puncr   : 1738050 defined values;  127262 spaces added;       0 spaces already present;  283045 skipped


Let's inspect a few new values

In [10]:
for (w, val) in sorted(modified["puncn"].items())[0:10]:
    print(f"{w:>7} has puncn `{val}`")

     30 has puncn ` `
     60 has puncn ` `
     61 has puncn `. `
    179 has puncn `. `
    181 has puncn `, `
    230 has puncn `. `
    275 has puncn `. `
    529 has puncn ` `
    533 has puncn `. `
    549 has puncn ` `


Looks good.

Now we can compose the new dataset, by using the function
[modify](https://annotation.github.io/text-fabric/tf/compose/modify.html)
which turns one dataset into another one.

Node that the `addFeatures` data that we pass only contains the modified values of the relevant features.
When `modify` encounters existing features with the same name, it will use the new data as overrides.
The same holds for the metadata: the new metadata is merged with the old metadata.

In [11]:
from tf.compose import modify

In [12]:
metaPunc = dict(version=TF_VERSION_NEW, changelog="spaces added by Sophie Arnould")

TF.indent(reset=True)
TF.info("Computing new feature data")

modify(
    TF_DIR_OLD,
    TF_DIR_NEW,
    addFeatures=dict(nodeFeatures=modified),
    featureMeta={p: metaPunc for p in modified},
)

TF.info("done")

  0.00s Computing new feature data
  0.00s preparing and checking ...
This is Text-Fabric 8.5.12
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

40 features found and 0 ignored
  0.00s loading features ...
    12s All features loaded/computed - for details use loadLog()
  0.00s loading features ...
  0.38s All features loaded/computed - for details use loadLog()
  0.00s loading features ...
  0.84s All additional features loaded - for details use loadLog()
   |       13s done
    13s add features ...
   |     0.05s done (added 4 node + 0 edge features)
   |      |     0.00s edge features: 
   |      |     0.00s node features: punc, puncn, punco, puncr
    14s applying updates ...
    18s write TF data ...
This is Text-Fabric 8.5.12
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

0 features found and 0 ignored
  0.00s Warp feature "otype" not found in
~/github/Dans-labs/clariah-gm/tf/0.8.1/
  0.00s Warp feature "oslots" not found

Let's do a check: we compare the `punc` feature in versions 0.8 and 0.8.1 and see what the differences are.

Note that TF needs to do the lengthy precomputation of auxiliary data.

In [13]:
TF81 = Fabric(locations=TF_DIR_NEW)
api81 = TF81.load(featureList)

This is Text-Fabric 8.5.12
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

40 features found and 0 ignored
  0.00s loading features ...
   |     1.89s T otype                from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |       17s T oslots               from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.82s T n                    from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.51s T puncn                from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |       14s T trans                from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.60s T transn               from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     7.04s T punco                from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     8.85s T transo               from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     4.73s T transr               from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     3.89s T puncr                from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |       11s 



   |      |     1.84s C __sections__         from otype, oslots, otext, __levUp__, __levels__, n, n, n
   |      |     0.35s C __structure__        from otype, oslots, otext, __rank__, __levUp__, n, title, n
 3m 39s All features loaded/computed - for details use loadLog()


In [14]:
TF.indent(reset=True)
TF.info("Comparing modified feature data")

F81 = api81.F

total = 0
ok=0
missed=0
extra=0
wrong=0
nonex=0

for w in F.otype.s("word"):
    total += 1
    
    val = F.punc.v(w)
    val81 = F81.punc.v(w)
    
    if val is None:
        if val81 is None:
            ok += 1
        else:
            nonex += 1
    else:
        if val81 is None:
            nonex += 1
        else:
            if w in topad:
                if f"{val} " == val81:
                    ok += 1
                elif val == val81:
                    missed += 1
                else:
                    wrong += 1
            else:
                if val == val81:
                    ok += 1
                elif f"{val} " == val81:
                    extra += 1
                else:
                    wrong += 1
                    
TF.info("Done")

print(f"""
Total:                  {total:>7}
OK:                     {ok:>7}
all well:               {total == ok}
lost modifications:     {missed:>7}
surplus modifications:  {extra:>7}
wrong modifications:    {wrong:>7}
definedness mismatches: {nonex:>7}
""")

  0.00s Comparing modified feature data
  5.67s Done

Total:                  5316429
OK:                     5316429
all well:               True
lost modifications:           0
surplus modifications:        0
wrong modifications:          0
definedness mismatches:       0



Very satisfactory!

A final check that the new data version (0.81) loads when calling it with `use()`:

In [15]:
from tf.app import use

In [16]:
A = use("missieven:clone", checkout="clone")

   |     0.00s T author               from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.01s T authorFull           from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.04s T col                  from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.00s T day                  from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.03s T facs                 from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.02s T isemph               from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.02s T isfolio              from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.36s T isnote               from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     5.14s T isorig               from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.03s T isref                from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     2.73s T isremark             from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.00s T isspecial            from ~/github/Dans-labs/clariah-gm/tf/0.8.1
   |     0.00s T

Indeed.

In the meanwhile, the new data version has been released on GitHub.
You can call it in with a one-time

```
A = use("missieven:latest", checkout="latest", hoist=globals())
```

after which the plain

```
A = use("missieven", hoist=globals())
```

is sufficient.