# Export a plain text with nodes recorded

We generate a plain text with a mapping file that links text positions to nodes.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use
from tf.convert.recorder import Recorder

In [13]:
A = use("annotation/mobydick:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
text,1,1199306.0,100
body,1,1196049.0,100
div,138,8689.22,100
chapter,141,8510.52,100
front,1,3257.0,0
note,19,710.84,1
fileDesc,1,659.0,0
p,2421,492.42,99
chunk,2606,459.32,100
publicationStmt,2,211.0,0


In [14]:
c1 = F.otype.s("chunk")[1]
c1

1200288

# Spacy

We use the
[recorder](https://annotation.github.io/text-fabric/tf/convert/recorder.html)
to produce a plain text.

This will be fed into spacy for linguistic annotation.

After that, we capture the annotations and save them as features to the TF dataset.

In [34]:
rec = Recorder(A.api)

for s in F.otype.s(F.otype.slotType):
    if F.is_meta.v(s) or F.is_note.v(w) or F.empty.v(s):
        continue
    rec.start(s)
    rec.add(F.ch.v(s))
    rec.end(s)
    
rec.add("\n\nNotes\n\n")

for n in F.otype.s("note"):
    for s in L.d(n, otype="char"):
        if F.empty.v(s):
            continue
        rec.start(s)
        rec.add(F.ch.v(s))
        rec.end(s)
    rec.add("\n\n")

In [35]:
repoDir = A.repoLocation
TEXT_DIR = f"{repoDir}/txt"
TEXT_PATH = f"{TEXT_DIR}/plain.txt"
POS_PATH = f"{TEXT_DIR}/plain.txt.pos"

In [36]:
rec.write(TEXT_PATH)

In [37]:
!ls -l {TEXT_DIR}

total 19152
-rw-r--r--@ 1 me  staff  1221733 Apr 13 13:15 plain.txt
-rw-r--r--  1 me  staff  8577206 Apr 13 13:15 plain.txt.pos


In [38]:
!head -n 10 {POS_PATH}

819
820
821
822
823
824
825
826
827
828


In [39]:
!head -n 10 {TEXT_PATH}

Born in New York City, the son of New England merchant. He worked at odd jobs (clerk, farmhand, teacher) before sailing to the South Seas on the whaler Acushnet. He deserted his ship, lived among cannibals, mutinied on an Australian boat, then spent two years on an American boat returning to the U.S. He successfully romanticized these adventures, publishing seven novels in six years, including Moby Dick (1851), one of the masterworks of American fiction. His popularity waned, and by the time he died he was virtually forgotten. Billy Budd was his last great novel. As his writing declined, Melville sailed again, around Cape Horn to San Francisco on a clipper ship commanded by his brother. 



Preliminary Matter.

This text of Melville's Moby-Dick is based on the Hendricks House edition. It was prepared by Professor Eugene F. Irey at the University of Colorado. Any subsequent copies of this data must include this notice and any publications resulting from analysis of this data must includ

In [41]:
!tail -n 10 {TEXT_PATH}

Though all comparison in the way of general bulk between the whale and the elephant is preposterous, inasmuch as in that particular the elephant stands in much the same respect to the whale that a dog does to the elephant; nevertheless, there are not wanting some points of curious similitude; among these is the spout. It is well known that the elephant will often draw up water or dust in his trunk, and then elevating it, jet it forth in a stream.

To gally, or gallow, is to frighten excessively —to confound with fright. It is an old Saxon word. It occurs once in Shakespeare: — The wrathful skies Gallow the very wanderers of the dark And make them keep their caves. To common language, the word is now completely obsolete. When the polite landsman first hears it from the gaunt Nantucketer, he is apt to set it down as one of the whaleman's self-derived savageries. Much the same is it with many other sinewy Saxonisms of this sort, which emigrated to New-England rocks with the noble brawn of

In [18]:
rec2 = Recorder(A.api)
rec2.read(TEXT_PATH)

In [19]:
rec.text() == rec2.text()

True

In [20]:
rec.positions() == rec2.positions()

True