# Tesserae v5 Demo

This demo will go over the basics of Tesserae v5 development up through February 5, 2019.

In [1]:
!apt update
!apt install -y mongodb
!mkdir -p mongodb-data

[33m0% [Working][0m            Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease [3,626 B]
[33m0% [Connecting to archive.ubuntu.com] [Waiting for headers] [1 InRelease 3,626 [0m[33m0% [Connecting to archive.ubuntu.com] [Waiting for headers] [Waiting for header[0m                                                                               Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
[33m0% [Connecting to archive.ubuntu.com] [Waiting for headers] [Waiting for header[0m[33m0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com] [Waiting for h[0m                                                                               Ign:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
[33m0% [1 InRelease gpgv 3,626 B] [Connecting to archive.ubuntu.com] [Waiting for h[0m                                                                          

In [2]:
!mongod --dbpath ./mongodb-data --fork --logpath ./mongod.log

about to fork child process, waiting until server is ready for connections.
forked process: 1417
child process started successfully, parent exiting


In [3]:
!pip install git+https://github.com/ElectronicBabylonianLiterature/tesserae-v5

Collecting git+https://github.com/ElectronicBabylonianLiterature/tesserae-v5
  Cloning https://github.com/ElectronicBabylonianLiterature/tesserae-v5 to /tmp/pip-req-build-cz7z5k4k
  Running command git clone -q https://github.com/ElectronicBabylonianLiterature/tesserae-v5 /tmp/pip-req-build-cz7z5k4k
Building wheels for collected packages: tesserae
  Building wheel for tesserae (setup.py) ... [?25l[?25hdone
  Created wheel for tesserae: filename=tesserae-0.1a1-cp36-none-any.whl size=61754 sha256=e47c24682583861ba345b1f11b47210156524ccc5669c5b482c271ed4fb2ccb7
  Stored in directory: /tmp/pip-ephem-wheel-cache-648mindw/wheels/15/12/9b/1a9461e40ceddee12498dee0f289daa405301c5f616adf3a1c
Successfully built tesserae


In [4]:
import json

from tesserae.db import TessMongoConnection
from tesserae.db.entities import Match, Text, Token, Unit
from tesserae.utils import TessFile
from tesserae.tokenizers import GreekTokenizer, LatinTokenizer
from tesserae.unitizer import Unitizer
from tesserae.matchers import DefaultMatcher
from tesserae.matchers.sparse_encoding import SparseMatrixSearch

# Set up the connection and clean up the database
connection = TessMongoConnection('127.0.0.1', 27017, None, None, 'tesstest')

# Clean up the previous demo
connection.connection['feature_sets'].delete_many({})
connection.connection['feature']
connection.connection['frequencies'].delete_many({})
connection.connection['matches'].delete_many({})
connection.connection['match_sets'].delete_many({})
connection.connection['texts'].delete_many({})
connection.connection['tokens'].delete_many({})
connection.connection['units'].delete_many({})

<pymongo.results.DeleteResult at 0x7f446552cac8>

## Loading and Storing New Texts

The Tesserae database catalogs metadata, including the title, author, and year published, as well as integrity information like filepath, MD5 hash, and CTS URN.

We start by loading in some metadata from `text_metadata.json`.

In [5]:
with open('text_metadata.json', 'r') as f:
    text_meta = json.load(f)

print('{}{}{}{}'.format('Title'.ljust(15), 'Author'.ljust(15), 'Language'.ljust(15), 'Year'))
print('{}{}{}{}'.format('-----'.ljust(15), '------'.ljust(15), '--------'.ljust(15), '----'))
for t in text_meta:
    print('{}{}{}{}'.format(t['title'].ljust(15), t['author'].ljust(15), t['language'].ljust(15), str(t['year']).ljust(15)))

FileNotFoundError: ignored

Then insert the new texts with `TessMongoConnection.insert` after converting the raw JSON to Tesserae `Text` entities.

In [None]:
texts = []
for t in text_meta:
    texts.append(Text.json_decode(t))
result = connection.insert(texts)
print('Inserted {} texts.'.format(len(result.inserted_ids)))
print(result.inserted_ids)

Inserted 4 texts.
[ObjectId('5cd5d8a34273852ca631fafe'), ObjectId('5cd5d8a34273852ca631faff'), ObjectId('5cd5d8a34273852ca631fb00'), ObjectId('5cd5d8a34273852ca631fb01')]


We can retrieve the inserted texts with `TessMongoConnection.find`. These texts will be converted to objects representing the database entries. The returned text list can be filtered by any valid field in the text database.

In [None]:
texts = connection.find('texts', _id=result.inserted_ids)

print('{}{}{}{}'.format('Title'.ljust(15), 'Author'.ljust(15), 'Language'.ljust(15), 'Year'))
for t in texts:
    print('{}{}{}{}'.format(t.title.ljust(15), t.author.ljust(15), t.language.ljust(15), t.year))

Title          Author         Language       Year
aeneid         vergil         latin          19
de oratore     cicero         latin          38
heracles       euripides      greek          -416
epistles       plato          greek          -280


## Loading .tess Files

Text metadata includes the path to the .tess file on the local filesystem. Using a Text retrieved from the database, the file can be loaded for further processing.

In [None]:
tessfile = TessFile(texts[0].path, metadata=texts[0])

print(tessfile.path)
print(len(tessfile))
print(tessfile[270])

la/vergil.aeneid.tess
9908
<verg. aen. 1.271>	transferet, et longam multa vi muniet Albam.



We can iterate through the file line-by-line.

In [None]:
lines = tessfile.readlines()
for i in range(10):
    print(next(lines))

<verg. aen. 1.1>	Arma virumque cano, Troiae qui primus ab oris

<verg. aen. 1.2>	Italiam, fato profugus, Laviniaque venit

<verg. aen. 1.3>	litora, multum ille et terris iactatus et alto

<verg. aen. 1.4>	vi superum saevae memorem Iunonis ob iram;

<verg. aen. 1.5>	multa quoque et bello passus, dum conderet urbem,

<verg. aen. 1.6>	inferretque deos Latio, genus unde Latinum,

<verg. aen. 1.7>	Albanique patres, atque altae moenia Romae.

<verg. aen. 1.8>	Musa, mihi causas memora, quo numine laeso,

<verg. aen. 1.9>	quidve dolens, regina deum tot volvere casus

<verg. aen. 1.10>	insignem pietate virum, tot adire labores



We can also iterate token-by-token.

In [None]:
tokens = tessfile.read_tokens()
for i in range(10):
    print(next(tokens))

Arma
virumque
cano,
Troiae
qui
primus
ab
oris
Italiam,
fato


## Tokenizing a Text

Texts can be tokenized with `tesserae.tokenizers` objects. These objects are designed to normalize and compute features for tokens of a specific language.

In [None]:
tokenizer = GreekTokenizer(connection) if tessfile.metadata.language == 'greek' else LatinTokenizer(connection)

tokens, tags, features = tokenizer.tokenize2(tessfile.read(), text=tessfile.metadata)

print(len(tokens), len(tags), len(features))

print('{}{}{}{}'.format('Raw'.ljust(15), 'Normalized'.ljust(15), 'Lemmata'.ljust(20), 'Frequency'))
print('{}{}{}{}'.format('---'.ljust(15), '----------'.ljust(15), '-------'.ljust(20), '---------'))
for i in range(20):
    if len(tokens[i].features):
        print('{}{}{}{}'.format(tokens[i].display.ljust(15),
                              str(tokens[i].features['form'].token).ljust(20),
                              str(tokens[i].features['lemmata'][0].token).ljust(20),
                              list(tokens[i].features['form'].frequencies.values())[0]))

MemoryError: 

Processed tokens can then be stored in and retrieved from the database, similar to text metadata.

In [None]:
result = connection.insert(features)
print('Inserted {} feature entities out of {}'.format(len(result.inserted_ids), len(features)))

## Unitizing a Text

Texts can be unitized into lines and phrases, and the intertext matches are found between units of text.


In [None]:
# Unitizing lines of a poem
unitizer = Unitizer()
lines, phrases = unitizer.unitize(tokens, tags, tessfile.metadata)

print('Lines\n-----')
for line in lines[:20]:
        print(''.join([str(line.tags), ': '] + [t.display for t in line.tokens]))
        
print('\n\nPhrases\n-------')
for phrase in phrases[:20]:
        print(''.join([str(phrase.tags), ': '] + [t.display for t in phrase.tokens]))

In [None]:
# Unitizing phrases of a poem or prose
result = connection.insert(lines + phrases)
print('Inserted {} units out of {}.'.format(len(result.inserted_ids), len(lines + phrases)))


result = connection.insert(tokens)
print('Inserted {} tokens out of {}.'.format(len(result.inserted_ids), len(tokens)))

In [None]:
for text in texts[1:]:
    tessfile = TessFile(text.path, metadata=text)
    tokenizer = GreekTokenizer(connection) if tessfile.metadata.language == 'greek' else LatinTokenizer(connection)

    
    tokens, tags, frequencies, feature_sets = tokenizer.tokenize(tessfile.read(), text=tessfile.metadata)
        
    tokens = tokenizer.tokens
    result = connection.insert(feature_sets)
    result = connection.insert(frequencies)
    
    unitizer = Unitizer()
    lines, phrases = unitizer.unitize(tokens, tags, tessfile.metadata)
    result = connection.insert(lines + phrases)
    
    result = connection.insert(tokens)

## Matching

Once the Texts, Tokens, and Units are in the database, we can then find intertext matches.

In [None]:
import time
matcher = DefaultMatcher(connection)
match_texts = [t for t in texts if t.language == 'greek']

start = time.time()
matches, match_set = matcher.match(match_texts, 'phrase', 'form', distance_metric='span', stopwords=20, max_distance=10)
print("Completed matching in {0:.2f}s".format(time.time() - start))

matches.sort(key=lambda x: x.score, reverse=True)

# result = connection.insert(match_set)
# print('Inserted {} match set entities out of {}'.format(len(result.inserted_ids), 1))
result = connection.insert(matches)
print('Inserted {} match entities out of {}'.format(len(result.inserted_ids), len(matches)))

In [None]:
matches = connection.aggregate('matches', [
    {'$match': {'match_set': match_set.id}},
    {'$sort': {'score': -1}},
    {'$limit': 20},
    {'$lookup': {
        'from': 'units',
        'let': {'m_units': '$units'},
        'pipeline': [
            {'$match': {'$expr': {'$in': ['$_id', '$$m_units']}}},
            {'$lookup': {
                'from': 'tokens',
                'localField': '_id',
                'foreignField': 'phrase',
                'as': 'tokens'
            }},
            {'$sort': {'index': 1}}
        ],
        'as': 'units'
    }},
    {'$lookup': {
        'from': 'tokens',
        'localField': 'tokens',
        'foreignField': '_id',
        'as': 'tokens'
    }},
    {'$project': {
        'units': True,
        'score': True,
        'tokens': '$tokens.feature_set'
    }},
    {'$lookup': {
        'from': 'feature_sets',
        'localField': 'tokens',
        'foreignField': '_id',
        'as': 'tokens'
    }}
])

print('\n')
print('{}{}'.format('Score'.ljust(15), 'Match Tokens'.ljust(15)))
print('{}{}'.format('-----'.ljust(15), '------------'.ljust(15)))
for m in matches:
    print('{}{}'.format(('%.3f'%(m.score)).ljust(15), ', '.join(list(set([t['form'] for t in m.tokens])))))
    print('{} {} {}: {}'.format(match_texts[0].author, match_texts[0].title, m.units[0]['tags'], ''.join([t['display'] for t in m.units[0]['tokens']])))
    print('{} {} {}: {}'.format(match_texts[1].author, match_texts[1].title, m.units[1]['tags'], ''.join([t['display'] for t in m.units[1]['tokens']])))
    print('\n')