In [1]:
import os
import operator
import re
from collections import Counter

from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer

In [2]:
path_to_texts = '/Users/tyler/cltk_data/latin/text/latin_text_latin_library/'

Determine which texts to analyze. For this experiment, I look at Cicero's Catilinarians 1 and 2, Vergil's Aenied 1, 4, 6, Caesar's Gallic Wars 1, Ovid's Metamorphoses 1, Catullus, Tacitus Annales 1, and Livy 1.

In [3]:
texts_of_interest = [
    'cicero/cat1.txt',
    'cicero/cael.txt',
    'vergil/aen1.txt',
    'vergil/aen4.txt',
    'vergil/aen6.txt',
    'caesar/gall1.txt',
    'ovid/ovid.met1.txt',
    'catullus.txt',
    'tacitus/tac.ann1.txt',
    'livy/liv.1.txt'
]

For each text, we need to remove extraeneous characters and split it into tokens.

In [4]:
tokens = []
pattern = re.compile('[^a-zA-Z]')
for text_path in texts_of_interest:
    text_path = path_to_texts + text_path
    with open(text_path) as file:
        text = file.read()
    clean_text = pattern.sub(' ', text).lower().strip()
    text_tokens = [t for t in clean_text.split(' ') if t != '']
    tokens += text_tokens

We then lemmatize the tokens to normalize the vocabulary counts.

In [5]:
lemmatizer = BackoffLatinLemmatizer()
vocab = [l[1] for l in lemmatizer.lemmatize(tokens)]

In [6]:
frequencies = Counter(vocab)

In [7]:
sorted_frequencies = sorted(frequencies.items(), key=operator.itemgetter(1), reverse=True)
sorted_frequencies[:10]

[('sum', 1992),
 ('et', 1840),
 ('qui', 1833),
 ('in', 1370),
 ('is', 904),
 ('hic', 902),
 ('ad', 749),
 ('non', 741),
 ('tu', 665),
 ('cum2', 618)]

Now we set some threshold of percent of known frequent words. For this experiment, we say the reader knows the top 70% of frequent words in this vocabulary.

In [8]:
perc_of_vocab = 0.70
known_words = [w[0] for w in sorted_frequencies][:int(perc_of_vocab*len(sorted_frequencies))]

Next we take a fairly difficult text and see how many words fall outside of our known vocabulary.

In [9]:
propertius_1_1 = """
Cynthia prima suis miserum me cepit ocellis,
    contactum nullis ante cupidinibus.
tum mihi constantis deiecit lumina fastus
    et caput impositis pressit Amor pedibus,
donec me docuit castas odisse puellas                 5
    improbus, et nullo vivere consilio.
ei mihi, iam toto furor hic non deficit anno,
    cum tamen adversos cogor habere deos.
Milanion nullos fugiendo, Tulle, labores
    saevitiam durae contudit Iasidos.                 10
nam modo Partheniis amens errabat in antris,
    rursus in hirsutas ibat et ille feras;
ille etiam Hylaei percussus vulnere rami
    saucius Arcadiis rupibus ingemuit.
ergo velocem potuit domuisse puellam:                 15
    tantum in amore fides et benefacta valent.
in me tardus Amor non ullas cogitat artes,
    nec meminit notas, ut prius, ire vias.
at vos, deductae quibus est pellacia lunae
    et labor in magicis sacra piare focis,                 20
en agedum dominae mentem convertite nostrae,
    et facite illa meo palleat ore magis!
tunc ego crediderim Manes et sidera vobis
    posse Cytinaeis ducere carminibus.
aut vos, qui sero lapsum revocatis, amici,                 25
    quaerite non sani pectoris auxilia.
fortiter et ferrum saevos patiemur et ignes,
    sit modo libertas quae velit ira loqui.
ferte per extremas gentes et ferte per undas,
    qua non ulla meum femina norit iter.                 30
vos remanete, quibus facili deus annuit aure,
    sitis et in tuto semper amore pares.
nam me nostra Venus noctes exercet amaras,
    et nullo vacuus tempore defit Amor.
hoc, moneo, vitate malum: sua quemque moretur                 35
    cura, neque assueto mutet amore torum.
quod si quis monitis tardas adverterit aures,
    heu referet quanto verba dolore mea!
"""

In [10]:
propertius_1_1 = pattern.sub(' ', propertius_1_1).strip()
pro_1_1_lemmata = lemmatizer.lemmatize(propertius_1_1.split(' '))
pro_1_1_lemmata = [l for l in pro_1_1_lemmata if l[0] != '']

Below are words that are in the poem but not in the known vocabulary. Most are proper nouns but 8 are not.

In [11]:
set([l for l in pro_1_1_lemmata if l[1] not in known_words])

{('Amor', 'Amor'),
 ('Arcadiis', 'Arcadius'),
 ('Cynthia', 'Cynthius'),
 ('Cytinaeis', 'Cytinaeis'),
 ('Hylaei', 'Hylaeus'),
 ('Iasidos', 'Iasidos'),
 ('Manes', 'Manes'),
 ('Milanion', 'Milanion'),
 ('Partheniis', 'Parthenia'),
 ('Tulle', 'Tullus'),
 ('Venus', 'Venus1'),
 ('benefacta', 'benefacio'),
 ('castas', 'castus'),
 ('fastus', 'fastus'),
 ('focis', 'focus'),
 ('hirsutas', 'hirsutus'),
 ('magicis', 'magicus'),
 ('pellacia', 'pellacia'),
 ('vitate', 'vitas')}

In [12]:
8 / len(set(pro_1_1_lemmata))

0.03902439024390244

Proper nouns aside, if a reader knew all of the known words, they would only have trouble with about 4% of the vocabulary in this poem.