# Classics — Grammar

In term of guaranteeing correctnes, where the grammar is regular we can automate the production declensions of nouns and conjugations of verbs.

This is useful in production for redusing opportunites for error. Where the means of production are shared with learners, the ability to check declensions and conjugations for arbitrary words provides an opportunity to support curiosty driven, self-directed learning.

## Inflection Patterns 

Find the inflection (declension / conjugation) of a given word / lemma.

*(Lemma - "the canonical form of an inflected word".)*

In [3]:
from cltk.data.fetch import FetchCorpus
corpus_downloader = FetchCorpus('lat')
path = '/Users/tonyhirst/cltk_data/lat/text/lat_text_latin_library'

corpus_downloader.import_corpus('lat_models_cltk')

The morphological character of a word is encoded using a nine character code string (- is used as the null character):

 	1: 	part of speech
 		n	noun
 		v	verb
 		t	participle
 		a	adjective
 		d	adverb
 		c	conjunction
 		r	preposition
 		p	pronoun
 		m	numeral
 		i	interjection
 		e	exclamation
 		u	punctuation
 	2: 	person
 		1	first person
 		2	second person
 		3	third person
 	3: 	number
 		s	singular
 		p	plural
 	4: 	tense
 		p	present
 		i	imperfect
 		r	perfect
 		l	pluperfect
 		t	future perfect
 		f	future
 	5: 	mood
 		i	indicative
 		s	subjunctive
 		n	infinitive
 		m	imperative
 		p	participle
 		d	gerund
 		g	gerundive
 		u	supine
 	6: 	voice
 		a	active
 		p	passive
 	7:	gender
 		m	masculine
 		f	feminine
 		n	neuter
 	8: 	case
 		n	nominative
 		g	genitive
 		d	dative
 		a	accusative
 		b	ablative
 		v	vocative
 		l	locative
 	9: 	degree
 		c	comparative
 		s	superlative
 
Via: https://github.com/cltk/latin_treebank_perseus#readme

In [6]:
from cltk.morphology.lat import CollatinusDecliner
decliner = CollatinusDecliner()

decliner.decline("amo")[:20]

[('amo', 'v1spia---'),
 ('amas', 'v2spia---'),
 ('amat', 'v3spia---'),
 ('amamus', 'v1ppia---'),
 ('amatis', 'v2ppia---'),
 ('amant', 'v3ppia---'),
 ('amabam', 'v1siia---'),
 ('amabas', 'v2siia---'),
 ('amabat', 'v3siia---'),
 ('amabamus', 'v1piia---'),
 ('amabatis', 'v2piia---'),
 ('amabant', 'v3piia---'),
 ('amabo', 'v1sfia---'),
 ('amabis', 'v2sfia---'),
 ('amabit', 'v3sfia---'),
 ('amabimus', 'v1pfia---'),
 ('amabitis', 'v2pfia---'),
 ('amabunt', 'v3pfia---'),
 ('amavi', 'v1sria---'),
 ('amavisti', 'v2sria---')]

In [75]:
decliner.decline("canis")

[('canis', '--s----n-'),
 ('canis', '--s----v-'),
 ('canem', '--s----a-'),
 ('canis', '--s----g-'),
 ('cani', '--s----d-'),
 ('cane', '--s----b-'),
 ('canes', '--p----n-'),
 ('canes', '--p----v-'),
 ('canes', '--p----a-'),
 ('canum', '--p----g-'),
 ('canibus', '--p----d-'),
 ('canibus', '--p----b-')]

We can decode the strings to more easily describe the morphological character of a word.

In [8]:
#Taken from https://github.com/alpheios-project/pyperseus-treebank/blob/master/pyperseus_treebank/latin.py#L44#
#Maybe use https://github.com/jazzband/inflect for natural language code2text description?
import re

# Conversion table for CONLL
# Thanks to @epageperron
#??Some divergence from README?
_CONLL_LA_CONV_DICT = { "a": "adjective", "c": "conjunction",
                        "d": "adverb", "e": "exclamation", "g": "PART",
                        "i": "interjection", "l": "DET",
                        "m": "numeral", "n": "noun","p": "pronoun",
                        "r": "preposition", "t": "VERB", "u": "punctuation",
                        "v": "verb", "x": "X" }

_NUMBER = {"s": "singular", "p": "plural"}
_TENSE = {"p": "present", "f": "future", "r": "perfect", "l": "pluperfect",
          "i": "imperfect", "t": "future perfect"}
_MOOD = {"i": "indicative", "s": "subjunctive", "m": "imperative", 'd':'gerund',
         "g": "gerundive", "p": "participle", "u": "supine", "n": "infinitive"}
_VOICE = {"a": "active", "p": "passive", "d": "Dep"}
_GENDER = {"f": "feminine", "m": "masculine", "n": "neuter", "c": "Com"}
_CASE = {"g": "genitive", "d": "dative", "a": "accusative", "v": "vocative",
         "n": "nominative", "b": "ablative", "i": "Ins", "l": "locative"}
_DEGREE = {"p": "Pos", "c": "comparative", "s": "superlative"}

_PERSON = {"1":'first person', "2":'second person', "3":'third person'}

NOTWORD = re.compile("^\W+$")

_NULL_CHAR="-"

def parse_features(features):
    """ Parse features from the POSTAG of Perseus Latin XML
    .. example :: self.parse_features("n-p---na-")
    :param features: A string containing morphological informations
    :type features: str
    :return: Parsed features
    :rtype: dict
    """

    if features is None or features.lower()=='unk':
        return {}
    
    features = features.lower()
    
    feats = {}

    feats['POS'] = _CONLL_LA_CONV_DICT[features[0]]

    # Person handling : 3 possibilities
    if features[1] != _NULL_CHAR:
        feats["Person"] = _PERSON[features[1]]

    # Number handling : two possibilities
    if features[2] != _NULL_CHAR:
        feats["Number"] = _NUMBER[features[2]]

    # Tense
    if features[3] != _NULL_CHAR:
        feats["Tense"] = _TENSE[features[3]]

    # Mood
    if features[4] != _NULL_CHAR:
        feats["Mood"] = _MOOD[features[4]]

    # Voice
    if features[5] != _NULL_CHAR:
        feats["Voice"] = _VOICE[features[5]]

    # Tense
    if features[6] != _NULL_CHAR:
        feats["Gender"] = _GENDER[features[6]]

    # Tense
    if features[7] != _NULL_CHAR:
        feats["Case"] = _CASE[features[7]]

    # Degree
    if features[8] != _NULL_CHAR:
        feats["Degree"] = _DEGREE[features[8]]

    return feats

We can then decode the morphological data 
feature string:

In [9]:
#Example
parse_features('v3plia---')

{'POS': 'verb',
 'Person': 'third person',
 'Number': 'plural',
 'Tense': 'pluperfect',
 'Mood': 'indicative',
 'Voice': 'active'}

Looking up words in the decliner provides a way of getting the morphological data for a word. For example, we could look up amabitis and get back something like `('amo', 'v2pfia---')`:

In [10]:
#hacky way that assumes you know the root
def lookupInflection(word, lemma):
    ''' Find the inflection of a given word, given its lemma. '''
    result=[]
    if lemma is None:
        return result
    
    lemma = [lemma] if isinstance(lemma,str) else lemma
    for l in lemma:
        try:
            words = decliner.decline(l)
            result.append([(w,d) for w,d in words if w==word])
        except:
            result.append((l, None))
    return result

If we know the root, we can lookup the inflection:

In [None]:
lookupInflection('amabitis','amo')

Let's see if we can find the root of a word with a simple lemmatizer:

In [11]:
#Lemmatizer - find root of a word
from cltk.stem.lemma import LemmaReplacer

from cltk.stem.latin.j_v import JVReplacer

#Lemmatizer requires the following
CorpusImporter('latin').import_corpus('latin_pos_lemmata_cltk')
CorpusImporter('latin').import_corpus('latin_models_cltk')


sentence = 'Progeniem sed enim Troiano a sanguine duci audierat'

sentence = sentence.lower()

lemmatizer = LemmaReplacer('latin')

lemmatizer.lemmatize(sentence)

ModuleNotFoundError: No module named 'cltk.stem.lemma'

In [18]:
from cltk.lemmatize.lat import LatinBackoffLemmatizer

sentence = 'Progeniem sed enim Troiano a sanguine duci audierat'

sentence = sentence.lower()

LatinBackoffLemmatizer.lemmatize(sentence)

TypeError: lemmatize() missing 1 required positional argument: 'tokens'

In [16]:
from cltk.lemmatize.lat import RomanNumeralLemmatizer

#Lemmatizer for identifying roman numerals in Latin text based on regex.
lemmatizer = RomanNumeralLemmatizer()

lemmatizer.lemmatize("i ii iii iv v vi vii vii ix x xx xxx xl l lx c cc".split())

[('i', 'NUM'),
 ('ii', 'NUM'),
 ('iii', 'NUM'),
 ('iv', 'NUM'),
 ('v', 'NUM'),
 ('vi', 'NUM'),
 ('vii', 'NUM'),
 ('vii', 'NUM'),
 ('ix', 'NUM'),
 ('x', 'NUM'),
 ('xx', 'NUM'),
 ('xxx', 'NUM'),
 ('xl', 'NUM'),
 ('l', 'NUM'),
 ('lx', 'NUM'),
 ('c', 'NUM'),
 ('cc', 'NUM')]

## Syllables

One way of helping students read a text is to split the syllables out.m

In [23]:
with open(f'{path}/vergil/aen1.txt') as f:
    aeneid_1 = f.read()

In [26]:
#Here's a manual way of doing a concordance, though we need to clean it for the tokeniser?
from cltk.alphabet.text_normalization import remove_non_ascii
from cltk.alphabet.text_normalization import remove_non_latin

aen1_clean = remove_non_ascii(aeneid_1)
aen1_clean = remove_non_latin(aen1_clean)
print(aen1_clean[:1000])

Vergil Aeneid I        P VERGILI MARONIS AENEIDOS LIBER PRIMVS  Arma virumque cano Troiae qui primus ab oris Italiam fato profugus Laviniaque venit litora multum ille et terris iactatus et alto vi superum saevae memorem Iunonis ob iram multa quoque et bello passus dum conderet urbem     inferretque deos Latio genus unde Latinum Albanique patres atque altae moenia Romae  Musa mihi causas memora quo numine laeso quidve dolens regina deum tot volvere casus insignem pietate virum tot adire labores     impulerit Tantaene animis caelestibus irae  Urbs antiqua fuit Tyrii tenuere coloni Karthago Italiam contra Tiberinaque longe ostia dives opum studiisque asperrima belli quam Iuno fertur terris magis omnibus unam     posthabita coluisse Samo hic illius arma hic currus fuit hoc regnum dea gentibus esse si qua fata sinant iam tum tenditque fovetque Progeniem sed enim Troiano a sanguine duci audierat Tyrias olim quae verteret arces     hinc populum late regem belloque superbum venturum excidio Li

In [35]:
from cltk.tokenizers.lat.lat import LatinWordTokenizer
from nltk.text import Text

latin_word_tokenizer = LatinWordTokenizer()

tokens = latin_word_tokenizer.tokenize(aen1_clean)
textList = Text(tokens)
textList.concordance('Libyae')

Displaying 7 of 7 matches:
ello -que superbum venturum excidio Libyae sic volvere Parcas Id metuens veter
a litora cursu contendunt petere et Libyae vertuntur ad oras Est in secessu lo
ulos sic vertice caeli constitit et Libyae defixit lumina regnis Atque illum t
e per aera magnum remigio alarum ac Libyae citus adstitit oris Et iam iussa fa
o -que supersunt Ipse ignotus egens Libyae deserta peragro Europa atque Asia p
e pater optime Teucrum pontus habet Libyae nec spes iam restat Iuli at freta S
uidem per litora certos dimittam et Libyae lustrare extrema iubebo si quibus e


In [30]:
from cltk.languages.example_texts import get_example_text

example_lat = get_example_text('lat')
example_lat

'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit. Horum omnium fortissimi sunt Belgae, propterea quod a cultu atque humanitate provinciae longissime absunt, minimeque ad eos mercatores saepe commeant atque ea quae ad effeminandos animos pertinent important, proximique sunt Germanis, qui trans Rhenum incolunt, quibuscum continenter bellum gerunt. Qua de causa Helvetii quoque reliquos Gallos virtute praecedunt, quod fere cotidianis proeliis cum Germanis contendunt, cum aut suis finibus eos prohibent aut ipsi in eorum finibus bellum gerunt. Eorum una, pars, quam Gallos obtinere dictum est, initium capit a flumine Rhodano, continetur Garumna flumine, Oceano, finibus Belgarum, attingit etiam ab Sequanis et Helvetiis flumen Rhenum, vergit ad septentriones. Belgae ab ex

In [36]:
from cltk.sentence.lat import LatinPunktSentenceTokenizer

latin_splitter = LatinPunktSentenceTokenizer()

sentences = latin_splitter.tokenize(example_lat)

sentences

['Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.',
 'Hi omnes lingua, institutis, legibus inter se differunt.',
 'Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit.',
 'Horum omnium fortissimi sunt Belgae, propterea quod a cultu atque humanitate provinciae longissime absunt, minimeque ad eos mercatores saepe commeant atque ea quae ad effeminandos animos pertinent important, proximique sunt Germanis, qui trans Rhenum incolunt, quibuscum continenter bellum gerunt.',
 'Qua de causa Helvetii quoque reliquos Gallos virtute praecedunt, quod fere cotidianis proeliis cum Germanis contendunt, cum aut suis finibus eos prohibent aut ipsi in eorum finibus bellum gerunt.',
 'Eorum una, pars, quam Gallos obtinere dictum est, initium capit a flumine Rhodano, continetur Garumna flumine, Oceano, finibus Belgarum, attingit etiam ab Sequanis et Helvetiis flumen Rhenum, vergit ad septen

In [41]:
from cltk.prosody.lat.syllabifier import Syllabifier

syllabifier = Syllabifier()

clean_sentence = remove_non_ascii(remove_non_latin(sentences[0])).lower()

#Extract syllables for each word
for word in latin_word_tokenizer.tokenize(clean_sentence):
    syllables = syllabifier.syllabify(word)
    print(word, syllables)

gallia ['gal', 'li', 'a']
est ['est']
omnis ['om', 'nis']
divisa ['di', 'vi', 'sa']
in ['in']
partes ['par', 'tes']
tres ['tres']
quarum ['qua', 'rum']
unam ['u', 'nam']
incolunt ['in', 'co', 'lunt']
belgae ['bel', 'gae']
aliam ['a', 'li', 'am']
aquitani ['a', 'qui', 'ta', 'ni']
tertiam ['ter', 'ti', 'am']
qui ['qui']
ipsorum ['ip', 'so', 'rum']
lingua ['lin', 'gua']
celtae ['cel', 'tae']
nostra ['nos', 'tra']
galli ['gal', 'li']
appellantur ['ap', 'pel', 'lan', 'tur']


In [45]:
from cltk import NLP
cltk_nlp = NLP(language="lat")

# First run, prompts to allow download of Stanza NLP library models
# to ~/stanza_resources/la/ (~250MB)
# Also word embedding models from the Fasttext project to
# ~/cltk_data/lat/embeddings/fasttext 365MB
# Also Lewis's *An Elementary Latin Dictionary* (1890)
cltk_doc = cltk_nlp.analyze(text=example_lat)

‎𐤀 CLTK version '1.0.14'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinNERProcess`, `LatinLexiconProcess`.
From `from_ud():` Number[psor]: Unrecognized UD feature name


In [46]:
dir(cltk_doc)

['__annotations__',
 '__class__',
 '__dataclass_fields__',
 '__dataclass_params__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slotnames__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_get_words_attribute',
 'embeddings',
 'embeddings_model',
 'language',
 'lemmata',
 'morphosyntactic_features',
 'normalized_text',
 'pipeline',
 'pos',
 'raw',
 'sentences',
 'sentences_strings',
 'sentences_tokens',
 'stanza_doc',
 'stems',
 'tokens',
 'tokens_stops_filtered',
 'words']

In [66]:
cltk_doc.stanza_doc

[
  [
    {
      "id": 1,
      "text": "Gallia",
      "lemma": "iallius",
      "upos": "NOUN",
      "xpos": "A1|grn1|casA|gen2",
      "feats": "Case=Nom|Gender=Fem|Number=Sing",
      "head": 0,
      "deprel": "root",
      "misc": "start_char=0|end_char=6"
    },
    {
      "id": 2,
      "text": "est",
      "lemma": "sum",
      "upos": "AUX",
      "xpos": "N3|modA|tem1|gen6",
      "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act",
      "head": 1,
      "deprel": "cop",
      "misc": "start_char=7|end_char=10"
    },
    {
      "id": 3,
      "text": "omnis",
      "lemma": "omnis",
      "upos": "DET",
      "xpos": "C1|grn1|casA|gen2",
      "feats": "Case=Nom|Gender=Fem|Number=Sing|PronType=Ind",
      "head": 1,
      "deprel": "det",
      "misc": "start_char=11|end_char=16"
    },
    {
      "id": 4,
      "text": "divisa",
      "lemma": "digis",
      "upos": "ADJ",
      "xpos": "A1|grn1|casA|gen2",
      "feats": "Case=Nom|Degree=Pos|G

In [57]:
# Also: cltk_doc.raw
cltk_doc.normalized_text

'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit. Horum omnium fortissimi sunt Belgae, propterea quod a cultu atque humanitate provinciae longissime absunt, minimeque ad eos mercatores saepe commeant atque ea quae ad effeminandos animos pertinent important, proximique sunt Germanis, qui trans Rhenum incolunt, quibuscum continenter bellum gerunt. Qua de causa Helvetii quoque reliquos Gallos virtute praecedunt, quod fere cotidianis proeliis cum Germanis contendunt, cum aut suis finibus eos prohibent aut ipsi in eorum finibus bellum gerunt. Eorum una, pars, quam Gallos obtinere dictum est, initium capit a flumine Rhodano, continetur Garumna flumine, Oceano, finibus Belgarum, attingit etiam ab Sequanis et Helvetiis flumen Rhenum, vergit ad septentriones. Belgae ab ex

In [56]:
cltk_doc.sentences_strings

['Gallia est omnis divisa in partes tres , quarum unam incolunt Belgae , aliam Aquitani , tertiam qui ipsorum lingua Celtae , nostra Galli appellantur .',
 'Hi omnes lingua , institutis , legibus inter se differunt .',
 'Gallos ab Aquitanis Garumna flumen , a Belgis Matrona et Sequana dividit .',
 'Horum omnium fortissimi sunt Belgae , propterea quod a cultu atque humanitate provinciae longissime absunt , minimeque ad eos mercatores saepe commeant atque ea quae ad effeminandos animos pertinent important , proximique sunt Germanis , qui trans Rhenum incolunt , quibuscum continenter bellum gerunt .',
 'Qua de causa Helvetii quoque reliquos Gallos virtute praecedunt , quod fere cotidianis proeliis cum Germanis contendunt , cum aut suis finibus eos prohibent aut ipsi in eorum finibus bellum gerunt .',
 'Eorum una , pars , quam Gallos obtinere dictum est , initium capit a flumine Rhodano , continetur Garumna flumine , Oceano , finibus Belgarum , attingit etiam ab Sequanis et Helvetiis flume

In [60]:
cltk_doc.sentences_tokens

[['Gallia',
  'est',
  'omnis',
  'divisa',
  'in',
  'partes',
  'tres',
  ',',
  'quarum',
  'unam',
  'incolunt',
  'Belgae',
  ',',
  'aliam',
  'Aquitani',
  ',',
  'tertiam',
  'qui',
  'ipsorum',
  'lingua',
  'Celtae',
  ',',
  'nostra',
  'Galli',
  'appellantur',
  '.'],
 ['Hi',
  'omnes',
  'lingua',
  ',',
  'institutis',
  ',',
  'legibus',
  'inter',
  'se',
  'differunt',
  '.'],
 ['Gallos',
  'ab',
  'Aquitanis',
  'Garumna',
  'flumen',
  ',',
  'a',
  'Belgis',
  'Matrona',
  'et',
  'Sequana',
  'dividit',
  '.'],
 ['Horum',
  'omnium',
  'fortissimi',
  'sunt',
  'Belgae',
  ',',
  'propterea',
  'quod',
  'a',
  'cultu',
  'atque',
  'humanitate',
  'provinciae',
  'longissime',
  'absunt',
  ',',
  'minimeque',
  'ad',
  'eos',
  'mercatores',
  'saepe',
  'commeant',
  'atque',
  'ea',
  'quae',
  'ad',
  'effeminandos',
  'animos',
  'pertinent',
  'important',
  ',',
  'proximique',
  'sunt',
  'Germanis',
  ',',
  'qui',
  'trans',
  'Rhenum',
  'incolunt',
  

In [74]:
[ p for p in zip(cltk_doc.tokens, cltk_doc.lemmata,
                 cltk_doc.pos, cltk_doc.morphosyntactic_features)]

[('Gallia',
  'iallius',
  'NOUN',
  {Case: [nominative], Gender: [feminine], Number: [singular]}),
 ('est',
  'sum',
  'AUX',
  {Mood: [indicative], Number: [singular], Person: [third], Tense: [present], VerbForm: [finite], Voice: [active]}),
 ('omnis',
  'omnis',
  'DET',
  {Case: [nominative], Gender: [feminine], Number: [singular], PrononimalType: [indefinite]}),
 ('divisa',
  'digis',
  'ADJ',
  {Case: [nominative], Degree: [positive], Gender: [feminine], Number: [singular]}),
 ('in', 'in', 'ADP', {AdpositionalType: [preposition]}),
 ('partes',
  'pars',
  'NOUN',
  {Case: [accusative], Gender: [feminine], Number: [plural]}),
 ('tres',
  'tres',
  'NUM',
  {Case: [accusative], Gender: [feminine], Numeral: [cardinal], Number: [plural]}),
 (',', ',', 'PUNCT', {}),
 ('quarum',
  'qui',
  'PRON',
  {Case: [genitive], Gender: [feminine], Number: [plural], PrononimalType: [relative]}),
 ('unam',
  'unus',
  'NUM',
  {Case: [accusative], Gender: [feminine], Numeral: [cardinal], Number: [