# Part of Speech tagging and lemmatisation with 🐍

See information on the [Sunoikisis Wiki](https://github.com/SunoikisisDC/SunoikisisDC-2017-2018/wiki/Python-2:-Part-of-Speech-tagging-and-lemmatisation).

## Goals

* present tools/libraries that can be used for PoS tagging and lemmatisation
* raise awareness about what "goes on" behind the scenes, and that libraries make transparent to the user
* show that there is a growing amount of python code/libraries for linguistic annotation on Anc Greek/Latin
* ... but same time it takes a bit of bricolage to get things to work

## Imports

In [1]:
import os
from tqdm import tqdm
import sys
import cltk

In [2]:
cltk.__version__

'0.1.83'

## Download corpora

### Greek

In [17]:
from cltk.corpus.utils.importer import CorpusImporter

In [11]:
grk_corpus_importer = CorpusImporter('greek')

In [12]:
grk_corpus_importer.list_corpora

['greek_software_tlgu',
 'greek_text_perseus',
 'phi7',
 'tlg',
 'greek_proper_names_cltk',
 'greek_models_cltk',
 'greek_treebank_perseus',
 'greek_lexica_perseus',
 'greek_training_set_sentence_cltk',
 'greek_word2vec_cltk',
 'greek_text_lacus_curtius',
 'greek_text_first1kgreek']

In [41]:
grk_corpus_importer.import_corpus('greek_text_perseus')

Downloaded 100% 143.79 MiB | 4.02 MiB/s s 

### Latin

In [57]:
from cltk.corpus.latin import latinlibrary

In [18]:
la_corpus_importer = CorpusImporter('latin')

In [19]:
la_corpus_importer.list_corpora

['latin_text_perseus',
 'latin_treebank_perseus',
 'latin_text_latin_library',
 'phi5',
 'phi7',
 'latin_proper_names_cltk',
 'latin_models_cltk',
 'latin_pos_lemmata_cltk',
 'latin_treebank_index_thomisticus',
 'latin_lexica_perseus',
 'latin_training_set_sentence_cltk',
 'latin_word2vec_cltk',
 'latin_text_antique_digiliblt',
 'latin_text_corpus_grammaticorum_latinorum',
 'latin_text_poeti_ditalia']

In [20]:
la_corpus_importer.import_corpus('latin_training_set_sentence_cltk')

In [43]:
la_corpus_importer.import_corpus('latin_text_latin_library')

Downloaded 100% 35.50 MiB | 5.91 MiB/s 

In [3]:
from cltk.corpus.latin import latinlibrary

**Note**: `cltk.corpus.latin.latinlibrary` is a shortcut for several things, and there is nothing comparable (yet) for Greek (see [source code](https://github.com/cltk/cltk/blob/master/cltk/corpus/latin/__init__.py)).

In [58]:
amicitia_words = latinlibrary.words('cicero/amic.txt')

In [59]:
len(amicitia_words)

11618

We can get `n` number of tokens from this text by using the *slice notation*:

In [17]:
# the first ten tokens
amicitia_words[:10]

['Cicero',
 ':',
 'de',
 'Amicitia',
 'M.',
 'TVLLI',
 'CICERONIS',
 'LAELIVS',
 'DE',
 'AMICITIA']

In [18]:
# or the last token
amicitia_words[-1]

'Page'

We can also count occurrences by using the `count()` method and passing as parameter the token we want to inspect:

In [19]:
amicitia_words.count('et')

236

In [23]:
amicitia_words.count('amicitia')

67

Let's have a closer look to the `type` of the variable `amicitia_words` where we loaded the content of Cicero's *De Amicitia*:

In [11]:
type(amicitia_words)

nltk.corpus.reader.util.StreamBackedCorpusView

In [7]:
help(amicitia_words)

Help on StreamBackedCorpusView in module nltk.corpus.reader.util object:

class StreamBackedCorpusView(nltk.collections.AbstractLazySequence)
 |  A 'view' of a corpus file, which acts like a sequence of tokens:
 |  it can be accessed by index, iterated over, etc.  However, the
 |  tokens are only constructed as-needed -- the entire corpus is
 |  never stored in memory at once.
 |  
 |  The constructor to ``StreamBackedCorpusView`` takes two arguments:
 |  a corpus fileid (specified as a string or as a ``PathPointer``);
 |  and a block reader.  A "block reader" is a function that reads
 |  zero or more tokens from a stream, and returns them as a list.  A
 |  very simple example of a block reader is:
 |  
 |      >>> def simple_block_reader(stream):
 |      ...     return stream.readline().split()
 |  
 |  This simple block reader reads a single line at a time, and
 |  returns a single token (consisting of a string) for each
 |  whitespace-separated substring on the line.
 |  
 |  When d

## Part of Speech Tagging

### Latin

#### CLTK taggers

In [112]:
from cltk.tag.pos import POSTag
tagger = POSTag('latin')
tagger.tag_ngram_123_backoff('Gallia est omnis divisa in partes tres')

[('Gallia', None),
 ('est', 'V3SPIA---'),
 ('omnis', 'A-S---MN-'),
 ('divisa', 'T-PRPPNN-'),
 ('in', 'R--------'),
 ('partes', 'N-P---FA-'),
 ('tres', 'M--------')]

**Why is it failing?**

CLTK relies on some components (e.g. trained models) that are read from disk instead of being generated on the fly.

This is very common, especially in those cases where the time needed to generate certain objects is not negligible.

`pickle` is the python library that does this, and *serialization* is the process of writing an object to disk. 

In [48]:
la_corpus_importer.list_corpora

['latin_text_perseus',
 'latin_treebank_perseus',
 'latin_text_latin_library',
 'phi5',
 'phi7',
 'latin_proper_names_cltk',
 'latin_models_cltk',
 'latin_pos_lemmata_cltk',
 'latin_treebank_index_thomisticus',
 'latin_lexica_perseus',
 'latin_training_set_sentence_cltk',
 'latin_word2vec_cltk',
 'latin_text_antique_digiliblt',
 'latin_text_corpus_grammaticorum_latinorum',
 'latin_text_poeti_ditalia']

In [49]:
la_corpus_importer.import_corpus('latin_models_cltk')

In [113]:
tagger = POSTag('latin')

In [114]:
tagger

<cltk.tag.pos.POSTag at 0x10f5bd7f0>

In [46]:
list(zip(
    tagger.tag_tnt(" ".join([str(w) for w in amicitia_words[100:150]])),
    tagger.tag_ngram_123_backoff(" ".join([str(w) for w in amicitia_words[100:150]])),
    tagger.tag_crf(" ".join([str(w) for w in amicitia_words[100:150]]))
))

[(('91', 'Unk'), ('91', None), ('91', 'A-P---FN-')),
 (('92', 'Unk'), ('92', None), ('92', 'N-P---FN-')),
 (('93', 'Unk'), ('93', None), ('93', 'A-P---FN-')),
 (('94', 'Unk'), ('94', None), ('94', 'N-P---FN-')),
 (('95', 'Unk'), ('95', None), ('95', 'A-P---FN-')),
 (('96', 'Unk'), ('96', None), ('96', 'N-P---FN-')),
 (('97', 'Unk'), ('97', None), ('97', 'A-P---FN-')),
 (('98', 'Unk'), ('98', None), ('98', 'N-P---FN-')),
 (('99', 'Unk'), ('99', None), ('99', 'A-P---FN-')),
 (('100', 'Unk'), ('100', None), ('100', 'N-P---FN-')),
 (('101', 'Unk'), ('101', None), ('101', 'A-P---FN-')),
 (('102', 'Unk'), ('102', None), ('102', 'N-P---FN-')),
 (('103', 'Unk'), ('103', None), ('103', 'A-P---FN-')),
 (('104', 'Unk'), ('104', None), ('104', 'N-P---FN-')),
 (('[', 'U--------'), ('[', 'U--------'), ('[', 'U--------')),
 (('1', 'Unk'), ('1', None), ('1', 'N-S---MV-')),
 ((']', 'U--------'), (']', 'U--------'), (']', 'U--------')),
 (('Q', 'Unk'), ('Q', None), ('Q', 'N-S---MV-')),
 (('.', 'U-------

What if we want to take sentences instead of ranges of tokens?

The class `PlaintextCorpusReader` has a nice method – `sents()` – that does this.

Let's see how it works...

In [94]:
de_amicitia_sentences = latinlibrary.sents('cicero/amic.txt')

In [95]:
de_amicitia_sentences[10]

['[',
 '4',
 ']',
 'Cum',
 'enim',
 'saepe',
 'cum',
 'me',
 'ageres',
 'ut',
 'de',
 'amicitia',
 'scriberem',
 'aliquid',
 ',',
 'digna',
 'mihi',
 'res',
 'cum',
 'omnium',
 'cognitione',
 'tum',
 'nostra',
 'familiaritate',
 'visa',
 'est',
 '.']

#### Making sense of PoS tags

https://github.com/francescomambrini/gAGDT

In [81]:
sys.path.append(os.path.expanduser('~/Documents/gAGDT/'))
from IPython.display import HTML

In [74]:
from treebanks import Morph

In [96]:
my_pos_tag = 'T-SRPPMN-'

In [109]:
Morph(my_pos_tag).full

KeyError: 'T'

In [97]:
Morph(my_pos_tag.lower()).full

{'case': 'nominative',
 'degree': '-',
 'gender': 'masculine',
 'mood': 'participle',
 'number': 'singular',
 'person': '-',
 'pos': 'verb',
 'tense': 'perfect',
 'voice': 'passive'}

In [115]:
tnt_output = tagger.tag_tnt(" ".join([str(w) for w in amicitia_words[100:150]]))

In [124]:
tnt_output

[('91', 'Unk'),
 ('92', 'Unk'),
 ('93', 'Unk'),
 ('94', 'Unk'),
 ('95', 'Unk'),
 ('96', 'Unk'),
 ('97', 'Unk'),
 ('98', 'Unk'),
 ('99', 'Unk'),
 ('100', 'Unk'),
 ('101', 'Unk'),
 ('102', 'Unk'),
 ('103', 'Unk'),
 ('104', 'Unk'),
 ('[', 'U--------'),
 ('1', 'Unk'),
 (']', 'U--------'),
 ('Q', 'Unk'),
 ('.', 'U--------'),
 ('Mucius', 'Unk'),
 ('augur', 'Unk'),
 ('multa', 'A-P---NA-'),
 ('narrare', 'V--PNA---'),
 ('de', 'R--------'),
 ('C', 'Unk'),
 ('.', 'U--------'),
 ('Laelio', 'Unk'),
 ('socero', 'Unk'),
 ('suo', 'A-S---NB-'),
 ('memoriter', 'Unk'),
 ('et', 'C--------'),
 ('iucunde', 'Unk'),
 ('solebat', 'V3SIIA---'),
 ('nec', 'C--------'),
 ('dubitare', 'Unk'),
 ('illum', 'P-S---MA-'),
 ('in', 'R--------'),
 ('omni', 'A-S---MB-'),
 ('sermone', 'N-S---MB-'),
 ('appellare', 'V--PNA---'),
 ('sapientem', 'Unk'),
 (';', 'Unk'),
 ('ego', 'P-S---MN-'),
 ('autem', 'C--------'),
 ('a', 'R--------'),
 ('patre', 'N-S---MB-'),
 ('ita', 'D--------'),
 ('eram', 'V1SIIA---'),
 ('deductus', 'T-SRPPM

In [128]:
for token, pos_tag in tnt_output:
    print(pos_tag)
    if pos_tag != "Unk":
        print(Morph(pos_tag.lower()).full)

Unk
Unk
Unk
Unk
Unk
Unk
Unk
Unk
Unk
Unk
Unk
Unk
Unk
Unk
U--------
{'pos': 'punctuation', 'person': '-', 'number': '-', 'tense': '-', 'mood': '-', 'voice': '-', 'gender': '-', 'case': '-', 'degree': '-'}
Unk
U--------
{'pos': 'punctuation', 'person': '-', 'number': '-', 'tense': '-', 'mood': '-', 'voice': '-', 'gender': '-', 'case': '-', 'degree': '-'}
Unk
U--------
{'pos': 'punctuation', 'person': '-', 'number': '-', 'tense': '-', 'mood': '-', 'voice': '-', 'gender': '-', 'case': '-', 'degree': '-'}
Unk
Unk
A-P---NA-
{'pos': 'adjective', 'person': '-', 'number': 'plural', 'tense': '-', 'mood': '-', 'voice': '-', 'gender': 'neuter', 'case': 'accusative', 'degree': '-'}
V--PNA---
{'pos': 'verb', 'person': '-', 'number': '-', 'tense': 'present', 'mood': 'infinitive', 'voice': 'active', 'gender': '-', 'case': '-', 'degree': '-'}
R--------
{'pos': 'preposition', 'person': '-', 'number': '-', 'tense': '-', 'mood': '-', 'voice': '-', 'gender': '-', 'case': '-', 'degree': '-'}
Unk
U--------
{'

KeyError: 'b'

In [136]:
for token, pos_tag in tnt_output:
    if pos_tag != "Unk":
        try:
            pos_info = Morph(pos_tag.lower()).full
            print("{} \t {}\n".format(token, pos_info["pos"]))
        except Exception as e:
            print("Expanded form for {} not available (error: {})".format(pos_tag, e))

[ 	 punctuation

] 	 punctuation

. 	 punctuation

multa 	 adjective

narrare 	 verb

de 	 preposition

. 	 punctuation

Expanded form for A-S---NB- not available (error: 'b')
et 	 conjunction

solebat 	 verb

nec 	 conjunction

illum 	 pron

in 	 preposition

Expanded form for A-S---MB- not available (error: 'b')
Expanded form for N-S---MB- not available (error: 'b')
appellare 	 verb

ego 	 pron

autem 	 conjunction

a 	 preposition

Expanded form for N-S---MB- not available (error: 'b')
ita 	 adverb

eram 	 verb

deductus 	 verb

ad 	 preposition



#### TreeTagger

`TreeTagger` is available as a command line tool, but there are ways of calling it from within Python code.

The code below uses a *python wrapper*, namely a couple of python classes/methods that exposes `TreeTagger`'s functionalities via Python objects/methods.

In [101]:
from treetagger import TreeTagger

In [102]:
os.environ["TREETAGGER_HOME"] = os.path.expanduser('~/tree-tagger/cmd/')

In [103]:
tt = TreeTagger(language="latin")

In [106]:
tt.tag("Cogito ergo sum")

[['Cogito', 'V:IMP', 'cogo'],
 ['ergo', 'ADV', 'ergo'],
 ['sum', 'ESSE:IND', 'sum']]

In [104]:
tt.tag(amicitia_words[100:150])

[['91', 'ADJ:NUM', '@card@'],
 ['92', 'ADJ:NUM', '@card@'],
 ['93', 'ADJ:NUM', '@card@'],
 ['94', 'ADJ:NUM', '@card@'],
 ['95', 'ADJ:NUM', '@card@'],
 ['96', 'ADJ:NUM', '@card@'],
 ['97', 'ADJ:NUM', '@card@'],
 ['98', 'ADJ:NUM', '@card@'],
 ['99', 'ADJ:NUM', '@card@'],
 ['100', 'ADJ:NUM', '@card@'],
 ['101', 'ADJ:NUM', '@card@'],
 ['102', 'ADJ:NUM', '@card@'],
 ['103', 'ADJ:NUM', '@card@'],
 ['104', 'ADJ:NUM', '@card@'],
 ['[', 'PUN', '['],
 ['1', 'ADJ:NUM', '@card@'],
 [']', 'PUN', ']'],
 ['Q.', 'ABBR', 'Q.'],
 ['Mucius', 'ADJ', '<unknown>'],
 ['augur', 'N:nom', 'augur'],
 ['multa', 'ADJ', 'multus'],
 ['narrare', 'V:INF', 'narro'],
 ['de', 'PREP', 'de'],
 ['C.', 'ABBR', 'C.'],
 ['Laelio', 'N:abl', '<unknown>'],
 ['socero', 'N:abl', 'socer'],
 ['suo', 'POSS', 'suus'],
 ['memoriter', 'ADV', 'memoriter'],
 ['et', 'CC', 'et'],
 ['iucunde', 'ADJ', '<unknown>'],
 ['solebat', 'V:IND', 'soleo'],
 ['nec', 'CC', 'nec'],
 ['dubitare', 'V:INF', 'dubito'],
 ['illum', 'DIMOS', 'ille'],
 ['in', 'PRE

### Greek

#### CLTK taggers

In [13]:
grk_corpus_importer.import_corpus("greek_models_cltk")

In [37]:
from cltk.tag.pos import POSTag
tagger = POSTag('greek')

In [39]:
tagger.tag_ngram_123_backoff('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')

[('θεοὺς', 'N-P---MA-'),
 ('μὲν', 'G--------'),
 ('αἰτῶ', 'V1SPIA---'),
 ('τῶνδ', 'P-P---MG-'),
 ('᾽', None),
 ('ἀπαλλαγὴν', 'N-S---FA-'),
 ('πόνων', 'N-P---MG-'),
 ('φρουρᾶς', 'N-S---FG-'),
 ('ἐτείας', 'A-S---FG-'),
 ('μῆκος', 'N-S---NA-')]

In [40]:
tagger.tag_tnt('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')

[('θεοὺς', 'N-P---MA-'),
 ('μὲν', 'G--------'),
 ('αἰτῶ', 'V1SPIA---'),
 ('τῶνδ', 'P-P---NG-'),
 ('᾽', 'Unk'),
 ('ἀπαλλαγὴν', 'N-S---FA-'),
 ('πόνων', 'N-P---MG-'),
 ('φρουρᾶς', 'N-S---FG-'),
 ('ἐτείας', 'A-S---FG-'),
 ('μῆκος', 'N-S---NA-')]

#### Small detour: importing a Greek corpus

**Q**: Is there a handy way to import an Ancient Greek corpus as we did for Latin?

**A**: not quite (yet)



First step: convert the Perseus data (TEI/XML) into plain text:

In [5]:
# my modified version of https://github.com/cltk/greek_text_perseus/blob/master/perseus_compiler.py

import os
import re
import bleach
#from cltk.corpus.classical_greek.replacer import Replacer
from cltk.corpus.greek.beta_to_unicode import Replacer


home = os.path.expanduser('~')
cltk_path = os.path.join(home, 'cltk_data')
#print(cltk_path)
perseus_root = cltk_path + '/greek/text/greek_text_perseus/'
#print(perseus_root)
ignore = [
    '.git',
    'LICENSE.md',
    'README.md',
    'cltk_json',
    'json',
    'perseus_compiler.py'
]
authors = [d for d in os.listdir(perseus_root) if d not in ignore]

for author in tqdm(authors):
    texts = os.listdir(perseus_root + author + '/opensource')
    for text in texts:
        text_match = re.match(r'.*_gk.xml', text)
        if text_match:
            gk_file = text_match.group()
            txt_file = perseus_root + author + '/opensource/' + gk_file
            with open(txt_file) as gk:
                html = gk.read()
                beta_code = bleach.clean(html, strip=True).upper()
                a_replacer = Replacer()
                unicode_converted = a_replacer.beta_code(beta_code)
                #print(unicode_converted)
                unicode_root = cltk_path + '/greek/text/perseus_unicode/'
                unic_pres = os.path.isdir(unicode_root)
                if unic_pres is True:
                    pass
                else:
                    os.mkdir(unicode_root)
                author_path = unicode_root + author
                author_path_pres = os.path.isdir(author_path)
                if author_path_pres is True:
                    pass
                else:
                    os.mkdir(author_path)
                gk_file_txt = os.path.splitext(gk_file)[0] + '.txt'
                uni_write = author_path + '/' + gk_file_txt
                #print(uni_write)
                with open(uni_write, 'w') as uni_write:
                    uni_write.write(unicode_converted)

100%|██████████| 61/61 [03:44<00:00,  3.68s/it]


In [137]:
import os.path
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
from cltk.tokenize.sentence import TokenizeSentence
from cltk.tokenize.word import WordTokenizer

In [138]:
agrk_word_tokenizer = WordTokenizer('greek')
agrk_sentence_tokenizer = TokenizeSentence("greek")

In [139]:
cltk_path = os.path.expanduser('~/cltk_data')
try:
    perseusgreek = PlaintextCorpusReader(
        cltk_path + '/greek/text/perseus_unicode/', 
        '.*\.txt',
        word_tokenizer=agrk_word_tokenizer, 
        sent_tokenizer=agrk_sentence_tokenizer, 
        encoding='utf-8'
    )    
    pass
except IOError as e:
    pass
    # print("Corpus not found. Please check that the Latin Library is installed in CLTK_DATA.")

In [140]:
perseusgreek

<PlaintextCorpusReader in '/Users/rromanello/cltk_data/greek/text/perseus_unicode'>

In [92]:
birds = perseusgreek.words('Aristophanes/aristoph.birds_gk.txt')

In [93]:
print(list(birds[1000:1100]))

['ἐπέγειρον', 'αὐτόν', '.', 'Θεράπων', 'Ἔποποσ', 'οἶδα', 'μὲν', 'σαφῶσ', 'ὅτι', 'ἀχθέσεται', ',', 'σφῷν', 'δ’', 'αὐτὸν', 'οὕνεκ’', 'ἐπεγερῶ', '.', 'Πισθέταιροσ', 'κακῶς', 'σύ', 'γ’', 'ἀπόλοῐ', ',', 'ὥς', 'μ’', 'ἀπέκτεινας', 'δέει', '.', 'Ἐυελπίδησ', 'οἴμοι', 'κακοδαίμων', 'χὠ', 'κολοιός', 'μοἴχεται', 'ὑπὸ', 'τοῦ', 'δέους\\', '.', 'Πισθέταιροσ', 'ὦ', 'δειλότατον', 'σὺ', 'θηρίον', ',', 'δείσας', 'ἀφῆκας', 'τὸν', 'κολοιόν', ';', 'Ἐυελπίδησ', 'εἰπέ', 'μοι', ',', 'σὺ', 'δὲ', 'τὴν', 'κορώνην', 'οὐκ', 'ἀφῆκας', 'καταπεσών', ';', 'Πισθέταιροσ', 'μὰ', 'Δί’', 'οὐκ', 'ἔγωγε', '.', 'Ἐυελπίδησ', 'ποῦ', 'γάρ', 'ἐστ’', ';', 'Πισθέταιροσ', 'ἀπέπτετο', '.', 'Ἐυελπίδησ', 'οὐκ', 'ἆρ’', 'ἀφῆκας', ';', 'ὦγάθ’', 'ὡς', 'ἀνδρεῖος', 'εἶ', '.', 'Ἔποψ', 'ἄνοιγε', 'τὴν', 'ὕλην', ',', 'ἵν’', 'ἐξέλθω', 'ποτέ', '.', 'Ἐυελπίδησ', 'ὦ', 'Ἡράκλεις', 'τουτὶ', 'τί', 'ποτ’']


In [85]:
tagger.tag_tnt(" ".join(birds[1000:1100]))

[('ἐπέγειρον', 'Unk'),
 ('αὐτόν', 'A-S---MA-'),
 ('.', 'U--------'),
 ('Θεράπων', 'Unk'),
 ('Ἔποποσ', 'Unk'),
 ('οἶδα', 'V1SRIA---'),
 ('μὲν', 'G--------'),
 ('σαφῶσ', 'Unk'),
 ('ὅτι', 'C--------'),
 ('ἀχθέσεται', 'Unk'),
 (',', 'U--------'),
 ('σφῷν', 'P-D---MG-'),
 ('δ', 'G--------'),
 ('’', 'Unk'),
 ('αὐτὸν', 'A-S---MA-'),
 ('οὕνεκ', 'C--------'),
 ('’', 'Unk'),
 ('ἐπεγερῶ', 'Unk'),
 ('.', 'U--------'),
 ('Πισθέταιροσ', 'Unk'),
 ('κακῶς', 'D--------'),
 ('σύ', 'P-S----N-'),
 ('γ', 'G--------'),
 ('’', 'Unk'),
 ('ἀπόλοῐ', 'Unk'),
 (',', 'U--------'),
 ('ὥς', 'C--------'),
 ('μ', 'P-S---MA-'),
 ('’', 'Unk'),
 ('ἀπέκτεινας', 'Unk'),
 ('δέει', 'N-S---ND-'),
 ('.', 'U--------'),
 ('Ἐυελπίδησ', 'Unk'),
 ('οἴμοι', 'E--------'),
 ('κακοδαίμων', 'Unk'),
 ('χὠ', 'L-S---MN-'),
 ('κολοιός', 'Unk'),
 ('μοἴχεται', 'Unk'),
 ('ὑπὸ', 'R--------'),
 ('τοῦ', 'L-S---NG-'),
 ('δέους', 'N-S---NG-'),
 ('\\', 'Unk'),
 ('.', 'U--------'),
 ('Πισθέταιροσ', 'Unk'),
 ('ὦ', 'E--------'),
 ('δειλότατον', 'Unk'),

#### TreeTagger

## Lemmatization

Present two examples: one suitable for automatic lemmatization and the other more as a support for the reader/annotator.

Main difference: how words that may correspond to multiple lemmata are handled.

### Latin

#### CLTK

In [68]:
from cltk.utils.file_operations import open_pickle

In [69]:
# Set up training sentences
rel_path = os.path.join('~/cltk_data/latin/model/latin_models_cltk/lemmata/backoff')
path = os.path.expanduser(rel_path)

# Check for presence of latin_pos_lemmatized_sents
file = 'latin_pos_lemmatized_sents.pickle' 
latin_pos_lemmatized_sents_path = os.path.join(path, file)
if os.path.isfile(latin_pos_lemmatized_sents_path):
    latin_pos_lemmatized_sents = open_pickle(latin_pos_lemmatized_sents_path)
else:
    latin_pos_lemmatized_sents = []
    print('The file %s is not available in cltk_data' % file)

In [61]:
amicitia_sents = latinlibrary.sents('cicero/amic.txt')

In [63]:
amicitia_sents[10]

['[',
 '4',
 ']',
 'Cum',
 'enim',
 'saepe',
 'cum',
 'me',
 'ageres',
 'ut',
 'de',
 'amicitia',
 'scriberem',
 'aliquid',
 ',',
 'digna',
 'mihi',
 'res',
 'cum',
 'omnium',
 'cognitione',
 'tum',
 'nostra',
 'familiaritate',
 'visa',
 'est',
 '.']

In [41]:
from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer

In [70]:
backoff_lemmatizer = BackoffLatinLemmatizer(train=latin_pos_lemmatized_sents)

In [71]:
backoff_lemmatizer.lemmatize(amicitia_sents[10])

[('[', 'punc'),
 ('4', '4'),
 (']', 'punc'),
 ('Cum', 'Cos2'),
 ('enim', 'enim'),
 ('saepe', 'saepe'),
 ('cum', 'cum2'),
 ('me', 'ego'),
 ('ageres', 'ago'),
 ('ut', 'ut'),
 ('de', 'de'),
 ('amicitia', 'amicitia'),
 ('scriberem', 'scribo'),
 ('aliquid', 'aliquis'),
 (',', 'punc'),
 ('digna', 'dignus'),
 ('mihi', 'ego'),
 ('res', 'res'),
 ('cum', 'cum2'),
 ('omnium', 'omnis'),
 ('cognitione', 'cognitio'),
 ('tum', 'tum'),
 ('nostra', 'noster'),
 ('familiaritate', 'familiaritas'),
 ('visa', 'video'),
 ('est', 'sum'),
 ('.', 'punc')]

But behind the scenes, this is what is going on:

```python

    def _define_lemmatizer(self):
        # Suggested backoff chain--should be tested for optimal order
        backoff0 = None
        backoff1 = IdentityLemmatizer()
        backoff2 = TrainLemmatizer(model=self.LATIN_OLD_MODEL, backoff=backoff1)
        backoff3 = PPLemmatizer(regexps=self.latin_verb_patterns, pps=self.latin_pps, backoff=backoff2)                 
        backoff4 = RegexpLemmatizer(self.latin_sub_patterns, backoff=backoff3)
        backoff5 = UnigramLemmatizer(self.train_sents, backoff=backoff4)
        backoff6 = TrainLemmatizer(model=self.LATIN_MODEL, backoff=backoff5)      
        #backoff7 = BigramPOSLemmatizer(self.pos_train_sents, include=['cum'], backoff=backoff6)
        #lemmatizer = backoff7
        lemmatizer = backoff6
        return lemmatizer

```

further readings: 
* https://github.com/cltk/cltk/blob/master/cltk/lemmatize/latin/backoff.py
* https://disiectamembra.wordpress.com/2016/08/23/wrapping-up-google-summer-of-code/


#### PyCollatinus

* Python port of the [Collatinus lemmatizer](https://github.com/biblissima/collatinus)
* good if you can read some French (or at least practice it) 😉
* the PoS tags used by Collatinus are explained [here](https://github.com/biblissima/collatinus/blob/master/NOTES_Tagger.md)
* morphological analysis not readily machine readable

We import and instantiate the PyCollatinus lemmatizer (`Lemmatiseur`) – and ignore the long list of warnings. 

In [25]:
from pycollatinus import Lemmatiseur
analyzer = Lemmatiseur()

/Users/mat/.local/share/virtualenvs/sunoikisis_dc-I3iKJ3Z3/lib/python3.6/site-packages/pycollatinus/parser.py:335: MissingRadical: honor has no radical 1
/Users/mat/.local/share/virtualenvs/sunoikisis_dc-I3iKJ3Z3/lib/python3.6/site-packages/pycollatinus/parser.py:335: MissingRadical: aer has no radical 1
/Users/mat/.local/share/virtualenvs/sunoikisis_dc-I3iKJ3Z3/lib/python3.6/site-packages/pycollatinus/parser.py:335: MissingRadical: tethys has no radical 1
/Users/mat/.local/share/virtualenvs/sunoikisis_dc-I3iKJ3Z3/lib/python3.6/site-packages/pycollatinus/parser.py:335: MissingRadical: opes has no radical 1
/Users/mat/.local/share/virtualenvs/sunoikisis_dc-I3iKJ3Z3/lib/python3.6/site-packages/pycollatinus/parser.py:335: MissingRadical: dos has no radical 1
/Users/mat/.local/share/virtualenvs/sunoikisis_dc-I3iKJ3Z3/lib/python3.6/site-packages/pycollatinus/parser.py:335: MissingRadical: corpus has no radical 1
/Users/mat/.local/share/virtualenvs/sunoikisis_dc-I3iKJ3Z3/lib/python3.6/site-p

The lemmatiser can take as input a **single word**

In [25]:
list(analyzer.lemmatise("Cogito"))

[{'desinence': 'ito',
  'form': 'cogito',
  'lemma': 'cogo',
  'morph': '2ème singulier impératif futur actif',
  'radical': 'cog'},
 {'desinence': 'ito',
  'form': 'cogito',
  'lemma': 'cogo',
  'morph': '3ème singulier impératif futur actif',
  'radical': 'cog'},
 {'desinence': 'o',
  'form': 'cogito',
  'lemma': 'cogito',
  'morph': '1ère singulier indicatif présent actif',
  'radical': 'cogit'},
 {'desinence': 'o',
  'form': 'cogito',
  'lemma': 'cogito',
  'morph': '1ère singulier indicatif présent actif',
  'radical': 'cogit'}]

or an **entire sentence**

In [26]:
list(analyzer.lemmatise_multiple("Cogito ergo sum"))

[[{'desinence': 'ito',
   'form': 'cogito',
   'lemma': 'cogo',
   'morph': '2ème singulier impératif futur actif',
   'radical': 'cog'},
  {'desinence': 'ito',
   'form': 'cogito',
   'lemma': 'cogo',
   'morph': '3ème singulier impératif futur actif',
   'radical': 'cog'},
  {'desinence': 'o',
   'form': 'cogito',
   'lemma': 'cogito',
   'morph': '1ère singulier indicatif présent actif',
   'radical': 'cogit'},
  {'desinence': 'o',
   'form': 'cogito',
   'lemma': 'cogito',
   'morph': '1ère singulier indicatif présent actif',
   'radical': 'cogit'}],
 [{'desinence': 'o',
   'form': 'ergo',
   'lemma': 'ergo',
   'morph': '1ère singulier indicatif présent actif',
   'radical': 'erg'},
  {'desinence': '',
   'form': 'ergo',
   'lemma': 'ergo',
   'morph': '-',
   'radical': 'ergo'},
  {'desinence': '',
   'form': 'ergo',
   'lemma': 'ergo',
   'morph': 'positif',
   'radical': 'ergo'}],
 [{'desinence': 'um',
   'form': 'sum',
   'lemma': 'sum',
   'morph': '1ère singulier indicatif p

Let's try to output the lemmatisation in a more intellegible way...

In [35]:
# the analyzer output is essentially a list of lists
# for each analyzed token it returns a list of possible lemmata
# here we iterate through both lists and display the analysis as we go along

for n, result in enumerate(analyzer.lemmatise_multiple("Cogito ergo sum")):
    for i, lemma in enumerate(result):
        print(
            "{}.{}\t{}\t\t{} {}".format(
                n + 1,
                i + 1,
                lemma["form"],
                lemma["lemma"],
                lemma["morph"]
            )
        )

1.1	cogito		cogo 2ème singulier impératif futur actif
1.2	cogito		cogo 3ème singulier impératif futur actif
1.3	cogito		cogito 1ère singulier indicatif présent actif
1.4	cogito		cogito 1ère singulier indicatif présent actif
2.1	ergo		ergo 1ère singulier indicatif présent actif
2.2	ergo		ergo -
2.3	ergo		ergo positif
3.1	sum		sum 1ère singulier indicatif présent actif


the same but also with PoS tag

In [30]:
# the analyzer output is essentially a list of lists
# for each analyzed token it returns a list of possible lemmata
# here we iterate through both lists and display the analysis as we go along

for n, result in enumerate(analyzer.lemmatise_multiple("Cogito ergo sum", pos=True),):
    for i, lemma in enumerate(result):
        print(
            "{}.{}\t{}\t{}\t{} {}".format(
                n + 1,
                i + 1,
                lemma["form"],
                lemma["pos"],
                lemma["lemma"],
                lemma["morph"]
            )
        )

1.1	cogito	v	cogo 2ème singulier impératif futur actif
1.2	cogito	v	cogo 3ème singulier impératif futur actif
1.3	cogito	v	cogito 1ère singulier indicatif présent actif
1.4	cogito	v	cogito 1ère singulier indicatif présent actif
2.1	ergo	v	ergo 1ère singulier indicatif présent actif
2.2	ergo	c	ergo -
2.3	ergo	d	ergo positif
3.1	sum	v	sum 1ère singulier indicatif présent actif
