# Part of Speech tagging and lemmatisation with 🐍

See information on the [Sunoikisis Wiki](https://github.com/SunoikisisDC/SunoikisisDC-2017-2018/wiki/Python-2:-Part-of-Speech-tagging-and-lemmatisation).

## Goals

* present tools/libraries that can be used for PoS tagging and lemmatisation
    * [CLTK](http://cltk.org/)
    * [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
    * [Collatinus](https://github.com/biblissima/collatinus) / [pyCollatinus](https://github.com/PonteIneptique/collatinus-python)
* have a peek into what "goes on" behind the scenes, and that libraries make transparent to the user
* show that there is a growing amount of python code/libraries for linguistic annotation on Anc Greek/Latin
* ... but same time it takes a bit of bricolage to get things to work

## Imports

In [11]:
import os
from tqdm import tqdm
import sys
import cltk

In [12]:
cltk.__version__

'0.1.83'

## Download corpora

### Greek

In [13]:
from cltk.corpus.utils.importer import CorpusImporter

In [14]:
grk_corpus_importer = CorpusImporter('greek')

In [15]:
grk_corpus_importer.list_corpora

['greek_software_tlgu',
 'greek_text_perseus',
 'phi7',
 'tlg',
 'greek_proper_names_cltk',
 'greek_models_cltk',
 'greek_treebank_perseus',
 'greek_lexica_perseus',
 'greek_training_set_sentence_cltk',
 'greek_word2vec_cltk',
 'greek_text_lacus_curtius',
 'greek_text_first1kgreek']

In [None]:
grk_corpus_importer.import_corpus('greek_text_perseus')

### Latin

In [16]:
from cltk.corpus.latin import latinlibrary

In [17]:
la_corpus_importer = CorpusImporter('latin')

In [18]:
la_corpus_importer.list_corpora

['latin_text_perseus',
 'latin_treebank_perseus',
 'latin_text_latin_library',
 'phi5',
 'phi7',
 'latin_proper_names_cltk',
 'latin_models_cltk',
 'latin_pos_lemmata_cltk',
 'latin_treebank_index_thomisticus',
 'latin_lexica_perseus',
 'latin_training_set_sentence_cltk',
 'latin_word2vec_cltk',
 'latin_text_antique_digiliblt',
 'latin_text_corpus_grammaticorum_latinorum',
 'latin_text_poeti_ditalia']

In [None]:
la_corpus_importer.import_corpus('latin_training_set_sentence_cltk')

In [None]:
la_corpus_importer.import_corpus('latin_text_latin_library')

In [19]:
from cltk.corpus.latin import latinlibrary

In [21]:
type(latinlibrary)

nltk.corpus.reader.plaintext.PlaintextCorpusReader

In [22]:
amicitia_words = latinlibrary.words('cicero/amic.txt')

In [23]:
len(amicitia_words)

11618

In [25]:
latinlibrary.fileids()

['12tables.txt',
 '1644.txt',
 'abbofloracensis.txt',
 'abelard/dialogus.txt',
 'abelard/epistola.txt',
 'abelard/historia.txt',
 'addison/barometri.txt',
 'addison/burnett.txt',
 'addison/hannes.txt',
 'addison/machinae.txt',
 'addison/pax.txt',
 'addison/praelium.txt',
 'addison/preface.txt',
 'addison/resurr.txt',
 'addison/sphaer.txt',
 'adso.txt',
 'aelredus.txt',
 'agnes.txt',
 'alanus/alanus1.txt',
 'alanus/alanus2.txt',
 'albertanus/albertanus.arsloquendi.txt',
 'albertanus/albertanus.liberconsol.txt',
 'albertanus/albertanus.sermo.txt',
 'albertanus/albertanus.sermo1.txt',
 'albertanus/albertanus.sermo2.txt',
 'albertanus/albertanus.sermo3.txt',
 'albertanus/albertanus.sermo4.txt',
 'albertanus/albertanus1.txt',
 'albertanus/albertanus2.txt',
 'albertanus/albertanus3.txt',
 'albertanus/albertanus4.txt',
 'albertofaix/hist1.txt',
 'albertofaix/hist10.txt',
 'albertofaix/hist11.txt',
 'albertofaix/hist12.txt',
 'albertofaix/hist2.txt',
 'albertofaix/hist3.txt',
 'albertofaix/his

We can get `n` number of tokens from this text by using the *slice notation*:

In [26]:
# the first ten tokens
amicitia_words[:10]

['Cicero',
 ':',
 'de',
 'Amicitia',
 'M.',
 'TVLLI',
 'CICERONIS',
 'LAELIVS',
 'DE',
 'AMICITIA']

In [27]:
# or the last token
amicitia_words[-1]

'Page'

We can also count occurrences by using the `count()` method and passing as parameter the token we want to inspect:

In [29]:
amicitia_words.count('et')

236

In [30]:
amicitia_words.count('amicitia')

67

Let's have a closer look to the `type` of the variable `amicitia_words` where we loaded the content of Cicero's *De Amicitia*:

In [31]:
type(amicitia_words)

nltk.corpus.reader.util.StreamBackedCorpusView

In [33]:
amicitia_words?

**Caveat**: `cltk.corpus.latin.latinlibrary` is a shortcut for several things, and there is nothing comparable (yet) for Greek (see [source code](https://github.com/cltk/cltk/blob/master/cltk/corpus/latin/__init__.py)).

## Part of Speech Tagging

### Latin

#### CLTK taggers

CLTK documentation with runnable examples [here](http://docs.cltk.org/en/latest/latin.html#pos-tagging).

In [34]:
from cltk.tag.pos import POSTag
tagger = POSTag('latin')
tagger.tag_ngram_123_backoff('Gallia est omnis divisa in partes tres')

[('Gallia', None),
 ('est', 'V3SPIA---'),
 ('omnis', 'A-S---MN-'),
 ('divisa', 'T-PRPPNN-'),
 ('in', 'R--------'),
 ('partes', 'N-P---FA-'),
 ('tres', 'M--------')]

**Why is it failing?**

CLTK relies on some components (e.g. trained models) that are read from disk instead of being generated on the fly.

This is very common, especially in those cases where the time needed to generate certain objects is not negligible.

`pickle` is the python library that does this, and *serialization* is the process of writing an object to disk. 

In [None]:
la_corpus_importer.list_corpora

In [None]:
la_corpus_importer.import_corpus('latin_models_cltk')

In [35]:
tagger = POSTag('latin')

In [37]:
tagger.tag_tnt?

In [38]:
list(zip(
    tagger.tag_tnt(" ".join([str(w) for w in amicitia_words[100:150]])),
    tagger.tag_ngram_123_backoff(" ".join([str(w) for w in amicitia_words[100:150]])),
    tagger.tag_crf(" ".join([str(w) for w in amicitia_words[100:150]]))
))

[(('91', 'Unk'), ('91', None), ('91', 'A-P---FN-')),
 (('92', 'Unk'), ('92', None), ('92', 'N-P---FN-')),
 (('93', 'Unk'), ('93', None), ('93', 'A-P---FN-')),
 (('94', 'Unk'), ('94', None), ('94', 'N-P---FN-')),
 (('95', 'Unk'), ('95', None), ('95', 'A-P---FN-')),
 (('96', 'Unk'), ('96', None), ('96', 'N-P---FN-')),
 (('97', 'Unk'), ('97', None), ('97', 'A-P---FN-')),
 (('98', 'Unk'), ('98', None), ('98', 'N-P---FN-')),
 (('99', 'Unk'), ('99', None), ('99', 'A-P---FN-')),
 (('100', 'Unk'), ('100', None), ('100', 'N-P---FN-')),
 (('101', 'Unk'), ('101', None), ('101', 'A-P---FN-')),
 (('102', 'Unk'), ('102', None), ('102', 'N-P---FN-')),
 (('103', 'Unk'), ('103', None), ('103', 'A-P---FN-')),
 (('104', 'Unk'), ('104', None), ('104', 'N-P---FN-')),
 (('[', 'U--------'), ('[', 'U--------'), ('[', 'U--------')),
 (('1', 'Unk'), ('1', None), ('1', 'N-S---MV-')),
 ((']', 'U--------'), (']', 'U--------'), (']', 'U--------')),
 (('Q', 'Unk'), ('Q', None), ('Q', 'N-S---MV-')),
 (('.', 'U-------

What if we want to take sentences instead of ranges of tokens?

The class `PlaintextCorpusReader` has a nice method – `sents()` – that does this.

Let's see how it works...

In [39]:
de_amicitia_sentences = latinlibrary.sents('cicero/amic.txt')

In [41]:
len(de_amicitia_sentences)

414

In [42]:
de_amicitia_sentences[:10]

[['Cicero', ':', 'de', 'Amicitia'],
 ['M.', 'TVLLI', 'CICERONIS', 'LAELIVS', 'DE', 'AMICITIA'],
 ['1',
  '2',
  '3',
  '4',
  '5',
  '6',
  '7',
  '8',
  '9',
  '10',
  '11',
  '12',
  '13',
  '14',
  '15',
  '16',
  '17',
  '18',
  '19',
  '20',
  '21',
  '22',
  '23',
  '24',
  '25',
  '26',
  '27',
  '28',
  '29',
  '30',
  '31',
  '32',
  '33',
  '34',
  '35',
  '36',
  '37',
  '38',
  '39',
  '40',
  '41',
  '42',
  '43',
  '44',
  '45',
  '46',
  '47',
  '48',
  '49',
  '50',
  '51',
  '52',
  '53',
  '54',
  '55',
  '56',
  '57',
  '58',
  '59',
  '60',
  '61',
  '62',
  '63',
  '64',
  '65',
  '66',
  '67',
  '68',
  '69',
  '70',
  '71',
  '72',
  '73',
  '74',
  '75',
  '76',
  '77',
  '78',
  '79',
  '80',
  '81',
  '82',
  '83',
  '84',
  '85',
  '86',
  '87',
  '88',
  '89',
  '90',
  '91',
  '92',
  '93',
  '94',
  '95',
  '96',
  '97',
  '98',
  '99',
  '100',
  '101',
  '102',
  '103',
  '104'],
 ['[',
  '1',
  ']',
  'Q.',
  'Mucius',
  'augur',
  'multa',
  'narrare',

#### Making sense of PoS tags

https://github.com/francescomambrini/gAGDT

In [45]:
sys.path.append(os.path.expanduser('~/Documents/gAGDT/'))
from IPython.display import HTML

In [46]:
from treebanks import Morph

In [47]:
Morph

treebanks.Morph

In [48]:
my_pos_tag = 'T-SRPPMN-'

In [49]:
Morph(my_pos_tag).full

KeyError: 'T'

In [50]:
Morph(my_pos_tag.lower()).full

{'case': 'nominative',
 'degree': '-',
 'gender': 'masculine',
 'mood': 'participle',
 'number': 'singular',
 'person': '-',
 'pos': 'verb',
 'tense': 'perfect',
 'voice': 'passive'}

In [51]:
tnt_output = tagger.tag_tnt(" ".join([str(w) for w in amicitia_words[100:150]]))

In [53]:
tnt_output[:10]

[('91', 'Unk'),
 ('92', 'Unk'),
 ('93', 'Unk'),
 ('94', 'Unk'),
 ('95', 'Unk'),
 ('96', 'Unk'),
 ('97', 'Unk'),
 ('98', 'Unk'),
 ('99', 'Unk'),
 ('100', 'Unk')]

In [57]:
for token, pos_tag in tnt_output:
    if pos_tag != "Unk":
        print(token, pos_tag, Morph(pos_tag.lower()).full)

[ U-------- {'pos': 'punctuation', 'person': '-', 'number': '-', 'tense': '-', 'mood': '-', 'voice': '-', 'gender': '-', 'case': '-', 'degree': '-'}
] U-------- {'pos': 'punctuation', 'person': '-', 'number': '-', 'tense': '-', 'mood': '-', 'voice': '-', 'gender': '-', 'case': '-', 'degree': '-'}
. U-------- {'pos': 'punctuation', 'person': '-', 'number': '-', 'tense': '-', 'mood': '-', 'voice': '-', 'gender': '-', 'case': '-', 'degree': '-'}
multa A-P---NA- {'pos': 'adjective', 'person': '-', 'number': 'plural', 'tense': '-', 'mood': '-', 'voice': '-', 'gender': 'neuter', 'case': 'accusative', 'degree': '-'}
narrare V--PNA--- {'pos': 'verb', 'person': '-', 'number': '-', 'tense': 'present', 'mood': 'infinitive', 'voice': 'active', 'gender': '-', 'case': '-', 'degree': '-'}
de R-------- {'pos': 'preposition', 'person': '-', 'number': '-', 'tense': '-', 'mood': '-', 'voice': '-', 'gender': '-', 'case': '-', 'degree': '-'}
. U-------- {'pos': 'punctuation', 'person': '-', 'number': '-', 

KeyError: 'b'

In [59]:
for token, pos_tag in tnt_output:
    if pos_tag != "Unk":
        try:
            pos_info = Morph(pos_tag.lower()).full
            print("{} \t {}\n".format(token, pos_info["pos"]))
        except Exception as e:
            print("Expanded form for {} not available (error: {})\n".format(pos_tag, e))

[ 	 punctuation

] 	 punctuation

. 	 punctuation

multa 	 adjective

narrare 	 verb

de 	 preposition

. 	 punctuation

Expanded form for A-S---NB- not available (error: 'b')

et 	 conjunction

solebat 	 verb

nec 	 conjunction

illum 	 pron

in 	 preposition

Expanded form for A-S---MB- not available (error: 'b')

Expanded form for N-S---MB- not available (error: 'b')

appellare 	 verb

ego 	 pron

autem 	 conjunction

a 	 preposition

Expanded form for N-S---MB- not available (error: 'b')

ita 	 adverb

eram 	 verb

deductus 	 verb

ad 	 preposition



#### TreeTagger

`TreeTagger` is available as a command line tool, but there are ways of calling it from within Python code.

The code below uses a *python wrapper*, namely a couple of python classes/methods that exposes `TreeTagger`'s functionalities via Python objects/methods.

In [60]:
from treetagger import TreeTagger

In [61]:
os.environ["TREETAGGER_HOME"]

KeyError: 'TREETAGGER_HOME'

In [62]:
os.environ["TREETAGGER_HOME"] = os.path.expanduser('~/tree-tagger/cmd/')

In [63]:
os.environ["TREETAGGER_HOME"]

'/Users/rromanello/tree-tagger/cmd/'

In [64]:
tt = TreeTagger(language="latin")

In [65]:
tt

<treetagger.TreeTagger at 0x10f2bb518>

In [66]:
tt.tag("Cogito ergo sum")

[['Cogito', 'V:IMP', 'cogo'],
 ['ergo', 'ADV', 'ergo'],
 ['sum', 'ESSE:IND', 'sum']]

In [67]:
tt.tag(amicitia_words[100:150])

[['91', 'ADJ:NUM', '@card@'],
 ['92', 'ADJ:NUM', '@card@'],
 ['93', 'ADJ:NUM', '@card@'],
 ['94', 'ADJ:NUM', '@card@'],
 ['95', 'ADJ:NUM', '@card@'],
 ['96', 'ADJ:NUM', '@card@'],
 ['97', 'ADJ:NUM', '@card@'],
 ['98', 'ADJ:NUM', '@card@'],
 ['99', 'ADJ:NUM', '@card@'],
 ['100', 'ADJ:NUM', '@card@'],
 ['101', 'ADJ:NUM', '@card@'],
 ['102', 'ADJ:NUM', '@card@'],
 ['103', 'ADJ:NUM', '@card@'],
 ['104', 'ADJ:NUM', '@card@'],
 ['[', 'PUN', '['],
 ['1', 'ADJ:NUM', '@card@'],
 [']', 'PUN', ']'],
 ['Q.', 'ABBR', 'Q.'],
 ['Mucius', 'ADJ', '<unknown>'],
 ['augur', 'N:nom', 'augur'],
 ['multa', 'ADJ', 'multus'],
 ['narrare', 'V:INF', 'narro'],
 ['de', 'PREP', 'de'],
 ['C.', 'ABBR', 'C.'],
 ['Laelio', 'N:abl', '<unknown>'],
 ['socero', 'N:abl', 'socer'],
 ['suo', 'POSS', 'suus'],
 ['memoriter', 'ADV', 'memoriter'],
 ['et', 'CC', 'et'],
 ['iucunde', 'ADJ', '<unknown>'],
 ['solebat', 'V:IND', 'soleo'],
 ['nec', 'CC', 'nec'],
 ['dubitare', 'V:INF', 'dubito'],
 ['illum', 'DIMOS', 'ille'],
 ['in', 'PRE

### Greek

#### CLTK taggers

In [68]:
grk_corpus_importer.import_corpus("greek_models_cltk")

In [69]:
from cltk.tag.pos import POSTag
tagger = POSTag('greek')

In [70]:
tagger.tag_ngram_123_backoff('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')

[('θεοὺς', 'N-P---MA-'),
 ('μὲν', 'G--------'),
 ('αἰτῶ', 'V1SPIA---'),
 ('τῶνδ', 'P-P---MG-'),
 ('᾽', None),
 ('ἀπαλλαγὴν', 'N-S---FA-'),
 ('πόνων', 'N-P---MG-'),
 ('φρουρᾶς', 'N-S---FG-'),
 ('ἐτείας', 'A-S---FG-'),
 ('μῆκος', 'N-S---NA-')]

In [71]:
tagger.tag_tnt('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')

[('θεοὺς', 'N-P---MA-'),
 ('μὲν', 'G--------'),
 ('αἰτῶ', 'V1SPIA---'),
 ('τῶνδ', 'P-P---NG-'),
 ('᾽', 'Unk'),
 ('ἀπαλλαγὴν', 'N-S---FA-'),
 ('πόνων', 'N-P---MG-'),
 ('φρουρᾶς', 'N-S---FG-'),
 ('ἐτείας', 'A-S---FG-'),
 ('μῆκος', 'N-S---NA-')]

#### Small detour: importing a Greek corpus

**Q**: Is there a handy way to import an Ancient Greek corpus as we did for Latin?

**A**: not quite (yet)



First step: convert the Perseus data (TEI/XML) into plain text:

In [72]:
# my modified version of https://github.com/cltk/greek_text_perseus/blob/master/perseus_compiler.py

import os
import re
import bleach
#from cltk.corpus.classical_greek.replacer import Replacer
from cltk.corpus.greek.beta_to_unicode import Replacer


home = os.path.expanduser('~')
cltk_path = os.path.join(home, 'cltk_data')
#print(cltk_path)
perseus_root = cltk_path + '/greek/text/greek_text_perseus/'
#print(perseus_root)
ignore = [
    '.git',
    'LICENSE.md',
    'README.md',
    'cltk_json',
    'json',
    'perseus_compiler.py'
]
authors = [d for d in os.listdir(perseus_root) if d not in ignore]

for author in tqdm(authors):
    texts = os.listdir(perseus_root + author + '/opensource')
    for text in texts:
        text_match = re.match(r'.*_gk.xml', text)
        if text_match:
            gk_file = text_match.group()
            txt_file = perseus_root + author + '/opensource/' + gk_file
            with open(txt_file) as gk:
                html = gk.read()
                beta_code = bleach.clean(html, strip=True).upper()
                a_replacer = Replacer()
                unicode_converted = a_replacer.beta_code(beta_code)
                #print(unicode_converted)
                unicode_root = cltk_path + '/greek/text/perseus_unicode/'
                unic_pres = os.path.isdir(unicode_root)
                if unic_pres is True:
                    pass
                else:
                    os.mkdir(unicode_root)
                author_path = unicode_root + author
                author_path_pres = os.path.isdir(author_path)
                if author_path_pres is True:
                    pass
                else:
                    os.mkdir(author_path)
                gk_file_txt = os.path.splitext(gk_file)[0] + '.txt'
                uni_write = author_path + '/' + gk_file_txt
                #print(uni_write)
                with open(uni_write, 'w') as uni_write:
                    uni_write.write(unicode_converted)

 66%|██████▌   | 40/61 [02:26<01:16,  3.66s/it]

KeyboardInterrupt: 

In [5]:
import os.path
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
from cltk.tokenize.sentence import TokenizeSentence
from cltk.tokenize.word import WordTokenizer

In [73]:
agrk_word_tokenizer = WordTokenizer('greek')
agrk_sentence_tokenizer = TokenizeSentence("greek")

In [74]:
cltk_path = os.path.expanduser('~/cltk_data')
try:
    perseusgreek = PlaintextCorpusReader(
        cltk_path + '/greek/text/perseus_unicode/', 
        '.*\.txt',
        word_tokenizer=agrk_word_tokenizer, 
        sent_tokenizer=agrk_sentence_tokenizer, 
        encoding='utf-8'
    )    
    pass
except IOError as e:
    #pass
    print("Corpus not found. Please check that the Latin Library is installed in CLTK_DATA.")

In [76]:
perseusgreek.fileids()

['Aeschines/aeschin_gk.txt',
 'Aeschylus/aesch.ag_gk.txt',
 'Aeschylus/aesch.eum_gk.txt',
 'Aeschylus/aesch.lib_gk.txt',
 'Aeschylus/aesch.pb_gk.txt',
 'Aeschylus/aesch.pers_gk.txt',
 'Aeschylus/aesch.seven_gk.txt',
 'Aeschylus/aesch.supp_gk.txt',
 'Andocides/andoc_gk.txt',
 'Anth/01_gk.txt',
 'Anth/02_gk.txt',
 'Anth/03_gk.txt',
 'Anth/04_gk.txt',
 'Anth/05_gk.txt',
 'Apollodorus/apollod_gk.txt',
 'Apollonius/argo_gk.txt',
 'Appian/appian.cw_gk.txt',
 'Appian/appian.fw_gk.txt',
 'Aretaeus/aret_gk.txt',
 'Aristides/aristid.orat_gk.txt',
 'Aristides/aristid.rhet_gk.txt',
 'Aristophanes/aristoph.ach_gk.txt',
 'Aristophanes/aristoph.birds_gk.txt',
 'Aristophanes/aristoph.cl_gk.txt',
 'Aristophanes/aristoph.eccl_gk.txt',
 'Aristophanes/aristoph.frogs_gk.txt',
 'Aristophanes/aristoph.kn_gk.txt',
 'Aristophanes/aristoph.lys_gk.txt',
 'Aristophanes/aristoph.peace_gk.txt',
 'Aristophanes/aristoph.pl_gk.txt',
 'Aristophanes/aristoph.thes_gk.txt',
 'Aristophanes/aristoph.wasps_gk.txt',
 'Aristot

In [77]:
birds = perseusgreek.words('Aristophanes/aristoph.birds_gk.txt')

In [79]:
birds_sentences = perseusgreek.sents('Aristophanes/aristoph.birds_gk.txt')

In [80]:
birds_sentences[:2]

[['%περσδραμα', ';', ']', '&', 'γτ', ';'],
 ['βιρδσ', 'μαξηινε', 'ρεαδαβλε', 'τεχτ', 'αριστοπηανεσ', 'φ.ω', '.']]

In [78]:
birds[1000:1050]

['ἐπέγειρον',
 'αὐτόν',
 '.',
 'Θεράπων',
 'Ἔποποσ',
 'οἶδα',
 'μὲν',
 'σαφῶσ',
 'ὅτι',
 'ἀχθέσεται',
 ',',
 'σφῷν',
 'δ’',
 'αὐτὸν',
 'οὕνεκ’',
 'ἐπεγερῶ',
 '.',
 'Πισθέταιροσ',
 'κακῶς',
 'σύ',
 'γ’',
 'ἀπόλοῐ',
 ',',
 'ὥς',
 'μ’',
 'ἀπέκτεινας',
 'δέει',
 '.',
 'Ἐυελπίδησ',
 'οἴμοι',
 'κακοδαίμων',
 'χὠ',
 'κολοιός',
 'μοἴχεται',
 'ὑπὸ',
 'τοῦ',
 'δέους\\',
 '.',
 'Πισθέταιροσ',
 'ὦ',
 'δειλότατον',
 'σὺ',
 'θηρίον',
 ',',
 'δείσας',
 'ἀφῆκας',
 'τὸν',
 'κολοιόν',
 ';',
 'Ἐυελπίδησ']

In [None]:
print(list(birds[1000:1100]))

In [None]:
tagger.tag_tnt(" ".join(birds[1000:1100]))

#### TreeTagger

- TreeTagger for Ancient Greek became available like...yesterday
- documentation: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tagsetdocs.txt
- training for anc. greek is work of Alessandro Vatri and Barbara McGillivray

In [81]:
from treetagger import TreeTagger

In [82]:
tt = TreeTagger(language="ancient-greek-utf8")

In [83]:
tt.tag(" ".join(birds[1000:1100]))

[['ἐπέγειρον', 'verb', '-'],
 ['αὐτόν', 'pronoun', '-'],
 ['.', 'SENT', '-'],
 ['Θεράπων', 'adjective', '<unknown>'],
 ['Ἔποποσ', 'noun', '<unknown>'],
 ['οἶδα', 'verb', '-'],
 ['μὲν', 'adverb', '-'],
 ['σαφῶσ', 'verb', '<unknown>'],
 ['ὅτι', 'conjunction', '-'],
 ['ἀχθέσεται', 'verb', '-'],
 [',', 'SENT', '-'],
 ['σφῷν', 'pronoun', '-'],
 ['δ', 'unknown', '-'],
 ['’', 'unknown', '-'],
 ['αὐτὸν', 'pronoun', '-'],
 ['οὕνεκ', 'verb', '<unknown>'],
 ['’', 'unknown', '-'],
 ['ἐπεγερῶ', 'verb', '-'],
 ['.', 'SENT', '-'],
 ['Πισθέταιροσ', 'adjective', '<unknown>'],
 ['κακῶς', 'adverb', '-'],
 ['σύ', 'pronoun', '-'],
 ['γ', 'particle', '-'],
 ['’', 'unknown', '-'],
 ['ἀπόλοῐ', 'verb', '<unknown>'],
 [',', 'SENT', '-'],
 ['ὥς', 'conjunction', '-'],
 ['μ', 'verb', '<unknown>'],
 ['’', 'SENT', '-'],
 ['ἀπέκτεινας', 'verb', '-'],
 ['δέει', 'noun', '-'],
 ['.', 'SENT', '-'],
 ['Ἐυελπίδησ', 'adjective', '<unknown>'],
 ['οἴμοι', 'interjection', '-'],
 ['κακοδαίμων', 'adjective', '-'],
 ['χὠ', 'artic

## Lemmatization

Present two examples: one suitable for automatic lemmatization and the other more as a support for the reader/annotator.

Main difference: how words that may correspond to multiple lemmata are handled.

### Latin

#### CLTK

In [84]:
from cltk.utils.file_operations import open_pickle

In [85]:
# Set up training sentences
rel_path = os.path.join('~/cltk_data/latin/model/latin_models_cltk/lemmata/backoff')
path = os.path.expanduser(rel_path)

# Check for presence of latin_pos_lemmatized_sents
file = 'latin_pos_lemmatized_sents.pickle' 
latin_pos_lemmatized_sents_path = os.path.join(path, file)
if os.path.isfile(latin_pos_lemmatized_sents_path):
    latin_pos_lemmatized_sents = open_pickle(latin_pos_lemmatized_sents_path)
else:
    latin_pos_lemmatized_sents = []
    print('The file %s is not available in cltk_data' % file)

In [86]:
latin_pos_lemmatized_sents[:10]

[[('cum', 'cum2', 'c'),
  ('esset', 'sum', 'v'),
  ('caesar', 'caesar', 'n'),
  ('in', 'in', 'r'),
  ('citeriore', 'citer', 'a'),
  ('gallia', 'gallia', 'n'),
  ('in', 'in', 'r'),
  ('hibernis', 'hibernus', 'a'),
  (',', 'punc', 'u'),
  ('ita', 'ita', 'd'),
  ('uti', 'ut', 'c'),
  ('supra', 'supra', 'd'),
  ('demonstrauimus', 'demonstro', 'v'),
  (',', 'punc', 'u'),
  ('crebri', 'creber', 'a'),
  ('ad', 'ad', 'r'),
  ('eum', 'is', 'p'),
  ('rumores', 'rumor', 'n'),
  ('adferebantur', 'affero', 'v'),
  ('litteris', 'littera', 'n'),
  ('-que', '-que', 'c'),
  ('item', 'item', 'd'),
  ('labieni', 'labienus', 'n'),
  ('certior', 'certus', 'a'),
  ('fiebat', 'fio', 'v'),
  ('omnes', 'omnis', 'a'),
  ('belgas', 'belgae', 'n'),
  (',', 'punc', 'u'),
  ('quam', 'qui', 'p'),
  ('tertiam', 'tertius', 'm'),
  ('esse', 'sum', 'v'),
  ('galliae', 'gallia', 'n'),
  ('partem', 'pars', 'n'),
  ('dixeramus', 'dico', 'v'),
  (',', 'punc', 'u'),
  ('contra', 'contra', 'r'),
  ('populum', 'populus', 'n'),

In [90]:
amicitia_sents = latinlibrary.sents('cicero/amic.txt')

In [91]:
amicitia_sents[10]

['[',
 '4',
 ']',
 'Cum',
 'enim',
 'saepe',
 'cum',
 'me',
 'ageres',
 'ut',
 'de',
 'amicitia',
 'scriberem',
 'aliquid',
 ',',
 'digna',
 'mihi',
 'res',
 'cum',
 'omnium',
 'cognitione',
 'tum',
 'nostra',
 'familiaritate',
 'visa',
 'est',
 '.']

In [87]:
from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer

In [88]:
backoff_lemmatizer = BackoffLatinLemmatizer(train=latin_pos_lemmatized_sents)

In [92]:
backoff_lemmatizer.lemmatize(amicitia_sents[10])

[('[', 'punc'),
 ('4', '4'),
 (']', 'punc'),
 ('Cum', 'Cos2'),
 ('enim', 'enim'),
 ('saepe', 'saepe'),
 ('cum', 'cum2'),
 ('me', 'ego'),
 ('ageres', 'ago'),
 ('ut', 'ut'),
 ('de', 'de'),
 ('amicitia', 'amicitia'),
 ('scriberem', 'scribo'),
 ('aliquid', 'aliquis'),
 (',', 'punc'),
 ('digna', 'dignus'),
 ('mihi', 'ego'),
 ('res', 'res'),
 ('cum', 'cum2'),
 ('omnium', 'omnis'),
 ('cognitione', 'cognitio'),
 ('tum', 'tum'),
 ('nostra', 'noster'),
 ('familiaritate', 'familiaritas'),
 ('visa', 'video'),
 ('est', 'sum'),
 ('.', 'punc')]

But behind the scenes, this is what is going on:

```python

    def _define_lemmatizer(self):
        # Suggested backoff chain--should be tested for optimal order
        backoff0 = None
        backoff1 = IdentityLemmatizer()
        backoff2 = TrainLemmatizer(model=self.LATIN_OLD_MODEL, backoff=backoff1)
        backoff3 = PPLemmatizer(regexps=self.latin_verb_patterns, pps=self.latin_pps, backoff=backoff2)                 
        backoff4 = RegexpLemmatizer(self.latin_sub_patterns, backoff=backoff3)
        backoff5 = UnigramLemmatizer(self.train_sents, backoff=backoff4)
        backoff6 = TrainLemmatizer(model=self.LATIN_MODEL, backoff=backoff5)      
        #backoff7 = BigramPOSLemmatizer(self.pos_train_sents, include=['cum'], backoff=backoff6)
        #lemmatizer = backoff7
        lemmatizer = backoff6
        return lemmatizer

```

further readings: 
* https://github.com/cltk/cltk/blob/master/cltk/lemmatize/latin/backoff.py
* https://disiectamembra.wordpress.com/2016/08/23/wrapping-up-google-summer-of-code/


#### PyCollatinus

* Python port of the [Collatinus lemmatizer](https://github.com/biblissima/collatinus)
* good if you can read some French (or at least practice it) 😉
* the PoS tags used by Collatinus are explained [here](https://github.com/biblissima/collatinus/blob/master/NOTES_Tagger.md)
* morphological analysis not readily machine readable

We import and instantiate the PyCollatinus lemmatizer (`Lemmatiseur`) – and ignore the long list of warnings. 

In [93]:
from pycollatinus import Lemmatiseur
analyzer = Lemmatiseur()

/Users/rromanello/.local/share/virtualenvs/sunoikisis_dc-Xcx-IOWS/lib/python3.6/site-packages/pycollatinus/parser.py:335: MissingRadical: honor has no radical 1
/Users/rromanello/.local/share/virtualenvs/sunoikisis_dc-Xcx-IOWS/lib/python3.6/site-packages/pycollatinus/parser.py:335: MissingRadical: aer has no radical 1
/Users/rromanello/.local/share/virtualenvs/sunoikisis_dc-Xcx-IOWS/lib/python3.6/site-packages/pycollatinus/parser.py:335: MissingRadical: tethys has no radical 1
/Users/rromanello/.local/share/virtualenvs/sunoikisis_dc-Xcx-IOWS/lib/python3.6/site-packages/pycollatinus/parser.py:335: MissingRadical: opes has no radical 1
/Users/rromanello/.local/share/virtualenvs/sunoikisis_dc-Xcx-IOWS/lib/python3.6/site-packages/pycollatinus/parser.py:335: MissingRadical: dos has no radical 1
/Users/rromanello/.local/share/virtualenvs/sunoikisis_dc-Xcx-IOWS/lib/python3.6/site-packages/pycollatinus/parser.py:335: MissingRadical: corpus has no radical 1
/Users/rromanello/.local/share/virtua

In [94]:
analyzer

<pycollatinus.lemmatiseur.Lemmatiseur at 0x10f2250f0>

The lemmatiser can take as input a **single word**

In [95]:
list(analyzer.lemmatise("Cogito"))

[{'desinence': 'ito',
  'form': 'cogito',
  'lemma': 'cogo',
  'morph': '2ème singulier impératif futur actif',
  'radical': 'cog'},
 {'desinence': 'ito',
  'form': 'cogito',
  'lemma': 'cogo',
  'morph': '3ème singulier impératif futur actif',
  'radical': 'cog'},
 {'desinence': 'o',
  'form': 'cogito',
  'lemma': 'cogito',
  'morph': '1ère singulier indicatif présent actif',
  'radical': 'cogit'},
 {'desinence': 'o',
  'form': 'cogito',
  'lemma': 'cogito',
  'morph': '1ère singulier indicatif présent actif',
  'radical': 'cogit'}]

or an **entire sentence**

In [96]:
list(analyzer.lemmatise_multiple("Cogito ergo sum"))

[[{'desinence': 'ito',
   'form': 'cogito',
   'lemma': 'cogo',
   'morph': '2ème singulier impératif futur actif',
   'radical': 'cog'},
  {'desinence': 'ito',
   'form': 'cogito',
   'lemma': 'cogo',
   'morph': '3ème singulier impératif futur actif',
   'radical': 'cog'},
  {'desinence': 'o',
   'form': 'cogito',
   'lemma': 'cogito',
   'morph': '1ère singulier indicatif présent actif',
   'radical': 'cogit'},
  {'desinence': 'o',
   'form': 'cogito',
   'lemma': 'cogito',
   'morph': '1ère singulier indicatif présent actif',
   'radical': 'cogit'}],
 [{'desinence': 'o',
   'form': 'ergo',
   'lemma': 'ergo',
   'morph': '1ère singulier indicatif présent actif',
   'radical': 'erg'},
  {'desinence': '',
   'form': 'ergo',
   'lemma': 'ergo',
   'morph': '-',
   'radical': 'ergo'},
  {'desinence': '',
   'form': 'ergo',
   'lemma': 'ergo',
   'morph': 'positif',
   'radical': 'ergo'}],
 [{'desinence': 'um',
   'form': 'sum',
   'lemma': 'sum',
   'morph': '1ère singulier indicatif p

Let's try to output the lemmatisation in a more intellegible way...

In [97]:
# the analyzer output is essentially a list of lists
# for each analyzed token it returns a list of possible lemmata
# here we iterate through both lists and display the analysis as we go along

for n, result in enumerate(analyzer.lemmatise_multiple("Cogito ergo sum")):
    for i, lemma in enumerate(result):
        print(
            "{}.{}\t{}\t\t{} {}".format(
                n + 1,
                i + 1,
                lemma["form"],
                lemma["lemma"],
                lemma["morph"]
            )
        )

1.1	cogito		cogo 2ème singulier impératif futur actif
1.2	cogito		cogo 3ème singulier impératif futur actif
1.3	cogito		cogito 1ère singulier indicatif présent actif
1.4	cogito		cogito 1ère singulier indicatif présent actif
2.1	ergo		ergo 1ère singulier indicatif présent actif
2.2	ergo		ergo -
2.3	ergo		ergo positif
3.1	sum		sum 1ère singulier indicatif présent actif


the same but also with PoS tag

In [98]:
# the analyzer output is essentially a list of lists
# for each analyzed token it returns a list of possible lemmata
# here we iterate through both lists and display the analysis as we go along

for n, result in enumerate(analyzer.lemmatise_multiple("Cogito ergo sum", pos=True),):
    for i, lemma in enumerate(result):
        print(
            "{}.{}\t{}\t{}\t{} {}".format(
                n + 1,
                i + 1,
                lemma["form"],
                lemma["pos"],
                lemma["lemma"],
                lemma["morph"]
            )
        )

1.1	cogito	v	cogo 2ème singulier impératif futur actif
1.2	cogito	v	cogo 3ème singulier impératif futur actif
1.3	cogito	v	cogito 1ère singulier indicatif présent actif
1.4	cogito	v	cogito 1ère singulier indicatif présent actif
2.1	ergo	v	ergo 1ère singulier indicatif présent actif
2.2	ergo	c	ergo -
2.3	ergo	d	ergo positif
3.1	sum	v	sum 1ère singulier indicatif présent actif


# Exercise

- install Jupyter and all necessary libraries using the instructions on the Wiki
- prepare a corpus of Ancient Greek (by re-running code above)
    - use Perseus data
    - convert the data (TEI/XML + Betacode => Plain Text + UTF-8)
    - create a `PlaintextCorpusReader`
- pick a specific work in the corpus (e.g. one you are particularly familiar with)
- lemmatize using CLTK (see documentation for [Greek lemmatization](http://docs.cltk.org/en/latest/greek.html#lemmatization))
- and then... spot the errors! 