# NOTEBOOK IS IN DEVELOPMENT

# Table of contents
1. [Introduction](#introduction)
1. [Install pre-release of CLTK](#install)
1. [Load data](#load)
1. [Run NLP pipeline with `NLP()`](#run-nlp)
1. [Inspect CLTK `Doc`](#inspect-doc)
1. [Inspect CLTK `Word`](#inspect-word)
1. [Modeling morphology with `MorphosyntacticFeature` and `MorphosyntacticFeatureBundle`](#morph)
1. [Modeling syntax with `Form` and `DependencyTree`](#syntax)
1. [Feature extraction](#features)
1. [Brief demonstration of `NLP()` for Ancient Greek](#greek-nlp)

# Introduction <a name="introduction"></a>

This notebook is based on [a notebook](https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb) from the [Classical Language Toolkit project](http://cltk.org). We will annotate the text of the Vulgate of Genesis, using the pipeline that is provided for the Latin language. We use the text of the Latin Genesis as it is provided [here](https://github.com/cltk/lat_text_tesserae).

This notebook demonstrates how to use `NLP()`, the CLTK's primary interface, in Latin and Ancient Greek. Pipelines are available for 17 languages (see [Languages](https://docs.cltk.org/en/latest/languages.html) in the docs).

Full documentation available at <https://docs.cltk.org/en/latest/cltk.html#cltk.nlp.NLP>.



# Install CLTK <a name="install"></a>

In [None]:
## Requires Python 3.7, 3.8, or 3.9

# !pip install cltk

## Load data <a name="load"></a>

Open the text of Genesis, read it line by line, parse the label (book, chapter, verse), and store the data in the dictionary vulgate_genesis. In this dict, the keys are a tuple containing book, chapter and verse, and the values consist of a string, containing the text of a verse.

In [None]:
vulgate_genesis = {}

with open("jerome.vulgate.part.1.genesis.tess") as gen:
    for line in gen:

        label, text = line.split('> ')
        _, bo_ch_ve = label.split()
        bo, ch, ve = bo_ch_ve.split('.')
        vulgate_genesis[(bo, ch, ve)] = text.strip()


How many verses are there in Genesis?

In [None]:
len(vulgate_genesis)

What is the text of Genesis 10:10?

In [None]:
vulgate_genesis[('Genesis', '10', '10')]

# Run NLP pipeline with `NLP()` <a name="run-nlp"></a>

In [None]:
from cltk import NLP

In [None]:
cltk_nlp = NLP(language="lat")

In [None]:
# Removing ``LatinLexiconProcess`` for this demo b/c it is slow (adds ~9 mins total)
cltk_nlp.pipeline.processes.pop(-1)
print(cltk_nlp.pipeline.processes)

Let's see what the pipeline does with Genesis 1:1.

In [None]:
cltk_doc = cltk_nlp.analyze(text=vulgate_genesis[('Genesis', '1', '1')])

# Inspect CLTK `Doc` <a name="inspect-doc"></a>

The pipeline has created a Doc object of our string:

In [None]:
print(type(cltk_doc))

How can we access this Doc object?

In [None]:
print([x for x in dir(cltk_doc) if not x.startswith("__")])

Show the tokens! Note that the semicolon at the end is parsed as a separate token.

In [None]:
print(cltk_doc.tokens)

Lemmata.

In [None]:
print(cltk_doc.lemmata)

Parts of speech.

In [None]:
print(cltk_doc.pos)

In [None]:
sentences_tokens is a list of lists, which contains the sentences in the string under consideration.

In [None]:
print(cltk_doc.sentences_tokens)

# Inspect CLTK `Word` <a name="inspect-word"></a>

Most powerful, though, is the ``Doc.words`` accessor, which is a list of ``Word`` objects. These ``Word`` objects contain all information that was generated during the NLP pipeline

In [None]:
# One Word object for each token
print(len(cltk_doc.words))

Users can go token-by-token via ``Doc.words`` or via the intermediary step of looping through sentences.

In [None]:
print(cltk_doc.sentences_strings[0])
sentence_gen_1_1 = cltk_doc.sentences[0]  # type: List[Word]

In [None]:
for word in sentence_gen_1_1:
    print(word)
    print('')

In this word, you can see information for lexicography (`.lemmata`), semantics (`.embedding`), morphology (`.pos`, `.features`), syntax (`.governor`, `.dependency_relation`), plus other information most users would find helpful (`.stop`, `.named_entity`).

# Modeling morphology with `MorphosyntacticFeature` and `MorphosyntacticFeatureBundle` <a name="morph"></a>

When a language's `Pipeline` builds each `Word` object, morphological information is stored at several accessors. Those of interest to most users are `.pos` and `.features`.

Let's have a look at the verb in the sentence: creavit, which is the third word, so it has index 2.

In [None]:
creav = sentence_gen_1_1[2]
print('Word.string:', creav.string)
print("")

print('Word.pos:', creav.pos)

The CLTK contains classes a specific class for [the annotation types defined by v2 of the Universal Dependencies project](https://universaldependencies.org/u/feat/all.html). In the CLTK's codebase, these are located at [cltk/morphology/universal_dependencies_features.py](https://github.com/cltk/cltk/blob/dev/src/cltk/morphology/universal_dependencies_features.py).

For instance, a Latin verb requires a label for its [https://universaldependencies.org/u/feat/all.html#al-u-feat/Mood](mood) (e.g., indicative), which the UD project defines as "a feature that expresses modality and subclassifies finite verb forms".

Though morphological taggers may annnotate a verb's mood variously ("ind.", "indicative", "Indic", etc.), the CLTK maps the term into the following, standardized `Mood`.

``` python
class Mood(MorphosyntacticFeature):
    """The mood of a verb.
    see https://universaldependencies.org/u/feat/Mood.html
    """

    admirative = auto()
    conditional = auto()
    desiderative = auto()
    imperative = auto()
    indicative = auto()
    jussive = auto()
    necessitative = auto()
    optative = auto()
    potential = auto()
    purposive = auto()
    quotative = auto()
    subjunctive = auto()
```

Turning back to the the above example word, we can see such features at `.features`.

In [None]:
# type
print("type(`Word.features`):", type(creav.features))
print("")
# str repr of `MorphosyntacticFeatureBundle`
print("`Word.features`:", creav.features)

A user may inspect a `MorphosyntacticFeatureBundle` in a manner similar to a `dict`

In [None]:
print("Mood:", creav.features["Mood"], creav.features["Mood"][0].name)  # type: List[Mood]
print("Number:", creav.features["Number"])  # type: List[Number]
print("Person:", creav.features["Person"])  # type: List[Person]
print("Tense:", creav.features["Tense"])  # type: List[Tense]
print("VerbForm:", creav.features["VerbForm"])  # type: List[VerbForm]
print("Voice:", creav.features["Voice"])  # type: List[Voice]

# Note: The values returned here are a list, though under normally only one 
# morphological form will be available

Looking a bit closer at `MorphosyntacticFeature`, we can see how its data type inherits from the Python builtin [IntEnu](https://docs.python.org/3/library/enum.html#enum.IntEnum).

In [None]:
a_mood_obj = a_word_concurrunt.features["Mood"][0]
# see type
print("type(a_mood_obj):", type(a_mood_obj))
print("")
# See inheritance
from enum import IntEnum
print("Is `IntEnum`?", isinstance(a_mood_obj, IntEnum))
print("")
# 
from cltk.morphology.morphosyntax import MorphosyntacticFeature
print("`Mood` inherits from `MorphosyntacticFeature`?", isinstance(a_mood_obj, MorphosyntacticFeature))

In [None]:
# You can manipulate this object as any IntEnum plus a few extras

print("`MorphosyntacticFeature` accessors:", [x for x in dir(a_mood_obj) if not x.startswith("__")])
print("")
print("MorphosyntacticFeature.name:", a_mood_obj.name)  # type: str
# A stable int value is available, too, associated with this name
print("MorphosyntacticFeature.value:", a_mood_obj.value)  # type: int

Users can create their own `MorphosyntacticFeature` and `MorphosyntacticFeatureBundle`:

In [None]:
from cltk.morphology.morphosyntax import MorphosyntacticFeatureBundle
from cltk.morphology.universal_dependencies_features import Mood, Number, Person, VerbForm, Voice

latin_word_sim = "sim"

mood = Mood.subjunctive
voice = Voice.active
person = Person.first
number = Number.singular
verb_form = VerbForm.finite

latin_word_sim_bundle = MorphosyntacticFeatureBundle(mood, voice, person, number, verb_form)
print(latin_word_sim_bundle)

Finally, we may even construct a `Word` with this information:

In [None]:
from cltk.core.data_types import Word

print(Word(string="sim", features=latin_word_sim_bundle))

In [None]:
# For more on this or any other CLTK class, use `help()`
# help(a_mood_obj)
# help(MorphosyntacticFeatureBundle)

In [None]:
# Note: Extra morphological info may be written in `str` type
# to to the values at `.upos` and `.xpos` for languages using
# Stanza project

# Note: The particular annoations at these are often inconsistent across
# languages or even treebanks within a single language; hence the benefit
# of the CLTK's modeling at `.pos`.
print("`Word.upos`:", a_word_concurrunt.upos)
print("`Word.xpos`:", a_word_concurrunt.xpos)

## Analyze the whole book of Genesis

In [None]:
all_features = {}

for ve in vulgate_genesis:
    gen_doc = cltk_nlp.analyze(text=vulgate_genesis[ve])
    for wo in gen_doc:
        word_dict = {}
        word_dict['text'] = wo.string
        word_dict['lemma'] = wo.lemma
        word_dict['pos'] = wo.pos
        features = wo.features
            
        print(word_dict)
            

# Modeling syntax with `Form` and `DependencyTree`  <a name="syntax"></a>

The CLTK uses the builtin `xml` library to build tree for modeling dependency parses. A `Word` is mapped into a `Form`, then `ElemntTree` is used to organize these `Form`s into a `DependencyTree`. With a tree, certain measurements are more efficient (counting depth, breadth, edge types).

In [None]:
from cltk.dependency.tree import DependencyTree

In [None]:
# Let's look at this sentence again
print(cltk_doc.sentences_strings[6])  # text form of `sentence_6`

In [None]:
a_tree = DependencyTree.to_tree(sentence_6)

In [None]:
from pprint import pprint

pprint(a_tree.get_dependencies())

In [None]:
a_tree.print_tree()

# Feature extraction <a name="features"></a>

The CLTK offers the function `cltk_doc_to_features_table()`, which assist users when preparing a `Doc` for training data for machine learning. It converts the list of `Word` objects at `Doc.words` into a tabular list of lists.

In [None]:
from cltk.utils.feature_extraction import cltk_doc_to_features_table

In [None]:
feature_names, list_of_list_features = cltk_doc_to_features_table(cltk_doc=cltk_doc)

In [None]:
# See here the names of the features extracted
print(feature_names)

In [None]:
# Number of "inner lists" matches number of tokens
print("Number tokens:", len(cltk_doc.words))
print("len() of feature instances (one for each token):", len(list_of_list_features))

In [None]:
# Look at one row of data `(variable name, variable value)`
pprint(list(zip(feature_names, list_of_list_features[108])))

# Brief demonstration of `NLP()` for Ancient Greek <a name="greek-nlp"></a>

The API for Greek is the same as Latin.

In [None]:
# read the Ancient Greek file
with open("grc-thucydides.txt") as fo:
    thucydides_full = fo.read()

In [None]:
print("Text snippet:", thucydides_full[0:200])
print("Character count:", len(thucydides_full))
print("Approximate token count:", len(thucydides_full.split()))

In [None]:
len(thucydides_full) // 7

In [None]:
# Cut this down to roughly 10k tokens for this demonstration's purposes
thucydides = thucydides_full[:len(thucydides_full) // 7]
print("Approximate token count:", len(thucydides.split()))

In [None]:
thucydides[:200]

In [None]:
cltk_nlp_grc = NLP(language="grc")

In [None]:
# Execution time is 50 sec on a 2015 Macbook Pro
%time cltk_doc_grc = cltk_nlp_grc.analyze(text=thucydides)

# You will be asked to download some models (from CLTK, fastText, and Stanza)

In [None]:
print("`Doc.tokens`:", cltk_doc_grc.tokens[:20])

In [None]:
print(cltk_doc_grc.words[4])  # πόλεμον ('war')

In [None]:
a_tree_grc = DependencyTree.to_tree(cltk_doc_grc.sentences[0])  #81

In [None]:
pprint(a_tree_grc.get_dependencies())

In [None]:
print(cltk_doc_grc.sentences_strings[0])
print("")
print("Translation:", "Thucydides, an Athenian, wrote the history of the war between the Peloponnesians and the Athenians, beginning at the moment that it broke out, and believing that it would be a great war, and more worthy of relation than any that had preceded it. This belief was not without its grounds. The preparations of both the combatants were in every department in the last state of perfection; and he could see the rest of the Hellenic race taking sides in the quarrel; those who delayed doing so at once having it in contemplation.")
print("")
a_tree_grc.print_tree()

In [None]:
feature_names_grc, list_of_list_features_grc = cltk_doc_to_features_table(cltk_doc=cltk_doc_grc)

In [None]:
print(feature_names_grc)

In [None]:
print("len() of feature instances (one for each token):", len(list_of_list_features_grc))
print("")
print("Example of one instance row:", list_of_list_features_grc[4])

In [None]:
# Putting these together for easier reading
pprint(list(zip(feature_names_grc, list_of_list_features_grc[4])))