# Table of contents
1. [Introduction](#introduction)
1. [Install CLTK](#install)
1. [Read data](#load)
1. [Run NLP pipeline with `NLP()`](#run-nlp)
1. [Inspect CLTK `Doc`](#inspect-doc)
1. [Inspect CLTK `Word`](#inspect-word)
1. [Analyze the whole book of Genesis](#analyze-book)

# Introduction <a name="introduction"></a>

This notebook is based on [a notebook](https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb) from the [Classical Language Toolkit project](http://cltk.org). We will annotate the text of the Vulgate of Genesis, using the pipeline that is provided for the Latin language. We use the text of the Latin Genesis as it is provided [here](https://github.com/cltk/lat_text_tesserae).

This notebook demonstrates how to use `NLP()`, the CLTK's primary interface, in Latin and Ancient Greek. Pipelines are available for 17 languages (see [Languages](https://docs.cltk.org/en/latest/languages.html) in the docs).

Full documentation available at <https://docs.cltk.org/en/latest/cltk.html#cltk.nlp.NLP>.



In [1]:
import pandas as pd

# Install CLTK <a name="install"></a>

In [None]:
## Requires Python 3.7, 3.8, or 3.9

# !pip install cltk

# Read data <a name="load"></a>

Open the text of Genesis, read it line by line, parse the label (book, chapter, verse), and store the data in the dictionary vulgate_genesis. In this dict, the keys are a tuple containing book, chapter and verse, and the values consist of a string, containing the text of a verse.

In [6]:
a_string = 'this is sentence'
a_string.split('nt')

['this is se', 'ence']

In [9]:
vulgate_genesis = {}

with open("jerome.vulgate.part.1.genesis.tess") as gen:
    for line in gen:
        #rint(line)

        label, text = line.split('> ')
        #rint(label)
        #rint(text)
        _, bo_ch_ve = label.split()
        bo, ch, ve = bo_ch_ve.split('.')
        #rint(bo, ch, ve)
        vulgate_genesis[(bo, ch, ve)] = text.strip()


In [10]:
vulgate_genesis

{('Genesis', '1', '1'): 'in principio creavit Deus caelum et terram;',
 ('Genesis',
  '1',
  '2'): 'terra autem erat inanis et vacua et tenebrae super faciem abyssi et spiritus Dei ferebatur super aquas;',
 ('Genesis', '1', '3'): 'dixitque Deus fiat lux et facta est lux;',
 ('Genesis',
  '1',
  '4'): 'et vidit Deus lucem quod esset bona et divisit lucem ac tenebras;',
 ('Genesis',
  '1',
  '5'): 'appellavitque lucem diem et tenebras noctem factumque est vespere et mane dies unus;',
 ('Genesis',
  '1',
  '6'): 'dixit quoque Deus fiat firmamentum in medio aquarum et dividat aquas ab aquis;',
 ('Genesis',
  '1',
  '7'): 'et fecit Deus firmamentum divisitque aquas quae erant sub firmamento ab his quae erant super firmamentum et factum est ita;',
 ('Genesis',
  '1',
  '8'): 'vocavitque Deus firmamentum caelum et factum est vespere et mane dies secundus;',
 ('Genesis',
  '1',
  '9'): 'dixit vero Deus congregentur aquae quae sub caelo sunt in locum unum et appareat arida factumque est ita;',


How many verses are there in Genesis?

In [None]:
len(vulgate_genesis)

What is the text of Genesis 10:10?

In [None]:
vulgate_genesis[('Genesis', '10', '10')]

# Run NLP pipeline with `NLP()` <a name="run-nlp"></a>

In [2]:
from cltk import NLP
from cltk.morphology.utils import get_features

In [None]:
cltk_nlp = NLP(language="lat")

In [None]:
# Removing ``LatinLexiconProcess`` for this demo b/c it is slow (adds ~9 mins total)
cltk_nlp.pipeline.processes.pop(-1)
print(cltk_nlp.pipeline.processes)

Let's see what the pipeline does with Genesis 1:1.

In [None]:
cltk_doc = cltk_nlp.analyze(text=vulgate_genesis[('Genesis', '1', '1')])

# Inspect CLTK `Doc` <a name="inspect-doc"></a>

The pipeline has created a Doc object of our string:

In [None]:
print(type(cltk_doc))

How can we access this Doc object?

In [None]:
dir(cltk_doc)

Show the tokens! Note that the semicolon at the end is parsed as a separate token.

In [None]:
print(cltk_doc.tokens)

Lemmata.

In [None]:
print(cltk_doc.lemmata)

Parts of speech.

In [None]:
print(cltk_doc.pos)

sentences_tokens is a list of lists, which contains the sentences in the string under consideration.

In [None]:
print(cltk_doc.sentences_tokens)

# Inspect CLTK `Word` <a name="inspect-word"></a>

Most powerful, though, is the ``Doc.words`` accessor, which is a list of ``Word`` objects. These ``Word`` objects contain all information that was generated during the NLP pipeline

In [None]:
# One Word object for each token
print(len(cltk_doc.words))

In [None]:
for wo in cltk_doc.words:
    print(wo.lemma, wo.pos)

We select the verb of the sentence, to be able to inspect its features.

In [None]:
cltk_doc.words[2]

In [None]:
creavit = cltk_doc.words[2]

In [None]:
creavit.string

You can get the other word features with the function get_features().

In [None]:
get_features(creavit)

You see that this function returns a tuple, which contains 2 lists. One list contains the feature names, the other contains the values. We can unpack the tuple by assigning each list to a variable name.

In [None]:
feature_names, feature_values = get_features(creavit)

In [None]:
feature_names

# Analyze the whole book of Genesis <a name="analyze-book"></a>

In [None]:
all_features = set()
all_words_genesis = {}

for verse in vulgate_genesis:
    bo, ch, ve = verse
    gen_doc = cltk_nlp.analyze(text=vulgate_genesis[verse])
    for idx, wo in enumerate(gen_doc):

        word_string = wo.string
        word_lemma = wo.lemma
        word_pos = wo.pos
        
        feature_names, feature_values = get_features(wo)
        all_features.add(tuple(feature_names))
            
        feature_list = [bo, ch, ve, word_string, word_lemma, word_pos]
        for feature in feature_values:
            if not feature:
                feature_list.append('-')
            else:
                feature_list.append(feature)
        all_words_genesis[(idx, bo, ch, ve)] = feature_list
            

In [None]:
genesis = pd.DataFrame(all_words_genesis).T

In [None]:
all_features

In [None]:
genesis_colnames = ['book', 
                    'chapter',
                    'verse', 
                    'text', 
                    'lemma', 
                    'pos', 
                    'case',
                    'gender',
                    'animacy',
                    'number',
                    'definiteness',
                    'degree',
                    'strength',
                    'verbform',
                    'tense',
                    'mood',
                    'aspect',
                    'voice',
                    'person',
                    'polarity',
                    'politeness',
                    'clusivity',
                    'evidentiality',
                    'strength']

genesis.columns = genesis_colnames

In [None]:
genesis.shape

In [None]:
genesis.head(10)

Save the result as a tsv file.

In [None]:
genesis.to_csv('genesis.tsv', sep='\t', index=False)