# NLP Pipeline for Greek

Under each level 2 or smaller heading, choose one or more cells do run. Each cell is a single step that can be added to the whole pipeline. To skip a step, simply convert the cell to `raw` type with `esc` + `r` keys. By default, all steps are initially skipped.

## Frequently used Python modules

In [1]:
import re, os, cltk, requests, spacy, textblob
from pprint import pprint
from termcolor import colored, cprint

In [2]:
# print coloring options
line_color = "blue"
text_color = "magenta"
reference_color = "white"

## Import Data

Import Greek text.

### *From TXT File*

In [3]:
def load_txt(filename: str):
    """
    Extract text from a .txt file.

    Parameters:
        filename (str): Path of file to load.

    Returns:
        str: Text loaded from file.
    """
    if not os.path.exists(path=filename):
        raise ValueError(f"The path {filename} does not exist.")

    text = open(file=filename, mode="r").read()
    return text


filename = "/mnt/d/share/Using-AI-to-Trace-the-History-of-Race-and-Inequality/src/sample_text/latin/urn_cts_greekLit_stoa0146d.stoa001.opp-lat11.txt"
text = load_txt(filename=filename)
cprint(text="-" * 100, color=line_color)
cprint(text="Loading from TXT file:", color=text_color)
cprint(text=filename, color=reference_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

[34m----------------------------------------------------------------------------------------------------[0m
[35mLoading from TXT file:[0m
[37m/mnt/d/share/Using-AI-to-Trace-the-History-of-Race-and-Inequality/src/sample_text/latin/urn_cts_greekLit_stoa0146d.stoa001.opp-lat11.txt[0m
[34m----------------------------------------------------------------------------------------------------[0m
[35mACTA ARCHELAI. Thesaurus verus sive disputatio habita in Carcharis eiTitate Mesopotamiae
Archelai episeopi adversus Manen, Judicantibus Manippo et
Aegialeo et Clandio et Cleobolo. In qua urbe erat quidani vir Marcellus 
 nomine qui vita et studiis et genere, prudentia quoque et honestate
valde clanis habebatnr; faeultatibus etiam copiosus et quoii
omnium maximiim est, religiosiasime deum timens, et his quae de
Christo dicebantur semper cum timore auscultans, nec quicquam omnino
honi erat quod illi viro deesset; unde et honore plurimo ab nniversa 
 eivitate colebatnr plnrimisque ipse civitat

### *From URI*

In [4]:
def load_uri(uri: str):
    """
    Load text from URI.

    Parameters:
        uri (str): URI link to text online.

    Returns:
        str: Text loaded from URI.
    """
    req = requests.get(url=uri)
    return req.text


uri = "https://scaife.perseus.org/library/passage/urn:cts:greekLit:stoa0146d.stoa001.opp-lat1:1/text/"
text = load_uri(uri=uri)
cprint(text="-" * 100, color=line_color)
cprint(text="Loading from URI:", color=text_color)
cprint(text=uri, color=reference_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

[34m----------------------------------------------------------------------------------------------------[0m
[35mLoading from URI:[0m
[37mhttps://scaife.perseus.org/library/passage/urn:cts:greekLit:stoa0146d.stoa001.opp-lat1:1/text/[0m
[34m----------------------------------------------------------------------------------------------------[0m
[35mACTA ARCHELAI. Thesaurus verus sive disputatio habita in Carcharis eiTitate Mesopotamiae
Archelai episeopi adversus Manen, Judicantibus Manippo et
Aegialeo et Clandio et Cleobolo. In qua urbe erat quidani vir Marcellus 
 nomine qui vita et studiis et genere, prudentia quoque et honestate
valde clanis habebatnr; faeultatibus etiam copiosus et quoii
omnium maximiim est, religiosiasime deum timens, et his quae de
Christo dicebantur semper cum timore auscultans, nec quicquam omnino
honi erat quod illi viro deesset; unde et honore plurimo ab nniversa 
 eivitate colebatnr plnrimisque ipse civitatem suam freqiienter largitionibus
remunerabatur

## Paragraph Edit

Alter text on a paragraph level of abstraction.

### *Delete Footnotes*

Delete all footnotes that were extracted with the main text.

In [5]:
pattern = "([0-9].*)"
match_obj = re.search(pattern=pattern, string=text)
cprint(text="-" * 100, color=line_color)
cprint(text="match:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=match_obj, color=text_color)

[34m----------------------------------------------------------------------------------------------------[0m
[35mmatch:[0m
[34m----------------------------------------------------------------------------------------------------[0m
[35m<re.Match object; span=(1327, 1407), match='1 Dispatatio archelay et manychei (rot) vel manee>[0m


## Punctuation

Remove, replace, or alter punctuation marks from text.

### *Swallow All Brackets*

Delete both angle <> and square [] brackets, including the text within them. 

In [6]:
text = cltk.alphabet.lat.swallow_angle_brackets(text=text)
text = cltk.alphabet.lat.swallow_square_brackets(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Swallowing all brackets and their text:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

[34m----------------------------------------------------------------------------------------------------[0m
[35mSwallowing all brackets and their text:[0m
[34m----------------------------------------------------------------------------------------------------[0m
[35mACTA ARCHELAI. Thesaurus verus sive disputatio habita in Carcharis eiTitate Mesopotamiae
Archelai episeopi adversus Manen, Judicantibus Manippo et
Aegialeo et Clandio et Cleobolo. In qua urbe erat quidani vir Marcellus 
 nomine qui vita et studiis et genere, prudentia quoque et honestate
valde clanis habebatnr; faeultatibus etiam copiosus et quoii
omnium maximiim est, religiosiasime deum timens, et his quae de
Christo dicebantur semper cum timore auscultans, nec quicquam omnino
honi erat quod illi viro deesset; unde et honore plurimo ab nniversa 
 eivitate colebatnr plnrimisque ipse civitatem suam freqiienter largitionibus
remunerabatur, paiiperibus tribuens, adfticts relevans, tribulatis
auxilium ferens, Sed ne infi

### *Swallow Editorial*

Delete common editorial marks.

In [7]:
text = cltk.alphabet.lat.swallow_editorial(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Deleting common editorial marks:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

[34m----------------------------------------------------------------------------------------------------[0m
[35mDeleting common editorial marks:[0m
[34m----------------------------------------------------------------------------------------------------[0m
[35mACTA ARCHELAI. Thesaurus verus sive disputatio habita in Carcharis eiTitate Mesopotamiae
Archelai episeopi adversus Manen, Judicantibus Manippo et
Aegialeo et Clandio et Cleobolo. In qua urbe erat quidani vir Marcellus 
 nomine qui vita et studiis et genere, prudentia quoque et honestate
valde clanis habebatnr; faeultatibus etiam copiosus et quoii
omnium maximiim est, religiosiasime deum timens, et his quae de
Christo dicebantur semper cum timore auscultans, nec quicquam omnino
honi erat quod illi viro deesset; unde et honore plurimo ab nniversa 
 eivitate colebatnr plnrimisque ipse civitatem suam freqiienter largitionibus
remunerabatur, paiiperibus tribuens, adfticts relevans, tribulatis
auxilium ferens, Sed ne infirmitate

### *Dehyphenate*

Remove hypens, which is especially useful for overflow-wrapped text that uses hyphens.

In [8]:
text = cltk.alphabet.lat.dehyphenate(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Removing hyphens from text:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

[34m----------------------------------------------------------------------------------------------------[0m
[35mRemoving hyphens from text:[0m
[34m----------------------------------------------------------------------------------------------------[0m
[35mACTA ARCHELAI. Thesaurus verus sive disputatio habita in Carcharis eiTitate Mesopotamiae
Archelai episeopi adversus Manen, Judicantibus Manippo et
Aegialeo et Clandio et Cleobolo. In qua urbe erat quidani vir Marcellus 
 nomine qui vita et studiis et genere, prudentia quoque et honestate
valde clanis habebatnr; faeultatibus etiam copiosus et quoii
omnium maximiim est, religiosiasime deum timens, et his quae de
Christo dicebantur semper cum timore auscultans, nec quicquam omnino
honi erat quod illi viro deesset; unde et honore plurimo ab nniversa 
 eivitate colebatnr plnrimisque ipse civitatem suam freqiienter largitionibus
remunerabatur, paiiperibus tribuens, adfticts relevans, tribulatis
auxilium ferens, Sed ne infirmitate verb

### *Drop Latin Punctuation*

Drop all Latin punctuation except the hyphen and obelization markers, replacing the punctuation with a space.  Hypens (-) and obeli (†) must be removed before this step if intended for removal.

In [9]:
text = cltk.alphabet.lat.drop_latin_punctuation(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Dropping Latin punctuation:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

[34m----------------------------------------------------------------------------------------------------[0m
[35mDropping Latin punctuation:[0m
[34m----------------------------------------------------------------------------------------------------[0m
[35mACTA ARCHELAI  Thesaurus verus sive disputatio habita in Carcharis eiTitate Mesopotamiae
Archelai episeopi adversus Manen  Judicantibus Manippo et
Aegialeo et Clandio et Cleobolo  In qua urbe erat quidani vir Marcellus 
 nomine qui vita et studiis et genere  prudentia quoque et honestate
valde clanis habebatnr  faeultatibus etiam copiosus et quoii
omnium maximiim est  religiosiasime deum timens  et his quae de
Christo dicebantur semper cum timore auscultans  nec quicquam omnino
honi erat quod illi viro deesset  unde et honore plurimo ab nniversa 
 eivitate colebatnr plnrimisque ipse civitatem suam freqiienter largitionibus
remunerabatur  paiiperibus tribuens  adfticts relevans  tribulatis
auxilium ferens  Sed ne infirmitate verb

### *Ligature Replacement*

Replace ‘œæ’ with AE, ‘Œ Æ’ with OE

In [10]:
ligature_replacer = cltk.alphabet.lat.LigatureReplacer()
ligature_replacer.replace(text="text")
cprint(text="-" * 100, color=line_color)
cprint(text="Replacing ligatures (œ, æ) from text:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

[34m----------------------------------------------------------------------------------------------------[0m
[35mReplacing ligatures (œ, æ) from text:[0m
[34m----------------------------------------------------------------------------------------------------[0m
[35mACTA ARCHELAI  Thesaurus verus sive disputatio habita in Carcharis eiTitate Mesopotamiae
Archelai episeopi adversus Manen  Judicantibus Manippo et
Aegialeo et Clandio et Cleobolo  In qua urbe erat quidani vir Marcellus 
 nomine qui vita et studiis et genere  prudentia quoque et honestate
valde clanis habebatnr  faeultatibus etiam copiosus et quoii
omnium maximiim est  religiosiasime deum timens  et his quae de
Christo dicebantur semper cum timore auscultans  nec quicquam omnino
honi erat quod illi viro deesset  unde et honore plurimo ab nniversa 
 eivitate colebatnr plnrimisque ipse civitatem suam freqiienter largitionibus
remunerabatur  paiiperibus tribuens  adfticts relevans  tribulatis
auxilium ferens  Sed ne infirm

### *Drop Accents*

Remove accents; note: AE replacement and macron replacement should happen elsewhere, if desired. 

In [11]:
text = cltk.alphabet.lat.remove_accents(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Dropping accents:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

[34m----------------------------------------------------------------------------------------------------[0m
[35mDropping accents:[0m
[34m----------------------------------------------------------------------------------------------------[0m
[35mACTA ARCHELAI  Thesaurus verus sive disputatio habita in Carcharis eiTitate Mesopotamiae
Archelai episeopi adversus Manen  Judicantibus Manippo et
Aegialeo et Clandio et Cleobolo  In qua urbe erat quidani vir Marcellus 
 nomine qui vita et studiis et genere  prudentia quoque et honestate
valde clanis habebatnr  faeultatibus etiam copiosus et quoii
omnium maximiim est  religiosiasime deum timens  et his quae de
Christo dicebantur semper cum timore auscultans  nec quicquam omnino
honi erat quod illi viro deesset  unde et honore plurimo ab nniversa 
 eivitate colebatnr plnrimisque ipse civitatem suam freqiienter largitionibus
remunerabatur  paiiperibus tribuens  adfticts relevans  tribulatis
auxilium ferens  Sed ne infirmitate verborum virtu

### *Drop Macrons*

Remove macrons, which indicate long pronounciation, above vowels.

In [12]:
text = cltk.alphabet.lat.remove_macrons(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Dropping macrons:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

[34m----------------------------------------------------------------------------------------------------[0m
[35mDropping macrons:[0m
[34m----------------------------------------------------------------------------------------------------[0m
[35mACTA ARCHELAI  Thesaurus verus sive disputatio habita in Carcharis eiTitate Mesopotamiae
Archelai episeopi adversus Manen  Judicantibus Manippo et
Aegialeo et Clandio et Cleobolo  In qua urbe erat quidani vir Marcellus 
 nomine qui vita et studiis et genere  prudentia quoque et honestate
valde clanis habebatnr  faeultatibus etiam copiosus et quoii
omnium maximiim est  religiosiasime deum timens  et his quae de
Christo dicebantur semper cum timore auscultans  nec quicquam omnino
honi erat quod illi viro deesset  unde et honore plurimo ab nniversa 
 eivitate colebatnr plnrimisque ipse civitatem suam freqiienter largitionibus
remunerabatur  paiiperibus tribuens  adfticts relevans  tribulatis
auxilium ferens  Sed ne infirmitate verborum virtu

## Spelling and Capitalization

### *Spell Checker*

Correct any spelling and wrong case endings.

### Truecase

Correct any mistakes with capitalization using Truecase dictionary, which is a frequency counter of all distinct capitalizations of the same word in a given text. Usually, the most frequent capitalization is deemed the default capitalization and applied for the word to be corrected for capitalization.

# Term Extraction

Detect and extract potentially important terms from text.