# Preprocessing Pipeline for Latin

Under each level 2 or smaller heading, choose one or more cells do run. Each cell is a single step that can be added to the whole pipeline. To skip a step, simply convert the cell to `raw` type with `esc` + `r` keys. By default, all steps are initially skipped.

## Initial Setup

Frequently used Python modules.

In [7]:
import re, os, cltk, requests, spacy, textblob
from pprint import pprint
from termcolor import colored, cprint

ModuleNotFoundError: No module named 'cltk'

Configurations.

In [None]:
# print coloring options
line_color = "blue"
text_color = "magenta"
reference_color = "white"

## Import Data

Import Latin text.

### *From TXT File*

In [3]:
def load_txt(filename: str):
    """
    Extract text from a .txt file.

    Parameters:
        filename (str): Path of file to load.

    Returns:
        str: Text loaded from file.
    """
    if not os.path.exists(path=filename):
        raise ValueError(f"The path {filename} does not exist.")

    text = open(file=filename, mode="r").read()
    return text

In [6]:
# Example
filename = "/mnt/c/share/AI-in-Classics/src/sample_text/latin/urn_cts_greekLit_stoa0146d.stoa001.opp-lat11.txt"
text = load_txt(filename=filename)
cprint(text="-" * 100, color=line_color)
cprint(text="Loading from TXT file:", color=text_color)
cprint(text=filename, color=reference_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

NameError: name 'cprint' is not defined

### *From URI*

In [None]:
def load_uri(uri: str):
    """
    Load text from URI.

    Parameters:
        uri (str): URI link to text online.

    Returns:
        str: Text loaded from URI.
    """
    req = requests.get(url=uri)
    return req.text

NameError: name 'requests' is not defined

In [None]:
# Example
uri = "https://scaife.perseus.org/library/passage/urn:cts:greekLit:stoa0146d.stoa001.opp-lat1:1/text/"
text = load_uri(uri=uri)
cprint(text="-" * 100, color=line_color)
cprint(text="Loading from URI:", color=text_color)
cprint(text=uri, color=reference_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

## Paragraph Edit

Alter text on a paragraph level of abstraction.

### *Delete Footnotes*

Delete all footnotes that were extracted with the main text.

In [None]:
def remove_footnotes(text: str):
    """
    Remove footnotes, where each footnote starts with an integer.

    Parameters:
        text (str): Text to remove footnotes from.

    Returns:
        str: Text after removing footnotes.
    """
    pattern = "[0-9](.+\n?)+"
    return re.sub(pattern=pattern, repl="", string=text)


text = remove_footnotes(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Removing footnotes:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

## Punctuation

Remove, replace, or alter punctuation marks from text.

### *Swallow All Brackets*

Delete both angle <> and square [] brackets, including the text within them. 

In [None]:
text = cltk.alphabet.lat.swallow_angle_brackets(text=text)
text = cltk.alphabet.lat.swallow_square_brackets(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Swallowing all brackets and their text:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

### *Swallow Editorial*

Delete common editorial marks.

In [None]:
text = cltk.alphabet.lat.swallow_editorial(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Deleting common editorial marks:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

### *Dehyphenate*

Remove hypens, which is especially useful for overflow-wrapped text that uses hyphens.

In [None]:
text = cltk.alphabet.lat.dehyphenate(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Removing hyphens from text:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

### *Drop Latin Punctuation*

Drop all Latin punctuation except the hyphen and obelization markers, replacing the punctuation with a space.  Hypens (-) and obeli (†) must be removed before this step if intended for removal.

In [None]:
text = cltk.alphabet.lat.drop_latin_punctuation(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Dropping Latin punctuation:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

### *Ligature Replacement*

Replace ‘œæ’ with AE, ‘Œ Æ’ with OE

In [None]:
ligature_replacer = cltk.alphabet.lat.LigatureReplacer()
ligature_replacer.replace(text="text")
cprint(text="-" * 100, color=line_color)
cprint(text="Replacing ligatures (œ, æ) from text:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

### *Drop Accents*

Remove accents; note: AE replacement and macron replacement should happen elsewhere, if desired. 

In [None]:
text = cltk.alphabet.lat.remove_accents(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Dropping accents:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

### *Drop Macrons*

Remove macrons, which indicate long pronounciation, above vowels.

In [None]:
text = cltk.alphabet.lat.remove_macrons(text=text)
cprint(text="-" * 100, color=line_color)
cprint(text="Dropping macrons:", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=text, color=text_color)

## Spelling and Capitalization

### *Spell Checker*

Correct any spelling and wrong case endings.

### Truecase

Correct any mistakes with capitalization using Truecase dictionary, which is a frequency counter of all distinct capitalizations of the same word in a given text. Usually, the most frequent capitalization is deemed the default capitalization and applied for the word to be corrected for capitalization.

## Translation

Translate to English using Google Translate.

In [None]:
from googletrans import Translator, constants

translator = Translator()
translation = translator.translate(text=text)

# text
cprint(text="-" * 100, color=line_color)
cprint(text="text", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=translation.text, color=text_color)

# origin
cprint(text="-" * 100, color=line_color)
cprint(text="origin", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=translation.origin, color=text_color)

# source
cprint(text="-" * 100, color=line_color)
cprint(text="source", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=translation.src, color=text_color)

# destination
cprint(text="-" * 100, color=line_color)
cprint(text="destination", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=translation.dest, color=text_color)

# pronounciation
cprint(text="-" * 100, color=line_color)
cprint(text="pronunciation", color=text_color)
cprint(text="-" * 100, color=line_color)
cprint(text=translation.pronunciation, color=text_color)

## SSH

ssh directly into a remote server e.g. HiperGator

In [None]:
from dotenv import load_dotenv
from paramiko import SSHClient

load_dotenv()

username = os.getenv(key="username")
password = os.getenv(key="password")

client = SSHClient()
client.load_system_host_keys()
client.connect(f"{username}@hpg.rc.ufl.edu")
stdin, stdout, stderr = client.exec_command("ls -l")