Caesar's De Bello Gallico: Bigram Analysis

A Python program written with NLTK to find and sort bigrams in Caesar's famous De Bello Gallico (The Gallic War).

PROCESS:

I wrote and ran regex_cleanup.py to clean up the data, using regular expressions to remove the section numbers and footnote numbers with their brackets.
I then combined all of the cleaned text into one file, de_bello_gallico_clean_combined.txt
Next, I tokenized both the English and (cleaned) Latin texts. When forming frequency distributions, the Latin text presented an added challenge because it’s a highly-inflected language: it has inflectional morphemes attached to the end of nouns to mark their case (declensions). I worked around this by using more regular expressions to approximate taking off the inflectional endings.

OVERALL ORDER FOR RUNNING ON THE COMMAND LINE:

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
clean		clean
originals		originals
regex cleanup code		regex cleanup code
README.md		README.md
latin_tokenize.py		latin_tokenize.py

Provide feedback