Skip to content

A Python program written with NLTK to find and sort bigrams in Caesar's famous De Bello Gallico (The Gallic War)

Notifications You must be signed in to change notification settings

DillonPlummer/De-Bello-Gallico-Bigram-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Caesar's De Bello Gallico: Bigram Analysis

A Python program written with NLTK to find and sort bigrams in Caesar's famous De Bello Gallico (The Gallic War).

PROCESS:

  1. I wrote and ran regex_cleanup.py to clean up the data, using regular expressions to remove the section numbers and footnote numbers with their brackets.

  2. I then combined all of the cleaned text into one file, de_bello_gallico_clean_combined.txt

  3. Next, I tokenized both the English and (cleaned) Latin texts. When forming frequency distributions, the Latin text presented an added challenge because it’s a highly-inflected language: it has inflectional morphemes attached to the end of nouns to mark their case (declensions). I worked around this by using more regular expressions to approximate taking off the inflectional endings.

OVERALL ORDER FOR RUNNING ON THE COMMAND LINE:

  1. regex_cleanup.py files
  2. latin_tokenize.py (this prints the most common bigrams)
  3. latin.tokenize.py | grep -i caesar*
  4. latin.tokenize.py | grep -i propter*
  5. latin.tokenize.py | grep Galli*

About

A Python program written with NLTK to find and sort bigrams in Caesar's famous De Bello Gallico (The Gallic War)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages