# Corpus Statistics - Lab

## Introduction

In this lab, we'll learn how to use various NLP techniques to generate descriptive statistics to explore a text corpus!

## Objectives

You will be able to:

- Generate common corpus statistics using NLTK 
- Use a count vectorization strategy to create a bag of words 
- Compare two different text corpora using corpus statistics generated by NLTK 


## Getting Started

In this lab, we'll load two different text corpora from NLTK's library of various texts, and then explore and compare each corpus using some basic statistical measures and techniques common in NLP. Let's get started!

In the cell below:

* Import `nltk`
* Download `gutenberg` and `stopwords` from `nltk`
* Import `gutenberg` and `stopwords` from `nltk.corpus`
* Import everything (`*`) from `nltk.collocations`
* Import `FreqDist` and `word_tokenize` from `nltk`
* Import the `string` and `re` libraries 

In [21]:
import nltk
nltk.download('gutenberg')
nltk.download('stopwords')
from nltk.corpus import gutenberg,stopwords
from nltk.collocations import *
from nltk import word_tokenize,FreqDist
import re
import string

[nltk_data] Downloading package gutenberg to C:\Users\FLEX
[nltk_data]     5\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\FLEX
[nltk_data]     5\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now, let's take a look at the corpora available to us. There are many, many corpora available inside of nltk's `corpus` module. For this lab, we'll make use of the texts contained in `corpus.gutenberg`-- 18 different (complete) corpora that can be found on the [Project Gutenberg](https://www.gutenberg.org/) website. 

To see the file ids for each of the corpora inside of `gutenberg`, we can call the `.fileids()` method. Do this now in the cell below.

In [22]:
file_ids = gutenberg.fileids()
file_ids

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Great! For the first part of this lab, we'll be working with Shakespeare's *Macbeth*, a tragedy about a pair of ambitious social climbers. 

To load the actual corpus, we need to pass in the file id for macbeth into `gutenberg.raw()`. 

Do this now in the cell below.  Then, print the first 1000 characters of the text to ensure it loaded correctly, and get a feel for what our text data looks like.

In [4]:
macbeth_text = gutenberg.raw('shakespeare-macbeth.txt')
macbeth_text[:1001]

"[The Tragedie of Macbeth by William Shakespeare 1603]\n\n\nActus Primus. Scoena Prima.\n\nThunder and Lightning. Enter three Witches.\n\n  1. When shall we three meet againe?\nIn Thunder, Lightning, or in Raine?\n  2. When the Hurley-burley's done,\nWhen the Battaile's lost, and wonne\n\n   3. That will be ere the set of Sunne\n\n   1. Where the place?\n  2. Vpon the Heath\n\n   3. There to meet with Macbeth\n\n   1. I come, Gray-Malkin\n\n   All. Padock calls anon: faire is foule, and foule is faire,\nHouer through the fogge and filthie ayre.\n\nExeunt.\n\n\nScena Secunda.\n\nAlarum within. Enter King Malcome, Donalbaine, Lenox, with\nattendants,\nmeeting a bleeding Captaine.\n\n  King. What bloody man is that? he can report,\nAs seemeth by his plight, of the Reuolt\nThe newest state\n\n   Mal. This is the Serieant,\nWho like a good and hardie Souldier fought\n'Gainst my Captiuitie: Haile braue friend;\nSay to the King, the knowledge of the Broyle,\nAs thou didst leaue it\n\n   Cap. 

**_Question:_**  Look at the text snippet above. What do you notice about it? Are there any issues you see that we'll need to deal with during the preprocessing steps?

Write your answer below this line:
yes
_______________________________________________________________________________

Yes, there are. Some of the words are hyphenated. If we just use basic tokenization, then it will split hyphenated words into individual tokens. There are also numbers that act as metadata about which witch is speaking -- we'll need to remove these. 

### Preprocessing the Data

Looking at the text output above shows us a few things that we'll need to deal with during the preprocessing and tokenization steps -- specifically:

* Capitalization -- we'll need to lowercase all words. 
* Apostrophes -- we'll need to write some basic regex in order to capture words that contain apostrophes as a single token. In the interest of time, a pattern has been provided for you. Use the following pattern:  `"([a-zA-Z]+(?:'[a-z]+)?)"`
* Numbers -- We'll want to remove these, as they generally appear as stage direction to tell us which witch is speaking. 

In the cell below:

* Store the pattern shown above in the appropriate variable  
* Use `nltk.regexp_tokenize()` and pass in our text and the `pattern` 

In [5]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
macbeth_tokens_raw = nltk.regexp_tokenize(macbeth_text,pattern)

Great! Now that we have our tokens, we need to lowercase them. In the cell below, use a list comprehension and the `.lower()` method on every word token in `macbeth_tokens_raw`. Store this inside `macbeth_tokens`.

In [6]:
macbeth_tokens = [word.lower() for word in macbeth_tokens_raw]
macbeth_tokens

['the',
 'tragedie',
 'of',
 'macbeth',
 'by',
 'william',
 'shakespeare',
 'actus',
 'primus',
 'scoena',
 'prima',
 'thunder',
 'and',
 'lightning',
 'enter',
 'three',
 'witches',
 'when',
 'shall',
 'we',
 'three',
 'meet',
 'againe',
 'in',
 'thunder',
 'lightning',
 'or',
 'in',
 'raine',
 'when',
 'the',
 'hurley',
 "burley's",
 'done',
 'when',
 'the',
 "battaile's",
 'lost',
 'and',
 'wonne',
 'that',
 'will',
 'be',
 'ere',
 'the',
 'set',
 'of',
 'sunne',
 'where',
 'the',
 'place',
 'vpon',
 'the',
 'heath',
 'there',
 'to',
 'meet',
 'with',
 'macbeth',
 'i',
 'come',
 'gray',
 'malkin',
 'all',
 'padock',
 'calls',
 'anon',
 'faire',
 'is',
 'foule',
 'and',
 'foule',
 'is',
 'faire',
 'houer',
 'through',
 'the',
 'fogge',
 'and',
 'filthie',
 'ayre',
 'exeunt',
 'scena',
 'secunda',
 'alarum',
 'within',
 'enter',
 'king',
 'malcome',
 'donalbaine',
 'lenox',
 'with',
 'attendants',
 'meeting',
 'a',
 'bleeding',
 'captaine',
 'king',
 'what',
 'bloody',
 'man',
 'is',


## Frequency Distributions

Now that we've done some basic cleaning and tokenization, let's go ahead and create a **_Frequency Distribution_** to see the number of times each word is used in this play. This frequency distribution is an example of a **_Bag of Words_**, which you've worked with in previous labs. 

In the cell below:

* Use `FreqDist()` and pass in `macbeth_tokens` as the input 
* Display the frequency distribution to see what it looks like  

In [7]:
macbeth_freqdist = FreqDist(macbeth_tokens)
macbeth_freqdist.most_common(50)

[('the', 649),
 ('and', 545),
 ('to', 383),
 ('of', 338),
 ('i', 331),
 ('a', 241),
 ('that', 227),
 ('my', 203),
 ('you', 203),
 ('in', 199),
 ('is', 180),
 ('not', 165),
 ('it', 161),
 ('with', 153),
 ('his', 146),
 ('be', 137),
 ('macb', 137),
 ('your', 126),
 ('our', 123),
 ('haue', 122),
 ('but', 120),
 ('me', 113),
 ('he', 110),
 ('for', 109),
 ('what', 106),
 ('this', 104),
 ('all', 99),
 ('so', 96),
 ('him', 90),
 ('as', 89),
 ('thou', 87),
 ('we', 83),
 ('enter', 81),
 ('which', 80),
 ('are', 73),
 ('will', 72),
 ('they', 70),
 ('shall', 68),
 ('no', 67),
 ('then', 63),
 ('macbeth', 62),
 ('their', 62),
 ('thee', 61),
 ('vpon', 58),
 ('on', 58),
 ('macd', 58),
 ('from', 57),
 ('yet', 57),
 ('thy', 56),
 ('vs', 55)]

Well, that doesn't tell us very much! The top 10 most used words in macbeth are all **_Stop Words_**. They don't contain any interesting information, and essentially just act as the "connective tissue" between the words that really matter in any text. Let's try removing the stopwords and punctuation, and then creating another frequency distribution that contains only the important words. 

## Removing Stop Words and Punctuation

We've already imported the `stopwords` module. We can access all of the stopwords using the `stopwords.words()` method -- however, we don't want to use the whole thing, as this contains all stopwords in every language supported by NLTK. We don't need to check for and remove any Finnish or Japanese stop words, as this text is in English. To avoid unnecessarily long runtimes, we'll just use the English subset of stopwords by passing in the parameter `"english"` into `stopwords.words()`.

In the cell below:

* Get all the `'english'` stopwords from `stopwords.words()` and store them in the appropriate variable below. They will be stored as a list, by default  
* We'll also want to remove all punctuation. Create a list version of `string.punctuation` and add it to our stopwords list  
* Finally, we'll also remove numbers. Create a list that contains numbers 0-9 (as strings!), and add this to the stopwords list as well  
* Use another list comprehension to get words out of `macbeth_tokens` as long as they are not in `stopwords_list` 

In [8]:
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)
stopwords_list += ['0','1','2','3','4','5','6','7','8','9']

macbeth_words_stopped = [word  for word in macbeth_tokens if word not in stopwords_list]
macbeth_words_stopped

['tragedie',
 'macbeth',
 'william',
 'shakespeare',
 'actus',
 'primus',
 'scoena',
 'prima',
 'thunder',
 'lightning',
 'enter',
 'three',
 'witches',
 'shall',
 'three',
 'meet',
 'againe',
 'thunder',
 'lightning',
 'raine',
 'hurley',
 "burley's",
 'done',
 "battaile's",
 'lost',
 'wonne',
 'ere',
 'set',
 'sunne',
 'place',
 'vpon',
 'heath',
 'meet',
 'macbeth',
 'come',
 'gray',
 'malkin',
 'padock',
 'calls',
 'anon',
 'faire',
 'foule',
 'foule',
 'faire',
 'houer',
 'fogge',
 'filthie',
 'ayre',
 'exeunt',
 'scena',
 'secunda',
 'alarum',
 'within',
 'enter',
 'king',
 'malcome',
 'donalbaine',
 'lenox',
 'attendants',
 'meeting',
 'bleeding',
 'captaine',
 'king',
 'bloody',
 'man',
 'report',
 'seemeth',
 'plight',
 'reuolt',
 'newest',
 'state',
 'mal',
 'serieant',
 'like',
 'good',
 'hardie',
 'souldier',
 'fought',
 'gainst',
 'captiuitie',
 'haile',
 'braue',
 'friend',
 'say',
 'king',
 'knowledge',
 'broyle',
 'thou',
 'didst',
 'leaue',
 'cap',
 'doubtfull',
 'stoo

Great! Now, let's create another frequency distribution using `macbeth_words_stopped`, and then inspect the top 50 most common words, to see if removing stopwords and punctuation has helped. 

Do this now in the cell below.

In [9]:
macbeth_stopped_freqdist = FreqDist(macbeth_words_stopped)
macbeth_stopped_freqdist.most_common(50)

[('macb', 137),
 ('haue', 122),
 ('thou', 87),
 ('enter', 81),
 ('shall', 68),
 ('macbeth', 62),
 ('thee', 61),
 ('vpon', 58),
 ('macd', 58),
 ('yet', 57),
 ('thy', 56),
 ('vs', 55),
 ('come', 54),
 ('king', 54),
 ('hath', 52),
 ('good', 49),
 ('rosse', 49),
 ('lady', 48),
 ('would', 47),
 ('time', 46),
 ('like', 43),
 ('say', 39),
 ('doe', 38),
 ('lord', 38),
 ('make', 38),
 ('tis', 37),
 ('must', 36),
 ('done', 35),
 ('selfe', 35),
 ('ile', 35),
 ('feare', 35),
 ('let', 35),
 ('man', 34),
 ('wife', 34),
 ('night', 34),
 ('banquo', 34),
 ('well', 33),
 ('know', 33),
 ('one', 32),
 ('great', 31),
 ('see', 31),
 ('may', 31),
 ('exeunt', 30),
 ('speake', 29),
 ('sir', 29),
 ('lenox', 28),
 ('mine', 26),
 ('vp', 26),
 ('th', 26),
 ('mal', 25)]

This is definitely an improvement! You may be wondering why `'Macb'` shows up as the number 1 most used token. If you inspect [Macbeth](http://www.gutenberg.org/cache/epub/1795/pg1795-images.html) on project gutenberg and search for `'Macb'`, you'll soon discover that the source text denotes `Macb` as stage direction for any line spoken by Macbeth's character. This means that `'Macb'` is actually stage direction, meaning that under normal circumstances, we would need to ask ourselves if it is worth it to remove it or keep it. In the interest of time for this lab, we'll leave it be. 

## Answering Questions about our Corpus

Now that we have a frequency distribution, we can easily answer some basic questions about the text. Let's answer some basic questions about Macbeth below, before we move onto creating bigrams. 

### Vocabulary Size

What is the size of the total vocabulary used in Macbeth, once all stopwords have been removed?

Compute this in the cell below. 

In [10]:
size = len(macbeth_words_stopped)
size

10115

### Normalized Word Frequency

Knowing the frequency with which each word is used is somewhat informative, but without the context of how many words are used in total, it doesn't tell us much. One way we can adjust for this is to use **_Normalized Word Frequency_**, which we can compute by dividing each word frequency by the total number of words. 

Compute this now in the cell below, and display the normalized word frequency for the top 50 words. 

In [11]:
from collections import Counter
word_freq = Counter(macbeth_words_stopped)
total_word_count = size
macbeth_top_50 = macbeth_stopped_freqdist.most_common(50)
print(f'{"Word":10} Normalized Frequency')
for word,freq in macbeth_top_50:
    normalized_frequency = freq/total_word_count
    print(f'{word:10} {normalized_frequency:^20.4}')

Word       Normalized Frequency
macb             0.01354       
haue             0.01206       
thou             0.008601      
enter            0.008008      
shall            0.006723      
macbeth          0.00613       
thee             0.006031      
vpon             0.005734      
macd             0.005734      
yet              0.005635      
thy              0.005536      
vs               0.005437      
come             0.005339      
king             0.005339      
hath             0.005141      
good             0.004844      
rosse            0.004844      
lady             0.004745      
would            0.004647      
time             0.004548      
like             0.004251      
say              0.003856      
doe              0.003757      
lord             0.003757      
make             0.003757      
tis              0.003658      
must             0.003559      
done             0.00346       
selfe            0.00346       
ile              0.00346       
feare   

## Creating Bigrams

Knowing individual word frequencies is somewhat informative, but in practice, some of these tokens are actually parts of larger phrases that should be treated as a single unit. Let's create some bigrams, and see which combinations of words are most telling. 

In the cell below:

* We'll begin by aliasing a particularly long method name to make it easier to call. Store `nltk.collocations.BigramAssocMeasures()` inside of the variable `bigram_measures` 
* Next, we'll need to create a **_finder_**. Pass `macbeth_words_stopped` into `BigramCollocationFinder.from_words()` and assign the result to `macbeth_finder` 
* Once we have a finder, we can use it to compute bigram scores, so we can see the combinations that occur most frequently. Call the `macbeth_finder` object's `score_ngrams()` method and pass in `bigram_measures.raw_freq` as the input  
* Display first 50 elements in the `macbeth_scored` list to see the 50 most common bigrams in macbeth 

In [12]:
bigram_measures = nltk.collocations.BigramAssocMeasures()

In [13]:
macbeth_finder = BigramCollocationFinder.from_words(macbeth_words_stopped)

In [14]:
macbeth_scored = macbeth_finder.score_ngrams(bigram_measures.raw_freq)

In [15]:
# Display the first 50 elements of macbeth_scored

macbeth_scored[:50]

[(('enter', 'macbeth'), 0.0015818091942659417),
 (('exeunt', 'scena'), 0.0014829461196243204),
 (('thane', 'cawdor'), 0.0012852199703410777),
 (('knock', 'knock'), 0.0009886307464162135),
 (('lord', 'macb'), 0.0008897676717745922),
 (('thou', 'art'), 0.0008897676717745922),
 (('good', 'lord'), 0.0007909045971329708),
 (('haue', 'done'), 0.0007909045971329708),
 (('macb', 'haue'), 0.0007909045971329708),
 (('enter', 'lady'), 0.0006920415224913495),
 (('let', 'vs'), 0.0006920415224913495),
 (('macbeth', 'macb'), 0.0005931784478497281),
 (('enter', 'malcolme'), 0.0004943153732081067),
 (('enter', 'three'), 0.0004943153732081067),
 (('euery', 'one'), 0.0004943153732081067),
 (('macb', 'ile'), 0.0004943153732081067),
 (('macb', 'thou'), 0.0004943153732081067),
 (('make', 'vs'), 0.0004943153732081067),
 (('mine', 'eyes'), 0.0004943153732081067),
 (('mine', 'owne'), 0.0004943153732081067),
 (('scena', 'secunda'), 0.0004943153732081067),
 (('three', 'witches'), 0.0004943153732081067),
 (('thy'

These look a bit more interesting. We can see here that some of the most common ones are stage directions, such as 'Enter Macbeth' and 'Exeunt Scena', while others seem to be common phrases used in the play. 

To wrap up our initial examination of *Macbeth*, let's end by calculating **_Mutual Information Scores_**.

## Using Mutual Information Scores

To calculate mutual information scores, we'll need to first create a frequency filter, so that we only examine bigrams that occur more than a set number of times -- for our purposes, we'll set this limit to 5. 

In NLTK, mutual information is often referred to as `pmi`, for **_Pointwise Mutual Information_**. Calculating PMI scores works much the same way that we created bigrams, with a few notable differences.

In the cell below:

* We'll start by creating another finder for pmi. Pass `macbeth_words_stopped` as the input to `BigramCollocationFinder.from_words()`. Store this is the variable `macbeth_pmi_finder` 
* Once we have our finder, we'll need to apply our frequency filter. Call `macbeth_pmi_finder`'s `apply_freq_filter` and pass in the number `5` as the input 
* Now, we can use the finder to calculate pmi scores. Use the pmi finder's `.score_ngrams()` method, and pass in `bigram_measures.pmi` as the argument. Store this in `macbeth_pmi_scored` 
* Examine the first 50 elements in `macbeth_pmi_scored` 

In [16]:
macbeth_pmi_finder = BigramCollocationFinder.from_words(macbeth_words_stopped)

In [17]:
macbeth_pmi_finder.apply_freq_filter(5)

In [18]:
macbeth_pmi_scored = macbeth_pmi_finder.score_ngrams( bigram_measures.pmi)

In [19]:
macbeth_pmi_scored[:50]

[(('three', 'witches'), 8.925697076191915),
 (('scena', 'secunda'), 8.844777080808347),
 (('knock', 'knock'), 8.626136794333007),
 (('thane', 'cawdor'), 7.968474805033251),
 (('exeunt', 'scena'), 7.844777080808349),
 (('mine', 'eyes'), 7.466265457554618),
 (('worthy', 'thane'), 6.982280604558282),
 (('mine', 'owne'), 6.8382342349415755),
 (('euery', 'one'), 6.626136794333007),
 (('thou', 'art'), 5.861265203596917),
 (('enter', 'malcolme'), 5.58584707330729),
 (('enter', 'three'), 5.58584707330729),
 (('good', 'lord'), 5.441571341886851),
 (('let', 'vs'), 5.2009208910336255),
 (('enter', 'macbeth'), 5.010162386174146),
 (('thy', 'selfe'), 4.689498855330436),
 (('make', 'vs'), 4.596849567364762),
 (('haue', 'done'), 4.244188344937793),
 (('enter', 'lady'), 4.186751117897471),
 (('lord', 'macb'), 4.128174104483845),
 (('macb', 'ile'), 3.3988216944275145),
 (('would', 'haue'), 3.140810605092483),
 (('macbeth', 'macb'), 2.8369428068193994),
 (('macb', 'haue'), 2.2754392789222333),
 (('macb'

## On Your Own: Comparative Corpus Statistics

Now that we've worked through generating some baseline corpus statistics for one corpus, it's up to you to select a second corpus and generate your own corpus statistics, and then compare and contrast the two. For simplicity's sake, we recommend you stick to a corpus from `nltk.corpus.gutenberg` -- although comparing the diction found in a classic work of fiction to something like a presidential State of the Union address could be interesting, it's not really an apples-to-apples comparison, and those corpora could also require additional preprocessing steps that are outside the scope of this lab. 

In the cells below:

1. Select another corpus from `gutenberg.fileids()`  
2. Clean, preprocess, tokenize, and generate corpus statistics for this new corpus   
3. Perform a comparative analysis using the Macbeth statistics we generated above and your new corpus statistics. How are they similar? How are they different? Was there anything interesting or surprising that you found in your comparison? Create at least one meaningful visualization comparing the two corpora 

In [20]:
file_ids

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [24]:
caesar_txt = gutenberg.raw('shakespeare-caesar.txt')
caesar_txt[:501]

'[The Tragedie of Julius Caesar by William Shakespeare 1599]\n\n\nActus Primus. Scoena Prima.\n\nEnter Flauius, Murellus, and certaine Commoners ouer the Stage.\n\n  Flauius. Hence: home you idle Creatures, get you home:\nIs this a Holiday? What, know you not\n(Being Mechanicall) you ought not walke\nVpon a labouring day, without the signe\nOf your Profession? Speake, what Trade art thou?\n  Car. Why Sir, a Carpenter\n\n   Mur. Where is thy Leather Apron, and thy Rule?\nWhat dost thou with thy best Apparrell on?'

In [27]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
caeasar_tokenize_raw = nltk.regexp_tokenize(caesar_txt,pattern)
caeasar_tokenize_raw


['The',
 'Tragedie',
 'of',
 'Julius',
 'Caesar',
 'by',
 'William',
 'Shakespeare',
 'Actus',
 'Primus',
 'Scoena',
 'Prima',
 'Enter',
 'Flauius',
 'Murellus',
 'and',
 'certaine',
 'Commoners',
 'ouer',
 'the',
 'Stage',
 'Flauius',
 'Hence',
 'home',
 'you',
 'idle',
 'Creatures',
 'get',
 'you',
 'home',
 'Is',
 'this',
 'a',
 'Holiday',
 'What',
 'know',
 'you',
 'not',
 'Being',
 'Mechanicall',
 'you',
 'ought',
 'not',
 'walke',
 'Vpon',
 'a',
 'labouring',
 'day',
 'without',
 'the',
 'signe',
 'Of',
 'your',
 'Profession',
 'Speake',
 'what',
 'Trade',
 'art',
 'thou',
 'Car',
 'Why',
 'Sir',
 'a',
 'Carpenter',
 'Mur',
 'Where',
 'is',
 'thy',
 'Leather',
 'Apron',
 'and',
 'thy',
 'Rule',
 'What',
 'dost',
 'thou',
 'with',
 'thy',
 'best',
 'Apparrell',
 'on',
 'You',
 'sir',
 'what',
 'Trade',
 'are',
 'you',
 'Cobl',
 'Truely',
 'Sir',
 'in',
 'respect',
 'of',
 'a',
 'fine',
 'Workman',
 'I',
 'am',
 'but',
 'as',
 'you',
 'would',
 'say',
 'a',
 'Cobler',
 'Mur',
 'But

In [28]:
caesar_lower = [word.lower() for word in caeasar_tokenize_raw]
caesar_lower

['the',
 'tragedie',
 'of',
 'julius',
 'caesar',
 'by',
 'william',
 'shakespeare',
 'actus',
 'primus',
 'scoena',
 'prima',
 'enter',
 'flauius',
 'murellus',
 'and',
 'certaine',
 'commoners',
 'ouer',
 'the',
 'stage',
 'flauius',
 'hence',
 'home',
 'you',
 'idle',
 'creatures',
 'get',
 'you',
 'home',
 'is',
 'this',
 'a',
 'holiday',
 'what',
 'know',
 'you',
 'not',
 'being',
 'mechanicall',
 'you',
 'ought',
 'not',
 'walke',
 'vpon',
 'a',
 'labouring',
 'day',
 'without',
 'the',
 'signe',
 'of',
 'your',
 'profession',
 'speake',
 'what',
 'trade',
 'art',
 'thou',
 'car',
 'why',
 'sir',
 'a',
 'carpenter',
 'mur',
 'where',
 'is',
 'thy',
 'leather',
 'apron',
 'and',
 'thy',
 'rule',
 'what',
 'dost',
 'thou',
 'with',
 'thy',
 'best',
 'apparrell',
 'on',
 'you',
 'sir',
 'what',
 'trade',
 'are',
 'you',
 'cobl',
 'truely',
 'sir',
 'in',
 'respect',
 'of',
 'a',
 'fine',
 'workman',
 'i',
 'am',
 'but',
 'as',
 'you',
 'would',
 'say',
 'a',
 'cobler',
 'mur',
 'but

In [29]:
most_frequent_ceasar = FreqDist(caeasar_tokenize_raw)
most_frequent_ceasar.most_common(50)

[('I', 530),
 ('the', 502),
 ('and', 409),
 ('to', 370),
 ('you', 341),
 ('of', 336),
 ('not', 248),
 ('a', 240),
 ('is', 228),
 ('And', 218),
 ('in', 204),
 ('that', 199),
 ('Caesar', 189),
 ('my', 188),
 ('me', 186),
 ('it', 166),
 ('him', 165),
 ('Brutus', 161),
 ('Bru', 153),
 ('his', 150),
 ('this', 141),
 ('your', 137),
 ('be', 132),
 ('with', 131),
 ('will', 129),
 ('he', 128),
 ('haue', 127),
 ('for', 118),
 ('so', 109),
 ('do', 107),
 ('shall', 107),
 ('Cassi', 107),
 ('thou', 100),
 ('as', 100),
 ('are', 96),
 ('all', 90),
 ('That', 86),
 ('Cassius', 85),
 ('by', 82),
 ('we', 82),
 ('then', 79),
 ('our', 79),
 ('on', 77),
 ('The', 76),
 ('To', 76),
 ('Antony', 75),
 ('But', 73),
 ('O', 69),
 ('but', 68),
 ('no', 68)]

In [31]:
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)
stopwords_list += ['0','1','2','3','4','5','6','7','8',9]
stopped_caesar = [word for word in caesar_lower if word not in stopwords_list]
stopped_caesar


['tragedie',
 'julius',
 'caesar',
 'william',
 'shakespeare',
 'actus',
 'primus',
 'scoena',
 'prima',
 'enter',
 'flauius',
 'murellus',
 'certaine',
 'commoners',
 'ouer',
 'stage',
 'flauius',
 'hence',
 'home',
 'idle',
 'creatures',
 'get',
 'home',
 'holiday',
 'know',
 'mechanicall',
 'ought',
 'walke',
 'vpon',
 'labouring',
 'day',
 'without',
 'signe',
 'profession',
 'speake',
 'trade',
 'art',
 'thou',
 'car',
 'sir',
 'carpenter',
 'mur',
 'thy',
 'leather',
 'apron',
 'thy',
 'rule',
 'dost',
 'thou',
 'thy',
 'best',
 'apparrell',
 'sir',
 'trade',
 'cobl',
 'truely',
 'sir',
 'respect',
 'fine',
 'workman',
 'would',
 'say',
 'cobler',
 'mur',
 'trade',
 'art',
 'thou',
 'answer',
 'directly',
 'cob',
 'trade',
 'sir',
 'hope',
 'may',
 'vse',
 'safe',
 'conscience',
 'indeed',
 'sir',
 'mender',
 'bad',
 'soules',
 'fla',
 'trade',
 'thou',
 'knaue',
 'thou',
 'naughty',
 'knaue',
 'trade',
 'cobl',
 'nay',
 'beseech',
 'sir',
 'yet',
 'sir',
 'mend',
 'mur',
 "mean'

In [33]:
size = len(stopped_caesar)
size 

11040

In [35]:
total_word_count = size
stopped_caesar_freqdist = FreqDist(stopped_caesar)
caesar_top_50 = stopped_caesar_freqdist.most_common(50)
print(f'{"Word":10} Normalized Frequency')
for word,freq in caesar_top_50:
    normalized_frequency = freq/total_word_count
    print(f'{word:10} {normalized_frequency:^20.4}')

Word       Normalized Frequency
caesar           0.01721       
brutus           0.01458       
bru              0.01386       
haue             0.01332       
shall            0.01132       
thou             0.01042       
cassi            0.009692      
cassius          0.007699      
antony           0.006793      
come             0.006703      
good             0.006431      
men              0.00625       
know             0.006159      
enter            0.005797      
let              0.005797      
vs               0.005616      
man              0.005344      
thy              0.005072      
heere            0.005072      
thee             0.004982      
ant              0.004348      
well             0.004348      
vpon             0.004257      
day              0.004167      
would            0.003986      
lord             0.003986      
yet              0.003804      
night            0.003804      
go               0.003714      
selfe            0.003623      
caes    

In [42]:
biagram_measures = nltk.collocations.BigramAssocMeasures()
caesar_finder = BigramCollocationFinder.from_words(stopped_caesar)
caesar_scored = caesar_finder.score_ngrams(bigram_measures.raw_freq)
caesar_scored[:50]

[(('let', 'vs'), 0.0014492753623188406),
 (('mark', 'antony'), 0.001177536231884058),
 (('marke', 'antony'), 0.0010869565217391304),
 (('lord', 'bru'), 0.0009963768115942029),
 (('thou', 'art'), 0.0009963768115942029),
 (('would', 'haue'), 0.0009057971014492754),
 (('art', 'thou'), 0.0008152173913043478),
 (('brutus', 'cassius'), 0.0008152173913043478),
 (('caesar', 'caes'), 0.0008152173913043478),
 (('caesar', 'shall'), 0.0008152173913043478),
 (('enter', 'brutus'), 0.0008152173913043478),
 (('good', 'night'), 0.0008152173913043478),
 (('noble', 'brutus'), 0.0008152173913043478),
 (('thou', 'hast'), 0.0008152173913043478),
 (('good', 'morrow'), 0.0007246376811594203),
 (('haue', 'done'), 0.0007246376811594203),
 (('antony', 'ant'), 0.0006340579710144927),
 (('bru', 'good'), 0.0006340579710144927),
 (('enter', 'lucius'), 0.0006340579710144927),
 (('shall', 'finde'), 0.0006340579710144927),
 (('cassi', 'brutus'), 0.0005434782608695652),
 (('come', 'downe'), 0.0005434782608695652),
 (('e

In [43]:
caesar_pmi_finder = BigramCollocationFinder.from_words(stopped_caesar)
caesar_pmi_finder.apply_freq_filter(5)
caesar_pmi_scored = caesar_pmi_finder.score_ngrams(bigram_measures.pmi)
caesar_pmi_scored[:50]

[(('ides', 'march'), 9.845490050944374),
 (('market', 'place'), 9.26052755022322),
 (('caius', 'ligarius'), 8.882015926969489),
 (('metellus', 'cymber'), 8.882015926969489),
 (('mine', 'owne'), 7.9006316051368355),
 (('fell', 'downe'), 7.643856189774725),
 (('mark', 'antony'), 7.20163386116965),
 (('messala', 'messa'), 6.860596943334583),
 (('marke', 'antony'), 6.464668267003443),
 (('luc', 'sir'), 6.421463768438276),
 (('good', 'morrow'), 6.373814836552331),
 (('thou', 'hast'), 5.9475325801058645),
 (("did'st", 'thou'), 5.906890595608518),
 (('honourable', 'men'), 5.83650126771712),
 (('haue', 'seene'), 5.8157427075503225),
 (('haue', 'beene'), 5.7453533796589245),
 (('caius', 'cassius'), 5.6425499922741),
 (('enter', 'lucius'), 5.537367755582045),
 (('luc', 'lord'), 5.485594105857992),
 (('let', 'vs'), 5.476256241278657),
 (('thou', 'art'), 5.400537929583729),
 (('haue', 'heard'), 5.382783300274216),
 (('euery', 'man'), 5.32541708096724),
 (('exeunt', 'enter'), 5.292949027915597),
 (

## Summary

In this lab, we used our newfound NLP skills to generate some statistics specific to text data, and used them to compare two different works! 