# Text Processing: Analyzing The Works of Shakespeare

## This is an HTML version of the Jupyter Notebook. Download the zip file with all of the necessary files [here](shakespeare.zip). 

### In this notebook, we will analyze word frequency of all of the works of Shakespeare!

#### We will:
#### - We will read all of Shakespeare's works from a text file
#### - Do a word frequency analysis: total word counts and total unique words
#### - Compute the total 30 most frequently used words in Shakespeare's works not including glue words or stop words such as "the" and "and", etc...



To read a file, we use a context manager. The most widely used context manager is the `with` statement. Use the following code to open the attached file "shakespeare.txt":

```python
with open(filename, "r") as file:
    text = file.read()
```
### The above code reads in the entire works of Shakespeare! This code can read in a file of any size! The only limitation is the RAM on your computer. For example, we can read in the entire Wikipedia this way!

In [1]:
with open("shakespeare.txt", "r") as file:
    text = file.read()

### Use the `type()` function to determine the type of text. 

In [2]:
type(text)

str

### Use slicing to **OUTPUT** the first 500 characters. (From the play Corialanus)

### Notice the newline \n character. Now print the string. 

In [4]:
print(text[:500])

First Citizen:
Before we proceed any further, hear me speak.
BNSD
All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted 


### Find the text of a famous Shakespeare's quote. "To be, or not to be, that is the question:". Do this by calling the find method to find the index of the quote.

In [5]:
text.find("To be or not to be")

-1

### You may run into some difficulties with the previous problem. This is because the text you're looking for need to match exactly with Shakespeare's text, including punctuation and capitalization. Let's normalize the text by lower casing the entire text and removing all punctuation including newline characters.

### Do this by defining a function `remove_punctuation` which accepts a string of text and returns the string with all punctuations and new line characters removed. 

In [8]:
def remove_punctuation(txt):
    """Convert text to lower case. Use the function replace(replacedtext,newtext) to 
    remove \n and punctuations from text: punc = [".",":",",",";","'",'"', "!","?"]
    Then return text."""
    
    # lower text by calling the method lower() on the string
    txt = txt.lower()
    
    # replace \n with a SPACE
    txt = txt.replace("\n", " ")
    
    # loop through punc and replace each punctuation with empty string ""
    punc = [".",":",",",";","'",'"', "!","?"]
    for p in punc:
        txt = txt.replace(p, "")
    
    # remember to return the text!
    return txt

    
    

In [9]:
print(remove_punctuation("hello! my name is Mike, I am in college. Thanks!"))

hello my name is mike i am in college thanks


### Test your remove_punctuation function on small text sample. 

In [11]:
text = remove_punctuation(text)
print(text[:500])

first citizen before we proceed any further hear me speak bnsd all speak speak  first citizen you are all resolved rather to die than to famish  all resolved resolved  first citizen first you know caius marcius is chief enemy to the people  all we knowt we knowt  first citizen let us kill him and well have corn at our own price ist a verdict  all no more talking ont let it be done away away  second citizen one word good citizens  first citizen we are accounted poor citizens the patricians good w


### Call remove_punctuation on the Shakespeare text. Now find your famous Shakespeare's quote. Print out approximately 300 characters after your quote. 

In [13]:
index = text.find("to be or not to be")
print(index)


3219707


In [14]:
print(text[index:index+300])

to be or not to be that is the question whether tis nobler in the mind to suffer the slings and arrows of outrageous fortune or to take arms against a sea of troubles and by opposing end them to die to sleep no more and by a sleep to say we end the heart-ache and the thousand natural shocks that fle


### Compute the total number of characters of all of the works of Shakespeare.

In [15]:
print(len(text))

4356615


### Approximately how many words are there in the entire works of Shakespeare? Use split(). Recall that split() returns a list of all words from a string; split() by default splits on white spaces. 

In [17]:
list_words = text.split()
print(len(list_words))

832287


### Write function frequency below. The function takes an input string text and returns a dictionary whose keys are the words and values are the word counts.

### For example frequency("baa baa black sheep") returns the dictionary {"baa":2, "black":1, sheep:1}.

In [24]:
def frequency(text):
    """Returns dictionary of word:counts key-value pairs."""
    word_counts = {}
    list_words = text.split()
    for word in list_words:
        if word not in word_counts:
            word_counts[word] = 0
        word_counts[word] += 1
    return word_counts

### Test your frequency function on small text sample. 

In [25]:
print(frequency(text))



### Write the function text_stats which accepts a string input and returns a tuple of (total words, total unique words). Hint: Call the remove_punctuation and frequency functions above. 

In [26]:
def text_stats(text):
    """Given a text, returns a TUPLE of total words and total unique words.
    Remember to call remove_punctuation and frequency above."""
    text = remove_punctuation(text)
    word_counts = frequency(text)
    total_words = len(text.split())
    total_unique = len(word_counts)
    return (total_words, total_unique)


### Test your text_stats function on small text sample. 

In [27]:
text_stats("hello hello, how are you you you")

(7, 4)

### Now call text_stats on Shakespeare's text. How many unique words are there in all of his works?

In [28]:
text_stats(text)

(832287, 28154)

### There is already an object called Counter from the collections module which returns dictionary object equivalent to the the function `frequency` implemented above. 

### Here's how to create a Counter object. 

```python
from collections import Counter

text = remove_punctuation(text)
counter = Counter(text.split())
```


### Use == to check that the Counter object is equivalent to the dictionary object returned by the frequency function above.


In [44]:
from collections import Counter

# the function frequency we implement above have already been implemented
# in the Counter class. Counter has a nice function most_common
# that returns the most common occurring words in a text. 
text = remove_punctuation(text)
counter = Counter(text.split())
print(counter == frequency(text))


True



### Counter has a nice method called `most_common(n)` which accepts an integer n and returns the n most common occurring words. Call most_common to see the top 20 most occuring words in all of Shakespeare's works.

In [47]:
counter.most_common(20)

[('the', 26197),
 ('and', 23480),
 ('i', 19994),
 ('to', 18066),
 ('of', 16345),
 ('you', 13490),
 ('a', 13433),
 ('my', 12009),
 ('that', 10438),
 ('in', 10270),
 ('is', 8882),
 ('not', 8258),
 ('me', 7529),
 ('it', 7495),
 ('for', 7309),
 ('with', 6961),
 ('be', 6649),
 ('your', 6594),
 ('this', 6420),
 ('his', 6363)]

### The above words are high frequency words but many do not contribute to the content of the text. These are called glue words or [stop words](https://en.wikipedia.org/wiki/Stop_words). In Google searches for example, Google matches articles or sites to searches by ignoring these stop words as they do not contribute to the content of those articles/sites.

### Follow the comment below to complete to read in a tab separated file of stop words and add those words to a list. 

In [37]:
with open("stopwords.txt", 'r') as file:
    # create an empty list
    stop_words = []
    
    
    # loop through each line of file(line is a str of words separated by tabs('\t'))
    # loop through each word of line
    # apppend word to list, don't forget to call strip() to strip away any new line character 
    # at the end of each line as well as any leading/trailing spaces
    for line in file:
        words_on_line = line.split("\t")
        for word in words_on_line:
            stop_words.append(word.strip())


In [38]:
print(stop_words)

['a', 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', "c'mon", "c's", 'came', 'can', "can't", 'cannot', 'cant', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', 'co', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains', 'correspond

### Finally, loop through each of the stop words in the list created above and remove them from the counter. The counter(subclass of a dictionary) object has a pop(key) function which removes an entry given its key. 

### The counter objects now contain the most frequently used words in Shakespeare's works NOT including the stop words. 

In [40]:
 for word in stop_words:
        if word in counter:
            counter.pop(word)
        
        

### Now print out the list of the top 30 words in all of Shakespeare's works. The most_common method can

In [42]:
counter.most_common(20)

[('i', 19994),
 ('thou', 5251),
 ('thy', 3688),
 ('lord', 3121),
 ('thee', 3019),
 ('sir', 2849),
 ('good', 2793),
 ('king', 2718),
 ('o', 2476),
 ('ill', 1936),
 ('hath', 1876),
 ('love', 1837),
 ('man', 1777),
 ('make', 1575),
 ('tis', 1352),
 ('give', 1306),
 ('speak', 1143),
 ('mine', 1093),
 ('duke', 1010),
 ('henry', 1002)]