# Project: Analyzing Macbeth

## Introduction
For our first day and first data science project, we're going to do some rudimentry analysis of Shakespeare's classic play: Macbeth! You will get practice working with lists, condtionals and dictionaries, visualizing data, and thinking analytically about data.

## Objectives
You will be able to:
* Show mastery of the content covered in this section

### Getting the Data
Here we start by importing a python package and using it to pull the transcript of Macbeth from the project Gutenberg website. We also preview a few details about what is now stored in the variable macbeth; it's a string with 119,846 characters, the first 500 of which are printed below. 

In [None]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

macbeth = requests.get('http://www.gutenberg.org/cache/epub/2264/pg2264.txt').text

print(type(macbeth))
print(len(macbeth))
print(macbeth[:500])

### Your Task

Your task is to create a bar graph of the 25 most common words in the Shakespeare's Macbeth.  


A common python programming pattern to counting objects, produce histograms, or update statistics is to make calls to a dictionary as you iterate through a list. For example, given a list of words, you can create a dictionary to store counts and then iterate through the list of words, checking how many times each word has appeared using your dictionary, and updating the dictionary count now that you've seen that word again. The `dictionary.get()` method is very useful in doing this. Read the docstring for the dictionary.get() method and use it along with the pseudocode above to create a bar graph of the 25 most common words from the transcript of Macbeth which has been loaded into a variable 'Macbeth'. Be sure to include a title and appropriate labels for your graph.

In [None]:
# Split the transcript into words
macbeth_split = macbeth.split()

# Create a dictionary
common_words_dict = {}

# Iterate through the text of Macbeth/iterate through a dictionary
for word in macbeth_split:
     common_words_dict[word] = common_words_dict.get(word, 0) + 1 #Starts with first key (0), and then moves to the next key with each iteration
common_words_dict
#making a dictionary is an easy way to find all the unique elements naturally!

In [None]:
#Update word counts with Pandas - putting these words from the dict into a dataframe and then sorting the values by highest to lowest count.
counts = pd.DataFrame.from_dict(common_words_dict, orient='index') #all values are scalar, so pass index in orient. It is not necessary to specify data type with dtype because there aren't mixed data types in this frame.
counts = counts.sort_values(by=counts.columns[0], ascending=False) #ascending=False starts the count from highest to lowest quantity

#Create bar graph
counts_finished.head(25).plot(kind='barh')

#Include descriptive titles and labels
plt.title("25 Most Common Words in Shakespeare's Macbeth")
plt.xlabel('Words')
plt.ylabel('Occurence Count')
plt.show()

In [None]:
#You Can Get the Same Result With Lists
word_counts = list(common_words_dict.items()) #the items method returns a list with dictionary keys and their values
top_25 = sorted(word_counts, key = lambda x: x[1], reverse=True)[:25] #[:25] limits the top_25 to 25 items. The sorted fuction always allows you a 'key' argument where you put in the function for sorting. In this case, the key says to sort by the 2nd element of each key + value pair (index of 1). THis means that it is sorting by value of a key, so the # of occurrences of a word.
#The lambda allows you to make in-line functions without doing the typical 'def' routine.
y = [item[1] for item in top_25] #item[1] grabs the word counts (value) since the list is of key and value pairs like this ({'the': 25})
X = np.arange(len(y)) #np.arange (NOT arrange) returns evenly spaced values within a given interval.
plt.figure(figsize=(12,12))
plt.bar(X , y)
plt.xticks(X, [item[0] for item in top_25]) #item(0) grabs the dictionary keys. In this case, they are words in Shakespeare.
plt.ylabel('Number of Occurences')
plt.xlabel('Word')
plt.title('Top 25 Words in Macbeth')

### Level Up (Optional)
This project should take you about an hour and a half to complete. If you're done much more quickly than that and are not behind in the course, feel free to deepen your knowledge by completing any or all of the following tasks until you run out of time:
* Create a list of top characters by mentions of their names 
* Split the text by which character is talking
* Create sub graphs of the most common words by character
* Reduce the string to the text of the play itself. (Remove any initial notes, forward, introduction, appendix, etc.)
* Come up with some other fun analyses of the text!

## Summary
Congratulations! You've got some extra practice combining various data types into useful programming patterns and done an intitial analysis of a classic text!