# Errors — Workbook

*Note: You can explore this [workbook](https://mybinder.org/v2/gh/INFO1350/Intro-CA-SP21/master?urlpath=lab/tree/book/COURSE-Final-Project/Workbooks/04-Errors-WORKBOOK.ipynb) in the cloud via Binder.*

In this workbook, we're going to practice troubleshooting and debugging errors. There are 4 pesky code cells (🚨) in this notebook that you need to resolve. Can you resolve them all?

This notebook also demonstrates how you might format your own Jupyter notebook for the final project. Notice how we describe the way we pre-process and prepare the data, how we examine patterns and outliers, and how we zoom in on specific trends.

## Cleaning, Preparing, Pre-Processing Data

Here we import necessary packages and set default display settings for Pandas.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.options.display.max_rows = 200
pd.options.display.max_columns = 200
from pathlib import Path  

import matplotlib.pyplot as plt

Here we make a list of file paths for all the U.S. Inaugural addresses and a list of titles for each address, as well.

*🚨 **1.** This code cell will throw one error and lead to another problem. Can you identify these issues and debug them?🚨*

In [2]:
directory_path = "../../history/US_Inaugural_Addresses/"
text_files = glob.glob(f"{directory_path}/*.txt")
text_titles = [Path(text).stem for text in text_files]

NameError: name 'glob' is not defined

Check to make sure you have a list of filenames. Are there filenames in this list?

In [7]:
text_files

[]

Next, we will use `CountVectorizer` to count all the words in the addresses and make them into a DataFrame. But first, we will remove a custom list of stopwords, because we're interested in pronouns such as "her," "his," and "theirs."

In [11]:
custom_stopwords = ['what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']

In [12]:
#Initialize CountVectorizer with desired parameters
count_vectorizer= CountVectorizer(input='filename',
                                  stop_words= custom_stopwords)

#Plug in "text_files" to the initialized count_vectorizer
word_count_vector = count_vectorizer.fit_transform(text_files)

## Calculate Word Frequency

Next, we make a DataFrame out of the word frequencies and sort by address title.

In [None]:
#Make a DataFrame out of the word count vector and sort by title
word_count_df = pd.DataFrame(word_count_vector.toarray(), index=text_titles, columns=count_vectorizer.get_feature_names())
word_count_df = word_count_df.sort_index()
word_count_df

To understand patterns and potential outliers in these addresses, we calculate how many words are in each address:

*🚨 **2.** This code cell will throw an error. Can you identify the issue and debug it?🚨*

In [18]:
word_count_df.sum(axis=1).sort()

AttributeError: 'Series' object has no attribute 'sort'

It looks like some texts, such as George Washington's second address, are significantly shorter than others. We examine this address in more detail:

*🚨 **3.** This code cell will throw an error and lead to a problem. Can you identify the issues and debug them?🚨*

In [20]:
text = open("../../texts/history/US_Inaugural_Addresses/02_washington_1793.txt").read
print(address)

NameError: name 'address' is not defined

## Zoom In

Because we're interested in how the Presidents discussed gender, we zoom in and analyze the frequency of specific words related to gender as well as other descriptions of people.

*🚨 **4.** This code cell will throw an error. Can you identify the issue and debug it?🚨*

In [21]:
word_count_df['men', 'women']

KeyError: ('men', 'women')

🥳 *If you made it this far, you've resolved all the errors! Good work! You should be able to run the rest of the cells in this notebook.* 🥳

In [None]:
word_count_df[['his', 'her']]

President Polk used the word "her" a surprising number of times. So we decided to close read Polk's address to find out how he was using the word "her."

In [None]:
text = open("../../texts/history/US_Inaugural_Addresses/15_polk_1845.txt").read()
print(text)

It turns out that Polk used "her" numerous times in reference to the state of Texas:
> None can fail to see the danger to our safety and future peace if Texas remains an independent state or becomes an ally or dependency of some foreign nation more powerful than **herself**...Is there one who would not prefer free intercourse with **her** to high duties on all our products and manufactures which enter her ports or cross her frontiers? Is there one who would not prefer an unrestricted communication with **her** citizens to the frontier obstructions which must occur if she remains out of the Union?

## Create Line Plot

Here we make a line plot of gendered language in each of the Inaugural Addresses.

In [None]:
# Make a line plot
ax = word_count_df.plot(y=['women', 'men'],
                    kind='line',
                    linewidth =5,
                    figsize=(15,10))

# Label the axes and create a title
plt.xlabel('U.S. Inaugural Addresses', fontsize = 15)
plt.ylabel('Mentions of Word', fontsize = 15)
plt.title('Gendered Language in Presidential Inaugural Addresses',
          fontsize = 25)

#plt.tight_layout()
#ax.figure.savefig('Gender-Inaugural-Addresses.png')

In [None]:
# Make a line plot
ax = word_count_df.plot(y=['her', 'his'],
                    kind='line',
                    linewidth =5,
                    figsize=(15,10))

# Label the axes and create a title
plt.xlabel('U.S. Inaugural Addresses', fontsize = 15)
plt.ylabel('Mentions of Word', fontsize = 15)
plt.title('Gendered Language in Presidential Inaugural Addresses',
          fontsize = 25)

#plt.tight_layout()
#ax.figure.savefig('America-Inaugural-Addresses-Updated.png')