# Making and Labeling Figures — Workbook

In this workbook, we're going to demonstrate how to plot word counts across a group of documents with `CounterVectorizer`.

*Note: You can explore this [workbook](https://mybinder.org/v2/gh/INFO1350/Intro-CA-SP21/master?urlpath=lab/tree/book/COURSE-Final-Project/Workbooks/01-Figures-WORKBOOK.ipynb) in the cloud via Binder.*

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd
pd.options.display.max_rows = 200
pd.options.display.max_columns = 200

from pathlib import Path  
import glob

Below we're setting the directory filepath that contains all the text files that we want to analyze.

In [None]:
directory_path = "../texts/history/US_Inaugural_Addresses/"

Then we're going to use `glob` and `Path` to make a list of all the filepaths in that directory and a list of all the short story titles.

In [None]:
text_files = glob.glob(f"{directory_path}/*.txt")

In [None]:
text_titles = [Path(text).stem for text in text_files]

To count all the words in these Inaugural Addresses, we're going to use scikit-learn's `CountVectorizer`.

In [None]:
#Initialize CountVectorizer with desired parameters
count_vectorizer= CountVectorizer(input='filename', stop_words= 'english')

#Plug in "text_files," which contains all the albums, to the initialized count_vectorizer
word_count_vector = count_vectorizer.fit_transform(text_files)

In [None]:
#Make a DataFrame out of the word count vector and sort by title
word_count_df = pd.DataFrame(word_count_vector.toarray(), index=text_titles, columns=count_vectorizer.get_feature_names())
word_count_df = word_count_df.sort_index()

In [None]:
word_count_df[['america', 'women', 'men', 'war', 'economy']]

In [None]:
word_count_df['america']

In [None]:
word_count_df.plot(y='america', figsize=(15,10), kind='bar')

## Your Turn!

Provide a title and labels for this plot. Then describe the plot as you might describe it in a paper or blog post.

In [None]:
import matplotlib.pyplot as plt

ax = word_count_df.plot(y='america', figsize=(15,10), kind='bar')

plt.xlabel('X LABEL HERE', fontsize = 15)
plt.ylabel('Y LABEL HERE', fontsize = 15)
plt.title('TITLE HERE', fontsize = 25)


plt.tight_layout()
ax.figure.savefig('America-Inaugural-Addresses.png')

## Describe This Figure  

The figure below shows...

What this pattern suggests is that...

## Appnedix: Examine the Documents

If you want to read some of the Inaugural Addresses to get a better sense of how "America" is being used or not used, you can print them out below. 

In [None]:
print(open("../texts/history/US_Inaugural_Addresses/59_biden_2021.txt").read())

In [None]:
print(open("../texts/history/US_Inaugural_Addresses/51_bush_george_h_w_1989.txt").read())

In [None]:
print(open("../texts/history/US_Inaugural_Addresses/58_trump_2017.txt").read())

In [None]:
print(open("../texts/history/US_Inaugural_Addresses/01_washington_1789.txt").read())