# Visualizing the data
This notebook contains the code necessary to visualize the processed transcription data; it is divided into two parts: 

1. Create simple bar graphs,
2. Create grouped bar graph.

## Instructions for running the code
1. Run the cells in order.
2. If only creating the grouped bar graph, still run cell 1.1 Import modules.

---

## 1. Create simple bar graphs
This code will create one bar graph for each dataset depicting five most frequent words from that dataset. The four graphs will be displayed in a 2x2 grid.

### 1.1 Import modules

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
from operator import itemgetter

### 1.2 Import `transcriptions`, which contains a list of lemmas for all datasets

In [None]:
%store -r transcriptions

### 1.3 Plot data in bar graphs

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(10, 8))
fig.tight_layout(pad=5.0)

#This code will loop over the lemmas from each dataset.
for i, transcription in enumerate(transcriptions):
    title = transcription['title']
    text = transcription['text']
    word_counts = Counter(text)
    #Identify the top 5 words for each dataset.
    top_words = word_counts.most_common(5)
    words, counts = zip(*top_words)
    
    #Create a 2x2 grid of bar graphs.
    ax = axs[i // 2][i % 2]
    ax.bar(words, counts)
    ax.set_title(title)
    ax.set_xlabel('Words')
    ax.set_ylabel('Counts')

plt.show()

---

## 2. Create grouped bar graph
This code will create a grouped bar graph showing the usage of the five most frequent words from Susan B. Anthony's speeches by year.

### 2.1 Import `speech_list` from Anthony speeches

In [None]:
# Load the speech data from the processing notebook
%store -r speech_list

### 2.2 Group speeches by year

In [None]:
# some years have multiple speeches
year_speeches = {}
for speech in speech_list:
    year = speech["year"]
    if year not in year_speeches:
        year_speeches[year] = []
    year_speeches[year].append(speech)

### 2.3 Count word occurences for each year, excluding "nan" values

In [None]:
# nan means "not a number." This will exclude cases where there are no speeches for a given year.
year_word_counts = {}
for year, speeches in year_speeches.items():
    word_counts = Counter()
    for speech in speeches:
        words = [word for word in speech["text"] if word != "nan"]
        word_counts.update(words)
    year_word_counts[year] = word_counts

### 2.4 Sum word occurences across all years

In [None]:
word_counts = Counter()
for year_counts in year_word_counts.values():
    word_counts += year_counts

### 2.5 Get and print five most frequent words with most occurences

In [None]:
top_words = [word for word, count in word_counts.most_common(5)]
print(top_words)

### 2.6 Create grouped bar graph

In [None]:
data = []
for i, word in enumerate(top_words):
    word_data = []
    for year, word_counts in year_word_counts.items():
        count = word_counts.get(word, 0)
        word_data.append(count)
    data.append(word_data)

bar_width = 0.15
year_labels = list(year_word_counts.keys())
x = np.arange(len(year_labels))
fig, ax = plt.subplots()
colors = ['tab:green', 'tab:orange', 'tab:blue', 'tab:red', 'tab:purple']
for i, word_data in enumerate(data):
    ax.bar(x - (2 - i) * bar_width, word_data, bar_width, label=top_words[i], color=colors[i])


# Set the x-axis tick locations and labels
ax.set_xticks(range(len(year_labels)))
ax.set_xticklabels([int(year) for year in year_labels])

ax.legend()
ax.set_xlabel('Year')
ax.set_ylabel('Frequency')
ax.set_title('Speeches by year and top words')

plt.show()