In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("HW01.ipynb")

# Homework 1 - Presidential Inaugural Readability

In this assignment, we will be analyzing the reading level of Presidential Inagural Speeches. 
By the end of this homework you will be able to answer questions such:
- How has the reading level of Presidential Inagural Speeches changed over time?
- Which President gave the longest Inagural Speech?
- Which political party's Presidents' Inagural Speech had the highest and lowest reading levels?


This assignment is inspired the following related work:
- [The Atlantic](https://www.theatlantic.com/politics/archive/2014/10/have-presidential-speeches-gotten-less-sophisticated-over-time/381410/)
- [Huffington Post](https://www.huffpost.com/entry/trump-speeches-reading-level_n_56e9899fe4b0b25c9184183f?mkv86w29=) (Work by Elliot Schumacher - my grad school labmate)
- [The state of our union is … dumber (The Guardian)](https://www.theguardian.com/world/interactive/2013/feb/12/state-of-the-union-reading-level)

There are python libraries, e.g. [textstat](https://github.com/shivam5992/textstat), that compute some of the metrics we will be using here. However, in this assignment we will implement different readability metrics by scratch. 


For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Deadline:**

This assignment is due Monday, May 10th at 11:45 A.M. Late work be accepted, up to one day per assignment, as per the [policies](http://coms2710.barnard.edu/policies.html) page.

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively.

You should start early so that you have time to get help if you're stuck. Office hours are held Monday-Friday. The schedule appears on [Office Hours](http://coms2710.barnard.edu/office-hours.html) page.

Let's begin by running the next cell that will import some python packages relevant for this assignment.

In [None]:
import os
import pandas as pd
import re
import numpy as np
import math

## 1. Explore & Load Data

When working with and analyzing data, it is always a good idea to try to get a sense of what the data looks like.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1.1
manual: True
points: 2
-->

**Question 1.1:** Write a bash command in the next cell that will print out the first ten lines of George Washington's Inaugural Speech in 1789. 

*Hint 1: You might want to look at the [explanation of common bash commands on the course webpage](http://coms2710.barnard.edu/bash-commands.html)*

*Hint 2: Remember that to run a bash command in a JupyterNotebook, we start the command with "!". For example, `!cat data/speeches/2005-George_W._Bush.txt` will print out all of George W Bush's 2005 Inaugrual Speech.*

_Type your answer here, replacing this text._

<!-- END QUESTION -->



By looking at the first ten lines of George Washington's 1789 Inaugural Speech, you should notice that each line represents a new sentence in a speech. This information will be helpful for the next question

<!--
BEGIN QUESTION
name: q1.2
manual: false
points: 
    - 0.15
    - 0.10
    - 0.10
    - 0.25
    - 0.25
    - 0.25
    - 0.25
    - 4
-->

**Question 1.2:** Fill in the missing code in the next cell to store the contents of each speech in a dataframe called `speeches_df`. The dataframe should contains the following columns: `Year`, `President`, `Speech`. `os.listdir(SPEECH_PATH)` results in a list of the names of the files in `SPEECH_PATH`.
- For the `President` column, make sure to use a white space to seperate First and Last names for Presidents. Also, make sure to use *title* case ("Adam Z Poliak" is an example of title case while "Adam z poliak" is not).
- Each item in the `Speech` column should be a list of strings. The *i-th* string in the list should correspond to the *i-th* sentence in the speech. Also, make sure to remove trailing white spaces.
- The `Year` column should be a column of integers, not strings.

*Hint: For title case, python strings have a useful function*

In [None]:
SPEECH_PATH="data/speeches/"

for file in os.listdir(SPEECH_PATH):
    """
    Fill in the rest of this cell to iterate through the speeches in SPEECH_PATH. 
    For each speech, make sure to keep track
    of the year, president, and contents of each speech.
    """;
    ...

    
speeches_df = ...
speeches_df.head(5)

In [None]:
grader.check("q1.2")

#### Sorting the dataframe
Run the next line to look at the first 7 rows in the dataframe.

In [None]:
speeches_df.head(7)

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1.3
manual: True
points: 1
-->

Above, there most likely is not a clear order of the rows. 

**Question 1.3:** Why are the rows ordered in the way they are?
    
*Hint:* Remember the starter code in Question 1.2. This [documentation](https://docs.python.org/3/library/os.html#os.listdir) might be helpful

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!--
BEGIN QUESTION
name: q1.4
manual: false
points: 
    - 0.25
    - 0.25
    - 0.25
    - 1
-->

**Question 1.4:** Sort the dataframe chronologically in ascending order so that the first row in the Dataframe corresponds to the first Presidential Inaugrual address. Make sure to reindex the dataframe.
The first row should correspond to George Washington's Inaugrual address in 1789

In [None]:
speeches_df = ...
speeches_df.head(5), speeches_df.tail(5)

In [None]:
grader.check("q1.4")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1.5
manual: True
points: 2
-->
**Question 1.5:** Briefly (maximum of 2 sentences) what do you think might make a speech or text more readable? List at least two aspects or characteristics might make a speech or text hard to read and understand? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## 2. Speech Statistics

Before we implement different functions to determine the reading level of a speech, we will first compute the following statistics for each speech:

- Number of words
- Number of sentences
- Number of syllabus per word
- Number of pollysyllablic words
- Number of characters

These statistics will be used for different readability metrics in Section 3.

#### Number of sentences

Let's begin by determining the number of sentences in each speech.
<!--
BEGIN QUESTION
name: q2.1
manual: false
points: 
    - 0.25
    - 0.75
    - 0.75
-->

**Question: 2.1** Compute the number of sentences in each speech and store the result in a new column called `Sentence_Count` in the Dataframe called `speeches_df` *(even though my solution was one line, it is okay if have one line of code for each step in your solution)*.

*Hint:* Remember [`map`](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html) from pandas and lambda functions

In [None]:
speeches_df = ...
speeches_df.head(5)

In [None]:
grader.check("q2.1")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2.2.1
manual: True
points: 
    - 1
-->
**Question: 2.2.1** Remember from above that each line in the original file contained just one sentence. Briefly describe what bash commands could we use to find the number of sentences in any of the speeches.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2.2.2
manual: True
points: 
    - 1
-->
**Question: 2.2.2** In the next cell, write and run that bash command to determine the number of lines that are in file that contains Geroge W. Bush's 2005 speech. You will need to change the next cell from a markdown cell to a code cell.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2.2.3
manual: True
points: 
    - 1
-->
**Question: 2.2.3** In the next cell, write and run that bash command to determine the number of lines that are in the file that contains Thomas Jefferson's second inaugrual speech. You will need to change the next cell from a markdown cell to a code cell.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!--
BEGIN QUESTION
name: q2.2.4
manual: False
points: 
    - 0
    - 0.1
    - 1
    - 1
-->
**Question: 2.2.4** In the next cell, write code to extract from `speeches_df` the number of speeches in Bush's 2005 speech and Jefferson's second speech. 

In [None]:
jefferson_num_sents = ...
bush_num_sents = ...

f"Bush's 2005 speech had {bush_num_sents} sentences and Jefferson's second speech had {jefferson_num_sents} sentence"

In [None]:
grader.check("q2.2.4")

What we just did is an example of sanity checking. Ideally, the results should be the same. However, you should notice that there is a difference of 1 between the results querying the dataframe compared to using the command line function that counts the number of lines.

For this specific example, right now that difference of 1 is ok. We will discuss why this is the case here but if you are interested, consider why the bash command resulted in one less line number compared to querying the dataframe for the number of sentences in a speech. 

#### Number of words

The next statistic we will compute is the number of words in a speech.

<!--
BEGIN QUESTION
name: q2.3
manual: false
points: 
    - 0.1
    - 0.1
    - 0.40
    - 0.40
-->
**Question: 2.3** Implement the function `get_total_word_count` based on the docstring below.

In [None]:
def get_total_word_count(speech):
    """
    Given a speech (a list of strings), return the number of words in the speech.
    Here, we will define words as a run of characters between whitespace.
    """
    ...
    
test = ["one", 
        "one two", 
        "one two three", 
        "one two three four", 
        "one two three four five",
        "one two three four five six."]
get_total_word_count(test)

In [None]:
grader.check("q2.3")

<!--
BEGIN QUESTION
name: q2.4
manual: false
points: 
    - 0.1
    - 0.9
-->
**Question: 2.4** Use `get_total_word_count` to computer the total number of words in each speech and store the result in a new column called `Total_Word_Count`

In [None]:
speeches_df = ...
speeches_df.head(5)

In [None]:
grader.check("q2.4")

<!--
BEGIN QUESTION
name: q2.5
manual: false
points: 
    - 0.10
    - 0.9
-->

**Question 2.5:** Compute the average number of words per sentence in each speech and store the result in a new column called `Avg_Word_Count`. Round the results to the nearest hundredth (2 decimals) using `np.round`

*Hint:* You should use the two columns created in the last few questions

In [None]:
speeches_df = ...
speeches_df.head(5)

In [None]:
grader.check("q2.5")

<!--
BEGIN QUESTION
name: q2.6
manual: false
points: 
    - 0.1
    - 0.1
    - 0.5
    - 0.5
    
-->

**Question 2.6:** Which President and in what year had the longest sentences on average and what was the average number of words per sentence in the speech? Make sure to store the corresponding results in `president_longest_sentences`, `year_longest_sentences`, `avg_longest_sentences_count`.

*You do not need to complete the first two lines, they are there as a hint for the way I answered this question*

In [None]:
longest_avg_sent = ...
longest_row = ...

president_longest_sentences = ...
year_longest_sentences = ...
avg_longest_sentences_count = ...

f"{president_longest_sentences} averaged {avg_longest_sentences_count} words per sentence in {year_longest_sentences}"

In [None]:
grader.check("q2.6")

<!-- BEGIN QUESTION -->

####  Syllables
So far the statistics we've computed are at the word level. However, words with many syllables are often associated with more complex reading levels. In the next few questions we will focus on getting statistics based on the number of syllables in a word.

<!--
BEGIN QUESTION
name: q2.7.1
manual: True
points: 2  
-->

**Question 2.7.1:** In the next cell, briefly describe how would you go about counting the number of syllables in a word programatically.

_Type your answer here, replacing this text._

<!-- END QUESTION -->



When programming, we do not want to reinvent the wheel and a specific problem we are facing e.g. counting the number of syllables in a word, is often not unique to our specific use case. In these cases, it is useful to leverage exsiting python libraries/packages that implement the solution we are looking for. 

#### PIP
pip (which stands for "pip installs packages") is a command-line system to install and manage python packages. Many python packages can be installed via pip. You can search for any package on the Python Package Index (PyPI) [webpage](https://pypi.org/). [W3school](https://www.w3schools.com/python/python_pip.asp) gives a good overview of pip. 

Counting the number of syllables in a word is not that uncommon, especially when analyzing the readability of text. Instead of implementing an algorithm to compute the number of syllabues in a word ourselves, we will use a publicly available python package called [`python-syllables`](https://github.com/prosegrinder/python-syllables). 

Run the next cell to install the python package using pip

In [None]:
!pip install syllables

To access the syllables package now, we need to import the package in python. Run the next code cell to import the `syllables` python package. The line of code will tell us the version number of the package

In [None]:
import syllables
syllables.__version__

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2.7.2
manual: True
points: 
    - 0.1
    - 0.75
    - 0.75
    
-->

Now we have access to the python library called `syllables`. 

**Question: 2.7.2:** Based on the package's [documentation](https://github.com/prosegrinder/python-syllables), use the package to determine how many syllables are in the words `hello`, `supercalifragilisticexpialidocious`? 
Assign the result of the expression to the corresponding variables below.

In [None]:
hello_syl_count = ...
super_syl_count = ...

f"There are {hello_syl_count} syllables in hello and {super_syl_count} syllables in supercalifragilisticexpialidocious"

In [None]:
grader.check("q2.7.2")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2.7.3
manual: True
points: 2
    
-->

Often times we will use existing packages as black boxes. However, it is sometimes useful to understand what is going on under the hood of a function in a package. 
The package's `estimate()` function is implemented in these [13 lines of code](https://github.com/prosegrinder/python-syllables/blob/630c08786c72459e5a961503fefa5acb967f8d30/syllables/__init__.py#L176-L199)

**Question 2.7.3:** Briefly (in at most 3 sentences) describe in English how the `estimate()` function works.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!--
BEGIN QUESTION
name: q2.7.4
manual: False
points: 
    - 0.1
    - 0.9
    
-->

**Question: 2.7.4** Implement the function `get_total_syllable_count` based on the docstring below.

In [None]:
def get_total_syllable_count(speech):
    """
    Given a speech (a list of strings), return the number of syllables in the speech.
    """
    ...
    
test = ["welcome to the jungle", "we got fun"]
get_total_syllable_count(["hello"]), get_total_syllable_count(test)

In [None]:
grader.check("q2.7.4")

<!--
BEGIN QUESTION
name: q2.7.5
manual: False
points: 
    - 0.1
    - 0.9
  
  
-->
**Question: 2.7.5** Use `get_total_syllable_count` to computer the total number of syllables in each speech and store the result in a new column called `Total_Syllable_Count`

In [None]:
speeches_df = ...
speeches_df.head(5)

In [None]:
grader.check("q2.7.5")

####  Polysyllabic Words

<!--
BEGIN QUESTION
name: q2.7.6
manual: False
points: 
    - 0.1
    - 0.9
    
-->

**Question: 2.7.6** Implement the function `get_polysyllable_word_count` based on the docstring below.

In [None]:
def get_polysyllable_word_count(speech):
    """
    Given a speech (a list of strings), return the number of words that contain 3 or more syllables in the speech.
    """
    ...
    
test = ["The SMOG grade is a readability metric",
        "It estimates the years of education needed to understand a piece of writing."]
get_polysyllable_word_count(["hello"]), get_polysyllable_word_count(test)

In [None]:
grader.check("q2.7.6")

<!--
BEGIN QUESTION
name: q2.7.7
manual: False
points: 
    - 0.1
    - 0.9
  
  
-->
**Question: 2.7.7** Use `get_polysyllable_word_count` to computer the total number of polysyllabic words in each speech and store the result in a new column called `Total_Polysyllable_Count`

In [None]:
speeches_df = ...
speeches_df.head(5)

In [None]:
grader.check("q2.7.7")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2.7.8
manual: True
points: 2  
-->

We often have multiple options of which python packages to use. For example, `python-syllables` is not the only publicly available python package that we can use to count the number of syllables in a word. [SyllaPy](https://github.com/mholtzscher/syllapy) is one as well. 
 
**Question 2.7.8:** Give two possible reasons why we might choose one python package over another?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Characters

The next statistic we will compute is the number of characters in a speech. Remember that characters here are not entities in a story. ["Characters are anything you type in a keyboard"](https://www.pythonforbeginners.com/basics/string-manipulation-in-python).

<!--
BEGIN QUESTION
name: q2.8
manual: false
points: 
    - 0.5
    - 0.5
-->
**Question: 2.8** Implement the function `get_total_char_count` based on the docstring below.

In [None]:
def get_total_char_count(speech):
    """
    Given a speech (a list of strings), return the number of character in the speech.
    Here, we will define words as a run of characters between whitespace, including punctiation.
    """
    ...
    
test = ["one", 
        "one two"]
get_total_char_count(test)

In [None]:
grader.check("q2.8")

<!--
BEGIN QUESTION
name: q2.9
manual: false
points: 
    - 0.1
    - 0.9
-->
**Question: 2.9** Use `get_total_char_count` to computer the total number of characters in each speech and store the result in a new column called `Total_Char_Count`

In [None]:
speeches_df = ...
speeches_df.head(5)

In [None]:
grader.check("q2.9")

## 3. Readability Metrics

We are now ready to determine the reading level of different speeches. We are going to implement the following four readability tests for English texts:
- Flesch Reading Ease
- Flesch Kincaid Grade Level
- Simple Measure of Gobbledygook (SMOG)
- Automated Readability Index (ARI)


<!--
BEGIN QUESTION
name: q3.1.1
manual: False
points: 0
-->
### Flesch reading ease (FRES)

The first readability metric we will compute is the Flesh reading ease score (FRES). 
The equation for FRES is:
$$206.835 - 1.015 \left( \frac{\text{words}}{\text{sentences}} \right) - 84.6 \left( \frac{\text{syllables}}{\text{words}} \right)$$

Where *words* is the number words, *sentence* is the number of sentences, and *syllables* is the number of syllables. 
See [Wikipedia](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease) for more information about this metric.


**Question 3.1.1**
Complete the function `fres()` to compute the Flresch Reading Ease score for an entire speech.

In [None]:
def fres(row):
    """
    Given a row (speech) in the dataframe, return the Flesch reading ease score of the speech in the row.
    """

<!--
BEGIN QUESTION
name: q3.1.2
manual: False
points: 
    - 0.2
    - 0.8
    
-->

**Question 3.1.2:** As mentioned on Wikipedia, the Flesch Reading Ease Score for the sentence "*The cat sat on the mat*" should be 116. Fill in the following dataframe `cat_mat_df` to test that `fres` function works correctly. 

In [None]:
cat_mat_df = pd.DataFrame().assign(Speech = [["The cat sat on the mat."]])
...
...
...
...

cat_fres_score = ...

f"The fres score for The cat sat on the mat is {cat_fres_score}"

In [None]:
grader.check("q3.1.2")

<!--
BEGIN QUESTION
name: q3.1.3
manual: False
points: 
    - 0.2
    - 0.8
    
-->

**Question 3.1.3:** In the next cell, use the DataFrame [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) function to apply the `fres()` function to each speech and store the resulting pandas Series in the variable names `fres_scores`.
Then, add the Series to the dataframe in a column called `Fres_Score`. 

In [None]:
fres_scores = ...
speeches_df = ...
speeches_df.head(5)

In [None]:
grader.check("q3.1.3")

<!--
BEGIN QUESTION
name: q3.1.4
manual: False
points: 
    - 0.1
    - 0.1
    - 0.9
    - 0.9
    
-->

**Question 3.1.4:** Based on the FRES metric, compute which President's speech and in which year was the easiest to read. Assign the value of the expressions to `easiest_fres_pres`, `easiest_fres_year`.

In [None]:
easiest_fres_pres = ...
easiest_fres_year = ...

f"{easiest_fres_pres} speech in {easiest_fres_year} was the easiest Inaugural Address to read"

In [None]:
grader.check("q3.1.4")

<!--
BEGIN QUESTION
name: q3.1.5
manual: False
points: 
    - 0.1
    - 0.1
    - 0.9
    - 0.9
    
-->

**Question 3.1.5:** Based on the FRES metric, compute which President's speech and in which year was the hardest to read.Assign the value of the expressions to `hardest_fres_pres`, `hardest_fres_year`.

In [None]:
hardest_fres_pres = ...
hardest_fres_year = ...

f"{hardest_fres_pres} speech in {hardest_fres_year} was the hardest Inaugural Address to read"

In [None]:
grader.check("q3.1.5")

<!--
BEGIN QUESTION
name: q3.2.1
manual: False
points: 0
-->

### Flesch–Kincaid grade level

The second metric we will compute is the Flesch–Kincaid grade level metric. 
The Flesch–Kincaid Grade Level Formula is:
$$0.39 \left ( \frac{\mbox{words}}{\mbox{sentences}} \right ) + 11.8 \left ( \frac{\mbox{syllables}}{\mbox{words}} \right ) - 15.59$$ 


Where *words* is the number words, *sentence* is the number of sentences, and *syllables* is the number of syllables. 
See [Wikipedia](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level) for more information about this metric.


**Question 3.1.1**
Complete the function `fk_grade_level()` to compute the Flesch-Kincaid grade level score for an entire speech.

In [None]:
def fk_grade_level(row):
    
    """
    Given a row (speech) in the dataframe, return the Flesch-Kincaid grade level score of the speech in the row.
    """

<!--
BEGIN QUESTION
name: q3.2.2
manual: False
points: 
    - 0.5
    
-->

**Question 3.2.2:** As mentioned on Wikipedia, the Flesch Reading Ease Score for the sentence "*The Australian platypus is seemingly a hybrid of a mammal and reptilian creature*" is an 11.3. 

Fill in the following dataframe `aus_df` to test that `fres` function works correctly. If the FK grade level is about 12.17, that is close enough here. 


In [None]:
aus_df = pd.DataFrame().assign(Speech = [["The Australian platypus is seemingly a hybrid of a mammal and reptilian creature"]])
...
...
...
...

aus_fk_grade = ...

f"The fres score for The cat sat on the mat is {aus_fk_grade}"

In [None]:
grader.check("q3.2.2")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3.2.3
manual: True
points: 1
    
-->

**Question 3.2.3:** Why in your function does the FK grade level for "*The Australian platypus is seemingly a hybrid of a mammal and reptilian creature*" come out to about 12 rather than 11.3?

*Hint:* Look at the values in `aus_df` and compare them with the description in Wikipedia

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!--
BEGIN QUESTION
name: q3.2.4
manual: False
points: 
    - 0.25
    - 0.25
    
-->

**Question 3.2.4:** In the next cell, use the DataFrame [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) function to apply the `fk_grade_level` function to each speech and store the resulting pandas Series in the variable names `fk_grade_scores`.
Then, add the Series to the dataframe in a column called `FK_Grade_Score`. 

In [None]:
fk_grade_scores = ...
speeches_df = ...
speeches_df.head(5)

In [None]:
grader.check("q3.2.4")

<!--
BEGIN QUESTION
name: q3.2.5
manual: False
points: 
    - 0.1
    - 0.1
    - 0.9
    - 0.9
    
-->

**Question 3.2.5:** Based on the FK Grade metric, compute which President's speech and in which year was the easiest to read. Assign the value of the expressions to `easiest_fk_pres`, `easiest_fk_year`.

In [None]:
easiest_fk_pres = ...
easiest_fk_year = ...

f"{easiest_fk_pres} speech in {easiest_fk_year} was the easiest Inaugural Address to read"

In [None]:
grader.check("q3.2.5")

<!--
BEGIN QUESTION
name: q3.2.6
manual: False
points: 
    - 0.1
    - 0.1
    - 0.9
    - 0.9
    
-->

**Question 3.2.6:** Based on the FRES metric, compute which President's speech and in which year was the hardest to read.Assign the value of the expressions to `hardest_fk_pres`, `hardest_fk_year`.

In [None]:
hardest_fk_pres = ...
hardest_fk_year = ...

f"{hardest_fk_pres} speech in {hardest_fk_year} was the hardest Inaugural Address to read"

In [None]:
grader.check("q3.2.6")

<!--
BEGIN QUESTION
name: q3.3.1
manual: False
points: 0
-->

### Simple Measure of Gobbledygood (SMOG)

The third metric we will compute is SMOG. 
As describes on [Wikipeida](https://en.wikipedia.org/wiki/SMOG), the formula for computing SMOG is:
$$1.0430 \sqrt{\mbox{polysyllables}\times{30 \over \mbox{sentences}} } + 3.1291$$  

Where *polysyllables* is the number of polysyllables in a speech and *sentences* is the number of sentences in a speech.

**Question 3.3.1:** Complete the function `SMOG()` to compute the SMOG metric for an entire speech.
If the speech has less than 30 sentences, return -1

In [None]:
def SMOG(row):
    
    """
    Given a row (speech) in the dataframe, return the SMOG score of the speech in the row.
    """
    

<!--
BEGIN QUESTION
name: q3.3.2
manual: False
points: 
    - 0.25
    - 0.25    
-->

**Question 3.3.2:** In the next cell, use the DataFrame [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) function to apply the `SMOG` function to each speech and store the resulting pandas Series in the variable names `smog_scores`.
Then, add the Series to the dataframe in a column called `SMOG_Score`. 

In [None]:
smog_scores = ...
speeches_df = ...
speeches_df.head(5)

In [None]:
grader.check("q3.3.2")

<!--
BEGIN QUESTION
name: q3.3.3
manual: False
points: 
    - 0.1
    - 0.1
    - 0.9
    - 0.9
    
-->

**Question 3.3.3:** Based on the SMOG score metric, compute which President's speech and in which year was the easiest to read. Assign the value of the expressions to `easiest_smog_pres`, `easiest_smog_year`.

In [None]:

easiest_smog_pres = ...
easiest_smog_year = ...

f"{easiest_smog_pres} speech in {easiest_smog_year} was the easiest Inaugural Address to read"

In [None]:
grader.check("q3.3.3")

<!--
BEGIN QUESTION
name: q3.3.4
manual: False
points: 
    - 0.1
    - 0.1
    - 0.9
    - 0.9
    
-->

**Question 3.3.4:** Based on the SMOG metric, compute which President's speech and in which year was the hardest to read. Assign the value of the expressions to `hardest_smog_pres`, `hardest_smog_year`.

In [None]:

hardest_smog_pres = ...
hardest_smog_year = ...

f"{hardest_smog_pres} speech in {hardest_smog_year} was the hardest Inaugural Address to read"

In [None]:
grader.check("q3.3.4")

### Automated readability index (ARI)

The fourth (and last) metric will use is Automated readability index (ARI). As described on [Wikipeida](https://en.wikipedia.org/wiki/Automated_readability_index), the formula for computing ARI is:
$$4.71 \left (\frac{\mbox{characters}}{\mbox{words}} \right) + 0.5 \left (\frac{\mbox{words}}{\mbox{sentences}} \right)  - 21.43$$  

where *characters*, *words*, *sentences* are the number of characters, words and sentences in a speech.

<!--
BEGIN QUESTION
name: q3.4.1
manual: False
points: 0
-->

**Question 3.4.1:** Complete the function `ARI()` to compute the ARI metric for an entire speech.

In [None]:
def ARI(row):
        
    """
    Given a row (speech) in the dataframe, return the SMOG score of the speech in the row.
    """
    

<!--
BEGIN QUESTION
name: q3.4.2
manual: False
points: 
    - 0.25
    - 0.25    
-->

**Question 3.4.2:** In the next cell, use the DataFrame [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) function to apply the `ARI` function to each speech and store the resulting pandas Series in the variable names `ari_scores`.
Then, add the Series to the dataframe in a column called `ARI_Score`. 

In [None]:
ari_scores = ...
speeches_df = ...
speeches_df.tail(5)

In [None]:
grader.check("q3.4.2")

<!--
BEGIN QUESTION
name: q3.4.3
manual: False
points: 
    - 0.1
    - 0.1
    - 0.9
    - 0.9
    
-->

**Question 3.4.3:** Based on the ARI metric, compute which President's speech and in which year was the 2nd easiest to read. Assign the value of the expressions to `easiest_ari_pres`, `easiest_ari_year`.

In [None]:


easiest_ari_pres = ...
easiest_ari_year = ...

f"{easiest_ari_pres}'s speech in {easiest_ari_year} was the second easiest Inaugural Address to read"

In [None]:
grader.check("q3.4.3")

<!--
BEGIN QUESTION
name: q3.4.4
manual: False
points: 
    - 0.1
    - 0.1
    - 0.9
    - 0.9
    
-->

**Question 3.4.4:** Based on the ARI metric, compute which President's speech and in which year was the second hardest to read.Assign the value of the expressions to `hardest_ari_pres`, `hardest_ari_year`.

In [None]:

hardest_ari_pres = ...
hardest_ari_year = ...


f"{hardest_ari_pres}'s speech in {hardest_ari_year} was the second hardest Inaugural Address to read"

In [None]:
grader.check("q3.4.4")

## 5. Visualization

There are other metrics for quantifying readability but for this assignment we will stick with these 4. 

<!--
BEGIN QUESTION
name: q3.5.1
manual: False
points: 
    - 0.1
    
-->
**Question 3.5.1** Create a dataframe called `metrics_df` that contain the columns `Year`, `President` and the metrics that were just computed. 

In [None]:
metrics_df = ...
metrics_df.head(5)

In [None]:
grader.check("q3.5.1")

<!--
BEGIN QUESTION
name: q3.5.2
manual: False
points: 
    - 0
    - 1
    
-->

**Question 3.5.2** We would like to determine whether there is a consistent change in readability of Inaugrual addresses over time. What visualization or type of figure would be best to determine whether there is a trend in readability of over time?

- a. bar
- b. scatter
- c. line
- d. histogram

Assign the correct answer in the form `a`, `b`, `c`, or `d` to the variable `correct_visualization` in the next cell.

In [None]:
correct_visualization = ...

In [None]:
grader.check("q3.5.2")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3.5.3
manual: True
points: 5
    
-->

**Question 3.5.3** In the next cell, generate the type of graph answered in the last question. Make sure to make one plot for each of the metrics.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3.5.4
manual: True
points: 2   
-->

**Question 3.5.4** The Guardian article referenced on the top of this assignment analyzes Presidental State of the Union Addresses. How do your findings compare the findings from the Guardian article? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3.5.5
manual: True
points: 2   
-->

**Question 3.5.5** In the next cell, 1) describe another research question you could ask about the Inaugarual addresses based on these metrics, and 2) sketch out how you would go about answering this new question.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## 6. Additional Applications of Readability

There are other readability metrics that have been developed. This [survey](https://community.ksde.org/LinkClick.aspx?fileticket=t8hDydbT-jo%3D&tabid=5575&mid=13625) contains an in-depth discuss of some of these other metrics if you are interested.

Some of the readabiltiy metrics you implemented have been used to study language in other settings. For example, when  exploring whether advertising on
chips targeted toward consumers of high socioeconomic
status uses different language than that on chips designed to
appeal to lower status consumers, Freedman and Jurafksy discovered that *expensive chips use more complex language than inexpensive chips* [link](https://web.stanford.edu/~jurafsky/freedmanjurafsky2011.pdf)

<!--
BEGIN QUESTION
name: q4
manual: True
points: 1   
-->

**Question 4:** With an eye towards your project, what other use cases can you think about using readability for?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## 7. Feedback


<!--
BEGIN QUESTION
name: q5.1
manual: false
points: 1   
-->

**Question:** Roughly how many hours did you spend on this assignment. Assign the total number of hours to the variable `time_spent`

In [None]:
time_spent = ...
time_spent

In [None]:
grader.check("q5.1")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q5.2
manual: True
points: 1   
-->

**Optional:** Provide any comments or feedback below

_Type your answer here, replacing this text._

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()