# HW 2: Candidates vs. Presidents

**Content Warning:** This assignment discusses United States politics.

In this homework assignment, you will study the statistical characteristics of speeches made by United States presidents and presidential candidates. In HW 1, you implemented code for conducting keyword analysis on a tokenized corpus; you will have an opportunity to use your code from HW 1 for this assignment. In this assignment, you will expand upon HW 1 by implementing the following functionalities:
- extracting text from webpages
- repairing NLTK corpora with encoding errors
- conducting _key feature analysis_ [(Egbert and Biber, 2023)](https://www.euppublishing.com/doi/10.3366/cor.2023.0275), where keyness is measured for high-level syntactic–pragmatic features instead of just token types
- comparing sets of corpora with one another, instead of just comparing a single target corpus to a single reference corpus.

## Problem 0: Background (No Submission, 0 Points)

In this problem, you will learn relevent background information about the United States presidency.

This problem was written primarily for the benefit of international students and/or students not familiar with US politics. If you are familiar with US presidential inaugurations and nominating conventions, you may skip this problem.

### Problem 0a: Understand US Presidential Elections (No Submission, 0 Points)

The United States president is chosen every four years via a _presidential election_. Most US citizens can participate in presidential elections by _voting_ in one of the 50 states or the District of Columbia. Exceptions include citizens under 18 years of age, citizens convicted of certain crimes, citizens not registered to vote, and citizens residing in [US territories](https://en.wikipedia.org/wiki/Territories_of_the_United_States).

Most US politicians, as well as many US voters, belong to one of two _political parties_: the [Democratic Party](https://democrats.org/) and the [Republican Party](https://www.gop.com/). The current president, Donald Trump, belongs to the Republican Party, while the previous president, Joe Biden, belongs to the Democratic Party. Presidential candidates may also represent smaller parties (like the [Libertarian Party](https://lp.org/) or the [Green Party](https://www.gp.org/)) or run without a party affiliation, but such candidates are usually unsuccessful. The most recent president who belonged to neither the Democratic nor the Republican Party was [Millard Fillmore](https://en.wikipedia.org/wiki/Millard_Fillmore) from the [Whig Party](https://en.wikipedia.org/wiki/Whig_Party_(United_States)), who served from 1850 to 1853.

Before each presidential election, each party wishing to participate in the election holds a _nominating convention_ where the official candidate for that party is chosen. The nominating conventions for the most recent presidential election (2024) are shown below.

|   |   | Location | Dates | Presidential Nominee | 
|---|---|---|---|---|
| Democratic National Convention | (DNC) | Chicago, IL | Aug. 19–22 | Kamala Harris | 
| Republican National Convention | (RNC) | Milwaukee, WI | Jul. 15–18 | Donald Trump | 
| Libertarian National Convention | (LNC) | Washington, DC | May 24–27 | Chase Oliver | 
| Green National Convention | (GNC) | Online | Aug. 15–18 | Jill Stein | 

After each presidential election, the winner of the election officially becomes president during the _inauguration ceremony_, held in January the following year.

### Problem 0b: Understand US Presidential Speeches (No Submission, 0 Points)

In this assignment, we will be studying two kinds of speeches:
- _nomination acceptance speeches_ given at a party convention, where a new presidential candidate explains why they should be president
- _inaugural addresses_ given at an inauguration ceremony, where a new president explains their goals for the next four years.

Here is an example of each kind of speech.

In [1]:
# Code for embedding YouTube videos in a Jupyter Notebook
from IPython.display import display, HTML, IFrame

display(HTML("<h4>Candidate Kamala Harris's 2024 Nomination Acceptance Speech</h4>"))
display(IFrame("https://www.youtube.com/embed/9hy2cZaGbOM?si=v8_rHO7-AyVXXsam", width=560, height=315))

display(HTML("<h4>President Donald Trump's 2025 Inaugural Address</h4>"))
display(IFrame("https://www.youtube.com/embed/WQuk73KIqZ8?si=W_QYPMZtV3PHuuNG", width=560, height=315))

## Problem 1: Programming Exercises (5 Points in Total + 1 Point Extra Credit)

In these exercises, you will learn and practice computer programming concepts needed for Problems 2 and 3.

**Note:** In many parts of Problems 2 and 3, you will benefit from referring back to this problem.

### Problem 1a: Understand UTF-8 Morphosyntax (Written, 1 Point)

Recall that UTF-8 encodes characters using **1 to 4 bytes**—that is, each character is represented as a sequence of 8, 16, 24, or 32 **bits** (1s and 0s).

Please look at the following characters and their binary encodings in UTF-8.

|   | UTF-8 Binary |   | UTF-8 Binary |
|---|--------------|---|--------------|
| `\n` | **0**0001010 | `ਊ`  | **1110**0000 **10**101000 **10**001010  |
| `8`  | **0**0111000 | `Ꮚ` | **1110**0001 **10**001111 **10**001010  |
| `L`  | **0**1001100 | `嗨` | **1110**0101 **10**010111 **10**101000 |
| `x`  | **0**1111000 | `짷` | **1110**1100 **10**100111 **10**110111 |
| `¥` | **110**00010 **10**100101 | `𒀂` | **11110**000 **10**010010 **10**000000 **10**000010 |
| `ř` | **110**00101 **10**011001 | `𓅈` | **11110**000 **10**010011 **10**000101 **10**001000 |
| `Ψ` | **110**01110 **10**101000 | `𝄞` | **11110**000 **10**011101 **10**000100 **10**011110 |
| `Թ` | **110**10100 **10**111001 | `😅` | **11110**000 **10**011111 **10**011000 **10**000101 |

Notice that the bytes (sequences of 8 bits) in these encodings seem to have a systematic "morphology." What do the prefixes 0-, 10-, 110-, 1110-, and 11110- mean? 

**Hint:** The prefixes are highlighted in bold.

### Problem 1b: Explain UTF-8 Morphosyntax (Written, 1 Point Extra Credit)

Why do you think UTF-8 has this "morphosyntax"?

**Hint:** Look at the following UTF-8 binary code:
```
01011001 11000011 10111010 01101100 11000011 10111001
```
Without looking it up, can you tell how many characters are in the above byte sequence? Can you tell where the character boundaries are?

### Problem 1c: Understand Python Errors (No Submission, 0 Points)

In class, we have seen many instances where problematic Python code can result in an error. There are two kinds of errors in Python: _syntax errors_ and _runtime errors_. Runtime errors are also known as _exceptions_.

Syntax errors occur when you try to run Python code that is "ungrammatical"; i.e., its syntax does not comply with the Python grammar. For example:

In [2]:
# Code that triggers a syntax error
1 +

SyntaxError: invalid syntax (918557178.py, line 2)

Runtime errors occur when you try to run Python code that is "semantically infelicitous." Such code is syntactically valid (and therefore you can run it), but while running the code, you end up trying to perform a computation that is meaningless or logically incoherent. For example, the following code triggers a `ZeroDivisonError`, which is a subclass of `RuntimeError`:

In [3]:
# Code that triggers a runtime error
5 / 0

ZeroDivisionError: division by zero

When writing Python code, you can manually trigger a runtime error by using the `raise` keyword, like so.

In [4]:
raise RuntimeError

RuntimeError: 

Some "real-life" examples of errors occur in the `.py` files provided with this assignment. Each of the functions you are supposed to implement starts out with the following code:
```python
raise NotImplementedError("Please replace this line with your code.")
```
If you try to run one of these functions from this notebook, you will get a runtime error, more specifically a `NotImplementedError`. If someone else were using your code to perform keyword analysis, the `NotImplementedError` would tell them that some of your functions are not working because you have not implemented them yet.

In [5]:
# Code that triggers a NotImplementedError
import hw2

hw2.function_without_implementation()

[nltk_data] Downloading package punkt_tab to [path redacted]...
[nltk_data]   Package punkt_tab is already up-to-date!


NotImplementedError: This function has no implementation.

### Problem 1d: Understand Exception Handling (Written, 1 Point)

A very useful Python construct for dealing with runtime errors is the _`try`—`except`_ block. This construct does something called _exception handling_. Here's an example of how it works.

In [6]:
try:
    print(5 / 0)
except ZeroDivisionError:
    print("I tried to divide by zero!")

I tried to divide by zero!


And here's a somewhat more "realistic" example.

In [7]:
for i in range(-5, 5):
    try:
        print(f"1/{i} = {1 / i}")
    except ZeroDivisionError:
        pass  # "pass" means "do nothing"

1/-5 = -0.2
1/-4 = -0.25
1/-3 = -0.3333333333333333
1/-2 = -0.5
1/-1 = -1.0
1/1 = 1.0
1/2 = 0.5
1/3 = 0.3333333333333333
1/4 = 0.25


What does the `try`—`except` block do? When does the code under the `try` block run, and when does the code under the `except` block run?

### Problem 1e: Understand Loop Breaking (Written, 1 Point)

Please look at the following code snippet.

In [8]:
", ".join(["a", "b", "c", "d", "e"])

'a, b, c, d, e'

Let's say `a` is a `str`, and `b` is a `list[str]`. What does `a.join(b)` do? Does it still work if `b` is a `set`? What if `b = range(10)`?

### Problem 1f: Understand Loop Breaking (Written, 1 Point)

Please look at the following code snippets.

In [9]:
i = 0
while True:
    i += 1
    print(i)
    if i >= 5:
        break

1
2
3
4
5


In [10]:
for i in range(1, 100):
    if i % 3 == 0 and i % 5 == 0:  
        # i is divisible by 3 and 5
        print(i)
        break

15


What does `break` do?

### Problem 1g: Understand the NLTK Tokenizer (Written, 1 Point)

NLTK has a function called `word_tokenize`.

In [11]:
import nltk

nltk.word_tokenize("The quick brown fox jumps over the lazy dog.")

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

What does this function do? (In your answer, do not use the word "tokenize," except in the name of the `word_tokenize` function.)

## Problem 2: Analysis of Inaugural Addresses (7 Points in Total)

In this problem, you will use the code from HW 1, provided with this assignment, to conduct keyword analysis of US inaugural addresses. You will conduct the analysis using the Inaugural Addresses corpus from NLTK.

In [12]:
# Download the Inaugural Addresses corpus if you haven't already
nltk.download("inaugural")

# Load the Inaugural Addresses corpus
from nltk.corpus import inaugural

[nltk_data] Downloading package inaugural to [path redacted]...
[nltk_data]   Package inaugural is already up-to-date!


The HW 1 code provided with this assignment is slightly different from the HW 1 solution. **Please make sure you use the `hw1.py` file provided with this assignment, and not a different `hw1.py` file.**

In [13]:
# Load other code that we will use in this problem
from nltk.probability import FreqDist
import hw1

### Problem 2a: Attempt to Analyze an Inaugural Address (Written, 1 Point)

Using the entire Inaugural Addresses corpus as a reference, and using a smoothing constant of $k = 1$, what are the top five keywords of Donald Trump's 2025 inaugural address?

**Hint:** Use the code from HW 1, and consult [Chapter 2, Section 1](https://www.nltk.org/book/ch02.html#corpora_index_term) of the textbook.

In [None]:
# Feel free to write code to help you solve this problem.

### Problem 2b: Understand Encoding Issues (Written, 1 Point)

One of the keywords for Donald Trump's 2025 inauguration speech is `'\x80\x99'`. Let's look at the places in this speech where this token shows up.

In [14]:
from nltk.text import Text

Text(inaugural.words("2025-Trump.txt")).concordance("\x80\x99")

Displaying 25 of 25 matches:
thout even a token of defense . Theyâ  re raging through the houses and comm
re sitting here right now . They donâ  t have a home any longer . Thatâ  s
â  t have a home any longer . Thatâ  s interesting . But we canâ  t let 
 Thatâ  s interesting . But we canâ  t let this happen . Everyone is unabl
nable to do anything about it . Thatâ  s going to change . We have a public 
edom . From this moment on , Americaâ  s decline is over . Our liberties and
 over . Our liberties and our nationâ  s glorious destiny will no longer be 
 competency , and loyalty of Americaâ  s government . Over the past eight ye
nt in our 250 - year history , and Iâ  ve learned a lot along the way . The 
ful Pennsylvania field , an assassinâ  s bullet ripped through my ear . But 
cords , and I will not forget it . Iâ  ve heard your voices in the campaign 
and we will not forget our God . Canâ  t do that . Today , I will sign a ser
 the revolution of comm

What do you think `'\x80\x99'` means? Why is it in the corpus, and why is it always preceded by a word ending in `â`?

**Hint:** Let's check what encoding is used in the Inaugural Addresses corpus.

In [15]:
inaugural.encoding("2025-Trump.txt")

'latin1'

Now, try looking up how `â` is represented in this encoding (you may need to convert it from hexadecimal to binary representation), and revisit Problem 1a.

### Problem 2c: Fix Encoding Issues (Code, 1 Point)

Problem 2b shows that the Inaugural Addresses corpus contains [_mojibake_](https://en.wikipedia.org/wiki/Mojibake)—weird text that results from files being read using the wrong encoding. In this problem, you will remove all mojibake from the Inaugural Addresses corpus by setting each file in the corpus to its correct encoding.

NLTK corpora are represented by the [`CorpusReader` class](https://www.nltk.org/api/nltk.corpus.reader.api.html#nltk.corpus.reader.api.CorpusReader). `CorpusReader`s contain a property called `._encoding`, which contains the encoding used for the corpus. Let's check the value of this property for the Inaugural Addresses corpus.

In [16]:
inaugural._encoding

'latin1'

Notice that the name of the `CorpusReader._encoding` property begins with an underscore (`_`). When a property begins with an underscore, it usually means that that property is for "internal use"; the authors of the class don't want you to use or change the value of that property from outside the class definition. But this is merely a suggestion, not a requirement. So let's try to fix the Inaugural Addresses corpus by changing its encoding to UTF-8, and see what happens.

In [17]:
# Try setting the encoding to UTF-8
inaugural._encoding = "utf8"

# Will this mess up any of the files in the corpus?
msg = "No encoding errors detected!"
for f in inaugural.fileids():
    try:
        _ = list(inaugural.words(f))
    except UnicodeDecodeError:
        msg = f"{f} is incompatible with UTF-8!"
        break
print(msg)

2005-Bush.txt is incompatible with UTF-8!


It seems, based on the code above, that not all files in the corpus are compatible with the UTF-8 encoding. Therefore, we will need to individually specify the encoding of each file, using a dict of the form:

```python
inaugural._encoding = {"1789-Washington.txt": "latin1",
                       "1793-Washington.txt": "latin1",
                       ...,
                       "2025-Trump.txt": "utf8"}
```

Please implement the function `hw2.get_encodings`, which takes an NLTK corpus and returns the encoding for each file in the corpus, in the form of a `dict`. Please assume that a file's corpus should be `'utf8'` if it contains at least one token ending in `â` when loaded in `latin1` encoding, and that it should be `latin1` otherwise. When implemented correctly, the output of your function should be as follows.

In [18]:
inaugural._encoding = "latin1"
hw2.get_encodings(inaugural)

{'1789-Washington.txt': 'latin1',
 '1793-Washington.txt': 'latin1',
 '1797-Adams.txt': 'latin1',
 '1801-Jefferson.txt': 'latin1',
 '1805-Jefferson.txt': 'latin1',
 '1809-Madison.txt': 'latin1',
 '1813-Madison.txt': 'latin1',
 '1817-Monroe.txt': 'latin1',
 '1821-Monroe.txt': 'latin1',
 '1825-Adams.txt': 'latin1',
 '1829-Jackson.txt': 'latin1',
 '1833-Jackson.txt': 'latin1',
 '1837-VanBuren.txt': 'latin1',
 '1841-Harrison.txt': 'latin1',
 '1845-Polk.txt': 'latin1',
 '1849-Taylor.txt': 'latin1',
 '1853-Pierce.txt': 'latin1',
 '1857-Buchanan.txt': 'latin1',
 '1861-Lincoln.txt': 'latin1',
 '1865-Lincoln.txt': 'latin1',
 '1869-Grant.txt': 'latin1',
 '1873-Grant.txt': 'latin1',
 '1877-Hayes.txt': 'latin1',
 '1881-Garfield.txt': 'latin1',
 '1885-Cleveland.txt': 'latin1',
 '1889-Harrison.txt': 'latin1',
 '1893-Cleveland.txt': 'latin1',
 '1897-McKinley.txt': 'latin1',
 '1901-McKinley.txt': 'latin1',
 '1905-Roosevelt.txt': 'latin1',
 '1909-Taft.txt': 'latin1',
 '1913-Wilson.txt': 'latin1',
 '1917

Now, let's use your code to set all files in the corpus to the correct encoding.

In [19]:
# Set all the files to the correct encoding
inaugural._encoding = hw2.get_encodings(inaugural)

# Will this mess up any of the files in the corpus?
msg = "No encoding errors detected!"
for f in inaugural.fileids():
    try:
        _ = list(inaugural.words(f))
    except UnicodeDecodeError:
        msg = f"{f} is incompatible with UTF-8!"
        break
print(msg)

# Do we still get mojibake?
print("Looking up '\\x80\\x99'...") 
Text(inaugural.words()).concordance("\x80\x99")

No encoding errors detected!
Looking up '\x80\x99'...
no matches


### Problem 2d: Compare Inaugural Address Keywords (Written, 1 Point)

Using the entire Inaugural Addresses corpus as a reference, and using a smoothing constant of $k = 1$, look at the top 100 keywords of Donald Trump's 2025 inaugural address and Joe Biden's 2021 inaugural address. What are some interesting differences between the keywords for these two addresses?

In your submission, you do not need to write down all 200 keywords.

### Problem 2e: Understand Feature Tagging (Written, 1 Point)

[Egbert and Biber (2023)](https://www.euppublishing.com/doi/10.3366/cor.2023.0275) propose a corpus analysis method called _key feature analysis_. In keyword analysis, we are looking for token types that characterize one corpus relative to another. In key feature analysis, we are looking for higher-level grammatical features that characterize one corpus relative to another.

To help us perform key feature analysis, we will use the [Biberplus feature tagger](https://github.com/davidjurgens/biberplus) by [Alkiek et al. (2025, Subsection 3.1)](https://arxiv.org/abs/2502.18590v1), a piece of Python code that will annotate each token in a string with its grammatical features. Let's load the Biberplus package…

In [20]:
from biberplus.tagger import tag_text

…and let's try tagging some text.

In [21]:
tag_text("Alice saw Bob in the park.")

[{'text': 'Alice',
  'upos': 'PROPN',
  'xpos': 'NNP',
  'feats': Number=Sing,
  'tags': ['CAP', 'NNP']},
 {'text': 'saw',
  'upos': 'VERB',
  'xpos': 'VBD',
  'feats': Tense=Past|VerbForm=Fin,
  'tags': ['VBD', 'PRIV']},
 {'text': 'Bob',
  'upos': 'PROPN',
  'xpos': 'NNP',
  'feats': Number=Sing,
  'tags': ['CAP', 'NNP']},
 {'text': 'in',
  'upos': 'ADP',
  'xpos': 'IN',
  'feats': '',
  'tags': ['in', 'PIN', 'CONJ', 'PREP']},
 {'text': 'the',
  'upos': 'DET',
  'xpos': 'DT',
  'feats': Definite=Def|PronType=Art,
  'tags': ['the', 'ART', 'DET']},
 {'text': 'park',
  'upos': 'NOUN',
  'xpos': 'NN',
  'feats': Number=Sing,
  'tags': ['NN']},
 {'text': '.',
  'upos': 'PUNCT',
  'xpos': '.',
  'feats': PunctType=Peri,
  'tags': []}]

The output of `tag_text` is a `list[dict[str, Any]]`, where each `dict` contains morphosyntactic information about a particular token in the sentence. The features we care about are the ones associated with the `tags` key in each of the `dicts`. A description of these tags is given in Appendix A of [Alkiek et al. (2025)](https://arxiv.org/abs/2502.18590v1).

In the output above, the token `'Alice'` is associated with the tags `['CAP', 'NNP']`, while the token `'saw'` is associated with the tags `['VBD', 'PRIV']`. What do each of these tags mean (`'CAP'`, `'NNP'`, `'VBD'`, and `'PRIV'`)?

### Problem 2f: Implement Feature Tagging (Code, 1 Point)

Please implement the function `hw2.feature_dist`, which takes an untokenized text and returns the counts of each possible feature tag as a `FreqDist`. For example, the feature tag `'CAP'` occurs twice in the above output, once with `'Alice'` and once with `'Bob'`; as shown below, the count of this tag within the sentence `'Alice saw Bob in the park.'` is therefore 2.

In [22]:
fd = hw2.feature_dist("Alice saw Bob in the park.")
fd

FreqDist({'CAP': 2, 'NNP': 2, 'VBD': 1, 'PRIV': 1, 'in': 1, 'PIN': 1, 'CONJ': 1, 'PREP': 1, 'the': 1, 'ART': 1, ...})

In [23]:
# Visualize the entire FreqDist
for tag, count in fd.items():
    print(f"{tag} occurs {count} time(s).")

CAP occurs 2 time(s).
NNP occurs 2 time(s).
VBD occurs 1 time(s).
PRIV occurs 1 time(s).
in occurs 1 time(s).
PIN occurs 1 time(s).
CONJ occurs 1 time(s).
PREP occurs 1 time(s).
the occurs 1 time(s).
ART occurs 1 time(s).
DET occurs 1 time(s).
NN occurs 1 time(s).


### Problem 2g: Find Key Features of Inaugural Addresses (Written, 1 Point)

Once feature tagging has been implemented, we can use it to analyze the key features of inaugural addresses. For example, here are the top 5 key features of Donald Trump's 2025 inaugural address.

In [24]:
trump_features = hw2.feature_dist(inaugural.raw("2025-Trump.txt"))
ref_features = hw2.feature_dist(inaugural.raw())
hw1.freq_ratio(trump_features, ref_features).most_common(5)

[('nobody', 32.93226707286113),
 ('everybody', 24.949200304645846),
 ('SPIN', 12.307933485656257),
 ('something', 6.7790897392126475),
 ('UH', 6.294666744780987)]

Take a look at the top 25 key features of Donald Trump's 2025 inaugural address and Joe Biden's 2021 inaugural address. Use the entire Inaugural Addresses corpus as a reference, and use a smoothing constant of $k = 1$. Do you notice any interesting patterns?

You do not need to write down all 50 key features in your answer.

## Problem 3: Analysis of Nomination Acceptance Speeches (8 Points in Total)

Now, you will apply keyword analysis and key feature analysis to nomination acceptance speeches given by US presidential candidates at the Democratic and Republican National Conventions. Unfortunately, NLTK does not provide these speeches as a corpus. Instead, you will use texts from [UC Santa Barbara's American Presidency Project](https://www.presidency.ucsb.edu/documents/app-categories/elections-and-transitions/convention-speeches/presidential-nomination-0), which have been downloaded for you and provided with this assignment.

### Problem 3a: Extract Text From Speeches (Code, 1 Point)

The `convention_speeches` folder contains the HTML file for each nomination acceptance speech available at the American Presidency Project. For example, the file [2024-Harris-DNC.html](convention_speeches/2024-Harris-DNC.html) in this folder was downloaded from the following URL: [https://www.presidency.ucsb.edu/documents/address-accepting-the-democratic-presidential-nomination-chicago-illinois](https://www.presidency.ucsb.edu/documents/address-accepting-the-democratic-presidential-nomination-chicago-illinois)

Please implement the function `hw2.get_raw_text_from_html`, which extracts the text of a nomination acceptance speech from its corresponding HTML file. The content of each `p` element in the HTML file (i.e., the text appearing between the tags `<p>...</p>`) should be joined with `'\n'`.

In [25]:
hw2.get_raw_text_from_html("convention_speeches/2024-Harris-DNC.html")[:1000]

"The Vice President: Good evening! [Laughs.] [Applause.]\nAudience: Kamala! Kamala! Kamala!\nThe Vice President: California. [Laughs.] [Applause.]\nGood evening, everyone. Good evening. [Laughs.] [Applause.] Good evening. [Laughs.] [Applause.]\nOh, my goodness. [Applause.]\nGood evening, everyone. Good evening. Go- — [laughs]. [Applause.] Good evening. Thank you. [Applause.]\nThank you. Thank you. [Applause.]\nAudience: Kamala! Kamala! Kamala!\nThe Vice President: Good evening. [Applause.]\nThank you. Thank you. Thank — thank you. [Applause.] Thank you. Thank you, everyone. Thank you. [Applause.] Thank you. Thank you. [Applause.]\nAudience: USA! USA! USA!\nThe Vice President: Thank you all.\nAudience: USA! USA! USA!\nThe Vice President: Thank you all. [Applause.]\nOkay, we've got to get to some business. We've got to get to some business.\nOkay. Thank you all. [Applause.] Okay. [Laughs.] Thank you, thank you, thank you, thank you, thank you. [Applause.] Thank you, thank you. Please. Th

### Problem 3b: Understand Preprocessing (No Submission, 0 Points)

Notice that the text for Kamala Harris's 2024 nomination acceptance speech contains non-linguistic elements like `'[Laughs.]'` and `'[Applause.]'`. The function `hw2.preprocess` removes these elements.

In [26]:
hw2.preprocess(hw2.get_raw_text_from_html("convention_speeches/2024-Harris-DNC.html"))[:1000]

"The Vice President: Good evening! \nAudience: Kamala! Kamala! Kamala!\nThe Vice President: California. \nGood evening, everyone. Good evening. \nOh, my goodness. \nGood evening, everyone. Good evening. Go- — \nThank you. Thank you. \nAudience: Kamala! Kamala! Kamala!\nThe Vice President: Good evening. \nThank you. Thank you. Thank — thank you. \nAudience: USA! USA! USA!\nThe Vice President: Thank you all.\nAudience: USA! USA! USA!\nThe Vice President: Thank you all. \nOkay, we've got to get to some business. We've got to get to some business.\nOkay. Thank you all. \nPlease. Thank you so very much. Thank you, everyone. Thank you, everyone. Thank you. \nOkay, let's get to business. Let's get to business. All right. \nSo, let me start by thanking my most incredible husband, Doug —  I love you so very much.\nTo our president, Joe Biden — \nAnd to Coach Tim Walz — \nAnd to the delegates and everyone who has put your faith in our campaign, your support is humbling.\nSo, America, the path th

### Problem 3c: Implement ConventionSpeechCorpus Constructor (Code, 1 Point)

You will organize the nomination acceptance speeches by loading them into a `ConventionSpeechCorpus` object. The `ConventionSpeechCorpus` class, defined in `hw2`, behaves similarly to an NLTK corpus: it has methods like `.fileid`, `.raw`, and `.words`, whose NLTK counterparts you used extensively in Problem 2.

Please inspect the constructor (`.__init__` method) of `ConventionSpeechCorpus`, and complete its implementation. As shown below, `ConventionSpeechCorpus` objects can be constructed from the name of a single HTML file, or from the name of a folder containing a collection of HTML files.

In [27]:
# Creating a ConventionSpeechCorpus for a single HTML file
harris_corpus = hw2.ConventionSpeechCorpus("convention_speeches/2024-Harris-DNC.html")
harris_corpus.fileids()

['convention_speeches/2024-Harris-DNC.html']

In [28]:
# Creating a ConventionSpeechCorpus for a folder of HTML files
nomination_acceptance = hw2.ConventionSpeechCorpus("convention_speeches")
nomination_acceptance.fileids()

['convention_speeches/1932-Roosevelt-DNC.html',
 'convention_speeches/1936-Roosevelt-DNC.html',
 'convention_speeches/1940-Roosevelt-DNC.html',
 'convention_speeches/1944-Dewey-RNC.html',
 'convention_speeches/1944-Roosevelt-DNC.html',
 'convention_speeches/1948-Dewey-RNC.html',
 'convention_speeches/1948-Truman-DNC.html',
 'convention_speeches/1952-Eisenhower-RNC.html',
 'convention_speeches/1952-Stevenson-DNC.html',
 'convention_speeches/1956-Eisenhower-RNC.html',
 'convention_speeches/1956-Stevenson-DNC.html',
 'convention_speeches/1960-Kennedy-DNC.html',
 'convention_speeches/1960-Nixon-RNC.html',
 'convention_speeches/1964-Goldwater-RNC.html',
 'convention_speeches/1964-Johnson-DNC.html',
 'convention_speeches/1968-Humphrey-DNC.html',
 'convention_speeches/1968-Nixon-RNC.html',
 'convention_speeches/1972-McGovern-DNC.html',
 'convention_speeches/1972-Nixon-RNC.html',
 'convention_speeches/1976-Carter-DNC.html',
 'convention_speeches/1976-Ford-RNC.html',
 'convention_speeches/1976-

Currently, the `ConventionSpeechCorpus` constructor only supports constructing a `ConventionSpeechCorpus` object from a single HTML file. You are responsible for adding support for constructing a `ConventionSpeechCorpus` object from a folder of HTML files.

**Notes:**
- If the folder contains things other than HTML files (i.e., sub-directories or files not ending in `.html`), those things should not be included in the `ConventionSpeechCorpus` object.
- If the folder does not contain any HTML files, then the constructed `ConventionSpeechCorpus` object should be blank.

**Hint:** Please use [the `os.listdir` function](https://www.w3schools.com/python/ref_os_listdir.asp).

### Problem 3d: Implement ConventionSpeechCorpus Methods (Code, 2 Points)

Please implement the `.raw` and `.words` methods of `ConventionSpeechCorpus`. These methods should behave similarly to their counterparts in NLTK corpora, like `inaugural.raw` and `inaugural.words`.

The `.raw` method should return the raw text for one or more files in a `ConventionSpeechCorpus`.

In [29]:
nomination_acceptance.raw("convention_speeches/2024-Harris-DNC.html")[:1000]

"The Vice President: Good evening! \nAudience: Kamala! Kamala! Kamala!\nThe Vice President: California. \nGood evening, everyone. Good evening. \nOh, my goodness. \nGood evening, everyone. Good evening. Go- — \nThank you. Thank you. \nAudience: Kamala! Kamala! Kamala!\nThe Vice President: Good evening. \nThank you. Thank you. Thank — thank you. \nAudience: USA! USA! USA!\nThe Vice President: Thank you all.\nAudience: USA! USA! USA!\nThe Vice President: Thank you all. \nOkay, we've got to get to some business. We've got to get to some business.\nOkay. Thank you all. \nPlease. Thank you so very much. Thank you, everyone. Thank you, everyone. Thank you. \nOkay, let's get to business. Let's get to business. All right. \nSo, let me start by thanking my most incredible husband, Doug —  I love you so very much.\nTo our president, Joe Biden — \nAnd to Coach Tim Walz — \nAnd to the delegates and everyone who has put your faith in our campaign, your support is humbling.\nSo, America, the path th

The `.words` method should return tokenized text for one or more files in a `ConventionSpeechCorpus`.

In [30]:
nomination_acceptance.words("convention_speeches/2024-Harris-DNC.html")[:10]

['The',
 'Vice',
 'President',
 ':',
 'Good',
 'evening',
 '!',
 'Audience',
 ':',
 'Kamala']

Both methods should support loading multiple files, or all files in a `ConventionSpeechCorpus`.

In [31]:
# Load two files
_ = nomination_acceptance.raw(["convention_speeches/2024-Harris-DNC.html",
                               "convention_speeches/2024-Trump-RNC.html"])
_ = nomination_acceptance.words(["convention_speeches/2024-Harris-DNC.html",
                                "convention_speeches/2024-Trump-RNC.html"])

# Load all files
_ = nomination_acceptance.raw()
_ = nomination_acceptance.words()

If you try loading a file that doesn't exist within the `ConventionSpeechCorpus`, you should get a `ValueError`.

In [32]:
# Load a non-existent file
nomination_acceptance.raw("fake_file.html")

ValueError: There is no file called fake_file.html in this corpus!

In [32]:
# Load multiple files, including a non-existent file
nomination_acceptance.words(["convention_speeches/2024-Harris-DNC.html",
                             "fake_file.html",
                             "convention_speeches/2024-Trump-RNC.html"])

ValueError: There is no file called fake_file.html in this corpus!

**Notes:**
- Please read the docstrings for these methods very carefully!
- For grading purposes, it doesn't matter what error message you put into the `ValueError`.

### Problem 3e: Nomination Acceptance Speeches vs. Inaugural Addresses (Written, 1 Point)

What are the top five keywords and key features of:
- the nomination acceptance speeches, all concatenated into a single corpus?
- the inaugural addresses, all concatenated into a single corpus?

Please use a smoothing constant of $k = 1$. For your reference corpus, please use the concatenation of the nomination acceptance speeches and the inaugural addresses corpus.

**Hint:** Remember that you can add `FreqDist`s.

In [None]:
# Feel free to write code to help you solve this problem.

### Problem 3f: Democratic Speeches vs. Republican Speeches (Written, 1 Point)

What are the top five keywords and key features of:
- the nomination acceptance speeches given at the Democratic National Convention (DNC), all concatenated into a single corpus?
- the nomination acceptance speeches given at the Republican National Convention (RNC), all concatenated into a single corpus?

Please use a smoothing constant of $k = 1$. For your reference corpus, please use the concatenation of the all nomination acceptance speeches, regardless of parties.

**Hints:** 
- Remember that you can add `FreqDist`s.
- Look at the filename for each speech.
- What is the value of the Python expression `'bc' in 'abcde'`?

In [None]:
# Feel free to write code to help you solve this problem.

### Problem 3g: Deep Dive (Written, 2 Points)

Using keyword analysis and/or key feature analysis, what can you say about the differences between:
- inaugural addresses and nomination acceptance speeches, and
- Democratic vs. Republican nomination acceptance speeches?

Please base your answers on at least 25 key words and/or key features from both sets of speeches in each comparison. That is, you should look at at least 25 key words and/or key features from the inaugural addresses, nomination acceptance speeches, DNC speeches, and RNC speeches, each with an appropriate reference corpus. You do not need to write down all 25 key words and/or key features for each corpus in your answer; just describe any interesting observations you make from this analysis.

In [None]:
# Feel free to write code to help you solve this problem.

## Problem 4: Optional Further Reading (No Submission, 0 Points)

Egbert and Biber (2020) conduct a detailed key feature analysis of Donald Trump's presidential debates. This paper is available on Blackboard under Assignments; please read it at your own leisure.