# SI 618 Homework 5 - Natural Language Processing

### The total score for this assignment will be 100 points, consisting of:
- 10 pt: Overall quality of spelling, grammar, punctuation, etc. of written sentences.
- 10 pt: Code is written in [PEP 8](https://www.python.org/dev/peps/pep-0008/) style.
- 80 pt: Homework questions. Questions 1 through 6 are worth 10 points each; Question 7 is worth 20 points.

Version 2024.02.20.CT

## Background

Recently on September 10th, 2024, there was the first U.S. Presidential Debate between two candidates, Former President Donald J. Trump and current Vice President Kamala Harris. 

The debate was hosted by ABC News and moderated by David Muir and Linsey Davis. This debate was held in Philadelphia, Pennsylvania in a closed room setting but aired on national television. At a time where the Presidential Election is justt around the corner and has a domestic and international spotlight on it, thhis homework will focus on how to analyze the text data from recent news sources.

There are two sources of the debate transcript that are provided for this homnework. The main and primary source of the debate transcript is from The American Presidency Project at the University of California, Santa Barbara. The secondary source is from the ABC News website.


- Link: https://www.presidency.ucsb.edu/documents/presidential-debate-philadelphia-pennsylvania
- Link: https://abcnews.go.com/Politics/harris-trump-presidential-debate-transcript/story?id=113560542

## Learning Objectives

In this homework, you will practice the following skills:

- Getting Comfortable working with text data
- Preprocessing techniques for text data
- Using regular expressions to extract information from text data.
- Tokenizing text data
- Word filtering (stop words, punctuation, etc.)
- Understanding the basics of Natural Language Processing (NLP)
- Introduction to embeddings
- Using embeddings to analyze text data
    - Similarity
    - Distance


## 📂 Data

The teaching team has provided two text files for this homework. We will only be using one of them for this assignment. 

- The file from the **American Presidency Project** is the primary source of data and is named **"Full_Transcript.txt"**.
- The other file, from the **ABC News** website, is a partial transcript called **"Partial_Debate_Transcript.txt"**.

###  Homework Focus:
In this homework, you will work on skills that provide the groundwork for understanding and analyzing text data.

### Extra Resource:
The partial debate transcript is provided as an extra resource. **It is not required for this homework**, but feel free to explore it after completing the assignment.

### Question for Reflection:
Consider the following question:  
*"What is the impact of cherry-picked data? Do you think that a partial transcript can provide the whole truth of a debate, even if selectively chosen?"*

Please fill in your uniqname in the next code cell:

In [1]:
MY_UNIQNAME = ""

Answer each of the questions below.  You are encouraged to use as many code and markdown cells as you need for each question.

We **strongly** suggest running all cells from top to bottom before you submit your notebook.

In [2]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.tag import pos_tag
# nltk.download('wordnet')
# nltk.download('vader_lexicon')
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from langchain_huggingface import HuggingFaceEmbeddings
import matplotlib.pyplot as plt
from collections import Counter


## **Q0** - Setting Up Data Structure for Text Processing

For this question, the desired output is to create a **2-column DataFrame** with the following columns:
- **Column 1**: Speaker
- **Column 2**: Text

### Task:
Please create a 2-column DataFrame using the data from the **"Full_Transcript.txt"** file.

In [None]:
# your code here...

## **Q1.**

### Tokenization of the Text
- The teaching team would like to emphasize the skill of **tokenization** of the text data.
- **Tokenization** is the process of breaking down text data into smaller units, such as **words** or **phrases**.
- We will use the `split()` method to tokenize the text data.
- Using the `split()` method, please tokenize the text data for each **candidate** and **moderator**.
- Store each in separate data structures for each candidate and moderator, along with one for the **full transcript**.

In [None]:
# your code here...

You may notice some irregularitites in the output of the text data, Over the next several steps we will clean the text data to make it more readable and easier to analyze.

### Lower all of the text data
The next step in the NLP pipeline is to lower all of the text data. This is important because it will allow us to standardize the text data and make it easier to analyze. In this step we would like you to define a function that can be used to lower all of the text data.

In [None]:
# your code here...

### Remove Punctuation from the Text
The next step in the NLP pipeline is to remove punctuation from the text data. This is another important step as it moves towards more standardization of the text data. In this step, we would like you to define a function that can be used to remove punctuation from the text data.

In [None]:
# your code here...

### Remove stopwords from the text
- The next step in the NLP pipeline is to remove stopwords from the text data. Stopwords are common words that do not provide much information about the text data. In this step we would like you to define a function that can be used to remove stopwords from the text data.
- The teaching team has provided a list of stopwords that you can use for this step. The stopwords can be found in the stopwords.txt file.
- To summarize, please load the stopwords.txt file and use it to remove stopwords from the text data (make use of a function for removing the stopwords).

In [None]:
# your code here...

### 🏷️ POS Tagging the Text

The next step in our NLP pipeline is **Part-of-Speech (POS) tagging**.

### What is POS Tagging?
- **POS tagging** is the process of marking each word in a text with its corresponding **part of speech** based on its **definition** and **context**.
  
- Examples:  
  - **Noun**: *"dog"*, *"city"*  
  - **Verb**: *"run"*, *"write"*  
  - **Adjective**: *"happy"*, *"blue"*

### Task:
Use the **NLTK library** to perform **POS tagging** on the text data.

In [None]:
# your code here...

### Lemmatization of the Text

The final step in our NLP pipeline is **lemmatization**.

### What is Lemmatization?
- **Lemmatization** groups together the **inflected forms** of a word, so they can be analyzed as a **single item**.  
- Example:  
  - *"running"* → **"run"**

### Task:
Define a function using the **NLTK library** to lemmatize the text data.

🔗 **Hint**: Use NLTK’s `WordNetLemmatizer` for this process.

In [None]:
# your code here...

## **Q2.** Word Usage Analysis by Candidate and Moderator

### Key Questions:
1. **How many words** did each **candidate** use in the debate?  
   - *(Excluding stopwords)*

2. **How many **unique words** did each **moderator** use in the debate?  
   - *(Excluding stopwords)*

### Analysis Focus:
- **Candidates**: Calculate **total word count**.
- **Moderators**: Calculate **unique word count**.

In [None]:
# your code here...

## **Q3.** Counting Speaking Turns for Each Candidate and Moderator

### What is a "Turn"?
- A **turn** is defined as an **uninterrupted period of speech**.  
  For example:
  
  > Chris: Big data is really interesting.  
  > Colleague: Actually, it's a bit boring.  
  > Chris: Really? Why would you say that?  
  > Colleague: Your choice of tools is really limited.  
  > Colleague: I mean, you're basically stuck with Spark, right?  
  > Chris: Yeah, but Spark provides most of the functionality you need to do really cool data science work.

In this example, **Chris** had 3 turns, while his **Colleague** had 2.

### Desired Output:
Format your results using a **DataFrame** with the following columns:

- **Speaker**  
- **Uninterrupted_Turns**

In [None]:
# your code here...

## **Q4.** Analyzing Noun Usage by Each Speaker

### Task:
Calculate the **number of different nouns** used by each speaker in the debates.

### Format:
Present your answer using a **dictionary** in the following format:

```python
{
    'DAVID MUIR': 0, 
    'LINSEY DAVIS': 0, 
    'FORMER PRESIDENT DONALD TRUMP': 0, 
    'VICE PRESIDENT KAMALA HARRIS': 0
}

In [None]:
# your code here...

## **Q5.** Analyzing Unique Words Used by Each Speaker

### Task:
- **Calculate the Number of Unique Words** for each candidate and moderator.
- Answer the following question:  
  - How many **unique words** did each person use in the debate?
  - What might this indicate about the **breadth of their vocabulary**?

### Considerations:
- Think about the **role** of the speaker:
  - Are they a **candidate** trying to present their ideas?
  - Or a **moderator** guiding the discussion?

### Visualization: Zipf's Law
- Create a **visualization** to depict **Zipf's Law** for each speaker, showing the relationship between word frequency and rank.

In [None]:
# your code here...

## **Q6.** Calculate Bigrams by Candidate and Moderator

Use **bigrams** (pairs of consecutive words) to analyze speech patterns for each candidate and the moderator.

### Task:
- Calculate **bigrams** separately for each candidate and the moderator.
- Present the **top 5 bigrams** for each speaker.

### Analysis Questions:
- Do you notice any **interesting patterns** or unique combinations of words?
- Are there specific phrases that are frequently used by a particular candidate?

In [None]:
# your code here...

## **Q7.** Word Frequency Histogram

Create a **word frequency histogram** for each candidate and the moderator.

###  Analysis Questions:
- Do you observe any **interesting patterns**?
- Choose an **arbitrary word count threshold** and **remove** all words with a frequency **lower** than that threshold.
  
### Updated Histogram:
- Create a **new histogram** using the updated data.

###  Additional Insight:
- Since this is a **political debate**, identify any **rare words** that might be of interest.
- Do any moderators or candidates use **complex vocabulary**? If so, which ones?


In [3]:
# your code here...

## **Q8.** Sentiment Analysis of Trump vs. Harris Debate

In this task, we will use the **VADER sentiment analysis tool** to evaluate the sentiment of the debate.

### Steps to Follow:
1. **Filter the Text**:  
   - Include only **candidate responses** by removing any **moderator text** from the analysis.

2. **Analyze Sentiment**:  
   - Use VADER to calculate the **sentiment score** for each text blurp belonging to the candidates.

3. **Create a New Column**:  
   - Add a new column in the **full transcript dataframe** to store the **sentiment score** for each blurp.

4. **Save the New Dataframe**:  
   - Store this modified dataframe as a **separate data structure**.

5. **Visualize Sentiment**:  
   - Create a **visualization** showing the **sentiment score** of each blurp for each candidate.

In [None]:
# your code here...

## **Q9.** Embeddings
Use the **embedding model** provided through `Langchain`'s `HuggingFaceEmbeddings()` to encode the **untokenized text** for each candidate from the data frames.

### Steps:
1. **Embed Each Candidate’s Speech**:
   - For every row in the candidate data frames, embed the untokenized text using the embedding model.
   - Return the embeddings in a **list** format.

2. **Create a Query**:
   - Use the text:  
     > *"I am the Best president in the history of the United States."*

3. **Embed the Query**:
   - Embed the query text using the same model.

4. **Find the Most Relevant Document**:
   - Use **cosine similarity** to compare the **query embedding** with each candidate’s **speech embeddings**.
   - Determine which speech is **most similar** to the query.

In [None]:
# your code here...

### 🟢 IMPORTANT: Final Checks

Before you submit your homework, please ensure that:
- Your complete homework **runs without errors** from top to bottom.  
  💡 **Tip**: Use the **Run All** feature to quickly check this.
  
---

## 📤 Submission Instructions

Submit your completed assignment in both of the following formats:
1. **.IPYNB** (Jupyter Notebook format)
2. **.HTML** (Webpage format)

Upload your files to **Canvas** before the deadline.