In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw03.ipynb")

# Homework 3: Text Analysis Using Twitter

## Cleaning and Exploring Twitter Data using REGEX

## Due Date: Monday, July 3, 11:59 PM
You must submit this assignment to Gradescope by the on-time deadline, Monday, July 3 at 11:59pm. Please read the syllabus for the grace period policy. No late
submissions beyond the grace period will be accepted. **We strongly encourage you to plan to submit your work to Gradescope several hours before the stated deadline.** This way, you will have ample time to reach out to staff for support if you encounter difficulties with submission. While course staff is happy to help guide you with submitting your assignment ahead of the deadline, we will not respond to last-minute requests for assistance (TAs need to sleep, after all!).

Please read the instructions carefully to submit your work to both the coding and written portals of Gradescope.

## Collaboration Policy

Data science is a collaborative activity. While you may talk with others about
the homework, we ask that you **write your solutions individually**. If you do
discuss the assignments with others please **include their names** at the top
of your notebook.

**Collaborators**: *list collaborators here*

## This Assignment

Welcome to the third homework assignment of Data 100! In this assignment, we will be exploring tweets from several high profile Twitter users.  

In this assignment you will gain practice with:
* Conducting Data Cleaning and EDA on a text-based dataset.
* Manipulating data in pandas with the datetime and string accessors.
* Writing regular expressions and using pandas regex methods.
* Performing sentiment analysis on social media using VADER.


In [None]:
# Run this cell to set up your notebook.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

from ds100_utils import *

# Ensure that Pandas shows at least 280 characters in columns, so we can see full tweets.
pd.set_option('max_colwidth', 280)
plt.style.use('fivethirtyeight')
sns.set()
sns.set_context("talk")

def horiz_concat_df(dict_of_df, head=None):
    """
    Horizontally concatenante multiple DataFrames for easier visualization. 
    Each DataFrame must have the same columns.
    """
    df = pd.concat([df.reset_index(drop=True) for df in dict_of_df.values()], axis=1, keys=dict_of_df.keys())
    if head is None:
        return df
    return df.head(head)

### Score Breakdown

Question | Manual| Points
--- |---| ---
1a |No| 1
1b |No| 1
1c |No| 3
1d |Yes| 1
2a |No| 2
2b |No| 2
2c |No| 2
2d |No| 2
2e |Yes| 2
2f |Yes| 1
3a |No| 1
3b |Yes| 1
3c |No| 1
4a |Yes| 1
4b |No| 1
4ci |No| 1
4cii |No| 1
4d |No| 1
4e |No| 2
4f |No| 2
4g |Yes| 2
5a |Yes| 2
5b |Yes| 2
**Total** || **35**

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 1: Importing the Data


The data for this assignment was obtained using the [Twitter APIs](https://developer.twitter.com/en/docs/twitter-api).  To ensure that everyone has the same data and to eliminate the need for every student to apply for a Twitter developer account, we have collected a sample of tweets from several high-profile public figures.  The data is stored in the folder `data`.  Run the following cell to list the contents of the directory:

In [None]:
# Run this cell to list the content, no further action is needed.
from os import listdir
for f in listdir("data"):
    print(f)

<br><br>

--- 
### Question 1a

Let's examine the contents of one of these files.  Using the [`open` function](https://docs.python.org/3/library/functions.html#open) and [`read` operation](https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects) on a python file object, read the first 1000 **characters** in `data/BernieSanders_recent_tweets.txt` and store your result in the variable `q1a`.  Then display the result so you can read it.

**CAUTION: Viewing the contents of large files in a Jupyter notebook could crash your browser! Be careful not to print the entire contents of the file.**

**Hint1:** You might want to try to use `with`:

```python
with open("filename", "r") as f:
    f.read(2)
```
**Hint2:** Since your data is stored in the `data` directory, your datapath should start with `data/...`. Absolute paths (i.e paths that start from the root directory) will not be accepted.


In [None]:
...
    q1a = ...
print(q1a)

In [None]:
grader.check("q1a")

<br><br>

--- 
### Question 1b

Based on the printed output you got from `q1a`, what format is the data in? Answer this question by entering the letter corresponding to the right format in the variable `q1b` below.

**CAUTION: As a reminder, viewing the contents of large files in a Jupyter notebook could crash your browser.  Be careful not to print the entire contents of the file, and do not use the file explorer to open data files directly.**

  **A.** CSV<br/>
  **B.** HTML<br/>
  **C.** JavaScript Object Notation (JSON)<br/>
  **D.** Excel XML

Answer in the following cell. Your answer should be a string, either `"A"`, `"B"`, `"C"`, or `"D"`.


In [None]:
q1b = ...

In [None]:
grader.check("q1b")

<br><br>

--- 

### Question 1c

Pandas has built-in readers for many different file formats including the file format used here to store tweets.  To learn more about these, check out the documentation for `pd.read_csv` [(docs)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), `pd.read_html`[(docs)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html), `pd.read_json`[(docs)](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html), and `pd.read_excel`[(docs)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html).  

1. Use one of these functions to populate the `tweets` dictionary with the tweets for: `AOC`, `Cristiano`, and `elonmusk`. The keys of `tweets` should be the handles of the users (i.e. username), which we have provided in the cell below, and the values should be the DataFrames.
2. Set the index of each DataFrame to correspond to the `id` of each tweet.  


**Hint1:** You might want to first try loading one of the DataFrames before trying to complete the entire question.

**Hint2:** This is one of the rare instances in which a `for` loop may come in handy (although it's not required to answer this question)!

**Hint3:** If your code is taking more than a few seconds to run, you should review your answers to `q1a` and `q1b`; you may have used the incorrect data loading function for the file types in this assignment.

**Hint4:** Pay attention to the arguments that the built-in reader you choose takes in, specifically how the `orient` argument affects the output.

In [None]:
tweets = ...

In [None]:
grader.check("q1c")

If you did everything correctly, the following cells will show you the first 5 tweets for Elon Musk (and a lot of information about those tweets).

In [None]:
# Run this cell to show the first 5 tweets for Elon Musk, no further action is needed.
tweets["elonmusk"].head()

<!-- BEGIN QUESTION -->

<br><br>

---
### Question 1d
There are many ways we could choose to read tweets. Why might someone be interested in doing data analysis on tweets? Name a kind of person or institution which might be interested in this kind of analysis. Then, give two reasons why a data analysis of tweets might be interesting or useful for them. Answer in 2-3 sentences.


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br/><br/><br/>

<hr style="border: 1px solid #fdb515;" />


## Question 2:  Source Analysis


In some cases, the Twitter feed of a public figure may be partially managed by a public relations firm. In these cases, the device used to post the tweet may help reveal whether it was the individual (e.g., from an iPhone) or a public relations firm (e.g., TweetDeck).  The tweets we have collected contain the source information but it is formatted strangely :(

In [None]:
# Run this cell to see the source column of Cristano's tweet dataframe, no futher action is needed.
tweets["Cristiano"][["source"]]

In this question we will use a regular expression to convert this messy HTML snippet into something more readable.  For example: `<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>` should be `Twitter for iPhone`. 


<br><br>

--- 
### Question 2a

We will first use the Python `re` library to cleanup the above test string.  In the cell below, write a regular expression that will match the **HTML tag** and assign it to the variable `q2a_pattern`. We then use the `re.sub` function to substitute anything that matches the pattern with an empty string `""`.

An HTML tag is defined as a `<` character followed by zero or more non-`>` characters, followed by a `>` character. That is, `<a>` and `</a>` are both considered _separate_ HTML tags.

**Note:** In this question and in all subsequent questions in this assignment that involve regex, the public test cases will run your regex pattern on several test strings. If you are failing a public test case, this is likely because your regex pattern does not account for a possible way in which text might be structured. See the description of the test case to help you debug your regex. Resources like [Regex101](https://regex101.com/) might be helpful!


In [None]:
q2a_pattern = ...
test_str = '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'
re.sub(q2a_pattern, "", test_str)

In [None]:
grader.check("q2a")

<br><br>

--- 
### Question 2b

Rather than writing a regular expression to detect and remove the HTML tags we could instead write a regular expression to **capture** the device name between the angle brackets.  To simplify the problem, you may assume the device name is between a right angle bracket (`>`) and a left angle bracket (`<`). Here we will use [**capturing groups**](https://docs.python.org/3/howto/regex.html#grouping) by placing parenthesis around the part of the regular expression we want to return..  


**Hint:** The output of the following cell should be `['Twitter for iPhone']`.


In [None]:
q2b_pattern = ...
test_str = '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'
re.findall(q2b_pattern, test_str)

In [None]:
grader.check("q2b")

<br><br>

---
### Question 2c

Using either of the two regular expressions you just created and `Series.str.replace`[(docs)](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html) or `Series.str.extract`[(docs)](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html), add a new column called `"device"` to **all** of the DataFrames in `tweets` containing just the text describing the device (without the HTML tags). You may use one or multiple lines.

**Note:** If you choose to use a loop to go through `tweets`, you may also find `DataFrame.assign` helpful, see the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html), but this is not necessarily required.

In [None]:
grader.check("q2c")

<br><br>

---
### Question 2d

To examine the most frequently used devices by each individual, implement the `most_freq` function that takes in a `Series` and an integer `k`, and returns a new `Series` containing the `k` most commonly occurring entries in the first series, where the values are the counts of the entries and the indices are the entries themselves.

For example: 
```python
most_freq(pd.Series(["A", "B", "A", "C", "B", "A"]), k=2)
```
would return:
```
A    3
B    2
dtype: int64
```




**Hint**: You may consider using `value_counts`, `sort_values`, `head` or some other method.
 Think of what might be the most efficient implementation.

In [None]:
def most_freq(series, k = 5):
    ...

most_freq(tweets["Cristiano"]['device'])

In [None]:
grader.check("q2d")

Run the following two cells to compute a table and plot describing the top 5 most commonly used devices for each user.

In [None]:
# Run this cell to compute a table, no further action needed.
device_counts = pd.DataFrame(
    [most_freq(tweets[name]['device']).rename(name)
     for name in tweets]
).fillna(0)
device_counts

In [None]:
# Run this cell to generate the plot, no further action needed.
make_bar_plot(device_counts.T, title="Count of Tweets by Source",
               xlabel="Source", ylabel="Count")
plt.xticks(rotation=45)
plt.legend(title="Handle");

<!-- BEGIN QUESTION -->

<br><br>

---
### Question 2e

Given the plot above, what might we want to investigate during EDA? Name some possible questions you may have about the dataset in light of the information shown in the plot.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br><br>

---
### Question 2f

We just looked at the top 5 most commonly used devices for each user. However, we used the number of tweets as a measure, when it might be better to compare these distributions by comparing _proportions_ of tweets (i.e. what percentage of all tweets for a user were published from each device). Why might proportions of tweets be better measures than numbers of tweets?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br/><br/><br/>

<hr style="border: 1px solid #fdb515;" />


## Question 3: When?

Now that we've explored the sources of each of the tweets, we will perform some time series analysis. A look into the temporal aspect of the data could reveal insights about how a user spends their day, when they eat and sleep, etc. In this question, we will focus on the time at which each tweet was posted.


<br><br>

---
### Question 3a

Complete the following function `add_hour` that takes in a tweets dataframe `df`, and two variables `time_col` and `result_col` representing the string names of the associated columns in `df`. Your function should use the timestamps in the `time_col` column to store in a new column `result_col` the computed  hour of the day as a floating point number according to the formula:

$$
\text{hour} + \frac{\text{minute}}{60} + \frac{\text{second}}{60^{2}}
$$

**Note:** The below code calls your `add_hour` function and updates each tweets dataframe by using the `created_at` timestamp column to calculate and store the `hour` column.

**Hint:** See the following link for an example of working with timestamps using the [`dt` accessors](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dt-accessor). 


In [None]:
def add_hour(df, time_col, result_col):
    ...
    return df

# Do not modify the below code.
tweets = {handle: add_hour(df, "created_at", "hour") for handle, df in tweets.items()}
tweets["AOC"]["hour"].head()

In [None]:
grader.check("q3a")

With our new `hour` column, let's take a look at the distribution of tweets for each user by time of day. The following cell helps create a density plot on the number of tweets based on the hour they are posted. 

The function `bin_df` takes in a dataframe, an array of bins, and a column name; it bins the the values in the specified column, returning a dataframe with the bin lower bound and the number of elements in the bin. This function uses [`pd.cut`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html), a pandas [utility](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) for binning numerical values that you may find helpful in the distant future.

Run the cell and answer the following question about the plot.

In [None]:
# Run this cell to generate the plot, no further action is needed.
def bin_df(df, bins, colname):
    binned = pd.cut(df[colname], bins).value_counts().sort_index()
    return pd.DataFrame({"counts": binned, "bin": bins[:-1]})

hour_bins = np.arange(0, 24.5, .5)
binned_hours = {handle: bin_df(df, hour_bins, "hour") for handle, df in tweets.items()}

make_line_plot(binned_hours, "bin", "counts", title="Distribution of Tweets by Time of Day",
               xlabel="Hour", ylabel="Number of Tweets")

<!-- BEGIN QUESTION -->

<br><br>

---
### Question 3b
Compare Cristiano's distribution with those of AOC and Elon Musk. In particular, compare the distributions before and after Hour 6. What differences did you notice? What might be a possible cause of that? Do the data plotted above seem reasonable?

**Hint:** if you are not familiar with who Cristiano, AOC, and Elon Musk are, it may be helpful to Google information about these people, their occupations, and where they live.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

---
### Question 3c

To account for different locations of each user in our analysis, we will next adjust the `"created_at"` timestamp for each tweet to the respective timezone of each user. Complete the following function `convert_timezone` that takes in a tweets dataframe `df` and a timezone `new_tz` and adds a new column `"converted_time"` that has the adjusted `"created_at"` timestamp for each tweet. The provided code at the bottom of the cell will run your `convert_timezone` function to convert the timezones of each user to the appropriate location.

**Hint:** Again, please see the following link for an example of working with [`dt` accessors](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dt-accessor).


In [None]:
def convert_timezone(df, new_tz):
    ...
    return df

timezones = {"AOC": "EST", "Cristiano": "Europe/Lisbon", "elonmusk": "America/Los_Angeles"}

tweets = {handle: convert_timezone(tweets[handle], timezones[handle]) for handle in tweets.keys()}

In [None]:
grader.check("q3c")

With our adjusted timestamps for each user based on their timezone, let's take a look again at the distribution of tweets by time of day.

In [None]:
# Run this cell to generate the plot, no further action is needed.
tweets = {handle: add_hour(df, "converted_time", "converted_hour") for handle, df in tweets.items()}
binned_hours = {handle: bin_df(df, hour_bins, "converted_hour") for handle, df in tweets.items()}

make_line_plot(binned_hours, "bin", "counts", title="Distribution of Tweets by Time of Day (timezone-corrected)",
               xlabel="Hour", ylabel="Number of Tweets")

<br/><br/><br/>

<hr style="border: 1px solid #fdb515;" />


## Question 4: Sentiment Analysis


In the past few questions, we have explored the sources of the tweets and when they are posted. Although on their own, they might not seem particularly intricate, combined with the power of regular expressions, they could actually help us infer a lot about the users. In this section, we will continue building on our past analysis and specifically look at the **sentiment of each tweet** -- this would lead us to a much more direct and detailed understanding of how users view certain subjects and people. **Sentiment analysis** is generally the computational task of classifying the emotions in a body of text as positively or negatively charged.

<br/>
How do we actually measure the sentiment of each tweet? In our case, we can use the words in the text of a tweet for our calculation! For example, the word "love" within the sentence "I love America!" has a positive sentiment, whereas the word "hate" within the sentence "I hate taxes!" has a negative sentiment. In addition, some words have stronger positive / negative sentiment than others: "I love America." is more positive than "I like America."

We will use the VADER (Valence Aware Dictionary and sEntiment Reasoner) lexicon ([github](https://github.com/cjhutto/vaderSentiment), [original paper](https://doi.org/10.1609/icwsm.v8i1.14550)) to analyze the sentiment of AOC's tweets. VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media which is great for our usage.

The VADER lexicon gives the sentiment of individual words. Run the following cell to show the first few rows of the lexicon:

In [None]:
# Run this cell to print the first 10 rows, no further action needed.
print(''.join(open("vader_lexicon.txt").readlines()[:10]))

As you can see, the lexicon contains emojis too! Each row contains a word ("token") and various measures of the **polarity** of that word, measuring how positive or negative the word is, on a scale of -4 (extremely negative) to +4 (extremely positive). We explain more below.

### VADER Sentiment Analysis

VADER ([github](https://github.com/cjhutto/vaderSentiment), [original paper](https://doi.org/10.1609/icwsm.v8i1.14550)) is a tool that can quantitatively describe the polarity or "sentiment" of a word.

VADER doesn't "read" sentences, but works by parsing sentences into words, assigning a preset generalized score from their testing sets to each word separately. 

VADER relies on humans to stabilize its scoring. The creators use Amazon Mechanical Turk, a crowdsourcing survey platform, to train its model. Its training data consists of a small corpus of tweets, New York Times editorials and news articles, Rotten Tomatoes reviews, and Amazon product reviews, tokenized using the natural language toolkit (NLTK). Each word in each dataset was reviewed and rated on a scale of -4 (extremely negative) to 4 (extremely positive) by at least 10 trained individuals who had signed up to work on these tasks through Mechanical Turk. 


<!-- BEGIN QUESTION -->

<br><br>

---
### Question 4a
Please score the sentiment of one of the following words, using your own personal interpretation. No code is required for this question!

- police
- order
- Democrat
- Republican
- gun
- dog
- technology
- TikTok
- security
- face-mask
- science
- climate change
- vaccine

What score did you give it and why? Can you think of a situation in which this word would carry the opposite sentiment to the one you’ve just assigned?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Optional (ungraded):** Are there circumstances (e.g. certain kinds of language or data) when you might not want to use VADER? What features of human speech might VADER misrepresent or fail to capture?

<br><br>

---
### Question 4b

Let's first load in the data containing all the sentiments. 
In `vader_lexicon.txt`, each row contains the word (token), average polarity, standard deviation of polarity, and the "raw polarity ratings" of each of the 10 human raters. See the description of `vader_lexicon.txt` given in the documentation [here](https://github.com/cjhutto/vaderSentiment#resources-and-dataset-descriptions) for more information about each of these measures.

Read `vader_lexicon.txt` into a new dataframe called `sent`. The index of the dataframe should be the words in the lexicon and should be named `token`. `sent` should have one column named `polarity`, storing the average polarity of each word. We will not incorporate the polarity standard deviation and raw ratings for this exercise. The first five entries of `sent` should look like this:

token | polarity
--- |---
**$:** | -1.5
**%)** | -0.4
**%-)**| -1.5
**&-:**| -0.4
**&:** | -0.7

**Note:** If you're confused as to why the first few entries of `sent` don't seem to be words, don't worry: it's because VADER also includes the polarities of emoticons as well!

**Hint1:** The `pd.read_csv` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) function may help here. 

**Hint2:** Since the file is tab-separated, be sure to read the documentation on how to set the separator with `pd.read_csv`'s parameter `sep`. To check you work, the first token should be `$:`.

**Hint3:** Is there a header (that is, data that can be used as column names) in the csv file and how can you account for this?


In [None]:
sent = ...
sent.head()

In [None]:
grader.check("q4b")

<br><br>

---
### Question 4c

Before further analysis, we will need some more tools that can help us extract the necessary information and clean our data.

Complete the following regular expressions that will help us match part of a tweet that we either (i) want to remove or (ii) are interested in learning more about.

#### **Question 4c Part i**
Assign a regular expression to a new variable `punct_re` that captures all of the punctuations within a tweet. We consider punctuation to be any non-word, non-whitespace character.

**Note**: A word character is any character that is alphanumeric or an underscore. A whitespace character is any character that is a space, a tab, a new line, or a carriage return.


In [None]:
punct_re = ...

re.sub(punct_re, " ", tweets["AOC"].iloc[0]["full_text"])

In [None]:
grader.check("q4ci")

#### **Question 4c Part ii**
Assign a regular expression to a new variable `mentions_re` that matches any mention in a tweet. Your regular expression should use a capturing group to extract the user's username in a mention. The `@` sign preceding the username should not be extracted from the tweet.

**Hint**: a user mention within a tweet always starts with the `@` symbol and is followed by a series of word characters (with no space in between). For more explanations on what a word character is, check out the **Note** section in Part 1.


In [None]:
mentions_re = ...

re.findall(mentions_re, tweets["AOC"].iloc[0]["full_text"])

In [None]:
grader.check("q4cii")

<br/>

### Tweet Sentiments and User Mentions

As you have seen in the previous part of this question, there are actually a lot of interesting components that we can extract out of a tweet for further analysis! For the rest of this question though, we will focus on one particular case: the sentiment of each tweet in relation to the users mentioned within it. 

To calculate the sentiments for a sentence, we will follow this procedure:

1. Remove the punctuation from each tweet so we can analyze the words.
2. For each tweet, find the sentiment of each word.
3. Calculate the sentiment of each tweet by taking the sum of the sentiments of its words.

<br><br>

---
### Question 4d

Let's use our `punct_re` regular expression from the previous part to clean up the text a bit more! The goal here is to remove all of the punctuations to ensure words can be properly matched with those from VADER to actually calculate the full sentiment score.

Complete the following function `sanitize_texts` that takes in a table `df` and adds a new column `clean_text` by converting all characters in its original `"full_text"` column **to lower case** and **replace all instances of punctuation with a space character**.


In [None]:
def sanitize_texts(df):
    ...
    return df

tweets = {handle: sanitize_texts(df) for handle, df in tweets.items()}
tweets["AOC"]["clean_text"].head()

In [None]:
grader.check("q4d")

<br><br>

---
### Question 4e
With the texts sanitized, we can now extract all the user mentions from tweets. 

Complete the following function `extract_mentions` that takes in a Series containing the values stored in the `"full_text"` column from a tweets DataFrame and uses `mentions_re` to extract all the mentions in a dataframe. The returned dataframe is:
* single-indexed by the IDs of the tweets
* has one row for each mention
* has one column named `twitter_handle`, which contains each mention in all lower-cased characters

After filling out the `extract_mentions` function, running the cell below should output three **DataFrames** (not Series), each belonging to AOC, Elon Musk, and Cristiano Ronaldo, respectively. As a sanity check, the first DataFrame (belonging to AOC) should look like this:

id| twitter_handle
--- |---
**1358149122264563712** | repescobar
**1358147616400408576** | rokhanna
**1358130063963811840**| jaketapper
**1358130063963811840**| repnancymace
**1358130063963811840** | aoc

**Note:** While staff solution contains a single line, you may find it helpful to break the problem into multiple subparts and complete this in multiple lines.

**Hints**: There are several ways to approach this problem.
* Here is a list of documentations for potentially useful function. You are not expected to use all of them, you may also need additional functions that are not listed below.
    * Extracting valid mentions: `str.extractall` ([link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extractall.html?highlight=extractall)), `str.findall` ([link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.findall.html)), `dropna` ([link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dropna.html))
    * Refomatting data: `.reset_index` ([link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html)), `.explode` ([link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.explode.html)), `.to_frame`([link](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html)).
    * You can find an example of how to chain `.explode` and `.to_frame` in the *Tidying Up the Data* section following this question.
* The staff solution uses a single line of chained methods, but you are encouraged to break the problem into subparts with multiple lines.

In [None]:
def extract_mentions(full_texts):
    mentions = ...
    return mentions[["twitter_handle"]]

# Uncomment this line to help you debug.
# display(extract_mentions(tweets["AOC"]["full_text"]).head())

# Do not modify the below code.
mentions = {handle: extract_mentions(df["full_text"]) for handle, df in tweets.items()} # for autograder
display(extract_mentions(tweets["AOC"]["full_text"]).head())
display(extract_mentions(tweets["elonmusk"]["full_text"]).head())
display(extract_mentions(tweets["Cristiano"]["full_text"]).head())

In [None]:
grader.check("q4e")

<br/>

### Tidying Up the Data

Now, let's convert the tweets into what's called a [*tidy format*](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) to make the sentiments easier to calculate. The `to_tidy_format` function implemented for you uses the `clean_text` column of each tweets dataframe to create a tidy table, which is:

* single-indexed by the IDs of the tweets, for every word in the tweet.
* has one column named `word`, which contains the individual words of each tweet.

Run the following cell to convert the table into the tidy format. Take a look at the first 5 rows from the "tidied" tweets dataframe for AOC and see if you can find out how the structure has changed.

**Note**: Although there is no work needed on your part, we have referenced a few more advanced pandas methods you might have not seen before -- you should definitely look them up in the documentation when you have a chance, as they are quite powerful in restructuring a dataframe into a useful intermediate state!

In [None]:
# Run this cell to convert table into tidy format, no further action is needed.
def to_tidy_format(df):
    tidy = (
        df["clean_text"]
        .str.split()
        .explode()
        .to_frame()
        .rename(columns={"clean_text": "word"})
    )
    return tidy

tidy_tweets = {handle: to_tidy_format(df) for handle, df in tweets.items()}
tidy_tweets["AOC"].head()

### Adding in the Polarity Score

Now that we have this table in the tidy format, it becomes much easier to find the sentiment of each tweet: we can join the table with the lexicon table. 

The following `add_polarity` function adds a new `polarity` column to the `df` table. The `polarity` column contains the sum of the sentiment polarity of each word in the text of the tweet.

**Note**: Again, though there is no work needed on your part, it is important for you to go through how we set up this method and actually understand what each method is doing. In particular, see how we deal with missing data.

In [None]:
# Just run this cell to add the "polarity" column.
# No further code is needed, but verify your understanding of each chained method.
def add_polarity(df, tidy_df):
    df["polarity"] = (
        tidy_df
        .merge(sent, how='left', left_on='word', right_index=True)
        .reset_index()
        .loc[:, ['id', 'polarity']]
        .fillna(0)
        .groupby('id')
        .sum()
    )
    return df

tweets = {handle: add_polarity(df, tidy_df) for (handle, df), tidy_df in \
          zip(tweets.items(), tidy_tweets.values())}
tweets["AOC"][["clean_text", "polarity"]].head()

Comment: In the demo cell above, `add_polarity()` is a very straightforward approach to sentiment analysis: define a tweet's sentiment as the **sum** of each word's sentiment as determined by a VADER lexicon. The VADER lexicon itself relies on crowdsourcing humans to stabilize its scoring. However, sentence structure and word phrasing heavily impacts sentiment, but our current approach ignores these contexts, instead opting for approximate, naive sentiments to perform initial EDA. 

If we were to further explore this direction of the data, we would consider approaches for computing tweet sentiment that are modern, nuanced, and more accurate (for some definition of "accurate"). Such approaches often adapt deep natural language processing models to sentiment analysis tasks, meaning they directly address the sentiment of a body of text, instead of individual words like VADER. However, these models still depend on a robust "training dataset" of tweet sentiments, which is often still generated through crowdsourced human work. If you're curious about this, explore Data C104: Human Contexts and Ethics and CS 288: Natural Language Processing!

<br><br>

---
### Question 4f
Finally, with our polarity column in place, we can finally explore how the sentiment of each tweet relates to the user(s) mentioned in it. 

Complete the following function `mention_polarity` that takes in a mentions dataframe `mention_df` and the original tweets dataframe `df` and returns a series where the mentioned users are the index and the corresponding mean sentiment scores of the tweets mentioning them are the values.

**Hint**: You should consider joining tables together in this question.


In [None]:
def mention_polarity(df, mention_df):
    ...

aoc_mention_polarity = mention_polarity(tweets["AOC"],mentions["AOC"]).sort_values(ascending=False)
aoc_mention_polarity

In [None]:
grader.check("q4f")

<!-- BEGIN QUESTION -->

<br><br>

---
### Question 4g

In `q4f` above, we aggregated the polarity of the tweets by computing the mean sentiment score of tweets mentioning each user. What are some drawbacks of the decision to use the mean as an aggregation function? What other aggregation function(s) might be more appropriate than the mean?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br/><br/><br/>

<hr style="border: 1px solid #fdb515;" />


## Question 5: You Do EDA!

Congratulations! You have finished all of the preliminary analysis on AOC, Cristiano, and Elon Musk's recent tweets. 

As you might have recognized, there is still far more to explore within the data and build upon what we have uncovered so far. In this open-ended question, we want you to come up with a new perspective that can expand upon our analysis of the sentiment of each tweet. 

For this question, you will perform some text analysis on our `tweets` dataset. Your analysis should have two parts:

1. a piece of code that manipulates `tweets` in some way and produces informative output (e.g. a dataframe, series, or plot)
2. a short (4-5 sentence) description of the findings of your analysis: what were you looking for? What did you find? How did you go about answering your question?

Your work should involve text analysis in some way, whether that's using regular expressions or some other form.

To aid you in creating plots, we provide the plotting helper functions in the table below. These are same helpers we have used throughout this notebook, and all accept dictionaries with a similar structure to `tweets`. That being said, if you'd like to experiment with using Matplotlib and Seaborn to generate plots on your own, please do so!

| Helper | Description |
|--------|-------------|
| `make_bar_plot` | Plot side-by-side bar plots of data like [`plt.bar`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.bar.html) |
| `make_histogram` | Plot overlaid histograms of data like [`plt.hist`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html) |
| `make_line_plot` | Plot overlaid line plots of data like [`plt.plot`](https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.plot.html) |
| `make_scatter_plot` | Plot overlaid scatter plots of data like [`plt.scatter`](https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.scatter.html) |

Each of the provided helpers is in `ds100_utils.py` and has a comprehensive docstring. You can read the docstring by calling `help` on the plotting function:

In [None]:
help(make_line_plot)

To assist you in getting started, here are a few ideas for this you can analyze for this question:

- dig deeper into when devices were used
- how sentiment varies with time of tweet
- expand on regexes from 4b to perform additional analysis (e.g. hashtags)
- examine sentiment of tweets over time

In general, try to combine the analyses from earlier questions or create new analysis based on the scaffolding we have provided.

This question is worth 4 points and will be graded based on this rubric:

| | 2 points | 1 point | 0 points |
|-----|-----|-----|-----|
| **Code** | Produces a mostly informative plot or pandas output that addresses the question posed in the student's description and uses at least one of the following pandas DataFrame/Series methods: `groupby`, `agg`, `merge`, `pivot_table`, `str`, `apply` | Attempts to produce a plot or manipulate data but the output is unrelated to the proposed question, or doesn't utilize at least one of the listed methods | No attempt at writing code |
| **Description** | Describes the analysis question and procedure comprehensively and summarizes results correctly | Attempts to describe analysis and results but description of results is incorrect or analysis of results is disconnected from the student’s original question | No attempt at writing a description |

<!-- BEGIN QUESTION -->

<br><br>

---
### Question 5a

Use this space to put your EDA code.


In [None]:
# perform your text analysis here

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br><br>

---
### Question 5b

Use this space to put your EDA description.


_Write your description here._

<!-- END QUESTION -->

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Congratulations! You have finished Homework 3!
Below, you will see two cells. Running the first cell will automatically generate a PDF of all questions that need to be manually graded, and running the second cell will automatically generate a zip with your autograded answers. **You are responsible for both the coding portion (the zip from Homework 3) and the written portion (the PDF with from Homework 3) to their respective Gradescope portals.** The coding proportion should be submitted to Homework 2 Coding as a single zip file, and the written portion should be submitted to Homework 3 Written as a single pdf file. When submitting the written portion, please ensure you select pages appropriately. 

If there are issues with automatically generating the PDF in the first cell, you can try downloading the notebook as a PDF by clicking on `File -> Save and Export Notebook As... -> PDF`. If that doesn't work either, you can manually take screenshots of your answers to the manually graded questions and submit those. Either way, **you are responsible for ensuring your submission follows our requirements, we will NOT be granting regrade requests for submissions that don't follow instructions.**

In [None]:
from otter.export import export_notebook
from os import path
from IPython.display import display, HTML
export_notebook("hw03.ipynb", filtering=True, pagebreaks=True)
if(path.exists('hw03.pdf')):
    display(HTML("Download your PDF <a href='hw03.pdf' download>here</a>."))
else:
    print("\n Pdf generation fails, please try the other methods described above")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)