# Text Analysis Using Twitter

In this assignment, I will be exploring tweets from several high profile Twitter users.  

In this assignment I will gain practice with:
* Conducting Data Cleaning and EDA on a text-based dataset.
* Manipulating data in pandas with the datetime and string accessors.
* Writing regular expressions and using pandas regex methods.
* Performing sentiment analysis on social media using VADER.


In [15]:
# Import dependencies
# Run this cell to set up your notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

from ds100_utils import *

# Ensure that Pandas shows at least 280 characters in columns, so we can see full tweets
pd.set_option('max_colwidth', 280)
plt.style.use('fivethirtyeight')
sns.set()
sns.set_context("talk")

def horiz_concat_df(dict_of_df, head=None):
    """
    Horizontally concatenante multiple DataFrames for easier visualization. 
    Each DataFrame must have the same columns.
    """
    df = pd.concat([df.reset_index(drop=True) for df in dict_of_df.values()], axis=1, keys=dict_of_df.keys())
    if head is None:
        return df
    return df.head(head)

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Importing the Data


The data for this assignment was obtained using the [Twitter APIs](https://developer.twitter.com/en/docs/twitter-api).  

In [16]:
# just run this cell
from os import listdir
for f in listdir("data"):
    print(f)

AOC_recent_tweets.txt
.DS_Store
EmmanuelMacron_recent_tweets.txt
Cristiano_recent_tweets.txt
elonmusk_recent_tweets.txt
BernieSanders_recent_tweets.txt
BillGates_recent_tweets.txt


Let's examine the contents of one of these files. I'll look at the first 1000 **characters** in `data/BernieSanders_recent_tweets.txt` and store the result in the variable `q1a`.  Then display the result so you can read it.

In [17]:
# Open the data and read first 1000 characters
with open("data/BernieSanders_recent_tweets.txt","r") as data:
    q1a = data.read(1000)
    print(q1a)

[{"created_at": "Sat Feb 06 22:43:03 +0000 2021", "id": 1358184460794163202, "id_str": "1358184460794163202", "full_text": "Why would we want to impeach and convict Donald Trump \u2013 a president who is now out of office? Because it must be made clear that no president, now or in the future, can lead an insurrection against the government he or she is sworn to protect.", "truncated": false, "display_text_range": [0, 243], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": []}, "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 216776631, "id_str": "216776631", "name": "Bernie Sanders", "screen_name": "BernieSanders", "location": "Vermont", "description": "U.S. Senator for Vermont. Not me, us.", "url": "https://t.co/jpg8Sp1GhR", "entities": {"

Create tweets dictionary with the the handles of the tweets as the key and the tweet itself as the value. Set the index of each df to the id of the tweets.

In [18]:
def load(data):
    df = pd.read_json(data)
    df = df.set_index('id')
    return df

tweets = {
    "AOC": load("data/AOC_recent_tweets.txt"),
    "Cristiano": load("data/Cristiano_recent_tweets.txt"),
    "elonmusk": load("data/elonmusk_recent_tweets.txt")
}

In [19]:
# View the first tweets from Elon Musk
tweets["elonmusk"].head()

Unnamed: 0_level_0,created_at,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,in_reply_to_status_id_str,...,favorite_count,favorited,retweeted,possibly_sensitive,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1357991946082418690,2021-02-06 09:58:04+00:00,1357991946082418688,The Second Last Kingdom https://t.co/Je4EI88HmV,False,"[0, 23]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 1357991942471094275, 'id_str': '1357991942471094275', 'indices': [24, 47], 'media_url': 'http://pbs.twimg.com/media/EtiOegrVEAMCgZE.jpg', 'media_url_https': 'https://pbs.twimg.com/media/EtiOegrV...","{'media': [{'id': 1357991942471094275, 'id_str': '1357991942471094275', 'indices': [24, 47], 'media_url': 'http://pbs.twimg.com/media/EtiOegrVEAMCgZE.jpg', 'media_url_https': 'https://pbs.twimg.com/media/EtiOegrVEAMCgZE.jpg', 'url': 'https://t.co/Je4EI88HmV', 'display_url': '...","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,...,352096,False,False,0.0,en,,,,,
1357973565413367808,2021-02-06 08:45:02+00:00,1357973565413367808,@DumDin7 @Grimezsz Haven’t heard that name in years …,False,"[19, 53]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'DumDin7', 'name': 'Dum Din', 'id': 1279896279733145601, 'id_str': '1279896279733145601', 'indices': [0, 8]}, {'screen_name': 'Grimezsz', 'name': '𝑪𝒍𝒂𝒊𝒓𝒆 𝒅𝒆 𝑳𝒖𝒏𝒆࿎', 'id': 276540738, 'id_str': '276540738', 'indi...",,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",1.357973e+18,1.357973e+18,...,2155,False,False,,en,,,,,
1357972904663687173,2021-02-06 08:42:25+00:00,1357972904663687168,@Grimezsz Dogecake,False,"[10, 18]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'Grimezsz', 'name': '𝑪𝒍𝒂𝒊𝒓𝒆 𝒅𝒆 𝑳𝒖𝒏𝒆࿎', 'id': 276540738, 'id_str': '276540738', 'indices': [0, 9]}], 'urls': []}",,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",1.357835e+18,1.357835e+18,...,5373,False,False,,en,,,,,
1357970517165182979,2021-02-06 08:32:55+00:00,1357970517165182976,YOLT\n\nhttps://t.co/cnOf9yjpF1,False,"[0, 29]","{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/cnOf9yjpF1', 'expanded_url': 'https://m.youtube.com/watch?v=05QJlF06F4s', 'display_url': 'm.youtube.com/watch?v=05QJlF…', 'indices': [6, 29]}]}",,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",,,...,62717,False,False,0.0,en,,,,,
1357964347813687296,2021-02-06 08:08:24+00:00,1357964347813687296,@Kristennetten That’s Damian,False,"[15, 28]","{'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'Kristennetten', 'name': 'K10✨', 'id': 985686123123949568, 'id_str': '985686123123949568', 'indices': [0, 14]}], 'urls': []}",,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",1.357964e+18,1.357964e+18,...,5726,False,False,,en,,,,,


## The Why

This analysis is relevant because an advertisement company might want to analyze text to find words that might relate to certain products that a consumer might want to buy. Maybe the government can analyze data for key words related to terrorism ("bombing","killing" etc) as a means of preventing an attack.

<!-- END QUESTION -->



<br/><br/><br/>
<br/><br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## Source Analysis


In some cases, the Twitter feed of a public figure may be partially managed by a public relations firm. In these cases, the device used to post the tweet may help reveal whether it was the individual (e.g., from an iPhone) or a public relations firm (e.g., TweetDeck).  The tweets we have collected contain the source information but it is formatted strangely :(

In [20]:
# just run this cell
tweets["Cristiano"][["source"]]

Unnamed: 0_level_0,source
id,Unnamed: 1_level_1
1358137564587319299,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>"
1357379984399212545,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>"
1356733030962987008,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>"
1355924395064233986,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>"
1355599316300292097,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>"
...,...
32514882561638401,"<a href=""http://www.whosay.com"" rel=""nofollow"">WhoSay</a>"
32513604662071296,"<a href=""http://www.whosay.com"" rel=""nofollow"">WhoSay</a>"
32511823722840064,"<a href=""http://www.whosay.com"" rel=""nofollow"">WhoSay</a>"
32510294081146881,"<a href=""http://www.whosay.com"" rel=""nofollow"">WhoSay</a>"


In the next section, I will use regex to convert these messy HTML snippets into something more readable. 

--- 
### Regex

We will first use the Python `re` library to cleanup the above test string.  In the cell below, I'll write a regular expression that will match the **HTML tag** below. I then use the `re.sub` function to substitute anything that matches the pattern with an empty string `""`.

An HTML tag is defined as a `<` character followed by zero or more non-`>` characters, followed by a `>` character. That is `<a>` and `</a>` are both considered _separate_ HTML tags.

<!--
BEGIN QUESTION
name: q2a
points: 2
-->

In [21]:
# Replace <a href=
q2a_pattern = '<.+?>'
test_str = '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'
re.sub(q2a_pattern, "", test_str)

'Twitter for iPhone'

--- 
### Alternative way

Rather than writing a regular expression to detect and remove the HTML tags we could instead write a regular expression to **capture** the device name between the angle brackets.  Here we will use [**capturing groups**](https://docs.python.org/3/howto/regex.html#grouping) by placing parenthesis around the part of the regular expression we want to return.  For example, to capture the `21` in the string `08/21/83` we could use the pattern `r"08/(..)/83"`.  


In [22]:
q2b_pattern = r">(.+)<"
test_str = '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'
re.findall(q2b_pattern, test_str)

['Twitter for iPhone']

---


Using one of the two regular expressions I just created and [`Series.str.replace`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html) or [`Series.str.extract`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html), I will add a new column called `"device"` to **all** of the DataFrames in `tweets` containing just the text describing the device (without the HTML tags).

<!--
BEGIN QUESTION
name: q2c
points: 2
-->

In [23]:
for each_df in tweets.keys():
    tweets[each_df]['source'] = tweets[each_df]['source'].str.extract(">(.+)<")
    tweets[each_df] = tweets[each_df].rename(columns = {'source':'device'})

---

To examine the most frequently used devices by each individual, I'll implement the `most_freq` function that takes in a `Series` and returns a new `Series` containing the `k` most commonly occuring entries in the first series, where the values are the counts of the entries and the indices are the entries themselves.

For example: 
```python
most_freq(pd.Series(["A", "B", "A", "C", "B", "A"]), k=2)
```
would return:
```
A    3
B    2
dtype: int64
```

In [24]:
def most_freq(series, k=5):
    new_series = series.value_counts(ascending = False)
    new_series = new_series.nlargest(n=k)
    return new_series
most_freq(tweets["Cristiano"]['device'])

Twitter for iPhone     1183
Twitter Web Client      959
WhoSay                  453
MobioINsider.com        144
Twitter for Android     108
Name: device, dtype: int64

Run the following two cells to compute a table and plot describing the top 5 most commonly used devices for each user.

In [25]:
device_counts = pd.DataFrame(
    [most_freq(tweets[name]['device'], 5).rename(name)
     for name in tweets]
).fillna(0)
device_counts

Unnamed: 0,Twitter for iPhone,Twitter Media Studio,Twitter Web Client,WhoSay,MobioINsider.com,Twitter for Android,Twitter Web App
AOC,3245.0,2.0,0.0,0.0,0.0,0.0,0.0
Cristiano,1183.0,0.0,959.0,453.0,144.0,108.0,0.0
elonmusk,3202.0,0.0,0.0,0.0,0.0,0.0,37.0


In [28]:
# just run this cell to plot it
make_bar_plot(device_counts.T, title="Count of Tweets by Source",
               xlabel="Source", ylabel="Count")
plt.xticks(rotation=45)
plt.legend(title="Handle");

NameError: name 'make_bar_plot' is not defined

## Further Investigation

I would like to further investigate the number of likes that specific tweets get between AOC and Elon Musk. I would be curious to see who get more positive feedback and who gets worse feedback. I also would like to know how popular Christano's tweets are in relation to AOC and Elon by looking at proportion of likes. 

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

---

We just looked at the top 5 most commonly used devices for each user. However, we used the number of tweets as a measure, when it might be better to compare these distributions by comparing _proportions_ of tweets. Why might proportions of tweets be better measures than numbers of tweets?

<!--
BEGIN QUESTION
name: q2f
points: 1
manual: true

-->`

Proportion of tweets would be a better measure because if there were two people who used two devices to tweet, the number of tweets coming from a specific device is based on how often the person tweets. If 50% of my tweets are form my iPhone, but I only tweeted 4 times, then 2 tweets would be from the iPhone. But someone who only tweeted 20% from their iPhone but tweeted 100 times, will have 2o tweets coming from iPhone. But, proportionally, I used my iPhone for tweeting moreso and the other person used other devices more than the iPhone.

<!-- END QUESTION -->



<br/><br/><br/>
<br/><br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## When?

Now that we've explored the sources of each of the tweets, we will perform some time series analysis. A look into the temporal aspect of the data could reveal insights about how a user spends their day, when they eat and sleep, etc. In this question, we will focus on the time at which each tweet was posted.


---

I will create a function `add_hour` that takes in a tweets dataframe `df`, and two column names `time_col` and `result_col`. My function  uses the timestamps in the `time_col` column to store in a new column `result_col` the computed  hour of the day as floating point number according to the formula:

$$
\text{hour} + \frac{\text{minute}}{60} + \frac{\text{second}}{60^{2}}
$$


In [13]:
def add_hour(df, time_col, result_col):
    df[result_col] = (df[time_col].dt.hour) + ((df[time_col].dt.minute)/60) + ((df[time_col].dt.second)/3600)
    
    return df

# # do not modify the below code
tweets = {handle: add_hour(df, "created_at", "hour") for handle, df in tweets.items()}
tweets["AOC"]["hour"].head()

id
1358149122264563712    20.377222
1358147616400408576    20.277500
1358145332316667909    20.126389
1358145218407759875    20.118611
1358144207333036040    20.051667
Name: hour, dtype: float64

With the new `hour` column, let's take a look at the distribution of tweets for each user by time of day. The following cell helps create a density plot on the number of tweets based on the hour they are posted. 

The function `bin_df` takes in a dataframe, an array of bins, and a column name; it bins the the values in the specified column, returning a dataframe with the bin lower bound and the number of elements in the bin. This function uses [`pd.cut`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html), a pandas [utility](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) for binning numerical values that you may find helpful in the distant future.


In [15]:
# just run this cell
def bin_df(df, bins, colname):
    binned = pd.cut(df[colname], bins).value_counts().sort_index()
    return pd.DataFrame({"counts": binned, "bin": bins[:-1]})

hour_bins = np.arange(0, 24.5, .5)
binned_hours = {handle: bin_df(df, hour_bins, "hour") for handle, df in tweets.items()}

make_line_plot(binned_hours, "bin", "counts", title="Distribution of Tweets by Time of Day",
               xlabel="Hour", ylabel="Number of Tweets")

NameError: name 'make_line_plot' is not defined

# Analysis of Distributions

It seems like Christiano never really tweets until around 6am or so. He is probably asleep. However, I doubt that AOC and Elon Musk are up and activley tweeting between the hours of 0-5. They also drop off in tweets after 5, meaning that they are probably in a timezone where they may be going to bed. After 10, it looks like people are more active. This is probably because this is a time period where both timezones are during the day and not during sleeping hours. 

---

To account for different locations of each user in our analysis, I will next adjust the `created_at` timestamp for each tweet to the respective timezone of each user. I created the following function `convert_timezone` that takes in a tweets dataframe `df` and a timezone `new_tz` and adds a new column `converted_time` that has the adjusted `created_at` timestamp for each tweet. The timezone for each user is provided in `timezones`.


In [16]:
def convert_timezone(df, new_tz):
    df['converted_time'] = df['created_at'].dt.tz_convert(new_tz)
    return df

timezones = {"AOC": "EST", "Cristiano": "Europe/Lisbon", "elonmusk": "America/Los_Angeles"}

tweets = {handle: convert_timezone(df, tz) for (handle, df), tz in zip(tweets.items(), timezones.values())}

With our adjusted timestamps for each user based on their timezone, let's take a look again at the distribution of tweets by time of day.

In [17]:
# just run this cell
tweets = {handle: add_hour(df, "converted_time", "converted_hour") for handle, df in tweets.items()}
binned_hours = {handle: bin_df(df, hour_bins, "converted_hour") for handle, df in tweets.items()}

make_line_plot(binned_hours, "bin", "counts", title="Distribution of Tweets by Time of Day (timezone-corrected)",
               xlabel="Hour", ylabel="Number of Tweets")

NameError: name 'make_line_plot' is not defined

# Sentiment Analysis

In the past few questions, I have explored the sources of the tweets and when they are posted. Although on their own, they might not seem particularly intricate, combined with the power of regular expressions, they could actually help me infer a lot about the users. In this section, I will continue building on the past analysis and specifically look at the sentiment of each tweet -- this would lead us to a much more direct and detailed understanding of how the users view certain subjects and people. 

<br/>
How do we actually measure the sentiment of each tweet? In our case, we can use the words in the text of a tweet for our calculation! For example, the word "love" within the sentence "I love America!" has a positive sentiment, whereas the word "hate" within the sentence "I hate taxes!" has a negative sentiment. In addition, some words have stronger positive / negative sentiment than others: "I love America." is more positive than "I like America."

I will use the [VADER (Valence Aware Dictionary and sEntiment Reasoner)](https://github.com/cjhutto/vaderSentiment) lexicon to analyze the sentiment of AOC's tweets. VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media which is great for our usage.

The VADER lexicon gives the sentiment of individual words. Here is an example:

In [18]:
# just run this cell
print(''.join(open("vader_lexicon.txt").readlines()[:10]))

FileNotFoundError: [Errno 2] No such file or directory: 'vader_lexicon.txt'

### VADER Sentiment Analysis

The creators of [VADER](https://github.com/cjhutto/vaderSentiment#introduction) describe the tool’s assessment of polarity, or “compound score,” in the following way:

“The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.”

As you can see, VADER doesn't "read" sentences, but works by parsing sentences into words, assigning a preset generalized score from their testing sets to each word separately. 

VADER relies on humans to stabilize its scoring. The creators use Amazon Mechanical Turk, a crowdsourcing survey platform, to train its model. Its training data consists of a small corpus of tweets, New York Times editorials and news articles, Rotten Tomatoes reviews, and Amazon product reviews, tokenized using the natural language toolkit (NLTK). Each word in each dataset was reviewed and rated by at least 20 trained individuals who had signed up to work on these tasks through Mechanical Turk. 

---

Let's first load in the data containing all the sentiments. I'll read `vader_lexicon.txt` into a dataframe called `sent`. The index of the dataframe should be the words in the lexicon and should be named `token`. `sent` should have one column named `polarity`, storing the polarity of each word.

In [19]:
sent = pd.read_csv("vader_lexicon.txt", sep='\t', header=None, names=['token', 'polarity', 'sum','ind_values'])
sent = sent.set_index('token')
sent = sent.drop(['sum', 'ind_values'], axis=1)
sent.head()

FileNotFoundError: [Errno 2] No such file or directory: 'vader_lexicon.txt'

---
Before further analysis, I will need some more tools that can help us extract the necessary information and clean the data.
I will assign a regular expression to a new variable `punct_re` that captures all of the punctuations within a tweet. I consider punctuation to be any non-word, non-whitespace character.

In [20]:
punct_re = r'\W'
re.sub(punct_re, " ", tweets["AOC"].iloc[0]["full_text"])

'RT  RepEscobar  Our country has the moral obligation and responsibility to reunite every single family separated at the southern border   T '

Next, I assign a regular expression to a new variable `mentions_re` that matches any mention in a tweet. The regular expression uses a capturing group to extract the user's username in a mention.

In [21]:
tweets["AOC"].iloc[0]["full_text"]

'RT @RepEscobar: Our country has the moral obligation and responsibility to reunite every single family separated at the southern border.\n\nT…'

In [22]:
mentions_re = r'@([\w\d]+)'

re.findall(mentions_re, tweets["AOC"].iloc[0]["full_text"])

['RepEscobar']

<br/>

### Tweet Sentiments and User Mentions

As we have seen in the previous part of this question, there are actually a lot of interesting components that we can extract out of a tweet for further analysis! Now I will focus on the sentiment of each tweet in relation to the users mentioned within it. 

To calculate the sentiments for a sentence, I will follow this procedure:

1. Remove the punctuation from each tweet so we can analyze the words.
2. For each tweet, find the sentiment of each word.
3. Calculate the sentiment of each tweet by taking the sum of the sentiments of its words.

---

I'll use the `punct_re` regular expression from the previous part to clean up the text a bit more! The goal here is to remove all of the punctuations to ensure words can be properly matched with those from VADER to actually calculate the full sentiment score.

In [23]:
def sanitize_texts(df):
    df["clean_text"] = df['full_text'].str.lower().str.replace(punct_re, " ")
    return df

tweets = {handle: sanitize_texts(df) for handle, df in tweets.items()}
tweets["AOC"]["clean_text"].head()

id
1358149122264563712                                                                                     rt  repescobar  our country has the moral obligation and responsibility to reunite every single family separated at the southern border   t 
1358147616400408576                                                                                     rt  rokhanna  what happens when we guarantee  15 hour     31  of black workers and 26  of latinx workers get raises    a majority of essent 
1358145332316667909                                                                                                                                                                                                 source  https   t co 3o5jer6zpd 
1358145218407759875                                             joe cunningham pledged to never take corporate pac money  and he never did  mace said she ll cash every check she gets  yet another way this is a downgrade  https   t co dytsqxkxgu
13581442073330360

---
With the texts sanitized, we can now extract all the user mentions from tweets. 

I created the following function `extract_mentions` that takes in the **`full_text`** (not `clean_text`!) column from a tweets dataframe  and uses `mentions_re` to extract all the mentions in a dataframe. The returned dataframe is:
* single-indexed by the IDs of the tweets
* has one row for each mention
* has one column named `mentions`, which contains each mention in all lower-cased characters


In [24]:
def extract_mentions(full_text):
    mentions = full_text.str.extractall(mentions_re)
    mentions = mentions.reset_index(level='match')
    mentions = mentions.drop('match', axis=1)
    mentions['mentions'] = mentions[0]
    mentions = mentions.drop(columns=[0])
    mentions['mentions'] = mentions['mentions'].str.lower()
    return mentions[["mentions"]]

# uncomment this line to help you debug
display(extract_mentions(tweets["AOC"]["full_text"]).head())

# do not modify the below code
mentions = {handle: extract_mentions(df["full_text"]) for handle, df in tweets.items()}
horiz_concat_df(mentions).head()

Unnamed: 0_level_0,mentions
id,Unnamed: 1_level_1
1358149122264563712,repescobar
1358147616400408576,rokhanna
1358130063963811840,jaketapper
1358130063963811840,repnancymace
1358130063963811840,aoc


Unnamed: 0_level_0,AOC,Cristiano,elonmusk
Unnamed: 0_level_1,mentions,mentions,mentions
0,repescobar,sixpadhomegym,dumdin7
1,rokhanna,globe_soccer,grimezsz
2,jaketapper,pestanacr7,grimezsz
3,repnancymace,goldenfootofficial,kristennetten
4,aoc,herbalife,kristennetten


<br/>

### Tidying Up the Data

Now, I'll convert the tweets into what's called a [*tidy format*](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) to make the sentiments easier to calculate. The `to_tidy_format` function implemented for uses the `clean_text` column of each tweets dataframe to create a tidy table, which is:

* single-indexed by the IDs of the tweets, for every word in the tweet.
* has one column named `word`, which contains the individual words of each tweet.

In [25]:
# just run this cell
def to_tidy_format(df):
    tidy = (
        df["clean_text"]
        .str.split()
        .explode()
        .to_frame()
        .rename(columns={"clean_text": "word"})
    )
    return tidy

tidy_tweets = {handle: to_tidy_format(df) for handle, df in tweets.items()}
tidy_tweets["AOC"].head()



Unnamed: 0_level_0,word
id,Unnamed: 1_level_1
1358149122264563712,rt
1358149122264563712,repescobar
1358149122264563712,our
1358149122264563712,country
1358149122264563712,has


### Adding in the Polarity Score

Now that we have this table in the tidy format, it becomes much easier to find the sentiment of each tweet: we can join the table with the lexicon table. 

The following `add_polarity` function adds a new `polarity` column to the `df` table. The `polarity` column contains the sum of the sentiment polarity of each word in the text of the tweet.

In [26]:
# just run this cell
def add_polarity(df, tidy_df):
    df["polarity"] = (
        tidy_df
        .merge(sent, how='left', left_on='word', right_index=True)
        .reset_index()
        .loc[:, ['id', 'polarity']]
        .fillna(0)
        .groupby('id')
        .sum()
    )
    return df

tweets = {handle: add_polarity(df, tidy_df) for (handle, df), tidy_df in \
          zip(tweets.items(), tidy_tweets.values())}
tweets["AOC"][["clean_text", "polarity"]].head()

NameError: name 'sent' is not defined

---

Finally, with our polarity column in place, we can finally explore how the sentiment of each tweet relates to the user(s) mentioned in it. 

The function `mention_polarity` takes in a mentions dataframe `mentions` and the original tweets dataframe `df` and returns a series where the mentioned users are the index and the corresponding mean sentiment scores of the tweets mentioning them are the values.

In [27]:
def mention_polarity(df, mention_df):
    new_df = pd.merge(mention_df, df['polarity'],left_index=True, right_index=True).dropna().groupby('mentions')['polarity'].mean()
    return new_df

aoc_mention_polarity = mention_polarity(tweets["AOC"],mentions["AOC"]).sort_values(ascending=False)
aoc_mention_polarity

KeyError: 'polarity'

# Extra EDA and Visualizations

In [28]:
df1 = tweets['AOC']['converted_hour']
df2=tweets['AOC']['polarity']
df3 = pd.merge(df1, df2,left_index=True, right_index=True).groupby('converted_hour').agg('mean').reset_index()
plt.scatter(df3['converted_hour'], df3['polarity'])
plt.xlabel('converted_hour')
plt.ylabel('polarity')
plt.title('Change in Polarity throughout the day of AOC Tweets')

KeyError: 'polarity'

In [29]:
df1 = tweets['elonmusk']['converted_hour']
df2=tweets['elonmusk']['polarity']
df3 = pd.merge(df1, df2,left_index=True, right_index=True).groupby('converted_hour').agg('mean').reset_index()
plt.scatter(df3['converted_hour'], df3['polarity'])
plt.xlabel('converted_hour')
plt.ylabel('polarity')
plt.title('Change in Polarity throughout the day of AOC Tweets')

KeyError: 'polarity'

For my exploratory analysis, I wanted to see if the sentiment rating of tweets would change throughout the day between AOC and Elon Musk. After spending a lot of time trying to make descriptive graphs, I realized that this analysis would not be very telling. I noticed in my graphs that there were no general trends and the outliers seemed obvious. I think that this data would be better expressed as a line graph or bar charts with polarities in bins ('I couldn't figure these out because its late and my brain is fried.')
Ultimately, I learned from the graphs that there is likely not a pattern or change in polarity throughout a single day. Negative Tweets would probably be clustered around a couple of days or even weeks depending on what is going on in the world. Through experimentation, I also learned that a scatter plot is not ideal for looking at this particular data.