In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("tutorial2_2.ipynb")

# Tutorial 2.2: Dictionaries

Welcome to Tutorial 2.2!  In Thursday's class we discussed some dictionary based methods

In this tutorial, we will learn how to use 2 popular dictionaries: `WordNet` and `VADER`. 
We will use `WordNet` to explore the relationship between different words and coginitive concepts and we will use `VADER` to re-implement this [NPR story](https://www.npr.org/2017/04/30/526106612/what-we-learned-about-the-mood-of-trumps-tweets) that shows these last first 100 days of Trump's Presidency were a roller coaster of emotion.

First, set up the tests and imports by running the cell below.

In [1]:
# Run this cell, but please don't change it.

# These lines load the tests.
import otter
grader = otter.Notebook()

import nltk
import spacy

import pandas as pd
import matplotlib.pyplot as plt
#%matplotlib notebook
import numpy as np

In [2]:
%matplotlib inline

## 1. WordNet

WordNet is lexical database where nouns, verbs, adjectives, and adverbs are group into distinct cognitive concepts. 
The concepts are called synsets. According to the [NLTK textbook](http://www.nltk.org/book/ch02.html#sec-wordnet):

> WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure.

In WordNet, synsets are linked via semantic relations.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_1
points: 6
manual: true
-->

Begin by reading this short [paper](https://dl.acm.org/doi/pdf/10.1145/219717.219748) that provides an overview of WordNet.

**Question 1.1:** What are the 6 types of semantic relations in WordNet? Briefly explain in your own words each semantic relationship and provide an example for each. The examples should not be one from the paper.

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### 1.1 Using WordNet in NLTK 

The next line will import `WordNet` from `nltk`. The `as` command in the import statement renames the `wordnet` module as `wn`

In [3]:
from nltk.corpus import wordnet as wn
wn

We can determine the synset of a word by using the `wn.synsets()`.

In [4]:
syn_motorcar = wn.synsets('motorcars')
syn_motorcar

<!--
BEGIN QUESTION
name: q1_2
points: 1
-->

**Question 1.2:** When we pass in a word as an argument to `wn.sysnsets()`, why data type is returned?
Assign the type to the variable `sysnsets_func_return_type`

In [5]:
sysnsets_func_return_type = ...
sysnsets_func_return_type

In [None]:
grader.check("q1_2")

We can see that WordNet only has one synset for the word `motorcars`. Run the next cell to see how many sysnsets WordNet has for the word "car"

In [8]:
wn.synsets('car')

<!--
BEGIN QUESTION
name: q1_3
points: 1
-->

**Question 1.3:** How many sysnsets are there for the word `car`? Assign the value to the variable named `num_car_syns` 

In [9]:
num_car_syns = ...
num_car_syns

In [None]:
grader.check("q1_3")

#### 1.1.1 WordNet Online Interface

WordNet has an [online interface](http://wordnetweb.princeton.edu/perl/webwn) where you can interactively search WordNet via your web browser.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_4
points: 2.5
manual: true
-->

**Question 1.4:** Use the online interface to search for `car`. According to your search, what is the gloss (definition) for each of these synsets

_Type your answer here, replacing this text._

<!-- END QUESTION -->



#### 1.1.2 Synset Object

NLTK has an object type called `Synset` that it uses to represent WordNet sysnets.

The next line extracts the first synset for the word car

In [12]:
cars_syns = wn.synsets('car')
first_car_syn = cars_syns[0]
first_car_syn

The next line prints out all functions and attributes that each `Synset` object has

In [13]:
" ".join(dir(first_car_syn))

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_5
points: 2
manual: true
-->

**Question 1.5:** Complete the missing lines in the next cell to print out the names and definitions of each synset in `cars_syns`.

*Hint:* The necessary functions or attributes are listed in the output of the previous cell
    
*Hint:* The definitions should match (or be close) to what you found online

In [14]:
for car_syn in cars_syns:
    syn_name = ...
    syn_explanation = ...
    print(syn_name, syn_explanation)

<!-- END QUESTION -->



A synset object lists the lemmas that are associated with the synset.

In [15]:
first_car_syn.lemma_names(), first_car_syn.lemmas()

The above are the lemmas that are associated with the synset `car.n.01`.

From the [documentation](https://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.wordnet.Lemma), each lemma object in WordNet has the following attributes:
>     - name: The canonical name of this lemma.
    - synset: The synset that this lemma belongs to.
    - syntactic_marker: For adjectives, the WordNet string identifying the
      syntactic position relative modified noun. See:
      https://wordnet.princeton.edu/documentation/wninput5wn
      For all other parts of speech, this attribute is None.
    - count: The frequency of this lemma in wordnet.

<!--
BEGIN QUESTION
name: q1_6
points: 
    - 0.1
    - 0.9
-->

**Question 1.6:** Based on the documentation, which of `first_car_syn`'s lemmas appears the most in WordNet? Assign the name of the lemma to the variable `most_freq_car_n_01_lemma`.

In [16]:
most_freq_car_n_01_lemma = ...
f"The most frequency lemma is {most_freq_car_n_01_lemma}"

In [None]:
grader.check("q1_6")

### 1.2 Tree Structure

`WordNet` is structured as a tree where Sysnsets have a hierarchy based on the relationships described above.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_7
points: 1
manual: true
-->

**Question 1.7:** Run and then briefly describe the next line in light of the structure of WordNet.

In [19]:
first_car_syn.hypernym_paths()[0]

_Type your answer here, replacing this text._

<!-- END QUESTION -->



In the next line we can see how a synset can have multiple paths to the root of the tree.

In [20]:
first_car_syn.hypernym_paths()

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_8
points: 1
manual: true
-->

**Question 1.8:** These two paths look similar but there is a difference. What is the difference?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



We can use this tree structure to determine what type of synset one sysnet is as well as what are the types of sysnet that are examples of a synset.

<!--
BEGIN QUESTION
name: q1_9
points: 
    - 0.1
    - 0.9
-->

**Question 1.9:** According to WordNet, `first_car_synset` is a type of what synset? Use the function that represents the correct relationship you described in the begining of this assignment and assign the name of that sysnset to the variable `first_car_syn_type_of`.

In [21]:
first_car_syn_type_of = ...
f"The first car synset is a type of {first_car_syn_type_of}"

In [None]:
grader.check("q1_9")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_10
points: 1
manual: true
-->

**Question 1.10:** What relationship will give us examples of this car synset?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!--
BEGIN QUESTION
name: q1_11
points: 
    - 0.1
    - 0.9
-->

**Question 1.11:** Use the corresponding function to extract a list of synsets that are a type of the `first_car_synset`. Assign the list to the variable named `type_of_first_car_syns`.

*Hint:* The name of the function should map the name of the relationship you answered in the last question

In [24]:
type_of_first_car_syns = ...
type_of_first_car_syns

In [None]:
grader.check("q1_11")

<!--
BEGIN QUESTION
name: q1_12
points: 
    - 0.1
    - 0.9
    - 1
-->

**Question 1.12:** Loop through the lemmas of the synsets in `type_of_first_car_syns` and determine which lemma appears the most in WordNet. Assign the name of the lemma to the variable `most_common_car_lemma`.

*(There are two that are tied, you can choose either of those)*

In [27]:
...
most_common_car_lemma = ...

f"The most common car lemma is {most_common_car_lemma}"

In [None]:
grader.check("q1_12")

### 1.3 Similarity

We can use the paths in WordNet to find similarities between different synsets.
The next line shows sysnets that we think should be similar.


In [31]:
wn.synsets("college")[1].definition(), wn.synsets("high_school")[0].definition()

We can compute how similar two synsets are using the function `path_similarity()`

In [32]:
wn.synsets("college")[1].path_similarity(wn.synsets("high_school")[0])

Let's look at another synset of "college".

In [33]:
wn.synsets("college")[0].definition(), wn.synsets("high_school")[0].definition()

The next line will determine the similarity between this new synset of college and the synset of highschool. 

In [34]:
wn.synsets("college")[0].path_similarity(wn.synsets("high_school")[0])

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_13
points: 1
manual: true
-->

**Question 1.13:** Looking at the different similarity scores computed, what does this mean and do you agree with this?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!--
BEGIN QUESTION
name: q1_14
points: 
    - 0.1
    - 0.4
    - 0.5
    - 0.5
    - 2
-->

**Question  1.14:** Complete the next cell to sort the pair of words in `word_pairs` based on the similarity of the pair, in decreasing order. Store the sorted pairs as a list in the variable named `sorted_pairs`. Each item in the list should be a tuple where the first item is a word pair represented as a tuple (like in the code below) and the second item is the similarity score.

*Note*: the similarity of
a pair should be represented by the similarity of the most similar pair of synsets
they have.

In [35]:
word_pairs = [('car', 'automobile'), ('gem', 'jewel'), ('journey', 'voyage'), ('boy', 'lad'), ('coast', 'shore'), 
              ('asylum', 'madhouse',), ('magician', 'wizard'), ('midday', 'noon'), ('furnace', ' stove'), 
              ('food', 'fruit'), ('bird', 'cock'), ('bird', 'crane'),('tool', 'implement'), ('brother', 'monk'),
              ('lad', 'brother'), ('crane', 'implement'), ('journey', 'car'), ('monk', 'oracle'),
              ('cemetery', 'woodland'), ('food', 'rooster'), ('coast', 'hill'), ('forest', 'graveyard'), 
              ('shore', 'woodland'), ('monk', 'slave'), ('coast', 'forest'), ('lad', 'wizard'), ('chord', 'smile'),
              ('glass', 'magician'), ('rooster', 'voyage'), ('noon', 'string')]

sorted_pairs = ...
...
sorted_pairs

In [None]:
grader.check("q1_14")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_15
points: 1
manual: true
-->

**Question 1.15:** How could we use WordNet to reduce variation when building a Document-Term matrix?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## 2. VADER (for Valence Aware Dictionary forsEntiment Reasoning)

Now we will look at how to use a dictionary-based method to categorize text and convert word counts into a quantifiable attribute that we want to measure.

VADER is a *a simple rule-based model for general sentiment analysis* introduced in this [paper](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf) from the 2014 International Conference on Web and Social Media (ICWSM).
VADER is a popular method and is included in many python packages, like nltk and spacy.

First, let's download the VADER lexicon

In [41]:
nltk.download('vader_lexicon')

### 2.1 Exploring VADER

The vader lexicon can be found online on GitHub: https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt.

If we look at the [4140-th line in the file](https://github.com/cjhutto/vaderSentiment/blob/d8da3e21374a57201b557a4c91ac4dc411a08fed/vaderSentiment/vader_lexicon.txt#L4140), we see the following:

> irate	-2.9	0.53852	[-3, -3, -3, -2, -3, -4, -3, -3, -2, -3]

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_1
points: 2
manual: true
-->

**Question 2.1:** Based on the documentation found [here](https://github.com/cjhutto/vaderSentiment/#resources-and-dataset-descriptions), what does this line mean? 


_Type your answer here, replacing this text._

<!-- END QUESTION -->



Now, in the next line let's import the `vader` module from NLTK and create a new SentimentIntensityAnalyzer object and assign it to the variable named `sentiment_analyzer`

In [42]:
from nltk.sentiment import vader  

sentiment_analyzer = vader.SentimentIntensityAnalyzer()

The lexicon (or a similar lexicon) we just looked is stored in the `sentiment_analyzer` variable.
We can access the sentiment intensity for each via a dictionary lookup, as shown in the next line. 

In [43]:
sentiment_analyzer.lexicon['irate'], sentiment_analyzer.lexicon['ecstatic'], sentiment_analyzer.lexicon['mad']

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_2
points: 2
manual: true
-->

**Question 2.2:** Compare the values just printed above with the values in the [lexicon on GitHub](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt). Do the values match up, and if not why do you think that is the case?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



The lexicon in VADER contains scores for individual words. However, the sentiment of some phrases might be very different
than the sentiment of each of the words. The next line prints out sentiment intensity scores for special idioms.

In [44]:
vader.VaderConstants.SPECIAL_CASE_IDIOMS

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_3
points: 1
manual: true
-->

**Question 2.3:** In the next cell, choose one of the idioms and print out the score in the lexicon for the individual words to see how the scores for the individual words differ from the idiom.

*Note:* If there is a word in the idiom that is not in the lexicon, choose a different idioms.

In [45]:
...

<!-- END QUESTION -->



In this tutorial we won't go into the details of how exactly VADER works, we encourage you to read the paper and have included it in one of the options for Week 2's Readings. 
<br>
As a high level overview, to determine the sentiment of a text, VADER uses a bunch of rules to combine the polarity of words in a given sentence. 


The function `polarity_score` will leverage the rules to compute a score for how *negative*, *neutral*, and *positive* a text is. The next line demonstrate how to use this function and the scores that it returns.


In [46]:
sentiment_analyzer.polarity_scores("As a high level overview, to determine the sentiment of a text, VADER uses a bunch of rules to combine the polarity of words in a given sentence.")

`polarity_score` also computes a compund score. According to the documentation on GitHub:
    
> the compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_4
points: 2
manual: true
-->

**Question 2.4:** Based on the documentation, what are typical threshold values useds to determine if a text is positive, neutral, or negative.

_Type your answer here, replacing this text._

<!-- END QUESTION -->



#### 2.1.1 VADER Rules

The rules in VADER were developed to work well on text from social media, like Twitter and Reddit. Here will look look at aspects of two rules.

Some of the rules are based on if a sentence has any of these *booster* words, the polarity will be boosted.

In [47]:
vader.VaderConstants.BOOSTER_DICT

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_5
points: 1
manual: true
-->

**Question 2.5:** In the next cell, add one of these booster terms to a sentence and see how the `polarity_score` results change when the booster word is included or not-included.

In [48]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_6
points: 1
manual: true
-->

**Question 2.6:** Another rule in VADER is based on captialization. In the next cell, demonstrate how the values computed by the `polarity_score` function change when some words in a text are capitalized or are all lower cased.

In [49]:
...

<!-- END QUESTION -->



### 2.2 Analyzing Trump's 100 days of office via Sentiment

We are now going to leverage VADER to explore the sentiment of Trump's Tweets during his first 100 days of office. This is based on an [NPR Story](https://www.npr.org/2017/04/30/526106612/what-we-learned-about-the-mood-of-trumps-tweets).

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_7
points: 2
manual: true
-->

**Question 2.7:** Summarize and describe the findings from the NPR story.

_Type your answer here, replacing this text._

<!-- END QUESTION -->



#### 2.2.1 - Collecting Data and Exploration

Later in the course we will learn how to extract Tweets directly from Twitter but for now we will use a collection already collected for us.

Melanie Walsh has collected and cleaned Donald Trump's Tweets from the [Trump Twitter Archive](http://www.trumptwitterarchive.com/). We can download the Tweets from https://melaniewalsh.github.io/Intro-Cultural-Analytics/_downloads/c3e837cce30a959abc84cbc8914dc7a2/Trump-Tweets.csv.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_8
points: 1
manual: true
-->

**Question 2.8:** In the next line, use one bash command to download the `csv` file of tweets. Then use another bash command to move the file to the `data/` directory 

In [50]:
...

<!-- END QUESTION -->



Run the next cell to read in the csv file of Trump's Tweets. You can use the next cell to test whether the data was downloaded and renamed correctly. There should be about 29K rows.

In [51]:
trump_tweet_df = pd.read_csv("data/Trump-Tweets.csv")
trump_tweet_df

<!--
BEGIN QUESTION
name: q2_9
points: 
    - 0.2
    - 0.8
-->

**Question 2.9:** Extract the names of the columns and store them in the variable called `column_names`.

In [52]:
column_names = ...
column_names

In [None]:
grader.check("q2_9")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_10
points: 2
manual: true
-->

**Question 2.10:** Briefly explain what each column is and the type of variable that is stored in the column?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



During Week 4 when we focus on Data Collection, we will go over different properties of Tweets that we can get when we collect Tweets from Twitter

#### 2.2.2 Data Filtering

Whenever we have a dataset, it is important to filter out data that is not relevant for our study the research question we are interested in exploring.

##### 2.2.2.1 Retweets

We want to see the senitment of Trump's Tweets, not those that he has re-tweeted. The `is_retweet` column indicates if the Tweet is a retweet

<!--
BEGIN QUESTION
name: q2_11
points: 
    - 0.5
    - 0.5
-->

**Question 2.11:** In the next cell, remove all Tweets that are retweets. You can override the dataframe. 

In [55]:
trump_tweet_df = ...
trump_tweet_df['is_retweet'].value_counts()

In [None]:
grader.check("q2_11")

<!--
BEGIN QUESTION
name: q2_12
points: 
    - 1
-->

**Question 2.12:** Since we now know that all of these tweets are Trump's tweets and not retweets, go ahead and remove the column `is_retweet` from the dataframe.

In [58]:
trump_tweet_df = ...
trump_tweet_df

In [None]:
grader.check("q2_12")

##### 2.2.2.2 Dates

The next cell will tell us the type that is stored in the `created_at` column

In [60]:
trump_tweet_df['created_at'].dtype

Running the next line shows that `created_at` in fact represents date. `created_at` tells us the time when the Tweet was posted.

In [61]:
trump_tweet_df['created_at']

[datetime](https://www.w3schools.com/python/python_datetime.asp) is a python module that allows us to easily work with dates. The next line will import the datetime module.

In [62]:
import datetime

<br>Pandas has a nifty function [`to_datetime()` function](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) that will convert the argument to a TimeStamp object. 

The next line shows how to create a datetime object from a string.

In [63]:
timestamp = pd.to_datetime("2020-05-11 01:01:23")
timestamp

<!--
BEGIN QUESTION
name: q2_13
points: 
    - 1
-->

**Question 2.13:** Apply the `.to_datetime()` function to replace the values in `created_at` to timestamp objects. 

*You should notice the dtype of the column will change*

In [64]:
trump_tweet_df['created_at'] = trump_tweet_df['created_at'].apply(pd.to_datetime)
trump_tweet_df['created_at']

In [None]:
grader.check("q2_13")

We can extract time related information easily from a TimeStamp object as shown in the next cell

In [66]:
timestamp.day_name(), timestamp.day_of_week, timestamp.week, timestamp.month_name(), timestamp.month, timestamp.quarter, timestamp.now()

We can also compare times, as seen in the next line. 

In [67]:
timestamp.now() > timestamp

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_14
points: 1
manual: true
-->

**Question 2.14:** Briefly explain what the previous python cell is checking and what the resulting value indicates.

_Type your answer here, replacing this text._

<!-- END QUESTION -->



We can use the `.timedelta()` function to add time a timestamp object. The next cell adds one day to the value stored in `timestamp`

In [68]:
timestamp, datetime.timedelta(days=1) + timestamp

Trump's first day in office was January 20th 2017. 

<!--
BEGIN QUESTION
name: q2_15
points: 
    - 0.1
    - 0.1
    - 0.1
    - 0.7
-->

**Question 2.15:** In the next cell, use the `pd.to_datetime()` function to create a timestamp of Trump's first day in office. Even though the President is sworn in at noon, lets ignore time for now and just focus on the day.

In [69]:
trump_first_day = ...
trump_first_day

In [None]:
grader.check("q2_15")

<!--
BEGIN QUESTION
name: q2_16
points: 
    - 0.3
    - 0.7
-->

**Question:** Use a Timestamp object function to determine what day of the week was Trump's Inaugaration? Assign the answer to the variable named `trump_first_day_of_the_week`

In [74]:
trump_first_day_of_the_week = ...
trump_first_day_of_the_week

In [None]:
grader.check("q2_16")

<!--
BEGIN QUESTION
name: q2_17
points: 1
-->

**Question 2.17:** We want to measure the sentiment of Trump's Tweets during his first 100 days of office. Create a new datetime object that is 101 days after `trump_first_day` and assign the value to the variable named `trump_100_day`.

In [77]:
trump_101_day = ...
trump_101_day

In [None]:
grader.check("q2_17")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_18
points: 1
manual: true
-->

**Question 2.18:** Why are we adding 101 days rather than 100 days?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

Since we only want to look at the Tweets from Trump's first 100 days in office, lets filter out all tweets that are not within the first one-hundrad_days.

<!--
BEGIN QUESTION
name: q2_19
points: 
    - 0.1
    - 0.2
    - 0.6
    - 0.5
    - 0.5
-->

**Question 2.19:**  In the next cell, create a new dataframe called `trump_100_days_tweet_df` that contains the Tweets between `trump_101_day` and `trump_first_day. 

*Hint:* Make sure to reset the index

In [79]:
...
trump_100_days_tweet_df

In [None]:
grader.check("q2_19")

#### 2.2.3 Applying VADER to Tweets

Now we are ready to apply VADER to Trump's tweets. 

<!--
BEGIN QUESTION
name: q2_20
points: 
    - 0.5
    - 0.5
    - 0.5
    - 0.5
-->

**Question 2.20:** In the next cells, apply the the `sentiment_analyzer.polarity_scores` function to the tweets. Store the resulting dictionaries in a column called `polarity_scores` and the compound polarity in a new column called `compound_polarity`. 

In [85]:
trump_100_days_tweet_df['polarity_scores'] = trump_100_days_tweet_df['text'].apply(sentiment_analyzer.polarity_scores)
trump_100_days_tweet_df['compound_polarity'] = trump_100_days_tweet_df['polarity_scores'].map(lambda x: x['compound'])
trump_100_days_tweet_df

In [None]:
grader.check("q2_20")

##### 2.2.3.1 Data Validation

<!-- BEGIN QUESTION -->

It is always important to sample some example after applying a method to classify text or add attributes to text.
*(In my work I'll often sample about 50 random examples but here we will just look at two specific examples.)*

<!--
BEGIN QUESTION
name: q2_21
points: 1
manual: true
-->

**Question 2.21:** The next line uses `argmax` to find the index of the Tweet with the highest compound polarity score.
Use the index to determine the highest `compound_polarity` value, the corresponding `text`, and when that tweet was written. Assign the values to the variables respectively named `max_compound_polarity`, `max_compound_polarity_tweet`, and `max_compound_polarity_time`.

In [90]:
max_compound_polarity_index = np.argmax(trump_100_days_tweet_df['compound_polarity'])
max_compound_polarity = ...
max_compound_polarity_tweet = ...
max_compound_polarity_time = ...

f"The Tweet from {max_compound_polarity_time}: \"{max_compound_polarity_tweet}\" had the highest compound polarity of {max_compound_polarity}"

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_22
points: 1
manual: true
-->

**Question 2.22:** Find the index of the Tweet with the lowest highest `compound_polarity` value, the corresponding `text`, and when that tweet was written. Assign the values to the variables respectively named `min_compound_index`, `min_compound_polarity`, `min_compound_polarity_tweet`, and `min_compound_polarity_time`.

In [91]:
min_compound_polarity_index = ...
min_compound_polarity = ...
min_compound_polarity_tweet = ...
min_compound_polarity_time = ...

f"The Tweet from {min_compound_polarity_time}: \"{min_compound_polarity_tweet}\" had the highest compound polarity of {min_compound_polarity}"

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_23
points: 1
manual: true
-->

**Question 2.23:** Looking at those two examples, do agree that these could be the Tweets with the highest and lowest sentiment scores?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



The next cell will compute the correlation between different columns in our dataframe. 

In [92]:
trump_100_days_tweet_df.corr()

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_24
points: 1
manual: true
-->

**Question 2.24:** Which two columns are the most correlated with each other and is this suprising?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_25
points: 1
manual: true
-->

**Question 2.25:** Is the correlation between how many days Trump was in office and people's engagement with his tweets negative, positive, or neutral? Briefly, describe what this means.


*Note:* We can measure engagement by retweets and favorting tweets. Replies is another way to measure engagement but this dataset does not include replies.

_Type your answer here, replacing this text._

<!-- END QUESTION -->



#### 2.2.2.4 - Visualization

In the next cell, we will generate a line plot to show the compound polarity of a tweet ('y-axis') across each time stamp ('x-axis')

In [93]:
trump_100_days_tweet_df.plot(kind='line', y='compound_polarity', x='created_at')
plt.axhline(y=0, color='black', linestyle='-')

The plot you just generated is likely hard to understand. Trump tweeted a lot each day and his mood can change through out the day. Here we plot the score of each tweet since each tweet has it's own timestamp (that includes hours, minutes, and seconds).


Let's look at a large unit of analysis - average sentiment per day. 

<!--
BEGIN QUESTION
name: q2_26
points:
    - 0.5
    - 0.5
-->

**Question 2.26:** In the next cell, add a column named `days_in_office` that represents how many days in office Trump had been in so far for each Tweet.

In [94]:
...
trump_100_days_tweet_df

In [None]:
grader.check("q2_26")

<!--
BEGIN QUESTION
name: q2_27
points:
    - 1
    - 1
-->

**Question 2.27:** Now that each row has a corresponding value indicating how many days Trump has been in office, group the sentiment of Trumps tweets on a daily level and determine the average polarity of the tweet that day. Assign the resulting dataframe to the variable named `trump_daily_avg_sentiment`. Make sure to reset the index of the dataframe.

In [97]:
trump_daily_avg_sentiment = ...
trump_daily_avg_sentiment

In [None]:
grader.check("q2_27")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_28
points: 1
manual: true
-->

**Question 2.28:** In the next cell, use a line plot to plot the average compound polarity of each tweet ('y-axis') across each day ('x-axis'). We provided a line that will show the y-axis as a black line

In [260]:
...
plt.axhline(y=0, color='black', linestyle='-')

<!-- END QUESTION -->



We can begin to see some trends of how the sentiment of Trump's tweets changed over time.

Let's look at a large unit of analysis - average sentiment per week.

<!--
BEGIN QUESTION
name: q2_29
points:
    - 1
    - 1
-->

**Question 2.29:** In the next cell, add a column named `weeks_in_office` that represents how many weeks Trump has been in office based on the timestamp in a given Tweet.

*Hint* We can use `.week` to find which week a date was in and we can then subtract two weeks from each other.

In [261]:
...
trump_100_days_tweet_df

In [None]:
grader.check("q2_29")

<!--
BEGIN QUESTION
name: q2_30
points:
    - 1
    - 1
-->

**Question 2.30:** Now that each row has a corresponding value indicating how many weeks Trump has been in office, group the sentiment of Trumps tweets on a weekly level and determine the average polarity of the tweet that week. Assign the resulting dataframe to the variable named `trump_weekly_avg_sentiment`. Make sure to reset the index of the dataframe.

In [264]:
trump_weekly_avg_sentiment = ...
trump_weekly_avg_sentiment

In [None]:
grader.check("q2_30")

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_31
points: 2
manual: true
-->

**Question 2.31:** In the next cell, use a line plot to plot the average compound polarity of each tweet ('y-axis') across each week ('x-axis')

In [266]:
...

plt.axhline(y=0, color='black', linestyle='-')
plt.grid(True)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_32
points: 2
manual: true
-->


**Question 2.32:** How similar are your results with the findings and figure in the NPR story? If there are differences, do you think these differences are substantial?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_33
points: 1
manual: true
-->


**Question 2.33:** At the minimum it is likely that your figure does not *exactly* match the one from the NPR story, which is ok. However, why do you think this might be the case?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_34
points: 1
manual: true
-->


**Question 2.34:** Looking towards your final project, how could you use Vader in your final project?

_Type your answer here, replacing this text._

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()