Adapted from [Fall 2019 Data 100 HW 4: Trump, Twitter, and Text](http://www.ds100.org/fa19/syllabus/)

In [None]:
import numpy as np
from datascience import *

# Table.interactive()

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Project 2: Trump's Tweets

## Table of Contents
<a href='#section 0'>Background Knowledge: Twitter & the President </a>

1. <a href='#section 1'> The Data Science Life Cycle</a>

    a. <a href='#subsection 1a'>Formulating a question or problem</a> 

    b. <a href='#subsection 1b'>Acquiring and cleaning data</a>

    c. <a href='#subsection 1c'>Conducting exploratory data analysis</a>

    d. <a href='#subsection 1d'>Using prediction and inference to draw conclusions</a>
<br><br>

### Background Knowledge: Twitter & the President <a id='section 0'></a>


<img src="twitter_trump.png" width = 1000/>
[Source](https://www.politico.com/magazine/story/2018/01/26/donald-trump-twitter-addiction-216530)

President Donald Trump's Twitter history has grown over time from before he was elected into the presidency. From the image above, we can even see part of how he used his Twitter during the time that he was running for the election.

# The Data Science Life Cycle <a id='section 1'></a>

## Formulating a question or problem <a id='subsection 1a'></a>
It is important to ask questions that will be informative and that will avoid misleading results. There are many different questions we could ask about Trump's Tweets, for example, many people are interested in how he uses twitter to connect with his supporters.

<div class="alert alert-warning">
<b>Question:</b> Recall the questions you developed with your group on Tuesday. Write down that question below, and try to add on to it with the context from the articles from Wednesday. Think about what data you would need to answer your question. You can review the articles on the bCourses page under Module 4.3.
   </div>
   

Original Question(s): *here*


Updated Question(s): *here*



Data you would need: *here*



## Acquiring and cleaning data <a id='subsection 1b'></a>
The following table, `trump`, contains tweets from President Donald Trump's Personal Twitter Account from January 2016 till February 2019. Here is information about the columns of the dataset.

|<center>Codebook</center>|
| --- | --- |
| time |Coordinated Universal Time of Day that the Tweet was published|
| source | Source of the Tweet (Andriod, iPhone, Web Browser, etc.)|
| text| Original Text if the tweet (includes all punctuation)|
|retweet_count| Number of Times Original Tweet was Shared|
|year| Year the Tweet was released|
|est_time| Eastern Standard Time of the Day that the Tweet was published|
|hour| Hour of the Day the Tweet was Published|
|no_punc| Text from Tweet without any punctuation|
|Polarity| Score measuring the sentiment of the Tweet|

In [None]:
trump = Table().read_table('trump_tweets.csv')
trump

<div class="alert alert-warning">
<b>Question:</b> It's important to evalute our data source. What do you know about the source (Trump's Twitter Account)? What motivations might he have for posting? What data might be missing? How might deleted tweets be dealt with?
   </div>

*Insert answer*

<div class="alert alert-warning">
<b>Question:</b> We want to learn more about the dataset. First, how many total rows are in this table? What does each row represent?
    
   </div>

In [None]:
total_rows = ...

*Description of a row here*

## Conducting exploratory data analysis <a id='subsection 1c'></a>

We will explore how Trump's tweets vary by sentiment and extend that analysis in the context of how many retweets he gets and patterns over time. In the end, we will try to answer **"How do Trump's tweets influence the interpretation of big events based on the sentiment of his tweets, reception of his tweets (retweets), and how his tweets of the event progress over time?"**

### Part 1: Polarity & Sentiment

It turns out that we can use the words in Trump's tweets to calculate a measure of the sentiment of the tweet. For example, the sentence "I love America!" has positive sentiment, whereas the sentence "I hate taxes!" has a negative sentiment. In addition, some words have stronger positive / negative sentiment than others: "I love America." is more positive than "I like America."

We will use the [VADER (Valence Aware Dictionary and sEntiment Reasoner)](https://github.com/cjhutto/vaderSentiment) lexicon to analyze the sentiment of Trump's tweets. VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media which is great for our usage.

The VADER lexicon gives the sentiment of individual words. Run the following cell to show a few rows of the lexicon:



In [None]:
print(''.join(open("vader_lexicon.txt").readlines()[300:310]))

We used the VADER Lexicon to calculate the polarity for each tweet. This is in the "polarity" column of the `trump` table. We can use this to find the most positive and negative tweets.

<div class="alert alert-warning">
<b>Question:</b> Find the 5 most negative tweets in the dataset. (Hint: first, sort the data.)
   </div>

In [None]:
most_negative = trump...(...)...(np.arange(5))
most_negative

In [None]:
## Just Run this cell to view the whole text of the tweets in a nicer format
print('Most negative tweets:')
for t in most_negative.column('text'):
    print('\n  ', t)

<div class="alert alert-warning">
<b>Question:</b> What patterns do you notice in the most negative tweets?
   </div>

*Answer here*

<div class="alert alert-warning">
<b>Question:</b> Find the 5 most positive tweets in the dataset. (Hint: first, sort the data.)
   </div>

In [None]:
most_positive = trump.sort('...', descending=True)...(np.arange(5))
most_positive

In [None]:
## Just Run this cell to view the whole text of the tweets in a nicer format
print('Most positive tweets:')
for t in most_positive.column('text'):
    print('\n  ', t)

<div class="alert alert-warning">
<b>Question:</b> What patterns do you notice in the most positive tweets?
   </div>

*Answer here*

**Specific Words:** Based on these more extreme tweets, we can see some trends in the tweets. Based on what we know from these tweets and the news. Let's investigate specific words that Trump uses in his tweets. What context does he use these words in?

<div class="alert alert-warning">
<b>Question:</b> Choose 6 different keywords. Then, calculate the average polarity for tweets that contain those keywords. Use the `avg_pol` function. Make sure to run the cell that defines the function. We have provided the word "immigr" as an example for format, feel free to change this.
   </div>
   
Note: Some words are used more often then others, but there is usually a stem or root part of a word that appears more often. For example, if you are interested in immigration consider using "immigr", so that you find cases that contain immigration, immigrant, etc.

In [None]:
## RUN THIS CELL!!
def avg_pol(keyword_array):
    pol_arr = make_array()
    for i in keyword_array:
        tbl = trump.where("no_punc", are.containing(i))
        avg = np.average(tbl.column("polarity"))
        pol_arr = np.append(pol_arr,avg)
    return pol_arr

In [None]:
words = make_array("immigr",  ..., ..., ..., ..., ...)
polarity_score = avg_pol(words)
polarity_score

We have compiled the keywords we are interested in and their average polarities. In order to compare the numbers in the array, it would be easier if they were in a table, so let's create one.
<div class="alert alert-warning">
<b>Question:</b> Create a table called `words_polarity` that has two columns. The first called `Word`, and the second called `Average Polarity` which contains the `polarity_score` array we made above. Then, sort the "Average Polarity" column in ascending order.
   </div>

In [None]:
words_polarity = Table().with_columns("...", words, "Average Polarity", ...).sort("Polarity")
words_polarity

<div class="alert alert-warning">
<b>Question:</b> Using the words_polarity table, we can make a bar chart. Fill in the code below.
   </div>

In [None]:
words_polarity.barh(...)

<div class="alert alert-warning">
<b>Question:</b> What are some possible reasons for the disparities between the bars? 
   </div>

*Insert answer here.*

### Part 2: Polarity in relation to Retweets & Time

In Part 1, we learned about polarity and the sentiment of some of Trump's tweets, but how does this relate to other parts of the data. Two other interesting components are the number of retweets certain posts get over others and differences over time in polarity. How are these variables related to sentiment? Let's start with retweets.

**Retweets:** Similar to other social media platforms, retweeting allows people to share content others post. The higher the number of retweets, then the more popular the post.

<div class="alert alert-warning">
<b>Question:</b> Find the 5 most retweeted posts in the dataset. (Hint: first, sort the data.)
   </div>

In [None]:
most_retweeted = trump...("retweet_count", ...).take(...)
most_retweeted

In [None]:
## Just Run this cell to view the whole text of the tweets in a nicer format
print('Most retweeted posts:')
for t in most_retweeted.column('text'):
    print('\n  ', t)

<div class="alert alert-warning">
<b>Question:</b> What patterns do you notice in the most retweeted tweets?
   </div>

*Answer here*


<div class="alert alert-warning">
<b>Question:</b> How do retweets relate to polarity? Make a scatterplot that compares retweets and polarity
   </div>

In [None]:
trump.scatter("polarity", ...)

**Polarity Over Time:** We learned about retweeting patterns a little bit, but how do these patterns vary over time. Let's focus on years, so we can see the broad pattern over time.
<div class="alert alert-warning">
<b>Question:</b> Group the data by year, so that each row represents a unique year. Take the average of every other column. If a column contains strings, make sure to drop it. Call this table `year_group`.
   </div>

In [None]:
year_group = trump.group("year", ...).drop(...)
year_group 

<div class="alert alert-warning">
<b>Question:</b> Using the grouped table, create a plot comparing `year` by `retweet_count average`.
   </div>

In [None]:
year_group.plot("year", "...")

<div class="alert alert-warning">
<b>Question:</b> What do you notice from the plot? What trend exists over time (if any)?
   </div>

*Answer here*

<div class="alert alert-warning">
<b>Question:</b> Let's do the same for polarity over time. Using the grouped table, create a plot comparing `year` by `polarity average`.
   </div>

In [None]:
year_group.plot("...", "...")

<div class="alert alert-warning">
<b>Question:</b> What do you notice from the plot? What trend exists over time (if any)?
   </div>

*Answer here*

<div class="alert alert-warning">
<b>Question:</b> Given the changes in polarity and retweet counts over time, what might we expect to see from Trump's 2020 twitter data?
   </div>

*Answer here.*

## Using prediction and inference to draw conclusions <a id='subsection 1a'></a>

Now that we have some context for the data, let's think back to major events that have happened in Trump's period as president. Consider his fight against Hillary  & Bernie, his inauguration, the witch hunt period, fake new, Russia scandal, and Charlottesville to name a few. These are all major events that happened in the past few years. **How do these events appear in Trump's tweets?**

From the previous sections, we have looked at the polarity of certain words, and we can do something similar to explore these events. As a group, choose an event you would like to explore more in depth.

<div class="alert alert-warning">
<b>Question:</b> What event are you interested in exploring? Determine a keyword you can use to find all related tweets to the event.
   </div>

*Answer here*

<div class="alert alert-warning">
<b>Question:</b> Use your keyword to find all the tweets where your keyword in contained in the post.
   </div>

In [None]:
event = ...("no_punc", are.containing("..."))
event

<div class="alert alert-warning">
<b>Question:</b> What is the time range of tweets related to your event? How does this compare to what you know of the event? I recommend searching a bit about the event you are exploring.
   </div>

*Answer here*

<div class="alert alert-warning">
<b>Question:</b> Plot the change over time in retweets for your event. Comment on what patterns you noticed.
   </div>

In [None]:
event.plot("time", ...)

*Comment here*

<div class="alert alert-warning">
<b>Question:</b> Plot the change over time in polarity for your event. Comment on what patterns you noticed.
   </div>

In [None]:
event.plot(..., ...)

*Comment here*

<div class="alert alert-warning">
<b>Question:</b> Based on these two measure, how do Trump's tweets frame the event? How does that differ from your interpretation of the event? How is it similar?
   </div>

*Answer here*

<div class="alert alert-warning">
<b>Question:</b> What impact might President Trump's tweets have an effect on these major events? How does his use of twitter influence individuals who agree and disagree with his beliefs?
   </div>

*Answer here*

<div class="alert alert-warning">
<b>Question:</b> What is something interesting you learned from the project?
   </div>

*Answer here*

Source: Adapted from [Fall 2019 Data 100 HW 4: Trump, Twitter, and Text](http://www.ds100.org/fa19/syllabus/)
Notebook Authors: Alleanna Clark, Ashley Quiterio, Karla Palos Castellanos