# _What are you tweeting about?_

When I first heard of Twitter way back when I thought it was stupid. I mean, who would really want to put their thoughts (in 140 or fewer characters) out for the world to see? 

Apparently, a lot of people...

It has only been in the past six months or so that I've come to use Twitter regularly. However, I don't tweet all that much, so what's the point? Information curation. 

Twitter has given me virtual access to experts in the fields of data science, machine learning and deep learning. The list of names is exceptionally long, but all these individuals combine to help show the way, so to speak. They highlight the signal within all the noise around the above subjects. Additionally, **I** have the choice in whose information I allow to populate the screen of my smartphone or laptop. In summary, much like an art connoisseur curates artworks for a showing, I can curate what information populates my feed. 

Now there are issues with this, the most obvious of which is that one can begin to silo themselves into a bubble. Since we have the power to control our feed, and for the most part, many of us want to read about what we're already interested in, why would we be compelled to follow anybody with completely opposite views? 

However, I don't want to get too far into the weeds because this notebook has a different and very specific purpose. Continuing with the #trend, we're going to look into gathering tweets from Twitter and analyzing them using Python. More specifically, we're going to be focusing on the tweets from members of the United States federal government - particularly those in the Legislative and Executive Branches - where we'll use a range of NLP (natural language processing) techniques to analyze the sentiment associated with each tweet. 

### Purpose

Firstly, I want to get a better understanding of Twitter's [API](https://developer.twitter.com/en/docs). Ever since I went through Dataquest's [tutorial](https://www.dataquest.io/blog/streaming-data-python/) on the Twitter API, I've made it a focus to get more comfortable with it. 

As an extension, I want to expand my understanding of the wide range of tools available in Python for NLP (natural-language processing). When I first started my journey into data science, I was hugely intimidated by working with text. After all, how do you analyze anything but numbers? Luckily, I've been exposed to some really insightful resources (which I'll list below) that have really helped me to grasp the many approaches available to us with NLP. 

I feel like I'm starting to ramble, so without further ado, let's dive in! Our first step will be making sure `tweepy` is installed in our environment.




In [None]:
# install tweepy
!pip install tweepy

### _What is `tweepy`?_

_"An easy-to-use Python library for accessing the Twitter API."_

Couldn't put it more succinctly than that. Like `sklearn` does for machine learning, or `pandas` does for data wrangling, `tweepy` gives us tools to work directly with and access information on Twitter. 

Before we can analyze tweets though we need to set up an app via Twitter's developer site. While I won't be diving further into this initial app set-up, Dataquest's [tutorial](https://www.dataquest.io/blog/streaming-data-python/) gives an excellent high-level overview of how to do this. 

So after setting up our app, we have access to an assortment of "Keys and Access Tokens". Much like the press need badges to get into White House press meetings, these keys/tokens are the credentials that allow us to access Twitter. I want to highlight here the importance of keeping these _private_. If someone else had these keys/tokens, they could access Twitter and potentially do some nefarious acts in your name, which we don't want! 

To better keep these key/variables private, they'll be stored as variables in a private Python script. We'll then be able to call to each key/token directly without directly having to put actual combinations in. For clarification, though, below is some background on the different keys/tokens and what variables they'll be stored in.

- Consumer Key (i.e., API Key) - will be stored in a variable called `TWITTER_API_KEY`
- Consumer Secret (i.e., API Secret Key) - will be stored in a variable called `TWITTER_API_SECRET`
- Access Token - will be stored in a variable called `TWITTER_TOKEN`
- Access Token Secret - will be stored in a variable called `TWITTER_TOKEN_SECRET`

### _Setting Up the Environment_

With all this talk about `tweepy`, I nearly forgot about one crucial step: setting up my working environment! Since I only had access to an older Macbook for this project, I'll be using Google Colab today which means to access any of the scripts I mentioned previously, I'll have to "mount" this Colab notebook to my Google Drive. After that, I'll update the main directory to the correct location; this will serve as our "home base," where any data we access (or decide to download) will be stored. It takes just a few steps, so let's do that now.


In [5]:
# mount your Google Drive, giving us access to any files we have stored there
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Ok we've mounted our Google Drive, now we need to see what our current working directory is. This is important to address up-front because it'll help us stay better organized for the rest of the project.

In [6]:
# import os library to better help us navigate directories (which in this) case is our Google Drive
import os

# what is our current working directory?
print("Our current working directory is '{}'.".format(os.getcwd()))

Our current working directory is '/content'.


We'll need to update this to the correct location. We can do this through the `os` library as well, more specifically by calling `chdir()` and inputting the path name in the parentheses. 

In [7]:
# change our directory to the one associated with this project
os.chdir('/content/gdrive/My Drive/projects/trump-tweets')

# lets confirm that it updated the directory accordingly
os.getcwd()

'/content/gdrive/My Drive/projects/trump-tweets'

In [8]:
# import Path to make working with directory even more manageable
from pathlib import Path

# store our main directory path in a variable in case we need to access/download information in that locaton directly
PATH = Path(os.getcwd())
print(PATH)

/content/gdrive/My Drive/projects/trump-tweets


Awesome! Our working directory is set, and we have our path stored safely, which may come in handy as a data lighthouse of sorts in case we ever get lost with the directories. Our next steps will now be to import `joetools`, which is our very own custom library, which has the sole purpose of storing the Python script with our Twitter keys and tokens. 

We'll import it and then be able to easily access those while keeping them private at the same time!

In [None]:
# now we have access to the keys and tokens!
from joetools import get_tweets, secrets

### _Authentication_

Environment and main directory set up? Check.

Access to Twitter keys and tokens? Check.

Now comes the next part: authentication. It is relatively easy to do this with just a few lines of code thanks to `tweepy`! We'll first authenticate our request to access Twitter using `TWITTER_API_KEY` and `TWITTER_API_SECRET`. Then we'll set our access tokens - using `TWITTER_TOKEN` and `TWITTER_TOKEN_SECRET`. 

We are now authenticated, and the next step is to simply pass this authentication in an API object, which will then allow us to pull data from Twitter. 

In [None]:
# be sure to import tweepy!
import tweepy

# access tweepy authenticator, passing in our two API keys, then set access tokens
auth = tweepy.OAuthHandler(secrets.TWITTER_API_KEY, secrets.TWITTER_API_SECRET)
auth.set_access_token(secrets.TWITTER_TOKEN, secrets.TWITTER_TOKEN_SECRET)

# pass in above authentication to API
api = tweepy.API(auth)

In three lines of code (four if you want to count importing the library...), we're able to get access to the entire Twitter [RESTful API](https://tweepy.readthedocs.io/en/latest/api.html#api-reference) and its methods! Since I am not too technically-inclined (yet), I try to keep my code as simple as possible. This does not mean that things can't get complex and multi-tiered; things can blow up and blow up quickly to be sure. 

However, I do my best to follow the principle of [_Occam's razor_](https://en.wikipedia.org/wiki/Occam%27s_razor), which states: 

_"Entities should not be multiplied without necessity."_

Basically, when two competing theories make exactly the same predictions, the simpler one is the better one [(Source)](http://math.ucr.edu/home/baez/physics/General/occam.html). 

We can loosely translate that to the following: keep it simple stupid!

As a quick and simple example of the power of the API, let's print out some tweets from my very own Twitter account using `home_timeline()` in conjecture with our `api`. 

In [None]:
# access home_timeline()
public_tweets = api.home_timeline()

In [12]:
# access some recent tweets/retweets
for tweet in public_tweets[:10]:
    print(tweet.text)

RT @phil_tinline: Defining features of “Soviet style purges” included prolonged torture, false confessions at kangaroo show trials, sometim…
What we learned from the NFL’s upset-heavy week 6 --&gt; https://t.co/EpgmJtTpIG
RT @Liquidata1: @fchollet We don't think APIs are the answer. For many applications, you need to see the entire dataset. We have a really s…
RT @juliaioffe: American hero https://t.co/ylrb9IXxL1
The right’s anti-Trump bloc: https://t.co/wImHYHRtwv
“Whatever can be understood, can be understood through explanatory knowledge. And more, any physical process can be… https://t.co/No4gFaq5je
Google Home PM: Alright, so only voice commands?

Google Home product engineers: Sounds good.

[everyone gets up an… https://t.co/QOXR1I2yW9
Where we discuss self-scandalisation as a propaganda tactic and the reason politicians can’t be held accountable wi… https://t.co/jew0hDMBoH
Convo/thread https://t.co/cQ9OY3ZW3S
Not all #Pokemon are created equal. #Pikachu may have his own #Detecti

We can also gather tweets from other users. To keep things simple let's get access to user `earny_joe`'s tweets (which is me!).

In [13]:
# use get_user method to access data from a specific user
user = api.get_user('earny_joe')

print('What timezone is user {} in? --> {}'.format(user.screen_name, user.time_zone))
print('How many followers does {} have? --> {}'.format(user.screen_name, user.followers_count))

What timezone is user earny_joe in? --> None
How many followers does earny_joe have? --> 40


### _Next Steps: Create A Function to Get Tweets_

You may have noticed earlier when we imported the scripts from `joetools` that we in addition to our `secrets` script (which has the keys/tokens stored in it) we imported another one called `get_tweets`.

What is the point of this script? The `get_tweets` script contains the function that we can use to get specified users' tweets from Twitter! Luckily this step was pretty quick; after a quick Google search, I came across David Yanofsky's Python script ([link](https://gist.github.com/yanofsky/5436496)) that did most of the leg work. I made a few slight tweaks to it for this project, but David did most of the leg work, so thank you! 

I'll be calling it via the command `get_tweets.get_tweets()`, which takes two inputs:
1. `username`: The individual's Twitter username, or 'handle'
2. `num_tweets`: number of tweets to process; as a reminder there is a limit of the `3200` most recent tweets
With these two inputs, the function gets to work. But first, let's take a slight detour back to the keys and tokens. After all, we have to get the keys authenticated and set the access tokens to use the Twitter API! 

This was one of the small tweaks I made; I added a call to the `secrets` script, which contains all the secret info! We can then safely access our secret keys//tokens and store them in variables for use later. 

Ok, now back to the function! Below is a quick description of the steps that the function takes to gather, process, and then store the information:

1. Authenticates with API keys and sets access tokens so that we can access Twitter via `tweepy`.
2. Uses a [_cursor_](http://docs.tweepy.org/en/v3.5.0/cursor_tutorial.html) object to access our inputted user's items, things like the username, retweet count, favorite count, and the tweet text.
3. Generate a CSV file that we can then put into a `pandas` data frame.

Since we now have a better understanding of what is going on behind the scenes, let's dive in and get to work, pulling some tweets!

In [15]:
# get 3199 most recent tweets from Donal Trump
get_tweets.get_tweets(username='realDonaldTrump', num_tweets=3199)

writing to realDonaldTrump_tweets.csv


Ok so why did I set `num_tweets` to 3,199? For some reason when I set the number to 3200 (the limit specified by Twitter) sometimes it'll only return a few hundred tweets. I haven't figured out why this is though; I don't make that many requests (i.e. < 5) on any given day so I don't think it is due to me making too many calls in a short period of time. It is something I'll have to go more in-depth with the [documentation](https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline) to figure out what is going on.

In [None]:
# see what the CSV looks like
import pandas as pd
# optional - changes the default setting so that all columns are output for a pandas Df
pd.set_option('display.max_columns', None)

# create pandas dataframe
trump = pd.read_csv('realDonaldTrump_tweets.csv')

In [17]:
# check out the first five rows
trump.head()

Unnamed: 0,realDonaldTrump,1183908206088728576,2019-10-15 00:51:26,Twitter for iPhone,3183,7994,b'\xe2\x80\x9cProject Veritas-Obtained Undercover Videos Highlight Jeff Zucker\xe2\x80\x99s (@CNN) Campaign To Destroy Trump. Videos Reveal\xe2\x80\xa6 https://t.co/yJSfxoGt7e'
0,realDonaldTrump,1183900672892309505,2019-10-15 00:21:30,Twitter for iPhone,8409,24211,b'A big scandal at @ABC News. They got caught ...
1,realDonaldTrump,1183899559124189184,2019-10-15 00:17:05,Twitter for iPhone,6571,21131,b'Shifty Schiff now seems to think they don\xe...
2,realDonaldTrump,1183873633057476609,2019-10-14 22:34:04,Twitter Media Studio,11785,29106,"b'""The House gone rogue! I want to remind you ..."
3,realDonaldTrump,1183869954640228352,2019-10-14 22:19:27,Twitter Media Studio,10572,27167,"b'""It doesn\'t speak for the FULL HOUSE becaus..."
4,realDonaldTrump,1183869189049737217,2019-10-14 22:16:24,Twitter Media Studio,9258,24080,"b'""The Democrat Party has hijacked the House o..."


### _Observations_
For the most part, this looks like we're off to a great start. However, it appears that the header row represents an actual tweet! This is a relatively easy fix; we'll recall the `pd.read_csv()` function, but this time include an additional input called `names` that'll contain a list of column names to assign to the DataFrame. Let us do this now!

In [None]:
# list containing the column names we want to assign the dataframe
column_names = ['username', 'id', 'created_at', 'source', 'retweet_count', 'favorite_count', 'tweet']

In [19]:
# create pandas dataframe
trump = pd.read_csv('realDonaldTrump_tweets.csv', names=column_names)

# recheck first few rows
trump.head()

Unnamed: 0,username,id,created_at,source,retweet_count,favorite_count,tweet
0,realDonaldTrump,1183908206088728576,2019-10-15 00:51:26,Twitter for iPhone,3183,7994,b'\xe2\x80\x9cProject Veritas-Obtained Underco...
1,realDonaldTrump,1183900672892309505,2019-10-15 00:21:30,Twitter for iPhone,8409,24211,b'A big scandal at @ABC News. They got caught ...
2,realDonaldTrump,1183899559124189184,2019-10-15 00:17:05,Twitter for iPhone,6571,21131,b'Shifty Schiff now seems to think they don\xe...
3,realDonaldTrump,1183873633057476609,2019-10-14 22:34:04,Twitter Media Studio,11785,29106,"b'""The House gone rogue! I want to remind you ..."
4,realDonaldTrump,1183869954640228352,2019-10-14 22:19:27,Twitter Media Studio,10572,27167,"b'""It doesn\'t speak for the FULL HOUSE becaus..."


In [20]:
# what is the shape of our dataframe?
print('There are {} rows and {} columns in our Donald Trump tweet dataset.'.format(trump.shape[0], trump.shape[1]))

There are 3199 rows and 7 columns in our Donald Trump tweet dataset.


As we can see above, we're working with 3198 tweets! Now it's time to explore the data a little bit and see how we might be able to clean it up into a more advantageous format.

In [21]:
# what data types are in each of the columns?
trump.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3199 entries, 0 to 3198
Data columns (total 7 columns):
username          3199 non-null object
id                3199 non-null int64
created_at        3199 non-null object
source            3199 non-null object
retweet_count     3199 non-null int64
favorite_count    3199 non-null int64
tweet             3199 non-null object
dtypes: int64(3), object(4)
memory usage: 175.0+ KB


We can see that by calling `info()`, we're able to get a good high-level overview of what is in each column. It looks like we're working with `int` and `object` data types. (FYI - `object` means strings).

Now might be an excellent time to take a step back and assess what happened when we accessed the tweets with our `get_tweets` function and what was returned from the API.

To access Donald Trump's tweets, we use what is called a `Cursor` object. Now, what is it? Well, when we use the Twitter API, we inevitably run into something called [pagination](http://docs.tweepy.org/en/v3.4.0/cursor_tutorial.html). In other words, we are often going to have to iterate over things to get the information we want. For example, the `get_tweets` iterates through `realDonaldTrump`'s timeline until it has gathered the number of tweets specified. 

What the `Cursor` does is simplify this iteration. Instead of a combination of `if-else` statements and `for` loops, we simply pass in the username! 

Ok, that's cool and all, but how do we go about collecting the actual tweets?

The data returned from the API is called a Tweet object (really original...). Within this object, there is a long list of ‘root-level’ attributes, including fundamental attributes such as `id`, `created_at`, and `text`. ([source](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object))

For reference, I've decided to describe what each column represents below. I'll present the column name, then its subsequent description, data type and the Tweet object `Attribute` name (all of which can also be found at this [link](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)):

- `username`: Twitter username/handle of individual of interest; in our function we technically used the username that was input in the `get_tweets` function; this could be accessed however by calling `user`, which returns a [User object](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object) that is also a data dictionary containing assorted data about the user 
- `id`: the unique identifier of the tweet, the root-level attribute name is `id_str`, which returns a string; there is technically another attribute that returns the same thing, but as an `int` however, Twitter advises using the string version to be on the safe side
- `created_at`: UTC time when Tweet was created, the root-level attribute name is also `created_at` which returns a string type
- `source`: utility used to post the tweet, i.e. 'did you use your smartphone or computer/laptop to post it?', the root-level attribute name is also `source` which returns a string type
- `retweet_count`: number of times tweet has been retweeted, the root-level attribute name is `retweet-count` which returns an integer type
- `favorite_count`: approximately how many times the tweet's been liked by other users, the root-level attribute name is also `favorite_count` which returns an integer type
- `tweet`: the actual text of the tweet, the root-level attribute name is `text` which returns a string type

As we can see though, there is a mismatch between the columns and their types, the most obvious being the `id` column. Despite us retrieving the `id_str` attribute from the Tweet object, `pandas` saw the numbers and implied that it should be an `int` type instead.

Luckily we can fix this as the `read_csv()` function has a way for us to define data types for each column by inputting a dictionary into the `dtype` parameter. Let's go ahead and create this dictionary and then pass it into the CSV function again! After all, third time is a charm!

In [22]:
# print list of column names
print(list(trump.columns))

['username', 'id', 'created_at', 'source', 'retweet_count', 'favorite_count', 'tweet']


In [None]:
# create dictionary with column name as keys and data types as the values
dtypes = {'id': str}

Another thing we can do is attempt to address the dates up-front in the `created_at` column. At this point in time, `pandas` returns the values in that column as a string; yet, there are two parameters - `parse_dates` & `infer_datetime_format` - we can use to potentially address this datetime issue upfront. 

So, here are the next two things we'll be doing: 
1. Making sure the `id` column is returned as a string type
2. Assign a value of `True` to both the `parse_dates` and `infer_datetime_format` so that pandas converts the `created_at` column from a string type to a hopefully correctly formatted datetime type.

In [24]:
# create pandas dataframe
trump = pd.read_csv('realDonaldTrump_tweets.csv', names=column_names, dtype=dtypes, parse_dates=['created_at'], infer_datetime_format=True)

trump.head()

Unnamed: 0,username,id,created_at,source,retweet_count,favorite_count,tweet
0,realDonaldTrump,1183908206088728576,2019-10-15 00:51:26,Twitter for iPhone,3183,7994,b'\xe2\x80\x9cProject Veritas-Obtained Underco...
1,realDonaldTrump,1183900672892309505,2019-10-15 00:21:30,Twitter for iPhone,8409,24211,b'A big scandal at @ABC News. They got caught ...
2,realDonaldTrump,1183899559124189184,2019-10-15 00:17:05,Twitter for iPhone,6571,21131,b'Shifty Schiff now seems to think they don\xe...
3,realDonaldTrump,1183873633057476609,2019-10-14 22:34:04,Twitter Media Studio,11785,29106,"b'""The House gone rogue! I want to remind you ..."
4,realDonaldTrump,1183869954640228352,2019-10-14 22:19:27,Twitter Media Studio,10572,27167,"b'""It doesn\'t speak for the FULL HOUSE becaus..."


In [25]:
trump.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3199 entries, 0 to 3198
Data columns (total 7 columns):
username          3199 non-null object
id                3199 non-null object
created_at        3199 non-null datetime64[ns]
source            3199 non-null object
retweet_count     3199 non-null int64
favorite_count    3199 non-null int64
tweet             3199 non-null object
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 175.0+ KB


Yes! Looks like we're good to go in terms of the correct data types for each column! 

Unfortunately, I think this is a great point to call it a day. Tomorrow I'll start getting into the fun stuff: analyzing and visualizing the data we've gathered, with a particular focus on the text of the tweets!

Until then, auf wiedersehen! 