# Data Acquisition: Getting Tweet Sentiments And SPY Data

![SPY](https://static.seekingalpha.com/uploads/2017/8/12/34092875-1502537153708901.png "SPY")

## Introduction

The two datasets I have decided to deal with are the following:

* __Famous Political Leaders' Tweets__: Starting from late 2016 till April 2018. These Famous Political Leaders will include:
    1. Donald Trump
    2. Mike Pence
    3. Paul Ryan
  
    
* __Historical SPY Data__: The Price Range and other details of the SPY within the same time period are acquired using the Pandas Data Reader. The Price Range will be calculated by subtracting the __low__ price from the __high__ for a particular day.

Once we have all this data independently, we will grouping each of the data by __day__. If a political leader has tweeted multiple times a day, the __average sentiment__ of that day will be taken into consideration. Subsequently, we will be joining the two day-by-day data for the SPY and Tweets after converting those into a Pandas Dataframe and then save the result in the form of a csv.

## Common Libraries

Let's start off by loading all the libraries used for the data acquisition.

In [1]:
# For Twitter 
import re
import tweepy
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# For SPY Data
from datetime import datetime
from pandas import DataFrame, to_numeric, to_datetime, Grouper, merge 
import pandas_datareader.data as dr

## Famous Political Leaders' Tweets
 
Our process to obtain the Political Leader's Tweets is as follows:

* __Authentication__: We will first use the Tweepy library to authenticate ourselves to be able to use the API.  


* __Getting the Tweets__: We will be going as far back as early 2017 to grab our Tweets for each of the Famous Political Leaders we have decided to get data from.  
 

* __Sentiment Analysis of the Tweets__: We will be using NLTK's Vader Sentiment Analysis model to compute the sentiment analysis for all the tweets. This stage will involve cleaning the tweets that would involve removing any hyperlinks embedded in the tweet itself or RTs with the username.  
 

* __Grouping Sentiment Score by Data__: Once the data is available and cleaned, we will be grouping the data by date which will produce an independent dataframe with the index as the Date and the average sentiment score as the only column.

### Authentication

It is fairly straight forward to get authenticated. We are simply using the Access token and Secret along with the Consumer key and Secret to return to us an authenticated Tweepy object.

In [2]:
# Variables that contains the user credentials to access Twitter API 
# The Keys were removed but kept here for completeness sake.
ACCESS_TOKEN        = 'Nothing'
ACCESS_TOKEN_SECRET = 'to'
CONSUMER_KEY        = 'see'
CONSUMER_SECRET     = 'here.'

def auth(): 
    '''Function responsible for sending the correct Authorization for Twitter.
    '''
    oauth = tweepy.OAuthHandler( CONSUMER_KEY, CONSUMER_SECRET )
    oauth.set_access_token( ACCESS_TOKEN, ACCESS_TOKEN_SECRET ) 
    return oauth

### Getting the Tweets

The tweet acquisition step involved batching the tweets from an API in chunks of 200 and subsequently keeping track of __when__ our last tweet was from using a tweet id. Our function to do so simply takes in a screen name and gets all the tweet's from the user's timeline including Retweets. We will get back quite the loaded object from the API from which we will just need the text of the tweet and date for our next step in the pipeline.

In [3]:
def get_all_tweets( screen_name ):
    '''Function gets all the associated tweets linked with a Screen Name'''
    api = tweepy.API( auth() )
    all_tweets = []
    print( f"Getting Tweets for: {screen_name}") 

    # Get the 200 Most Recent Tweets.
    new_tweets = api.user_timeline( screen_name = screen_name, 
                                    count       = 200 )
    all_tweets.extend( new_tweets )
    
    # Save the id of the oldest tweet less one
    oldest_tweet_id = all_tweets[ -1 ].id - 1

    # Let's get the most recent 4000 tweets. 
    while len( new_tweets ) > 0:
        
        # All Subsequent requests use the max_id param to prevent duplicates
        new_tweets = api.user_timeline( screen_name = screen_name, 
                                        count       = 200,
                                        max_id      = oldest_tweet_id, 
                                        tweet_mode  = 'extended' ) 
        
        all_tweets.extend( new_tweets )
        
        # Update the id of the oldest tweet less one
        oldest_tweet_id = all_tweets[ -1 ].id - 1
        
        print( f'{ len(all_tweets)} tweets downloaded so far.') 
        
    return all_tweets


#### Testing the Function with Donald Trump's Tweets

In [4]:
donald_tweets = get_all_tweets( 'realDonaldTrump' )

Getting Tweets for: realDonaldTrump
400 tweets downloaded so far.
600 tweets downloaded so far.
800 tweets downloaded so far.
997 tweets downloaded so far.
1194 tweets downloaded so far.
1394 tweets downloaded so far.
1593 tweets downloaded so far.
1793 tweets downloaded so far.
1993 tweets downloaded so far.
2193 tweets downloaded so far.
2393 tweets downloaded so far.
2592 tweets downloaded so far.
2792 tweets downloaded so far.
2992 tweets downloaded so far.
3192 tweets downloaded so far.
3207 tweets downloaded so far.
3207 tweets downloaded so far.


### Sentiment Analysis of the Tweets

This process starts by cleaning the text of the tweets by stripping off any URLs, non-alphanumeric characters and RT metadata embedded in a tweet. The Vader SentimentIntensityAnalyser is then subsequently used on this cleaned tweet data to give us back a dictionary with a polarity score out of which we consider the __compound__ polarity that is a combination of positive, neutral and negative sentiment scores; this number ranges from -1 to 1.  

Our final output is a dataframe consisting of 2 columns: __date__ i.e. date of the tweet and __Sentiment Score__.   

__Helper Functions__ 

In [5]:
sid = SentimentIntensityAnalyzer()

def analyze_sentiment( input_tweet ):
    '''Function returns the polarity scores for the input cleaned tweet.'''
    return sid.polarity_scores( input_tweet )

def clean_tweet( input_tweet ):
    '''Function cleans the tweet by removing non-alphanumeric characters, hyperlinks, 
       RT metadata etc. 
    '''
    return ' '.join(re.sub( "(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|RT @[\w_]+:",
                            " ", 
                            input_tweet ).split() )

def get_tweet_text( input_tweet ):
    '''Function gets the clean version of the text associated with a tweet. 
    '''
    text = ""

    if hasattr( input_tweet, 'full_text' ):
        return clean_tweet( input_tweet.full_text )

    elif hasattr( input_tweet, 'fulltext' ):
        return clean_tweet( input_tweet.fulltext )

    elif hasattr( input_tweet, 'text' ):
        return clean_tweet( input_tweet.text )

    else:
        return None

__Cleaning and Transforming Tweets into a Dataframe__

In [6]:
def transform_tweets( input_tweets, sentiment_score_col_name ):
    '''Function transforms the tweet text into a sentiment score and saves 
       it in a dataframe
    '''
    output_tweets = [] 
        
    for t in input_tweets:
        compound_value = 0
        text           = get_tweet_text( t )
        
        # Ignore cases where the text is not found.
        if ( text == None ):
            print( f'No text found for: User: {t.user.name} Tweet @ {t.created_at}')
            continue
        else:
             compound_value = analyze_sentiment( text )[ 'compound' ] 
            
        output_tweets.append( [ t.created_at.date(), compound_value ])
        
    df = DataFrame( data    = output_tweets, 
                    columns = [ "date",  
                                sentiment_score_col_name ] )
    df[ 'date' ] = to_datetime( df[ 'date' ])
    
    return df

#### Testing the Transformation Functionality

In [7]:
donald_tweets_df = transform_tweets( donald_tweets, 
                                     "Sentiment Score" ) 
donald_tweets_df.head()

Unnamed: 0,date,Sentiment Score
0,2018-04-29,0.6444
1,2018-04-29,0.5719
2,2018-04-29,0.9042
3,2018-04-28,0.296
4,2018-04-28,-0.0083


### Sentiment Score Grouped by Date

Once we have all the data for all days, we will need to squash the multiday twitter sentiment score into a single value to eventually key off the date only. We do this by using the mean of the multiple tweets per day to give us a singular compound mean value. 

In [8]:
def create_df_grouped_by_date( tweets_df ):
    '''Function creates a new data frame with date as the index and the sentiment score 
       as a column.
    '''
    return tweets_df.groupby( Grouper( 'date' )).mean()

#### Testing the Grouping Logic

In [9]:
donald_sentiment_score_df = create_df_grouped_by_date( donald_tweets_df )
donald_sentiment_score_df.head()

Unnamed: 0_level_0,Sentiment Score
date,Unnamed: 1_level_1
2018-04-29,0.706833
2018-04-28,0.24195
2018-04-27,0.322336
2018-04-26,0.5372
2018-04-25,0.3394


We have successfully forged the first few functions needed to easily get us the Twitter sentiment data. Our next step is to acquire data of the SPY and then merge the two dataframes together.

## Historical SPY Data

The process of getting the SPY data is fairly easy, we will be using the Pandas Data reader for this step starting from the end of 2016 to the present day.

The process will consist of:
* Acquiring the Historical SPY Data
* Adding the Price Range to the Data frame

### Acquiring the Historical SPY Data

In [10]:
def download_historical_prices_for_instrument( ticker ):
    '''Function gets the historical data of a specified ticker.'''
    try:
        now_time         = datetime.now()
        print(f"Getting historical data for: {ticker}")
        start_time       = datetime(2016, 12 , 30)
        stock_df         = dr.DataReader( ticker,'iex', start_time, now_time)
        stock_df['Name'] = ticker
        return stock_df
    
    except Exception as e:
        print(f'Unable to get data for: {ticker} because of Error: {e} ')
        
spy_df = download_historical_prices_for_instrument( 'SPY' )
spy_df.head()

Getting historical data for: SPY
2y


Unnamed: 0_level_0,open,high,low,close,volume,Name
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-12-30,219.5765,219.6742,217.6223,218.404,108998328,SPY
2017-01-03,219.8794,220.6512,218.7496,220.0748,91366522,SPY
2017-01-04,220.4461,221.5501,220.4363,221.384,78744433,SPY
2017-01-05,221.0812,221.384,220.3093,221.2082,78379012,SPY
2017-01-06,221.3352,222.5272,220.7196,221.9996,71559922,SPY


### Adding the Price Range in the DataFrame

The price range is simply the high price of the day minus the low price and is added as a new column to the Dataframe. 

In [11]:
spy_df['range'] = ( to_numeric( spy_df['high'] ) - 
                    to_numeric( spy_df['low'] ))
spy_df.sort_index( ascending=False, inplace= True ) 
spy_df.index = to_datetime( spy_df.index )
spy_df.head()

Unnamed: 0_level_0,open,high,low,close,volume,Name,range
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-04-27,267.0,267.34,265.5,266.56,57053647,SPY,1.84
2018-04-26,264.79,267.2452,264.29,266.31,67731942,SPY,2.9552
2018-04-25,262.91,264.13,260.85,263.63,103756753,SPY,3.28
2018-04-24,267.73,267.9762,261.28,262.98,112885452,SPY,6.6962
2018-04-23,267.26,267.89,265.35,266.57,65557954,SPY,2.54


##  Combining the Two DataFrames

Now that we have the two data sets we wanted, our next step is to merge these into one data frame. We will use the __merge__ function from Pandas to " inner join" our Instrument dataframe with the Sentiment Score dataframe by using the Date key. 

In [12]:
def combine_instrument_and_sentiment_dfs( instrument_df, 
                                          sentiment_score_df ):
    '''Function returns a merged dataframe of the historical instrument information
       and the sentiment score data frame.
    '''
    return merge( instrument_df, sentiment_score_df, 
                  how='inner', 
                  left_index=True, 
                  right_index=True )

#### Testing the Combination of the Dataframe

In [13]:
donald_combined_df = combine_instrument_and_sentiment_dfs( spy_df,
                                                           donald_sentiment_score_df )
donald_combined_df.head()

Unnamed: 0_level_0,open,high,low,close,volume,Name,range,Sentiment Score
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2018-04-27,267.0,267.34,265.5,266.56,57053647,SPY,1.84,0.322336
2018-04-26,264.79,267.2452,264.29,266.31,67731942,SPY,2.9552,0.5372
2018-04-25,262.91,264.13,260.85,263.63,103756753,SPY,3.28,0.3394
2018-04-24,267.73,267.9762,261.28,262.98,112885452,SPY,6.6962,0.305412
2018-04-23,267.26,267.89,265.35,266.57,65557954,SPY,2.54,-0.2659


### Saving Combined Dataframe

In [14]:
def save_combined_df( df_to_save, file_name ):
    '''Function saves the dataframe as a csv to the specified filepath
    '''
    df_to_save.to_csv( f'./data/{file_name}.csv' )

### Testing the Saving of the Combined Dataframe

In [15]:
save_combined_df( donald_combined_df, 'realDonaldTrump' )
!ls ./data 

realDonaldTrump.csv


## Generalizing The Pipeline

Now that we have all the independent and modular functions in place, we are going to run this prescribed pipeline for all 3 of our chosen political leaders to get us independent csvs. 

In [16]:
political_twitter_handles = [ 'realDonaldTrump', 'SpeakerRyan', 'VP' ]

def create_combined_df( screen_name, should_persist = False ):
    '''Function creates a combined data frame of the screen name and SPY Historical Data
    '''
    try:
        all_tweets         = get_all_tweets( screen_name )
        transformed_tweets = transform_tweets( all_tweets, 
                                               f"{screen_name} Vader Sentiment Score" )
        sentiment_score_df = create_df_grouped_by_date( transformed_tweets )
        combined_df        = combine_instrument_and_sentiment_dfs( spy_df, 
                                                                   sentiment_score_df )
        if ( should_persist ):
            save_combined_df( combined_df, screen_name )
        return combined_df
    
    except Exception as e :
        print ( f'Error while getting the sentiment score for: {screen_name}: {e}')

def get_and_save_sentiment_score_for_all():
    return {  p : create_combined_df( p, True ) for p in political_twitter_handles }

### Getting and Saving All Tweets 

Let's run our entire pipeline for all the decided political leaders. 

In [17]:
all_sentiments = get_and_save_sentiment_score_for_all()

Getting Tweets for: realDonaldTrump
400 tweets downloaded so far.
600 tweets downloaded so far.
800 tweets downloaded so far.
997 tweets downloaded so far.
1194 tweets downloaded so far.
1394 tweets downloaded so far.
1593 tweets downloaded so far.
1793 tweets downloaded so far.
1993 tweets downloaded so far.
2193 tweets downloaded so far.
2393 tweets downloaded so far.
2592 tweets downloaded so far.
2792 tweets downloaded so far.
2992 tweets downloaded so far.
3192 tweets downloaded so far.
3207 tweets downloaded so far.
3207 tweets downloaded so far.
Getting Tweets for: SpeakerRyan
400 tweets downloaded so far.
600 tweets downloaded so far.
800 tweets downloaded so far.
1000 tweets downloaded so far.
1199 tweets downloaded so far.
1399 tweets downloaded so far.
1599 tweets downloaded so far.
1799 tweets downloaded so far.
1999 tweets downloaded so far.
2199 tweets downloaded so far.
2399 tweets downloaded so far.
2598 tweets downloaded so far.
2798 tweets downloaded so far.
2998 twee