# Data Acquisition

## Introduction

The two datasets we will be dealing with are the following:

1. __Famous Political Leaders' Tweets__: Starting from late 2016 till April 2018. 
2. __SPY's Historical Price Ranges__: The Range of the SPY within the same time period.  

## Common Libraries

In [1]:
# For Twitter 
import re
import tweepy
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# For SPY Data
from datetime import datetime
from pandas import DataFrame, to_numeric, to_datetime, Grouper, merge 
import pandas_datareader.data as dr

## 1. Famous Political Leaders' Tweets

### User Credentials from Twitter

In [19]:
# Variables that contains the user credentials to access Twitter API 
ACCESS_TOKEN        = 'Nothing'
ACCESS_TOKEN_SECRET = 'to'
CONSUMER_KEY        = 'see'
CONSUMER_SECRET     = 'here.'

### Authentication 

In [2]:
def auth(): 
    oauth = tweepy.OAuthHandler( CONSUMER_KEY, CONSUMER_SECRET )
    oauth.set_access_token( ACCESS_TOKEN, ACCESS_TOKEN_SECRET ) 
    return oauth

### Getting Tweets

In [3]:
def get_all_tweets( screen_name ):
    api = tweepy.API( auth() )
    all_tweets = []
    print( f"Getting Tweets for: {screen_name}") 

    # Get the 200 Most Recent Tweets.
    new_tweets = api.user_timeline( screen_name = screen_name, 
                                    count       = 200 )
    all_tweets.extend( new_tweets )
    
    # Save the id of the oldest tweet less one
    oldest_tweet_id = all_tweets[ -1 ].id - 1

    # Let's get the most recent 4000 tweets. 
    while len( new_tweets ) > 0:
        
        # All Subsequent requests use the max_id param to prevent duplicates
        new_tweets = api.user_timeline( screen_name = screen_name, 
                                        count       = 200,
                                        max_id      = oldest_tweet_id, 
                                        tweet_mode  = 'extended' ) 
        
        all_tweets.extend( new_tweets )
        
        # Update the id of the oldest tweet less one
        oldest_tweet_id = all_tweets[ -1 ].id - 1
        
        print( f'{ len(all_tweets)} tweets downloaded so far.') 
        
    return all_tweets


### Testing the Function with Donald Trump's Tweets

In [5]:
donald_tweets = get_all_tweets( 'realDonaldTrump' )

Getting Tweets for: realDonaldTrump
400 tweets downloaded so far.
600 tweets downloaded so far.
800 tweets downloaded so far.
997 tweets downloaded so far.
1197 tweets downloaded so far.
1397 tweets downloaded so far.
1596 tweets downloaded so far.
1796 tweets downloaded so far.
1996 tweets downloaded so far.
2196 tweets downloaded so far.
2396 tweets downloaded so far.
2595 tweets downloaded so far.
2795 tweets downloaded so far.
2995 tweets downloaded so far.
3195 tweets downloaded so far.
3241 tweets downloaded so far.
3241 tweets downloaded so far.


### Tranforming the Tweets  

__Helper Functions__ 

In [6]:
sid = SentimentIntensityAnalyzer()

def analyze_sentiment( input_tweet ):
    return sid.polarity_scores( input_tweet )

def clean_tweet( input_tweet ):
    return ' '.join(re.sub( "(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)",
                            " ", 
                            input_tweet ).split() )

def get_tweet_text( input_tweet ):
    text = ""

    if hasattr( input_tweet, 'full_text' ):
        return clean_tweet( input_tweet.full_text )

    elif hasattr( input_tweet, 'fulltext' ):
        return clean_tweet( input_tweet.fulltext )

    elif hasattr( input_tweet, 'text' ):
        return clean_tweet( input_tweet.text )

    else:
        return None

__Tranformation Logic__

In [7]:
def transform_tweets( input_tweets, sentiment_score_col_name ):
    output_tweets = [] 
        
    for t in input_tweets:
        compound_value = 0
        text           = get_tweet_text( t )
        
        if ( text == None ):
            print( f'No text found for: User: {t.user.name} Tweet @ {t.created_at}')
            continue
        else:
             compound_value = analyze_sentiment( text )[ 'compound' ] 
            
        output_tweets.append( [ t.created_at.date(), compound_value ])
        
    df = DataFrame( data    = output_tweets, 
                    columns = [ "date",  
                                sentiment_score_col_name ] )
    df[ 'date' ] = to_datetime( df[ 'date' ])
    
    return df

### Testing the Transformation Functionality

In [8]:
donald_tweets_df = transform_tweets( donald_tweets, 
                                     "Trump - Vader Sentiment Score" ) 
donald_tweets_df.head()

Unnamed: 0,date,Trump - Vader Sentiment Score
0,2018-04-20,0.6187
1,2018-04-20,0.3612
2,2018-04-20,-0.5984
3,2018-04-20,0.8931
4,2018-04-20,-0.7089


### Sentiment Score Grouped by Date

In [9]:
def create_df_grouped_by_date( tweets_df ):
    return tweets_df.groupby( Grouper( 'date' )).mean()

### Testing the Grouping Logic

In [10]:
donald_sentiment_score_df = create_df_grouped_by_date( donald_tweets_df )
donald_sentiment_score_df.head()

Unnamed: 0_level_0,Trump - Vader Sentiment Score
date,Unnamed: 1_level_1
2018-04-20,-0.016867
2018-04-19,0.507986
2018-04-18,0.043463
2018-04-17,0.153062
2018-04-16,0.02345


## 2. SPY Historical Price Ranges

### Getting the Historical Data

In [11]:
def download_historical_prices_for_instrument( ticker ):
    try:
        now_time         = datetime.now()
        print(f"Getting historical data for: {ticker}")
        start_time       = datetime(2016, 12 , 30)
        stock_df         = dr.DataReader( ticker,'iex', start_time, now_time)
        stock_df['Name'] = ticker
        return stock_df
    
    except Exception as e:
        print(f'Unable to get data for: {ticker} because of Error: {e} ')
        
spy_df = download_historical_prices_for_instrument( 'SPY' )
spy_df.head()

Getting historical data for: SPY
2y


Unnamed: 0_level_0,open,high,low,close,volume,Name
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-12-30,219.5765,219.6742,217.6223,218.404,108998328,SPY
2017-01-03,219.8794,220.6512,218.7496,220.0748,91366522,SPY
2017-01-04,220.4461,221.5501,220.4363,221.384,78744433,SPY
2017-01-05,221.0812,221.384,220.3093,221.2082,78379012,SPY
2017-01-06,221.3352,222.5272,220.7196,221.9996,71559922,SPY


### Adding the Price Range in the DataFrame 

In [12]:
spy_df['range'] = ( to_numeric( spy_df['high'] ) - 
                    to_numeric( spy_df['low'] ))
spy_df.sort_index( ascending=False, inplace= True ) 
spy_df.index = to_datetime( spy_df.index )
spy_df.head()

Unnamed: 0_level_0,open,high,low,close,volume,Name,range
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-04-19,269.65,269.88,267.72,268.89,77655909,SPY,2.16
2018-04-18,270.69,271.3,269.87,270.39,57303857,SPY,1.43
2018-04-17,269.33,270.87,268.75,270.19,64682036,SPY,2.12
2018-04-16,267.0,268.2,266.07,267.33,63405287,SPY,2.13
2018-04-13,267.41,267.54,264.01,265.15,85079176,SPY,3.53


#  Combining the Two DataFrames

In [13]:
def combine_instrument_and_sentiment_dfs( instrument_df, 
                                          sentiment_score_df ):
    return merge( instrument_df, sentiment_score_df, 
                  how='inner', 
                  left_index=True, 
                  right_index=True )

### Testing the Combination of the Dataframe

In [14]:
donald_combined_df = combine_instrument_and_sentiment_dfs( spy_df,
                                                           donald_sentiment_score_df )
donald_combined_df.head()

Unnamed: 0_level_0,open,high,low,close,volume,Name,range,Trump - Vader Sentiment Score
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2018-04-19,269.65,269.88,267.72,268.89,77655909,SPY,2.16,0.507986
2018-04-18,270.69,271.3,269.87,270.39,57303857,SPY,1.43,0.043463
2018-04-17,269.33,270.87,268.75,270.19,64682036,SPY,2.12,0.153062
2018-04-16,267.0,268.2,266.07,267.33,63405287,SPY,2.13,0.02345
2018-04-13,267.41,267.54,264.01,265.15,85079176,SPY,3.53,-0.256667


### Saving Combined Dataframe

In [15]:
def save_combined_df( df_to_save, file_name ):
    df_to_save.to_csv( f'../../data/{file_name}.csv' )

### Testing the Saving of the Combined Dataframe

In [16]:
save_combined_df( donald_combined_df, 'realDonaldTrump' )
!ls ../../data/

realDonaldTrump.csv


# Generalizing The Pipeline

In [17]:
# Source: https://www.davemanuel.com/the-most-popular-us-politicians-by-twitter-followers-163/
political_twitter_handles = [ 'realDonaldTrump',
                              'BarackObama',
                              'SenJohnMcCain',
                              'SpeakerBoehner',
                              'JoeBiden',
                              'SpeakerRyan',
                              'NancyPelosi',
                              'RepRonPaul',
                              'MicheleBachmann',
                              'JimDeMint',
                              'GOPLeader', # Kevin McCarthy 
                              'AllenWest',
                              'VP', # Mike Pence
                              'ThadMcCotter',
                              'jaredpolis',
                              'OrrinHatch',
                              'keithellison' ]

def create_combined_df( screen_name, should_persist = False ):
    try:
        all_tweets         = get_all_tweets( screen_name )
        transformed_tweets = transform_tweets( all_tweets, 
                                               f"{screen_name} Vader Sentiment Score" )
        sentiment_score_df = create_df_grouped_by_date( transformed_tweets )
        combined_df        = combine_instrument_and_sentiment_dfs( spy_df, 
                                                                   sentiment_score_df )
        if ( should_persist ):
            save_combined_df( combined_df, screen_name )
        return combined_df
    
    except Exception as e :
        print ( f'Error while getting the sentiment score for: {screen_name}: {e}')

def get_and_save_sentiment_score_for_all():
    return {  p : create_combined_df( p, True ) for p in political_twitter_handles }

### Getting and Saving All Tweets 

In [18]:
all_sentiments = get_and_save_sentiment_score_for_all()

Getting Tweets for: realDonaldTrump
400 tweets downloaded so far.
600 tweets downloaded so far.
800 tweets downloaded so far.
997 tweets downloaded so far.
1197 tweets downloaded so far.
1397 tweets downloaded so far.
1596 tweets downloaded so far.
1796 tweets downloaded so far.
1996 tweets downloaded so far.
2196 tweets downloaded so far.
2396 tweets downloaded so far.
2595 tweets downloaded so far.
2795 tweets downloaded so far.
2995 tweets downloaded so far.
3195 tweets downloaded so far.
3241 tweets downloaded so far.
3241 tweets downloaded so far.
Getting Tweets for: BarackObama
400 tweets downloaded so far.
600 tweets downloaded so far.
800 tweets downloaded so far.
1000 tweets downloaded so far.
1200 tweets downloaded so far.
1400 tweets downloaded so far.
1600 tweets downloaded so far.
1800 tweets downloaded so far.
2000 tweets downloaded so far.
2200 tweets downloaded so far.
2399 tweets downloaded so far.
2599 tweets downloaded so far.
2798 tweets downloaded so far.
2996 twee

1000 tweets downloaded so far.
1200 tweets downloaded so far.
1399 tweets downloaded so far.
1599 tweets downloaded so far.
1799 tweets downloaded so far.
1999 tweets downloaded so far.
2199 tweets downloaded so far.
2399 tweets downloaded so far.
2598 tweets downloaded so far.
2797 tweets downloaded so far.
2997 tweets downloaded so far.
3196 tweets downloaded so far.
3233 tweets downloaded so far.
3233 tweets downloaded so far.
