# Data Acquisition

## Introduction

The two datasets we will be dealing with are the following:

1. __Trump's Tweets__: Starting from late 2016 till April 2018. 
2. __S&P Movements__: The movements of the S&P within the same time period. 

This notebook deals with acquiring and transforming __Trump's Tweets__. 

## Common Libraries

In [25]:
# For Trump's Tweets
import re
import tweepy
import csv
from nltk.sentiment.vader import SentimentIntensityAnalyzer    

# For S&P 500 Data
from pandas import DataFrame
import pandas_datareader.data as dr

## Trump's Tweets Over Time

In [57]:
# Variables that contains the user credentials to access Twitter API 
ACCESS_TOKEN        = 'Nothing'
ACCESS_TOKEN_SECRET = 'to'
CONSUMER_KEY        = 'see'
CONSUMER_SECRET     = 'here.'

### Authentication 

In [4]:
def auth(): 
	oauth = tweepy.OAuthHandler( CONSUMER_KEY, CONSUMER_SECRET )
	oauth.set_access_token( ACCESS_TOKEN, ACCESS_TOKEN_SECRET ) 
	return oauth

### Getting All Tweets for Trump

In [5]:
def get_all_tweets( screen_name, total_tweets = 4000 ):
    api = tweepy.API( auth() )
    all_tweets = []

    # Get the 200 Most Recent Tweets.
    new_tweets = api.user_timeline( screen_name = screen_name, 
                                    count       = 200 )
    all_tweets.extend( new_tweets )
    
    # Save the id of the oldest tweet less one
    oldest_tweet_id = all_tweets[ -1 ].id - 1
    
    tweet_counter = 0

    # Let's get the most recent 4000 tweets. 
    while tweet_counter < total_tweets:
        
        # All Subsequent requests use the max_id param to prevent duplicates
        new_tweets = api.user_timeline( screen_name = screen_name, 
                                        count       = 200,
                                        max_id      = oldest_tweet_id, 
                                        tweet_mode  = 'extended' ) 
        tweet_counter += 200
        
        all_tweets.extend( new_tweets )
        
        # Update the id of the oldest tweet less one
        oldest = all_tweets[ -1 ].id - 1
        
        print( f'{ len(all_tweets)} tweets downloaded so far.') 
        
    return all_tweets

donald_tweets = get_all_tweets( 'realDonaldTrump' )

400 tweets downloaded so far.
600 tweets downloaded so far.
800 tweets downloaded so far.
1000 tweets downloaded so far.
1200 tweets downloaded so far.
1400 tweets downloaded so far.
1600 tweets downloaded so far.
1800 tweets downloaded so far.
2000 tweets downloaded so far.
2200 tweets downloaded so far.
2400 tweets downloaded so far.
2600 tweets downloaded so far.
2800 tweets downloaded so far.
3000 tweets downloaded so far.
3200 tweets downloaded so far.
3400 tweets downloaded so far.
3600 tweets downloaded so far.
3800 tweets downloaded so far.
4000 tweets downloaded so far.
4200 tweets downloaded so far.


In [6]:
import pprint

# Let's print the text of the first 5 tweets.
pprint.pprint( [ t.text  for t in donald_tweets[ : 5 ]] )

['Looks like OPEC is at it again. With record amounts of Oil all over the '
 'place, including the fully loaded ships at… https://t.co/l5MMjmtI14',
 'Nancy Pelosi is going absolutely crazy about the big Tax Cuts given to the '
 'American People by the Republicans...got… https://t.co/0REgmJNMqT',
 'So exciting! I have agreed to be the Commencement Speaker at our GREAT Naval '
 'Academy on May 25th in Annapolis, Mary… https://t.co/L9iZ6RS3ft',
 'So General Michael Flynn’s life can be totally destroyed while Shadey James '
 'Comey can Leak and Lie and make lots of… https://t.co/q1lyKyyeYI',
 'James Comey Memos just out and show clearly that there was NO COLLUSION and '
 'NO OBSTRUCTION. Also, he leaked classif… https://t.co/YfMYBrTkza']


### Tranforming the Tweets Using Sentiment Analysis via NLTK 

__Helper Functions__ 

In [39]:
sid = SentimentIntensityAnalyzer()

def analyze_sentiment( input_tweet ):
    return sid.polarity_scores( input_tweet )

def clean_tweet( input_tweet ):
    return ' '.join(re.sub( "(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)",
                            " ", 
                            input_tweet ).split() )

def get_tweet_text( input_tweet ):
    text = ""

    if hasattr( input_tweet, 'full_text' ):
        return clean_tweet( input_tweet.full_text )

    elif hasattr( input_tweet, 'fulltext' ):
        return clean_tweet( input_tweet.fulltext )

    elif hasattr( input_tweet, 'text' ):
        return clean_tweet( input_tweet.text )

    else:
        return None

__Tranformation Logic__

In [56]:
def transform_tweets( input_tweets ):
    output_tweets = [] 
        
    for t in donald_tweets:
        compound_value = 0
        text           = get_tweet_text( t )
        
        if ( text == None ):
            print( f'No text found for: User: {t.user.name} Tweet @ {t.created_at}')
            continue
        else:
             compound_value = analyze_sentiment( text )[ 'compound' ] 
            
        output_tweets.append( [ t.created_at, text, compound_value ])
    return output_tweets
            
transformed_tweets = transform_tweets( donald_tweets ) 

### Saving the Tweets

In [55]:
def write_tweets( input_tweets, file_path ):

    with open( file_path, 'w' ) as csv_file:
        csv_writer = csv.writer( csv_file )
        csv_writer.writerow( ["Created At", "Cleaned Tweet", "Sentiment Score"])
        csv_writer.writerows( input_tweets )
        
write_tweets( transformed_tweets, '../../data/realDonaldTrump_tweets.csv' )