# Bitcoin sentiment analysis using Twitter

## Data generation

searchtweets API reference: https://twitterdev.github.io/search-tweets-python/  
Twitter API reference: https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search.html  
Twitter tweet object and dictionary: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object  

`~/.twitter_keys` contains endpoint, consumer_key, and consumer_secret  
Change `yaml_key` to get data for the last 30 days (250 queries / month) or since Twitters inception - 2006 (50 queries / month)  
`yaml_key = "search_tweets_premium_30day"`  
`yaml_key = "search_tweets_premium_archive"`:  


Each stream increments query  
For example, if `results_per_call` is 100 and `max_results` is 1000, that is 10 queries  

### Twitter metadata

 - Text
 - Date
 - User; user_name, user_screen_name, user_followers, user_friends, user_verified, user_language
 - Retweet_count
 - Favorite_count (likes)

#### Notes on metadata

* Retired Place.full_name and place.country as for dates 2018.09.16-2018.09.30 from 120000 tweets, only 52 had a location   
* user_verified is likely unhelpful, may be if user is a developer


In [94]:
from searchtweets import ResultStream, gen_rule_payload, load_credentials, collect_results

# general imports
import numpy as np
import pandas as pd
#import tweepy
from textblob import TextBlob
import re
import time
import datetime

# plotting and visualization
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [5]:
premium_search_args = load_credentials("~/.twitter_keys.yaml",
                                          yaml_key="search_tweets_premium_30day",
                                          env_overwrite=False)

Grabbing bearer token from OAUTH


In [229]:
def days_to_collect(start, end, frequency):
    '''
    will return an array starting at midnight of desired date to last frequency hour of end date
    start = start date
    end = end date
    frequency = number of hours to step by per day. For example frequency = 12, will collect twice: at midnight and noon
    '''
    # add one day for right_side border case
    # pd.date_range only allows dates, use rounding dates and closed='right' to get desired dates
    #print(start, end)
    start = datetime.datetime.strptime(start, '%Y-%m-%d') - datetime.timedelta(days=0, hours=int(frequency))
    end = datetime.datetime.strptime(end, '%Y-%m-%d') + datetime.timedelta(days=1, hours=0)
    #print(start, end)
    dates = pd.date_range(start=start, end=end, freq=frequency+'H', closed='left')
    formatted_dates = [ datetime.datetime.strftime(t, '%Y%m%d%H%M') for t in dates ]
    #print(formatted_dates)
    return formatted_dates

In [235]:
test_dates = days_to_collect('2018-10-01', '2018-10-01', '12')
print("twitter recognized dates will be collected on the closed iterval from", test_dates[1], "to", test_dates[-1])

twitter recognized dates will be collected on the closed iterval from 201810010000 to 201810011200


In [236]:
len(test_dates)

3

In [237]:
def collect_tweets(from_date, to_date):
    # maxResults is capped at 100 for sandbox account, even though there should be a next function to get more, it 
    # appears max_results=500 is accepted without any extra work
    # date format: YYYY-mm-DD HH:MM
    # from_date is inclusive. to_date is non-inclusive. Appears to start at from_date and start collecting tweets working
    # backwards to to_date
    bitcoin_rule = gen_rule_payload("bitcoin", results_per_call=100, from_date=from_date, to_date=to_date) 
    print(bitcoin_rule)
    collected_tweets = collect_results(bitcoin_rule, max_results=500, result_stream_args=premium_search_args)
    return collected_tweets

In [238]:
tweets = []
for i in range(0,len(test_dates[:-1])):
    tweets = np.append(tweets, collect_tweets(test_dates[i], test_dates[i+1]))
    if i % 8 == 0 and i != 0:
        print("waiting 60 seconds")
        time.sleep(60)

{"query": "bitcoin", "maxResults": 100, "toDate": "201810010000", "fromDate": "201809301200"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201810011200", "fromDate": "201810010000"}


In [239]:
print(len(tweets), tweets[0]['created_at'], tweets[-1]['created_at'])

1000 Sun Sep 30 23:59:54 +0000 2018 Mon Oct 01 11:49:39 +0000 2018


In [252]:
tweets[500]

{'created_at': 'Mon Oct 01 11:59:59 +0000 2018',
 'id': 1046731449645371392,
 'id_str': '1046731449645371392',
 'text': 'RT @CryptoCoinsNews: Chinese Billionaire Bitcoin Investor ‘Done’ Investing in Blockchain Projects https://t.co/oEvz97g1Cn',
 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 'truncated': False,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 2596584516,
  'id_str': '2596584516',
  'name': 'Alphidius 🇺🇸  🇺🇸 🇺🇸 🇺🇸 🇺🇸',
  'screen_name': 'AlfidioValera',
  'location': 'Nueva York, USA I♥love NYC! ',
  'url': None,
  'description': 'CEO and Community Manager of\nNavo SPA \nGOD BLESS THE U.S.A.! #MAGA🇺🇸 \n#SocialMedia #Libertarian\n#Blockchain #Bitcoin \n 🇺🇸  🇺🇸 🇺🇸 🇺🇸',
  'translator_type': 'none',
  'protected': False,
  'verified': False,
  'followers_count': 7226,
  'friends_count': 7911,


### counts and limitations

A trial to collect all tweets containing the string 'bitcoin' from the current date until a max number of tweets=1000 reached was 15 minutes. If the max number of tweets is increased, we will eventually go back in time to 30 days. To capture more data beyond this time, Full archive will need to be used. However, with only 50 requests per month, very finely specified dates to remain under 50 requests will need to be identified. I.E. once a month we can collect 25,000 tweets for the last 30 days or 5,000 for some time period earlier than that. For full archive to collect as many as montly, requires subscription of $225/month. Thousands to get over a million tweets.

## Sentiment Analysis

In [96]:
# create a pandas df from tweets
S2 = pd.DataFrame(columns=['tweets', 'date', 'user_name', 'user_screen_name', 'user_followers', 
                           'user_friends', 'user_verified', 'user_language', 'retweet_count', 'favorite_count'])

for i, tweet in enumerate(tweets):
    S2.loc[i] = [tweet['text'], 
                 tweet['created_at'], 
                 tweet['user']['name'], 
                 tweet['user']['screen_name'], 
                 tweet['user']['followers_count'], 
                 tweet['user']['friends_count'], 
                 tweet['user']['verified'], 
                 tweet['user']['lang'], 
                 tweet['retweet_count'], 
                 tweet['favorite_count']] 
S2_tweets = S2.loc[:,['tweets']]
S2_meta = S2.drop(['tweets'], axis=1)

In [115]:
S2.tail()

Unnamed: 0,tweets,created_at,user_name,user_screen_name,user_followers,user_friends,user_verified,user_language,retweet_count,favorite_count
95,@mrlbrmn83 It's inevitable. To much money to b...,Sat Sep 15 23:56:11 +0000 2018,ArnoldSchwarzenegger,arnold_ltc,325,24,False,en,0,1
96,RT @francispouliot_: 10 years ago today Lehman...,Sat Sep 15 23:56:10 +0000 2018,zach,UncleZachh,520,647,False,en,0,0
97,AML BitCoin cotizará en HitBTC Exchange #Bitco...,Sat Sep 15 23:56:10 +0000 2018,SatoshiClub,SatoshiModels,351,67,False,es,0,0
98,AvailCom today not only supports these functio...,Sat Sep 15 23:56:07 +0000 2018,Jeffrey the Great - #RAYSNetwork,JefftheGreat91,2543,4858,False,en,0,0
99,RT @ErikVoorhees: Bitcoin: a digital currency ...,Sat Sep 15 23:56:07 +0000 2018,KurtWallace,KurtWallace,2700,1327,False,en,0,0


In [None]:
# save file to csv
S2_tweets.to_csv('tweets_2018-0916_2018-0930_Tweets.csv', index=False)
S2_meta.to_csv('tweets_2018-0916_2018-0930_Metadata.csv', index=False)

In [115]:
def clean_tweet(tweet):
    '''
    Utility function to clean the text in a tweet by removing 
    links and special characters using regex.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def analize_sentiment(tweet):
    '''
    Utility function to classify the polarity of a tweet
    using textblob.
    
    textblob already has a trained analyser to work 
    with different machine learning models on 
    natural language processing.
    
    Might want to train our own model
    '''
    analysis = TextBlob(clean_tweet(tweet))
    if analysis.sentiment.polarity > 0:
        return 1
    elif analysis.sentiment.polarity == 0:
        return 0
    else:
        return -1
    

def sentiment_analysis(S2):
    # We create a column with the result of the analysis:
    S2['SA'] = np.array([ analize_sentiment(tweet) for tweet in S2['Tweets'] ])
    
    # We construct lists with classified tweets:
    pos_tweets = [ tweet for index, tweet in enumerate(S2['Tweets']) if S2['SA'][index] > 0]
    neu_tweets = [ tweet for index, tweet in enumerate(S2['Tweets']) if S2['SA'][index] == 0]
    neg_tweets = [ tweet for index, tweet in enumerate(S2['Tweets']) if S2['SA'][index] < 0]

    # We print percentages:
    print("Percentage of positive tweets: {}%".format(len(pos_tweets)*100/len(S2['Tweets'])))
    print("Percentage of neutral tweets: {}%".format(len(neu_tweets)*100/len(S2['Tweets'])))
    print("Percentage de negative tweets: {}%".format(len(neg_tweets)*100/len(S2['Tweets'])))

In [116]:
sentiment_analysis(S2)

Percentage of positive tweets: 37.037974683544306%
Percentage of neutral tweets: 52.32911392405063%
Percentage de negative tweets: 10.632911392405063%


Due to rate limitations, not all data was able to be gathered at once and resulted in gaps in data. Running through the dates and re-collecting the data allowed for a continuous data collection from September 01-15

In [88]:
"""
# Sep 01-00:00
S3_Date_0901A = pd.read_csv('tweets_2018-09-01-00_Date.csv', names=['Date'])
S3_Tweets_0901A = pd.read_csv('tweets_2018-09-01-00_Tweets.csv', names=['Tweets'])
# Sep 01-03
S3_Date_090103 = pd.read_csv('tweets_2018-09-01_2018-09-03_Date.csv', names=['Date'])
S3_Tweets_090103 = pd.read_csv('tweets_2018-09-01_2018-09-03_Tweets.csv', names=['Tweets'])
# Sep 03-21:00
S3_Date_0903A = pd.read_csv('tweets_2018-09-03-21_Date.csv', names=['Date'])
S3_Tweets_0903A = pd.read_csv('tweets_2018-09-03-21_Tweets.csv', names=['Tweets'])
# Sep 03-05
S3_Date_090305 = pd.read_csv('tweets_2018-09-03_2018-09-05_Date.csv', names=['Date'])
S3_Tweets_090305 = pd.read_csv('tweets_2018-09-03_2018-09-05_Tweets.csv', names=['Tweets'])
# Sep 15-21:00
S3_Date_0915A = pd.read_csv('tweets_2018-09-15-21_Date.csv', names=['Date'])
S3_Tweets_0915A = pd.read_csv('tweets_2018-09-15-21_Tweets.csv', names=['Tweets'])
# Sep 06-15
S3_Date_090615 = pd.read_csv('tweets_2018-09-06_2018-09-15_Date.csv', names=['Date'])
S3_Tweets_090615 = pd.read_csv('tweets_2018-09-06_2018-09-15_Tweets.csv', names=['Tweets'])

S3_A = pd.concat([S3_Tweets_0901A, S3_Date_0901A], axis=1)
S3_B = pd.concat([S3_Tweets_090103, S3_Date_090103], axis=1)
S3_C = pd.concat([S3_Tweets_0903A, S3_Date_0903A], axis=1)
S3_D = pd.concat([S3_Tweets_090305, S3_Date_090305], axis=1)
S3_E = pd.concat([S3_Tweets_0915A, S3_Date_0915A], axis=1)
S3_F = pd.concat([S3_Tweets_090615, S3_Date_090615], axis=1)
"""

In [70]:
#print(S3_A.head(), '\n', S3_B.head(), '\n', S3_C.head(), '\n', S3_D.head(), '\n', S3_E.head(), '\n', S3_F.head())

In [94]:
#S3 = pd.concat([S3_A, S3_B, S3_C, S3_D, S3_E, S3_F], axis=0)
#S3['Date'].to_csv('tweets_2018-08-01_2018-08-15_Date.csv', index=False)
#S3['Tweets'].to_csv('tweets_2018-08-01_2018-08-15_Tweets.csv', index=False)

### Summary so far

It's reasonable to assume that twitter data is more interesting when viewed as a larger picture than a collection centered around a pinpoint. To do this, subsamples of twitter data need to be gathered for a range of days. Tweets starting and ending on the dates listed below are gathered. The from_date is the listed day and the to_date is set to the next day. However rate limits will terminate early after 100 tweets have been gathered for that day, so typically only a couple minutes of tweets per day per every three hours. This method of collection 100 tweets per day is an efficient method to collect a fraction twitter data over a larger number of days. 

- 1944 2018-9-1 00:00
- 2063 2018-9-15 21:00
 
Sentiment analysis follows the preformulated TextBlob sentiment ML scoring algorithm. The data is then stored in a dataframe called S2 and written to individual csvs (due to texts containing commas as well, rather than fight it, just keep it separate) to paste back into a dataframe for later use.