# Bitcoin sentiment analysis using Twitter

## Data generation

searchtweets API reference: https://twitterdev.github.io/search-tweets-python/  
Twitter API reference: https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search.html  
Twitter tweet object and dictionary: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object  

`~/.twitter_keys` contains endpoint, consumer_key, and consumer_secret  
Change `yaml_key` to get data for the last 30 days (250 queries / month) or since Twitters inception - 2006 (50 queries / month)  
`yaml_key = "search_tweets_premium_30day"`  
`yaml_key = "search_tweets_premium_archive"`:  


Each stream increments query  
For example, if `results_per_call` is 100 and `max_results` is 1000, that is 10 queries  

In [2]:
from searchtweets import ResultStream, gen_rule_payload, load_credentials, collect_results

# general imports
import numpy as np
import pandas as pd
#import tweepy
from textblob import TextBlob
import re
import time

# plotting and visualization
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
premium_search_args = load_credentials("~/.twitter_keys.yaml",
                                          yaml_key="search_tweets_premium_30day",
                                          env_overwrite=False)

Grabbing bearer token from OAUTH


In [4]:
#dates
months = np.arange(1,13)
days = np.arange(1,32)
time = [" 00:00", " 03:00", " 6:00", " 09:00", " 12:00", " 15:00", " 18:00", " 21:00"]
dates = []
dates_extra = [ "2018-" + str(m) + "-" + str(d) + str(t) for m in months for d in days for t in time ]
spurious_dates = ['2018-2-29', '2018-2-30', '2018-2-31', '2018-4-31', '2018-6-31', '2018-9-31', '2018-11-31']
spurious_dates = [ d + t for d in spurious_dates for t in time ]
dates = [d for d in dates_extra if d not in spurious_dates]

In [102]:
#[print(i, d) for i, d in enumerate(dates)]

In [100]:
# 1944 2018-9-1 00:00
# 2063 2018-9-15 21:00
test_dates = dates[1984-1:2063+1]
print("collecting from", test_dates[0], "to", test_dates[-1], "in 3 hour intervals")

collecting from 2018-9-5 21:00 to 2018-9-15 21:00


In [72]:
S2_dict = {}
def collect_tweets(from_date, to_date):
    # maxResults is capped at 100 for sandbox account
    # date format: YYYY-mm-DD HH:MM
    bitcoin_rule = gen_rule_payload("bitcoin", results_per_call=100, from_date=from_date, to_date=to_date) 
    print(bitcoin_rule)
    tweets = collect_results(bitcoin_rule, max_results=100, result_stream_args=premium_search_args)
    return tweets

In [73]:
tweets = []
for i in range(0,len(test_dates[:-1])):
    #S2_dict[i] = collect_tweets(test_dates[i], test_dates[i+1])
    tweets = np.append(tweets, collect_tweets(test_dates[i], test_dates[i+1]))
    if i % 8 == 0 and i != 0:
        print("waiting 60 seconds")
        time.sleep(60)

{"query": "bitcoin", "maxResults": 100, "toDate": "201809032100", "fromDate": "201809031800"}


In [27]:
print(len(tweets), tweets[0]['created_at'], tweets[-1]['created_at'])

100 Fri Aug 31 23:59:57 +0000 2018 Fri Aug 31 23:57:03 +0000 2018


### counts and limitations

A trial to collect all tweets containing the string 'bitcoin' from the current date until a max number of tweets=1000 reached was 15 minutes. If the max number of tweets is increased, we will eventually go back in time to 30 days. To capture more data beyond this time, Full archive will need to be used. However, with only 50 requests per month, very finely specified dates to remain under 50 requests will need to be identified. I.E. once a month we can collect 25,000 tweets for the last 30 days or 5,000 for some time period earlier than that. For full archive to collect as many as montly, requires subscription of $225/month. Thousands to get over a million tweets.

## Sentiment Analysis

In [79]:
# create a pandas df from tweets
S2 = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])
S2['Date'] = [tweet['created_at'] for tweet in tweets]

In [80]:
S2.head()

Unnamed: 0,Tweets,Date
0,"RT @helexcorp: Our Public ICO is finished, tha...",Mon Sep 03 20:59:59 +0000 2018
1,"RT @coinbundlecom: With Bitcoin down, which to...",Mon Sep 03 20:59:58 +0000 2018
2,Is the hype of blockchain starting to cool? An...,Mon Sep 03 20:59:52 +0000 2018
3,RT @patestevao: I'm finally starting a new inf...,Mon Sep 03 20:59:52 +0000 2018
4,#Bitcoin has a difficult task ahead - regardin...,Mon Sep 03 20:59:43 +0000 2018


In [115]:
def clean_tweet(tweet):
    '''
    Utility function to clean the text in a tweet by removing 
    links and special characters using regex.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def analize_sentiment(tweet):
    '''
    Utility function to classify the polarity of a tweet
    using textblob.
    
    textblob already has a trained analyser to work 
    with different machine learning models on 
    natural language processing.
    
    Might want to train our own model
    '''
    analysis = TextBlob(clean_tweet(tweet))
    if analysis.sentiment.polarity > 0:
        return 1
    elif analysis.sentiment.polarity == 0:
        return 0
    else:
        return -1
    

def sentiment_analysis(S2):
    # We create a column with the result of the analysis:
    S2['SA'] = np.array([ analize_sentiment(tweet) for tweet in S2['Tweets'] ])
    
    # We construct lists with classified tweets:
    pos_tweets = [ tweet for index, tweet in enumerate(S2['Tweets']) if S2['SA'][index] > 0]
    neu_tweets = [ tweet for index, tweet in enumerate(S2['Tweets']) if S2['SA'][index] == 0]
    neg_tweets = [ tweet for index, tweet in enumerate(S2['Tweets']) if S2['SA'][index] < 0]

    # We print percentages:
    print("Percentage of positive tweets: {}%".format(len(pos_tweets)*100/len(S2['Tweets'])))
    print("Percentage of neutral tweets: {}%".format(len(neu_tweets)*100/len(S2['Tweets'])))
    print("Percentage de negative tweets: {}%".format(len(neg_tweets)*100/len(S2['Tweets'])))

In [116]:
sentiment_analysis(S2)

Percentage of positive tweets: 37.037974683544306%
Percentage of neutral tweets: 52.32911392405063%
Percentage de negative tweets: 10.632911392405063%


In [81]:
S2['Tweets'].to_csv('tweets_2018-09-03-21_Tweets.csv', index=False)
S2['Date'].to_csv('tweets_2018-09-03-21_Date.csv', index=False)

Due to rate limitations, not all data was able to be gathered at once and resulted in gaps in data. Running through the dates and re-collecting the data allowed for a continuous data collection from September 01-15

In [88]:
# Sep 01-00:00
S3_Date_0901A = pd.read_csv('tweets_2018-09-01-00_Date.csv', names=['Date'])
S3_Tweets_0901A = pd.read_csv('tweets_2018-09-01-00_Tweets.csv', names=['Tweets'])
# Sep 01-03
S3_Date_090103 = pd.read_csv('tweets_2018-09-01_2018-09-03_Date.csv', names=['Date'])
S3_Tweets_090103 = pd.read_csv('tweets_2018-09-01_2018-09-03_Tweets.csv', names=['Tweets'])
# Sep 03-21:00
S3_Date_0903A = pd.read_csv('tweets_2018-09-03-21_Date.csv', names=['Date'])
S3_Tweets_0903A = pd.read_csv('tweets_2018-09-03-21_Tweets.csv', names=['Tweets'])
# Sep 03-05
S3_Date_090305 = pd.read_csv('tweets_2018-09-03_2018-09-05_Date.csv', names=['Date'])
S3_Tweets_090305 = pd.read_csv('tweets_2018-09-03_2018-09-05_Tweets.csv', names=['Tweets'])
# Sep 15-21:00
S3_Date_0915A = pd.read_csv('tweets_2018-09-15-21_Date.csv', names=['Date'])
S3_Tweets_0915A = pd.read_csv('tweets_2018-09-15-21_Tweets.csv', names=['Tweets'])
# Sep 06-15
S3_Date_090615 = pd.read_csv('tweets_2018-09-06_2018-09-15_Date.csv', names=['Date'])
S3_Tweets_090615 = pd.read_csv('tweets_2018-09-06_2018-09-15_Tweets.csv', names=['Tweets'])

S3_A = pd.concat([S3_Tweets_0901A, S3_Date_0901A], axis=1)
S3_B = pd.concat([S3_Tweets_090103, S3_Date_090103], axis=1)
S3_C = pd.concat([S3_Tweets_0903A, S3_Date_0903A], axis=1)
S3_D = pd.concat([S3_Tweets_090305, S3_Date_090305], axis=1)
S3_E = pd.concat([S3_Tweets_0915A, S3_Date_0915A], axis=1)
S3_F = pd.concat([S3_Tweets_090615, S3_Date_090615], axis=1)

In [89]:
print(S3_A.head(), '\n', S3_B.head(), '\n', S3_C.head(), '\n', S3_D.head(), '\n', S3_E.head(), '\n', S3_F.head())

                                              Tweets  \
0  Haha @Eminem dropped that new album and name d...   
1  RT @coingecko: Have you tried comparing coins ...   
2  RT @cryptocomicon: Chris DeRose spends an 86 m...   
3  RT @santisiri: un partido político que opera s...   
4  RT @BitcoinDood: DNA: The Safest Way to Store ...   

                             Date  
0  Fri Aug 31 23:59:57 +0000 2018  
1  Fri Aug 31 23:59:51 +0000 2018  
2  Fri Aug 31 23:59:47 +0000 2018  
3  Fri Aug 31 23:59:46 +0000 2018  
4  Fri Aug 31 23:59:45 +0000 2018   
                                               Tweets  \
0  https://t.co/yLZluuYevy DECENTRALISED ENERGY P...   
1  📉 Biggest Losers (1 hr) 📉\nNoah Coin $NOAH -3....   
2  Crypto News: Yahoo! World’s Sixth-Most Popular...   
3  RT @coingecko: Have you tried comparing coins ...   
4  Bitcoin Gets Awareness Boost From Mention On E...   

                             Date  
0  Sat Sep 01 02:59:59 +0000 2018  
1  Sat Sep 01 02:59:58 +0000 2018  


In [94]:
S3 = pd.concat([S3_A, S3_B, S3_C, S3_D, S3_E, S3_F], axis=0)
S3['Date'].to_csv('tweets_2018-08-01_2018-08-15_Date.csv', index=False)
S3['Tweets'].to_csv('tweets_2018-08-01_2018-08-15_Tweets.csv', index=False)

In [95]:
S3.head()

Unnamed: 0,Tweets,Date
0,Haha @Eminem dropped that new album and name d...,Fri Aug 31 23:59:57 +0000 2018
1,RT @coingecko: Have you tried comparing coins ...,Fri Aug 31 23:59:51 +0000 2018
2,RT @cryptocomicon: Chris DeRose spends an 86 m...,Fri Aug 31 23:59:47 +0000 2018
3,RT @santisiri: un partido político que opera s...,Fri Aug 31 23:59:46 +0000 2018
4,RT @BitcoinDood: DNA: The Safest Way to Store ...,Fri Aug 31 23:59:45 +0000 2018


In [96]:
S3.tail()

Unnamed: 0,Tweets,Date
7895,RT @bitcoincardvd: You can start your Bitcoin ...,Sat Sep 15 20:57:43 +0000 2018
7896,RT @securixio: We at https://t.co/3OqG6HXwB0 p...,Sat Sep 15 20:57:40 +0000 2018
7897,It doesn’t matter if Bitcoin is $6k or $50k.\n...,Sat Sep 15 20:57:39 +0000 2018
7898,RT @iMariaJohnsen: Uncovering facts on #blockc...,Sat Sep 15 20:57:37 +0000 2018
7899,RT @favycoin: Grab your #Favycoin and don't mi...,Sat Sep 15 20:57:33 +0000 2018


### Summary so far

It's reasonable to assume that twitter data is more interesting when viewed as a larger picture than a collection centered around a pinpoint. To do this, subsamples of twitter data need to be gathered for a range of days. Tweets starting and ending on the dates listed below are gathered. The from_date is the listed day and the to_date is set to the next day. However rate limits will terminate early after 100 tweets have been gathered for that day, so typically only a couple minutes of tweets per day per every three hours. This method of collection 100 tweets per day is an efficient method to collect a fraction twitter data over a larger number of days. 

- 1944 2018-9-1 00:00
- 2063 2018-9-15 21:00
 
Sentiment analysis follows the preformulated TextBlob sentiment ML scoring algorithm. The data is then stored in a dataframe called S2 and written to individual csvs (due to texts containing commas as well, rather than fight it, just keep it separate) to paste back into a dataframe for later use.