# Bitcoin sentiment analysis using Twitter

## Data generation

searchtweets API reference: https://twitterdev.github.io/search-tweets-python/  
Twitter API reference: https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search.html  
Twitter tweet object and dictionary: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object  

`~/.twitter_keys` contains endpoint, consumer_key, and consumer_secret  
Change `yaml_key` to get data for the last 30 days (250 queries / month) or since Twitters inception - 2006 (50 queries / month)  
`yaml_key = "search_tweets_premium_30day"`  
`yaml_key = "search_tweets_premium_archive"`:  


Each stream increments query  
For example, if `results_per_call` is 100 and `max_results` is 1000, that is 10 queries  

### Twitter metadata

 - Text
 - Date
 - User
 - Place.full_name, Place.country
 - Retweet_count
 - Favorite_count (likes)


In [10]:
from searchtweets import ResultStream, gen_rule_payload, load_credentials, collect_results

# general imports
import numpy as np
import pandas as pd
#import tweepy
from textblob import TextBlob
import re
import time
import datetime

# plotting and visualization
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [9]:
premium_search_args = load_credentials("~/Documents/Research/Twitter/.twitter_keys.yaml",
                                          yaml_key="search_tweets_premium_30day",
                                          env_overwrite=False)

Grabbing bearer token from OAUTH


In [5]:
print(premium_search_args)

{'bearer_token': 'AAAAAAAAAAAAAAAAAAAAAMkh8wAAAAAAKjpbbf0g3sMvc5EofBU%2BTcPuado%3DGaRFNKdtFOjT3zkaTlyzkdNn7IGPx8vrAmwRHBNvSw8VFKEt87', 'endpoint': 'https://api.twitter.com/1.1/tweets/search/30day/development.json', 'extra_headers_dict': None}


In [6]:
# #dates
# months = np.arange(1,13)
# days = np.arange(1,32)
# time = [" 00:00", " 03:00", " 6:00", " 09:00", " 12:00", " 15:00", " 18:00", " 21:00"]
# dates = []
# dates_extra = [ "2018-" + str(m) + "-" + str(d) + str(t) for m in months for d in days for t in time ]
# spurious_dates = ['2018-2-29', '2018-2-30', '2018-2-31', '2018-4-31', '2018-6-31', '2018-9-31', '2018-11-31']
# spurious_dates = [ d + t for d in spurious_dates for t in time ]
# dates = [d for d in dates_extra if d not in spurious_dates]

In [23]:
def days_to_collect(start, end, frequency):
    '''
    will return an array starting at midnight of desired date to last frequency hour of end date
    start = start date
    end = end date
    frequency = number of hours to step by per day. For example frequency = 12, will collect twice: at midnight and noon
    '''
    # add one day for right_side border case
    # pd.date_range only allows dates, use rounding dates and closed='right' to get desired dates
    #print(start, end)
    start = datetime.datetime.strptime(start, '%Y-%m-%d') - datetime.timedelta(days=0, hours=int(frequency)) 
    end = datetime.datetime.strptime(end, '%Y-%m-%d') + datetime.timedelta(days=1, hours=0)
    #print(start, end)
    dates = pd.date_range(start=start, end=end, freq=frequency+'H', closed='left')
    formatted_dates = [ datetime.datetime.strftime(t, '%Y%m%d%H%M') for t in dates ]
    #print(formatted_dates)
    return formatted_dates

In [29]:
test_dates = days_to_collect('2018-10-01', '2018-10-05', '12')
print("twitter recognized dates will be collected on the closed iterval from", test_dates[1], "to", test_dates[-1])

twitter recognized dates will be collected on the closed iterval from 201810010000 to 201810051200


In [30]:
len(test_dates)

11

In [31]:
[print(i, d) for i, d in enumerate(test_dates)]

0 201809301200
1 201810010000
2 201810011200
3 201810020000
4 201810021200
5 201810030000
6 201810031200
7 201810040000
8 201810041200
9 201810050000
10 201810051200


[None, None, None, None, None, None, None, None, None, None, None]

In [9]:
# 1944 2018-9-1 00:00
# 2063 2018-9-15 21:00
test_dates = dates[1944:2063+1] # include -1 / +1 to include bounds
print("collecting from", test_dates[0], "to", test_dates[-1], "in 3 hour intervals")

collecting from 2018-9-1 00:00 to 2018-9-15 21:00 in 3 hour intervals


In [16]:
test_dates

['201809301200', '201810010000', '201810011200']

In [32]:
def collect_tweets(from_date, to_date):
    # maxResults is capped at 100 for sandbox account, even though there should be a next function to get more, it 
    # appears max_results=500 is accepted without any extra work
    # date format: YYYY-mm-DD HH:MM
    # from_date is inclusive. to_date is non-inclusive. Appears to start at from_date and start collecting tweets working
    # backwards to to_date
    bitcoin_rule = gen_rule_payload("bitcoin", results_per_call=100, from_date=from_date, to_date=to_date) 
    print(bitcoin_rule)
    collected_tweets = collect_results(bitcoin_rule, max_results=500, result_stream_args=premium_search_args)
    return collected_tweets

In [37]:
tweets = []
for i in range(0,len(test_dates[:-1])):
    tweets = np.append(tweets, collect_tweets(test_dates[i], test_dates[i+1]))
    if i % 3 == 0 and i != 0:
        print("waiting 60 seconds")
        time.sleep(60)

{"query": "bitcoin", "maxResults": 100, "toDate": "201810010000", "fromDate": "201809301200"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201810011200", "fromDate": "201810010000"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201810020000", "fromDate": "201810011200"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201810021200", "fromDate": "201810020000"}
waiting 60 seconds
{"query": "bitcoin", "maxResults": 100, "toDate": "201810030000", "fromDate": "201810021200"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201810031200", "fromDate": "201810030000"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201810040000", "fromDate": "201810031200"}
waiting 60 seconds
{"query": "bitcoin", "maxResults": 100, "toDate": "201810041200", "fromDate": "201810040000"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201810050000", "fromDate": "201810041200"}
{"query": "bitcoin", "maxResults": 100, "toDate": "201810051200", "fromDate": "201810050000"}
waiting 60 seconds


In [38]:
print(len(tweets), tweets[0]['created_at'], tweets[-1]['created_at'])

5000 Sun Sep 30 23:59:54 +0000 2018 Fri Oct 05 11:49:51 +0000 2018


In [36]:
tweets[1000]

{'created_at': 'Mon Oct 01 23:59:58 +0000 2018',
 'id': 1046912641086631936,
 'id_str': '1046912641086631936',
 'text': 'RT @choicemining: #Airdrop Phase 2 Is Currently Running\nGet Free 80 CHM token   from @choicemining - The new era of crypto mining.\nRegister…',
 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 'truncated': False,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 999281589946372096,
  'id_str': '999281589946372096',
  'name': 'Jeffrey Edions',
  'screen_name': 'edionwe2',
  'location': None,
  'url': None,
  'description': None,
  'translator_type': 'none',
  'protected': False,
  'verified': False,
  'followers_count': 6,
  'friends_count': 75,
  'listed_count': 0,
  'favourites_count': 38,
  'statuses_count': 80,
  'created_at': 'Wed May 23 13:31:11 +0000 2018',
  'utc_offset': None,
 

### counts and limitations

A trial to collect all tweets containing the string 'bitcoin' from the current date until a max number of tweets=1000 reached was 15 minutes. If the max number of tweets is increased, we will eventually go back in time to 30 days. To capture more data beyond this time, Full archive will need to be used. However, with only 50 requests per month, very finely specified dates to remain under 50 requests will need to be identified. I.E. once a month we can collect 25,000 tweets for the last 30 days or 5,000 for some time period earlier than that. For full archive to collect as many as montly, requires subscription of $225/month. Thousands to get over a million tweets.

## Sentiment Analysis

In [39]:
# create a pandas df from tweets
S2 = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])
S2['Date'] = [tweet['created_at'] for tweet in tweets]

In [40]:
S2.head()

Unnamed: 0,Tweets,Date
0,RT @CryptoInvest18: Best coin still under $1? ...,Sun Sep 30 23:59:54 +0000 2018
1,RT @kubitx: We are TRULY GLOBAL EXCHANGE... Aw...,Sun Sep 30 23:59:50 +0000 2018
2,RT @TrezarCoin: We are proud to release Trezar...,Sun Sep 30 23:59:43 +0000 2018
3,RT @APompliano: REMINDER: A single bank locati...,Sun Sep 30 23:59:42 +0000 2018
4,The first ever #crypto bank opening in October...,Sun Sep 30 23:59:42 +0000 2018


In [41]:
def clean_tweet(tweet):
    '''
    Utility function to clean the text in a tweet by removing 
    links and special characters using regex.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def analize_sentiment(tweet):
    '''
    Utility function to classify the polarity of a tweet
    using textblob.
    
    textblob already has a trained analyser to work 
    with different machine learning models on 
    natural language processing.
    
    Might want to train our own model
    '''
    analysis = TextBlob(clean_tweet(tweet))
    if analysis.sentiment.polarity > 0:
        return 1
    elif analysis.sentiment.polarity == 0:
        return 0
    else:
        return -1
    

def sentiment_analysis(S2):
    # We create a column with the result of the analysis:
    S2['SA'] = np.array([ analize_sentiment(tweet) for tweet in S2['Tweets'] ])
    
    # We construct lists with classified tweets:
    pos_tweets = [ tweet for index, tweet in enumerate(S2['Tweets']) if S2['SA'][index] > 0]
    neu_tweets = [ tweet for index, tweet in enumerate(S2['Tweets']) if S2['SA'][index] == 0]
    neg_tweets = [ tweet for index, tweet in enumerate(S2['Tweets']) if S2['SA'][index] < 0]

    # We print percentages:
    print("Percentage of positive tweets: {}%".format(len(pos_tweets)*100/len(S2['Tweets'])))
    print("Percentage of neutral tweets: {}%".format(len(neu_tweets)*100/len(S2['Tweets'])))
    print("Percentage de negative tweets: {}%".format(len(neg_tweets)*100/len(S2['Tweets'])))

In [42]:
sentiment_analysis(S2)

Percentage of positive tweets: 36.96%
Percentage of neutral tweets: 52.64%
Percentage de negative tweets: 10.4%


In [47]:
# S2['Tweets'].to_csv('tweets_2018-10-01_2018-10-05_Tweets.csv', index=False)
# S2['Date'].to_csv('tweets_2018-10-01_2018-10-05_Date.csv', index=False)
S2.to_csv('tweets_2018-10-01_2018-10-05_df.csv', index=False)

In [46]:
S2.head()

Unnamed: 0,Tweets,Date,SA
0,RT @CryptoInvest18: Best coin still under $1? ...,Sun Sep 30 23:59:54 +0000 2018,1
1,RT @kubitx: We are TRULY GLOBAL EXCHANGE... Aw...,Sun Sep 30 23:59:50 +0000 2018,1
2,RT @TrezarCoin: We are proud to release Trezar...,Sun Sep 30 23:59:43 +0000 2018,1
3,RT @APompliano: REMINDER: A single bank locati...,Sun Sep 30 23:59:42 +0000 2018,1
4,The first ever #crypto bank opening in October...,Sun Sep 30 23:59:42 +0000 2018,1


Due to rate limitations, not all data was able to be gathered at once and resulted in gaps in data. Running through the dates and re-collecting the data allowed for a continuous data collection from September 01-15

Lowering the waiting period to occur once every 3 iterations rather than once every 10 allowed it to all run in one go~

In [88]:
# Sep 01-00:00
S3_Date_0901A = pd.read_csv('tweets_2018-09-01-00_Date.csv', names=['Date'])
S3_Tweets_0901A = pd.read_csv('tweets_2018-09-01-00_Tweets.csv', names=['Tweets'])
# Sep 01-03
S3_Date_090103 = pd.read_csv('tweets_2018-09-01_2018-09-03_Date.csv', names=['Date'])
S3_Tweets_090103 = pd.read_csv('tweets_2018-09-01_2018-09-03_Tweets.csv', names=['Tweets'])
# Sep 03-21:00
S3_Date_0903A = pd.read_csv('tweets_2018-09-03-21_Date.csv', names=['Date'])
S3_Tweets_0903A = pd.read_csv('tweets_2018-09-03-21_Tweets.csv', names=['Tweets'])
# Sep 03-05
S3_Date_090305 = pd.read_csv('tweets_2018-09-03_2018-09-05_Date.csv', names=['Date'])
S3_Tweets_090305 = pd.read_csv('tweets_2018-09-03_2018-09-05_Tweets.csv', names=['Tweets'])
# Sep 15-21:00
S3_Date_0915A = pd.read_csv('tweets_2018-09-15-21_Date.csv', names=['Date'])
S3_Tweets_0915A = pd.read_csv('tweets_2018-09-15-21_Tweets.csv', names=['Tweets'])
# Sep 06-15
S3_Date_090615 = pd.read_csv('tweets_2018-09-06_2018-09-15_Date.csv', names=['Date'])
S3_Tweets_090615 = pd.read_csv('tweets_2018-09-06_2018-09-15_Tweets.csv', names=['Tweets'])

S3_A = pd.concat([S3_Tweets_0901A, S3_Date_0901A], axis=1)
S3_B = pd.concat([S3_Tweets_090103, S3_Date_090103], axis=1)
S3_C = pd.concat([S3_Tweets_0903A, S3_Date_0903A], axis=1)
S3_D = pd.concat([S3_Tweets_090305, S3_Date_090305], axis=1)
S3_E = pd.concat([S3_Tweets_0915A, S3_Date_0915A], axis=1)
S3_F = pd.concat([S3_Tweets_090615, S3_Date_090615], axis=1)

In [89]:
print(S3_A.head(), '\n', S3_B.head(), '\n', S3_C.head(), '\n', S3_D.head(), '\n', S3_E.head(), '\n', S3_F.head())

                                              Tweets  \
0  Haha @Eminem dropped that new album and name d...   
1  RT @coingecko: Have you tried comparing coins ...   
2  RT @cryptocomicon: Chris DeRose spends an 86 m...   
3  RT @santisiri: un partido político que opera s...   
4  RT @BitcoinDood: DNA: The Safest Way to Store ...   

                             Date  
0  Fri Aug 31 23:59:57 +0000 2018  
1  Fri Aug 31 23:59:51 +0000 2018  
2  Fri Aug 31 23:59:47 +0000 2018  
3  Fri Aug 31 23:59:46 +0000 2018  
4  Fri Aug 31 23:59:45 +0000 2018   
                                               Tweets  \
0  https://t.co/yLZluuYevy DECENTRALISED ENERGY P...   
1  📉 Biggest Losers (1 hr) 📉\nNoah Coin $NOAH -3....   
2  Crypto News: Yahoo! World’s Sixth-Most Popular...   
3  RT @coingecko: Have you tried comparing coins ...   
4  Bitcoin Gets Awareness Boost From Mention On E...   

                             Date  
0  Sat Sep 01 02:59:59 +0000 2018  
1  Sat Sep 01 02:59:58 +0000 2018  


In [94]:
S3 = pd.concat([S3_A, S3_B, S3_C, S3_D, S3_E, S3_F], axis=0)
S3['Date'].to_csv('tweets_2018-08-01_2018-08-15_Date.csv', index=False)
S3['Tweets'].to_csv('tweets_2018-08-01_2018-08-15_Tweets.csv', index=False)

In [44]:
S2.head()

Unnamed: 0,Tweets,Date,SA
0,RT @CryptoInvest18: Best coin still under $1? ...,Sun Sep 30 23:59:54 +0000 2018,1
1,RT @kubitx: We are TRULY GLOBAL EXCHANGE... Aw...,Sun Sep 30 23:59:50 +0000 2018,1
2,RT @TrezarCoin: We are proud to release Trezar...,Sun Sep 30 23:59:43 +0000 2018,1
3,RT @APompliano: REMINDER: A single bank locati...,Sun Sep 30 23:59:42 +0000 2018,1
4,The first ever #crypto bank opening in October...,Sun Sep 30 23:59:42 +0000 2018,1


In [45]:
S2.tail()

Unnamed: 0,Tweets,Date,SA
4995,#Earn #BITCOIN in differents ways!\n* Spin - ...,Fri Oct 05 11:49:55 +0000 2018,0
4996,#XBlock #Blockchain #Crypto #ether #ethereum #...,Fri Oct 05 11:49:54 +0000 2018,0
4997,RT @adam3us: On the elemental supply chain HW ...,Fri Oct 05 11:49:52 +0000 2018,1
4998,"RT @GRDO411: For clarity, $GRDO no longer has ...",Fri Oct 05 11:49:52 +0000 2018,1
4999,"Who's buying bitcoin? A lot of young, fairly a...",Fri Oct 05 11:49:51 +0000 2018,1


### Summary so far

It's reasonable to assume that twitter data is more interesting when viewed as a larger picture than a collection centered around a pinpoint. To do this, subsamples of twitter data need to be gathered for a range of days. Tweets starting and ending on the dates listed below are gathered. The from_date is the listed day and the to_date is set to the next day. However rate limits will terminate early after 100 tweets have been gathered for that day, so typically only a couple minutes of tweets per day per every three hours. This method of collection 100 tweets per day is an efficient method to collect a fraction twitter data over a larger number of days. 

- 1944 2018-9-1 00:00
- 2063 2018-9-15 21:00
 
Sentiment analysis follows the preformulated TextBlob sentiment ML scoring algorithm. The data is then stored in a dataframe called S2 and written to individual csvs (due to texts containing commas as well, rather than fight it, just keep it separate) to paste back into a dataframe for later use.

In [2]:
print("practicing working on branch")

practicing working on branch
