# Correlation analysis between the Bitcoin currency and Twitter

This project consists of a correlation analysis between the Bitcoin currency and tweets. In order to define the positiveness of a tweet (if the course of the bitcoin will go up or down), we realise a sentiment analysis of each tweet using the VADER algorithm. Finally we try to find a correlation between the two and we will make some machine learning to make predictions.

This notebook was written using Python 3.6.

## Sentiment analysis

### Import Twython
We use the *twython* package as my Python interface with the Twitter API: https://twython.readthedocs.io/en/latest/usage/starting_out.html

The twython package must be installed using *pip install twython* from the command line.

In [3]:
from twython import Twython

### OAuth2 Authentication (*app* authentication)
Here we use the method *OAuth2* along with the Twithon library to authenticate on the twitter API.

OAuth1 will give you *user* access to the API, whereas OAuth2 will give the *app* access. For academic use the rate limits are generally better for *OAuth2* (app) authentication, with a few exceptions. For a chart showing the API limits for user and app authentication for the various parts of the Twitter API, see this chart: https://dev.twitter.com/rest/public/rate-limits

Running the code block below shows that we now have a rate limit of 450 API calls. This means we can make 450 different calls to the API within the current 15-minute window. With the search API we can access 100 tweets per call. This means that, if we were downloading tweets with a specific hashtag, such as *#arnova16*, we could download 450 $\times$ 100 or 45,000 tweets per window. This is much better than the 18,000 tweets we can access using the OAuth1 or user authentication.

In [6]:
APP_KEY = 'mPQKoRwd2Pb9qpQyQmyG5s8KR'
APP_SECRET = 'HLvIhusvfzDLKaRXY8CnZGP143kp3E3f2KqQBIEMfVL5mOxZjq'
twitter = Twython(APP_KEY, APP_SECRET, oauth_version=2)
ACCESS_TOKEN = twitter.obtain_access_token()
twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)
twitter.get_application_rate_limit_status()['resources']['search']

{'/search/tweets': {'limit': 450, 'remaining': 450, 'reset': 1526460427}}

### Query the twitter API
Here we query the twitter API to get the latest tweets about bitcoin. Then we transform it to store only the useful data inside a Pandas Dataframe.

The pandas package must be installed using *pip install pandas* from the command line.

In [4]:
from time import sleep
import json
import pandas as pd
import io

In [7]:
NUMBER_OF_QUERIES = 400
data = {"statuses": []}
next_id = ""
f = open('tweets_raw.csv', 'a', encoding='utf-8')
while(True):
    for i in range(NUMBER_OF_QUERIES):
        if not next_id:
            data = twitter.search(q='#bitcoin', lang='en', result_type='recent', count="100") # Use since_id for tweets after id
        else:
            data["statuses"].extend(twitter.search(q='#bitcoin', lang='en', result_type='mixed', count="100", max_id=next_id)["statuses"])
        next_id = data["statuses"][len(data["statuses"]) - 1]['id']
    print('Retrieved {0}, waiting for 15 minutes until next queries'.format(len(data["statuses"])))

    if len(data["statuses"]) == 0:
        break
    else:
        d = pd.DataFrame([[s["id"], s["text"].replace('\n','').replace('\r',''), s["user"]["name"], s["user"]["followers_count"], s["retweet_count"], s["retweeted_status"]["favorite_count"], s["created_at"]] for s in data["statuses"]], columns=('ID', 'Text', 'UserName', "UserFollowerCount", 'RetweetCount', 'Likes', "CreatedAt"))
        d.to_csv(f, mode='a', header=True, encoding='utf-8',index=False)
        data["statuses"] = []
    sleep(910)

f.close()

{'statuses': [{'created_at': 'Wed May 16 08:32:12 +0000 2018', 'id': 996669631615131648, 'id_str': '996669631615131648', 'text': 'RT @gin_cash: GIN CASH early backers bounty! Claim 100.000 GIN CASH right now. Simply follow the 3 steps below to get your GIN CASH right a…', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'gin_cash', 'name': 'GIN CASH', 'id': 967878928017838081, 'id_str': '967878928017838081', 'indices': [3, 12]}], 'urls': []}, 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'}, 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 326046097, 'id_str': '326046097', 'name': 'Lucky Dube Lover', 'screen_name': 'luckydubep', 'location': 'South Africa', 'description': 'South Africa’s most popular and commercially su

KeyboardInterrupt: 

## Preprocessing

Now we will cleanup the data.

We already filtered tweets in english in the call to the Twitter API.
We will now filter links, @Pseudo, images, videos, unhashtag #happy -> happy
We will transform everything to lower case.

You must install `pip install tqdm`

In [2]:
import re # regular expressions
from tqdm import tnrange, tqdm_notebook, tqdm

d = pd.read_csv('tweets_raw.csv')
for i,s in enumerate(tqdm(d['Text'])):
    text = d.loc[i, 'Text']
    text = text.replace("#", "")
    text = re.sub('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', '', text, flags=re.MULTILINE)
    text = re.sub('@\\w+ *', '', text, flags=re.MULTILINE)
    d.loc[i, 'Text'] = text
f = open('tweets_clean.csv', 'a', encoding='utf-8')
d.to_csv(f, header=True, encoding='utf-8',index=False)

100%|██████████| 481483/481483 [9:30:55<00:00, 14.06it/s]  


In [5]:
df_clean = pd.read_csv('tweets_clean.csv')
df_clean.head(5)

Unnamed: 0,ID,Text,UserName,UserFollowerCount,RetweetCount,CreatedAt
0,995987428166000642,rt : cryptocurrency is not far from mainstream...,Mark Brown,402,1,Mon May 14 11:21:22 +0000 2018
1,995987423271235585,"rt : 10,000 xrp giveaway is now on! 10 lucky w...",Billy Blocks,333,25,Mon May 14 11:21:20 +0000 2018
2,995987420058275841,want to switch your mining between different c...,MaxiMine,404,0,Mon May 14 11:21:20 +0000 2018
3,995987412550606849,day trading: 2 manuscripts: absolute beginners...,Blockchain,17876,0,Mon May 14 11:21:18 +0000 2018
4,995987411472670723,arduino: the comprehensive beginner's guide to...,Blockchain,17876,0,Mon May 14 11:21:18 +0000 2018


In [15]:
df_clean.min(axis=0)

ID                                                  992434490168496133
Text                  bitcoin sees wall street warm to trading virt...
UserFollowerCount                                                    0
RetweetCount                                                         0
CreatedAt                               Fri May 04 16:03:15 +0000 2018
dtype: object