# Bitcoin Price Prediction - Documentation
## BS Data Science Project - Recurrent Bitcoin Network

This notebook contains a brief documentation on retrieving data from [Messari](https://messari.io/) through its [API](https://messari.io/api/docs). Method of data collection would be through a REST API with a `json` response.

### Data

The expected data would be a data frame of time series data denoted by its columns and it would be splitted, normalized, and be fed for model training and testing. In this project, we would only use daily time series data from 2016 to 2020 as the training and validation set, while 2021 would be used as a test set. Note that this notebook does not include the retrieval of the test set.

### Prerequesites

Before running this notebook, several python packages are needed to be installed as shown in the following cell. In addition, changing of working directory within the local repository was done.

In [1]:
%%capture
# Redirects the current working directory to the project directory
%cd ..

In [2]:
# Built-in packages
import os
import time
from datetime import datetime

# Packages to be installed
import pandas as pd
from tqdm import tqdm

# Local python functions
from src import collect

In [3]:
# Notebook Execution Time
print('Current time: {}'.format(datetime.now()))

Current time: 2021-11-14 14:05:27.992160


### Retrieve Metrics Data From Messari

This section retrieves the complete list of the possible metrics to be queried in time series.

To speed up the data collection process, it is recommended to apply an API key from Messari for an extended number of requests per minute. Although, it is not required as it can still be requested in low request rates. This can be achieved through inserting your API key to the environment.

To get Messari's API keys, a Messari account is required.

In [4]:
# os.environ['MESSARI_API_KEY'] = <insert_key_here>

In [5]:
# Check if the API key is set
'MESSARI_API_KEY' in os.environ

True

In [6]:
api_header = {'x-messari-api-key':os.getenv('MESSARI_API_KEY')}
collector = collect.MessariCollector(headers=api_header)

In [7]:
# Check if collector read the API key
collector.headers['x-messari-api-key'] is not None

True

In [8]:
metrics = collector.get_metrics()

In [9]:
freemetrics = metrics.get_free_metrics(return_df=True)
freemetrics.head(10)

Unnamed: 0_level_0,name,description,values_schema,minimum_interval,role_restriction,source_attribution
metric_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
act.addr.cnt,Active Addresses Count,The sum count of unique addresses that were ac...,{'active_addresses': 'The sum count of unique ...,1d,,"[{'asset_id': '', 'name': 'Coinmetrics', 'url'..."
blk.cnt,Block Count,The sum count of blocks created each day,{'block_count': 'The sum count of blocks creat...,1d,,"[{'asset_id': '', 'name': 'Coinmetrics', 'url'..."
blk.size.byte,Block Size (bytes),The sum of the size (in bytes) of all blocks c...,{'block_count': 'The sum of the size (in bytes...,1d,,"[{'asset_id': '', 'name': 'Coinmetrics', 'url'..."
blk.size.bytes.avg,Average Block Size (bytes),The mean size (in bytes) of all blocks created,{'block_count': 'The mean size (in bytes) of a...,1d,,"[{'asset_id': '', 'name': 'Coinmetrics', 'url'..."
cg.sply.circ,Circulating Supply (CoinGecko),The circulating supply acknowledges that token...,{'circulating_supply': 'The circulating supply...,1d,,"[{'asset_id': '', 'name': 'CoinGecko', 'url': ..."
daily.shp,Sharpe Ratio,The Sharpe ratio (performance of the asset com...,{'sharpe_30d': 'The asset's Sharpe ratio calcu...,1d,,"[{'asset_id': '', 'name': 'Kaiko', 'url': 'htt..."
daily.vol,Volatility,The annualized standard-deviation of daily ret...,{'volatility_30d': 'The asset's volatility cal...,1d,,"[{'asset_id': '', 'name': 'Kaiko', 'url': 'htt..."
diff.avg,Average Difficulty,The mean difficulty of finding a hash that mee...,{'mean_difficulty': 'The mean difficulty durin...,1d,,"[{'asset_id': '', 'name': 'Coinmetrics', 'url'..."
exch.flow.in.ntv,Deposits on Exchanges (Native Units),The amount of the asset sent to exchanges that...,{'flow_in': 'The amount of the asset sent to e...,1d,,"[{'asset_id': '', 'name': 'Coinmetrics', 'url'..."
exch.flow.in.ntv.incl,Deposits on Exchanges - Inclusive (Native Units),The amount of the asset sent to exchanges that...,{'flow_in': 'The amount of the asset sent to e...,1d,,"[{'asset_id': '', 'name': 'Coinmetrics', 'url'..."


In [10]:
freemetrics.to_csv('raw/freemetrics.csv')

### Retrieve Bitcoin Time Series

In this section, we would retrieve time series data with the given metrics retrieved above. The `tqdm` progress bar package was implemented to show the data collection process.

In [11]:
price_ts = collector.get_timeseries('BTC','price')
price_ts

<src.messari.Timeseries at 0x14a7e23b3a0>

In [12]:
price_ts.data

Unnamed: 0,timestamp,open,high,low,close,volume
0,2021-03-04T00:00:00Z,50448.916556,51803.208006,47512.617153,48372.152036,1.020196e+10
1,2021-03-05T00:00:00Z,48462.608092,49473.313503,46283.111629,48774.489197,9.811878e+09
2,2021-03-06T00:00:00Z,48861.319414,49204.761057,47069.052607,48900.346470,5.712650e+09
3,2021-03-07T00:00:00Z,48876.110706,51467.658744,48860.882057,50984.953490,7.355125e+09
4,2021-03-08T00:00:00Z,50963.938090,52416.483527,49308.809225,52402.098930,9.195889e+09
...,...,...,...,...,...,...
250,2021-11-09T00:00:00Z,67562.349405,68551.258452,66232.561006,66952.038557,7.105142e+09
251,2021-11-10T00:00:00Z,66942.304985,69006.394874,62835.075853,64916.888377,1.153238e+10
252,2021-11-11T00:00:00Z,64996.845356,65630.080829,64125.133220,64815.879679,5.846033e+09
253,2021-11-12T00:00:00Z,64871.913631,65476.742695,62290.531668,64148.801062,7.811470e+09


In [13]:
price_ts.get_structured_data()

Unnamed: 0,timestamp,metric,submetric,value
0,2021-03-04,price,open,5.044892e+04
1,2021-03-05,price,open,4.846261e+04
2,2021-03-06,price,open,4.886132e+04
3,2021-03-07,price,open,4.887611e+04
4,2021-03-08,price,open,5.096394e+04
...,...,...,...,...
1270,2021-11-09,price,volume,7.105142e+09
1271,2021-11-10,price,volume,1.153238e+10
1272,2021-11-11,price,volume,5.846033e+09
1273,2021-11-12,price,volume,7.811470e+09


In [14]:
from src.exceptions import MessariException

btc_metrics = metrics.get_free_metrics()
data_list = list()


with tqdm(total=len(btc_metrics)) as pbar:
    pbar.set_description('Response [200]')

    for metric in btc_metrics:
        while True:
            result = collector.get_timeseries(
                assetkey='BTC', metric_id=metric,
                start='2016-01-01', end='2020-12-31'
            )

            if isinstance(result, MessariException):
                pbar.write('Response [{0}]: {1}'.format(
                    result.error_code, result.error_message
                ))

                if result.error_code == 429:
                    time.sleep(result.cooldown)
                    continue
                
                break

            data_list.append(result.get_structured_data())
            pbar.update(1)
            break


Response [200]:   7%|▋         | 4/57 [00:04<00:29,  1.81it/s]

Response [500]: Internal Server Error


Response [200]:  46%|████▌     | 26/57 [00:24<00:49,  1.59s/it]

Response [429]: Rate limit exceeded, retry in 33 seconds. To increase the limit, sign up for Pro on messari.io.


Response [200]:  63%|██████▎   | 36/57 [01:16<01:25,  4.07s/it]

Response [501]: Not Implemented


Response [200]:  95%|█████████▍| 54/57 [01:30<00:02,  1.21it/s]

Response [429]: Rate limit exceeded, retry in 28 seconds. To increase the limit, sign up for Pro on messari.io.


Response [200]:  96%|█████████▋| 55/57 [02:00<00:04,  2.20s/it]


In [15]:
data = pd.concat(data_list, ignore_index=False)
data

Unnamed: 0,timestamp,metric,submetric,value
0,2016-01-01,act.addr.cnt,active_addresses,3.167810e+05
1,2016-01-02,act.addr.cnt,active_addresses,4.179660e+05
2,2016-01-03,act.addr.cnt,active_addresses,3.984430e+05
3,2016-01-04,act.addr.cnt,active_addresses,4.131590e+05
4,2016-01-05,act.addr.cnt,active_addresses,4.352910e+05
...,...,...,...,...
1822,2020-12-27,txn.vol,transaction_volume_usd,1.404424e+10
1823,2020-12-28,txn.vol,transaction_volume_usd,1.915253e+10
1824,2020-12-29,txn.vol,transaction_volume_usd,1.927709e+10
1825,2020-12-30,txn.vol,transaction_volume_usd,2.131078e+10


In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 111925 entries, 0 to 1826
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   timestamp  111925 non-null  datetime64[ns]
 1   metric     111925 non-null  object        
 2   submetric  111925 non-null  object        
 3   value      111925 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 4.3+ MB


In [17]:
data.pivot_table(
    index='timestamp',
    columns=['metric','submetric'],
    values='value'
).isna().sum().sort_values(ascending=False)[:10]

metric               submetric          
reddit.active.users  active_users           1588
reddit.subscribers   subscribers            1588
act.addr.cnt         active_addresses          0
sply.total.iss       issuance_total_usd        0
new.iss.usd          issuance_usd              0
nvt.adj              nvt_adjusted              0
nvt.adj.90d.ma       nvt_adjusted_90d_ma       0
price                close                     0
                     high                      0
                     low                       0
dtype: int64

In [18]:
data[data['metric'] == 'reddit.active.users']

Unnamed: 0,timestamp,metric,submetric,value
0,2020-05-07,reddit.active.users,active_users,8122.0
1,2020-05-08,reddit.active.users,active_users,7578.0
2,2020-05-09,reddit.active.users,active_users,5185.0
3,2020-05-10,reddit.active.users,active_users,7958.0
4,2020-05-11,reddit.active.users,active_users,6038.0
...,...,...,...,...
234,2020-12-27,reddit.active.users,active_users,8173.0
235,2020-12-28,reddit.active.users,active_users,7332.0
236,2020-12-29,reddit.active.users,active_users,8200.0
237,2020-12-30,reddit.active.users,active_users,9750.0


In [19]:
data[data['metric'] == 'reddit.subscribers']

Unnamed: 0,timestamp,metric,submetric,value
0,2020-05-07,reddit.subscribers,subscribers,1405139.0
1,2020-05-08,reddit.subscribers,subscribers,1407416.0
2,2020-05-09,reddit.subscribers,subscribers,1408076.0
3,2020-05-10,reddit.subscribers,subscribers,1410046.0
4,2020-05-11,reddit.subscribers,subscribers,1411910.0
...,...,...,...,...
234,2020-12-27,reddit.subscribers,subscribers,1849399.0
235,2020-12-28,reddit.subscribers,subscribers,1852774.0
236,2020-12-29,reddit.subscribers,subscribers,1855755.0
237,2020-12-30,reddit.subscribers,subscribers,1858874.0


The two metrics have incomplete data specifically on the earlier timestamps. For consistency, we would drop such metrics.

In [20]:
filter_users = data['metric'] != 'reddit.active.users'
filter_subs = data['metric'] != 'reddit.subscribers'
data = data[filter_users & filter_subs]

In [21]:
data

Unnamed: 0,timestamp,metric,submetric,value
0,2016-01-01,act.addr.cnt,active_addresses,3.167810e+05
1,2016-01-02,act.addr.cnt,active_addresses,4.179660e+05
2,2016-01-03,act.addr.cnt,active_addresses,3.984430e+05
3,2016-01-04,act.addr.cnt,active_addresses,4.131590e+05
4,2016-01-05,act.addr.cnt,active_addresses,4.352910e+05
...,...,...,...,...
1822,2020-12-27,txn.vol,transaction_volume_usd,1.404424e+10
1823,2020-12-28,txn.vol,transaction_volume_usd,1.915253e+10
1824,2020-12-29,txn.vol,transaction_volume_usd,1.927709e+10
1825,2020-12-30,txn.vol,transaction_volume_usd,2.131078e+10


In [22]:
data.to_csv('raw/data.csv', index=False)

### Creating Batch Dataset

In [23]:
from tensorflow.keras.preprocessing import timeseries_dataset_from_array as tsa
from tensorflow.data.experimental import save as tf_save

In [24]:
wide_data = data.pivot_table(index='timestamp', columns=['metric', 'submetric'], values='value')
prices = wide_data.pop('price')

In [25]:
wide_data

metric,act.addr.cnt,blk.cnt,blk.size.byte,blk.size.bytes.avg,daily.shp,daily.shp,daily.shp,daily.shp,daily.vol,daily.vol,...,txn.fee.med.ntv,txn.tfr.avg.ntv,txn.tfr.val.adj.ntv,txn.tfr.val.med,txn.tfr.val.med.ntv,txn.tfr.val.ntv,txn.tsfr.cnt,txn.tsfr.val.adj,txn.tsfr.val.avg,txn.vol
submetric,active_addresses,block_count,block_count,block_count,sharpe_1yr,sharpe_30d,sharpe_3yr,sharpe_90d,volatility_1yr,volatility_30d,...,transaction_fee_median,transfer_value_avg,transaction_volume_adjusted,transfer_value_median_usd,transfer_value_median,transaction_volume,transfers_count,adjusted_transfer_value_usd,average_transfer_value_usd,transaction_volume_usd
timestamp,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2016-01-01,316781.0,135.0,66544854.0,4.929248e+05,0.869196,3.254284,1.604864,3.685772,0.685757,0.640308,...,0.000100,2.716966,174515.818449,5.086063,0.011701,8.710050e+05,320580.0,7.585819e+07,1181.005400,3.786067e+08
2016-01-02,417966.0,145.0,83567789.0,5.763296e+05,0.903431,3.742549,1.609291,3.635713,0.685251,0.631588,...,0.000100,1.765765,189629.972730,2.997635,0.006900,7.161979e+05,405602.0,8.238282e+07,767.118818,3.111449e+08
2016-01-03,398443.0,145.0,82041255.0,5.658018e+05,0.885630,3.463730,1.612791,3.592289,0.685339,0.633846,...,0.000100,1.845884,232482.193500,2.204723,0.005126,7.984019e+05,432531.0,9.999900e+07,793.981305,3.434215e+08
2016-01-04,413159.0,181.0,99672880.0,5.506789e+05,1.075561,3.520499,1.612907,3.596404,0.675956,0.633926,...,0.000100,2.201920,258630.297507,7.544331,0.017405,9.237892e+05,419538.0,1.121030e+08,954.420051,4.004155e+08
2016-01-05,435291.0,157.0,100644706.0,6.410491e+05,1.157387,2.312140,1.609834,3.442895,0.673562,0.592641,...,0.000100,2.872012,335747.819336,8.489105,0.019620,1.258329e+06,438135.0,1.452678e+08,1242.632112,5.444406e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-12-27,1049825.0,148.0,193716651.0,1.308896e+06,1.983447,8.207496,0.326527,6.766483,0.752805,0.686434,...,0.000158,0.657102,241838.497017,137.846432,0.005215,5.313577e+05,808638.0,6.391999e+09,17367.771562,1.404424e+10
2020-12-28,1154354.0,160.0,206766595.0,1.292291e+06,1.994093,7.835167,0.332238,6.742289,0.753484,0.690959,...,0.000164,0.809730,397621.711422,149.061442,0.005513,7.083207e+05,874761.0,1.075143e+10,21894.585318,1.915253e+10
2020-12-29,1146131.0,152.0,194179955.0,1.277500e+06,1.987428,6.938161,0.380683,6.615674,0.753403,0.697172,...,0.000164,0.765763,339224.096073,153.351433,0.005631,7.079045e+05,924443.0,9.237480e+09,20852.655461,1.927709e+10
2020-12-30,1221579.0,172.0,215838093.0,1.254873e+06,2.077002,7.061308,0.362164,7.173090,0.753598,0.647848,...,0.000171,0.757514,335274.336904,158.645375,0.005500,7.388130e+05,975313.0,9.670859e+09,21850.191708,2.131078e+10


In [26]:
prices

submetric,close,high,low,open,volume
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016-01-01,433.960345,437.841344,426.068766,430.206620,2.099646e+07
2016-01-02,433.221343,436.820867,431.542401,434.015456,1.393225e+07
2016-01-03,429.132867,433.727409,422.757469,433.212063,2.396948e+07
2016-01-04,433.152291,435.231337,428.233668,429.401480,2.528594e+07
2016-01-05,432.001391,434.548442,428.386082,432.978697,1.956450e+07
...,...,...,...,...,...
2020-12-27,26250.457794,28375.198389,25748.450880,26453.928239,1.043023e+10
2020-12-28,27032.289152,27469.396137,26069.019129,26248.564564,5.653047e+09
2020-12-29,27360.005185,27387.310760,25832.269524,27036.832984,5.558736e+09
2020-12-30,28886.315853,28999.010850,27337.600922,27363.633892,7.442424e+09


In [27]:
X = wide_data.values
y = prices['open'].values

In [28]:
print(X.shape)
print(y.shape)

(1827, 56)
(1827,)


In [29]:
dataset = tsa(data=X, targets=y, sequence_length=56,
              sequence_stride=7, batch_size=1)

In [30]:
tf_save(dataset, 'raw/datasetbatch')