# Data Extraction Documentation
## BS Data Science Project - Recurrent Bitcoin Network

This notebook contains a brief documentation on retrieving data from [Messari](https://messari.io/) through its [API](https://messari.io/api/docs). Method of data collection would be through a REST API with a `json` response.

### Data

The expected data would be a data frame of time series data denoted by its columns and it would be splitted, normalized, and be fed for model training and testing. In this project, we would only use daily time series data from 2016 to 2020 as the training and validation set, while 2021 would be used as a test set. Note that this notebook does not include the retrieval of the test set.

### Prerequesites

Before running this notebook, several python packages are needed to be installed as shown in the following cell. In addition, changing of working directory within the local repository was done.

In [1]:
%%capture
# Redirects the current working directory to the project directory
%cd ..

In [2]:
# Built-in packages
import os
import time
from datetime import datetime

# Packages to be installed
import pandas as pd
from tqdm import tqdm

# Local python functions
from src import collect

In [3]:
# Notebook Execution Time
print('Current time: {}'.format(datetime.now()))

Current time: 2021-11-10 15:47:36.875677


### Retrieve Metrics Data From Messari

This section retrieves the complete list of the possible metrics to be queried in time series.

In [4]:
# os.environ['MESSARI_API_KEY'] = <insert_key_here>

In [5]:
# Check if the API key is set
'MESSARI_API_KEY' in os.environ

True

In [6]:
api_header = {'x-messari-api-key':os.getenv('MESSARI_API_KEY')}
collector = collect.MessariCollector(headers=api_header)

In [7]:
# Check if collector read the API key
collector.headers['x-messari-api-key'] is not None

True

In [8]:
metrics = collector.get_metrics()

In [9]:
freemetrics = metrics.get_free_metrics(return_df=True)
freemetrics.head(10)

Unnamed: 0_level_0,name,description,values_schema,minimum_interval,role_restriction,source_attribution
metric_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
act.addr.cnt,Active Addresses Count,The sum count of unique addresses that were ac...,{'active_addresses': 'The sum count of unique ...,1d,,"[{'name': 'Coinmetrics', 'url': 'https://coinm..."
blk.cnt,Block Count,The sum count of blocks created each day,{'block_count': 'The sum count of blocks creat...,1d,,"[{'name': 'Coinmetrics', 'url': 'https://coinm..."
blk.size.byte,Block Size (bytes),The sum of the size (in bytes) of all blocks c...,{'block_count': 'The sum of the size (in bytes...,1d,,"[{'name': 'Coinmetrics', 'url': 'https://coinm..."
blk.size.bytes.avg,Average Block Size (bytes),The mean size (in bytes) of all blocks created,{'block_count': 'The mean size (in bytes) of a...,1d,,"[{'name': 'Coinmetrics', 'url': 'https://coinm..."
cg.sply.circ,Circulating Supply (CoinGecko),The circulating supply acknowledges that token...,{'circulating_supply': 'The circulating supply...,1d,,"[{'name': 'CoinGecko', 'url': 'https://coingec..."
daily.shp,Sharpe Ratio,The Sharpe ratio (performance of the asset com...,{'sharpe_30d': 'The asset's Sharpe ratio calcu...,1d,,"[{'name': 'Kaiko', 'url': 'https://www.kaiko.c..."
daily.vol,Volatility,The annualized standard-deviation of daily ret...,{'volatility_30d': 'The asset's volatility cal...,1d,,"[{'name': 'Kaiko', 'url': 'https://www.kaiko.c..."
diff.avg,Average Difficulty,The mean difficulty of finding a hash that mee...,{'mean_difficulty': 'The mean difficulty durin...,1d,,"[{'name': 'Coinmetrics', 'url': 'https://coinm..."
exch.flow.in.ntv,Deposits on Exchanges (Native Units),The amount of the asset sent to exchanges that...,{'flow_in': 'The amount of the asset sent to e...,1d,,"[{'name': 'Coinmetrics', 'url': 'https://coinm..."
exch.flow.in.ntv.incl,Deposits on Exchanges - Inclusive (Native Units),The amount of the asset sent to exchanges that...,{'flow_in': 'The amount of the asset sent to e...,1d,,"[{'name': 'Coinmetrics', 'url': 'https://coinm..."


In [10]:
freemetrics.to_csv('raw/freemetrics.csv')

### Retrieve Bitcoin Time Series

In this section, we would retrieve time series data with the given metrics retrieved above. The `tqdm` progress bar package was implemented to show the data collection process.

In [11]:
price_ts = collector.get_timeseries('BTC','price')
price_ts

<src.messari.Timeseries at 0x19ffe669fa0>

In [12]:
price_ts.data

Unnamed: 0,timestamp,open,high,low,close,volume
0,2021-02-28T00:00:00Z,46138.784058,46676.901247,43030.466959,45223.360320,1.029078e+10
1,2021-03-01T00:00:00Z,45169.630736,49807.308916,45033.589248,49611.571197,1.104579e+10
2,2021-03-02T00:00:00Z,49628.987746,50256.036957,47067.573885,48500.723574,8.906216e+09
3,2021-03-03T00:00:00Z,48373.687384,52660.921279,48158.770169,50379.503546,1.084729e+10
4,2021-03-04T00:00:00Z,50448.916556,51803.208006,47512.617153,48372.152036,1.020196e+10
...,...,...,...,...,...,...
250,2021-11-05T00:00:00Z,61443.058359,62635.308486,60778.112164,61016.505544,2.599436e+09
251,2021-11-06T00:00:00Z,61038.386278,61600.546046,60102.705539,61528.835302,2.792822e+09
252,2021-11-07T00:00:00Z,61499.708756,63342.721126,61352.953115,63319.184397,3.859706e+09
253,2021-11-08T00:00:00Z,63335.587420,67804.238190,63300.511622,67553.510528,8.805560e+09


In [13]:
price_ts.get_structured_data()

Unnamed: 0,timestamp,metric,submetric,value
0,2021-02-28,price,open,4.613878e+04
1,2021-03-01,price,open,4.516963e+04
2,2021-03-02,price,open,4.962899e+04
3,2021-03-03,price,open,4.837369e+04
4,2021-03-04,price,open,5.044892e+04
...,...,...,...,...
1270,2021-11-05,price,volume,2.599436e+09
1271,2021-11-06,price,volume,2.792822e+09
1272,2021-11-07,price,volume,3.859706e+09
1273,2021-11-08,price,volume,8.805560e+09


In [14]:
from src.exceptions import MessariException

btc_metrics = metrics.get_free_metrics()
data_list = list()


with tqdm(total=len(btc_metrics)) as pbar:
    pbar.set_description('Response [200]')

    for metric in btc_metrics:
        while True:
            result = collector.get_timeseries(
                assetkey='BTC', metric_id=metric,
                start='2016-01-01', end='2020-12-31'
            )

            if isinstance(result, MessariException):
                pbar.write('Response [{0}]: {1}'.format(
                    result.error_code, result.error_message
                ))

                if result.error_code == 429:
                    time.sleep(40)
                    continue
                
                break

            data_list.append(result.get_structured_data())
            pbar.update(1)
            break


Response [200]:   7%|▋         | 4/57 [00:05<00:41,  1.28it/s]

Response [500]: Internal Server Error


Response [200]:  46%|████▌     | 26/57 [00:22<00:36,  1.17s/it]

Response [429]: Rate limit exceeded, retry in 35 seconds. To increase the limit, sign up for Pro on messari.io.


Response [200]:  63%|██████▎   | 36/57 [01:12<00:32,  1.56s/it]

Response [501]: Not Implemented


Response [200]:  95%|█████████▍| 54/57 [01:28<00:01,  1.52it/s]

Response [429]: Rate limit exceeded, retry in 34 seconds. To increase the limit, sign up for Pro on messari.io.


Response [200]:  96%|█████████▋| 55/57 [02:09<00:04,  2.35s/it]


In [15]:
data = pd.concat(data_list, ignore_index=False)
data

Unnamed: 0,timestamp,metric,submetric,value
0,2016-01-01,act.addr.cnt,active_addresses,3.167810e+05
1,2016-01-02,act.addr.cnt,active_addresses,4.179660e+05
2,2016-01-03,act.addr.cnt,active_addresses,3.984430e+05
3,2016-01-04,act.addr.cnt,active_addresses,4.131590e+05
4,2016-01-05,act.addr.cnt,active_addresses,4.352910e+05
...,...,...,...,...
1822,2020-12-27,txn.vol,transaction_volume_usd,1.404424e+10
1823,2020-12-28,txn.vol,transaction_volume_usd,1.915253e+10
1824,2020-12-29,txn.vol,transaction_volume_usd,1.927709e+10
1825,2020-12-30,txn.vol,transaction_volume_usd,2.131078e+10


In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 111925 entries, 0 to 1826
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   timestamp  111925 non-null  datetime64[ns]
 1   metric     111925 non-null  object        
 2   submetric  111925 non-null  object        
 3   value      111925 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 4.3+ MB


In [17]:
data.pivot_table(
    index='timestamp',
    columns=['metric','submetric'],
    values='value'
).isna().sum().sort_values(ascending=False)[:10]

metric               submetric          
reddit.active.users  active_users           1588
reddit.subscribers   subscribers            1588
act.addr.cnt         active_addresses          0
sply.total.iss       issuance_total_usd        0
new.iss.usd          issuance_usd              0
nvt.adj              nvt_adjusted              0
nvt.adj.90d.ma       nvt_adjusted_90d_ma       0
price                close                     0
                     high                      0
                     low                       0
dtype: int64

In [18]:
data[data['metric'] == 'reddit.active.users']

Unnamed: 0,timestamp,metric,submetric,value
0,2020-05-07,reddit.active.users,active_users,8122.0
1,2020-05-08,reddit.active.users,active_users,7578.0
2,2020-05-09,reddit.active.users,active_users,5185.0
3,2020-05-10,reddit.active.users,active_users,7958.0
4,2020-05-11,reddit.active.users,active_users,6038.0
...,...,...,...,...
234,2020-12-27,reddit.active.users,active_users,8173.0
235,2020-12-28,reddit.active.users,active_users,7332.0
236,2020-12-29,reddit.active.users,active_users,8200.0
237,2020-12-30,reddit.active.users,active_users,9750.0


In [19]:
data[data['metric'] == 'reddit.subscribers']

Unnamed: 0,timestamp,metric,submetric,value
0,2020-05-07,reddit.subscribers,subscribers,1405139.0
1,2020-05-08,reddit.subscribers,subscribers,1407416.0
2,2020-05-09,reddit.subscribers,subscribers,1408076.0
3,2020-05-10,reddit.subscribers,subscribers,1410046.0
4,2020-05-11,reddit.subscribers,subscribers,1411910.0
...,...,...,...,...
234,2020-12-27,reddit.subscribers,subscribers,1849399.0
235,2020-12-28,reddit.subscribers,subscribers,1852774.0
236,2020-12-29,reddit.subscribers,subscribers,1855755.0
237,2020-12-30,reddit.subscribers,subscribers,1858874.0


The two metrics have incomplete data specifically on the earlier timestamps. For consistency, we would drop such metrics.

In [20]:
filter_users = data['metric'] != 'reddit.active.users'
filter_subs = data['metric'] != 'reddit.subscribers'
data = data[filter_users & filter_subs]

In [21]:
data.to_csv('raw/data.csv', index=False)