# Data Extraction Documentation
## BS Data Science Project - Recurrent Bitcoin Network

This notebook contains a brief documentation on retrieving data from [Messari](https://messari.io/) through its [API](https://messari.io/api/docs). Method of data collection would be through a REST API with a `json` response.

### Data

The expected data would be a data frame of time series data denoted by its columns and it would be splitted, normalized, and be fed for model training and testing. In this project, we would only use daily time series data from 2016 to 2020 as the training and validation set, while 2021 would be used as a test set. Note that this notebook does not include the retrieval of the test set.

### Prerequesites

Before running this notebook, several python packages are needed to be installed as shown in the following cell. In addition, changing of working directory within the local repository was done.

In [1]:
# Redirects the current working directory to `/src/` directory
%%capture
%cd ..\src

In [2]:
# Built-in packages
import requests
import time
from datetime import datetime
from functools import reduce

# Packages to be installed
import pandas as pd
from tqdm import tqdm

# Local python functions
from messari import get_asset_timeseries
from formatting import get_table

### Retrieve Metrics Data From Messari

This section retrieves the complete list of the possible metrics to be queried in time series.

In [3]:
# Creates a session to save cookies and headers
sess = requests.session()

In [4]:
%%time
metrics = sess.get('https://data.messari.io/api/v1/assets/metrics')
metrics

Wall time: 8.72 s


In [None]:
df_metrics = pd.DataFrame(metrics.json()['data']['metrics'])
df_metrics.head()

Observing on the `role_restriction` column, there are missing values given. However, these missing values indicate that the given metrics has no restriction. Therefore, we would select all metrics with missing values on `role_restriction` and save it under a `csv` file.

In [5]:
free_metrics = df_metrics[df_metrics['role_restriction'].isna()]
free_metrics.reset_index(drop=True, inplace=True)

In [6]:
free_metrics.head()

Unnamed: 0,metric_id,name,description,values_schema,minimum_interval,role_restriction,source_attribution
0,txn.fee.med.ntv,Median Transaction Fees (Native Units),The median fee per transaction in native units...,{'transaction_fee_median': 'The median fee per...,1d,,"[{'name': 'Coinmetrics', 'url': 'https://coinm..."
1,cg.sply.circ,Circulating Supply (CoinGecko),The circulating supply acknowledges that token...,{'circulating_supply': 'The circulating supply...,1d,,"[{'name': 'CoinGecko', 'url': 'https://coingec..."
2,daily.vol,Volatility,The annualized standard-deviation of daily ret...,{'volatility_30d': 'The asset's volatility cal...,1d,,"[{'name': 'Kaiko', 'url': 'https://www.kaiko.c..."
3,txn.tsfr.val.adj,Adjusted Transaction Volume,The sum USD value of all native units transfer...,{'adjusted_transfer_value_usd': 'The USD value...,1d,,"[{'name': 'Coinmetrics', 'url': 'https://coinm..."
4,exch.sply,Supply on Exchanges (Native Units),The sum of all native units held in hot or col...,{'supply_usd': 'The sum of all native units he...,1d,,"[{'name': 'Coinmetrics', 'url': 'https://coinm..."


In [7]:
free_metrics.to_csv('../raw/messari_freemetrics.csv', index=False)

### Retrieve Bitcoin Time Series

In this section, we would retrieve time series data with the given metrics retrieved above. The `tqdm` progress bar package was implemented to show the data collection process.

In [11]:
%%time
data_list = list()

pbar = tqdm(free_metrics['metric_id'])
for metric in pbar:
    pbar.set_description('Now retrieving: {}'.format(metric))
    success = False
    while not success:
        res = get_asset_timeseries(
            assetkey='BTC', metric_id=metric, sess=sess
        )
        if res.status_code == 200:
            data_list.append(get_table(res))
            success = True
            time.sleep(1)
        elif res.status_code // 100 == 5:
            pbar.write(
                '[Request {0}] at {1}'.format(
                    res.status_code, metric
                )
            )
            time.sleep(5)
            break
        else:
            pbar.set_description(
                '[Request {0}] | Now retrieving: {1}'.format(
                    res.status_code, metric
                )
            )
            time.sleep(10)



Now retrieving: cg.sply.circ:   2%|▏         | 1/57 [00:05<02:05,  2.24s/it]

[Request 500] at cg.sply.circ


Now retrieving: sply.liquid:  37%|███▋      | 21/57 [01:28<02:34,  4.30s/it]

[Request 501] at sply.liquid


Now retrieving: blk.size.bytes.avg: 100%|██████████| 57/57 [03:29<00:00,  3.67s/it]

Wall time: 3min 29s





To merge the data that are separately queried, we applied `reduce` function to merge all elements through its timestamp.

In [12]:
data = reduce(
    lambda left, right: pd.merge(left, right, on='timestamp', how='outer'),
    data_list
)

In [13]:
data.head()

Unnamed: 0,timestamp,txn.fee.med.ntv - transaction_fee_median,daily.vol - volatility_30d,daily.vol - volatility_90d,daily.vol - volatility_1yr,daily.vol - volatility_3yr,txn.tsfr.val.adj - adjusted_transfer_value_usd,exch.sply - supply_usd,min.rev.usd - revenue_usd,diff.avg - mean_difficulty,...,txn.vol - transaction_volume_usd,exch.flow.in.usd.incl - flow_in_usd,fees.ntv - fees_total,txn.fee.avg - transaction_fee_avg,txn.fee.avg.ntv - transaction_fee_avg_ntv,hashrate - hash_rate,txn.tfr.val.med.ntv - transfer_value_median,new.iss.usd - issuance_usd,mcap.circ - circulating_marketcap,blk.size.bytes.avg - block_count
0,2016-01-01 00:00:00+00:00,0.0001,0.640308,0.677785,0.685757,1.028978,75858190.0,480945.387379,0,103880300000.0,...,378606700.0,13988450.0,19.872695,0.069424,0.00016,697129.166406,0.011701,1467038.0,6468191000.0,492924.844444
1,2016-01-02 00:00:00+00:00,0.0001,0.631588,0.678005,0.685251,1.028922,82382820.0,481485.884222,0,103880300000.0,...,311144900.0,11414710.0,31.235525,0.091773,0.000211,748768.363917,0.0069,1574845.0,6527223000.0,576329.57931
2,2016-01-03 00:00:00+00:00,0.0001,0.633846,0.67852,0.685339,1.028857,99999000.0,488982.494341,0,103880300000.0,...,343421500.0,16981440.0,24.473062,0.072584,0.000169,748768.363917,0.005126,1559244.0,6516692000.0,565801.758621
3,2016-01-04 00:00:00+00:00,0.0001,0.633926,0.678528,0.675956,1.028857,112103000.0,487235.112021,0,103880300000.0,...,400415500.0,19885280.0,30.382633,0.073746,0.00017,934669.474959,0.017405,1961357.0,6459280000.0,550678.895028
4,2016-01-05 00:00:00+00:00,0.0001,0.592641,0.677822,0.673562,1.028863,145267800.0,471160.674046,0,103880300000.0,...,544440600.0,22229690.0,32.511258,0.076591,0.000177,810735.400931,0.01962,1698228.0,6516855000.0,641049.082803


In [14]:
topna = data.isna().sum().sort_values(ascending=False)[:10]
topna

txn.tfr.erc721.cnt - transaction_transfer_count_erc721    1827
txn.tfr.erc20.cnt - transaction_transfer_count_erc20      1827
reddit.subscribers - subscribers                          1588
reddit.active.users - active_users                        1588
exch.flow.in.usd - flow_in_usd                               0
txn.tfr.val.adj.ntv - transaction_volume_adjusted            0
txn.fee.med - transaction_fee_median_usd                     0
fees - fees_total_usd                                        0
blk.cnt - block_count                                        0
timestamp                                                    0
dtype: int64

Missing data in this project would be dropped as the model requires complete data from 2016 to 2020.

In [15]:
data2 = data.drop(topna.index[[bool(a) for a in topna.values.tolist()]], axis=1)
data2.head()

Unnamed: 0,timestamp,txn.fee.med.ntv - transaction_fee_median,daily.vol - volatility_30d,daily.vol - volatility_90d,daily.vol - volatility_1yr,daily.vol - volatility_3yr,txn.tsfr.val.adj - adjusted_transfer_value_usd,exch.sply - supply_usd,min.rev.usd - revenue_usd,diff.avg - mean_difficulty,...,txn.vol - transaction_volume_usd,exch.flow.in.usd.incl - flow_in_usd,fees.ntv - fees_total,txn.fee.avg - transaction_fee_avg,txn.fee.avg.ntv - transaction_fee_avg_ntv,hashrate - hash_rate,txn.tfr.val.med.ntv - transfer_value_median,new.iss.usd - issuance_usd,mcap.circ - circulating_marketcap,blk.size.bytes.avg - block_count
0,2016-01-01 00:00:00+00:00,0.0001,0.640308,0.677785,0.685757,1.028978,75858190.0,480945.387379,0,103880300000.0,...,378606700.0,13988450.0,19.872695,0.069424,0.00016,697129.166406,0.011701,1467038.0,6468191000.0,492924.844444
1,2016-01-02 00:00:00+00:00,0.0001,0.631588,0.678005,0.685251,1.028922,82382820.0,481485.884222,0,103880300000.0,...,311144900.0,11414710.0,31.235525,0.091773,0.000211,748768.363917,0.0069,1574845.0,6527223000.0,576329.57931
2,2016-01-03 00:00:00+00:00,0.0001,0.633846,0.67852,0.685339,1.028857,99999000.0,488982.494341,0,103880300000.0,...,343421500.0,16981440.0,24.473062,0.072584,0.000169,748768.363917,0.005126,1559244.0,6516692000.0,565801.758621
3,2016-01-04 00:00:00+00:00,0.0001,0.633926,0.678528,0.675956,1.028857,112103000.0,487235.112021,0,103880300000.0,...,400415500.0,19885280.0,30.382633,0.073746,0.00017,934669.474959,0.017405,1961357.0,6459280000.0,550678.895028
4,2016-01-05 00:00:00+00:00,0.0001,0.592641,0.677822,0.673562,1.028863,145267800.0,471160.674046,0,103880300000.0,...,544440600.0,22229690.0,32.511258,0.076591,0.000177,810735.400931,0.01962,1698228.0,6516855000.0,641049.082803


In [16]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1827 entries, 0 to 1826
Data columns (total 62 columns):
 #   Column                                             Non-Null Count  Dtype              
---  ------                                             --------------  -----              
 0   timestamp                                          1827 non-null   datetime64[ns, UTC]
 1   txn.fee.med.ntv - transaction_fee_median           1827 non-null   float64            
 2   daily.vol - volatility_30d                         1827 non-null   float64            
 3   daily.vol - volatility_90d                         1827 non-null   float64            
 4   daily.vol - volatility_1yr                         1827 non-null   float64            
 5   daily.vol - volatility_3yr                         1827 non-null   float64            
 6   txn.tsfr.val.adj - adjusted_transfer_value_usd     1827 non-null   float64            
 7   exch.sply - supply_usd                             1827 non-

In [17]:
data2.to_csv('../raw/rawdata.csv', index=False)