# **Bitcoin price prediction - Data crawling**
## Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author: Corsi Danilo (1742375) - corsi.1742375@studenti.uniroma1.it



---


Description: crawling data on bitcoin's blochckain by querying blockchain.com.

# Global constants, dependencies, libraries and tools

In [1]:
# Main constants
LOCAL_RUNNING = False
ROOT_DIR = "D:/Documents/Repository/BDC/project" if LOCAL_RUNNING else "/content/drive"

In [2]:
if not LOCAL_RUNNING:
    # Point Colaboratory to Google Drive
    from google.colab import drive

    # Define GDrive paths
    drive.mount(ROOT_DIR, force_remount=True)

Mounted at /content/drive


In [3]:
# Set main dir
MAIN_DIR = ROOT_DIR + "" if LOCAL_RUNNING else ROOT_DIR + "/MyDrive/BDC/project"

# Datasets dir
DATASET_RAW_DIR = MAIN_DIR + "/datasets/raw"

# Datasets name
DATASET_NAME = "bitcoin_blockchain_data_15min"

# Datasets path
DATASET_RAW = DATASET_RAW_DIR + "/" + DATASET_NAME + ".parquet"

In [4]:
# Useful imports
import pandas as pd
import functools
import plotly.io as pio
import warnings
import time
from datetime import datetime, timedelta

if not LOCAL_RUNNING: from google.colab import drive

warnings.simplefilter(action='ignore', category=FutureWarning)

if LOCAL_RUNNING: pio.renderers.default='notebook' # To correctly export the notebook in html format

# Metrics and parameters
I chose to collect data on the Bitcoin blockchain using the API of the website Blockchain.org and the price information from two famous exchange, Binance and Kraken. They were retrieved the most relevant information from the last four years to the present day (a period for which there were moments of high volatility but also a lot of price lateralization). The procedure has been made as automatic as possible so that the same periods are considered each time the entire procedure is executed. 

The features taken under consideration were divided into several categories:
- **Currency Statistics**
   - **ohlcv**: stands for “Open, High, Low, Close and Volume” and it's a list of the five types of data that are most common in financial analysis regarding price.
   - **market-price:** the average USD market price across major bitcoin exchanges.
   - **trade-volume-usd:** the total USD value of trading volume on major bitcoin exchanges.
   - **total-bitcoins:** the total number of mined bitcoin that are currently circulating on the network.
   - **market-cap:** the total USD value of bitcoin in circulation.

- **Block Details**
   - **blocks-size:** the total size of the blockchain minus database indexes in megabytes.
   - **avg-block-size:** the average block size over the past 24 hours in megabytes.
   - **n-transactions-total:** the total number of transactions on the blockchain.
   - **n-transactions-per-block:** the average number of transactions per block over the past 24 hours.

- **Mining Information**
   - **hash-rate:** the estimated number of terahashes per second the bitcoin network is performing in the last 24 hours.
   - **difficulty:** a relative measure of how difficult it is to mine a new block for the blockchain.
   - **miners-revenue:** total value of coinbase block rewards and transaction fees paid to miners.
   - **transaction-fees-usd:** the total USD value of all transaction fees paid to miners. This does not include coinbase block rewards.

- **Network Activity**
   - **n-unique-addresses:** the total number of unique addresses used on the blockchain.
   - **n-transactions:** the total number of confirmed transactions per day.
   - **estimated-transaction-volume-usd:** the total estimated value in USD of transactions on the blockchain.

In [5]:
# Define the parameters
timespan = "4years" # Duration of the data
# Get current date (ending date)
end_date = datetime.today()
# Get the starting date
start_date = (datetime.today() - timedelta(days=365*4))

# Metrics considered
metrics = [
          # Currency Statistics
          "market-price",
          "trade-volume",

          # Block Details
          "blocks-size",
          "avg-block-size",
          "n-transactions-total",
          "n-transactions-per-block",

          # Mining Information
          "hash-rate",
          "difficulty",
          "miners-revenue",
          "transaction-fees-usd",

          # Network Activity
          "n-unique-addresses",
          "n-transactions",
          "estimated-transaction-volume-usd"
]

# Data crawling

In [6]:
# Install ccxt trading library that provides a way to connect and trade with various cryptocurrency exchanges and payment processing services worldwide
!pip3 install ccxt
import ccxt

Collecting ccxt
  Downloading ccxt-4.1.52-py2.py3-none-any.whl (4.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Collecting aiodns>=1.1.1 (from ccxt)
  Downloading aiodns-3.1.1-py3-none-any.whl (5.4 kB)
Collecting pycares>=4.0.0 (from aiodns>=1.1.1->ccxt)
  Downloading pycares-4.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.7/288.7 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pycares, aiodns, ccxt
Successfully installed aiodns-3.1.1 ccxt-4.1.52 pycares-4.4.0


In [7]:
# Create an array of dates in such a way as to contact the API in one-year increments
date_array = []

# Calculate the number of days between the start and end dates
num_days = (end_date - start_date).days

# Loop through the dates and add them to the array
for i in range(num_days + 1):
    current_date = start_date + timedelta(days=i)
    if i % 360 == 0:
        date_array.append(current_date)

# Append end_date
date_array.append(end_date)
date_array

[datetime.datetime(2019, 11, 16, 10, 11, 49, 563984),
 datetime.datetime(2020, 11, 10, 10, 11, 49, 563984),
 datetime.datetime(2021, 11, 5, 10, 11, 49, 563984),
 datetime.datetime(2022, 10, 31, 10, 11, 49, 563984),
 datetime.datetime(2023, 10, 26, 10, 11, 49, 563984),
 datetime.datetime(2023, 11, 15, 10, 11, 49, 563931)]

In [8]:
def ohlcv_crawler(exchange_to_use, start, end):
    exchange = exchange_to_use  # Connect to the exchange exchange
    market = 'BTC/USD'  # Bitcoin market
    exchange.enableRateLimit = False

    # Convert dates to milliseconds
    since = exchange.parse8601(start + 'T00:00:00Z')
    till = exchange.parse8601(end + 'T00:00:00Z')

    # Fetch OHLCV data
    ohlcv = exchange.fetch_ohlcv(market, '1d', since, till)

    # Convert to DataFrame
    dataset = pd.DataFrame(ohlcv, columns=['timestamp', 'open', 'high', 'low', 'close', 'volume'])
    dataset['timestamp'] = pd.to_datetime(dataset['timestamp'], unit='ms')

    return dataset

In [9]:
# Fetch OHLCV data
exchange_to_use = ccxt.binanceus()

df0 = pd.DataFrame()
j = 1
for i in range(3):
  df0 = df0.append(ohlcv_crawler(exchange_to_use, date_array[i].strftime('%Y-%m-%d'), date_array[j].strftime('%Y-%m-%d')), ignore_index=True)
  time.sleep(5)
  j += 1
df0

Unnamed: 0,timestamp,open,high,low,close,volume
0,2019-11-16,8463.79,8528.44,8430.77,8484.07,50.607771
1,2019-11-17,8485.83,8627.56,8373.96,8508.26,89.357152
2,2019-11-18,8495.72,8501.58,8043.00,8175.14,173.564990
3,2019-11-19,8170.52,8196.27,7993.46,8123.93,136.460384
4,2019-11-20,8123.92,8219.00,8037.77,8083.95,120.561255
...,...,...,...,...,...,...
2589,2023-07-10,27575.17,28100.00,27450.00,27707.00,84.414950
2590,2023-07-11,27707.00,27797.05,26900.00,26951.45,78.992070
2591,2023-07-12,26911.15,27500.01,26777.00,26777.00,83.336960
2592,2023-07-13,26799.97,27400.00,25450.00,25937.06,129.776650


In [10]:
# Check duplicated rows
len(df0['timestamp'].unique())

1337

In [11]:
# Drop the duplicates in column "timestamp", keep the last value
df0.drop_duplicates(subset="timestamp", keep="last", inplace=True)
df0

Unnamed: 0,timestamp,open,high,low,close,volume
0,2019-11-16,8463.79,8528.44,8430.77,8484.07,50.607771
1,2019-11-17,8485.83,8627.56,8373.96,8508.26,89.357152
2,2019-11-18,8495.72,8501.58,8043.00,8175.14,173.564990
3,2019-11-19,8170.52,8196.27,7993.46,8123.93,136.460384
4,2019-11-20,8123.92,8219.00,8037.77,8083.95,120.561255
...,...,...,...,...,...,...
2589,2023-07-10,27575.17,28100.00,27450.00,27707.00,84.414950
2590,2023-07-11,27707.00,27797.05,26900.00,26951.45,78.992070
2591,2023-07-12,26911.15,27500.01,26777.00,26777.00,83.336960
2592,2023-07-13,26799.97,27400.00,25450.00,25937.06,129.776650


In [12]:
# Check duplicated rows
len(df0['timestamp'].unique())

1337

In [13]:
# Since I cannot get all the data from the same exchange, I will get the remaining data from another
last_date = df0['timestamp'].tail(1).values[0]

# Compare the last date with our end date
if not last_date == end_date:
  exchange_to_use = ccxt.kraken()
  for i in range(3):
    df0 = df0.append(ohlcv_crawler(exchange_to_use, pd.to_datetime(last_date).strftime('%Y-%m-%d'), end_date.strftime('%Y-%m-%d')), ignore_index=True)
df0

Unnamed: 0,timestamp,open,high,low,close,volume
0,2019-11-16,8463.79,8528.44,8430.77,8484.07,50.607771
1,2019-11-17,8485.83,8627.56,8373.96,8508.26,89.357152
2,2019-11-18,8495.72,8501.58,8043.00,8175.14,173.564990
3,2019-11-19,8170.52,8196.27,7993.46,8123.93,136.460384
4,2019-11-20,8123.92,8219.00,8037.77,8083.95,120.561255
...,...,...,...,...,...,...
1707,2023-11-11,37311.70,37411.70,36658.00,37139.80,1725.244614
1708,2023-11-12,37139.90,37227.60,36727.30,37053.90,1172.596290
1709,2023-11-13,37053.90,37423.00,36371.00,36494.10,3275.049329
1710,2023-11-14,36494.10,36750.10,34666.00,35544.40,4657.820097


In [14]:
# Check duplicated rows
len(df0['timestamp'].unique())

1461

In [15]:
# Drop the duplicates in column "timestamp", keep the last value
df0.drop_duplicates(subset="timestamp", keep="last", inplace=True)
df0

Unnamed: 0,timestamp,open,high,low,close,volume
0,2019-11-16,8463.79,8528.44,8430.77,8484.07,50.607771
1,2019-11-17,8485.83,8627.56,8373.96,8508.26,89.357152
2,2019-11-18,8495.72,8501.58,8043.00,8175.14,173.564990
3,2019-11-19,8170.52,8196.27,7993.46,8123.93,136.460384
4,2019-11-20,8123.92,8219.00,8037.77,8083.95,120.561255
...,...,...,...,...,...,...
1707,2023-11-11,37311.70,37411.70,36658.00,37139.80,1725.244614
1708,2023-11-12,37139.90,37227.60,36727.30,37053.90,1172.596290
1709,2023-11-13,37053.90,37423.00,36371.00,36494.10,3275.049329
1710,2023-11-14,36494.10,36750.10,34666.00,35544.40,4657.820097


In [16]:
# Check duplicated rows
len(df0['timestamp'].unique())

1461

In [17]:
def blockchain_data_crawler(timespan, metrics, start, end):
    # API info
    url = f'https://api.blockchain.info/charts/{metrics}?timespan={timespan}&start={start}&format=csv'

    # Obtain data
    data = pd.read_csv(url, names=['timestamp', metrics])

    # Transform "timestamp" to datetime type
    data['timestamp'] = pd.to_datetime(data["timestamp"])

    # Select data up to the end date
    data = data[(data['timestamp'] < end)]

    return data

In [18]:
# Merge the data
merge = functools.partial(pd.merge, on='timestamp')

# Gain blockchain data from Blockchain.com API
df1 = functools.reduce(merge, [blockchain_data_crawler(timespan, metric, start_date.strftime('%Y-%m-%d'), end_date.strftime('%Y-%m-%d')) for metric in metrics])
df1

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
0,2019-11-16,8457.69,1.023465e+08,249372.424907,1.171504,475069560,1915.324324,9.358260e+07,1.272001e+13,1.586438e+07,1.942917e+05,495077.0,283468.0,4.618346e+08
1,2019-11-17,8482.70,2.721844e+07,249545.811754,1.108507,475351799,1716.688312,9.737649e+07,1.272001e+13,1.658248e+07,1.795335e+05,430853.0,264370.0,3.428184e+08
2,2019-11-18,8503.93,4.276261e+07,249716.672832,1.019048,475617345,2176.221429,8.852408e+07,1.272001e+13,1.496338e+07,2.331815e+05,510494.0,304671.0,1.043918e+09
3,2019-11-19,8175.99,9.803936e+07,249859.339345,1.143817,475922138,2228.775510,9.295029e+07,1.272001e+13,1.559906e+07,2.811575e+05,561226.0,327630.0,9.233997e+08
4,2019-11-20,8120.80,8.079782e+07,250027.466421,1.025924,476249278,2261.169014,8.978871e+07,1.272001e+13,1.480709e+07,2.255931e+05,525739.0,321086.0,8.709691e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1447,2023-11-09,35629.52,1.569445e+08,524735.476915,1.663591,916106091,4010.251748,4.440259e+08,6.246347e+13,4.238651e+07,9.218558e+06,832393.0,573466.0,6.287697e+09
1448,2023-11-10,36696.25,4.387976e+08,524973.519610,1.621323,916679882,3401.500000,4.781817e+08,6.246347e+13,4.158990e+07,6.043066e+06,848433.0,523831.0,5.561959e+09
1449,2023-11-11,37321.65,2.492784e+08,525223.328826,1.661074,917204210,3922.056250,4.968122e+08,6.246347e+13,4.229329e+07,4.284425e+06,894325.0,627529.0,2.477252e+09
1450,2023-11-12,37140.27,9.937919e+07,525488.844578,1.738509,917830353,4368.490683,5.060836e+08,6.323395e+13,4.417493e+07,4.545268e+06,880975.0,703327.0,1.674083e+09


In [19]:
# Check duplicated rows
len(df1['timestamp'].unique())

1452

In [20]:
# Retrieving market capitalization and total circulating data
metrics = [
          # Currency Statistics
          "total-bitcoins",
          "market-cap",
  ]

df2 = functools.reduce(merge, [blockchain_data_crawler(timespan, metric, start_date.strftime('%Y-%m-%d'), end_date.strftime('%Y-%m-%d')) for metric in metrics])
df2

Unnamed: 0,timestamp,total-bitcoins,market-cap
0,2019-11-16 00:00:13,18049675.00,1.526281e+11
1,2019-11-16 22:54:41,18051437.50,1.531394e+11
2,2019-11-17 20:48:53,18053200.00,1.545444e+11
3,2019-11-18 20:22:43,18054962.50,1.477347e+11
4,2019-11-19 19:52:56,18056750.00,1.461604e+11
...,...,...,...
1500,2023-11-10 22:31:21,19538787.50,7.289140e+11
1501,2023-11-11 19:54:46,19539675.00,7.260162e+11
1502,2023-11-12 18:15:12,19540587.50,7.240765e+11
1503,2023-11-13 20:09:39,19541468.75,7.205526e+11


In [21]:
# Check duplicated rows
len(df2['timestamp'].unique())

1505

In [22]:
# Wipe off the timestamp's h:m:s.
df2['timestamp'] = pd.to_datetime(df2["timestamp"]).dt.normalize()

# Drop the duplicates in column "timestamp", keep the last value
df2.drop_duplicates(subset="timestamp", keep="last", inplace=True)
df2

Unnamed: 0,timestamp,total-bitcoins,market-cap
1,2019-11-16,18051437.50,1.531394e+11
2,2019-11-17,18053200.00,1.545444e+11
3,2019-11-18,18054962.50,1.477347e+11
4,2019-11-19,18056750.00,1.461604e+11
5,2019-11-20,18058525.00,1.458226e+11
...,...,...,...
1500,2023-11-10,19538787.50,7.289140e+11
1501,2023-11-11,19539675.00,7.260162e+11
1502,2023-11-12,19540587.50,7.240765e+11
1503,2023-11-13,19541468.75,7.205526e+11


In [23]:
# Check duplicated rows
len(df2['timestamp'].unique())

1415

In [24]:
all_data_tmp = pd.merge(df0, df1, how="inner", on='timestamp')
all_data = pd.merge(all_data_tmp, df2, how="inner", on='timestamp')
all_data = all_data.interpolate(method='ffill')
all_data

Unnamed: 0,timestamp,open,high,low,close,volume,market-price,trade-volume,blocks-size,avg-block-size,...,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap
0,2019-11-16,8463.79,8528.44,8430.77,8484.07,50.607771,8457.69,1.023465e+08,249372.424907,1.171504,...,1915.324324,9.358260e+07,1.272001e+13,1.586438e+07,1.942917e+05,495077.0,283468.0,4.618346e+08,18051437.50,1.531394e+11
1,2019-11-17,8485.83,8627.56,8373.96,8508.26,89.357152,8482.70,2.721844e+07,249545.811754,1.108507,...,1716.688312,9.737649e+07,1.272001e+13,1.658248e+07,1.795335e+05,430853.0,264370.0,3.428184e+08,18053200.00,1.545444e+11
2,2019-11-18,8495.72,8501.58,8043.00,8175.14,173.564990,8503.93,4.276261e+07,249716.672832,1.019048,...,2176.221429,8.852408e+07,1.272001e+13,1.496338e+07,2.331815e+05,510494.0,304671.0,1.043918e+09,18054962.50,1.477347e+11
3,2019-11-19,8170.52,8196.27,7993.46,8123.93,136.460384,8175.99,9.803936e+07,249859.339345,1.143817,...,2228.775510,9.295029e+07,1.272001e+13,1.559906e+07,2.811575e+05,561226.0,327630.0,9.233997e+08,18056750.00,1.461604e+11
4,2019-11-20,8123.92,8219.00,8037.77,8083.95,120.561255,8120.80,8.079782e+07,250027.466421,1.025924,...,2261.169014,8.978871e+07,1.272001e+13,1.480709e+07,2.255931e+05,525739.0,321086.0,8.709691e+08,18058525.00,1.458226e+11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1402,2023-11-09,35626.40,37971.00,35517.30,36694.00,8078.048005,35629.52,1.569445e+08,524735.476915,1.663591,...,4010.251748,4.440259e+08,6.246347e+13,4.238651e+07,9.218558e+06,832393.0,573466.0,6.287697e+09,19537018.75,7.005584e+11
1403,2023-11-10,36702.50,37500.00,36340.10,37311.70,4459.426938,36696.25,4.387976e+08,524973.519610,1.621323,...,3401.500000,4.781817e+08,6.246347e+13,4.158990e+07,6.043066e+06,848433.0,523831.0,5.561959e+09,19538787.50,7.289140e+11
1404,2023-11-11,37311.70,37411.70,36658.00,37139.80,1725.244614,37321.65,2.492784e+08,525223.328826,1.661074,...,3922.056250,4.968122e+08,6.246347e+13,4.229329e+07,4.284425e+06,894325.0,627529.0,2.477252e+09,19539675.00,7.260162e+11
1405,2023-11-12,37139.90,37227.60,36727.30,37053.90,1172.596290,37140.27,9.937919e+07,525488.844578,1.738509,...,4368.490683,5.060836e+08,6.323395e+13,4.417493e+07,4.545268e+06,880975.0,703327.0,1.674083e+09,19540587.50,7.240765e+11


In [25]:
# Check nan values
all_data[all_data.isnull().T.any()]

Unnamed: 0,timestamp,open,high,low,close,volume,market-price,trade-volume,blocks-size,avg-block-size,...,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap


In [26]:
# Check duplicated rows
len(all_data['timestamp'].unique())

1407

In [27]:
all_data

Unnamed: 0,timestamp,open,high,low,close,volume,market-price,trade-volume,blocks-size,avg-block-size,...,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap
0,2019-11-16,8463.79,8528.44,8430.77,8484.07,50.607771,8457.69,1.023465e+08,249372.424907,1.171504,...,1915.324324,9.358260e+07,1.272001e+13,1.586438e+07,1.942917e+05,495077.0,283468.0,4.618346e+08,18051437.50,1.531394e+11
1,2019-11-17,8485.83,8627.56,8373.96,8508.26,89.357152,8482.70,2.721844e+07,249545.811754,1.108507,...,1716.688312,9.737649e+07,1.272001e+13,1.658248e+07,1.795335e+05,430853.0,264370.0,3.428184e+08,18053200.00,1.545444e+11
2,2019-11-18,8495.72,8501.58,8043.00,8175.14,173.564990,8503.93,4.276261e+07,249716.672832,1.019048,...,2176.221429,8.852408e+07,1.272001e+13,1.496338e+07,2.331815e+05,510494.0,304671.0,1.043918e+09,18054962.50,1.477347e+11
3,2019-11-19,8170.52,8196.27,7993.46,8123.93,136.460384,8175.99,9.803936e+07,249859.339345,1.143817,...,2228.775510,9.295029e+07,1.272001e+13,1.559906e+07,2.811575e+05,561226.0,327630.0,9.233997e+08,18056750.00,1.461604e+11
4,2019-11-20,8123.92,8219.00,8037.77,8083.95,120.561255,8120.80,8.079782e+07,250027.466421,1.025924,...,2261.169014,8.978871e+07,1.272001e+13,1.480709e+07,2.255931e+05,525739.0,321086.0,8.709691e+08,18058525.00,1.458226e+11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1402,2023-11-09,35626.40,37971.00,35517.30,36694.00,8078.048005,35629.52,1.569445e+08,524735.476915,1.663591,...,4010.251748,4.440259e+08,6.246347e+13,4.238651e+07,9.218558e+06,832393.0,573466.0,6.287697e+09,19537018.75,7.005584e+11
1403,2023-11-10,36702.50,37500.00,36340.10,37311.70,4459.426938,36696.25,4.387976e+08,524973.519610,1.621323,...,3401.500000,4.781817e+08,6.246347e+13,4.158990e+07,6.043066e+06,848433.0,523831.0,5.561959e+09,19538787.50,7.289140e+11
1404,2023-11-11,37311.70,37411.70,36658.00,37139.80,1725.244614,37321.65,2.492784e+08,525223.328826,1.661074,...,3922.056250,4.968122e+08,6.246347e+13,4.229329e+07,4.284425e+06,894325.0,627529.0,2.477252e+09,19539675.00,7.260162e+11
1405,2023-11-12,37139.90,37227.60,36727.30,37053.90,1172.596290,37140.27,9.937919e+07,525488.844578,1.738509,...,4368.490683,5.060836e+08,6.323395e+13,4.417493e+07,4.545268e+06,880975.0,703327.0,1.674083e+09,19540587.50,7.240765e+11


In [28]:
# Rename some columns
all_data.rename(columns={'open': 'opening-price', 'high': 'highest-price', 'low': 'lowest-price', 'close': 'closing-price', 'volume': 'trade-volume-btc', 'trade-volume': 'trade-volume-usd'}, inplace=True)

# Reorder colunmns
new_columns = ['timestamp', 'market-price', 'opening-price', 'highest-price', 'lowest-price', 'closing-price', 'trade-volume-btc', 'total-bitcoins', 'market-cap'] + [col for col in all_data.columns if col not in ['timestamp', 'market-price', 'opening-price', 'highest-price', 'lowest-price', 'closing-price', 'trade-volume-btc', 'total-bitcoins', 'market-cap']]
all_data = all_data.reindex(columns=new_columns)
all_data

Unnamed: 0,timestamp,market-price,opening-price,highest-price,lowest-price,closing-price,trade-volume-btc,total-bitcoins,market-cap,trade-volume-usd,...,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
0,2019-11-16,8457.69,8463.79,8528.44,8430.77,8484.07,50.607771,18051437.50,1.531394e+11,1.023465e+08,...,1.171504,475069560,1915.324324,9.358260e+07,1.272001e+13,1.586438e+07,1.942917e+05,495077.0,283468.0,4.618346e+08
1,2019-11-17,8482.70,8485.83,8627.56,8373.96,8508.26,89.357152,18053200.00,1.545444e+11,2.721844e+07,...,1.108507,475351799,1716.688312,9.737649e+07,1.272001e+13,1.658248e+07,1.795335e+05,430853.0,264370.0,3.428184e+08
2,2019-11-18,8503.93,8495.72,8501.58,8043.00,8175.14,173.564990,18054962.50,1.477347e+11,4.276261e+07,...,1.019048,475617345,2176.221429,8.852408e+07,1.272001e+13,1.496338e+07,2.331815e+05,510494.0,304671.0,1.043918e+09
3,2019-11-19,8175.99,8170.52,8196.27,7993.46,8123.93,136.460384,18056750.00,1.461604e+11,9.803936e+07,...,1.143817,475922138,2228.775510,9.295029e+07,1.272001e+13,1.559906e+07,2.811575e+05,561226.0,327630.0,9.233997e+08
4,2019-11-20,8120.80,8123.92,8219.00,8037.77,8083.95,120.561255,18058525.00,1.458226e+11,8.079782e+07,...,1.025924,476249278,2261.169014,8.978871e+07,1.272001e+13,1.480709e+07,2.255931e+05,525739.0,321086.0,8.709691e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1402,2023-11-09,35629.52,35626.40,37971.00,35517.30,36694.00,8078.048005,19537018.75,7.005584e+11,1.569445e+08,...,1.663591,916106091,4010.251748,4.440259e+08,6.246347e+13,4.238651e+07,9.218558e+06,832393.0,573466.0,6.287697e+09
1403,2023-11-10,36696.25,36702.50,37500.00,36340.10,37311.70,4459.426938,19538787.50,7.289140e+11,4.387976e+08,...,1.621323,916679882,3401.500000,4.781817e+08,6.246347e+13,4.158990e+07,6.043066e+06,848433.0,523831.0,5.561959e+09
1404,2023-11-11,37321.65,37311.70,37411.70,36658.00,37139.80,1725.244614,19539675.00,7.260162e+11,2.492784e+08,...,1.661074,917204210,3922.056250,4.968122e+08,6.246347e+13,4.229329e+07,4.284425e+06,894325.0,627529.0,2.477252e+09
1405,2023-11-12,37140.27,37139.90,37227.60,36727.30,37053.90,1172.596290,19540587.50,7.240765e+11,9.937919e+07,...,1.738509,917830353,4368.490683,5.060836e+08,6.323395e+13,4.417493e+07,4.545268e+06,880975.0,703327.0,1.674083e+09


Once I have the daily dataset I will sample it at a frequency of 15 minutes (15T) using the resample method.

This means that the data will be organized in 15-minute time-frame, and an interpolation method will be used to fill in any missing data or holes in the dataset by estimating missing values based on the surrounding known values.

In [29]:
# Upsampling to 15min by interpolate
all_data.set_index('timestamp', inplace=True)
all_data_15m = all_data.resample('15T').interpolate()
all_data_15m

Unnamed: 0_level_0,market-price,opening-price,highest-price,lowest-price,closing-price,trade-volume-btc,total-bitcoins,market-cap,trade-volume-usd,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2019-11-16 00:00:00,8457.690000,8463.790000,8528.440000,8430.770000,8484.070000,50.607771,1.805144e+07,1.531394e+11,1.023465e+08,249372.424907,1.171504,4.750696e+08,1915.324324,9.358260e+07,1.272001e+13,1.586438e+07,1.942917e+05,495077.000000,283468.000000,4.618346e+08
2019-11-16 00:15:00,8457.950521,8464.019583,8529.472500,8430.178229,8484.321979,51.011410,1.805146e+07,1.531540e+11,1.015639e+08,249374.231020,1.170848,4.750725e+08,1913.255199,9.362212e+07,1.272001e+13,1.587186e+07,1.941380e+05,494408.000000,283269.062500,4.605949e+08
2019-11-16 00:30:00,8458.211042,8464.249167,8530.505000,8429.586458,8484.573958,51.415050,1.805147e+07,1.531686e+11,1.007813e+08,249376.037133,1.170192,4.750754e+08,1911.186074,9.366164e+07,1.272001e+13,1.587934e+07,1.939842e+05,493739.000000,283070.125000,4.593551e+08
2019-11-16 00:45:00,8458.471563,8464.478750,8531.537500,8428.994688,8484.825937,51.818689,1.805149e+07,1.531833e+11,9.999875e+07,249377.843246,1.169535,4.750784e+08,1909.116949,9.370116e+07,1.272001e+13,1.588682e+07,1.938305e+05,493070.000000,282871.187500,4.581154e+08
2019-11-16 01:00:00,8458.732083,8464.708333,8532.570000,8428.402917,8485.077917,52.222329,1.805151e+07,1.531979e+11,9.921617e+07,249379.649359,1.168879,4.750813e+08,1907.047824,9.374068e+07,1.272001e+13,1.589430e+07,1.936768e+05,492401.000000,282672.250000,4.568756e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-11-12 23:00:00,37062.367083,37057.483333,37414.858333,36385.845833,36517.425000,3187.447119,1.954143e+07,7.206994e+11,6.992603e+07,525757.233966,1.674464,9.185055e+08,3575.843104,3.939149e+08,6.461839e+13,3.239613e+07,3.997268e+06,679079.083333,439957.833333,3.757942e+09
2023-11-12 23:15:00,37061.520313,37056.587500,37416.893750,36382.134375,36511.593750,3209.347671,1.954144e+07,7.206627e+11,6.960589e+07,525760.151242,1.673768,9.185128e+08,3567.227369,3.926957e+08,6.463344e+13,3.226810e+07,3.991312e+06,676884.562500,437095.125000,3.780593e+09
2023-11-12 23:30:00,37060.673542,37055.691667,37418.929167,36378.422917,36505.762500,3231.248224,1.954145e+07,7.206260e+11,6.928575e+07,525763.068518,1.673072,9.185202e+08,3558.611634,3.914764e+08,6.464849e+13,3.214007e+07,3.985355e+06,674690.041667,434232.416667,3.803244e+09
2023-11-12 23:45:00,37059.826771,37054.795833,37420.964583,36374.711458,36499.931250,3253.148776,1.954146e+07,7.205893e+11,6.896560e+07,525765.985794,1.672376,9.185275e+08,3549.995900,3.902572e+08,6.466354e+13,3.201204e+07,3.979399e+06,672495.520833,431369.708333,3.825894e+09


# Saving dataset

In [30]:
# Save the 15m dataset
all_data_15m.to_parquet(DATASET_RAW)

In [31]:
# Export notebook in html format (remember to save the notebook and change the model name)
if LOCAL_RUNNING:
    !jupyter nbconvert --to html 1-data-crawling.ipynb --output 1-data-crawling.ipynb --output-dir='./exports'