# **Bitcoin price prediction with PySpark - Data crawling**
## Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author: Corsi Danilo (1742375) - corsi.1742375@studenti.uniroma1.it



---


Description: data crawling on Bitcoin blochckain by querying Blockchain.com website.

# Global constants, dependencies, libraries and tools

In [28]:
# Main constants
LOCAL_RUNNING = True
MAIN_DIR = "D:/Documents/Repository/BDC/project" if LOCAL_RUNNING else "/content/drive"

In [30]:
if not LOCAL_RUNNING:
    # Point Colaboratory to Google Drive
    from google.colab import drive

    # Define GDrive paths
    drive.mount(MAIN_DIR, force_remount=True)

Mounted at /content/drive


In [31]:
# Set main dir
MAIN_DIR = MAIN_DIR + "" if LOCAL_RUNNING else MAIN_DIR + "/MyDrive/BDC/project"

# Datasets dir
DATASET_RAW_DIR = MAIN_DIR + "/datasets/raw"

# Datasets name
DATASET_NAME = "bitcoin_blockchain_data_15min"

# Datasets path
DATASET_RAW = DATASET_RAW_DIR + "/" + DATASET_NAME + ".parquet"

In [32]:
# Useful imports
import pandas as pd
import functools

if not LOCAL_RUNNING: from google.colab import drive

from datetime import date

# Metrics and parameters
I chose to collect data on the Bitcoin blockchain using the API of the website Blockchain.org, the most relevant information was retrieved from the year 2016 to the present day (a period for which there were moments of high volatility but also a lot of price lateralization).

The features taken under consideration were divided into several categories:

**Currency Statistics**

- **market-price:** the average USD market price across major bitcoin exchanges.
- **trade-volume:** the total USD value of trading volume on major bitcoin exchanges.
- **total-bitcoins:** the total number of mined bitcoin that are currently circulating on the network.
- **market-cap:** the total USD value of bitcoin in circulation.

**Block Details**

- **blocks-size:** the total size of the blockchain minus database indexes in megabytes.
- **avg-block-size:** the average block size over the past 24 hours in megabytes.
- **n-transactions-total:** the total number of transactions on the blockchain.
- **n-transactions-per-block:** the average number of transactions per block over the past 24 hours.

**Mining Information**

- **hash-rate:** the estimated number of terahashes per second the bitcoin network is performing in the last 24 hours.
- **difficulty:** a relative measure of how difficult it is to mine a new block for the blockchain.
- **miners-revenue:** total value of coinbase block rewards and transaction fees paid to miners.
- **transaction-fees-usd:** the total USD value of all transaction fees paid to miners. This does not include coinbase block rewards.

**Network Activity**

- **n-unique-addresses:** the total number of unique addresses used on the blockchain.
- **n-transactions:** the total number of confirmed transactions per day.
- **estimated-transaction-volume-usd:** the total estimated value in USD of transactions on the blockchain.

In [33]:
# Define the parameters
timespan = "5years" # Duration of the data (it was necessary to define it since it is possible to make requests for up to 6 years)
start_date = "2019-01-01"
end_date = str(date.today())

# Metrics considered
metrics = [
          # Currency Statistics
          "market-price",
          "trade-volume",

          # Block Details
          "blocks-size",
          "avg-block-size",
          "n-transactions-total",
          "n-transactions-per-block",

          # Mining Information
          "hash-rate",
          "difficulty",
          "miners-revenue",
          "transaction-fees-usd",

          # Network Activity
          "n-unique-addresses",
          "n-transactions",
          "estimated-transaction-volume-usd"
]

# Data crawling

In [34]:
def data_crawler(timespan, metrics, start_date, end_date):
    # API info
    url = f'https://api.blockchain.info/charts/{metrics}?timespan={timespan}&start={start_date}&format=csv'

    # Obtain data
    data = pd.read_csv(url, names=['timestamp', metrics])

    # Transform "timestamp" to datetime type
    data['timestamp'] = pd.to_datetime(data["timestamp"])

    # Select data up to the end date
    data = data[(data['timestamp'] < end_date)]

    return data

In [35]:
# Merge the data
merge = functools.partial(pd.merge, on='timestamp')

# Gain blockchain bata from Blockchain.com API
df1 = functools.reduce(merge, [data_crawler(timespan, metric, start_date, end_date) for metric in metrics])
df1

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
0,2019-01-01,3737.91,2.694888e+08,198101.219080,0.801779,369240247,1575.335570,4.161599e+07,5.618596e+12,7.012104e+06,4.154115e+04,310120.0,234725.0,2.204835e+08
1,2019-01-02,3855.88,2.419284e+08,198221.159602,0.947861,369476938,1799.311258,4.217459e+07,5.618596e+12,7.474829e+06,6.964781e+04,415822.0,271696.0,4.938601e+08
2,2019-01-03,3931.31,3.167773e+08,198364.321505,0.966222,369749053,1877.522581,4.329180e+07,5.618596e+12,7.679487e+06,7.513811e+04,450626.0,291016.0,5.863351e+08
3,2019-01-04,3822.87,1.766250e+08,198514.023883,0.959199,370039173,1891.087248,4.161599e+07,5.618596e+12,7.115351e+06,7.401997e+04,454702.0,281772.0,5.347395e+08
4,2019-01-05,3856.62,2.936471e+08,198656.957000,0.812946,370320307,1626.944785,4.552621e+07,5.618596e+12,8.052391e+06,4.937181e+04,385489.0,265192.0,3.258493e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1702,2023-09-06,25785.30,5.668125e+07,508846.022605,1.545214,889746462,3595.237037,3.644194e+08,5.430272e+13,2.250423e+07,7.788790e+05,757286.0,485357.0,2.224002e+09
1703,2023-09-07,25753.31,7.250844e+07,509054.562731,1.653818,890231866,3684.513699,3.930055e+08,5.415014e+13,2.440491e+07,8.159520e+05,785059.0,537939.0,2.196839e+09
1704,2023-09-08,26243.47,1.082050e+08,509296.365924,1.614361,890770975,3118.789474,3.580118e+08,5.415014e+13,2.259861e+07,9.325277e+05,694292.0,414799.0,1.920705e+09
1705,2023-09-09,25906.03,1.075289e+08,509510.723052,1.696613,891184188,3956.350993,4.064645e+08,5.415014e+13,2.527070e+07,8.420278e+05,927787.0,597409.0,1.234754e+09


In [36]:
# Check duplicated rows
len(df1['timestamp'].unique())

1707

In [37]:
# Retrieving market capitalization and total circulating data
metrics = [
          # Currency Statistics
          "total-bitcoins",                      # Total Circulating Bitcoin: The total number of mined bitcoin that are currently circulating on the network.
          "market-cap",                          # Market Capitalization (USD): The total USD value of bitcoin in circulation.
  ]

df2 = functools.reduce(merge, [data_crawler(timespan, metric, start_date, end_date) for metric in metrics])
df2

Unnamed: 0,timestamp,total-bitcoins,market-cap
0,2019-01-01 00:03:10,17455750.00,6.446408e+10
1,2019-01-02 03:05:27,17457850.00,6.637475e+10
2,2019-01-03 07:20:18,17459975.00,6.715106e+10
3,2019-01-04 08:10:45,17462075.00,6.644320e+10
4,2019-01-05 09:11:04,17464162.50,6.687028e+10
...,...,...,...
1491,2023-09-06 02:33:19,19477518.75,5.029095e+11
1492,2023-09-07 07:53:36,19478562.50,5.022353e+11
1493,2023-09-08 11:32:23,19479606.25,5.038205e+11
1494,2023-09-09 17:43:56,19480650.00,5.042761e+11


In [38]:
# Check duplicated rows
len(df2['timestamp'].unique())

1496

In [39]:
# Wipe off the timestamp's h:m:s.
df2['timestamp'] = pd.to_datetime(df2["timestamp"]).dt.normalize()

# Drop the duplicates in column "timestamp", keep the last value
df2.drop_duplicates(subset="timestamp", keep="last", inplace=True)
df2

Unnamed: 0,timestamp,total-bitcoins,market-cap
0,2019-01-01,17455750.00,6.446408e+10
1,2019-01-02,17457850.00,6.637475e+10
2,2019-01-03,17459975.00,6.715106e+10
3,2019-01-04,17462075.00,6.644320e+10
4,2019-01-05,17464162.50,6.687028e+10
...,...,...,...
1491,2023-09-06,19477518.75,5.029095e+11
1492,2023-09-07,19478562.50,5.022353e+11
1493,2023-09-08,19479606.25,5.038205e+11
1494,2023-09-09,19480650.00,5.042761e+11


In [40]:
# Check duplicated rows
len(df2['timestamp'].unique())

1491

In [41]:
# Add the market capitalization and total circulating data to the main dataset
all_data = pd.merge(df1, df2, how="inner", on='timestamp')
all_data = all_data.interpolate(method='ffill')
all_data

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap
0,2019-01-01,3737.91,2.694888e+08,198101.219080,0.801779,369240247,1575.335570,4.161599e+07,5.618596e+12,7.012104e+06,4.154115e+04,310120.0,234725.0,2.204835e+08,17455750.00,6.446408e+10
1,2019-01-02,3855.88,2.419284e+08,198221.159602,0.947861,369476938,1799.311258,4.217459e+07,5.618596e+12,7.474829e+06,6.964781e+04,415822.0,271696.0,4.938601e+08,17457850.00,6.637475e+10
2,2019-01-03,3931.31,3.167773e+08,198364.321505,0.966222,369749053,1877.522581,4.329180e+07,5.618596e+12,7.679487e+06,7.513811e+04,450626.0,291016.0,5.863351e+08,17459975.00,6.715106e+10
3,2019-01-04,3822.87,1.766250e+08,198514.023883,0.959199,370039173,1891.087248,4.161599e+07,5.618596e+12,7.115351e+06,7.401997e+04,454702.0,281772.0,5.347395e+08,17462075.00,6.644320e+10
4,2019-01-05,3856.62,2.936471e+08,198656.957000,0.812946,370320307,1626.944785,4.552621e+07,5.618596e+12,8.052391e+06,4.937181e+04,385489.0,265192.0,3.258493e+08,17464162.50,6.687028e+10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1480,2023-09-06,25785.30,5.668125e+07,508846.022605,1.545214,889746462,3595.237037,3.644194e+08,5.430272e+13,2.250423e+07,7.788790e+05,757286.0,485357.0,2.224002e+09,19477518.75,5.029095e+11
1481,2023-09-07,25753.31,7.250844e+07,509054.562731,1.653818,890231866,3684.513699,3.930055e+08,5.415014e+13,2.440491e+07,8.159520e+05,785059.0,537939.0,2.196839e+09,19478562.50,5.022353e+11
1482,2023-09-08,26243.47,1.082050e+08,509296.365924,1.614361,890770975,3118.789474,3.580118e+08,5.415014e+13,2.259861e+07,9.325277e+05,694292.0,414799.0,1.920705e+09,19479606.25,5.038205e+11
1483,2023-09-09,25906.03,1.075289e+08,509510.723052,1.696613,891184188,3956.350993,4.064645e+08,5.415014e+13,2.527070e+07,8.420278e+05,927787.0,597409.0,1.234754e+09,19480650.00,5.042761e+11


In [42]:
# Check nan values
all_data[all_data.isnull().T.any()]

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap


In [43]:
# Check duplicated rows
len(all_data['timestamp'].unique())

1485

In [44]:
all_data

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap
0,2019-01-01,3737.91,2.694888e+08,198101.219080,0.801779,369240247,1575.335570,4.161599e+07,5.618596e+12,7.012104e+06,4.154115e+04,310120.0,234725.0,2.204835e+08,17455750.00,6.446408e+10
1,2019-01-02,3855.88,2.419284e+08,198221.159602,0.947861,369476938,1799.311258,4.217459e+07,5.618596e+12,7.474829e+06,6.964781e+04,415822.0,271696.0,4.938601e+08,17457850.00,6.637475e+10
2,2019-01-03,3931.31,3.167773e+08,198364.321505,0.966222,369749053,1877.522581,4.329180e+07,5.618596e+12,7.679487e+06,7.513811e+04,450626.0,291016.0,5.863351e+08,17459975.00,6.715106e+10
3,2019-01-04,3822.87,1.766250e+08,198514.023883,0.959199,370039173,1891.087248,4.161599e+07,5.618596e+12,7.115351e+06,7.401997e+04,454702.0,281772.0,5.347395e+08,17462075.00,6.644320e+10
4,2019-01-05,3856.62,2.936471e+08,198656.957000,0.812946,370320307,1626.944785,4.552621e+07,5.618596e+12,8.052391e+06,4.937181e+04,385489.0,265192.0,3.258493e+08,17464162.50,6.687028e+10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1480,2023-09-06,25785.30,5.668125e+07,508846.022605,1.545214,889746462,3595.237037,3.644194e+08,5.430272e+13,2.250423e+07,7.788790e+05,757286.0,485357.0,2.224002e+09,19477518.75,5.029095e+11
1481,2023-09-07,25753.31,7.250844e+07,509054.562731,1.653818,890231866,3684.513699,3.930055e+08,5.415014e+13,2.440491e+07,8.159520e+05,785059.0,537939.0,2.196839e+09,19478562.50,5.022353e+11
1482,2023-09-08,26243.47,1.082050e+08,509296.365924,1.614361,890770975,3118.789474,3.580118e+08,5.415014e+13,2.259861e+07,9.325277e+05,694292.0,414799.0,1.920705e+09,19479606.25,5.038205e+11
1483,2023-09-09,25906.03,1.075289e+08,509510.723052,1.696613,891184188,3956.350993,4.064645e+08,5.415014e+13,2.527070e+07,8.420278e+05,927787.0,597409.0,1.234754e+09,19480650.00,5.042761e+11


In [45]:
# Reorder colunmns
new_columns = ['timestamp', 'market-price', 'total-bitcoins', 'market-cap'] + [col for col in all_data.columns if col not in ['timestamp', 'market-price', 'total-bitcoins', 'market-cap']]
all_data = all_data.reindex(columns=new_columns)
all_data

Unnamed: 0,timestamp,market-price,total-bitcoins,market-cap,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
0,2019-01-01,3737.91,17455750.00,6.446408e+10,2.694888e+08,198101.219080,0.801779,369240247,1575.335570,4.161599e+07,5.618596e+12,7.012104e+06,4.154115e+04,310120.0,234725.0,2.204835e+08
1,2019-01-02,3855.88,17457850.00,6.637475e+10,2.419284e+08,198221.159602,0.947861,369476938,1799.311258,4.217459e+07,5.618596e+12,7.474829e+06,6.964781e+04,415822.0,271696.0,4.938601e+08
2,2019-01-03,3931.31,17459975.00,6.715106e+10,3.167773e+08,198364.321505,0.966222,369749053,1877.522581,4.329180e+07,5.618596e+12,7.679487e+06,7.513811e+04,450626.0,291016.0,5.863351e+08
3,2019-01-04,3822.87,17462075.00,6.644320e+10,1.766250e+08,198514.023883,0.959199,370039173,1891.087248,4.161599e+07,5.618596e+12,7.115351e+06,7.401997e+04,454702.0,281772.0,5.347395e+08
4,2019-01-05,3856.62,17464162.50,6.687028e+10,2.936471e+08,198656.957000,0.812946,370320307,1626.944785,4.552621e+07,5.618596e+12,8.052391e+06,4.937181e+04,385489.0,265192.0,3.258493e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1480,2023-09-06,25785.30,19477518.75,5.029095e+11,5.668125e+07,508846.022605,1.545214,889746462,3595.237037,3.644194e+08,5.430272e+13,2.250423e+07,7.788790e+05,757286.0,485357.0,2.224002e+09
1481,2023-09-07,25753.31,19478562.50,5.022353e+11,7.250844e+07,509054.562731,1.653818,890231866,3684.513699,3.930055e+08,5.415014e+13,2.440491e+07,8.159520e+05,785059.0,537939.0,2.196839e+09
1482,2023-09-08,26243.47,19479606.25,5.038205e+11,1.082050e+08,509296.365924,1.614361,890770975,3118.789474,3.580118e+08,5.415014e+13,2.259861e+07,9.325277e+05,694292.0,414799.0,1.920705e+09
1483,2023-09-09,25906.03,19480650.00,5.042761e+11,1.075289e+08,509510.723052,1.696613,891184188,3956.350993,4.064645e+08,5.415014e+13,2.527070e+07,8.420278e+05,927787.0,597409.0,1.234754e+09



Once we have the daily dataset we will sample it at a frequency of 15 minutes (15T) using the resample method.

This means that the data will be organized in 15-minute time-frame, and an interpolation method will be used to fill in any missing data or holes in the dataset by estimating missing values based on the surrounding known values.

In [46]:
# Upsampling to 15min by interpolate
all_data.set_index('timestamp', inplace=True)
all_data_15m = all_data.resample('15T').interpolate()
all_data_15m

Unnamed: 0_level_0,market-price,total-bitcoins,market-cap,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2019-01-01 00:00:00,3737.910000,1.745575e+07,6.446408e+10,2.694888e+08,198101.219080,0.801779,3.692402e+08,1575.335570,4.161599e+07,5.618596e+12,7.012104e+06,4.154115e+04,310120.000000,234725.000000,2.204835e+08
2019-01-01 00:15:00,3739.138854,1.745577e+07,6.448399e+10,2.692018e+08,198102.468460,0.803301,3.692427e+08,1577.668651,4.162180e+07,5.618596e+12,7.016924e+06,4.183393e+04,311221.062500,235110.114583,2.233311e+08
2019-01-01 00:30:00,3740.367708,1.745579e+07,6.450389e+10,2.689147e+08,198103.717841,0.804822,3.692452e+08,1580.001731,4.162762e+07,5.618596e+12,7.021744e+06,4.212670e+04,312322.125000,235495.229167,2.261788e+08
2019-01-01 00:45:00,3741.596562,1.745582e+07,6.452379e+10,2.686276e+08,198104.967221,0.806344,3.692476e+08,1582.334811,4.163344e+07,5.618596e+12,7.026564e+06,4.241948e+04,313423.187500,235880.343750,2.290265e+08
2019-01-01 01:00:00,3742.825417,1.745584e+07,6.454370e+10,2.683405e+08,198106.216602,0.807866,3.692501e+08,1584.667891,4.163926e+07,5.618596e+12,7.031384e+06,4.271226e+04,314524.250000,236265.458333,2.318742e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-09-09 23:00:00,25894.798333,1.948165e+07,5.016135e+11,2.640476e+07,509756.348385,1.660303,8.917567e+08,3805.178173,3.961459e+08,5.415014e+13,2.496193e+07,1.199487e+06,850428.416667,560020.583333,1.011974e+09
2023-09-09 23:15:00,25894.676250,1.948166e+07,5.015846e+11,2.552298e+07,509759.018225,1.659908,8.917629e+08,3803.534991,3.960337e+08,5.415014e+13,2.495858e+07,1.203372e+06,849587.562500,559614.187500,1.009552e+09
2023-09-09 23:30:00,25894.554167,1.948167e+07,5.015556e+11,2.464119e+07,509761.688066,1.659514,8.917692e+08,3801.891808,3.959216e+08,5.415014e+13,2.495522e+07,1.207257e+06,848746.708333,559207.791667,1.007131e+09
2023-09-09 23:45:00,25894.432083,1.948168e+07,5.015267e+11,2.375941e+07,509764.357906,1.659119,8.917754e+08,3800.248625,3.958094e+08,5.415014e+13,2.495187e+07,1.211143e+06,847905.854167,558801.395833,1.004709e+09


# Saving dataset

In [47]:
# Save the 15m dataset
all_data_15m.to_parquet(DATASET_RAW)