# Bitcoin price forecasting with PySpark
## Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author
Corsi Danilo - corsi.1742375@studenti.uniroma1.it



Description: In this notebook I will be responsible for retrieving data from www.blockchain.com by considering the most interesting features, generate the dataset containing the minute-by-minute (1m) and daily (1d) data, and save it to the Google Drive space.

# Global Constants

In [None]:
GDRIVE_DIR = "/content/drive"
GDRIVE_DATASET_RAW_DIR = GDRIVE_DIR + "/MyDrive/BDC/project/datasets/raw"

#  Import Python packages ❗

In [None]:
import pandas as pd
import functools

from google.colab import drive

#  Define metrics and parameters

In this section we are going to define the parameters used to collect the data and the metrics used. I will consider Bitcoin data for 10 years, starting from 2012-01-01 through 2022-12-31.

Note that since the Blockchain.com API allows retreiving data with a maximum timespan equal to 6 years, I manually computed the continue date so that I could make a second API call to get the remaining data.

Regarding the metrics, I chose the ones that seemed most relevant to me, containing both price statistics but also technical features of Bitcoin's blockchain.

In [None]:
# Define the parameters
timespan = "6years" # Duration of the data (because the Max timespan == 6years)
start_date = "2012-01-01"
continue_date = "2017-12-31" # The continue date (manually calculate the continue_date)
end_date = "2023-06-30"

# Metrics considered
metrics = [
          ##Currency Statistics##
            "market-price", # Market Price: The average USD market price across major bitcoin exchanges.
            "trade-volume", #E xchange Trade Volume (USD): The total USD value of trading volume on major bitcoin exchanges.

          ##Block Details##
            "blocks-size", # Blockchain Size (MB): The total size of the blockchain minus database indexes in megabytes.
            "avg-block-size", # Average Block Size (MB): The average block size over the past 24 hours in megabytes.
            "n-transactions-total", # Total Number of Transactions: The total number of transactions on the blockchain.
            "n-transactions-per-block", # Average Transactions Per Block: The average number of transactions per block over the past 24 hours.

          ##Mining Information##
            "hash-rate", # Total Hash Rate (TH/s): The estimated number of terahashes per second the bitcoin network is performing in the last 24 hours.
            "difficulty", # Network Difficulty: A relative measure of how difficult it is to mine a new block for the blockchain.
            "miners-revenue", # Miners Revenue (USD): Total value of coinbase block rewards and transaction fees paid to miners.
            "transaction-fees-usd", # Total Transaction Fees (USD): The total USD value of all transaction fees paid to miners. This does not include coinbase block rewards.

          ##Network Activity##
            "n-unique-addresses", # The total number of unique addresses used on the blockchain.
            "n-transactions", # Confirmed Transactions Per Day: The total number of confirmed transactions per day.
            "estimated-transaction-volume-usd" # Estimated Transaction Value (USD): The total estimated value in USD of transactions on the blockchain. This does not include coins returned as change.
]

# Retreiving data

In this section we are going to make the call to the Blockchain.com API to retrieve the data.

In [None]:
def data_crawler(timespan, metrics, start_date, continue_date, end_date):
    # API Info
    url1 = f'https://api.blockchain.info/charts/{metrics}?timespan={timespan}&start={start_date}&format=csv'
    url2 = f'https://api.blockchain.info/charts/{metrics}?timespan={timespan}&start={continue_date}&format=csv'

    # Obtain Data
    data1 = pd.read_csv(url1,names=['timestamp',metrics])
    data2 = pd.read_csv(url2,names=['timestamp',metrics])

    # Concat by rows
    all_data = pd.concat([data1,data2])

    # Transform "timestamp" to datetime type
    all_data['timestamp'] = pd.to_datetime(all_data["timestamp"])

    # Keep the same end date with Bitcoin data
    all_data = all_data[(all_data['timestamp'] < end_date)]

    return all_data

In [None]:
# Merge the data
merge = functools.partial(pd.merge, on='timestamp')

# Gain Blockchain Data from Blockchain.com API
df1 = functools.reduce(merge, [data_crawler(timespan, metric, start_date, continue_date, end_date) for metric in metrics])
df1

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
0,2012-01-01,5.04,0.000000e+00,861.941752,0.017073,2119853.0,32.686275,8.591401e+00,1.159929e+06,4.260652e+04,1.851638e+01,8531.0,5001.0,1.016110e+06
1,2012-01-02,5.27,0.000000e+00,864.547504,0.019121,2124845.0,35.827815,8.764382e+00,1.159929e+06,6.301249e+04,3.598932e+01,8928.0,5410.0,7.508830e+05
2,2012-01-03,5.45,0.000000e+00,867.445999,0.018212,2130220.0,36.308176,9.340986e+00,1.159929e+06,4.662806e+04,3.056013e+01,9528.0,5773.0,6.037982e+05
3,2012-01-04,5.37,0.000000e+00,870.374487,0.019351,2135991.0,38.463087,8.879703e+00,1.159929e+06,4.706558e+04,7.808277e+01,9542.0,5731.0,7.495462e+05
4,2012-01-05,5.80,0.000000e+00,873.246150,0.024677,2141802.0,47.578231,8.476080e+00,1.159929e+06,5.369470e+04,4.469720e+01,11636.0,6994.0,1.614569e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4181,2023-06-25,30548.59,1.125890e+08,491068.723512,1.868064,856142438.0,2501.758170,3.981602e+08,5.235044e+13,3.019582e+07,9.002866e+05,656558.0,382769.0,2.222326e+09
4182,2023-06-26,30458.97,1.152567e+08,491354.967999,1.723111,856526345.0,2766.979452,3.799438e+08,5.235044e+13,2.903694e+07,9.910232e+05,692095.0,403979.0,3.515517e+09
4183,2023-06-27,30266.70,1.641947e+08,491606.066760,1.717011,856929056.0,3106.128571,3.643296e+08,5.235044e+13,2.769509e+07,9.677887e+05,707936.0,434858.0,3.548785e+09
4184,2023-06-28,30699.11,1.862727e+08,491846.908545,1.680819,857363958.0,3067.388889,2.805460e+08,5.225576e+13,2.137802e+07,9.052404e+05,627350.0,331278.0,3.880455e+09


In [None]:
# Check duplicated rows
len(df1['timestamp'].unique())

4186

Due to a problem with the Blockchain.com API, I was forced to make an additional call to retrieve capitalization and total circulating data that will be added to the currency statistics to get a single dataset.

In [None]:
# Retrieving market capitalization and total circulating data
metrics = [
  "total-bitcoins", # Total Circulating Bitcoin: The total number of mined bitcoin that are currently circulating on the network.
  "market-cap", # Market Capitalization (USD): The total USD value of bitcoin in circulation.
  ]

merge = functools.partial(pd.merge, on='timestamp')
df2 = functools.reduce(merge, [data_crawler(timespan, metric, start_date, continue_date, end_date) for metric in metrics])
df2

Unnamed: 0,timestamp,total-bitcoins,market-cap
0,2012-01-01 00:00:01,8001900.00,4.032958e+07
1,2012-01-02 13:34:31,8013350.00,4.223035e+07
2,2012-01-04 00:14:03,8025100.00,4.309479e+07
3,2012-01-05 15:23:53,8036850.00,4.661373e+07
4,2012-01-06 23:04:03,8048100.00,5.311746e+07
...,...,...,...
2970,2023-06-23 00:05:32,19409431.25,5.807108e+11
2971,2023-06-24 11:28:27,19410681.25,5.966067e+11
2972,2023-06-25 19:43:11,19411925.00,5.911125e+11
2973,2023-06-27 04:56:46,19413175.00,5.896946e+11


In [None]:
# Check duplicated rows
len(df2['timestamp'].unique())

2975

In [None]:
# Wipe off the timestamp's h:m:s.
df2['timestamp'] = pd.to_datetime(df2["timestamp"]).dt.normalize()

# Drop the duplicates in column "timestamp", keep the last value
df2.drop_duplicates(subset="timestamp", keep="last", inplace=True)

In [None]:
df2

Unnamed: 0,timestamp,total-bitcoins,market-cap
0,2012-01-01,8001900.00,4.032958e+07
1,2012-01-02,8013350.00,4.223035e+07
2,2012-01-04,8025100.00,4.309479e+07
3,2012-01-05,8036850.00,4.661373e+07
4,2012-01-06,8048100.00,5.311746e+07
...,...,...,...
2970,2023-06-23,19409431.25,5.807108e+11
2971,2023-06-24,19410681.25,5.966067e+11
2972,2023-06-25,19411925.00,5.911125e+11
2973,2023-06-27,19413175.00,5.896946e+11


In [None]:
# Check duplicated rows
len(df2['timestamp'].unique())

2974

In [None]:
# Add the market capitalization and total circulating data
all_data = pd.merge(df1, df2, how="left", on='timestamp')
all_data = all_data.interpolate(method='ffill')
all_data

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap
0,2012-01-01,5.04,0.000000e+00,861.941752,0.017073,2119853.0,32.686275,8.591401e+00,1.159929e+06,4.260652e+04,1.851638e+01,8531.0,5001.0,1.016110e+06,8001900.00,4.032958e+07
1,2012-01-02,5.27,0.000000e+00,864.547504,0.019121,2124845.0,35.827815,8.764382e+00,1.159929e+06,6.301249e+04,3.598932e+01,8928.0,5410.0,7.508830e+05,8013350.00,4.223035e+07
2,2012-01-03,5.45,0.000000e+00,867.445999,0.018212,2130220.0,36.308176,9.340986e+00,1.159929e+06,4.662806e+04,3.056013e+01,9528.0,5773.0,6.037982e+05,8013350.00,4.223035e+07
3,2012-01-04,5.37,0.000000e+00,870.374487,0.019351,2135991.0,38.463087,8.879703e+00,1.159929e+06,4.706558e+04,7.808277e+01,9542.0,5731.0,7.495462e+05,8025100.00,4.309479e+07
4,2012-01-05,5.80,0.000000e+00,873.246150,0.024677,2141802.0,47.578231,8.476080e+00,1.159929e+06,5.369470e+04,4.469720e+01,11636.0,6994.0,1.614569e+06,8036850.00,4.661373e+07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4181,2023-06-25,30548.59,1.125890e+08,491068.723512,1.868064,856142438.0,2501.758170,3.981602e+08,5.235044e+13,3.019582e+07,9.002866e+05,656558.0,382769.0,2.222326e+09,19411925.00,5.911125e+11
4182,2023-06-26,30458.97,1.152567e+08,491354.967999,1.723111,856526345.0,2766.979452,3.799438e+08,5.235044e+13,2.903694e+07,9.910232e+05,692095.0,403979.0,3.515517e+09,19411925.00,5.911125e+11
4183,2023-06-27,30266.70,1.641947e+08,491606.066760,1.717011,856929056.0,3106.128571,3.643296e+08,5.235044e+13,2.769509e+07,9.677887e+05,707936.0,434858.0,3.548785e+09,19413175.00,5.896946e+11
4184,2023-06-28,30699.11,1.862727e+08,491846.908545,1.680819,857363958.0,3067.388889,2.805460e+08,5.225576e+13,2.137802e+07,9.052404e+05,627350.0,331278.0,3.880455e+09,19414418.75,5.873250e+11


In [None]:
# Check nan value
all_data[all_data.isnull().T.any()]

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap


In [None]:
# Check duplicated rows
len(all_data['timestamp'].unique())

4186

In [None]:
all_data

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap
0,2012-01-01,5.04,0.000000e+00,861.941752,0.017073,2119853.0,32.686275,8.591401e+00,1.159929e+06,4.260652e+04,1.851638e+01,8531.0,5001.0,1.016110e+06,8001900.00,4.032958e+07
1,2012-01-02,5.27,0.000000e+00,864.547504,0.019121,2124845.0,35.827815,8.764382e+00,1.159929e+06,6.301249e+04,3.598932e+01,8928.0,5410.0,7.508830e+05,8013350.00,4.223035e+07
2,2012-01-03,5.45,0.000000e+00,867.445999,0.018212,2130220.0,36.308176,9.340986e+00,1.159929e+06,4.662806e+04,3.056013e+01,9528.0,5773.0,6.037982e+05,8013350.00,4.223035e+07
3,2012-01-04,5.37,0.000000e+00,870.374487,0.019351,2135991.0,38.463087,8.879703e+00,1.159929e+06,4.706558e+04,7.808277e+01,9542.0,5731.0,7.495462e+05,8025100.00,4.309479e+07
4,2012-01-05,5.80,0.000000e+00,873.246150,0.024677,2141802.0,47.578231,8.476080e+00,1.159929e+06,5.369470e+04,4.469720e+01,11636.0,6994.0,1.614569e+06,8036850.00,4.661373e+07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4181,2023-06-25,30548.59,1.125890e+08,491068.723512,1.868064,856142438.0,2501.758170,3.981602e+08,5.235044e+13,3.019582e+07,9.002866e+05,656558.0,382769.0,2.222326e+09,19411925.00,5.911125e+11
4182,2023-06-26,30458.97,1.152567e+08,491354.967999,1.723111,856526345.0,2766.979452,3.799438e+08,5.235044e+13,2.903694e+07,9.910232e+05,692095.0,403979.0,3.515517e+09,19411925.00,5.911125e+11
4183,2023-06-27,30266.70,1.641947e+08,491606.066760,1.717011,856929056.0,3106.128571,3.643296e+08,5.235044e+13,2.769509e+07,9.677887e+05,707936.0,434858.0,3.548785e+09,19413175.00,5.896946e+11
4184,2023-06-28,30699.11,1.862727e+08,491846.908545,1.680819,857363958.0,3067.388889,2.805460e+08,5.225576e+13,2.137802e+07,9.052404e+05,627350.0,331278.0,3.880455e+09,19414418.75,5.873250e+11


In [None]:
def move_columns(dataset, target_colum, column_to_move):
  cols = list(dataset.columns)
  cols.remove(column_to_move)
  cols.insert(cols.index(target_colum)+1, column_to_move)
  dataset = dataset.reindex(columns=cols)

  return dataset

In [None]:
# Move the column 'total-bitcoins' and 'market-cap' after the column 'market-price'
all_data = move_columns(all_data, 'market-price', 'total-bitcoins')
all_data = move_columns(all_data, 'market-price', 'market-cap')
all_data

Unnamed: 0,timestamp,market-price,market-cap,total-bitcoins,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
0,2012-01-01,5.04,4.032958e+07,8001900.00,0.000000e+00,861.941752,0.017073,2119853.0,32.686275,8.591401e+00,1.159929e+06,4.260652e+04,1.851638e+01,8531.0,5001.0,1.016110e+06
1,2012-01-02,5.27,4.223035e+07,8013350.00,0.000000e+00,864.547504,0.019121,2124845.0,35.827815,8.764382e+00,1.159929e+06,6.301249e+04,3.598932e+01,8928.0,5410.0,7.508830e+05
2,2012-01-03,5.45,4.223035e+07,8013350.00,0.000000e+00,867.445999,0.018212,2130220.0,36.308176,9.340986e+00,1.159929e+06,4.662806e+04,3.056013e+01,9528.0,5773.0,6.037982e+05
3,2012-01-04,5.37,4.309479e+07,8025100.00,0.000000e+00,870.374487,0.019351,2135991.0,38.463087,8.879703e+00,1.159929e+06,4.706558e+04,7.808277e+01,9542.0,5731.0,7.495462e+05
4,2012-01-05,5.80,4.661373e+07,8036850.00,0.000000e+00,873.246150,0.024677,2141802.0,47.578231,8.476080e+00,1.159929e+06,5.369470e+04,4.469720e+01,11636.0,6994.0,1.614569e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4181,2023-06-25,30548.59,5.911125e+11,19411925.00,1.125890e+08,491068.723512,1.868064,856142438.0,2501.758170,3.981602e+08,5.235044e+13,3.019582e+07,9.002866e+05,656558.0,382769.0,2.222326e+09
4182,2023-06-26,30458.97,5.911125e+11,19411925.00,1.152567e+08,491354.967999,1.723111,856526345.0,2766.979452,3.799438e+08,5.235044e+13,2.903694e+07,9.910232e+05,692095.0,403979.0,3.515517e+09
4183,2023-06-27,30266.70,5.896946e+11,19413175.00,1.641947e+08,491606.066760,1.717011,856929056.0,3106.128571,3.643296e+08,5.235044e+13,2.769509e+07,9.677887e+05,707936.0,434858.0,3.548785e+09
4184,2023-06-28,30699.11,5.873250e+11,19414418.75,1.862727e+08,491846.908545,1.680819,857363958.0,3067.388889,2.805460e+08,5.225576e+13,2.137802e+07,9.052404e+05,627350.0,331278.0,3.880455e+09


Once we have the daily dataset we will go to sample it at a frequency of 1 minute (1T) using the resample method. This means that the data will be organized in 1-minute time-frame, and an interpolation method will be used to fill in any missing data or holes in the DataFrame by estimating missing values based on the surrounding known values.

In [None]:
# Upsampling to 1min by interpolate
all_data.set_index('timestamp', inplace=True)
all_data_1h = all_data.resample('1H').interpolate()
all_data_1h

Unnamed: 0_level_0,market-price,market-cap,total-bitcoins,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2012-01-01 00:00:00,5.040000,4.032958e+07,8.001900e+06,0.000000e+00,861.941752,0.017073,2.119853e+06,32.686275,8.591401e+00,1.159929e+06,4.260652e+04,1.851638e+01,8531.000000,5001.000000,1.016110e+06
2012-01-01 01:00:00,5.049583,4.040878e+07,8.002377e+06,0.000000e+00,862.050325,0.017158,2.120061e+06,32.817172,8.598608e+00,1.159929e+06,4.345677e+04,1.924442e+01,8547.541667,5018.041667,1.005058e+06
2012-01-01 02:00:00,5.059167,4.048797e+07,8.002854e+06,0.000000e+00,862.158898,0.017244,2.120269e+06,32.948070,8.605816e+00,1.159929e+06,4.430701e+04,1.997246e+01,8564.083333,5035.083333,9.940074e+05
2012-01-01 03:00:00,5.068750,4.056717e+07,8.003331e+06,0.000000e+00,862.267471,0.017329,2.120477e+06,33.078967,8.613023e+00,1.159929e+06,4.515726e+04,2.070050e+01,8580.625000,5052.125000,9.829563e+05
2012-01-01 04:00:00,5.078333,4.064637e+07,8.003808e+06,0.000000e+00,862.376044,0.017414,2.120685e+06,33.209865,8.620231e+00,1.159929e+06,4.600751e+04,2.142854e+01,8597.166667,5069.166667,9.719052e+05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-06-28 20:00:00,30186.176667,5.873250e+11,1.941442e+07,1.677668e+08,491997.874785,1.677334,8.576407e+08,3012.333080,3.530701e+08,5.091447e+13,2.765230e+07,1.090646e+06,709812.500000,420373.833333,4.968820e+09
2023-06-28 21:00:00,30160.530000,5.873250e+11,1.941442e+07,1.668415e+08,492005.423097,1.677159,8.576545e+08,3009.580289,3.566963e+08,5.084740e+13,2.796602e+07,1.099916e+06,713935.625000,424828.625000,5.023238e+09
2023-06-28 22:00:00,30134.883333,5.873250e+11,1.941442e+07,1.659163e+08,492012.971409,1.676985,8.576683e+08,3006.827499,3.603225e+08,5.078034e+13,2.827973e+07,1.109186e+06,718058.750000,429283.416667,5.077657e+09
2023-06-28 23:00:00,30109.236667,5.873250e+11,1.941442e+07,1.649910e+08,492020.519721,1.676811,8.576822e+08,3004.074708,3.639487e+08,5.071327e+13,2.859344e+07,1.118457e+06,722181.875000,433738.208333,5.132075e+09


# Output

In this last section we are going to save the dataset we just created to the Google Drive.

In [None]:
# Link Colab to our Google Drive
drive.mount(GDRIVE_DIR)

Mounted at /content/drive


In [None]:
def output(dataset, path):
  dataset.to_parquet(path)

In [None]:
# Output the 1h data
GDRIVE_DATASET_NAME = "bitcoin_blockchain_data_1h"
GDRIVE_DATASET_NAME_EXT = "/" + GDRIVE_DATASET_NAME + ".parquet"
GDRIVE_DATASET = GDRIVE_DATASET_RAW_DIR + GDRIVE_DATASET_NAME_EXT
output(all_data_1m, GDRIVE_DATASET)