# Bitcoin price forecasting with PySpark
## Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author
Corsi Danilo - corsi.1742375@studenti.uniroma1.it



Description: In this notebook I will be responsible for retrieving data from www.blockchain.com by considering the most interesting features, generate the dataset containing the minute-by-minute (1m) and daily (1d) data, and save it to the Google Drive space.

# Dependencies, Libraries and Tools

In [1]:
import pandas as pd
import functools

from google.colab import drive

#  Define metrics and parameters

In this section we are going to define the parameters used to collect the data and the metrics used. I will consider Bitcoin data for 10 years, starting from 2012-01-01 through 2022-12-31.

Note that since the Blockchain.com API allows retreiving data with a maximum timespan equal to 6 years, I manually computed the continue date so that I could make a second API call to get the remaining data.

Regarding the metrics, I chose the ones that seemed most relevant to me, containing both price statistics but also technical features of Bitcoin's blockchain.

In [2]:
# Define the parameters
timespan = "6years" # Duration of the data (because the Max timespan == 6years)
start_date = "2012-01-01"
continue_date = "2017-12-31" # The continue date (manually calculate the continue_date)
end_date = "2023-07-31"

# Metrics considered
metrics = [
          ##Currency Statistics##
            "market-price", # Market Price: The average USD market price across major bitcoin exchanges.
            "trade-volume", #E xchange Trade Volume (USD): The total USD value of trading volume on major bitcoin exchanges.

          ##Block Details##
            "blocks-size", # Blockchain Size (MB): The total size of the blockchain minus database indexes in megabytes.
            "avg-block-size", # Average Block Size (MB): The average block size over the past 24 hours in megabytes.
            "n-transactions-total", # Total Number of Transactions: The total number of transactions on the blockchain.
            "n-transactions-per-block", # Average Transactions Per Block: The average number of transactions per block over the past 24 hours.

          ##Mining Information##
            "hash-rate", # Total Hash Rate (TH/s): The estimated number of terahashes per second the bitcoin network is performing in the last 24 hours.
            "difficulty", # Network Difficulty: A relative measure of how difficult it is to mine a new block for the blockchain.
            "miners-revenue", # Miners Revenue (USD): Total value of coinbase block rewards and transaction fees paid to miners.
            "transaction-fees-usd", # Total Transaction Fees (USD): The total USD value of all transaction fees paid to miners. This does not include coinbase block rewards.

          ##Network Activity##
            "n-unique-addresses", # The total number of unique addresses used on the blockchain.
            "n-transactions", # Confirmed Transactions Per Day: The total number of confirmed transactions per day.
            "estimated-transaction-volume-usd" # Estimated Transaction Value (USD): The total estimated value in USD of transactions on the blockchain. This does not include coins returned as change.
]

# Retreiving data

In this section we are going to make the call to the Blockchain.com API to retrieve the data.

In [3]:
def data_crawler(timespan, metrics, start_date, continue_date, end_date):
    # API Info
    url1 = f'https://api.blockchain.info/charts/{metrics}?timespan={timespan}&start={start_date}&format=csv'
    url2 = f'https://api.blockchain.info/charts/{metrics}?timespan={timespan}&start={continue_date}&format=csv'

    # Obtain Data
    data1 = pd.read_csv(url1,names=['timestamp',metrics])
    data2 = pd.read_csv(url2,names=['timestamp',metrics])

    # Concat by rows
    all_data = pd.concat([data1,data2])

    # Transform "timestamp" to datetime type
    all_data['timestamp'] = pd.to_datetime(all_data["timestamp"])

    # Keep the same end date with Bitcoin data
    all_data = all_data[(all_data['timestamp'] < end_date)]

    return all_data

In [4]:
# Merge the data
merge = functools.partial(pd.merge, on='timestamp')

# Gain Blockchain Data from Blockchain.com API
df1 = functools.reduce(merge, [data_crawler(timespan, metric, start_date, continue_date, end_date) for metric in metrics])
df1

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
0,2012-01-01,5.04,0.000000e+00,861.941752,0.017073,2119853.0,32.686275,8.591401e+00,1.159929e+06,4.260652e+04,18.516384,8531.0,5001.0,1.016110e+06
1,2012-01-02,5.27,0.000000e+00,864.547504,0.019121,2124845.0,35.827815,8.764382e+00,1.159929e+06,6.301249e+04,35.989325,8928.0,5410.0,7.508830e+05
2,2012-01-03,5.45,0.000000e+00,867.445999,0.018212,2130220.0,36.308176,9.340986e+00,1.159929e+06,4.662806e+04,30.560129,9528.0,5773.0,6.037982e+05
3,2012-01-04,5.37,0.000000e+00,870.374487,0.019351,2135991.0,38.463087,8.879703e+00,1.159929e+06,4.706558e+04,78.082768,9542.0,5731.0,7.495462e+05
4,2012-01-05,5.80,0.000000e+00,873.246150,0.024677,2141802.0,47.578231,8.476080e+00,1.159929e+06,5.369470e+04,44.697203,11636.0,6994.0,1.614569e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4211,2023-07-26,29226.18,7.731961e+07,498719.595538,1.682859,869881904.0,3144.069444,3.811901e+08,5.325165e+13,2.692680e+07,551760.191132,737067.0,452746.0,2.649786e+09
4212,2023-07-27,29344.56,1.112411e+08,498961.880465,1.653488,870335062.0,2993.028986,3.589731e+08,5.232831e+13,2.634722e+07,619524.768989,713630.0,413038.0,3.329622e+09
4213,2023-07-28,29213.94,8.336315e+07,499189.833530,1.629711,870747604.0,3208.750000,3.641756e+08,5.232831e+13,2.623381e+07,596803.108800,720574.0,449225.0,2.970637e+09
4214,2023-07-29,29316.12,7.587452e+07,499418.145973,1.616703,871197669.0,3205.446970,3.433656e+08,5.232831e+13,2.470638e+07,474147.692736,694496.0,423119.0,1.601999e+09


In [5]:
# Check duplicated rows
len(df1['timestamp'].unique())

4216

Due to a problem with the Blockchain.com API, I was forced to make an additional call to retrieve capitalization and total circulating data that will be added to the currency statistics to get a single dataset.

In [6]:
# Retrieving market capitalization and total circulating data
metrics = [
  "total-bitcoins", # Total Circulating Bitcoin: The total number of mined bitcoin that are currently circulating on the network.
  "market-cap", # Market Capitalization (USD): The total USD value of bitcoin in circulation.
  ]

merge = functools.partial(pd.merge, on='timestamp')
df2 = functools.reduce(merge, [data_crawler(timespan, metric, start_date, continue_date, end_date) for metric in metrics])
df2

Unnamed: 0,timestamp,total-bitcoins,market-cap
0,2012-01-01 00:00:01,8001900.00,4.032958e+07
1,2012-01-02 13:34:31,8013350.00,4.223035e+07
2,2012-01-04 00:14:03,8025100.00,4.309479e+07
3,2012-01-05 15:23:53,8036850.00,4.661373e+07
4,2012-01-06 23:04:03,8048100.00,5.311746e+07
...,...,...,...
2993,2023-07-24 18:37:14,19438118.75,5.656298e+11
2994,2023-07-26 04:22:24,19439362.50,5.690290e+11
2995,2023-07-27 14:47:29,19440606.25,5.716705e+11
2996,2023-07-29 01:06:03,19441856.25,5.711045e+11


In [7]:
# Check duplicated rows
len(df2['timestamp'].unique())

2998

In [8]:
# Wipe off the timestamp's h:m:s.
df2['timestamp'] = pd.to_datetime(df2["timestamp"]).dt.normalize()

# Drop the duplicates in column "timestamp", keep the last value
df2.drop_duplicates(subset="timestamp", keep="last", inplace=True)

In [9]:
df2

Unnamed: 0,timestamp,total-bitcoins,market-cap
0,2012-01-01,8001900.00,4.032958e+07
1,2012-01-02,8013350.00,4.223035e+07
2,2012-01-04,8025100.00,4.309479e+07
3,2012-01-05,8036850.00,4.661373e+07
4,2012-01-06,8048100.00,5.311746e+07
...,...,...,...
2993,2023-07-24,19438118.75,5.656298e+11
2994,2023-07-26,19439362.50,5.690290e+11
2995,2023-07-27,19440606.25,5.716705e+11
2996,2023-07-29,19441856.25,5.711045e+11


In [10]:
# Check duplicated rows
len(df2['timestamp'].unique())

2997

In [11]:
# Add the market capitalization and total circulating data
all_data = pd.merge(df1, df2, how="left", on='timestamp')
all_data = all_data.interpolate(method='ffill')
all_data

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap
0,2012-01-01,5.04,0.000000e+00,861.941752,0.017073,2119853.0,32.686275,8.591401e+00,1.159929e+06,4.260652e+04,18.516384,8531.0,5001.0,1.016110e+06,8001900.00,4.032958e+07
1,2012-01-02,5.27,0.000000e+00,864.547504,0.019121,2124845.0,35.827815,8.764382e+00,1.159929e+06,6.301249e+04,35.989325,8928.0,5410.0,7.508830e+05,8013350.00,4.223035e+07
2,2012-01-03,5.45,0.000000e+00,867.445999,0.018212,2130220.0,36.308176,9.340986e+00,1.159929e+06,4.662806e+04,30.560129,9528.0,5773.0,6.037982e+05,8013350.00,4.223035e+07
3,2012-01-04,5.37,0.000000e+00,870.374487,0.019351,2135991.0,38.463087,8.879703e+00,1.159929e+06,4.706558e+04,78.082768,9542.0,5731.0,7.495462e+05,8025100.00,4.309479e+07
4,2012-01-05,5.80,0.000000e+00,873.246150,0.024677,2141802.0,47.578231,8.476080e+00,1.159929e+06,5.369470e+04,44.697203,11636.0,6994.0,1.614569e+06,8036850.00,4.661373e+07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4211,2023-07-26,29226.18,7.731961e+07,498719.595538,1.682859,869881904.0,3144.069444,3.811901e+08,5.325165e+13,2.692680e+07,551760.191132,737067.0,452746.0,2.649786e+09,19439362.50,5.690290e+11
4212,2023-07-27,29344.56,1.112411e+08,498961.880465,1.653488,870335062.0,2993.028986,3.589731e+08,5.232831e+13,2.634722e+07,619524.768989,713630.0,413038.0,3.329622e+09,19440606.25,5.716705e+11
4213,2023-07-28,29213.94,8.336315e+07,499189.833530,1.629711,870747604.0,3208.750000,3.641756e+08,5.232831e+13,2.623381e+07,596803.108800,720574.0,449225.0,2.970637e+09,19440606.25,5.716705e+11
4214,2023-07-29,29316.12,7.587452e+07,499418.145973,1.616703,871197669.0,3205.446970,3.433656e+08,5.232831e+13,2.470638e+07,474147.692736,694496.0,423119.0,1.601999e+09,19441856.25,5.711045e+11


In [12]:
# Check nan value
all_data[all_data.isnull().T.any()]

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap


In [13]:
# Check duplicated rows
len(all_data['timestamp'].unique())

4216

In [14]:
all_data

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap
0,2012-01-01,5.04,0.000000e+00,861.941752,0.017073,2119853.0,32.686275,8.591401e+00,1.159929e+06,4.260652e+04,18.516384,8531.0,5001.0,1.016110e+06,8001900.00,4.032958e+07
1,2012-01-02,5.27,0.000000e+00,864.547504,0.019121,2124845.0,35.827815,8.764382e+00,1.159929e+06,6.301249e+04,35.989325,8928.0,5410.0,7.508830e+05,8013350.00,4.223035e+07
2,2012-01-03,5.45,0.000000e+00,867.445999,0.018212,2130220.0,36.308176,9.340986e+00,1.159929e+06,4.662806e+04,30.560129,9528.0,5773.0,6.037982e+05,8013350.00,4.223035e+07
3,2012-01-04,5.37,0.000000e+00,870.374487,0.019351,2135991.0,38.463087,8.879703e+00,1.159929e+06,4.706558e+04,78.082768,9542.0,5731.0,7.495462e+05,8025100.00,4.309479e+07
4,2012-01-05,5.80,0.000000e+00,873.246150,0.024677,2141802.0,47.578231,8.476080e+00,1.159929e+06,5.369470e+04,44.697203,11636.0,6994.0,1.614569e+06,8036850.00,4.661373e+07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4211,2023-07-26,29226.18,7.731961e+07,498719.595538,1.682859,869881904.0,3144.069444,3.811901e+08,5.325165e+13,2.692680e+07,551760.191132,737067.0,452746.0,2.649786e+09,19439362.50,5.690290e+11
4212,2023-07-27,29344.56,1.112411e+08,498961.880465,1.653488,870335062.0,2993.028986,3.589731e+08,5.232831e+13,2.634722e+07,619524.768989,713630.0,413038.0,3.329622e+09,19440606.25,5.716705e+11
4213,2023-07-28,29213.94,8.336315e+07,499189.833530,1.629711,870747604.0,3208.750000,3.641756e+08,5.232831e+13,2.623381e+07,596803.108800,720574.0,449225.0,2.970637e+09,19440606.25,5.716705e+11
4214,2023-07-29,29316.12,7.587452e+07,499418.145973,1.616703,871197669.0,3205.446970,3.433656e+08,5.232831e+13,2.470638e+07,474147.692736,694496.0,423119.0,1.601999e+09,19441856.25,5.711045e+11


In [15]:
def move_columns(dataset, target_colum, column_to_move):
  cols = list(dataset.columns)
  cols.remove(column_to_move)
  cols.insert(cols.index(target_colum)+1, column_to_move)
  dataset = dataset.reindex(columns=cols)

  return dataset

In [16]:
# Move the column 'total-bitcoins' and 'market-cap' after the column 'market-price'
all_data = move_columns(all_data, 'market-price', 'total-bitcoins')
all_data = move_columns(all_data, 'market-price', 'market-cap')
all_data

Unnamed: 0,timestamp,market-price,market-cap,total-bitcoins,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
0,2012-01-01,5.04,4.032958e+07,8001900.00,0.000000e+00,861.941752,0.017073,2119853.0,32.686275,8.591401e+00,1.159929e+06,4.260652e+04,18.516384,8531.0,5001.0,1.016110e+06
1,2012-01-02,5.27,4.223035e+07,8013350.00,0.000000e+00,864.547504,0.019121,2124845.0,35.827815,8.764382e+00,1.159929e+06,6.301249e+04,35.989325,8928.0,5410.0,7.508830e+05
2,2012-01-03,5.45,4.223035e+07,8013350.00,0.000000e+00,867.445999,0.018212,2130220.0,36.308176,9.340986e+00,1.159929e+06,4.662806e+04,30.560129,9528.0,5773.0,6.037982e+05
3,2012-01-04,5.37,4.309479e+07,8025100.00,0.000000e+00,870.374487,0.019351,2135991.0,38.463087,8.879703e+00,1.159929e+06,4.706558e+04,78.082768,9542.0,5731.0,7.495462e+05
4,2012-01-05,5.80,4.661373e+07,8036850.00,0.000000e+00,873.246150,0.024677,2141802.0,47.578231,8.476080e+00,1.159929e+06,5.369470e+04,44.697203,11636.0,6994.0,1.614569e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4211,2023-07-26,29226.18,5.690290e+11,19439362.50,7.731961e+07,498719.595538,1.682859,869881904.0,3144.069444,3.811901e+08,5.325165e+13,2.692680e+07,551760.191132,737067.0,452746.0,2.649786e+09
4212,2023-07-27,29344.56,5.716705e+11,19440606.25,1.112411e+08,498961.880465,1.653488,870335062.0,2993.028986,3.589731e+08,5.232831e+13,2.634722e+07,619524.768989,713630.0,413038.0,3.329622e+09
4213,2023-07-28,29213.94,5.716705e+11,19440606.25,8.336315e+07,499189.833530,1.629711,870747604.0,3208.750000,3.641756e+08,5.232831e+13,2.623381e+07,596803.108800,720574.0,449225.0,2.970637e+09
4214,2023-07-29,29316.12,5.711045e+11,19441856.25,7.587452e+07,499418.145973,1.616703,871197669.0,3205.446970,3.433656e+08,5.232831e+13,2.470638e+07,474147.692736,694496.0,423119.0,1.601999e+09


Once we have the daily dataset we will go to sample it at a frequency of 1 minute (1T) using the resample method. This means that the data will be organized in 1-minute time-frame, and an interpolation method will be used to fill in any missing data or holes in the DataFrame by estimating missing values based on the surrounding known values.

In [17]:
# Upsampling to 1min by interpolate
all_data.set_index('timestamp', inplace=True)
all_data_1h = all_data.resample('1H').interpolate()
all_data_1h

Unnamed: 0_level_0,market-price,market-cap,total-bitcoins,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2012-01-01 00:00:00,5.040000,4.032958e+07,8.001900e+06,0.000000e+00,861.941752,0.017073,2.119853e+06,32.686275,8.591401e+00,1.159929e+06,4.260652e+04,18.516384,8531.000000,5001.000000,1.016110e+06
2012-01-01 01:00:00,5.049583,4.040878e+07,8.002377e+06,0.000000e+00,862.050325,0.017158,2.120061e+06,32.817172,8.598608e+00,1.159929e+06,4.345677e+04,19.244424,8547.541667,5018.041667,1.005058e+06
2012-01-01 02:00:00,5.059167,4.048797e+07,8.002854e+06,0.000000e+00,862.158898,0.017244,2.120269e+06,32.948070,8.605816e+00,1.159929e+06,4.430701e+04,19.972463,8564.083333,5035.083333,9.940074e+05
2012-01-01 03:00:00,5.068750,4.056717e+07,8.003331e+06,0.000000e+00,862.267471,0.017329,2.120477e+06,33.078967,8.613023e+00,1.159929e+06,4.515726e+04,20.700502,8580.625000,5052.125000,9.829563e+05
2012-01-01 04:00:00,5.078333,4.064637e+07,8.003808e+06,0.000000e+00,862.376044,0.017414,2.120685e+06,33.209865,8.620231e+00,1.159929e+06,4.600751e+04,21.428541,8597.166667,5069.166667,9.719052e+05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-07-29 20:00:00,29352.595000,5.707623e+11,1.944289e+07,3.926979e+07,499595.925800,1.704247,8.715501e+08,3737.882035,4.214032e+08,5.232831e+13,3.052827e+07,503742.351307,782832.666667,608731.500000,1.421208e+09
2023-07-29 21:00:00,29354.418750,5.707452e+11,1.944294e+07,3.743955e+07,499604.814792,1.708625,8.715677e+08,3764.503788,4.253051e+08,5.232831e+13,3.081936e+07,505222.084235,787249.500000,618012.125000,1.412168e+09
2023-07-29 22:00:00,29356.242500,5.707281e+11,1.944300e+07,3.560932e+07,499613.703783,1.713002,8.715854e+08,3791.125541,4.292070e+08,5.232831e+13,3.111045e+07,506701.817164,791666.333333,627292.750000,1.403129e+09
2023-07-29 23:00:00,29358.066250,5.707110e+11,1.944305e+07,3.377908e+07,499622.592775,1.717379,8.716030e+08,3817.747294,4.331089e+08,5.232831e+13,3.140155e+07,508181.550092,796083.166667,636573.375000,1.394089e+09


# Output

In this last section we are going to save the dataset we just created to the Google Drive.

In [18]:
GDRIVE_DIR = "/content/drive"
GDRIVE_DATASET_RAW_DIR = GDRIVE_DIR + "/MyDrive/BDC/project/datasets/raw"

In [19]:
# Link Colab to our Google Drive
drive.mount(GDRIVE_DIR)

Mounted at /content/drive


In [20]:
def output(dataset, path):
  dataset.to_parquet(path)

In [22]:
# Output the 1h data
GDRIVE_DATASET_NAME = "bitcoin_blockchain_data_1h"
GDRIVE_DATASET_NAME_EXT = "/" + GDRIVE_DATASET_NAME + ".parquet"
GDRIVE_DATASET = GDRIVE_DATASET_RAW_DIR + GDRIVE_DATASET_NAME_EXT
output(all_data_1h, GDRIVE_DATASET)