# Bitcoin price forecasting with PySpark - Data crawling
## Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author
Corsi Danilo - corsi.1742375@studenti.uniroma1.it



# Dependencies, Libraries and Tools

In [1]:
import pandas as pd
import functools

from google.colab import drive

#  Define metrics and parameters

In this section we are going to define the parameters used to collect the data and the metrics used. I will consider Bitcoin data for 10 years, starting from 2012-01-01 through 2022-12-31.

Note that since the Blockchain.com API allows retreiving data with a maximum timespan equal to 6 years, I manually computed the continue date so that I could make a second API call to get the remaining data.

Regarding the metrics, I chose the ones that seemed most relevant to me, containing both price statistics but also technical features of Bitcoin's blockchain.

In [2]:
# Define the parameters
timespan = "6years" # Duration of the data (because the Max timespan == 6years)
start_date = "2016-01-01"
continue_date = "2021-12-31" # The continue date (manually calculate the continue_date)
end_date = "2023-07-31"

# Metrics considered
metrics = [
          ##Currency Statistics##
            "market-price", # Market Price: The average USD market price across major bitcoin exchanges.
            "trade-volume", #Exchange Trade Volume (USD): The total USD value of trading volume on major bitcoin exchanges.

          ##Block Details##
            "blocks-size", # Blockchain Size (MB): The total size of the blockchain minus database indexes in megabytes.
            "avg-block-size", # Average Block Size (MB): The average block size over the past 24 hours in megabytes.
            "n-transactions-total", # Total Number of Transactions: The total number of transactions on the blockchain.
            "n-transactions-per-block", # Average Transactions Per Block: The average number of transactions per block over the past 24 hours.

          ##Mining Information##
            "hash-rate", # Total Hash Rate (TH/s): The estimated number of terahashes per second the bitcoin network is performing in the last 24 hours.
            "difficulty", # Network Difficulty (T): A relative measure of how difficult it is to mine a new block for the blockchain.
            "miners-revenue", # Miners Revenue (USD): Total value of coinbase block rewards and transaction fees paid to miners.
            "transaction-fees-usd", # Total Transaction Fees (USD): The total USD value of all transaction fees paid to miners. This does not include coinbase block rewards.

          ##Network Activity##
            "n-unique-addresses", # The total number of unique addresses used on the blockchain.
            "n-transactions", # Confirmed Transactions Per Day: The total number of confirmed transactions per day.
            "estimated-transaction-volume-usd" # Estimated Transaction Value (USD): The total estimated value in USD of transactions on the blockchain. This does not include coins returned as change.
]

# Retreiving data

In this section we are going to make the call to the Blockchain.com API to retrieve the data.

In [3]:
def data_crawler(timespan, metrics, start_date, continue_date, end_date):
    # API Info
    url1 = f'https://api.blockchain.info/charts/{metrics}?timespan={timespan}&start={start_date}&format=csv'
    url2 = f'https://api.blockchain.info/charts/{metrics}?timespan={timespan}&start={continue_date}&format=csv'

    # Obtain Data
    data1 = pd.read_csv(url1,names=['timestamp',metrics])
    data2 = pd.read_csv(url2,names=['timestamp',metrics])

    # Concat by rows
    all_data = pd.concat([data1,data2])

    # Transform "timestamp" to datetime type
    all_data['timestamp'] = pd.to_datetime(all_data["timestamp"])

    # Keep the same end date with Bitcoin data
    all_data = all_data[(all_data['timestamp'] < end_date)]

    return all_data

In [4]:
# Merge the data
merge = functools.partial(pd.merge, on='timestamp')

# Gain Blockchain Data from Blockchain.com API
df1 = functools.reduce(merge, [data_crawler(timespan, metric, start_date, continue_date, end_date) for metric in metrics])
df1

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
0,2016-01-01,430.89,2.860153e+06,54604.791735,0.493407,101155706,919.200000,6.971292e+05,1.038803e+11,1.554769e+06,8783.063851,263121.0,124092.0,5.834626e+07
1,2016-01-02,434.75,1.646042e+06,54670.707780,0.579744,101278339,1027.848276,7.074570e+05,1.038803e+11,1.671420e+06,13798.467773,333102.0,149038.0,5.083235e+07
2,2016-01-03,432.76,1.287046e+06,54754.876205,0.554656,101427625,983.503448,7.694240e+05,1.038803e+11,1.720316e+06,11009.108568,335666.0,142608.0,6.764693e+07
3,2016-01-04,430.78,1.967359e+06,54835.983701,0.556970,101571729,1001.955801,9.036860e+05,1.038803e+11,2.076921e+06,13811.521698,344268.0,181354.0,9.627657e+07
4,2016-01-05,434.17,2.484225e+06,54936.400034,0.641779,101752002,1161.598726,8.417189e+05,1.038803e+11,1.819808e+06,14331.031533,359763.0,182371.0,1.031559e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2756,2023-07-26,29226.18,7.731961e+07,498719.595538,1.682859,869881904,3144.069444,3.811901e+08,5.325165e+13,2.692680e+07,551760.191132,737067.0,452746.0,2.649786e+09
2757,2023-07-27,29344.56,1.112411e+08,498961.880465,1.653488,870335062,2993.028986,3.589731e+08,5.232831e+13,2.634722e+07,619524.768989,713630.0,413038.0,3.329622e+09
2758,2023-07-28,29213.94,8.336315e+07,499189.833530,1.629711,870747604,3208.750000,3.641756e+08,5.232831e+13,2.623381e+07,596803.108800,720574.0,449225.0,2.970637e+09
2759,2023-07-29,29316.12,7.587452e+07,499418.145973,1.616703,871197669,3205.446970,3.433656e+08,5.232831e+13,2.470638e+07,474147.692736,694496.0,423119.0,1.601999e+09


In [5]:
# Check duplicated rows
len(df1['timestamp'].unique())

2761

Due to a problem with the Blockchain.com API, I was forced to make an additional call to retrieve capitalization and total circulating data that will be added to the currency statistics to get a single dataset.

In [6]:
# Retrieving market capitalization and total circulating data
metrics = [
  "total-bitcoins", # Total Circulating Bitcoin: The total number of mined bitcoin that are currently circulating on the network.
  "market-cap", # Market Capitalization (USD): The total USD value of bitcoin in circulation.
  ]

merge = functools.partial(pd.merge, on='timestamp')
df2 = functools.reduce(merge, [data_crawler(timespan, metric, start_date, continue_date, end_date) for metric in metrics])
df2

Unnamed: 0,timestamp,total-bitcoins,market-cap
0,2016-01-01 00:03:21,15029575.00,6.474140e+09
1,2016-01-02 15:54:15,15035125.00,6.499685e+09
2,2016-01-04 02:26:37,15040650.00,6.458253e+09
3,2016-01-05 09:50:06,15046150.00,6.489359e+09
4,2016-01-06 21:03:44,15051750.00,6.472554e+09
...,...,...,...
2954,2023-07-29 08:53:44,19442218.75,5.707263e+11
2955,2023-07-29 21:55:38,19442581.25,5.715924e+11
2956,2023-07-30 08:01:17,19442943.75,5.703588e+11
2957,2023-07-30 17:09:19,19443312.50,5.719056e+11


In [7]:
# Check duplicated rows
len(df2['timestamp'].unique())

2959

In [8]:
# Wipe off the timestamp's h:m:s.
df2['timestamp'] = pd.to_datetime(df2["timestamp"]).dt.normalize()

# Drop the duplicates in column "timestamp", keep the last value
df2.drop_duplicates(subset="timestamp", keep="last", inplace=True)

In [9]:
df2

Unnamed: 0,timestamp,total-bitcoins,market-cap
0,2016-01-01,15029575.00,6.474140e+09
1,2016-01-02,15035125.00,6.499685e+09
2,2016-01-04,15040650.00,6.458253e+09
3,2016-01-05,15046150.00,6.489359e+09
4,2016-01-06,15051750.00,6.472554e+09
...,...,...,...
2948,2023-07-26,19440037.50,5.737533e+11
2950,2023-07-27,19440768.75,5.672233e+11
2952,2023-07-28,19441493.75,5.705884e+11
2955,2023-07-29,19442581.25,5.715924e+11


In [10]:
# Check duplicated rows
len(df2['timestamp'].unique())

2081

In [23]:
# Add the market capitalization and total circulating data
all_data = pd.merge(df1, df2, how="inner", on='timestamp')
all_data = all_data.interpolate(method='ffill')
all_data

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap
0,2016-01-01,430.89,2.860153e+06,54604.791735,0.493407,101155706,919.200000,6.971292e+05,1.038803e+11,1.554769e+06,8783.063851,263121.0,124092.0,5.834626e+07,15029575.00,6.474140e+09
1,2016-01-02,434.75,1.646042e+06,54670.707780,0.579744,101278339,1027.848276,7.074570e+05,1.038803e+11,1.671420e+06,13798.467773,333102.0,149038.0,5.083235e+07,15035125.00,6.499685e+09
2,2016-01-04,430.78,1.967359e+06,54835.983701,0.556970,101571729,1001.955801,9.036860e+05,1.038803e+11,2.076921e+06,13811.521698,344268.0,181354.0,9.627657e+07,15040650.00,6.458253e+09
3,2016-01-05,434.17,2.484225e+06,54936.400034,0.641779,101752002,1161.598726,8.417189e+05,1.038803e+11,1.819808e+06,14331.031533,359763.0,182371.0,1.031559e+08,15046150.00,6.489359e+09
4,2016-01-06,432.43,1.677504e+06,55036.950659,0.662380,101934175,1220.496454,7.022931e+05,1.038803e+11,1.702648e+06,14113.278937,312004.0,172090.0,1.007001e+08,15051750.00,6.472554e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2072,2023-07-26,29226.18,7.731961e+07,498719.595538,1.682859,869881904,3144.069444,3.811901e+08,5.325165e+13,2.692680e+07,551760.191132,737067.0,452746.0,2.649786e+09,19440037.50,5.737533e+11
2073,2023-07-27,29344.56,1.112411e+08,498961.880465,1.653488,870335062,2993.028986,3.589731e+08,5.232831e+13,2.634722e+07,619524.768989,713630.0,413038.0,3.329622e+09,19440768.75,5.672233e+11
2074,2023-07-28,29213.94,8.336315e+07,499189.833530,1.629711,870747604,3208.750000,3.641756e+08,5.232831e+13,2.623381e+07,596803.108800,720574.0,449225.0,2.970637e+09,19441493.75,5.705884e+11
2075,2023-07-29,29316.12,7.587452e+07,499418.145973,1.616703,871197669,3205.446970,3.433656e+08,5.232831e+13,2.470638e+07,474147.692736,694496.0,423119.0,1.601999e+09,19442581.25,5.715924e+11


In [24]:
# Check nan value
all_data[all_data.isnull().T.any()]

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap


In [25]:
# Check duplicated rows
len(all_data['timestamp'].unique())

2077

In [26]:
all_data

Unnamed: 0,timestamp,market-price,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd,total-bitcoins,market-cap
0,2016-01-01,430.89,2.860153e+06,54604.791735,0.493407,101155706,919.200000,6.971292e+05,1.038803e+11,1.554769e+06,8783.063851,263121.0,124092.0,5.834626e+07,15029575.00,6.474140e+09
1,2016-01-02,434.75,1.646042e+06,54670.707780,0.579744,101278339,1027.848276,7.074570e+05,1.038803e+11,1.671420e+06,13798.467773,333102.0,149038.0,5.083235e+07,15035125.00,6.499685e+09
2,2016-01-04,430.78,1.967359e+06,54835.983701,0.556970,101571729,1001.955801,9.036860e+05,1.038803e+11,2.076921e+06,13811.521698,344268.0,181354.0,9.627657e+07,15040650.00,6.458253e+09
3,2016-01-05,434.17,2.484225e+06,54936.400034,0.641779,101752002,1161.598726,8.417189e+05,1.038803e+11,1.819808e+06,14331.031533,359763.0,182371.0,1.031559e+08,15046150.00,6.489359e+09
4,2016-01-06,432.43,1.677504e+06,55036.950659,0.662380,101934175,1220.496454,7.022931e+05,1.038803e+11,1.702648e+06,14113.278937,312004.0,172090.0,1.007001e+08,15051750.00,6.472554e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2072,2023-07-26,29226.18,7.731961e+07,498719.595538,1.682859,869881904,3144.069444,3.811901e+08,5.325165e+13,2.692680e+07,551760.191132,737067.0,452746.0,2.649786e+09,19440037.50,5.737533e+11
2073,2023-07-27,29344.56,1.112411e+08,498961.880465,1.653488,870335062,2993.028986,3.589731e+08,5.232831e+13,2.634722e+07,619524.768989,713630.0,413038.0,3.329622e+09,19440768.75,5.672233e+11
2074,2023-07-28,29213.94,8.336315e+07,499189.833530,1.629711,870747604,3208.750000,3.641756e+08,5.232831e+13,2.623381e+07,596803.108800,720574.0,449225.0,2.970637e+09,19441493.75,5.705884e+11
2075,2023-07-29,29316.12,7.587452e+07,499418.145973,1.616703,871197669,3205.446970,3.433656e+08,5.232831e+13,2.470638e+07,474147.692736,694496.0,423119.0,1.601999e+09,19442581.25,5.715924e+11


In [27]:
new_columns = ['timestamp', 'market-price', 'total-bitcoins', 'market-cap'] + [col for col in all_data.columns if col not in ['timestamp', 'market-price', 'total-bitcoins', 'market-cap']]
all_data = all_data.reindex(columns=new_columns)
all_data

Unnamed: 0,timestamp,market-price,total-bitcoins,market-cap,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
0,2016-01-01,430.89,15029575.00,6.474140e+09,2.860153e+06,54604.791735,0.493407,101155706,919.200000,6.971292e+05,1.038803e+11,1.554769e+06,8783.063851,263121.0,124092.0,5.834626e+07
1,2016-01-02,434.75,15035125.00,6.499685e+09,1.646042e+06,54670.707780,0.579744,101278339,1027.848276,7.074570e+05,1.038803e+11,1.671420e+06,13798.467773,333102.0,149038.0,5.083235e+07
2,2016-01-04,430.78,15040650.00,6.458253e+09,1.967359e+06,54835.983701,0.556970,101571729,1001.955801,9.036860e+05,1.038803e+11,2.076921e+06,13811.521698,344268.0,181354.0,9.627657e+07
3,2016-01-05,434.17,15046150.00,6.489359e+09,2.484225e+06,54936.400034,0.641779,101752002,1161.598726,8.417189e+05,1.038803e+11,1.819808e+06,14331.031533,359763.0,182371.0,1.031559e+08
4,2016-01-06,432.43,15051750.00,6.472554e+09,1.677504e+06,55036.950659,0.662380,101934175,1220.496454,7.022931e+05,1.038803e+11,1.702648e+06,14113.278937,312004.0,172090.0,1.007001e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2072,2023-07-26,29226.18,19440037.50,5.737533e+11,7.731961e+07,498719.595538,1.682859,869881904,3144.069444,3.811901e+08,5.325165e+13,2.692680e+07,551760.191132,737067.0,452746.0,2.649786e+09
2073,2023-07-27,29344.56,19440768.75,5.672233e+11,1.112411e+08,498961.880465,1.653488,870335062,2993.028986,3.589731e+08,5.232831e+13,2.634722e+07,619524.768989,713630.0,413038.0,3.329622e+09
2074,2023-07-28,29213.94,19441493.75,5.705884e+11,8.336315e+07,499189.833530,1.629711,870747604,3208.750000,3.641756e+08,5.232831e+13,2.623381e+07,596803.108800,720574.0,449225.0,2.970637e+09
2075,2023-07-29,29316.12,19442581.25,5.715924e+11,7.587452e+07,499418.145973,1.616703,871197669,3205.446970,3.433656e+08,5.232831e+13,2.470638e+07,474147.692736,694496.0,423119.0,1.601999e+09


Once we have the daily dataset we will go to sample it at a frequency of 1 minute (1T) using the resample method. This means that the data will be organized in 1-minute time-frame, and an interpolation method will be used to fill in any missing data or holes in the DataFrame by estimating missing values based on the surrounding known values.

In [28]:
# Upsampling to 1min by interpolate
all_data.set_index('timestamp', inplace=True)
all_data_30m = all_data.resample('30T').interpolate()
all_data_30m

Unnamed: 0_level_0,market-price,total-bitcoins,market-cap,trade-volume,blocks-size,avg-block-size,n-transactions-total,n-transactions-per-block,hash-rate,difficulty,miners-revenue,transaction-fees-usd,n-unique-addresses,n-transactions,estimated-transaction-volume-usd
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2016-01-01 00:00:00,430.890000,1.502958e+07,6.474140e+09,2.860153e+06,54604.791735,0.493407,1.011557e+08,919.200000,6.971292e+05,1.038803e+11,1.554769e+06,8783.063851,263121.000000,124092.000000,5.834626e+07
2016-01-01 00:30:00,430.970417,1.502969e+07,6.474672e+09,2.834859e+06,54606.164986,0.495206,1.011583e+08,921.463506,6.973443e+05,1.038803e+11,1.557199e+06,8887.551433,264578.937500,124611.708333,5.818972e+07
2016-01-01 01:00:00,431.050833,1.502981e+07,6.475204e+09,2.809565e+06,54607.538237,0.497004,1.011608e+08,923.727011,6.975595e+05,1.038803e+11,1.559629e+06,8992.039015,266036.875000,125131.416667,5.803318e+07
2016-01-01 01:30:00,431.131250,1.502992e+07,6.475736e+09,2.784271e+06,54608.911488,0.498803,1.011634e+08,925.990517,6.977747e+05,1.038803e+11,1.562059e+06,9096.526597,267494.812500,125651.125000,5.787664e+07
2016-01-01 02:00:00,431.211667,1.503004e+07,6.476268e+09,2.758977e+06,54610.284739,0.500602,1.011659e+08,928.254023,6.979898e+05,1.038803e+11,1.564490e+06,9201.014178,268952.750000,126170.833333,5.772010e+07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-07-29 22:00:00,29356.242500,1.944358e+07,5.698396e+11,3.560932e+07,499613.703783,1.713002,8.715854e+08,3791.125541,4.292070e+08,5.232831e+13,3.111045e+07,506701.817164,791666.333333,627292.750000,1.403129e+09
2023-07-29 22:30:00,29357.154375,1.944361e+07,5.697997e+11,3.469420e+07,499618.148279,1.715190,8.715942e+08,3804.436418,4.311579e+08,5.232831e+13,3.125600e+07,507441.683628,793874.750000,631933.062500,1.398609e+09
2023-07-29 23:00:00,29358.066250,1.944363e+07,5.697599e+11,3.377908e+07,499622.592775,1.717379,8.716030e+08,3817.747294,4.331089e+08,5.232831e+13,3.140155e+07,508181.550092,796083.166667,636573.375000,1.394089e+09
2023-07-29 23:30:00,29358.978125,1.944365e+07,5.697201e+11,3.286396e+07,499627.037270,1.719568,8.716118e+08,3831.058171,4.350598e+08,5.232831e+13,3.154710e+07,508921.416556,798291.583333,641213.687500,1.389570e+09


# Output

In this last section we are going to save the dataset we just created to the Google Drive.

In [29]:
GDRIVE_DIR = "/content/drive"
GDRIVE_DATASET_RAW_DIR = GDRIVE_DIR + "/MyDrive/BDC/project/datasets/raw"

In [30]:
# Link Colab to our Google Drive
drive.mount(GDRIVE_DIR)

Mounted at /content/drive


In [31]:
GDRIVE_DATASET_NAME = "bitcoin_blockchain_data_30min"
GDRIVE_DATASET_NAME_EXT = "/" + GDRIVE_DATASET_NAME + ".parquet"
GDRIVE_DATASET = GDRIVE_DATASET_RAW_DIR + GDRIVE_DATASET_NAME_EXT

In [33]:
# Output the 1h data
all_data_30m.to_parquet(GDRIVE_DATASET)