# **Bitcoin price prediction with PySpark - Data crawling**
## Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author: Corsi Danilo (1742375) - corsi.1742375@studenti.uniroma1.it



---


Description: data crawling on Bitcoin blochckain by querying Blockchain.com website.

# Global constants, dependencies, libraries and tools

In [1]:
# Main constants
GDRIVE_DIR = "/content/drive"

In [2]:
# Datasets dir
GDRIVE_DATASET_RAW_DIR = GDRIVE_DIR + "/MyDrive/BDC/project/datasets/raw"

# Datasets name
DATASET_NAME = "bitcoin_blockchain_data_30min"

# Datasets path
GDRIVE_DATASET_RAW = GDRIVE_DATASET_RAW_DIR + "/" + DATASET_NAME + ".parquet"

In [3]:
# Useful imports
import pandas as pd
import functools

from google.colab import drive

from datetime import date

In [4]:
# Point Colaboratory to Google Drive
from google.colab import drive

# Define GDrive paths
drive.mount(GDRIVE_DIR, force_remount=True)

# Metrics and parameters
I chose to collect data on the Bitcoin blockchain using the API of the website Blockchain.org, the most relevant information was retrieved from the year 2016 to the present day (a period for which there were moments of high volatility but also a lot of price lateralization).

The features taken under consideration were divided into several categories:

**Currency Statistics**

- **market-price:** the average USD market price across major bitcoin exchanges.
- **trade-volume:** the total USD value of trading volume on major bitcoin exchanges.
- **total-bitcoins:** the total number of mined bitcoin that are currently circulating on the network.
- **market-cap:** the total USD value of bitcoin in circulation.

**Block Details**

- **blocks-size:** the total size of the blockchain minus database indexes in megabytes.
- **avg-block-size:** the average block size over the past 24 hours in megabytes.
- **n-transactions-total:** the total number of transactions on the blockchain.
- **n-transactions-per-block:** the average number of transactions per block over the past 24 hours.

**Mining Information**

- **hash-rate:** the estimated number of terahashes per second the bitcoin network is performing in the last 24 hours.
- **difficulty:** a relative measure of how difficult it is to mine a new block for the blockchain.
- **miners-revenue:** total value of coinbase block rewards and transaction fees paid to miners.
- **transaction-fees-usd:** the total USD value of all transaction fees paid to miners. This does not include coinbase block rewards.

**Network Activity**

- **n-unique-addresses:** the total number of unique addresses used on the blockchain.
- **n-transactions:** the total number of confirmed transactions per day.
- **estimated-transaction-volume-usd:** the total estimated value in USD of transactions on the blockchain.

In [5]:
# Define the parameters
timespan = "6years" # Duration of the data (it was necessary to define it since it is possible to make requests for up to 6 years)
start_date = "2016-01-01"
continue_date = "2021-12-31" # 6 years from start_date
end_date = str(date.today())

# Metrics considered
metrics = [
          # Currency Statistics
          "market-price",
          "trade-volume",

          # Block Details
          "blocks-size",
          "avg-block-size",
          "n-transactions-total",
          "n-transactions-per-block",

          # Mining Information
          "hash-rate",
          "difficulty",
          "miners-revenue",
          "transaction-fees-usd",

          # Network Activity
          "n-unique-addresses",
          "n-transactions",
          "estimated-transaction-volume-usd"
]

# Data crawling

In [6]:
def data_crawler(timespan, metrics, start_date, continue_date, end_date):
    # API info
    url1 = f'https://api.blockchain.info/charts/{metrics}?timespan={timespan}&start={start_date}&format=csv'
    url2 = f'https://api.blockchain.info/charts/{metrics}?timespan={timespan}&start={continue_date}&format=csv'

    # Obtain data
    data1 = pd.read_csv(url1, names=['timestamp', metrics])
    data2 = pd.read_csv(url2, names=['timestamp', metrics])

    # Concat by rows
    all_data = pd.concat([data1, data2])

    # Transform "timestamp" to datetime type
    all_data['timestamp'] = pd.to_datetime(all_data["timestamp"])

    # Select data up to the end date
    all_data = all_data[(all_data['timestamp'] < end_date)]

    return all_data

In [7]:
# Merge the data
merge = functools.partial(pd.merge, on='timestamp')

# Gain blockchain bata from Blockchain.com API
df1 = functools.reduce(merge, [data_crawler(timespan, metric, start_date, continue_date, end_date) for metric in metrics])
df1

URLError: <urlopen error [Errno 11001] getaddrinfo failed>

In [None]:
# Check duplicated rows
len(df1['timestamp'].unique())

NameError: name 'df1' is not defined

In [None]:
# Retrieving market capitalization and total circulating data
metrics = [
          # Currency Statistics
          "total-bitcoins",                      # Total Circulating Bitcoin: The total number of mined bitcoin that are currently circulating on the network.
          "market-cap",                          # Market Capitalization (USD): The total USD value of bitcoin in circulation.
  ]

df2 = functools.reduce(merge, [data_crawler(timespan, metric, start_date, continue_date, end_date) for metric in metrics])
df2

In [None]:
# Check duplicated rows
len(df2['timestamp'].unique())

In [None]:
# Wipe off the timestamp's h:m:s.
df2['timestamp'] = pd.to_datetime(df2["timestamp"]).dt.normalize()

# Drop the duplicates in column "timestamp", keep the last value
df2.drop_duplicates(subset="timestamp", keep="last", inplace=True)
df2

In [None]:
# Check duplicated rows
len(df2['timestamp'].unique())

In [None]:
# Add the market capitalization and total circulating data to the main dataset
all_data = pd.merge(df1, df2, how="inner", on='timestamp')
all_data = all_data.interpolate(method='ffill')
all_data

In [None]:
# Check nan values
all_data[all_data.isnull().T.any()]

In [None]:
# Check duplicated rows
len(all_data['timestamp'].unique())

In [None]:
all_data

In [None]:
# Reorder colunmns
new_columns = ['timestamp', 'market-price', 'total-bitcoins', 'market-cap'] + [col for col in all_data.columns if col not in ['timestamp', 'market-price', 'total-bitcoins', 'market-cap']]
all_data = all_data.reindex(columns=new_columns)
all_data


Once we have the daily dataset we will sample it at a frequency of 30 minutes (30T) using the resample method.

This means that the data will be organized in 30-minute time-frame, and an interpolation method will be used to fill in any missing data or holes in the dataset by estimating missing values based on the surrounding known values.

In [None]:
# Upsampling to 30min by interpolate
all_data.set_index('timestamp', inplace=True)
all_data_30m = all_data.resample('30T').interpolate()
all_data_30m

# Saving dataset

In [None]:
# Save the 30m dataset
all_data_30m.to_parquet(GDRIVE_DATASET_RAW)