# **Bitcoin price prediction - Data crawling**
## Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author: Corsi Danilo (1742375) - corsi.1742375@studenti.uniroma1.it



---


Description: Bitcoin data retrieval via APIs call.

# Global constants, dependencies, libraries and tools

In [None]:
# Main constants
LOCAL_RUNNING = True
ROOT_DIR = "D:/Documents/Repository/BDC/project" if LOCAL_RUNNING else "/content/drive"

In [None]:
if not LOCAL_RUNNING:
    # Point Colaboratory to Google Drive
    from google.colab import drive

    # Define GDrive paths
    drive.mount(ROOT_DIR, force_remount=True)

## Import my utilities

In [None]:
# Set main dir
MAIN_DIR = ROOT_DIR + "" if LOCAL_RUNNING else ROOT_DIR + "/MyDrive/BDC/project"

###################
# --- DATASET --- #
###################

# Datasets dir
DATASET_RAW_DIR = MAIN_DIR + "/datasets/raw"

# Datasets name
DATASET_NAME = "bitcoin_blockchain_data_15min"

# Datasets path
DATASET_RAW = DATASET_RAW_DIR + "/" + DATASET_NAME + ".parquet"

In [None]:
# Useful imports
import pandas as pd
import functools
import plotly.io as pio
import time
from datetime import datetime, timedelta

# Suppression of warnings for better reading
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

pio.renderers.default = 'vscode+colab' # To correctly render plotly plots

# Metrics and parameters
I chose to collect data on the Bitcoin blockchain using the API of the website Blockchain.org and the price information from two famous exchanges, Binance and Kraken. They retrieved the most relevant information from 2019-12-07	to the 2023-11-13 (~4 years, a period for which there were moments of high volatility but also a lot of price lateralization).

The features taken under consideration were divided into several categories:
- **Currency Statistics**
   - **ohlcv**: stands for “Open, High, Low, Close and Volume” and it's a list of the five types of data that are most common in financial analysis regarding price.
   - **market-price:** the average USD market price across major bitcoin exchanges.
   - **trade-volume-usd:** the total USD value of trading volume on major bitcoin exchanges.
   - **total-bitcoins:** the total number of mined bitcoin that are currently circulating on the network.
   - **market-cap:** the total USD value of bitcoin in circulation.

- **Block Details**
   - **blocks-size:** the total size of the blockchain minus database indexes in megabytes.
   - **avg-block-size:** the average block size over the past 24 hours in megabytes.
   - **n-transactions-total:** the total number of transactions on the blockchain.
   - **n-transactions-per-block:** the average number of transactions per block over the past 24 hours.

- **Mining Information**
   - **hash-rate:** the estimated number of terahashes per second the bitcoin network is performing in the last 24 hours.
   - **difficulty:** a relative measure of how difficult it is to mine a new block for the blockchain.
   - **miners-revenue:** total value of coinbase block rewards and transaction fees paid to miners.
   - **transaction-fees-usd:** the total USD value of all transaction fees paid to miners. This does not include coinbase block rewards.

- **Network Activity**
   - **n-unique-addresses:** the total number of unique addresses used on the blockchain.
   - **n-transactions:** the total number of confirmed transactions per day.
   - **estimated-transaction-volume-usd:** the total estimated value in USD of transactions on the blockchain.

In [None]:
timespan = "4years" # Duration of the data
# end_date = datetime.today() # Get current date (ending date)
date_string = '2023-11-13'
end_date = datetime.strptime(date_string, '%Y-%m-%d') # set static date (ending date)
start_date = (end_date - timedelta(days=365*4)) # Get the starting date

# Metrics considered
metrics = [
          # Currency Statistics
          "market-price",
          "trade-volume",

          # Block Details
          "blocks-size",
          "avg-block-size",
          "n-transactions-total",
          "n-transactions-per-block",

          # Mining Information
          "hash-rate",
          "difficulty",
          "miners-revenue",
          "transaction-fees-usd",

          # Network Activity
          "n-unique-addresses",
          "n-transactions",
          "estimated-transaction-volume-usd"
]

# Data crawling

In [None]:
# Install ccxt trading library that provides a way to connect and trade with various cryptocurrency exchanges and payment processing services worldwide
!pip3 install ccxt
import ccxt

In [None]:
# Create an array of dates in such a way as to contact the API in one-year increments
date_array = []

# Calculate the number of days between the start and end dates
num_days = (end_date - start_date).days

# Loop through the dates and add them to the array
for i in range(num_days + 1):
    current_date = start_date + timedelta(days=i)
    if i % 360 == 0:
        date_array.append(current_date)

# Append end_date
date_array.append(end_date)
date_array

In [None]:
def ohlcv_crawler(exchange_to_use, start, end):
    exchange = exchange_to_use  # Connect to the exchange exchange
    market = 'BTC/USD'  # Bitcoin market
    exchange.enableRateLimit = False

    # Convert dates to milliseconds
    since = exchange.parse8601(start + 'T00:00:00Z')
    till = exchange.parse8601(end + 'T00:00:00Z')

    # Fetch OHLCV data
    ohlcv = exchange.fetch_ohlcv(market, '1d', since, till)

    # Convert to DataFrame
    dataset = pd.DataFrame(ohlcv, columns=['timestamp', 'open', 'high', 'low', 'close', 'volume'])
    dataset['timestamp'] = pd.to_datetime(dataset['timestamp'], unit='ms')

    return dataset

In [None]:
# Fetch OHLCV data
exchange_to_use = ccxt.binanceus()

df0 = pd.DataFrame()
j = 1
for i in range(3):
  df0 = df0.append(ohlcv_crawler(exchange_to_use, date_array[i].strftime('%Y-%m-%d'), date_array[j].strftime('%Y-%m-%d')), ignore_index=True)
  time.sleep(5)
  j += 1
df0

In [None]:
# Check duplicated rows
len(df0['timestamp'].unique())

In [None]:
# Drop the duplicates in column "timestamp", keep the last value
df0.drop_duplicates(subset="timestamp", keep="last", inplace=True)
df0

In [None]:
# Check duplicated rows
len(df0['timestamp'].unique())

In [None]:
# Since I cannot get all the data from the same exchange, I will get the remaining data from another
last_date = df0['timestamp'].tail(1).values[0]

# Compare the last date with our end date
if not last_date == end_date:
  exchange_to_use = ccxt.kraken()
  for i in range(3):
    df0 = df0.append(ohlcv_crawler(exchange_to_use, pd.to_datetime(last_date).strftime('%Y-%m-%d'), end_date.strftime('%Y-%m-%d')), ignore_index=True)
df0

In [None]:
# Check duplicated rows
len(df0['timestamp'].unique())

In [None]:
# Drop the duplicates in column "timestamp", keep the last value
df0.drop_duplicates(subset="timestamp", keep="last", inplace=True)
df0

In [None]:
def blockchain_data_crawler(timespan, metrics, start, end):
    # API info
    url = f'https://api.blockchain.info/charts/{metrics}?timespan={timespan}&start={start}&format=csv'

    # Obtain data
    data = pd.read_csv(url, names=['timestamp', metrics])

    # Transform "timestamp" to datetime type
    data['timestamp'] = pd.to_datetime(data["timestamp"])

    # Select data up to the end date
    data = data[(data['timestamp'] < end)]

    return data

In [None]:
# Merge the data
merge = functools.partial(pd.merge, on='timestamp')

# Gain blockchain data from Blockchain.com API
df1 = functools.reduce(merge, [blockchain_data_crawler(timespan, metric, start_date.strftime('%Y-%m-%d'), end_date.strftime('%Y-%m-%d')) for metric in metrics])
df1

In [None]:
# Check duplicated rows
len(df1['timestamp'].unique())

In [None]:
# Retrieving market capitalization and total circulating data
metrics = [
          # Currency Statistics
          "total-bitcoins",
          "market-cap",
  ]

df2 = functools.reduce(merge, [blockchain_data_crawler(timespan, metric, start_date.strftime('%Y-%m-%d'), end_date.strftime('%Y-%m-%d')) for metric in metrics])
df2

In [None]:
# Check duplicated rows
len(df2['timestamp'].unique())

In [None]:
# Wipe off the timestamp's h:m:s.
df2['timestamp'] = pd.to_datetime(df2["timestamp"]).dt.normalize()

# Drop the duplicates in column "timestamp", keep the last value
df2.drop_duplicates(subset="timestamp", keep="last", inplace=True)
df2

In [None]:
all_data_tmp = pd.merge(df0, df1, how="inner", on='timestamp')
all_data = pd.merge(all_data_tmp, df2, how="inner", on='timestamp')
all_data = all_data.interpolate(method='ffill')
all_data

In [None]:
# Check nan values
all_data[all_data.isnull().T.any()]

In [None]:
# Check duplicated rows
len(all_data['timestamp'].unique())

Once I have the daily dataset I will sample it at a frequency of 15 minutes (15T) using the resample method.

This means that the data will be organized in 15-minute time-frame, and an interpolation method will be used to fill in any missing data or holes in the dataset by estimating missing values based on the surrounding known values.

In [None]:
# Rename some columns
all_data.rename(columns={'open': 'opening-price', 'high': 'highest-price', 'low': 'lowest-price', 'close': 'closing-price', 'volume': 'trade-volume-btc', 'trade-volume': 'trade-volume-usd'}, inplace=True)

# Reorder colunmns
new_columns = ['timestamp', 'market-price', 'opening-price', 'highest-price', 'lowest-price', 'closing-price', 'trade-volume-btc', 'total-bitcoins', 'market-cap'] + [col for col in all_data.columns if col not in ['timestamp', 'market-price', 'opening-price', 'highest-price', 'lowest-price', 'closing-price', 'trade-volume-btc', 'total-bitcoins', 'market-cap']]
all_data = all_data.reindex(columns=new_columns)

# Upsampling to 15min by interpolate
all_data.set_index('timestamp', inplace=True)
all_data_15m = all_data.resample('15T').interpolate()
all_data_15m

# Saving dataset

In [None]:
# Save the 15m dataset
all_data_15m.to_parquet(DATASET_RAW)

In [None]:
# Export notebook in html format (remember to save the notebook and change the model name)
if LOCAL_RUNNING:
    !jupyter nbconvert --to html 1-data-crawling.ipynb --output 1-data-crawling.ipynb --output-dir='./exports'