# **Bitcoin price prediction - Feature Engineering**
## Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author: Corsi Danilo (1742375) - corsi.1742375@studenti.uniroma1.it


---


Description: adding useful features regardings the price of Bitcoin, visualizing data and performing feature selection.


# Global constants, dependencies, libraries and tools

In [None]:
# Main constants
LOCAL_RUNNING = True
SLOW_OPERATIONS = True # Decide whether or not to use operations that might slow down notebook execution
ROOT_DIR = "D:/Documents/Repository/BDC/project" if LOCAL_RUNNING else "/content/drive"

In [None]:
if not LOCAL_RUNNING:
    # Point Colaboratory to Google Drive
    from google.colab import drive

    # Define GDrive paths
    drive.mount(ROOT_DIR, force_remount=True)

    # Install Spark and related dependencies
    !pip install pyspark
    !pip install -U -q PyDrive -qq
    !apt install openjdk-8-jdk-headless -qq

## Import my utilities

In [None]:
# Set main dir
MAIN_DIR = ROOT_DIR + "" if LOCAL_RUNNING else ROOT_DIR + "/MyDrive/BDC/project"

# Utilities dir
UTILITIES_DIR = MAIN_DIR + "/utilities"

# Import my utilities
import sys
sys.path.append(UTILITIES_DIR)

from imports import *
import visualization

In [None]:
###################
# --- DATASET --- #
###################

# Datasets dirs
DATASET_RAW_DIR = MAIN_DIR + "/datasets/raw"
DATASET_OUTPUT_DIR = MAIN_DIR + "/datasets/output"
DATASET_TEMP_DIR = MAIN_DIR + "/datasets/temp"

# Datasets names
DATASET_NAME = "bitcoin_blockchain_data_15min"

# Datasets paths
DATASET_RAW = DATASET_RAW_DIR + "/" + DATASET_NAME + ".parquet"

####################
# --- FEATURES --- #
####################

# Features dir
FEATURES_DIR = MAIN_DIR + "/features"

# Features names
FEATURES_CORRELATION_NAME = "features_correlation"
CURRENCY_FEATURES_NAME = "currency_features"
CURRENCY_MOST_CORR_FEATURES_NAME = "currency_most_corr_features"
CURRENCY_LEAST_CORR_FEATURES_NAME = "currency_least_corr_features"

# Features paths
FEATURES_CORRELATION = FEATURES_DIR + "/" + FEATURES_CORRELATION_NAME + ".json"
CURRENCY_FEATURES = FEATURES_DIR + "/" + CURRENCY_FEATURES_NAME + ".json"
CURRENCY_MOST_CORR_FEATURES = FEATURES_DIR + "/" + CURRENCY_MOST_CORR_FEATURES_NAME + ".json"
CURRENCY_LEAST_CORR_FEATURES = FEATURES_DIR + "/" + CURRENCY_LEAST_CORR_FEATURES_NAME + ".json"

In [None]:
# Suppression of warnings for better reading
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

if LOCAL_RUNNING: pio.renderers.default='notebook' # To correctly export the notebook in html format

# Create the pyspark session

In [None]:
# Create the session
conf = SparkConf().\
                set('spark.ui.port', "4050").\
                set('spark.executor.memory', '12G').\
                set('spark.driver.memory', '12G').\
                set('spark.driver.maxResultSize', '109G').\
                set("spark.kryoserializer.buffer.max", "1G").\
                setAppName("BitcoinPricePrediction").\
                setMaster("local[*]")

# Create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

# Loading dataset

In [None]:
# Load datasets into pyspark dataset objects
df = spark.read.load(DATASET_RAW,
                         format="parquet",
                         sep=",",
                         inferSchema="true",
                         header="true"
                    ) \
                     .withColumn("id", F.row_number().over(Window.orderBy(F.monotonically_increasing_id()))-1) # Adding "id" column

In [None]:
def dataset_info(dataset):
  # Print dataset
  dataset.show(3)

  # Get the number of rows
  num_rows = dataset.count()

  # Get the number of columns
  num_columns = len(dataset.columns)

  # Print the shape of the dataset
  print("Shape:", (num_rows, num_columns))

  # Print the schema of the dataset
  dataset.printSchema()

In [None]:
if SLOW_OPERATIONS:
  dataset_info(df)

# Adding useful features
Here I am going to add some features that could help us predict the Bitcoin price:

*   **next-market-price:** represents the price of Bitcoin for the next 15 minutes (this will be the target variable on which to make predictions)
*   **rate-of-change:** indicator that measures the percentage of price changes over a period of time, allows investors to spot security momentum and other trends
*   **sma-x-days:** indicators that calculate the average price over a specified number of days. They are commonly used by traders to identify trends and potential buy or sell signals


In [None]:
# Creation of a new dataset for the new features
new_features_df = df.select("timestamp", "id", "market-price")

In [None]:
# Adding 'next-market-price' column
new_features_df = new_features_df.withColumn("next-market-price", F.lag("market-price", offset=-1) \
        .over(Window.orderBy("id"))) \
        .dropna()

In [None]:
# Adding "rate-of-change" column
new_features_df = new_features_df.withColumn("rate-of-change", (F.col("next-market-price") / F.col("market-price") - 1) * 100)

In [None]:
def simple_moving_average(dataset, period, days, col="next-market-price", orderby="id"):
    dataset = dataset.withColumn(f"sma-{days}-days", F.avg(col) \
          .over(Window.orderBy(orderby) \
          .rowsBetween(-period,0)))
    return dataset

In [None]:
# Moving averages days (5/7/10/20/50/100)
MA5 = 60 * 24 * 5
MA7 = 60 * 24 * 7
MA10 = 60 * 24 * 10
MA20 = 60 * 24 * 20
MA50 = 60 * 24 * 50
MA100 = 60 * 24 * 100

# Computing the SMA
new_features_df = simple_moving_average(new_features_df, MA5, 5)
new_features_df = simple_moving_average(new_features_df, MA7, 7)
new_features_df = simple_moving_average(new_features_df, MA10, 10)
new_features_df = simple_moving_average(new_features_df, MA20, 20)
new_features_df = simple_moving_average(new_features_df, MA50, 50)
new_features_df = simple_moving_average(new_features_df, MA100, 100)

In [None]:
# Drop "market-price" column
new_features_df = new_features_df.drop("market-price")

# Merge original dataset with the one with the new features
merged_df = df.join(new_features_df, on=['timestamp','id'], how='inner')

In [None]:
if SLOW_OPERATIONS:
  dataset_info(merged_df)

In [None]:
# Rearranges columns
new_columns = ["timestamp", "id"] + [col for col in merged_df.columns if col not in ["timestamp", "id", "next-market-price"]] + ["next-market-price"]
merged_df = merged_df.select(*new_columns)

# Rename some columns
column_mapping = {"open": "opening-price", "high": "highest-price", "low": "lowest-price", "close": "closing-price", "volume": "trade-volume-btc", "trade-volume": "trade-volume-usd"}
for old_name, new_name in column_mapping.items():
    merged_df = merged_df.withColumnRenamed(old_name, new_name)

# Set the "timestamp" column as the index of the Pandas dataset
merged_df.toPandas().set_index("timestamp", inplace=True)

In [None]:
if SLOW_OPERATIONS:
  dataset_info(merged_df)

# Splitting dataset
Here we are going to split the dataset into two sets:
* **Train / validation set:** will be used to train the models and validate the performances
* **Test set:** will be used to perform price prediction on never-before-seen data (the last 3 months of the original dataset will be used).

In [None]:
# Retrieve the last timestamp value
last_value = merged_df.agg(last("timestamp")).collect()[0][0]

# Subtract three month from the last timestamp value
split_date = last_value - relativedelta(months=3)

# Split the dataset based on the desired date
train_valid_df = merged_df[merged_df['timestamp'] <= split_date]
test_df = merged_df[merged_df['timestamp'] > split_date]

In [None]:
if SLOW_OPERATIONS:
    visualization.dataset_visualization(train_valid_df.toPandas(), test_df.toPandas(), "Train / Validation and Test sets")

# Saving datasets

In [None]:
def output(dataset, dataset_type):
  dataset.write.parquet(DATASET_TEMP_DIR, mode='overwrite')

  while True:
      parquet_files = glob.glob(os.path.join(DATASET_TEMP_DIR, "part*.parquet"))
      if len(parquet_files) > 0:
          # .parquet file found!
          file_path = parquet_files[0]
          break
      else:
          print(".parquet file not found. I'll try again after 1 second...")
          time.sleep(1)

  print(".parquet file found:", file_path)

  new_file_path = DATASET_OUTPUT_DIR + "/" + DATASET_NAME + "_" + dataset_type + ".parquet"

  # Rename and move the file
  shutil.move(file_path, new_file_path)

  print("File renamed and moved successfully!")

In [None]:
# Save the train / validation set
output(train_valid_df, "train_valid")

In [None]:
# Save the test set
output(test_df, "test")

# Data visualization

Here we are going to display the features taken under consideration according to their categories.

In [10]:
# # Datasets dirs
# DATASET_OUTPUT_DIR = MAIN_DIR + "/datasets/output"

# # Datasets names
# DATASET_NAME = "bitcoin_blockchain_data_15min"

# # Datasets paths
# TRAIN = DATASET_OUTPUT_DIR + "/" + DATASET_NAME + "_train_valid.parquet"
# TEST = DATASET_OUTPUT_DIR + "/" + DATASET_NAME + "_test.parquet"

# # Load datasets into pyspark dataset objects
# train = spark.read.load(TRAIN,
#         format="parquet",
#         sep=",",
#         inferSchema="true",
#         header="true"
# )

# test = spark.read.load(TEST,
#         format="parquet",
#         sep=",",
#         inferSchema="true",
#         header="true"
# )

# # Define a function for merging multiple DataFrames row-wise
# def unionAll(dfs):
#     return functools.reduce(lambda df1, df2: df1.unionAll(df2), dfs)

# # Merge the two DataFrames using the unionAll() method
# merged_df = unionAll([train, test])

# merged_df.show()

+-------------------+---+-----------------+-----------------+-----------------+-----------------+-----------------+------------------+-----------------+--------------------+--------------------+------------------+------------------+--------------------+------------------------+--------------------+------------------+--------------------+--------------------+------------------+-----------------+--------------------------------+--------------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
|          timestamp| id|     market-price|    opening-price|    highest-price|     lowest-price|    closing-price|  trade-volume-btc|   total-bitcoins|          market-cap|    trade-volume-usd|       blocks-size|    avg-block-size|n-transactions-total|n-transactions-per-block|           hash-rate|        difficulty|      miners-revenue|transaction-fees-usd|n-unique-addresses|   n-transactions|estimated-transaction-vol

In [None]:
# Convert the PySpark dataset into Pandas
merged_df_pd = merged_df.toPandas()
merged_df_pd

In [None]:
# List of features according to categories
ohlcv_statistics = {'Opening price (USD)':'opening-price', 'Highest price (USD)':'highest-price', 'Lowest price (USD)':'lowest-price', 'Closing price (USD)':'closing-price', 'Trade volume (BTC)':'trade-volume-btc'}
currency_statistics = {'Market price (USD)':'market-price', 'Market cap (USD)':'market-cap', 'N. total bitcoins':'total-bitcoins', 'Trade volume (USD)':'trade-volume-usd'}
block_details = {'Blocks size (MB)':'blocks-size', 'Avg. block size (MB)':'avg-block-size', 'N. total transactions':'n-transactions-total', 'N. transactions per block':'n-transactions-per-block'}
mining_information = {'Hash rate (TH/s)':'hash-rate', 'Difficulty (T)':'difficulty', 'Miners revenue (USD)':'miners-revenue', 'Transaction fees (USD)':'transaction-fees-usd'}
network_activity = {"N. unique addresses":'n-unique-addresses', 'N. transactions':'n-transactions', 'Estimated transaction volume (USD)':'estimated-transaction-volume-usd'}
additional_features = {"Rate of change (%)":"rate-of-change", "Simple moving avg. (5d)":"sma-5-days", "Simple moving avg. (7d)":"sma-7-days", "Simple moving avg. (10d)":"sma-10-days", "Simple moving avg. (20d)":"sma-20-days", "Simple moving avg. (50d)":"sma-50-days", "Simple moving avg. (100d)":"sma-100-days"}

In [None]:
# OHLC Statistics
ohlc_statistics = list(ohlcv_statistics.items())[:4]
print(ohlc_statistics)

# Volume Statistics
volume_statistics = list(ohlcv_statistics.items())[4:]
print(volume_statistics)

In [None]:
# OHLCV Statistics
if SLOW_OPERATIONS:
  visualization.ohlcv_visualization(merged_df_pd, ohlc_statistics, "OHLC Statistics (usd)")
  visualization.features_visualization(merged_df_pd, volume_statistics[0][0], volume_statistics[0][1])

The OHLCV stastistics chart is a type of bar chart that shows the open, high, low, close and volume values for each period. They are useful because they show the five main points of a period, with the closing price being considered the most important by many traders. Due to an Algo Bug on Binance's U.S. Exchange we have a strange dump on 21 October 2021 regarding the lower price.

Source: https://www.bloomberg.com/news/articles/2021-10-21/bitcoin-appears-to-crash-87-on-binance-in-apparent-mistake#xj4y7vzkg

In [None]:
# Currency Statistics
if SLOW_OPERATIONS:
  for key, value in currency_statistics.items():
    visualization.features_visualization(merged_df_pd, key, value)

Concerning currency statistics, we can see that in the period from late 2020 to mid-2022 there has been a rise in the price of Bitcoin, while the amount of Bitcoins issued is slowly peaking (i.e. 21 million), it is thought that the last BTC will be mined in 2140.

In [None]:
# Block Details
if SLOW_OPERATIONS:
  for key, value in block_details.items():
    visualization.features_visualization(merged_df_pd, key, value)

Concerning block details, we can see that over time the number of transactions has increased exponentially, along with the size of the blocks. The peak around the end of January 2023 is due to the creation of the Ordinals protocol that allows the creation of 'digital artefacts' on the Bitcoin network (These can include JPEG images, PDFs and audio and video files).

In [None]:
# Mining Information
if SLOW_OPERATIONS:
  for key, value in mining_information.items():
    visualization.features_visualization(merged_df_pd, key, value)

Regarding mining information, we can see how the difficulty of the network along with the hash rate has also increased exponentially, the greater the hashing (computing) power in the network, the greater its security and resistance to attacks. While the miners revenue more or less follows the price trend of Bitcoin itself (this is also thanks to the transaction fees that are distributed to the miners). The two biggest spikes in transaction fees are due to a combination of ASIC shortages, huge price increases of BTC outpacing difficulty and the sudden hashrate drop, resulting in slower block times, backlog of transactions and extra fees per block (20 - 21 April 2021, Source: https://www.coindesk.com/markets/2021/04/21/bitcoin-transactions-are-more-expensive-than-ever/) and the increase in demand for block space attributed to the increase in Ordinals (8 May 2023, Source: https://www.coindesk.com/tech/2023/05/08/ordinals-upend-bitcoin-mining-pushing-transaction-fees-above-mining-reward-for-first-time-in-years/).

In [None]:
# Network Activity
if SLOW_OPERATIONS:
  for key, value in network_activity.items():
    visualization.features_visualization(merged_df_pd, key, value)

Regarding Network Activity, we can see how this also increases as time goes by, a symbol that the Bitcoin protocol is becoming more and more popular and people are willing to pay to use it.

In [None]:
# Additional Features: Rate of change
if SLOW_OPERATIONS:
  first_pair = next(iter(additional_features.items()))
  visualization.features_visualization(merged_df_pd, first_pair[0], first_pair[1])

In [None]:
# Extract the short term SMA
short_term_sma = list(additional_features.items())[1:4]
print(short_term_sma)

# Extract the long term SMA
long_term_sma = list(additional_features.items())[-3:]
print(long_term_sma)

In [None]:
# Additional Features: Short term SMA
if SLOW_OPERATIONS:
  visualization.sma_visualization(merged_df_pd, short_term_sma, "Short term SMA (usd)")

In [None]:
# Additional Features: Long term SMA
if SLOW_OPERATIONS:
  visualization.sma_visualization(merged_df_pd,long_term_sma, "Long term SMA (usd)")

Taking into consideration the Rate of Change (which shows us the price variations over a short period of time) and the Simple Moving Averages (which instead give us a more medium to long term view of the price) we can see that the main price variations occur precisely in the latter, this tells us that although Bitcoin has high price volatility this often occurs days or even months later, except in some cases where unpredictable pumps or dumps can occur due to sudden news.

#  Feature selection
Here we are going to select blockchain features plus features added at the beginning to be eventually added to the main ones, i.e. those dedicated to currency based on their correlation and importance with respect to the market price using the Pearson method.

In [None]:
# # Datasets dirs
# DATASET_OUTPUT_DIR = MAIN_DIR + "/datasets/output"

# # Datasets names
# DATASET_NAME = "bitcoin_blockchain_data_15min"

# # Datasets paths
# TRAIN = DATASET_OUTPUT_DIR + "/" + DATASET_NAME + "_train_valid.parquet"
# TEST = DATASET_OUTPUT_DIR + "/" + DATASET_NAME + "_test.parquet"

# # Load datasets into pyspark dataset objects
# train = spark.read.load(TRAIN,
#         format="parquet",
#         sep=",",
#         inferSchema="true",
#         header="true"
# )

# test = spark.read.load(TEST,
#         format="parquet",
#         sep=",",
#         inferSchema="true",
#         header="true"
# )

# # Define a function for merging multiple DataFrames row-wise
# def unionAll(dfs):
#     return functools.reduce(lambda df1, df2: df1.unionAll(df2), dfs)

# # Merge the two DataFrames using the unionAll() method
# merged_df = unionAll([train, test])

# merged_df.show()

# ohlcv_statistics = {'Opening price (USD)':'opening-price', 'Highest price (USD)':'highest-price', 'Lowest price (USD)':'lowest-price', 'Closing price (USD)':'closing-price', 'Trade volume (BTC)':'trade-volume-btc'}
# currency_statistics = {'Market price (USD)':'market-price', 'Market cap (USD)':'market-cap', 'N. total bitcoins':'total-bitcoins', 'Trade volume (USD)':'trade-volume-usd'}

In [None]:
# Prepare dataset to feature selection
new_columns = ["next-market-price"] + [col for col in merged_df.columns if col not in ['timestamp', 'id', 'next-market-price', 'opening-price', 'highest-price', 'lowest-price', 'closing-price', 'trade-volume-btc', 'market-price', 'market-cap', 'total-bitcoins', 'trade-volume-usd']]
merged_df_only_blockchain_data = merged_df.select(*new_columns)
merged_df_only_blockchain_data.show()

In [None]:
# Assemble the data to apply PySpark methods
assembler = VectorAssembler(inputCols=merged_df_only_blockchain_data.columns, outputCol='features')
assembled_data = assembler.transform(merged_df_only_blockchain_data)

In [None]:
# Compute the correlation matrix
correlation_matrix = Correlation.corr(assembled_data, 'features').head()

# Get the highest correlated features
correlation_scores = correlation_matrix[0].toArray()
feature_names = merged_df_only_blockchain_data.columns
feature_correlations = sorted([(feature_names[i], str(correlation_scores[i][0])) for i in range(len(feature_names))], key=lambda x: x[1], reverse=True)

# Print the results
for label, value in feature_correlations:
    print(f"Feature: {label}, Correlation: {value}")

Finally, I decided to divide the features into two distinct groups:
- **Currency features:** contains currency statistics and ohlcv statistics
- **Currency and blockchain features:** contains the currency features plus the blockchain features divided based on their correlation value: 
    - If >= 0.5, then then they will be considered the **most correlated**
    - If < 0.5, then then they will be considered the **least correlated**

The strategy for the next notebooks will be as follows:
- Test models with currency features
- See if by adding the blockchain most and least correlated features to them improves the situation

In [None]:
currency_features = list(ohlcv_statistics.values()) + list(currency_statistics.values())
most_corr_features = [x[0] for x in feature_correlations[1:] if float(x[1]) >= 0.5]
currency_most_corr_features = currency_features + most_corr_features
discarded_features = [x[0] for x in feature_correlations[1:] if float(x[1]) < 0.5]
currency_least_corr_features = currency_features + discarded_features

In [None]:
currency_features

In [None]:
most_corr_features

In [None]:
currency_most_corr_features

In [None]:
discarded_features

In [None]:
currency_least_corr_features

# Saving selected features

In [None]:
# Save all the features and their correlation value
with open(FEATURES_CORRELATION, 'w') as file:
    json.dump(feature_correlations, file)

In [None]:
# Save currency and ohlcv features
with open(CURRENCY_FEATURES, 'w') as file:
    json.dump(currency_features, file)

In [None]:
# Save currency, ohlcv and blockchain most correlated features
with open(CURRENCY_MOST_CORR_FEATURES, 'w') as file:
    json.dump(currency_most_corr_features, file)

In [None]:
# Save currency, ohlcv and blockchain least correlated features
with open(CURRENCY_LEAST_CORR_FEATURES, 'w') as file:
    json.dump(currency_least_corr_features, file)

In [None]:
# Export notebook in html format (remember to save the notebook and change the model name)
if LOCAL_RUNNING:
    !jupyter nbconvert --to html 2-feature-engineering.ipynb --output 2-feature-engineering --output-dir='./exports'