<a href="https://colab.research.google.com/github/CorsiDanilo/big-data-computing-project/blob/main/3_Random_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bitcoin price forecasting with PySpark
## Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author
Corsi Danilo - corsi.1742375@studenti.uniroma1.it



### Introduction

The cryptocurrency Bitcoin has attracted the attention of many people in recent years. However, it's
price fluctuation can be extremely unpredictable, which makes it difficult to predict when the right
time to buy or sell this digital currency will be. In this context, forecasting Bitcoin prices can be a
competitive advantage for investors and traders, as it could allow them to make informed decisions
on the right time to enter or exit the market. In this project, I will analyze some machine learning
techniques to understand, through the processing of historical data, how accurately the price of Bitcoin
can be predicted and whether this can provide added value to cryptocurrency investors and traders.
### Dataset
I chose to use the following dataset from Kaggle Bitcoin Historical Dataset, more specifically those
containing minute-by-minute updates of the Bitcoin price from 2017 to 2021 (period for which there
were moments of high volatility but also a lot of price lateralisation). The columns (features) contained
in it, in addition to the timestamp of each transaction, are the opening, closing, highest and lowest
price and the corresponding trading volume in Bitcoin and Dollars.
### Methods (TODO: da scegliere per bene)
The methods I will test will be Linear Regression (simple and multiple) and Random Forest. Further
comparisons with other classification models are planned in the course of development. Moreover, I
would also like to try to understand what the differences are between these methods and the imple-
mentation of a state-of-the-art neural network such as Long-Short Term Memory.
### Evaluation framework (TODO: vedi quali usare in base ai paper/esempi e ai modelli utilizzati)
As evaluation framework I will use R-square (R²), Mean Square Error (MSE) and Mean Absolute
Error (MAE) to get a complete picture of the performance of the various models.

# Global Constants


In [1]:
JAVA_HOME = "/usr/lib/jvm/java-8-openjdk-amd64"
GDRIVE_DIR = "/content/drive"

GDRIVE_DATASET_RAW_DIR = GDRIVE_DIR + "/MyDrive/BDC/project/datasets/raw"
GDRIVE_DATASET_TEMP_DIR = GDRIVE_DIR + "/MyDrive/BDC/project/datasets/temp"
GDRIVE_DATASET_OUTPUT_DIR = GDRIVE_DIR + "/MyDrive/BDC/project/datasets/output"

GDRIVE_DATASET_NAME = "bitcoin_blockchain_data_1m"
GDRIVE_DATASET_NAME_TRAIN = GDRIVE_DATASET_NAME + "_all_train"
GDRIVE_DATASET_NAME_TEST = GDRIVE_DATASET_NAME + "_all_test"

GDRIVE_DATASET_NAME_EXT_TRAIN  = "/" + GDRIVE_DATASET_NAME_TRAIN + ".parquet"
GDRIVE_DATASET_NAME_EXT_TEST = "/" + GDRIVE_DATASET_NAME_TEST + ".parquet"

GDRIVE_DATASET_TRAIN = GDRIVE_DATASET_OUTPUT_DIR + GDRIVE_DATASET_NAME_EXT_TRAIN
GDRIVE_DATASET_TEST = GDRIVE_DATASET_OUTPUT_DIR + GDRIVE_DATASET_NAME_EXT_TEST

SLOW_OPERATION = False

#  Import useful Python packages

In [2]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from itertools import cycle

import plotly.express as px

import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
import gc

# **Spark + Google Colab Setup**

## Install PySpark and related dependencies





In [3]:
!pip install pyspark
# Alternatively, if you want to install a specific version of pyspark:
#!pip install pyspark==3.2.1
!pip install -U -q PyDrive # To use files that are stored in Google Drive directly (e.g., without downloading them from an external URL)
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = JAVA_HOME

import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

from pyspark.sql import functions as F

Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285398 sha256=8bbd5808b8e4490bbbeb7761b6a866fb1bd0ca7711fa159a4b4f10ad903b8bf8
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1
The following additional packages will be installed:
  libxtst6 openjdk-8-jre-headless
Suggested packages:
  openjdk-8-demo openjdk-8-source libnss-mdns fonts-dejavu-extra
  fonts-ipafont-gothic fonts-ipafont-mincho fonts-wqy-microhei
  fonts-wqy-zenhei fonts-indic
The follow

##  Create Spark context

In [4]:
# Create the session
conf = SparkConf().\
                set('spark.ui.port', "4050").\
                set('spark.executor.memory', '4G').\
                set('spark.driver.memory', '45G').\
                set('spark.driver.maxResultSize', '10G').\
                set("spark.kryoserializer.buffer.max", "1G").\
                setAppName("BitcoinPriceForecasting").\
                setMaster("local[*]")

# Create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

##  Link Colab to our Google Drive

In [5]:
# Point Colaboratory to our Google Drive

from google.colab import drive

drive.mount(GDRIVE_DIR, force_remount=True)

Mounted at /content/drive


# Random Forest
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.tree.RandomForest.html

In [6]:
# load dataset into pyspark dataframe objects
train_df = spark.read.load(GDRIVE_DATASET_TRAIN,
                         format="parquet",
                         sep=",",
                         inferSchema="true",
                         header="true"
                    )

test_df = spark.read.load(GDRIVE_DATASET_TEST,
                         format="parquet",
                         sep=",",
                         inferSchema="true",
                         header="true"
                    )

## Prediction ❗

In [7]:
def random_forest(train, test):
  from pyspark.ml.regression import RandomForestRegressor
  from pyspark.ml.feature import StringIndexer, VectorAssembler

  # Crea un oggetto RandomForestRegressor: Puoi impostare i parametri desiderati per il modello Random Forest come numero di alberi (numTrees), profondità massima degli alberi (maxDepth), numero massimo di bin per il partizionamento delle features (maxBins), ecc.
  rf = RandomForestRegressor(
      featuresCol="features",  # Colonna vettoriale delle features
      labelCol="market-price",  # Colonna delle etichette di output
  )

  # Addestra il modello: Utilizza il metodo fit() per addestrare il modello sulla tua dataset di addestramento.
  model = rf.fit(train)

  # Effettua le previsioni: Utilizza il modello addestrato per fare previsioni sul tuo dataset di test o su nuovi dati.
  predictions = model.transform(test)

  return predictions, model

In [8]:
predictions, model = random_forest(train_df, test_df)

In [54]:
print("The shape of the train dataset is {:d} rows by {:d} columns".format(predictions.count(), len(predictions.columns)))
predictions.show(3)
predictions.printSchema()

The shape of the train dataset is 1156608 rows by 5 columns
+-------------------+-------+--------------------+------------------+------------------+
|          timestamp|  index|            features|      market-price|        prediction|
+-------------------+-------+--------------------+------------------+------------------+
|2020-10-17 19:13:00|4626433|[2.11353216020423...|11358.749041666666|10876.359951959008|
|2020-10-17 19:14:00|4626434|[2.11354936360846...|11358.776083333334|10876.359951959008|
|2020-10-17 19:15:00|4626435|[2.11356656701269...|      11358.803125|10876.359951959008|
+-------------------+-------+--------------------+------------------+------------------+
only showing top 3 rows

root
 |-- timestamp: timestamp_ntz (nullable = true)
 |-- index: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- market-price: double (nullable = true)
 |-- prediction: double (nullable = false)



In [33]:
def output(dataset, type):
  from pyspark.sql.functions import date_format, to_timestamp, col

  dataset.write.parquet(GDRIVE_DATASET_TEMP_DIR, mode='overwrite')

  import os
  import glob
  import time

  while True:
      parquet_files = glob.glob(os.path.join(GDRIVE_DATASET_TEMP_DIR, "part*.parquet"))
      if len(parquet_files) > 0:
          # .parquet file found!
          file_path = parquet_files[0]
          break
      else:
          print(".parquet file not found. I'll try again after 1 second...")
          time.sleep(1)

  print(".parquet file found:", file_path)

  new_file_path = GDRIVE_DATASET_OUTPUT_DIR + "/" + GDRIVE_DATASET_NAME + "_" + type + ".parquet"

  import shutil

  # rename and move the file
  shutil.move(file_path, new_file_path)

  print("File renamed and moved successfully!")

In [34]:
output(predictions, "all_predictions")

.parquet file found: /content/drive/MyDrive/BDC/project/datasets/temp/part-00000-1a0111a9-cc95-4d41-ac32-8c186525ce84-c000.snappy.parquet
File renamed and moved successfully!


In [55]:
# renaming 'predictions' into 'market-price'
predictions_new = predictions.drop('market-price') \
                  .withColumnRenamed("prediction", "market-price")
predictions_new.show(3)

+-------------------+-------+--------------------+------------------+
|          timestamp|  index|            features|      market-price|
+-------------------+-------+--------------------+------------------+
|2020-10-17 19:13:00|4626433|[2.11353216020423...|10876.359951959008|
|2020-10-17 19:14:00|4626434|[2.11354936360846...|10876.359951959008|
|2020-10-17 19:15:00|4626435|[2.11356656701269...|10876.359951959008|
+-------------------+-------+--------------------+------------------+
only showing top 3 rows



In [68]:
def compute_avg_df(dataset):
  dataset = dataset.withColumn("timestamp", dataset["timestamp"].cast("timestamp"))

  dataset = dataset.withColumn("day", to_date("timestamp", "yyyy-MM-dd"))

  dataset = dataset.groupBy("day").agg(
      {"market-price": "avg"}
  )

  return dataset

In [69]:
avg_train_df = compute_avg_df(train_df)
avg_test_df = compute_avg_df(test_df)
avg_pred_df = compute_avg_df(predictions)
avg_pred_df.show(3)

+----------+------------------+
|       day| avg(market-price)|
+----------+------------------+
|2021-01-27| 31481.22202430554|
|2021-06-22|32066.805750000007|
|2021-08-27| 48009.10289583333|
+----------+------------------+
only showing top 3 rows



In [70]:
def show_results(train, test, pred):
  trace1 = go.Scatter(
      x = train['day'],
      y = train[' avg(market-price)'].astype(float),
      mode = 'lines',
      name = 'Train'
  )

  trace2 = go.Scatter(
      x = test['day'],
      y = test[' avg(market-price)'].astype(float),
      mode = 'lines',
      name = 'Test'
  )

  trace3 = go.Scatter(
      x = pred['day'],
      y = pred[' avg(market-price)'].astype(float),
      mode = 'lines',
      name = 'Prediction'
  )

  layout = dict(
      title='Train and Test set with the Slider ',
      xaxis=dict(
          rangeselector=dict(
              buttons=list([
                  #change the count to desired amount of months.
                  dict(count=1,
                      label='1m',
                      step='month',
                      stepmode='backward'),
                  dict(count=6,
                      label='6m',
                      step='month',
                      stepmode='backward'),
                  dict(count=12,
                      label='1y',
                      step='month',
                      stepmode='backward'),
                  dict(count=36,
                      label='3y',
                      step='month',
                      stepmode='backward'),
                  dict(step='all')
              ])
          ),
          rangeslider=dict(
              visible = True
          ),
          type='date'
      )
  )

  data = [trace1, trace2, trace3]
  fig = dict(data=data, layout=layout)
  iplot(fig, filename = "Train and Test set  with Rangeslider")

In [71]:
show_results(avg_train_df.toPandas(), avg_test_df.toPandas(), avg_pred_df.toPandas())

KeyError: ignored

## Evaluation ❗

In [17]:
def evaluation(pred):
  from pyspark.ml.evaluation import RegressionEvaluator

  evaluator = RegressionEvaluator(
      predictionCol="prediction",  # Colonna delle previsioni
      labelCol="market-price",  # Colonna delle etichette di output
  )

  mse = evaluator.evaluate(pred, {evaluator.metricName: "mse"})
  rmse = evaluator.evaluate(pred, {evaluator.metricName: "rmse"})
  r2 = evaluator.evaluate(pred, {evaluator.metricName: "r2"})
  mae = evaluator.evaluate(pred, {evaluator.metricName: "mae"})

  from pyspark.sql.functions import abs, col
  from pyspark.sql import functions as F
  from pyspark.ml.evaluation import RegressionEvaluator

  # Calcola il MAPE
  mape = pred.withColumn("error", abs(col("market-price") - col("prediction")) / col("market-price")) \
          .selectExpr("avg(error) * 100 as mape") \
          .collect()[0]["mape"]

  # adj_r2
  n = pred.count()  # Numero di osservazioni
  p = 1  # Numero di predittori nel modello
  adj_r2 = 1 - (1 - r2) * ((n - 1) / (n - p - 1))

  print("MSE = %s" % (mse)) # deve essere un valore non negativo, dove un valore di 0 indica una perfetta corrispondenza tra i valori predetti e quelli di riferimento
  print("RMSE = %s" % (rmse)) # dovresti considerare il valore di RMSE in relazione al range dei valori target nel tuo problema specifico
  print("R2 = %s" % (r2)) # piú é vicino ad 1 meglio é
  print("MAE = %s" % (mae)) # può essere utile confrontare il valore di MAE con quello di altri modelli o con il range dei valori target per valutare la sua precisione
  print("MAPE = %s" % (mape)) # di solito viene utilizzato come misura relativa per confrontare la precisione di modelli diversi
  print("ADJ R2 = %s" % (adj_r2)) # piú é vicino ad 1 meglio é

In [18]:
evaluation(predictions)

MSE = 768656591.017918
RMSE = 27724.656733996148
R2 = -2.6953923761047243
MAE = 24164.90514667972
MAPE = 61.3985200155669
ADJ R2 = -2.6953955711360282


In [None]:
#Training Summaries for the 3 Models
trainingSummary_reg = lr_model_reg.summary
trainingSummary_rel = lr_model_rel.summary
trainingSummary_sel = lr_model_sel.summary

#Performance Metrics for model with all-features
print("RMSE w/ All Features: %f" % trainingSummary_reg.rootMeanSquaredError)
print("r2(All Features): %f" % trainingSummary_reg.r2)
print('')

#Performance Metrics for model with relevant features
print("RMSE w/ Relevant Features: %f" % trainingSummary_rel.rootMeanSquaredError)
print("r2 (Relevant Features): %f" % trainingSummary_rel.r2)
print('')

#Performance Metrics for model with RFE-selected features
print("RMSE w/ RFE Features: %f" % trainingSummary_sel.rootMeanSquaredError)
print("r2 (RFE Features): %f" % trainingSummary_sel.r2)

In [None]:

#all features
pred_reg_df = pd.DataFrame(btc_data_df.toPandas())
pred_reg_df['timestamp'] = pd.to_datetime(pred_reg_df['timestamp'])
pred_reg_df = pred_reg_df.set_index('timestamp')
pred_reg_train_df = pred_reg_df[:valid_index]
pred_reg_test_df = pred_reg_df[valid_index:]
pred_reg_test_df = pred_reg_test_df.assign(Predictions = btc_preds_reg.toPandas()['prediction'].tolist())

#relevant features
pred_rel_df = pd.DataFrame(btc_data_df.toPandas()[[dep_var, 'timestamp'] + rel_columns ])
pred_rel_df['timestamp'] = pd.to_datetime(pred_rel_df['timestamp'])
pred_rel_df = pred_rel_df.set_index('timestamp')
pred_rel_train_df = pred_rel_df[:valid_index]
pred_rel_test_df = pred_rel_df[valid_index:]
pred_rel_test_df = pred_rel_test_df.assign(Predictions = btc_preds_rel.toPandas()['prediction'].tolist())

#RFE-selected features
pred_sel_df = pd.DataFrame(btc_data_df.toPandas()[[dep_var, 'timestamp'] + selected_features_rfe])
pred_sel_df['timestamp'] = pd.to_datetime(pred_sel_df['timestamp'])
pred_sel_df = pred_sel_df.set_index('timestamp')
pred_sel_train_df = pred_sel_df[:valid_index]
pred_sel_test_df = pred_sel_df[valid_index:]
pred_sel_test_df = pred_sel_test_df.assign(Predictions = btc_preds_sel.toPandas()['prediction'].tolist())



In [None]:
fig, ax = plt.subplots(1, 3, figsize=(18, 3))

ax[0].set_title('Forecast: All Features')
ax[0].set_ylabel(dep_var)
ax[0].set_xlabel('Year')
ax[0].plot(pred_reg_train_df[dep_var])
ax[0].plot(pred_reg_test_df[[dep_var, 'Predictions']])


ax[1].set_title('Forecast: Relevant Features')
ax[1].set_ylabel(dep_var)
ax[1].set_xlabel('Year')
ax[1].plot(pred_rel_train_df[dep_var])
ax[1].plot(pred_rel_test_df[[dep_var, 'Predictions']])


ax[2].set_title('Forecast: RFE Features')
ax[2].set_ylabel(dep_var)
ax[2].set_xlabel('Year')
ax[2].plot(pred_sel_train_df[dep_var])
ax[2].plot(pred_sel_test_df[[dep_var, 'Predictions']])
