<a href="https://colab.research.google.com/github/CorsiDanilo/big-data-computing-project/blob/main/BDC_Project_Bitcoin_price_forecasting_(Model_preparation).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bitcoin price forecasting with PySpark
## Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author
Corsi Danilo - corsi.1742375@studenti.uniroma1.it



### Introduction

The cryptocurrency Bitcoin has attracted the attention of many people in recent years. However, it's
price fluctuation can be extremely unpredictable, which makes it difficult to predict when the right
time to buy or sell this digital currency will be. In this context, forecasting Bitcoin prices can be a
competitive advantage for investors and traders, as it could allow them to make informed decisions
on the right time to enter or exit the market. In this project, I will analyze some machine learning
techniques to understand, through the processing of historical data, how accurately the price of Bitcoin
can be predicted and whether this can provide added value to cryptocurrency investors and traders.
### Dataset
I chose to use the following dataset from Kaggle Bitcoin Historical Dataset, more specifically those
containing minute-by-minute updates of the Bitcoin price from 2017 to 2021 (period for which there
were moments of high volatility but also a lot of price lateralisation). The columns (features) contained
in it, in addition to the timestamp of each transaction, are the opening, closing, highest and lowest
price and the corresponding trading volume in Bitcoin and Dollars.
### Methods (TODO: da scegliere per bene)
The methods I will test will be Linear Regression (simple and multiple) and Random Forest. Further
comparisons with other classification models are planned in the course of development. Moreover, I
would also like to try to understand what the differences are between these methods and the imple-
mentation of a state-of-the-art neural network such as Long-Short Term Memory.
### Evaluation framework (TODO: vedi quali usare in base ai paper/esempi e ai modelli utilizzati)
As evaluation framework I will use R-square (R²), Mean Square Error (MSE) and Mean Absolute
Error (MAE) to get a complete picture of the performance of the various models.

# **Spark + Google Colab Setup**

## Global Constants


In [1]:
JAVA_HOME = "/usr/lib/jvm/java-8-openjdk-amd64"
GDRIVE_DIR = "/content/drive"
GDRIVE_DATASET_DIR = GDRIVE_DIR + "/MyDrive/Computer_Science/BDC/project/datasets"

GDRIVE_DATASET_NAME = "BTC-2017_2021_cleaned"
# GDRIVE_DATASET_NAME = "BTC-2017_2021_1000000_cleaned"
# GDRIVE_DATASET_NAME = "BTC-2017_2021_500000_cleaned"
# GDRIVE_DATASET_NAME = "BTC-2015_2023_cleaned"
# GDRIVE_DATASET_NAME = "BTC-Hourly_cleaned"

GDRIVE_DATASET_NAME_EXT = "/" + GDRIVE_DATASET_NAME + ".csv"
# GDRIVE_DATASET_NAME_EXT = "/" + GDRIVE_DATASET_NAME + ".csv"
# GDRIVE_DATASET_NAME_EXT = "/" + GDRIVE_DATASET_NAME + ".csv"
# GDRIVE_DATASET_NAME_EXT = "/" + GDRIVE_DATASET_NAME + ".csv"
# GDRIVE_DATASET_NAME_EXT = "/" + GDRIVE_DATASET_NAME + ".csv"

GDRIVE_DATASET = GDRIVE_DATASET_DIR + GDRIVE_DATASET_NAME_EXT

SLOW_OPERATION = False

## Install PySpark and related dependencies





In [2]:
!pip install pyspark
# Alternatively, if you want to install a specific version of pyspark:
#!pip install pyspark==3.2.1
!pip install -U -q PyDrive # To use files that are stored in Google Drive directly (e.g., without downloading them from an external URL)
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = JAVA_HOME


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=48224ec99e2de043036e6b68f328434305af4adb2c15978c3f9765611afdeefe
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0
The following additional packages will be installed:
  libxtst6 openjdk-8-jre-headless
Suggested packages:
  openjdk-8-demo openjdk-8-source libnss-mdns fonts-dejavu-extra
  fon

##  Import useful Python packages

In [3]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

from itertools import cycle

import plotly.express as px

from pyspark.sql import functions as F

import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
import gc

##  Create Spark context

In [4]:
# Create the session
conf = SparkConf().\
                set('spark.ui.port', "4050").\
                set('spark.executor.memory', '4G').\
                set('spark.driver.memory', '45G').\
                set('spark.driver.maxResultSize', '10G').\
                set("spark.kryoserializer.buffer.max", "1G").\
                setAppName("BitcoinPriceForecasting").\
                setMaster("local[*]")

# Create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

##  Create <code>ngrok</code> tunnel to check the Spark UI

In [5]:
# Install ngrok
!pip install pyngrok

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyngrok
  Downloading pyngrok-6.0.0.tar.gz (681 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m681.2/681.2 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyngrok
  Building wheel for pyngrok (setup.py) ... [?25l[?25hdone
  Created wheel for pyngrok: filename=pyngrok-6.0.0-py3-none-any.whl size=19867 sha256=ebfe92e3352f3be713935f70cb552f8ec889f79881ff2398e1de46fde008a739
  Stored in directory: /root/.cache/pip/wheels/5c/42/78/0c3d438d7f5730451a25f7ac6cbf4391759d22a67576ed7c2c
Successfully built pyngrok
Installing collected packages: pyngrok
Successfully installed pyngrok-6.0.0


In [6]:
# Saving authtoken 
!ngrok authtoken 2PKOO1E6Ghw65dpEG4QNSVzu9PY_GufsrTiussGmBxF664RD 

Downloading ngrok ...Downloading ngrok: 0%Downloading ngrok: 1%Downloading ngrok: 2%Downloading ngrok: 3%Downloading ngrok: 4%Downloading ngrok: 5%Downloading ngrok: 6%Downloading ngrok: 7%Downloading ngrok: 8%Downloading ngrok: 9%Downloading ngrok: 10%Downloading ngrok: 11%Downloading ngrok: 12%Downloading ngrok: 13%Downloading ngrok: 14%Downloading ngrok: 15%Downloading ngrok: 16%Downloading ngrok: 17%Downloading ngrok: 18%Downloading ngrok: 19%Downloading ngrok: 20%Downloading ngrok: 21%Downloading ngrok: 22%Downloading ngrok: 23%Downloading ngrok: 24%Downloading ngrok: 25%Downloading ngrok: 26%Downloading ngrok: 27%Downloading ngrok: 28%Downloading ngrok: 29%Downloading ngrok: 30%Downloading ngrok: 31%Downloading ngrok: 32%Downloading ngrok: 33%Downloading ngrok: 34%Downloading ngrok: 35%Downloading ngrok: 36%Downloading ngrok: 37%Downloading ngrok: 38%Downloading ngrok: 39%Downloading ngrok: 40%Downloading ngrok: 41%Downloading ngrok: 42%

In [7]:
from pyngrok import ngrok

# Open a ngrok tunnel on the port 4050 where Spark is running
port = '4050'
public_url = ngrok.connect(port).public_url



In [8]:
print("To access the Spark Web UI console, please click on the following link to the ngrok tunnel \"{}\" -> \"http://127.0.0.1:{}\"".format(public_url, port))

To access the Spark Web UI console, please click on the following link to the ngrok tunnel "https://cd71-35-188-242-59.ngrok-free.app" -> "http://127.0.0.1:4050"


##  Link Colab to our Google Drive

In [9]:
# Point Colaboratory to our Google Drive

from google.colab import drive

drive.mount(GDRIVE_DIR, force_remount=True)

Mounted at /content/drive


##  Check everything is ok

In [10]:
spark

In [11]:
sc._conf.getAll()

[('spark.driver.memory', '45G'),
 ('spark.kryoserializer.buffer.max', '1G'),
 ('spark.app.id', 'local-1686150487239'),
 ('spark.executor.id', 'driver'),
 ('spark.sql.warehouse.dir', 'file:/content/spark-warehouse'),
 ('spark.driver.maxResultSize', '10G'),
 ('spark.app.name', 'BitcoinPriceForecasting'),
 ('spark.driver.extraJavaOptions',
  '-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-op

# **Model preparation**

Prepara i dati: Assicurati che il tuo dataset sia in un formato adatto per l'addestramento del modello. Dovresti avere una colonna di etichette di output (variabile di risposta) e le features (variabili indipendenti) in colonne separate.

Crea un VectorAssembler: Un VectorAssembler è utilizzato per combinare le features in una singola colonna vettoriale. Questo passaggio è necessario poiché PySpark richiede che le features siano in un unico vettore per l'addestramento del modello Random Forest.

In [12]:
# load dataset into pyspark dataframe objects
df = spark.read.load(GDRIVE_DATASET, 
                         format="csv", 
                         sep=",", 
                         inferSchema="true", 
                         header="true"
                    )

In [13]:
dataframe = df
dataframe.printSchema()
dataframe.show(5)

root
 |-- date: timestamp (nullable = true)
 |-- close: double (nullable = true)

+-------------------+--------+
|               date|   close|
+-------------------+--------+
|2017-12-31 23:59:00| 13880.0|
|2017-12-31 23:58:00|13953.77|
|2017-12-31 23:57:00|13913.26|
|2017-12-31 23:56:00|13859.58|
|2017-12-31 23:55:00|13825.05|
+-------------------+--------+
only showing top 5 rows



In [14]:
def model_preparation(dataframe):  
  from pyspark.ml.feature import VectorAssembler

  assembler = VectorAssembler(
      inputCols=["close"],  # Colonna del prezzo di chiusura
      outputCol="features"  # Colonna vettoriale risultante
  )

  dataframe = assembler.transform(dataframe)

  from pyspark.sql.functions import date_format, to_timestamp

  # transform date column into string
  dataframe = dataframe.withColumn("date_str", date_format(to_timestamp("date", "yyyy-MM-dd HH:mm:ss"), "yyyy-MM-dd HH:mm:ss"))

  # encode the date to a column of label indicies
  from pyspark.ml.feature import StringIndexer

  label_stringIdx = StringIndexer(inputCol = 'date_str', outputCol = 'labelIndex')
  dataframe = label_stringIdx.fit(dataframe).transform(dataframe)

  # dividi il dataset in train set e test set
  from pyspark.sql.functions import percent_rank
  from pyspark.sql import Window

  dataframe = dataframe.withColumn("rank", percent_rank().over(Window.partitionBy().orderBy("date_str")))
  train_df = dataframe.where("rank <= .8").drop("rank", "date_str")
  test_df = dataframe.where("rank > .8").drop("rank", "date_str")

  if(SLOW_OPERATION):
    print("The shape of the train set is {:d} rows by {:d} columns".format(train_df.count(), len(train_df.columns)))
    train_df.printSchema()
    train_df.show(5)	

    print("The shape of the test set is {:d} rows by {:d} columns".format(test_df.count(), len(test_df.columns)))
    test_df.printSchema()
    test_df.show(5)	

  return train_df, test_df

In [15]:
train_df, test_df = model_preparation(df)

In [16]:
def compute_avg_train_test(dataframe):
  dataframe = dataframe.drop("features", "labelIndex")
  
  dataframe = dataframe.withColumn("date", date_format(dataframe.date, "yyyy-MM-dd")).groupBy("date").agg(
      avg("close").alias("avg_close")
  ).sort("date")

  dataframe = dataframe.withColumn("avg_close", round(dataframe["avg_close"], 2))

  return dataframe

In [17]:
def show_avg_train_test(train_df, test_df):
  avg_train_df_pandas = compute_avg_train_test(train_df).toPandas()
  avg_test_df_pandas = compute_avg_train_test(test_df).toPandas()

  trace1 = go.Scatter(
      x = avg_train_df_pandas['date'],
      y = avg_train_df_pandas['avg_close'].astype(float),
      mode = 'lines',
      name = 'Train set'
  )

  trace2 = go.Scatter(
      x = avg_test_df_pandas['date'],
      y = avg_test_df_pandas['avg_close'].astype(float),
      mode = 'lines',
      name = 'Test set'
  )
  
  layout = dict(
      title='Train and Test set with the Slider ',
      xaxis=dict(
          rangeselector=dict(
              buttons=list([
                  #change the count to desired amount of months.
                  dict(count=1,
                      label='1m',
                      step='month',
                      stepmode='backward'),
                  dict(count=6,
                      label='6m',
                      step='month',
                      stepmode='backward'),
                  dict(count=12,
                      label='1y',
                      step='month',
                      stepmode='backward'),
                  dict(count=36,
                      label='3y',
                      step='month',
                      stepmode='backward'),
                  dict(step='all')
              ])
          ),
          rangeslider=dict(
              visible = True
          ),
          type='date'
      )
  )

  data = [trace1,trace2]
  fig = dict(data=data, layout=layout)
  iplot(fig, filename = "Train and Test set  with Rangeslider")

In [18]:
show_avg_train_test(train_df, test_df)

Saving the final train and test datasets

In [19]:
def output(dataframe, name):
  from pyspark.sql.functions import date_format, to_timestamp, col

  # transform date column into string
  dataframe = dataframe.withColumn("date", to_timestamp(col("date"), "yyyy-MM-dd HH:mm:ss").cast("string"))

  # Definizione della funzione di conversione da Vector a String
  vector_to_string = udf(lambda vector: str(vector), StringType())

  # Applicazione della funzione alla colonna vector_col
  dataframe = dataframe.withColumn("features", vector_to_string(dataframe["features"]))

  # Salva il dataset in formato CSV
  dataframe.repartition(1).write.csv(GDRIVE_DATASET_DIR + '/output', header=True, mode='overwrite')  # Sostituisci 'output.csv' con il percorso e il nome desiderato per il file di output

  import os
  import glob
  import time

  while True:
      csv_files = glob.glob(os.path.join(GDRIVE_DATASET_DIR + '/output', "*.csv"))
      if len(csv_files) > 0:
          # File .csv trovato!
          file_path = csv_files[0]  # Prende il primo file trovato
          break
      else:
          print("File .csv non trovato. Riprovo dopo 1 secondo...")
          time.sleep(1)

  print("File .csv trovato:", file_path)
  new_file_name = GDRIVE_DATASET_NAME + "_" + name + ".csv"
  # Rinomina il file
  new_file_path = os.path.join(os.path.dirname(file_path), new_file_name)
  os.rename(file_path, new_file_path)

  # Sposta il file nella cartella di destinazione
  new_file_destination = os.path.join(GDRIVE_DATASET_DIR, new_file_name)
  os.rename(new_file_path, new_file_destination)

  import shutil

  # Rimuovi la cartella
  shutil.rmtree(GDRIVE_DATASET_DIR + '/output')

  print("File rinominato e spostato con successo!")

In [20]:
output(train_df, "train")
output(test_df, "test")

File .csv trovato: /content/drive/MyDrive/Computer_Science/BDC/project/datasets/output/part-00000-52251d0c-bb4c-4329-b595-ae1a6d34a5ef-c000.csv
File rinominato e spostato con successo!
File .csv trovato: /content/drive/MyDrive/Computer_Science/BDC/project/datasets/output/part-00000-bd9b30d8-d8ca-494b-9c3e-0e0bce6aee29-c000.csv
File rinominato e spostato con successo!
