<a href="https://colab.research.google.com/github/CorsiDanilo/big-data-computing-project/blob/main/2_Model_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bitcoin price forecasting with PySpark
## Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author
Corsi Danilo - corsi.1742375@studenti.uniroma1.it



# Global Constants


In [1]:
JAVA_HOME = "/usr/lib/jvm/java-8-openjdk-amd64"
GDRIVE_DIR = "/content/drive"

GDRIVE_DATASET_RAW_DIR = GDRIVE_DIR + "/MyDrive/BDC/project/datasets/raw"
GDRIVE_DATASET_TEMP_DIR = GDRIVE_DIR + "/MyDrive/BDC/project/datasets/temp"
GDRIVE_DATASET_OUTPUT_DIR = GDRIVE_DIR + "/MyDrive/BDC/project/datasets/output"

GDRIVE_DATASET_NAME = "bitcoin_blockchain_data_1m"
GDRIVE_DATASET_NAME_EXT = "/" + GDRIVE_DATASET_NAME + ".csv"

GDRIVE_DATASET = GDRIVE_DATASET_RAW_DIR + GDRIVE_DATASET_NAME_EXT

#  Import useful Python packages

In [2]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from itertools import cycle

import plotly.express as px

import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
import gc

# **Spark + Google Colab Setup** ❗

In [7]:
!pip install pyspark
# Alternatively, if you want to install a specific version of pyspark:
#!pip install pyspark==3.2.1
!pip install -U -q PyDrive # To use files that are stored in Google Drive directly (e.g., without downloading them from an external URL)
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = JAVA_HOME

import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

from pyspark.sql import functions as F

openjdk-8-jdk-headless is already the newest version (8u372-ga~us1-0ubuntu1~22.04).
0 upgraded, 0 newly installed, 0 to remove and 8 not upgraded.


In [None]:
#TODO: da sistemare ❗
#General System Utilities
import sys
from datetime import datetime
import pickle

#Data Processing Libraries
import numpy as np
import pandas as pd
from pandas import concat
import matplotlib.pyplot as plt
from fastai.tabular import *
import six

#Pyspark/SQL libs
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType, IntegerType, FloatType
import seaborn as sns
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

#DS/DL Libs
import sklearn
from sklearn.linear_model import LinearRegression as sklearnLR
from sklearn.feature_selection import RFE
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, GRU
from keras import optimizers
from sklearn.preprocessing import MinMaxScaler

In [4]:
# Create the session
conf = SparkConf().\
                set('spark.ui.port', "4050").\
                set('spark.executor.memory', '4G').\
                set('spark.driver.memory', '45G').\
                set('spark.driver.maxResultSize', '10G').\
                set("spark.kryoserializer.buffer.max", "1G").\
                setAppName("BitcoinPriceForecasting").\
                setMaster("local[*]")

# Create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

In [5]:
# Point Colaboratory to our Google Drive

from google.colab import drive

drive.mount(GDRIVE_DIR, force_remount=True)

Mounted at /content/drive


# **Model preparation** ❗


In [10]:
# load dataset into pyspark dataframe objects
df = spark.read.load(GDRIVE_DATASET,
                         format="csv",
                         sep=",",
                         inferSchema="true",
                         header="true"
                    )

Linear Regression models typically take in a single vector input, so we’ll need to vectorize all of our features into a single column. Thankfully, pyspark offers the VectorAssembler class to do just that.

To build and compare performance of our three feature set sizes — 34 (all features, our baseline), 7 (relevant features), and 7 (RFE-selected features) — we’ll start by assembling 3 independent VectorAssemblers, 1 for each feature list:

In [11]:
all_features = ['market-price', 'market-cap', 'total-bitcoins', 'trade-volume', 'blocks-size', 'avg-block-size', 'n-transactions-total', 'n-transactions-per-block', 'hash-rate', 'difficulty', 'miners-revenue', 'transaction-fees-usd', 'n-unique-addresses', 'n-transactions', 'estimated-transaction-volume-usd']
rel_columns = ['market-cap', 'estimated-transaction-volume-usd', 'blocks-size', 'n-unique-addresses']
selected_features_rfe = ['total-bitcoins', 'blocks-size', 'avg-block-size', 'n-transactions-per-block', 'miners-revenue', 'n-unique-addresses', 'n-transactions']

dep_var = 'market-price'

vectorAssembler = VectorAssembler(
    inputCols = all_features,
    outputCol = 'features')

vectorAssembler2 = VectorAssembler(
    inputCols = rel_columns,
    outputCol = 'features')

vectorAssembler3 = VectorAssembler(
    inputCols = selected_features_rfe,
    outputCol = 'features')


To get the 3 Vectorized RDDs, we apply the transform on the data using each:


In [None]:
#All columns featurized
v_df = vectorAssembler.transform(df)
v_df = v_df.select(['features', dep_var])
v_df.show(3)

#Relevant columns featurized
v_rel_df = vectorAssembler2.transform(df)
v_rel_df = v_rel_df.select(['features', dep_var])
v_rel_df.show(3)

#RFE-selected columns featurized
v_sel_df = vectorAssembler3.transform(df)
v_sel_df = v_sel_df.select(['features', dep_var])
v_sel_df.show(3)

and calling the show method on each of the resulting RDDs yields the following vectorized inputs (X) and targets (y):

Great! Now we can move on to partitioning each of these RDDs into training and test sets.



I created a utility method for creating training and testing inputs and labels:



In [14]:
def regression_data_builder(spark_df, part_index):
  train_df = spark.createDataFrame(spark_df.toPandas()[:part_index])
  valid_df = spark.createDataFrame(spark_df.toPandas()[part_index:])
  return train_df, valid_df

From here, we can easily create 3 data bunches. First, we ensure that all 3 RDDs we created in the previous step are indexed correctly, store the index of an 80/20 split in a variable called “valid_index” and partition the data accordingly:



In [15]:
v_df.index = df.toPandas().index
v_rel_df.index = df.toPandas().index
v_sel_df.index = df.toPandas().index

valid_index =  int(v_df.toPandas().shape[0] * .8)

btc_train_df, btc_test_df = regression_data_builder(v_df, valid_index)
btc_rel_train_df, btc_rel_test_df = regression_data_builder(v_rel_df, valid_index)
btc_sel_train_df, btc_sel_test_df = regression_data_builder(v_sel_df, valid_index)


print(btc_train_df.toPandas().shape, btc_test_df.toPandas().shape)
print(btc_rel_train_df.toPandas().shape, btc_rel_test_df.toPandas().shape)
print(btc_sel_train_df.toPandas().shape, btc_sel_test_df.toPandas().shape)

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.10/dist-packages/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving

Passing unit-less datetime64 dtype to .astype is deprecated and will raise in a future version. Pass 'datetime64[ns]' instead

Exception ignored in: <function _xla_gc_callback at 0x7f5811389bd0>
Traceback (most recent call last):
  File "/usr/local/lib/pytho

Py4JError: ignored

Checking the shapes of our RDDs ensures we got this step right. Everything looks good! Let’s move on to fitting the models.



## [OLD] **Model preparation**

Prepara i dati: Assicurati che il tuo dataset sia in un formato adatto per l'addestramento del modello. Dovresti avere una colonna di etichette di output (variabile di risposta) e le features (variabili indipendenti) in colonne separate.

Crea un VectorAssembler: Un VectorAssembler è utilizzato per combinare le features in una singola colonna vettoriale. Questo passaggio è necessario poiché PySpark richiede che le features siano in un unico vettore per l'addestramento del modello Random Forest.

In [None]:
# load dataset into pyspark dataset objects
df = spark.read.load(GDRIVE_DATASET,
                         format="csv",
                         sep=",",
                         inferSchema="true",
                         header="true"
                    )

In [None]:
# def model_preparation(dataset):
#   from pyspark.ml.feature import VectorAssembler

#   assembler = VectorAssembler(
#       inputCols=["close"],
#       outputCol="features"
#   )

#   dataset = assembler.transform(dataset)

#   from pyspark.sql.functions import date_format, to_timestamp

#   # transform date column into string
#   dataset = dataset.withColumn("date_str", date_format(to_timestamp("date", "yyyy-MM-dd HH:mm:ss"), "yyyy-MM-dd HH:mm:ss"))

#   # encode the date to a column of label indicies
#   from pyspark.ml.feature import StringIndexer

#   label_stringIdx = StringIndexer(inputCol = 'date_str', outputCol = 'labelIndex')
#   dataset = label_stringIdx.fit(dataset).transform(dataset)

#   # divide the dataset into train set and test set
#   from pyspark.sql.functions import percent_rank
#   from pyspark.sql import Window

#   dataset = dataset.withColumn("rank", percent_rank().over(Window.partitionBy().orderBy("date_str")))
#   train = dataset.where("rank <= .8")
#   test = dataset.where("rank > .8")

#   return train.drop("rank", "date_str"), test.drop("rank", "date_str")

In [None]:
# def model_preparation(dataset):
#   # Preprocessing: StringIndexer for categorical labels
#   stringIndexer  = StringIndexer(inputCol="date", outputCol="label")

#   # Define the feature and label columns & Assemble the feature vector
#   assembler = VectorAssembler(inputCols="close", outputCol="features")

#   return train.drop("rank", "date_str"), test.drop("rank", "date_str")

In [None]:
def model_preparation(dataset):
  from pyspark.ml.feature import VectorAssembler

  assembler = VectorAssembler(inputCols=['close', 'volume_usd'], outputCol='features')
  dataset = assembler.transform(dataset)
  dataset = dataset.select('features', 'close')

  print("The shape of the dataset is {:d} rows by {:d} columns".format(dataset.count(), len(dataset.columns)))
  dataset.printSchema()
  dataset.show(5)

  # # transform date column into string
  # dataset = dataset.withColumn("date_str", date_format(to_timestamp("date", "yyyy-MM-dd HH:mm:ss"), "yyyy-MM-dd HH:mm:ss"))

  # # encode the date to a column of label indicies
  # from pyspark.ml.feature import StringIndexer

  # label_stringIdx = StringIndexer(inputCol = 'date_str', outputCol = 'labelIndex')
  # dataset = label_stringIdx.fit(dataset).transform(dataset)

  # # divide the dataset into train set and test set
  # from pyspark.sql.functions import percent_rank
  # from pyspark.sql import Window

  # dataset = dataset.withColumn("rank", percent_rank().over(Window.partitionBy().orderBy("date_str")))
  # train = dataset.where("rank <= .8")
  # test = dataset.where("rank > .8")

  # return train.drop("rank", "date_str"), test.drop("rank", "date_str")

In [None]:
# train_df, test_df = model_preparation(df)
model_preparation(df)

IllegalArgumentException: ignored

In [None]:
if SLOW_OPERATION:
  print("The shape of the train dataset is {:d} rows by {:d} columns".format(train_df.count(), len(train_df.columns)))
  train_df.show(5)
  print("The shape of the test dataset is {:d} rows by {:d} columns".format(test_df.count(), len(test_df.columns)))
  test_df.show(5)

In [None]:
def compute_daily_df(dataset):
  dataset = dataset.drop("features", "labelIndex")

  dataset = dataset.withColumn("date", date_format(dataset.date, "yyyy-MM-dd")).groupBy("date").agg(
      avg("close").alias("close")
  ).sort("date")

  dataset = dataset.withColumn("close", round(dataset["close"], 2))

  return dataset

In [None]:
def show_daily_train_test(train, test):
  daily_train_pandas = compute_daily_df(train).toPandas()
  daily_test_pandas = compute_daily_df(test).toPandas()

  trace1 = go.Scatter(
      x = daily_train_pandas['date'],
      y = daily_train_pandas['close'].astype(float),
      mode = 'lines',
      name = 'Train set'
  )

  trace2 = go.Scatter(
      x = daily_test_pandas['date'],
      y = daily_test_pandas['close'].astype(float),
      mode = 'lines',
      name = 'Test set'
  )

  layout = dict(
      title='Train and Test set with the Slider ',
      xaxis=dict(
          rangeselector=dict(
              buttons=list([
                  #change the count to desired amount of months.
                  dict(count=1,
                      label='1m',
                      step='month',
                      stepmode='backward'),
                  dict(count=6,
                      label='6m',
                      step='month',
                      stepmode='backward'),
                  dict(count=12,
                      label='1y',
                      step='month',
                      stepmode='backward'),
                  dict(count=36,
                      label='3y',
                      step='month',
                      stepmode='backward'),
                  dict(step='all')
              ])
          ),
          rangeslider=dict(
              visible = True
          ),
          type='date'
      )
  )

  data = [trace1,trace2]
  fig = dict(data=data, layout=layout)
  iplot(fig, filename = "Train and Test set  with Rangeslider")

In [None]:
show_daily_train_test(train_df, test_df)

# Output

Saving the final train and test datasets

In [None]:
def output(dataset, typology):
  from pyspark.sql.functions import date_format, to_timestamp, col

  # transform date column into string
  dataset = dataset.withColumn("date", to_timestamp(col("date"), "yyyy-MM-dd HH:mm:ss").cast("string"))

  # definition of Vector to String conversion function
  vector_to_string = udf(lambda vector: str(vector), StringType())

  # applying the function to the features column
  dataset = dataset.withColumn("features", vector_to_string(dataset["features"]))

  # save the dataset in CSV format
  dataset.repartition(1).write.csv(GDRIVE_DATASET_TEMP_DIR, header=True, mode='overwrite')

  import os
  import glob
  import time

  while True:
      csv_files = glob.glob(os.path.join(GDRIVE_DATASET_TEMP_DIR, "part*.csv"))
      if len(csv_files) > 0:
          # .csv file found!
          file_path = csv_files[0]
          break
      else:
          print(".csv file not found. I'll try again after 1 second...")
          time.sleep(1)

  print(".csv file found:", file_path)

  new_file_path = GDRIVE_DATASET_OUTPUT_DIR + "/" + GDRIVE_DATASET_NAME + "_" + typology + ".csv"

  import shutil

  # rename and move the file
  shutil.move(file_path, new_file_path)

  print("File renamed and moved successfully!")

In [None]:
output(train_df, "train")
output(test_df, "test")

.csv file found: /content/drive/MyDrive/Computer_Science/BDC/project/datasets/temp/part-00000-c85b88ff-d0a5-44b3-a848-1f640d5df9ec-c000.csv
File renamed and moved successfully!
.csv file found: /content/drive/MyDrive/Computer_Science/BDC/project/datasets/temp/part-00000-2f02edff-d40b-49fd-a2f8-883e4d0b3fd7-c000.csv
File renamed and moved successfully!
