# **Bitcoin price prediction - Linear Regression**
### Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author: Corsi Danilo (1742375) - corsi.1742375@studenti.uniroma1.it


---


Description: executing the chosen model, first with default values, then by choosing the best parameters by performing hyperparameter tuning with cross validation and performance evaluation. Finally validate the tuned model and train it on the whole train /validation set

# Global constants, dependencies, libraries and tools

In [None]:
# Main constants
LOCAL_RUNNING = True
SLOW_OPERATIONS = True # Decide whether or not to use operations that might slow down notebook execution
MODEL_NAME = "LinearRegression"
MAIN_DIR = "D:/Documents/Repository/BDC/project" if LOCAL_RUNNING else "/content/drive"

In [None]:
if not LOCAL_RUNNING: 
    # Point Colaboratory to Google Drive
    from google.colab import drive

    # Define GDrive paths
    drive.mount(MAIN_DIR, force_remount=True)

In [None]:
# Set main dir
MAIN_DIR = MAIN_DIR + "" if LOCAL_RUNNING else "/MyDrive/BDC/project"

###################
# --- DATASET --- #
###################

# Datasets dirs
DATASET_OUTPUT_DIR = MAIN_DIR + "/datasets/output"

# Datasets names
DATASET_TRAIN_VALID_NAME = "bitcoin_blockchain_data_30min_train_valid"

# Datasets paths
DATASET_TRAIN_VALID  = DATASET_OUTPUT_DIR + "/" + DATASET_TRAIN_VALID_NAME + ".parquet"

####################
# --- FEATURES --- #
####################

# Features dir
FEATURES_DIR = MAIN_DIR + "/features"

# Features labels
FEATURES_LABEL = "features"
TARGET_LABEL = "next-market-price"

# Features names
ALL_FEATURES_NAME = "all_features"
MOST_CORR_FEATURES_NAME = "most_corr_features"
LEAST_CORR_FEATURES_NAME = "least_corr_features"

# Features paths
ALL_FEATURES = FEATURES_DIR + "/" + ALL_FEATURES_NAME + ".json"
MOST_CORR_FEATURES = FEATURES_DIR + "/" + MOST_CORR_FEATURES_NAME + ".json"
LEAST_CORR_FEATURES = FEATURES_DIR + "/" + LEAST_CORR_FEATURES_NAME + ".json"

##################
# --- MODELS --- #
##################

# Model dir
MODELS_DIR = MAIN_DIR + "/models"

# Model path
MODEL = MODELS_DIR + "/" + MODEL_NAME

#####################
# --- UTILITIES --- #
#####################

# Utilities dir
UTILITIES_DIR = MAIN_DIR + "/utilities"

###################
# --- RESULTS --- #
###################

# Results dir
RESULTS_DIR = MAIN_DIR + "/results"

# Results path
MODEL_RESULTS  = RESULTS_DIR + "/" + MODEL_NAME + ".csv"

In [None]:
# Suppression of warnings for better reading
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
if not LOCAL_RUNNING:
    # Install Spark and related dependencies
    !pip install pyspark
    !pip install -U -q PyDrive -qq
    !apt install openjdk-8-jdk-headless -qq

# Import files

In [None]:
# Import my files
import sys
sys.path.append(UTILITIES_DIR)

from imports import *
import utilities, parameters

importlib.reload(utilities)
importlib.reload(parameters)

# Create the pyspark session

In [None]:
# Create the session
conf = SparkConf().\
                set('spark.ui.port', "4050").\
                set('spark.executor.memory', '12G').\
                set('spark.driver.memory', '12G').\
                set('spark.driver.maxResultSize', '109G').\
                set("spark.kryoserializer.buffer.max", "1G").\
                setAppName("BitcoinPricePrediction").\
                setMaster("local[*]")

# Create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

# Loading dataset

In [None]:
# Load train / validation set into pyspark dataset objects
df = spark.read.load(DATASET_TRAIN_VALID,
                         format="parquet",
                         sep=",",
                         inferSchema="true",
                         header="true"
                    )

In [None]:
def dataset_info(dataset):
  # Print dataset
  dataset.show(3)

  # Get the number of rows
  num_rows = dataset.count()

  # Get the number of columns
  num_columns = len(dataset.columns)

  # Print the shape of the dataset
  print("Shape:", (num_rows, num_columns))

  # Print the schema of the dataset
  dataset.printSchema()

In [None]:
if SLOW_OPERATIONS:
  dataset_info(df)

# Loading features

In [None]:
# Loading all the features
with open(ALL_FEATURES, "r") as f:
    ALL_FEATURES = json.load(f)
print(ALL_FEATURES)

In [None]:
# Loading the most correlated features
with open(MOST_CORR_FEATURES, "r") as f:
    MOST_CORR_FEATURES = json.load(f)
print(MOST_CORR_FEATURES)

In [None]:
# Loading least correlated features
with open(LEAST_CORR_FEATURES, "r") as f:
    LEAST_CORR_FEATURES = json.load(f)
print(LEAST_CORR_FEATURES)

# Model train / validation ❗
In order to train and validate the model, I'll try several approaches:
- **Simple:** Make predictions using the chosen base model
- **Simple with normalization:** Like the previous one but features are normalized

At this point, the features that gave on average the most satisfactory results (for each model) are chosen and proceeded with:

- **Hyperparameter tuning:** model validation to find the best parameters to use
- **Cross Validation:** validate the performance of the model with the chosen parameters
- **Validate final model:** validate the model with the chosen parameters
- **Train final model:** train the final model on the whole train / validation set to be ready to make predictions on market price

## Simple ❗
The train / validation set will be splitted so that the model performance can be seen without any tuning by using different features (normalized and non)

### Simple model

In [None]:
# Define model and features type
MODEL_TYPE = "simple"
FEATURES_NORMALIZATION = False

In [None]:
# Get default parameters
params = parameters.get_defaults_model_params(MODEL_NAME)
params

# [TO DELETE] Just for testing ❗

In [None]:
# cv_info = parameters.get_cross_validation_params('multi_splits')
cv_info = parameters.get_cross_validation_params('block_splits')
# cv_info = parameters.get_cross_validation_params('walk_forward_splits')
cv_info

In [None]:
# Make predictions by using all the features
simple_res_all, simple_pred_all = utilities.cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, ALL_FEATURES, ALL_FEATURES_NAME, FEATURES_LABEL, TARGET_LABEL)
simple_res_all

In [None]:
utilities.show_results(simple_pred_all.toPandas(), MODEL_NAME)

In [None]:
# Make predictions by using the most correlated features
simple_res_most_corr, simple_pred_most_corr = utilities.cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, MOST_CORR_FEATURES, MOST_CORR_FEATURES_NAME, FEATURES_LABEL, TARGET_LABEL)
simple_res_most_corr

In [None]:
utilities.show_results(simple_pred_most_corr.toPandas(), MODEL_NAME)

In [None]:
# Make predictions by using the least correlated features
simple_res_least_corr, simple_pred_least_corr = utilities.cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, LEAST_CORR_FEATURES, LEAST_CORR_FEATURES_NAME, FEATURES_LABEL, TARGET_LABEL)
simple_res_least_corr

In [None]:
utilities.show_results(simple_pred_least_corr.toPandas(), MODEL_NAME)

### Simple model (with normalization)

In [None]:
# Define model and features type
MODEL_TYPE = "simple_norm"
FEATURES_NORMALIZATION = True

In [None]:
# Valid performances with all the features
simple_norm_res_all, simple_norm_pred_all = utilities.cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, ALL_FEATURES, ALL_FEATURES_NAME, FEATURES_LABEL, TARGET_LABEL)
simple_norm_res_all

In [None]:
utilities.show_results(simple_norm_pred_all.toPandas(), MODEL_NAME)

In [None]:
# Make predictions by using the most the features
simple_norm_res_most_corr, simple_norm_pred_most_corr = utilities.cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, MOST_CORR_FEATURES, MOST_CORR_FEATURES_NAME, FEATURES_LABEL, TARGET_LABEL)
simple_norm_res_most_corr

In [None]:
utilities.show_results(simple_norm_pred_most_corr.toPandas(), MODEL_NAME)

In [None]:
# Make predictions by using the least the features
simple_norm_res_least_corr, simple_norm_pred_least_corr = utilities.cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, LEAST_CORR_FEATURES, LEAST_CORR_FEATURES_NAME, FEATURES_LABEL, TARGET_LABEL)
simple_norm_res_least_corr

In [None]:
utilities.show_results(simple_norm_pred_least_corr.toPandas(), MODEL_NAME)

In [None]:
# Define model information and evaluators to show
model_info = ['Model', 'Type', 'Cv', 'Features', 'Parameters']
evaluator_lst = ['RMSE', 'MSE', 'MAE', 'MAPE', 'R2', 'Adjusted_R2', 'Time']

In [None]:
# Define the results to show
simple_comparison_lst = [simple_res_all, simple_res_most_corr, simple_res_least_corr,simple_norm_res_all, simple_norm_res_most_corr, simple_norm_res_least_corr]

# Show the comparison table
simple_comparison_lst_df = pd.concat([utilities.model_comparison(results, model_info, evaluator_lst) for results in simple_comparison_lst])
simple_comparison_lst_df

## Tuned ❗
Once the features and execution method are selected, the model will undergo hyperparameter tuning and cross validation to find the best configuration.

### Hyperparameter tuning with cross validation
The train / validation set is divided based on a portion list which will split the dataset into several splits.

For each split, all combinations of the model parameters are tested and those that return a lower RMSE are considered.

Using the previously selected parameters, the model undergoes two types of cross validation:

**Multiple splits**

The idea is to divide the dataset into two folds at each iteration on condition that the validation set is always ahead of the training set. This way dependence is respected.

**Blocked time series**

It works by adding margins at two positions. The first is between the training and validation folds in order to prevent the model from observing lag values which are used twice, once as a regressor and another as a response. The second is between the folds used at each iteration in order to prevent the model from memorizing patterns from an iteration to the next.

**Walk forward time series**

The basic idea behind walk-forward validation is to iteratively train and evaluate the model using a sliding window approach. Here's how it works:
*  Split the time series data into a training set and a test set. The training set contains the initial portion of the data, while the test set contains the subsequent portion.
* Train the model on the training set and make predictions on the test set.
Evaluate the performance of the model on the test set using appropriate evaluation metrics such as mean squared error (MSE), root mean squared error (RMSE), or mean absolute error (MAE).
* Move the sliding window forward by one step, incorporating the next data point into the training set and shifting the test set accordingly.
* Repeat steps 2-4 until the entire time series has been used for testing.

In [None]:
# From now on, only selected and normalized features will be considered
MODEL_TYPE = "hyp_tuning"
CHOSEN_FEATURES = ALL_FEATURES
CHOSEN_FEATURES_LABEL = ALL_FEATURES_NAME
FEATURES_NORMALIZATION = True

In [None]:
# Get model grid parameters
params = parameters.get_model_grid_params(MODEL_NAME)
params

In [None]:
# Perform hyperparameter tuning
hyp_res = utilities.hyperparameter_tuning(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, CHOSEN_FEATURES, CHOSEN_FEATURES_LABEL, FEATURES_LABEL, TARGET_LABEL)
hyp_res

In [None]:
# Count the occurrences of each value in the Parameters column
counts = hyp_res["Parameters"].value_counts()

# Display the counts
print(counts)

In [None]:
MODEL_TYPE = "cross_val"

# [TO DELETE] Just for testing❗

In [None]:
def get_best_model_params(model_name):
    if (model_name == 'LinearRegression'):
        params = {
            'maxIter' : [5],
            'regParam' : [0.8],
            'elasticNetParam' : [0.0]
        }   
    if (model_name == 'GeneralizedLinearRegression'):
        params = {
            'maxIter' : [5],
            'regParam' : [0.2],
            'family': ['gaussian'],
            'link': ['log']
        }
    elif (model_name == 'RandomForestRegressor'):
        params = {
            'numTrees' : [3],
            'maxDepth' : [10],
            'seed' : [42]
            }
    elif (model_name == 'GBTRegressor'):
        params = {
            'maxIter' : [30],
            'maxDepth' : [3],
            'stepSize': [0.4],
            'seed' : [42]
        }
        
    return params

In [None]:
# Get tuned parameters
# params = utilities.get_best_model_params(MODEL_NAME)
# params = get_best_model_params(MODEL_NAME)
params

In [None]:
# Perform cross validation
cv_res, cv_pred = utilities.cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, CHOSEN_FEATURES, CHOSEN_FEATURES_LABEL, FEATURES_LABEL, TARGET_LABEL)
cv_res

In [None]:
utilities.show_results(cv_pred, MODEL_NAME)

In [None]:
# Define the results to show
tuned_comparison_lst = [cv_res]

# Show the comparison table
tuned_comparison_lst_df = pd.concat([utilities.model_comparison(results, model_info, evaluator_lst) for results in tuned_comparison_lst])
tuned_comparison_lst_df

## Final ❗
Finally, the configuration found will be validated and then the model will be trained one last time on the entire train / validation set, ready to make predictions.

### Validate final model

In [None]:
MODEL_TYPE = "final_validated"

In [None]:
# Performances on validated final model
final_valid_res, final_valid_pred = utilities.cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, CHOSEN_FEATURES, CHOSEN_FEATURES_LABEL, FEATURES_LABEL, TARGET_LABEL)
final_valid_res

In [None]:
utilities.show_results(final_valid_pred, MODEL_NAME)

### Train model

In [None]:
MODEL_TYPE = "final_trained"

In [None]:
# Train the model on the whole train / validation set
final_train_res, final_train_model, final_train_pred = utilities.evaluate_trained_model(df, params, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, CHOSEN_FEATURES, CHOSEN_FEATURES_LABEL, FEATURES_LABEL, TARGET_LABEL)
final_train_res

In [None]:
utilities.show_results(final_train_pred.toPandas(), MODEL_NAME)

In [None]:
# Define the results to show
valid_comparison_lst = [final_valid_res, final_train_res]

# Show the comparison table
valid_comparison_lst_df = pd.concat([utilities.model_comparison(results, model_info, evaluator_lst) for results in valid_comparison_lst])
valid_comparison_lst_df

# Comparison table
Visualization of model performance at various stages of train / validation

In [None]:
# Concatenate simple results into Pandas Dataframe
final_comparison_lst_df = pd.DataFrame(pd.concat([simple_comparison_lst_df, tuned_comparison_lst_df , valid_comparison_lst_df], ignore_index=True))
final_comparison_lst_df

# [TO DELETE] Just for testing ❗

In [None]:
# # cv_info = parameters.get_cross_validation_params('multi_splits')
# cv_info = parameters.get_cross_validation_params('block_splits')
# # cv_info = parameters.get_cross_validation_params('walk_forward_splits')
# cv_info

# MODEL_TYPE = "final_validated"
# # From now on, only selected and normalized features will be considered
# CHOSEN_FEATURES = ALL_FEATURES
# CHOSEN_FEATURES_LABEL = ALL_FEATURES_NAME
# FEATURES_NORMALIZATION = True

# def get_best_model_params(model_name):
#     if (model_name == 'LinearRegression'):
#         params = {
#             'maxIter' : [5],
#             'regParam' : [0.8],
#             'elasticNetParam' : [0.0]
#         }   
#     if (model_name == 'GeneralizedLinearRegression'):
#         params = {
#             'maxIter' : [5],
#             'regParam' : [0.2],
#             'family': ['gaussian'],
#             'link': ['log']
#         }
#     elif (model_name == 'RandomForestRegressor'):
#         params = {
#             'numTrees' : [3],
#             'maxDepth' : [10],
#             'seed' : [42]
#             }
#     elif (model_name == 'GBTRegressor'):
#         params = {
#             'maxIter' : [30],
#             'maxDepth' : [3],
#             'stepSize': [0.4],
#             'seed' : [42]
#         }
        
#     return params

# # Get tuned parameters
# params = get_best_model_params(MODEL_NAME)
# params

# # Performances on validated final model
# final_valid_res, final_valid_pred = cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, CHOSEN_FEATURES, CHOSEN_FEATURES_LABEL, FEATURES_LABEL, TARGET_LABEL)
# final_valid_res

# Model accuracy ❗

In [None]:
accuracy = utilities.model_accuracy(final_valid_pred)
print(f"Percentage of correct predictions for {MODEL_NAME, MODEL_TYPE}: {accuracy:.2f}%")

# Saving trained model


In [None]:
# Saving final model results
final_comparison_lst_df.to_csv(MODEL_RESULTS, index=False)

In [None]:
# Save the trained model
final_train_model.write().overwrite().save(MODEL)