# **Bitcoin price prediction - Random Forest Regressor**
### Big Data Computing final project - A.Y. 2022 - 2023
Prof. Gabriele Tolomei

MSc in Computer Science

La Sapienza, University of Rome

### Author: Corsi Danilo (1742375) - corsi.1742375@studenti.uniroma1.it


---


Description: executing the chosen model, first with default values, then by choosing the best parameters by performing hyperparameter tuning with cross validation and performance evaluation. Finally validate the tuned model and train it on the whole train /validation set

# Global constants, dependencies, libraries and tools

In [15]:
# Main constants
LOCAL_RUNNING = True
SLOW_OPERATIONS = True # Decide whether or not to use operations that might slow down notebook execution
MODEL_NAME = "RandomForestRegressor"
MAIN_DIR = "D:/Documents/Repository/BDC/project" if LOCAL_RUNNING else "/content/drive"

In [None]:
if not LOCAL_RUNNING: 
    # Point Colaboratory to Google Drive
    from google.colab import drive

    # Define GDrive paths
    drive.mount(MAIN_DIR, force_remount=True)

In [None]:
# Set main dir
MAIN_DIR = MAIN_DIR + "" if LOCAL_RUNNING else "/MyDrive/BDC/project"

###################
# --- DATASET --- #
###################

# Datasets dirs
DATASET_OUTPUT_DIR = MAIN_DIR + "/datasets/output"

# Datasets names
DATASET_TRAIN_VALID_NAME = "bitcoin_blockchain_data_30min_train_valid"

# Datasets paths
DATASET_TRAIN_VALID  = DATASET_OUTPUT_DIR + "/" + DATASET_TRAIN_VALID_NAME + ".parquet"

####################
# --- FEATURES --- #
####################

# Features dir
FEATURES_DIR = MAIN_DIR + "/features"

# Features labels
FEATURES_LABEL = "features"
TARGET_LABEL = "next-market-price"

# Features names
ALL_FEATURES_NAME = "all_features"
MOST_CORR_FEATURES_NAME = "most_corr_features"
LEAST_CORR_FEATURES_NAME = "least_corr_features"

# Features paths
ALL_FEATURES = FEATURES_DIR + "/" + ALL_FEATURES_NAME + ".json"
MOST_CORR_FEATURES = FEATURES_DIR + "/" + MOST_CORR_FEATURES_NAME + ".json"
LEAST_CORR_FEATURES = FEATURES_DIR + "/" + LEAST_CORR_FEATURES_NAME + ".json"

##################
# --- MODELS --- #
##################

# Model dir
MODELS_DIR = MAIN_DIR + "/models"

# Model path
MODEL = MODELS_DIR + "/" + MODEL_NAME

#####################
# --- UTILITIES --- #
#####################

# Utilities dir
UTILITIES_DIR = MAIN_DIR + "/utilities"

###################
# --- RESULTS --- #
###################

# Results dir
RESULTS_DIR = MAIN_DIR + "/results"

# Results path
MODEL_RESULTS  = RESULTS_DIR + "/" + MODEL_NAME + ".csv"

In [None]:
# Suppression of warnings for better reading
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
if not LOCAL_RUNNING:
    # Install Spark and related dependencies
    !pip install pyspark
    !pip install -U -q PyDrive -qq
    !apt install openjdk-8-jdk-headless -qq

# Import files

In [None]:
# Import my files
import sys
sys.path.append(UTILITIES_DIR)

from imports import *
import utilities, parameters

importlib.reload(utilities)
importlib.reload(parameters)

# Create the pyspark session

In [None]:
# Create the session
conf = SparkConf().\
                set('spark.ui.port', "4050").\
                set('spark.executor.memory', '12G').\
                set('spark.driver.memory', '12G').\
                set('spark.driver.maxResultSize', '109G').\
                set("spark.kryoserializer.buffer.max", "1G").\
                setAppName("BitcoinPricePrediction").\
                setMaster("local[*]")

# Create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

# Loading dataset

In [None]:
# Load train / validation set into pyspark dataset objects
df = spark.read.load(DATASET_TRAIN_VALID,
                         format="parquet",
                         sep=",",
                         inferSchema="true",
                         header="true"
                    )

In [None]:
def dataset_info(dataset):
  # Print dataset
  dataset.show(3)

  # Get the number of rows
  num_rows = dataset.count()

  # Get the number of columns
  num_columns = len(dataset.columns)

  # Print the shape of the dataset
  print("Shape:", (num_rows, num_columns))

  # Print the schema of the dataset
  dataset.printSchema()

In [None]:
if SLOW_OPERATIONS:
  dataset_info(df)

# Loading features

In [None]:
# Loading all the features
with open(ALL_FEATURES, "r") as f:
    ALL_FEATURES = json.load(f)
print(ALL_FEATURES)

In [None]:
# Loading the most correlated features
with open(MOST_CORR_FEATURES, "r") as f:
    MOST_CORR_FEATURES = json.load(f)
print(MOST_CORR_FEATURES)

In [None]:
# Loading least correlated features
with open(LEAST_CORR_FEATURES, "r") as f:
    LEAST_CORR_FEATURES = json.load(f)
print(LEAST_CORR_FEATURES)

# Model train / validation ❗
In order to train and validate the model, I'll try several approaches:
- **Simple:** Make predictions using the chosen base model
- **Simple with normalization:** Like the previous one but features are normalized

At this point, the features that gave on average the most satisfactory results (for each model) are chosen and proceeded with:

- **Hyperparameter tuning:** model validation to find the best parameters to use
- **Cross Validation:** validate the performance of the model with the chosen parameters
- **Validate final model:** validate the model with the chosen parameters
- **Train final model:** train the final model on the whole train / validation set to be ready to make predictions on market price

## Simple ❗
The train / validation set will be splitted so that the model performance can be seen without any tuning by using different features (normalized and non)

### Simple model

In [None]:
# Define model and features type
MODEL_TYPE = "simple"
FEATURES_NORMALIZATION = False

In [None]:
# Get default parameters
params = parameters.get_defaults_model_params(MODEL_NAME)
params

In [None]:
# cv_info = parameters.get_cross_validation_params('multi_splits')
cv_info = parameters.get_cross_validation_params('block_splits')
# cv_info = parameters.get_cross_validation_params('walk_forward_splits')
cv_info

In [None]:
# Takes the total number of samples, the minimum number of observations, and the sliding window size as input 
# and returns a list of tuples containing the start, split, and end positions for each walk-forward split. 
# We then add an index column to the dataset using the monotonically_increasing_id function and calculate 
# the total number of samples. Finally, we iterate over the generated split positions, create training and 
# validation datasets, and train and evaluate the model on each split.
def walk_forward_splits_new(num, min_obser, sliding_window):
    split_positions = []
    start = 0
    while start + min_obser + sliding_window <= num:
        split_positions.append((start, start + min_obser, start + min_obser + sliding_window))
        start += sliding_window

    split_position_df = pd.DataFrame(split_positions, columns=['start', 'split', 'end'])

    return split_position_df

In [None]:
'''
Description: Cross validation on time series data
Args:
    dataset: The dataset which needs to be splited
    params: Parameters which want to test 
    cv_info: The type of cross validation [multi_splits | block_splits]
    model_name: Model name selected
    features_normalization: Indicates whether features should be normalized (True) or not (False)
    features: Features to be used to make predictions
    features_name: Name of features used
    features_label: The column name of features
    target_label: The column name of target variable
Return: 
    results_lst_df: All the splits performances in a pandas dataset
'''
def cross_validation(dataset, params, cv_info, model_name, model_type, features_normalization, features, features_name, features_label, target_label):
    # Select the type of features to be used
    dataset = utilities.select_features(dataset, features_normalization, features, features_label, target_label)

    # Get the number of samples
    num = dataset.count()
    
    # Save results in a list
    results_lst = []

    # Initialize an empty list to store predictions
    predictions_list = []  

    # Identify the type of cross validation 
    if cv_info['cv_type'] == 'multi_splits':
        split_position_df = utilities.multi_splits(num, cv_info['splits'])
    elif cv_info['cv_type'] == 'block_splits':
        split_position_df = utilities.block_splits(num, cv_info['splits'])
    elif cv_info['cv_type'] == 'walk_forward_splits':
        split_position_df = walk_forward_splits_new(num, cv_info['min_obser'], cv_info['sliding_window'])

    for position in split_position_df.itertuples():
        # Get the start/split/end position based on the type of cross validation
        start = getattr(position, 'start')
        splits = getattr(position, 'split')
        end = getattr(position, 'end')
        idx  = getattr(position, 'Index')
        
        # Train / validation size
        train_size = splits - start
        valid_size = end - splits

        # Get training data and validation data
        train_data = dataset.filter(dataset['id'].between(start, splits-1))
        valid_data = dataset.filter(dataset['id'].between(splits, end-1))

        # Cache them
        train_data.cache()
        valid_data.cache()
        
        # All combination of params
        param_lst = [dict(zip(params, param)) for param in product(*params.values())]

        for param in param_lst:
            # Chosen Model
            model = utilities.model_selection(model_name, param, features_label, target_label)

            # Chain assembler and model in a Pipeline
            pipeline = Pipeline(stages=[model])

            # Train a model and calculate running time
            start = time.time()
            pipeline_model = pipeline.fit(train_data)
            end = time.time()

            # Make predictions
            predictions = pipeline_model.transform(valid_data).select(target_label, "prediction", 'timestamp')
            
            # Append predictions to the list
            predictions_list.append(predictions)  

            # Compute validation error by several evaluators
            eval_res = utilities.model_evaluation(target_label, predictions)

            # Use dict to store each result
            results = {
                "Model": model_name,
                "Type": model_type,
                "Cv": cv_info['cv_type'],
                "Features": features_name,
                "Splits": idx + 1,
                "Train&Validation": (train_size,valid_size),                
                "Parameters": list(param.values()),
                "RMSE": eval_res['rmse'],
                "MSE": eval_res['mse'],
                "MAE": eval_res['mae'],
                "MAPE": eval_res['mape'],
                "R2": eval_res['r2'],
                "Adjusted_R2": eval_res['adj_r2'],
                "Time": end - start,
            }

            # Store results for each split
            results_lst.append(results)
            print(results)

        # Release Cache
        train_data.unpersist()
        valid_data.unpersist()

    # Transform dict to pandas dataset
    results_lst_df = pd.DataFrame(results_lst)

    # Create an empty DataFrame with the same schema as the predictions dataset
    final_predictions = spark.createDataFrame([], schema=predictions_list[0].schema)

    # Iterate over the list of DataFrames and union them with the merged DataFrame
    for pred in predictions_list:
        final_predictions = final_predictions.union(pred)

    return results_lst_df, final_predictions.toPandas()

In [None]:
# Make predictions by using all the features
simple_res_all, simple_pred_all = cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, ALL_FEATURES, ALL_FEATURES_NAME, FEATURES_LABEL, TARGET_LABEL)
simple_res_all

In [None]:
utilities.show_results(simple_pred_all, MODEL_NAME)

In [None]:
# Make predictions by using the most correlated features
simple_res_most_corr, simple_pred_most_corr = cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, MOST_CORR_FEATURES, MOST_CORR_FEATURES_NAME, FEATURES_LABEL, TARGET_LABEL)
simple_res_most_corr

In [None]:
utilities.show_results(simple_pred_most_corr, MODEL_NAME)

In [None]:
# Make predictions by using the least correlated features
simple_res_least_corr, simple_pred_least_corr = cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, LEAST_CORR_FEATURES, LEAST_CORR_FEATURES_NAME, FEATURES_LABEL, TARGET_LABEL)
simple_res_least_corr

In [None]:
utilities.show_results(simple_pred_least_corr, MODEL_NAME)

### Simple model (with normalization)

In [None]:
# Define model and features type
MODEL_TYPE = "simple_norm"
FEATURES_NORMALIZATION = True

In [None]:
# Valid performances with all the features
simple_norm_res_all, simple_norm_pred_all = cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, ALL_FEATURES, ALL_FEATURES_NAME, FEATURES_LABEL, TARGET_LABEL)
simple_norm_res_all

In [None]:
utilities.show_results(simple_norm_pred_all, MODEL_NAME)

In [None]:
# Make predictions by using the most the features
simple_norm_res_most_corr, simple_norm_pred_most_corr = cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, MOST_CORR_FEATURES, MOST_CORR_FEATURES_NAME, FEATURES_LABEL, TARGET_LABEL)
simple_norm_res_most_corr

In [None]:
utilities.show_results(simple_norm_pred_most_corr, MODEL_NAME)

In [None]:
# Make predictions by using the least the features
simple_norm_res_least_corr, simple_norm_pred_least_corr = cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, LEAST_CORR_FEATURES, LEAST_CORR_FEATURES_NAME, FEATURES_LABEL, TARGET_LABEL)
simple_norm_res_least_corr

In [None]:
utilities.show_results(simple_norm_pred_least_corr, MODEL_NAME)

In [None]:
# Define model information and evaluators to show
model_info = ['Model', 'Type', 'Cv', 'Features', 'Parameters']
evaluator_lst = ['RMSE', 'MSE', 'MAE', 'MAPE', 'R2', 'Adjusted_R2', 'Time']

In [None]:
# Define the results to show
simple_comparison_lst = [simple_res_all, simple_res_most_corr, simple_res_least_corr,simple_norm_res_all, simple_norm_res_most_corr, simple_norm_res_least_corr]

# Show the comparison table
simple_comparison_lst_df = pd.concat([utilities.model_comparison(results, model_info, evaluator_lst) for results in simple_comparison_lst])
simple_comparison_lst_df

## Tuned ❗
Once the features and execution method are selected, the model will undergo hyperparameter tuning and cross validation to find the best configuration.

### Hyperparameter tuning with cross validation
The train / validation set is divided based on a portion list which will split the dataset into several splits.

For each split, all combinations of the model parameters are tested and those that return a lower RMSE are considered.

Using the previously selected parameters, the model undergoes two types of cross validation:

**Multiple splits**

The idea is to divide the dataset into two folds at each iteration on condition that the validation set is always ahead of the training set. This way dependence is respected.

**Blocked time series**

It works by adding margins at two positions. The first is between the training and validation folds in order to prevent the model from observing lag values which are used twice, once as a regressor and another as a response. The second is between the folds used at each iteration in order to prevent the model from memorizing patterns from an iteration to the next.

**Walk forward time series**

The basic idea behind walk-forward validation is to iteratively train and evaluate the model using a sliding window approach. Here's how it works:
*  Split the time series data into a training set and a test set. The training set contains the initial portion of the data, while the test set contains the subsequent portion.
* Train the model on the training set and make predictions on the test set.
Evaluate the performance of the model on the test set using appropriate evaluation metrics such as mean squared error (MSE), root mean squared error (RMSE), or mean absolute error (MAE).
* Move the sliding window forward by one step, incorporating the next data point into the training set and shifting the test set accordingly.
* Repeat steps 2-4 until the entire time series has been used for testing.

In [None]:
# From now on, only selected and normalized features will be considered
MODEL_TYPE = "hyp_tuning"
CHOSEN_FEATURES = ALL_FEATURES
CHOSEN_FEATURES_LABEL = ALL_FEATURES_NAME
FEATURES_NORMALIZATION = True

In [None]:
# Get model grid parameters
params = parameters.get_model_grid_params(MODEL_NAME)
params

In [None]:
'''
Description: Cross validation on time series data
Args:
    dataset: The dataset which needs to be splited
    params: Parameters which want to test 
    cv_info: The type of cross validation [multi_splits | block_splits]
    model_name: Model name selected
    features_normalization: Indicates whether features should be normalized (True) or not (False)
    features: Features to be used to make predictions
    features_name: Name of features used
    features_label: The column name of features
    target_label: The column name of target variable
Return: 
    results_lst_df: All the splits performances in a pandas dataset
'''
def hyperparameter_tuning(dataset, params, cv_info, model_name, model_type, features_normalization, features, features_name, features_label, target_label):
    # Select the type of features to be used
    dataset = utilities.select_features(dataset, features_normalization, features, features_label, target_label)

    best_split_result = []

    # Get the number of samples
    num = dataset.count()

    # Identify the type of cross validation 
    if cv_info['cv_type'] == 'multi_splits':
        split_position_df = utilities.multi_splits(num, cv_info['splits'])
    elif cv_info['cv_type'] == 'block_splits':
        split_position_df = utilities.block_splits(num, cv_info['splits'])
    elif cv_info['cv_type'] == 'walk_forward_splits':
        split_position_df = walk_forward_splits_new(num, cv_info['min_obser'], cv_info['sliding_window'])

    for position in split_position_df.itertuples():
        best_result = {"RMSE": float('inf')}

        # Get the start/split/end position based on the type of cross validation
        start = getattr(position, 'start')
        splits = getattr(position, 'split')
        end = getattr(position, 'end')
        idx  = getattr(position, 'Index')
        
        # Train / validation size
        train_size = splits - start
        valid_size = end - splits

        # Get training data and validation data
        train_data = dataset.filter(dataset['id'].between(start, splits-1))
        valid_data = dataset.filter(dataset['id'].between(splits, end-1))

        # Cache them
        train_data.cache()
        valid_data.cache()

        # All combination of params
        param_lst = [dict(zip(params, param)) for param in product(*params.values())]

        for param in param_lst:
            # Chosen Model
            model = utilities.model_selection(model_name, param, features_label, target_label)

            # Chain assembler and model in a Pipeline
            pipeline = Pipeline(stages=[model])

            # Train a model and calculate running time
            start = time.time()
            pipeline_model = pipeline.fit(train_data)
            end = time.time()

            # Make predictions
            predictions = pipeline_model.transform(valid_data).select(target_label, "prediction", 'timestamp')

            # Compute validation error by several evaluators
            eval_res = utilities.model_evaluation(target_label, predictions)

            # Use dict to store each result
            results = {
                "Model": model_name,
                "Type": model_type,
                "Cv": cv_info['cv_type'],
                "Features": features_name,
                "Splits": idx + 1,
                "Train&Validation": (train_size,valid_size),                
                "Parameters": list(param.values()),
                "RMSE": eval_res['rmse'],
                "MSE": eval_res['mse'],
                "MAE": eval_res['mae'],
                "MAPE": eval_res['mape'],
                "R2": eval_res['r2'],
                "Adjusted_R2": eval_res['adj_r2'],
                "Time": end - start,
            }
            # Store the result with the lowest RMSE and the associated parameters
            if results['RMSE'] < best_result['RMSE']:
                best_result = results

        # Release Cache
        train_data.unpersist()
        valid_data.unpersist()

        best_split_result.append(best_result) 
        print(best_result)

    # Transform dict to pandas dataset
    best_split_result_df = pd.DataFrame(best_split_result)

    return best_split_result_df

In [None]:
# Perform hyperparameter tuning
hyp_res = hyperparameter_tuning(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, CHOSEN_FEATURES, CHOSEN_FEATURES_LABEL, FEATURES_LABEL, TARGET_LABEL)
hyp_res

In [None]:
# Count the occurrences of each value in the Parameters column
counts = hyp_res["Parameters"].value_counts()

# Display the counts
print(counts)

In [None]:
MODEL_TYPE = "cross_val"

In [None]:
def get_best_model_params(model_name):
    if (model_name == 'LinearRegression'):
        params = {
            'maxIter' : [5],
            'regParam' : [0.8],
            'elasticNetParam' : [0.0]
        }   
    if (model_name == 'GeneralizedLinearRegression'):
        params = {
            'maxIter' : [5],
            'regParam' : [0.2],
            'family': ['gaussian'],
            'link': ['log']
        }
    elif (model_name == 'RandomForestRegressor'):
        params = {
            'numTrees' : [3],
            'maxDepth' : [10],
            'seed' : [42]
            }
    elif (model_name == 'GBTRegressor'):
        params = {
            'maxIter' : [30],
            'maxDepth' : [3],
            'stepSize': [0.4],
            'seed' : [42]
        }
        
    return params

In [None]:
# Get tuned parameters
params = get_best_model_params(MODEL_NAME)
params

In [None]:
# Perform cross validation
cv_res, cv_pred = cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, CHOSEN_FEATURES, CHOSEN_FEATURES_LABEL, FEATURES_LABEL, TARGET_LABEL)
cv_res

In [None]:
utilities.show_results(cv_pred, MODEL_NAME)

In [None]:
# Define the results to show
tuned_comparison_lst = [cv_res]

# Show the comparison table
tuned_comparison_lst_df = pd.concat([utilities.model_comparison(results, model_info, evaluator_lst) for results in tuned_comparison_lst])
tuned_comparison_lst_df

## Final ❗
Finally, the configuration found will be validated and then the model will be trained one last time on the entire train / validation set, ready to make predictions.

### Validate final model

In [None]:
MODEL_TYPE = "final_validated"

In [None]:
# Performances on validated final model
final_valid_res, final_valid_pred = cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, CHOSEN_FEATURES, CHOSEN_FEATURES_LABEL, FEATURES_LABEL, TARGET_LABEL)
final_valid_res

In [None]:
utilities.show_results(final_valid_pred, MODEL_NAME)

### Train model

In [None]:
MODEL_TYPE = "final_trained"

In [None]:
'''
Description: Cross validation on time series data
Args:
    dataset: The dataset which needs to be splited
    params: Parameters which want to test 
    model_name: Model name selected
    model_type: Model type [simple | simple_norm | hyp_tuning | final_validated | final_trained]
    features_normalization: Indicates whether features should be normalized (True) or not (False)
    features: Features to be used to make predictions
    features_name: Name of features used
    features_label: The column name of features
    target_label: The column name of target variable
Return: 
    results_df: Results obtained from the evaluation
    pipeline_model: Final trained model
    predictions: Predictions obtained from the model
'''
def evaluate_trained_model(dataset, params, model_name, model_type, features_normalization, features, features_name, features_label, target_label):    
    # Select the type of features to be used
    dataset = utilities.select_features(dataset, features_normalization, features, features_label, target_label)
  
    # All combination of params
    param_lst = [dict(zip(params, param)) for param in product(*params.values())]
    
    for param in param_lst:
        # Chosen Model
        model = utilities.model_selection(model_name, param, features_label, target_label)
        
        # Chain assembler and model in a Pipeline
        pipeline = Pipeline(stages=[model])

        # Train a model and calculate running time
        start = time.time()
        pipeline_model = pipeline.fit(dataset)
        end = time.time()

        # Make predictions
        predictions = pipeline_model.transform(dataset).select(target_label, "prediction", 'timestamp')

        # Compute validation error by several evaluators
        eval_res = utilities.model_evaluation(target_label, predictions)

        # Use dict to store each result
        results = {
            "Model": model_name,
            "Type": model_type,
            "Cv": "none",
            "Features": features_name,
            "Parameters": [list(param.values())],
            "RMSE": eval_res['rmse'],
            "MSE": eval_res['mse'],
            "MAE": eval_res['mae'],
            "MAPE": eval_res['mape'],
            "R2": eval_res['r2'],
            "Adjusted_R2": eval_res['adj_r2'],
            "Time": end - start,
        }

    # Transform dict to pandas dataset
    results_df = pd.DataFrame(results)
        
    return results_df, pipeline_model, predictions.toPandas()

In [None]:
# Train the model on the whole train / validation set
final_train_res, final_train_model, final_train_pred = evaluate_trained_model(df, params, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, CHOSEN_FEATURES, CHOSEN_FEATURES_LABEL, FEATURES_LABEL, TARGET_LABEL)
final_train_res

In [None]:
utilities.show_results(final_train_pred, MODEL_NAME)

In [None]:
# Define the results to show
valid_comparison_lst = [final_valid_res, final_train_res]

# Show the comparison table
valid_comparison_lst_df = pd.concat([utilities.model_comparison(results, model_info, evaluator_lst) for results in valid_comparison_lst])
valid_comparison_lst_df

# Comparison table
Visualization of model performance at various stages of train / validation

In [None]:
# Concatenate simple results into Pandas Dataframe
final_comparison_lst_df = pd.DataFrame(pd.concat([simple_comparison_lst_df, tuned_comparison_lst_df , valid_comparison_lst_df], ignore_index=True))
final_comparison_lst_df

# Model accuracy ❗

In [16]:
# cv_info = parameters.get_cross_validation_params('multi_splits')
cv_info = parameters.get_cross_validation_params('block_splits')
# cv_info = parameters.get_cross_validation_params('walk_forward_splits')
cv_info

# From now on, only selected and normalized features will be considered
MODEL_TYPE = "hyp_tuning"
CHOSEN_FEATURES = ALL_FEATURES
CHOSEN_FEATURES_LABEL = ALL_FEATURES_NAME
FEATURES_NORMALIZATION = True

In [17]:
def get_best_model_params(model_name):
    if (model_name == 'LinearRegression'):
        params = {
            'maxIter' : [5],
            'regParam' : [0.8],
            'elasticNetParam' : [0.0]
        }   
    if (model_name == 'GeneralizedLinearRegression'):
        params = {
            'maxIter' : [5],
            'regParam' : [0.2],
            'family': ['gaussian'],
            'link': ['log']
        }
    elif (model_name == 'RandomForestRegressor'):
        params = {
            'numTrees' : [3],
            'maxDepth' : [10],
            'seed' : [42]
            }
    elif (model_name == 'GBTRegressor'):
        params = {
            'maxIter' : [30],
            'maxDepth' : [3],
            'stepSize': [0.4],
            'seed' : [42]
        }
        
    return params

In [18]:
# Get tuned parameters
params = get_best_model_params(MODEL_NAME)
params

{'numTrees': [3], 'maxDepth': [10], 'seed': [42]}

In [19]:
'''
Description: Return the dataset with the selected features
Args:
    dataset: The dataset from which to extract the features
    features_normalization: Indicates whether features should be normalized (True) or not (False)
    features: list of features to be extracted
    features_label: The column name of features
    target_label: The column name of target variable
Return: 
    dataset: Dataset with the selected features
'''
def select_features(dataset, features_normalization, features, features_label, target_label):
    if features_normalization:
        # Assemble the columns into a vector column
        assembler = VectorAssembler(inputCols = features, outputCol = "raw_features")
        df_vector  = assembler.transform(dataset).select("timestamp", "id", "market-price", "raw_features", target_label)

        # Create a Normalizer instance
        normalizer = Normalizer(inputCol="raw_features", outputCol=features_label)

        # Fit and transform the data
        dataset = normalizer.transform(df_vector).select("timestamp", "id", "market-price",features_label, target_label)
    else:
        # Assemble the columns into a vector column
        vectorAssembler = VectorAssembler(inputCols = features, outputCol = features_label)
        dataset = vectorAssembler.transform(dataset).select("timestamp", "id", "market-price", features_label, target_label)

    return dataset

In [20]:
'''
Description: Cross validation on time series data
Args:
    dataset: The dataset which needs to be splited
    params: Parameters which want to test 
    cv_info: The type of cross validation [multi_splits | block_splits]
    model_name: Model name selected
    features_normalization: Indicates whether features should be normalized (True) or not (False)
    features: Features to be used to make predictions
    features_name: Name of features used
    features_label: The column name of features
    target_label: The column name of target variable
Return: 
    results_lst_df: All the splits performances in a pandas dataset
'''
def cross_validation(dataset, params, cv_info, model_name, model_type, features_normalization, features, features_name, features_label, target_label):
    # Select the type of features to be used
    dataset = select_features(dataset, features_normalization, features, features_label, target_label)

    # Get the number of samples
    num = dataset.count()
    
    # Save results in a list
    results_lst = []

    # Initialize an empty list to store predictions
    predictions_list = []  

    # Identify the type of cross validation 
    if cv_info['cv_type'] == 'multi_splits':
        split_position_df = utilities.multi_splits(num, cv_info['splits'])
    elif cv_info['cv_type'] == 'block_splits':
        split_position_df = utilities.block_splits(num, cv_info['splits'])
    elif cv_info['cv_type'] == 'walk_forward_splits':
        split_position_df = walk_forward_splits_new(num, cv_info['min_obser'], cv_info['sliding_window'])

    for position in split_position_df.itertuples():
        # Get the start/split/end position based on the type of cross validation
        start = getattr(position, 'start')
        splits = getattr(position, 'split')
        end = getattr(position, 'end')
        idx  = getattr(position, 'Index')
        
        # Train / validation size
        train_size = splits - start
        valid_size = end - splits

        # Get training data and validation data
        train_data = dataset.filter(dataset['id'].between(start, splits-1))
        valid_data = dataset.filter(dataset['id'].between(splits, end-1))

        # Cache them
        train_data.cache()
        valid_data.cache()
        
        # All combination of params
        param_lst = [dict(zip(params, param)) for param in product(*params.values())]

        for param in param_lst:
            # Chosen Model
            model = utilities.model_selection(model_name, param, features_label, target_label)

            # Chain assembler and model in a Pipeline
            pipeline = Pipeline(stages=[model])

            # Train a model and calculate running time
            start = time.time()
            pipeline_model = pipeline.fit(train_data)
            end = time.time()

            # Make predictions
            predictions = pipeline_model.transform(valid_data).select(target_label, "market-price", "prediction", 'timestamp')
            
            # Append predictions to the list
            predictions_list.append(predictions)  

            # Compute validation error by several evaluators
            eval_res = utilities.model_evaluation(target_label, predictions)

            # Use dict to store each result
            results = {
                "Model": model_name,
                "Type": model_type,
                "Cv": cv_info['cv_type'],
                "Features": features_name,
                "Splits": idx + 1,
                "Train&Validation": (train_size,valid_size),                
                "Parameters": list(param.values()),
                "RMSE": eval_res['rmse'],
                "MSE": eval_res['mse'],
                "MAE": eval_res['mae'],
                "MAPE": eval_res['mape'],
                "R2": eval_res['r2'],
                "Adjusted_R2": eval_res['adj_r2'],
                "Time": end - start,
            }

            # Store results for each split
            results_lst.append(results)
            print(results)

        # Release Cache
        train_data.unpersist()
        valid_data.unpersist()

    # Transform dict to pandas dataset
    results_lst_df = pd.DataFrame(results_lst)

    # Create an empty DataFrame with the same schema as the predictions dataset
    final_predictions = spark.createDataFrame([], schema=predictions_list[0].schema)

    # Iterate over the list of DataFrames and union them with the merged DataFrame
    for pred in predictions_list:
        final_predictions = final_predictions.union(pred)

    return results_lst_df, final_predictions.toPandas()

In [21]:
MODEL_TYPE = "final_validated"

# Performances on validated final model
final_valid_res, final_valid_pred = cross_validation(df, params, cv_info, MODEL_NAME, MODEL_TYPE, FEATURES_NORMALIZATION, CHOSEN_FEATURES, CHOSEN_FEATURES_LABEL, FEATURES_LABEL, TARGET_LABEL)
final_valid_res

{'Model': 'RandomForestRegressor', 'Type': 'final_validated', 'Cv': 'block_splits', 'Features': 'all_features', 'Splits': 1, 'Train&Validation': (10383, 2596), 'Parameters': [3, 10, 42], 'RMSE': 48.02398289916037, 'MSE': 2306.302933498848, 'MAE': 34.140728130413514, 'MAPE': 0.05798092849637084, 'R2': -8.852521975525253, 'Adjusted_R2': -8.86773235294791, 'Time': 5.2688515186309814}
{'Model': 'RandomForestRegressor', 'Type': 'final_validated', 'Cv': 'block_splits', 'Features': 'all_features', 'Splits': 2, 'Train&Validation': (10383, 2596), 'Parameters': [3, 10, 42], 'RMSE': 955.093230262068, 'MSE': 912203.0784924317, 'MAE': 847.3023256716388, 'MAPE': 0.3564685760850903, 'R2': -3.645366558354824, 'Adjusted_R2': -3.652538100706587, 'Time': 2.5356874465942383}
{'Model': 'RandomForestRegressor', 'Type': 'final_validated', 'Cv': 'block_splits', 'Features': 'all_features', 'Splits': 3, 'Train&Validation': (10383, 2596), 'Parameters': [3, 10, 42], 'RMSE': 1912.8355236380391, 'MSE': 3658939.7404

Unnamed: 0,Model,Type,Cv,Features,Splits,Train&Validation,Parameters,RMSE,MSE,MAE,MAPE,R2,Adjusted_R2,Time
0,RandomForestRegressor,final_validated,block_splits,all_features,1,"(10383, 2596)","[3, 10, 42]",48.023983,2306.303,34.140728,0.057981,-8.852522,-8.867732,5.268852
1,RandomForestRegressor,final_validated,block_splits,all_features,2,"(10383, 2596)","[3, 10, 42]",955.09323,912203.1,847.302326,0.356469,-3.645367,-3.652538,2.535687
2,RandomForestRegressor,final_validated,block_splits,all_features,3,"(10383, 2596)","[3, 10, 42]",1912.835524,3658940.0,1533.252833,0.164473,-1.842462,-1.84685,2.810883
3,RandomForestRegressor,final_validated,block_splits,all_features,4,"(10383, 2596)","[3, 10, 42]",1750.903854,3065664.0,1328.745716,0.343583,-0.967703,-0.97074,2.871811
4,RandomForestRegressor,final_validated,block_splits,all_features,5,"(10383, 2596)","[3, 10, 42]",959.056534,919789.4,836.793459,0.079478,-1.366147,-1.3698,3.363723
5,RandomForestRegressor,final_validated,block_splits,all_features,6,"(10383, 2596)","[3, 10, 42]",1021.589306,1043645.0,859.429739,0.092387,-0.238883,-0.240796,2.619076
6,RandomForestRegressor,final_validated,block_splits,all_features,7,"(10383, 2596)","[3, 10, 42]",8459.716422,71566800.0,6794.880209,0.14845,-0.22299,-0.224878,2.612045
7,RandomForestRegressor,final_validated,block_splits,all_features,8,"(10383, 2596)","[3, 10, 42]",6703.82721,44941300.0,6326.364172,0.103287,-3.735911,-3.743222,2.387757
8,RandomForestRegressor,final_validated,block_splits,all_features,9,"(10383, 2596)","[3, 10, 42]",1485.050452,2205375.0,1100.398799,0.048618,-0.168768,-0.170572,2.278728
9,RandomForestRegressor,final_validated,block_splits,all_features,10,"(10383, 2596)","[3, 10, 42]",1138.729333,1296704.0,950.775333,0.033027,0.043796,0.04232,2.133784


---

In [22]:
'''
Description: How good the models are at predicting whether the price will go up or down

Dato un dataset che contiene le colonne timestamp, market-price, next-market-price, prediction
Per ogni riga prendo in considerazione il valore di market-price, 
next-market-price e di prediction, se market-price < next-market-price -> il prezzo originale sale, 
se anche la prediction per quel giorno prevede che market-price < prediction -> 1, ho previsto correttamente che il prezzo è salito (stessa cosa se il prezzo scende)

Mentre se market-price < next-market-price -> il prezzo sale, se market-price > prediction-> 0, ho sbagliato la prediction
E così via finché non ho terminato tutto il set preso in considerazione, una volta finito mostro in percentuale 
quanti 1 e quanti 0 ho ottenuto e, se la percentuale supera il 50%, potrei dire di essere stato bravo

Args:
    dataset: The dataset which needs to be splited
    model_name: Model name selected
    model_type: Model type [simple | simple_norm | hyp_tuning | final_validated | final_trained]
Return: 
    accuracy: Return the percentage of correct predictions
'''
def model_accuracy(dataset, model_name, model_type):    
    # Compute the number of total rows in the DataFrame.
    total_rows = dataset.count()

    # Create a column "correct_prediction" which is worth 1 if the prediction is correct, otherwise 0
    dataset = dataset.withColumn(
        "correct_prediction",
        (
            (col("market-price") < col("next-market-price")) & (col("market-price") < col("prediction"))
        ) | (
            (col("market-price") > col("next-market-price")) & (col("market-price") > col("prediction"))
        )
    )

    # Count the number of correct predictions
    correct_predictions = dataset.filter(col("correct_prediction")).count()

    # Compite percentage of correct predictions
    accuracy = (correct_predictions / total_rows) * 100
        
    return accuracy

In [23]:
accuracy = model_accuracy(spark.createDataFrame(final_valid_pred), MODEL_NAME, MODEL_TYPE)
print(f"Percentage of correct predictions: {accuracy:.2f}%")

Percentage of correct predictions: 45.33%


# Saving trained model


In [24]:
# Saving final model results
final_comparison_lst_df.to_csv(MODEL_RESULTS, index=False)

NameError: name 'final_comparison_lst_df' is not defined

In [None]:
# Save the trained model
final_train_model.write().overwrite().save(MODEL)