# Bitcoin Price Prediction - Global Mode

* **Description**: COMP4103(Big Data)--Group Project
* **Author**: Aaron
* **Version**: 1.2 (The whole workflow with results)

**Updates:**
1. Add the final results

**Issues:**
N/A

**To be done**:
1. Statistical Analysis of Residuals - plot histogram of residuals and a QQplot to check for normality(Linear Regression)

**Questions:**
N/A

**Future Work:**
1. The **autoTuning()** and **tsCrossValidation()** cost much time because the procedure is not distributed. The possible way is to overwrite the following code.
    * CrossValidator: https://spark.apache.org/docs/3.1.1/api/python/_modules/pyspark/ml/tuning.html#CrossValidator
    * TrainValidationSplit: https://spark.apache.org/docs/3.1.1/api/python/_modules/pyspark/ml/tuning.html#TrainValidationSplit
2. Could add more algorithms of sklearn by Spark UDFs.
    * Apache Arrow: https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html
3. Could consider Linear model Trees: https://medium.com/convoy-tech/the-best-of-both-worlds-linear-model-trees-7c9ce139767d

****Unified DataSet Format****    
**TimeZone**: `UTC`
1. `id`: Represent the order of this rows
2. `Timestamp` : Represent the time of this record
3. `Close`: Represent the original close price without shifting
4. `NEXT_BTC_CLOSE`: Represent the close price with shifting. It's our label.

## 1. Data loading and Preprocessing

### 1.1. Load related packages

In [1]:
# Apache Spark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler,StandardScaler
from pyspark.ml.regression import LinearRegression, GeneralizedLinearRegression, DecisionTreeRegressor, RandomForestRegressor, GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator

# Python
import numpy as np
import pandas as pd
from itertools import product
import time

# Graph packages
# https://plotly.com/python/getting-started/#jupyterlab-support
# https://plotly.com/python/time-series/
import plotly.express as px

# Scikit-learn
from sklearn.metrics import mean_absolute_percentage_error

### 1.2. Create a SparkSession

In [2]:
# Start a SparkSession
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("Bitcoin Prediction") \
    .getOrCreate()

sc = spark.sparkContext

In [3]:
# Show the URL of SparkSession
spark

### 1.3. DataSet Checking

In [4]:
# Read csv file
filename = "bitcoin_10y_1min_interpolate.csv"

dataset = spark.read.format("csv") \
          .option("inferSchema",'True') \
          .option("header",True) \
          .load(filename)

In [5]:
# Check the schema of loaded dataSet
dataset.printSchema()

root
 |-- Timestamp: string (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume_(BTC): double (nullable = true)
 |-- Volume_(Currency): double (nullable = true)
 |-- Weighted_Price: double (nullable = true)
 |-- NEXT_BTC_CLOSE: double (nullable = true)
 |-- id: integer (nullable = true)



In [6]:
# Check the num of partitions
dataset.rdd.getNumPartitions()

20

In [7]:
# Have a look on the original and shifted label column
dataset.select("id","Timestamp","Close","NEXT_BTC_CLOSE").tail(5)

[Row(id=4856403, Timestamp='2021-03-30 23:55:00', Close=58714.31, NEXT_BTC_CLOSE=58686.0),
 Row(id=4856404, Timestamp='2021-03-30 23:56:00', Close=58686.0, NEXT_BTC_CLOSE=58685.81),
 Row(id=4856405, Timestamp='2021-03-30 23:57:00', Close=58685.81, NEXT_BTC_CLOSE=58723.84),
 Row(id=4856406, Timestamp='2021-03-30 23:58:00', Close=58723.84, NEXT_BTC_CLOSE=58760.59),
 Row(id=4856407, Timestamp='2021-03-30 23:59:00', Close=58760.59, NEXT_BTC_CLOSE=58778.18)]

In [8]:
# Check if there are any "nan", "Null" or empty in numerical columns
all_col = dataset.columns
#dataset.select([F.count(F.when(F.isnan(cols) | dataset[cols].isNull() | (dataset[cols] == ""), cols)).alias(cols) for cols in all_col]).show()

In [9]:
# Total rows
dataset.count()

4856408

## 2. Feature Engineering

### 2.1. Transform to MLlib required format

In [10]:
# labels and features
feature_cols = dataset.columns
# Gain the column list of features
non_feature_cols  = ['id',"NEXT_BTC_CLOSE",'Timestamp']
[feature_cols.remove(non_feature) for non_feature in non_feature_cols]

[None, None, None]

In [11]:
feature_cols

['Open',
 'High',
 'Low',
 'Close',
 'Volume_(BTC)',
 'Volume_(Currency)',
 'Weighted_Price']

In [12]:
# Form a column contains all the features - VectorAssembler
vector_assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

### 2.2. Feature Standardization

In [13]:
# Standardize the "features" column
# Use this column "scaledFeatures" if the algorithm didn't have the build-in param: Standardization
standard_scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

### 2.3. Train-test Split(Time series)

In [14]:
'''
Description: Split and keep the original time-series order
Args:
    dataSet: The dataSet which needs to be splited
    proportion: A number represents the split proportion

Return: 
    train_data: The train dataSet
    test_data: The test dataSet
'''
def trainSplit(dataSet, proportion):
    records_num = dataset.count()
    split_point = round(records_num * proportion)
    
    train_data = dataset.filter(F.col("id") < split_point)
    test_data = dataset.filter(F.col("id") >= split_point)
    
    return (train_data,test_data)

In [15]:
# Split the dataSet: Train(70%), test(30%)
proportion = 0.7
train_data,test_data = trainSplit(dataset, proportion)

# Cache it
train_data.cache()
test_data.cache()

# Number of train and test dataSets
print(f"Training data: {train_data.count()}\nTest data: {test_data.count()}")

Training data: 3399486
Test data: 1456922


## 3. Model Building

### 3.1. LinearRegression(Simple workflow demo)

In [16]:
# Train a LinearRegression model
lr = LinearRegression(featuresCol="features",labelCol="NEXT_BTC_CLOSE",maxIter=5, regParam=0.0, elasticNetParam=0.8)

In [17]:
# Chain assembler and forest in a Pipeline
pipeline = Pipeline(stages=[vector_assembler, lr])
pipeline_model = pipeline.fit(train_data)

In [18]:
# Make predictions
predictions = pipeline_model.transform(test_data)
predictions.select("Timestamp","NEXT_BTC_CLOSE","prediction").show(5)

+-------------------+--------------+-----------------+
|          Timestamp|NEXT_BTC_CLOSE|       prediction|
+-------------------+--------------+-----------------+
|2018-06-23 05:58:00|       6100.82| 6104.74202062437|
|2018-06-23 05:59:00|       6105.39|6101.757430469564|
|2018-06-23 06:00:00|        6085.6|6104.490407662455|
|2018-06-23 06:01:00|       6074.08|6087.143770105451|
|2018-06-23 06:02:00|        6073.6|6072.667359567475|
+-------------------+--------------+-----------------+
only showing top 5 rows



In [19]:
# Compute test error
rmse_evaluator = RegressionEvaluator(labelCol="NEXT_BTC_CLOSE", predictionCol="prediction", metricName='rmse')
mae_evaluator = RegressionEvaluator(labelCol="NEXT_BTC_CLOSE", predictionCol="prediction", metricName='mae')
r2_evaluator = RegressionEvaluator(labelCol="NEXT_BTC_CLOSE", predictionCol="prediction", metricName='r2')
var_evaluator = RegressionEvaluator(labelCol="NEXT_BTC_CLOSE", predictionCol="prediction", metricName='var')

rmse = rmse_evaluator.evaluate(predictions)
mae = mae_evaluator.evaluate(predictions)
var = var_evaluator.evaluate(predictions)
r2 = r2_evaluator.evaluate(predictions)

predictions_pd = predictions.select("NEXT_BTC_CLOSE","prediction").toPandas()
mape = mean_absolute_percentage_error(predictions_pd["NEXT_BTC_CLOSE"], predictions_pd["prediction"])

# Adjusted R-squared
n = predictions.count()
p = len(predictions.columns)
adj_r2 = 1-(1-r2)*(n-1)/(n-p-1)

# Use dict to store each result
results = {
    "Model": "Linear Regression",
    "Proportion": proportion,
    "RMSE": rmse,
    "MAPE": mape,
    "MAE": mae,
    "Variance": var,
    "R2": r2,
    "Adjusted_R2": adj_r2,
}

results

{'Model': 'Linear Regression',
 'Proportion': 0.7,
 'RMSE': 23.08776704182093,
 'MAPE': 0.000680338742099268,
 'MAE': 9.368862815098803,
 'Variance': 127357630.08188389,
 'R2': 0.9999958145719638,
 'Adjusted_R2': 0.9999958145374901}

In [20]:
# Release Cache
train_data.unpersist()
test_data.unpersist()

DataFrame[Timestamp: string, Open: double, High: double, Low: double, Close: double, Volume_(BTC): double, Volume_(Currency): double, Weighted_Price: double, NEXT_BTC_CLOSE: double, id: int]

## 4. Parameter Tuning

In [21]:
'''
Description: Use Grid Search to tune the Model 
Args:
    dataSet: The dataSet which needs to be splited
    proportion_lst: A list represents the split proportion
    feature_col: The column name of features
    label_col: The column name of label
    ml_model: The module to use
    params: Parameters which want to test 
    assembler: An assembler to dataSet
    scaler: A scaler to dataSet
Return: 
    results_df: The best result in a pandas dataframe
'''
def autoTuning(dataSet, proportion_lst, feature_col, label_col, ml_model, params, assembler, scaler):
    
    # Initialize the best result for comparison
    result_best = {"RMSE": float('inf')}
    
    # Try different proportions 
    for proportion in proportion_lst:
        # Split the dataSet
        train_data,test_data = trainSplit(dataSet, proportion)
    
        # Cache it
        train_data.cache()
        test_data.cache()
    
        # ALL combination of params
        param_lst = [dict(zip(params, param)) for param in product(*params.values())]
    
        for param in param_lst:
            # Chosen Model
            if ml_model == "LinearRegression":
                model = LinearRegression(featuresCol=feature_col, \
                                         labelCol=label_col, \
                                         maxIter=param['maxIter'], \
                                         regParam=param['regParam'], \
                                         elasticNetParam=param['elasticNetParam'])
            
            elif ml_model == "GeneralizedLinearRegression":
                model = GeneralizedLinearRegression(featuresCol=feature_col, \
                                                    labelCol=label_col, \
                                                    maxIter=param['maxIter'], \
                                                    regParam=param['regParam'], \
                                                    family=param['family'], \
                                                    link=param['link'])
            
            elif ml_model == "DecisionTree":
                model = DecisionTreeRegressor(featuresCol=feature_col, \
                                              labelCol=label_col, \
                                              maxDepth = param["maxDepth"], \
                                              seed=0)
            
            elif ml_model == "RandomForest":
                model = RandomForestRegressor(featuresCol=feature_col, \
                                              labelCol=label_col, \
                                              numTrees = param["numTrees"], \
                                              maxDepth = param["maxDepth"], \
                                              seed=0)
            
            elif ml_model == "GBTRegression":
                model = GBTRegressor(featuresCol=feature_col, \
                                     labelCol=label_col, \
                                     maxIter = param['maxIter'], \
                                     maxDepth = param['maxDepth'], \
                                     stepSize = param['stepSize'], \
                                     seed=0)
            
            # Chain assembler and model in a Pipeline
            pipeline = Pipeline(stages=[assembler, model])
            # Train a model and calculate running time
            start = time.time()
            pipeline_model = pipeline.fit(train_data)
            end = time.time()

            # Make predictions
            predictions = pipeline_model.transform(test_data)

            # Compute test error by several evaluators
            # https://spark.apache.org/docs/3.1.1/mllib-evaluation-metrics.html#regression-model-evaluation
            # https://spark.apache.org/docs/3.1.1/api/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.html
            rmse_evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName='rmse')
            mae_evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName='mae')
            r2_evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName='r2')
            var_evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName='var')
            
            predictions_pd = predictions.select("NEXT_BTC_CLOSE","prediction").toPandas()
            mape = mean_absolute_percentage_error(predictions_pd["NEXT_BTC_CLOSE"], predictions_pd["prediction"])
            
            rmse = rmse_evaluator.evaluate(predictions)
            mae = mae_evaluator.evaluate(predictions)
            var = var_evaluator.evaluate(predictions)
            r2 = r2_evaluator.evaluate(predictions)
            # Adjusted R-squared
            n = predictions.count()
            p = len(predictions.columns)
            adj_r2 = 1-(1-r2)*(n-1)/(n-p-1)
        
            # Use dict to store each result
            results = {
                "Model": ml_model,
                "Proportion": proportion,
                "Parameters": [list(param.values())],
                "RMSE": rmse,
                "MAPE":mape,
                "MAE": mae,
                "Variance": var,
                "R2": r2,
                "Adjusted_R2": adj_r2,
                "Time": end - start,
                "Predictions": predictions.select("NEXT_BTC_CLOSE","prediction",'Timestamp')
            }
            
            # Only store the lowest RMSE
            if results['RMSE'] < result_best['RMSE']:
                result_best = results
                
        # Release Cache
        train_data.unpersist()
        test_data.unpersist()
        
    # Transform dict to pandas dataframe
    results_df = pd.DataFrame(result_best)
    return results_df

In [22]:
# Define a function to plot line-like graph
# https://plotly.com/python/time-series/#time-series-with-range-selector-buttons
'''
Description: Plot the line graph by plotly(custom design)
Args:
    data: The data(pandas dataframe) which you want to ploy by line
    graph_title: The title of the graph
Return: None
'''
def line_plot(data,graph_title):
    plot = px.line(data,title=graph_title)
    plot.update_xaxes(
        rangeslider_visible=True,
        rangeselector=dict(
            buttons=list([
                dict(count=7, label="1w", step="day", stepmode="backward"),
                dict(count=1, label="1m", step="month", stepmode="backward"),
                dict(count=6, label="6m", step="month", stepmode="backward"),
                dict(count=1, label="1y", step="year", stepmode="backward"),
                dict(step="all")
            ])
        )
    )
    plot.show()

In [23]:
# Draw the Prediction Graph
'''
Description: Plot a line graph of the Prediction
Args:
    result: The result from the "autoTuning" Func
    graph_title: The title of the graph
Return: None
'''
def drawPrediction(result, graph_title):
    result_pd = result['Predictions'][0].withColumn("Time", F.to_timestamp("Timestamp", 'yyyy-MM-dd HH:mm:ss')) \
                                        .drop("Timestamp") \
                                        .toPandas() \
                                        .set_index('Time')
    # Display the info of the best trained Model
    print(result.iloc[0,:-1])
    # Draw by plotly
    line_plot(result_pd, graph_title)

In [24]:
# Parameter choosing - only use the most popular/useful params
# https://spark.apache.org/docs/3.1.1/ml-classification-regression.html#regression
# The explanations of params are form the above document

In [25]:
# Split proportion list
proportion_lst = [0.6, 0.7, 0.8, 0.9]

### 4.1. Linear Regression Model - Prediction

In [26]:
# LinearRegression
lr_params = {
    'maxIter' : [5, 10, 50, 80, 100], # max number of iterations (>=0), default:100
    'regParam' : np.arange(0,1,0.2).round(decimals=2),# regularization parameter (>=0), default:0.0
    'elasticNetParam' : np.arange(0,1,0.2).round(decimals=2) # the ElasticNet mixing parameter, [0, 1], default:0.0 
}
result_lr = autoTuning(dataset, proportion_lst, "features", "NEXT_BTC_CLOSE", "LinearRegression", lr_params, vector_assembler ,standard_scaler)
# Virtualization
drawPrediction(result_lr,"Predict by LinearRegression")

In [27]:
# Generalized linear regression
glr_params = {
    'maxIter' : [5, 10, 50, 80], # max number of iterations (>=0), default:25
    'regParam' : [0, 0.1, 0.2],# regularization parameter (>=0), default:0.0
    'family': ['gaussian', 'gamma'], # The name of family which is a description of the error distribution to be used in the model.
    'link': ['identity', 'inverse'] # which provides the relationship between the linear predictor and the mean of the distribution function.
}
result_glr = autoTuning(dataset, proportion_lst, "features", "NEXT_BTC_CLOSE", "GeneralizedLinearRegression", glr_params, vector_assembler ,standard_scaler)
# Virtualization
drawPrediction(result_glr,"Predict by GeneralizedLinearRegression")

### 4.2. Tree based Model - Prediction

In [28]:
# RandomForest
rf_params = {
    'numTrees' : [3, 5, 10, 20, 30],# Number of trees to train, >=1, default:20
    'maxDepth' : [3, 5, 10] # Maximum depth of the tree, <=30, default:5
}
result_rf = autoTuning(dataset, proportion_lst, "features", "NEXT_BTC_CLOSE", "RandomForest", rf_params, vector_assembler ,standard_scaler)
# Virtualization
drawPrediction(result_rf,"Predict by RandomForest")

Model          RandomForest
Proportion              0.7
Parameters          [5, 10]
RMSE                9335.58
MAPE                0.16048
MAE                 3553.66
Variance        2.29335e+07
R2                  0.31568
Adjusted_R2        0.315675
Time                23.4275
Name: 0, dtype: object


In [29]:
# GBTRegression
gb_params = {
    'maxIter' : [20, 40, 60], # max number of iterations (>=0), default:20
    'maxDepth' : [5, 8, 10], # Maximum depth of the tree (>=0), <=30, default:5
    'stepSize': [0.1, 0.3, 0.5, 0.7] # learning rate, [0,1], default:0.1
}
result_gb = autoTuning(dataset, proportion_lst, "features", "NEXT_BTC_CLOSE", "GBTRegression", gb_params, vector_assembler ,standard_scaler)
# Virtualization
drawPrediction(result_gb,"Predict by GBTRegression")

Model          GBTRegression
Proportion               0.7
Parameters      [40, 5, 0.3]
RMSE                 8090.42
MAPE                0.128075
MAE                  2977.23
Variance         2.78414e+07
R2                  0.486053
Adjusted_R2         0.486048
Time                 168.555
Name: 0, dtype: object


## 5. Time Series Cross Validation 
> https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/     
> https://hub.packtpub.com/cross-validation-strategies-for-time-series-forecasting-tutorial/        
> https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html   

### 5.1. Multiple Splits Time Series Cross Validation

In [31]:
'''
Description: Multiple Splits Cross Validation on Time Series data
Args:
    num: Number of DataSet
    n_splits: Split times
Return: 
    split_position_df: All set of splits position in a Pandas dataframe
'''
def mulTsCrossValidation(num, n_splits):
    split_position_lst = []
    # Calculate the split position for each time 
    for i in range(1, n_splits+1):
        # Calculate train size and test size
        train_size = i * num // (n_splits + 1) + num % (n_splits + 1)
        test_size = num //(n_splits + 1)

        # Calculate the start/split/end point for each fold
        start = 0
        split = train_size
        end = train_size + test_size
        
        # Avoid to beyond the whole number of dataSet
        if end > num:
            end = num
        split_position_lst.append((start,split,end))
        
    # Transform the split position list to a Pandas Dataframe
    split_position_df = pd.DataFrame(split_position_lst,columns=['start','split','end'])
    return split_position_df

### 5.2. Blocked Time Series Cross Validation

In [32]:
'''
Description: Blocked Time Series Cross Validation
Args:
    num: Number of DataSet
    n_splits: Split times
Return: 
    split_position_df: All set of splits position in a Pandas dataframe
'''
def blockedTsCrossValidation(num, n_splits):
    kfold_size = num // n_splits

    split_position_lst = []
    # Calculate the split position for each time 
    for i in range(n_splits):
        # Calculate the start/split/end point for each fold
        start = i * kfold_size
        end = start + kfold_size
        # Manually set train-test split proportion in each fold
        split = int(0.8 * (end - start)) + start
        split_position_lst.append((start,split,end))
        
    # Transform the split position list to a Pandas Dataframe
    split_position_df = pd.DataFrame(split_position_lst,columns=['start','split','end'])
    return split_position_df

### 5.3. Walk Forward Validation

In [33]:
'''
Description: Walk Forward Validation on Time Series data
Args:
    num: Number of DataSet
    min_obser: Minimum Number of Observations
    expand_window: Sliding or Expanding Window
Return: 
    split_position_df: All set of splits position in a Pandas dataframe
'''
def wfTsCrossValidation(num, min_obser, expand_window):
    split_position_lst = []
    # Calculate the split position for each time 
    for i in range(min_obser,num,expand_window):
        # Calculate the start/split/end point for each fold
        start = 0
        split = i
        end = split + expand_window
        
        # Avoid to beyond the whole number of dataSet
        if end > num:
            end = num
        split_position_lst.append((start,split,end))
        
    # Transform the split position list to a Pandas Dataframe
    split_position_df = pd.DataFrame(split_position_lst,columns=['start','split','end'])
    return split_position_df

### 5.4. Time Series Cross Validation 

In [34]:
'''
Description: Cross Validation on Time Series data
Args:
    dataSet: The dataSet which needs to be splited
    feature_col: The column name of features
    label_col: The column name of label
    ml_model: The module to use
    params: Parameters which want to test 
    assembler: An assembler to dataSet
    scaler: A scaler to dataSet
    cv_info: The type of Cross Validation
Return: 
    tsCv_df: All the splits performance of each model in a pandas dataframe
'''
def tsCrossValidation(dataSet, feature_col, label_col, ml_model, params, assembler, scaler, cv_info):
    
    # Get the number of samples
    num = dataSet.count()
    
    # Save results in a list
    result_lst = []
    
    # ALL combination of params
    param_lst = [dict(zip(params, param)) for param in product(*params.values())]

    for param in param_lst:
        # Chosen Model
        if ml_model == "LinearRegression":
            model = LinearRegression(featuresCol=feature_col, \
                                     labelCol=label_col, \
                                     maxIter=param['maxIter'], \
                                     regParam=param['regParam'], \
                                     elasticNetParam=param['elasticNetParam'])

        elif ml_model == "GeneralizedLinearRegression":
            model = GeneralizedLinearRegression(featuresCol=feature_col, \
                                                labelCol=label_col, \
                                                maxIter=param['maxIter'], \
                                                regParam=param['regParam'], \
                                                family=param['family'], \
                                                link=param['link'])

        elif ml_model == "DecisionTree":
            model = DecisionTreeRegressor(featuresCol=feature_col, \
                                          labelCol=label_col, \
                                          maxDepth = param["maxDepth"], \
                                          seed=0)

        elif ml_model == "RandomForest":
            model = RandomForestRegressor(featuresCol=feature_col, \
                                          labelCol=label_col, \
                                          numTrees = param["numTrees"], \
                                          maxDepth = param["maxDepth"], \
                                          seed=0)

        elif ml_model == "GBTRegression":
            model = GBTRegressor(featuresCol=feature_col, \
                                 labelCol=label_col, \
                                 maxIter = param['maxIter'], \
                                 maxDepth = param['maxDepth'], \
                                 stepSize = param['stepSize'], \
                                 seed=0)
            
        
        
        # Identify the type of Cross Validation 
        if cv_info['cv_type'] == 'mulTs':
            split_position_df = mulTsCrossValidation(num, cv_info['kSplits'])
        elif cv_info['cv_type'] == 'blkTs':
            split_position_df = blockedTsCrossValidation(num, cv_info['kSplits'])
        elif cv_info['cv_type'] == 'wfTs':
            split_position_df = wfTsCrossValidation(num, cv_info['min_obser'], cv_info['expand_window'])
            

        for position in split_position_df.itertuples():
            # Get the start/split/end position from a kind of Time Series Cross Validation
            start = getattr(position, 'start')
            splits = getattr(position, 'split')
            end = getattr(position, 'end')
            idx  = getattr(position, 'Index')
            
            # Train/Test size
            train_size = splits - start
            test_size = end - splits

            # Get training data and test data
            train_data = dataSet.filter(F.col("id").between(start, splits-1))
            test_data = dataSet.filter(F.col("id").between(splits, end-1))

            # Cache it
            train_data.cache()
            test_data.cache()

            # Chain assembler and model in a Pipeline
            pipeline = Pipeline(stages=[assembler, model])
            # Train a model and calculate running time
            start = time.time()
            pipeline_model = pipeline.fit(train_data)
            end = time.time()

            # Make predictions
            predictions = pipeline_model.transform(test_data)

            # Compute test error by several evaluator
            rmse_evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName='rmse')
            mae_evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName='mae')
            r2_evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName='r2')
            var_evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName='var')
            
            predictions_pd = predictions.select("NEXT_BTC_CLOSE","prediction").toPandas()
            mape = mean_absolute_percentage_error(predictions_pd["NEXT_BTC_CLOSE"], predictions_pd["prediction"])

            rmse = rmse_evaluator.evaluate(predictions)
            mae = mae_evaluator.evaluate(predictions)
            var = var_evaluator.evaluate(predictions)
            r2 = r2_evaluator.evaluate(predictions)
            # Adjusted R-squared
            n = predictions.count()
            p = len(predictions.columns)
            adj_r2 = 1-(1-r2)*(n-1)/(n-p-1)

            # Use dict to store each result
            results = {
                "Model": ml_model,
                'CV_type': cv_info['cv_type'],
                "Splits": idx + 1,
                "Train&Test": (train_size,test_size),
                "Parameters": list(param.values()),
                "RMSE": rmse,
                "MAPE": mape,
                "MAE": mae,
                "Variance": var,
                "R2": r2,
                "Adjusted_R2": adj_r2,
                "Time": end - start
            }
            
            # Store each splits result
            result_lst.append(results)
            
            # Release Cache
            train_data.unpersist()
            test_data.unpersist()

    # Transform dict to pandas dataframe
    tsCv_df = pd.DataFrame(result_lst)
    return tsCv_df

In [37]:
## Cross Validation Parameter
# Multiple Splits Time Series Cross Validation
mul_cv = {'cv_type':'mulTs',
          'kSplits': 5}

# Blocked Time Series Cross Validation
blk_cv = {'cv_type':'blkTs',
          'kSplits': 10}

# Walk Forward Validation, Last 50 steps
wf_cv = {'cv_type':'wfTs',
         'min_obser': 4856359,
         'expand_window': 1}

#### 5.4.1. LinearRegression Validation

In [36]:
# LinearRegression
lr_params = {
    'maxIter' : [5], # max number of iterations (>=0), default:100
    'regParam' : [0.0],# regularization parameter (>=0), default:0.0
    'elasticNetParam' : [0.8] # the ElasticNet mixing parameter, [0, 1], default:0.0 
}
lr_mul_cv = tsCrossValidation(dataset, "features", "NEXT_BTC_CLOSE", "LinearRegression", lr_params, vector_assembler ,standard_scaler, mul_cv)
lr_mul_cv

Unnamed: 0,Model,CV_type,Splits,Train&Test,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,LinearRegression,mulTs,1,"(809403, 809401)","[5, 0.0, 0.8]",1.366073,0.001256,0.612516,52481.68,0.999964,0.999964,4.625429
1,LinearRegression,mulTs,2,"(1618804, 809401)","[5, 0.0, 0.8]",0.935522,0.001143,0.248898,18639.08,0.999953,0.999953,4.872537
2,LinearRegression,mulTs,3,"(2428206, 809401)","[5, 0.0, 0.8]",13.59195,0.000985,5.412886,18965150.0,0.99999,0.99999,5.327785
3,LinearRegression,mulTs,4,"(3237607, 809401)","[5, 0.0, 0.8]",8.830537,0.000657,4.810164,5626686.0,0.999986,0.999986,6.095136
4,LinearRegression,mulTs,5,"(4047008, 809400)","[5, 0.0, 0.8]",30.03242,0.000733,13.404001,185817900.0,0.999995,0.999995,6.843345


In [38]:
lr_blk_cv = tsCrossValidation(dataset, "features", "NEXT_BTC_CLOSE", "LinearRegression", lr_params, vector_assembler ,standard_scaler, blk_cv)
lr_blk_cv

Unnamed: 0,Model,CV_type,Splits,Train&Test,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,LinearRegression,blkTs,1,"(388512, 97128)","[5, 0.0, 0.8]",0.010439,0.000255,0.002908,0.467659,0.999767,0.999767,3.31711
1,LinearRegression,blkTs,2,"(388512, 97128)","[5, 0.0, 0.8]",0.329218,0.000937,0.139805,945.3162,0.999885,0.999885,3.700262
2,LinearRegression,blkTs,3,"(388512, 97128)","[5, 0.0, 0.8]",0.76736,0.000973,0.441272,5216.484,0.999887,0.999887,3.786622
3,LinearRegression,blkTs,4,"(388512, 97128)","[5, 0.0, 0.8]",0.260701,0.000577,0.146041,642.6075,0.999894,0.999894,3.677978
4,LinearRegression,blkTs,5,"(388512, 97128)","[5, 0.0, 0.8]",2.425555,0.004829,0.520071,2306.649,0.997453,0.997452,3.343266
5,LinearRegression,blkTs,6,"(388512, 97128)","[5, 0.0, 0.8]",4.338398,0.001176,2.78738,84506.83,0.999777,0.999777,3.182063
6,LinearRegression,blkTs,7,"(388512, 97128)","[5, 0.0, 0.8]",8.15655,0.000688,5.547771,969618.0,0.999931,0.999931,3.380068
7,LinearRegression,blkTs,8,"(388512, 97128)","[5, 0.0, 0.8]",7.433761,0.000621,3.821281,1649642.0,0.999966,0.999966,3.441067
8,LinearRegression,blkTs,9,"(388512, 97128)","[5, 0.0, 0.8]",11.280208,0.001,6.802791,1568764.0,0.999919,0.999919,3.52152
9,LinearRegression,blkTs,10,"(388512, 97128)","[5, 0.0, 0.8]",68.313149,0.00101,47.46478,75988910.0,0.999939,0.999939,3.468341


In [39]:
lr_wf_cv = tsCrossValidation(dataset, "features", "NEXT_BTC_CLOSE", "LinearRegression", lr_params, vector_assembler ,standard_scaler, wf_cv)
lr_wf_cv

Unnamed: 0,Model,CV_type,Splits,Train&Test,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,LinearRegression,wfTs,1,"(4856359, 1)","[5, 0.0, 0.8]",6.410132,0.000109,6.410132,41.089788,-inf,,7.713154
1,LinearRegression,wfTs,2,"(4856360, 1)","[5, 0.0, 0.8]",2.135334,3.6e-05,2.135334,4.559651,-inf,,7.805584
2,LinearRegression,wfTs,3,"(4856361, 1)","[5, 0.0, 0.8]",28.938096,0.000493,28.938096,837.413416,-inf,,7.877965
3,LinearRegression,wfTs,4,"(4856362, 1)","[5, 0.0, 0.8]",35.871947,0.000611,35.871947,1286.79657,-inf,,8.024721
4,LinearRegression,wfTs,5,"(4856363, 1)","[5, 0.0, 0.8]",57.297181,0.000977,57.297181,3282.96692,-inf,,8.112156
5,LinearRegression,wfTs,6,"(4856364, 1)","[5, 0.0, 0.8]",12.726607,0.000217,12.726607,161.966535,-inf,,7.866153
6,LinearRegression,wfTs,7,"(4856365, 1)","[5, 0.0, 0.8]",1.909822,3.3e-05,1.909822,3.64742,-inf,,8.65425
7,LinearRegression,wfTs,8,"(4856366, 1)","[5, 0.0, 0.8]",7.355302,0.000125,7.355302,54.100463,-inf,,8.01069
8,LinearRegression,wfTs,9,"(4856367, 1)","[5, 0.0, 0.8]",6.841876,0.000117,6.841876,46.811273,-inf,,7.833155
9,LinearRegression,wfTs,10,"(4856368, 1)","[5, 0.0, 0.8]",21.152901,0.000361,21.152901,447.445234,-inf,,7.956287


#### 5.4.2. Generalized linear regression Validation

In [40]:
# Generalized linear regression
glr_params = {
    'maxIter' : [10], # max number of iterations (>=0), default:25
    'regParam' : [0],# regularization parameter (>=0), default:0.0
    'family': ['gaussian'], # The name of family which is a description of the error distribution to be used in the model.
    'link': ['identity'] # which provides the relationship between the linear predictor and the mean of the distribution function.
}
glr_mul_cv = tsCrossValidation(dataset, "features", "NEXT_BTC_CLOSE", "GeneralizedLinearRegression", glr_params, vector_assembler ,standard_scaler, mul_cv)
glr_mul_cv

Unnamed: 0,Model,CV_type,Splits,Train&Test,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,GeneralizedLinearRegression,mulTs,1,"(809403, 809401)","[10, 0, gaussian, identity]",1.366073,0.001256,0.612516,52481.68,0.999964,0.999964,3.211656
1,GeneralizedLinearRegression,mulTs,2,"(1618804, 809401)","[10, 0, gaussian, identity]",0.935522,0.001143,0.248898,18639.08,0.999953,0.999953,4.256966
2,GeneralizedLinearRegression,mulTs,3,"(2428206, 809401)","[10, 0, gaussian, identity]",13.59195,0.000985,5.412886,18965150.0,0.99999,0.99999,4.251082
3,GeneralizedLinearRegression,mulTs,4,"(3237607, 809401)","[10, 0, gaussian, identity]",8.830537,0.000657,4.810164,5626686.0,0.999986,0.999986,4.835094
4,GeneralizedLinearRegression,mulTs,5,"(4047008, 809400)","[10, 0, gaussian, identity]",30.03242,0.000733,13.404001,185817900.0,0.999995,0.999995,4.872096


In [41]:
glr_blk_cv = tsCrossValidation(dataset, "features", "NEXT_BTC_CLOSE", "GeneralizedLinearRegression", glr_params, vector_assembler ,standard_scaler, blk_cv)
glr_blk_cv

Unnamed: 0,Model,CV_type,Splits,Train&Test,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,GeneralizedLinearRegression,blkTs,1,"(388512, 97128)","[10, 0, gaussian, identity]",0.010439,0.000255,0.002908,0.467659,0.999767,0.999767,2.655176
1,GeneralizedLinearRegression,blkTs,2,"(388512, 97128)","[10, 0, gaussian, identity]",0.329218,0.000937,0.139805,945.3162,0.999885,0.999885,2.912059
2,GeneralizedLinearRegression,blkTs,3,"(388512, 97128)","[10, 0, gaussian, identity]",0.76736,0.000973,0.441272,5216.484,0.999887,0.999887,3.014059
3,GeneralizedLinearRegression,blkTs,4,"(388512, 97128)","[10, 0, gaussian, identity]",0.260701,0.000577,0.146041,642.6075,0.999894,0.999894,2.916116
4,GeneralizedLinearRegression,blkTs,5,"(388512, 97128)","[10, 0, gaussian, identity]",2.425555,0.004829,0.520071,2306.649,0.997453,0.997452,2.719054
5,GeneralizedLinearRegression,blkTs,6,"(388512, 97128)","[10, 0, gaussian, identity]",4.338398,0.001176,2.78738,84506.83,0.999777,0.999777,2.672051
6,GeneralizedLinearRegression,blkTs,7,"(388512, 97128)","[10, 0, gaussian, identity]",8.15655,0.000688,5.547771,969618.0,0.999931,0.999931,2.715182
7,GeneralizedLinearRegression,blkTs,8,"(388512, 97128)","[10, 0, gaussian, identity]",7.433761,0.000621,3.821281,1649642.0,0.999966,0.999966,2.91106
8,GeneralizedLinearRegression,blkTs,9,"(388512, 97128)","[10, 0, gaussian, identity]",11.280208,0.001,6.802791,1568764.0,0.999919,0.999919,2.848171
9,GeneralizedLinearRegression,blkTs,10,"(388512, 97128)","[10, 0, gaussian, identity]",68.313149,0.00101,47.46478,75988910.0,0.999939,0.999939,2.944083


In [42]:
glr_wf_cv = tsCrossValidation(dataset, "features", "NEXT_BTC_CLOSE", "GeneralizedLinearRegression", glr_params, vector_assembler ,standard_scaler, wf_cv)
glr_wf_cv

Unnamed: 0,Model,CV_type,Splits,Train&Test,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,GeneralizedLinearRegression,wfTs,1,"(4856359, 1)","[10, 0, gaussian, identity]",6.410132,0.000109,6.410132,41.089788,-inf,,6.072119
1,GeneralizedLinearRegression,wfTs,2,"(4856360, 1)","[10, 0, gaussian, identity]",2.135334,3.6e-05,2.135334,4.55965,-inf,,5.927115
2,GeneralizedLinearRegression,wfTs,3,"(4856361, 1)","[10, 0, gaussian, identity]",28.938096,0.000493,28.938096,837.413416,-inf,,6.173125
3,GeneralizedLinearRegression,wfTs,4,"(4856362, 1)","[10, 0, gaussian, identity]",35.871947,0.000611,35.871947,1286.796569,-inf,,6.451129
4,GeneralizedLinearRegression,wfTs,5,"(4856363, 1)","[10, 0, gaussian, identity]",57.297181,0.000977,57.297181,3282.966922,-inf,,6.22612
5,GeneralizedLinearRegression,wfTs,6,"(4856364, 1)","[10, 0, gaussian, identity]",12.726607,0.000217,12.726607,161.966537,-inf,,6.313744
6,GeneralizedLinearRegression,wfTs,7,"(4856365, 1)","[10, 0, gaussian, identity]",1.909822,3.3e-05,1.909822,3.64742,-inf,,6.240277
7,GeneralizedLinearRegression,wfTs,8,"(4856366, 1)","[10, 0, gaussian, identity]",7.355302,0.000125,7.355302,54.100463,-inf,,6.17312
8,GeneralizedLinearRegression,wfTs,9,"(4856367, 1)","[10, 0, gaussian, identity]",6.841876,0.000117,6.841876,46.811273,-inf,,5.924188
9,GeneralizedLinearRegression,wfTs,10,"(4856368, 1)","[10, 0, gaussian, identity]",21.152901,0.000361,21.152901,447.445231,-inf,,6.179151


#### 5.4.3. RandomForest Validation

In [43]:
rf_params = {
    'numTrees' : [5],# Number of trees to train, >=1, default:20
    'maxDepth' : [10] # Maximum depth of the tree, <=30, default:5
}
rf_mul_cv = tsCrossValidation(dataset, "features", "NEXT_BTC_CLOSE", "RandomForest", rf_params, vector_assembler ,standard_scaler, mul_cv)
rf_mul_cv

Unnamed: 0,Model,CV_type,Splits,Train&Test,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,RandomForest,mulTs,1,"(809403, 809401)","[5, 10]",351.515235,0.504572,281.271832,79894.6,-1.354177,-1.354211,10.70024
1,RandomForest,mulTs,2,"(1618804, 809401)","[5, 10]",17.682341,0.04124,12.998234,19940.27,0.983227,0.983226,12.395513
2,RandomForest,mulTs,3,"(2428206, 809401)","[5, 10]",5258.263045,0.459848,3061.772203,9371884.0,-0.458168,-0.458189,15.274308
3,RandomForest,mulTs,4,"(3237607, 809401)","[5, 10]",1104.8611,0.119616,843.623513,6615315.0,0.783033,0.783029,20.716533
4,RandomForest,mulTs,5,"(4047008, 809400)","[5, 10]",10989.855879,0.139607,4745.627951,37645830.0,0.350035,0.350025,23.449244


In [44]:
rf_blk_cv = tsCrossValidation(dataset, "features", "NEXT_BTC_CLOSE", "RandomForest", rf_params, vector_assembler ,standard_scaler, blk_cv)
rf_blk_cv

Unnamed: 0,Model,CV_type,Splits,Train&Test,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,RandomForest,blkTs,1,"(388512, 97128)","[5, 10]",0.24546,0.015894,0.187042,0.6553693,0.871169,0.871153,8.148999
1,RandomForest,blkTs,2,"(388512, 97128)","[5, 10]",12.881155,0.03915,6.914689,552.1326,0.824489,0.824467,8.678176
2,RandomForest,blkTs,3,"(388512, 97128)","[5, 10]",13.150895,0.021935,9.041445,5671.155,0.966855,0.966851,9.991845
3,RandomForest,blkTs,4,"(388512, 97128)","[5, 10]",3.576121,0.008668,2.388849,703.3641,0.980104,0.980102,9.457256
4,RandomForest,blkTs,5,"(388512, 97128)","[5, 10]",74.163257,0.094393,61.015094,3694.015,-1.381557,-1.381852,8.71091
5,RandomForest,blkTs,6,"(388512, 97128)","[5, 10]",633.485863,0.226869,573.634702,330782.5,-3.748733,-3.74932,8.489157
6,RandomForest,blkTs,7,"(388512, 97128)","[5, 10]",127.021944,0.013337,104.172232,1025163.0,0.983359,0.983357,8.851895
7,RandomForest,blkTs,8,"(388512, 97128)","[5, 10]",333.969051,0.047057,258.135086,1775488.0,0.932385,0.932377,9.140229
8,RandomForest,blkTs,9,"(388512, 97128)","[5, 10]",592.967194,0.055913,325.248486,988897.2,0.775863,0.775835,9.684478
9,RandomForest,blkTs,10,"(388512, 97128)","[5, 10]",12659.077342,0.203978,10727.303241,114758600.0,-1.108708,-1.108969,9.30901


In [45]:
rf_wf_cv = tsCrossValidation(dataset, "features", "NEXT_BTC_CLOSE", "RandomForest", rf_params, vector_assembler ,standard_scaler, wf_cv)
rf_wf_cv

Unnamed: 0,Model,CV_type,Splits,Train&Test,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,RandomForest,wfTs,1,"(4856359, 1)","[5, 10]",10309.198323,0.175588,10309.198323,106279600.0,-inf,,25.597913
1,RandomForest,wfTs,2,"(4856360, 1)","[5, 10]",10090.824991,0.171891,10090.824991,101824700.0,-inf,,25.329065
2,RandomForest,wfTs,3,"(4856361, 1)","[5, 10]",10962.858509,0.186654,10962.858509,120184300.0,-inf,,24.51133
3,RandomForest,wfTs,4,"(4856362, 1)","[5, 10]",11895.726341,0.202668,11895.726341,141508300.0,-inf,,26.054279
4,RandomForest,wfTs,5,"(4856363, 1)","[5, 10]",11494.661753,0.196033,11494.661753,132127200.0,-inf,,28.610012
5,RandomForest,wfTs,6,"(4856364, 1)","[5, 10]",9987.145637,0.170365,9987.145637,99743080.0,-inf,,26.959577
6,RandomForest,wfTs,7,"(4856365, 1)","[5, 10]",10758.576541,0.183533,10758.576541,115747000.0,-inf,,27.444233
7,RandomForest,wfTs,8,"(4856366, 1)","[5, 10]",11738.851146,0.200235,11738.851146,137800600.0,-inf,,27.178553
8,RandomForest,wfTs,9,"(4856367, 1)","[5, 10]",11012.023003,0.187817,11012.023003,121264700.0,-inf,,28.042655
9,RandomForest,wfTs,10,"(4856368, 1)","[5, 10]",8011.526013,0.136583,8011.526013,64184550.0,-inf,,27.845516


#### 5.4.4. GBTRegression Validation

In [46]:
gb_params = {
    'maxIter' : [40], # max number of iterations (>=0), default:20
    'maxDepth' : [5], # Maximum depth of the tree (>=0), <=30, default:5
    'stepSize': [0.3] # learning rate, [0,1], default:0.1
}
gb_mul_cv = tsCrossValidation(dataset, "features", "NEXT_BTC_CLOSE", "GBTRegression", gb_params, vector_assembler ,standard_scaler, mul_cv)
gb_mul_cv

Unnamed: 0,Model,CV_type,Splits,Train&Test,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,GBTRegression,mulTs,1,"(809403, 809401)","[40, 5, 0.3]",304.124379,0.411859,234.294797,59469.04,-0.762192,-0.762218,52.052768
1,GBTRegression,mulTs,2,"(1618804, 809401)","[40, 5, 0.3]",16.935725,0.04017,12.673021,19746.42,0.984613,0.984613,63.991497
2,GBTRegression,mulTs,3,"(2428206, 809401)","[40, 5, 0.3]",5046.938453,0.409534,2864.826508,8310177.0,-0.343318,-0.343338,121.272183
3,GBTRegression,mulTs,4,"(3237607, 809401)","[40, 5, 0.3]",954.67573,0.100484,705.534389,6473980.0,0.838009,0.838007,176.76786
4,GBTRegression,mulTs,5,"(4047008, 809400)","[40, 5, 0.3]",11035.333384,0.133965,4677.757736,37718770.0,0.344645,0.344635,243.22043


In [47]:
gb_blk_cv = tsCrossValidation(dataset, "features", "NEXT_BTC_CLOSE", "GBTRegression", gb_params, vector_assembler ,standard_scaler, blk_cv)
gb_blk_cv

Unnamed: 0,Model,CV_type,Splits,Train&Test,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,GBTRegression,blkTs,1,"(388512, 97128)","[40, 5, 0.3]",0.216869,0.01345,0.157667,0.5860867,0.899433,0.899421,39.846949
1,GBTRegression,blkTs,2,"(388512, 97128)","[40, 5, 0.3]",10.149855,0.03116,5.358995,831.6581,0.891028,0.891015,41.014925
2,GBTRegression,blkTs,3,"(388512, 97128)","[40, 5, 0.3]",11.780647,0.019741,8.321268,5470.287,0.973402,0.973399,43.392517
3,GBTRegression,blkTs,4,"(388512, 97128)","[40, 5, 0.3]",3.53125,0.008578,2.361131,701.2716,0.9806,0.980598,42.093172
4,GBTRegression,blkTs,5,"(388512, 97128)","[40, 5, 0.3]",72.701613,0.092485,59.733615,3539.424,-1.288609,-1.288892,42.091254
5,GBTRegression,blkTs,6,"(388512, 97128)","[40, 5, 0.3]",628.012678,0.224381,567.701285,323768.4,-3.667031,-3.667608,43.951389
6,GBTRegression,blkTs,7,"(388512, 97128)","[40, 5, 0.3]",128.678271,0.013624,106.632673,1024729.0,0.982922,0.98292,43.213658
7,GBTRegression,blkTs,8,"(388512, 97128)","[40, 5, 0.3]",257.940747,0.034981,194.151209,1679424.0,0.959666,0.959661,43.327354
8,GBTRegression,blkTs,9,"(388512, 97128)","[40, 5, 0.3]",580.910108,0.054826,319.16439,994769.3,0.784885,0.784859,43.267266
9,GBTRegression,blkTs,10,"(388512, 97128)","[40, 5, 0.3]",11501.53676,0.183975,9681.641599,97666900.0,-0.740701,-0.740916,43.379082


In [48]:
gb_wf_cv = tsCrossValidation(dataset, "features", "NEXT_BTC_CLOSE", "GBTRegression", gb_params, vector_assembler ,standard_scaler, wf_cv)
gb_wf_cv

Unnamed: 0,Model,CV_type,Splits,Train&Test,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,GBTRegression,wfTs,1,"(4856359, 1)","[40, 5, 0.3]",2890.492341,0.049231,2890.492341,8354946.0,-inf,,311.532754
1,GBTRegression,wfTs,2,"(4856360, 1)","[40, 5, 0.3]",6838.085423,0.116482,6838.085423,46759410.0,-inf,,295.566624
2,GBTRegression,wfTs,3,"(4856361, 1)","[40, 5, 0.3]",4419.999577,0.075255,4419.999577,19536400.0,-inf,,303.142009
3,GBTRegression,wfTs,4,"(4856362, 1)","[40, 5, 0.3]",3971.490656,0.067662,3971.490656,15772740.0,-inf,,300.848879
4,GBTRegression,wfTs,5,"(4856363, 1)","[40, 5, 0.3]",9411.128135,0.1605,9411.128135,88569330.0,-inf,,299.834889
5,GBTRegression,wfTs,6,"(4856364, 1)","[40, 5, 0.3]",2643.75192,0.045098,2643.75192,6989424.0,-inf,,295.209719
6,GBTRegression,wfTs,7,"(4856365, 1)","[40, 5, 0.3]",7388.959484,0.12605,7388.959484,54596720.0,-inf,,294.508795
7,GBTRegression,wfTs,8,"(4856366, 1)","[40, 5, 0.3]",9479.654503,0.161699,9479.654503,89863850.0,-inf,,294.856564
8,GBTRegression,wfTs,9,"(4856367, 1)","[40, 5, 0.3]",6095.821747,0.103968,6095.821747,37159040.0,-inf,,290.341172
9,GBTRegression,wfTs,10,"(4856368, 1)","[40, 5, 0.3]",1360.662367,0.023197,1360.662367,1851402.0,-inf,,295.397956


# 6.Summary  

## 6.1. Model Comparison Table

In [49]:
'''
Description: Apply calculations on Time Series Cross Validation results to form the final Model Comparison Table
Args:
    cv_result: The results from tsCrossValidation()
    model_info: The model information which you would like to show
    evaluator_lst: The evaluator metrics which you would like to show
Return: 
    comparison_df: A pandas dataframe of a model on a type of Time Series Cross Validation
'''
def modelComparison(cv_result, model_info, evaluator_lst):
    # Calculate mean of all splits on chosen evaluator 
    col_mean_df = cv_result[evaluator_lst].mean().to_frame().T
    # Extract model info
    model_info_df = cv_result[model_info][:1]
    # Concatenate by row
    comparison_df = pd.concat([model_info_df,col_mean_df],axis=1)
    return comparison_df

In [50]:
# Define what model_info and evaluators in the Model Comparison Table
model_info = ['Model','CV_type','Parameters']
evaluator_lst = ['RMSE','MAPE','MAE','Variance','R2','Adjusted_R2','Time']

# The the Cross Validation results would like to compare
comparison_lst = [lr_mul_cv,lr_blk_cv,lr_wf_cv,glr_mul_cv,glr_blk_cv,glr_wf_cv,rf_mul_cv,rf_blk_cv,rf_wf_cv,gb_mul_cv,gb_blk_cv,gb_wf_cv]

In [51]:
# Show the Comparison Table
pd.concat([modelComparison(cv_result,model_info,evaluator_lst) for cv_result in comparison_lst])

Unnamed: 0,Model,CV_type,Parameters,RMSE,MAPE,MAE,Variance,R2,Adjusted_R2,Time
0,LinearRegression,mulTs,"[5, 0.0, 0.8]",10.951301,0.000955,4.897693,42096170.0,0.999978,0.999978,5.552847
0,LinearRegression,blkTs,"[5, 0.0, 0.8]",10.331534,0.001206,6.76741,8027055.0,0.999642,0.999642,3.48183
0,LinearRegression,wfTs,"[5, 0.0, 0.8]",22.281872,0.00038,22.281872,788.3044,-inf,,7.992324
0,GeneralizedLinearRegression,mulTs,"[10, 0, gaussian, identity]",10.951301,0.000955,4.897693,42096170.0,0.999978,0.999978,4.285379
0,GeneralizedLinearRegression,blkTs,"[10, 0, gaussian, identity]",10.331534,0.001206,6.76741,8027055.0,0.999642,0.999642,2.830701
0,GeneralizedLinearRegression,wfTs,"[10, 0, gaussian, identity]",22.281872,0.00038,22.281872,788.3044,-inf,,6.219181
0,RandomForest,mulTs,"[5, 10]",3544.43552,0.252977,1789.058747,10746570.0,0.06079,0.060776,16.507168
0,RandomForest,blkTs,"[5, 10]",1445.053828,0.072719,1206.804086,11888960.0,0.009523,0.0094,9.046196
0,RandomForest,wfTs,"[5, 10]",11352.12541,0.193476,11352.12541,135618600.0,-inf,,28.045077
0,GBTRegression,mulTs,"[40, 5, 0.3]",3471.601534,0.219202,1699.01729,10516430.0,0.212351,0.21234,131.460948
