In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
import warnings; 
warnings.simplefilter('ignore')

**Price Predictions for Bitcoin**

**1.Intro**


In this simple Project, we will be looking at the FB Prophet algo for time series analysis. Other options would include ARIMA, SARIMAX, as well as regression analysis using algorithms not strictly designed for time series data, such XGBoost, and Random Forest. Past machine learning and onto deep learning, LSTM is an option but it has a very long train time, and depending on the use case, the expected accuracy increase may be negligible, or not even realised. One major advantage of these ML and DL algorithms is that they can take multiple predictor variables much more easily than fbprophet. As we will see later, this is still possible in Prophet too


The reason I like Prophet, is that it requires minimal pre-processing of the data that the statsmodel algos tend to need for best results eg converting to logs, identifying potential seasonality, splitting stationary and non-stationary parts of the data. This makes is more useable ‘out of the box’. I also appreciate the sklearn style syntax, and that it infers potential patterns without you needing to feature engineer them. So...here we go


First we import libraries and read in the data


In [3]:
!pip install pystan
!pip install fbprophet
import pandas as pd
import matplotlib.pyplot as plt
from fbprophet import Prophet
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
data = pd.read_csv("../input/bitcoin-price-usd/main.csv")

**2. Data Wrangling and Prep**
Prohpet, in default mode, always only takes a dataframe with 2 columns, the datetime, and the predictor variable, which in this case we will take as the closing price as it is the most commonly used in financial markets. They must be labelled 'ds' and 'y'
We will then convert the time code from UNIX time code to something clearer, and although it does not apply to Prophet, I will inlcude a small section on how we could then engineer some basic features which might aid in prediction with XGBoost etc, including some simple moving averages, range measurements, and seasonality columns. As I said, these do not apply to the more basic Prophet model, but are there for completeness, in case you wished to run another model for comparison, or to improve the Prophet model, which we will get to later on

In [4]:
#Lets slice the dataframe to the 2 relevant columns, and see the data
df = data[['Open Time','Close']]
df.head(10)
df.dtypes

In [5]:
#Convert UNIX time to pandas datetime object
df = df.rename({'Open Time': 'ds', 'Close': 'y'}, axis = 1)
df['ds']= pd.to_datetime(df['ds'],unit='ms')
df['ds'].head(5)

In [6]:
'''
#N/A in this simple case, but for other ML algos like XGBoost, we would feature engineer, for eg, as below:

#Add features for hour, day, month
df['Hour'] = df['DateTime'].dt.hour
df['Day'] = df['DateTime'].dt.dayofweek
df['Month'] = df['DateTime'].dt.month

#Add 9,21,50, 100 and 200 MAs
for n in [9,21,50, 100, 200]:
    name = "MA" + str(n)
    df[name] = df['Close'].rolling(n).mean()

#Add 5,10,20,50 period Ranges
for n in [5,10,20,50]:
    name = "Range" + str(n)
    df[name] = df['High'].rolling(n).max() - df['Low'].rolling(n).min()

'''

**3. We now create our model**

This involves splitting into test and train (which cannot be done with sklearn's train_test_split as it will shuffle the data, which is useless for a time series). We then feed data into the regressor and train it

In [7]:
#Split test and train data by finding the df length and doing a 0.95:0.05 split [typically 75% is more normal,
#but I have made it smaller simply to avoid excessive script run times]
print("DataFrame Length \n", df.shape[0])
print("Number of rows for test \n", df.shape[0] * 0.95)

In [8]:
train =df.iloc[:178902,:]
ytest = df.iloc[178902:]
#for just y values, you would use df['y'].iloc[178902:], but it will be clear further along why this wasn't done

**4. We now use our model to do some prediction**

In fbprophet this is done through generation of a dataframe with upper and lower predicted limits, and a predicted midline value called yhat. We can also obtain a breakdown of the effect of the explanatory factors on the outcome

In [9]:
model = Prophet(daily_seasonality=True)
model.fit(train)
forecast = model.make_future_dataframe(periods=9000)
#the parameter include_history=False would normally be used for testing only, but I have included history to allow us to visualise better
forecast = model.predict(forecast)
forecast.head()

In [10]:
#Plot the Forecast
#plot1 = model.plot(forecast)
#This isn't very clear, lets do the first 100 results
plt.figure(figsize = (15,8))
plt.plot(forecast.ds.head(100), forecast.yhat.head(100))
plt.show


In [11]:
#Prophet's built-in analysis of components
plot2 = model.plot_components(forecast)

**5. Evaluation**

Using this simple 'Out of the Box' method for Prophet, lets evaluate the predicted yhat values against the actual ones from our dataset. The metrics we need to use will be for evaluating regression problems. There are many to chose from, but I have used some of the most common ones

In [12]:
#We have to use sklearn for evaluation metrics as none come built in with Prophet
#for the test part of the dataframe, we do not know for sure if there are missing dates which might throw the prediction 
#and actual series out of synch. A good approach is to create a joined table and clear nan values, to ensure the data
# is the same length, and properly synched

right_df = forecast[['ds','yhat']] 
new_df = pd.merge(ytest, right_df, how = 'inner', on = 'ds')

mse = mean_squared_error(new_df['y'],new_df['yhat'])
print("Mean Squared Error:", mse)

mae = mean_absolute_error(new_df['y'],new_df['yhat'])
print("Mean Absolute Error:", mae)

r2 = r2_score(new_df['y'],new_df['yhat'])
print("R Squared Score:", r2)


**6. Follow-Up**

As is obvious from above, the model is pretty terrible. Its not really a surprise. Bitcoin is a very volatile and unpredictable asset, and the parameters in Prophet alone are not sufficient to predict its future state. One ideally needs a model which is more able to incorporate additional factors into the regression analysis as predictor variables.

Prophet can actually do this, by way of a function called 'Adding Regressors'. I will demonstrate this below (if it is not compiled on the notebook, it is only because it was taking too long on my PC!!)

One final note, is that where there are 
and see if this can improve our model

In [13]:
#First we trim our Dataframe to the relevant columns, and rename as needed
data = data[['Open Time','Open', 'High', 'Low', 'Close', 'Volume']]
data = data.rename({'Open Time': 'ds', 'Close': 'y'}, axis = 1)
data['ds']= pd.to_datetime(data['ds'],unit='ms')

In [14]:
#Now we add the needed features

#Add 9,21,50, 100 and 200 MAs
for n in [9,21,50, 100, 200]:
    name = "MA" + str(n)
    data[name] = data['y'].rolling(n).mean()

#Add 5,10,20,50 period Ranges
for n in [5,10,20,50]:
    name = "Range" + str(n)
    data[name] = data['High'].rolling(n).max() - data['Low'].rolling(n).min()

#Remove nan cloums created from the above (where there were insufficient observations to make the calculations)
data = data.dropna()

In [15]:
data = data.dropna()

In [16]:
#next we divide it into train and test
train_X= data[:178902]
test_X= data[178902:]

In [17]:
#Now we add the Regressors
new_model= Prophet()
regressors = ['Open', 'High', 'Low', 'Volume', 'MA9', 'MA21', 'MA50', 'MA100', 'MA200', 'Range5', 'Range10', 'Range20', 'Range50']
for r in regressors:
    new_model.add_regressor(r)

#Fit the data to the updated model
new_model.fit(train_X)
future_data = new_model.make_future_dataframe(periods=1000)

#forecast the prices for the Test  data (a direct method vs the indirect method described above)
forecast_data = new_model.predict(test_X)
new_model.plot(forecast_data)

In [18]:
#Evaluate the new model
right_df = forecast_data[['ds','yhat']] 
new_df = pd.merge(test_X[['ds','y']], right_df, how = 'inner', on = 'ds')

mse = mean_squared_error(new_df['y'],new_df['yhat'])
print("Mean Squared Error:", mse)

mae = mean_absolute_error(new_df['y'],new_df['yhat'])
print("Mean Absolute Error:", mae)

r2 = r2_score(new_df['y'],new_df['yhat'])
print("R Squared Score:", r2)

Result! By adding regressors, we have improved our model considerably. It could still use more work, via hyperparameter cross validation, but we can clearly see here, that we need more than just a time series of data to make predictions with prophet. As a general algorithm, it failed to take into account the most common metrics used with financial data. This is where a data scientist with domain knowledge pays dividends over an 'out of the box' model.