The main goal of this project is to predict if bitcoin price one day in the future will be higher or lower than current price based on the last 30 days using recurrent neural networks especially LSTM cells. I tried to use different features such as page views volume etc. The project is done 100% by my self but the model part is based on sentdex video of recurrent neural networks

here we just import necessary modules

In [58]:
import pandas as pd
import requests
from datetime import datetime
from sklearn import preprocessing
from collections import deque
import numpy as np
import random
import time
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import ModelCheckpoint
import talib

The data for Bitcoin price and volume is taken from cryptocompare.com and page views are based on wikipedia page views and seraching for Bitcoin topic

In [61]:
df_btc=pd.read_csv('BTC_USD_Coinbase_day_2020-05-17.csv',index_col='datetime')
page_views=pd.read_csv('pageviews-20150701-20200516.csv',index_col='Date')
print(page_views)

            Bitcoin
Date               
2015-07-01    12957
2015-07-02     9802
2015-07-03     8307
2015-07-04     8947
2015-07-05     8692
...             ...
2020-05-12    10677
2020-05-13     8349
2020-05-14     8061
2020-05-15     7443
2020-05-16     7142

[1782 rows x 1 columns]


Combining two data frames in to one and creating new features rsi(relative strength index), mean and standard deviation and the future price which is current price moved by 1

In [62]:
df = pd.DataFrame({'BTC-USD': df_btc.close,
                   'BTC_Volume': df_btc.volumefrom,
                   'views': page_views.Bitcoin
                   })
df['future']=df['BTC-USD'].shift(-1)
df['rsi'] = talib.RSI(df['BTC-USD'].values, timeperiod=14)
df['std']=df['BTC-USD'].rolling(window=30).std()
df['mean']=df['BTC-USD'].rolling(window=30).mean()
df=df[['BTC-USD','future','BTC_Volume','views','rsi','std','mean']]
df=df.dropna()
df.replace(0,np.nan, inplace=True)
print(df)

            BTC-USD   future  BTC_Volume    views        rsi          std  \
2015-07-01   257.97   255.22     8150.38  12957.0  63.516932    11.714312   
2015-07-02   255.22   256.36     6288.44   9802.0  60.149131    11.729520   
2015-07-03   256.36   260.72     4850.58   8307.0  61.070622    11.719544   
2015-07-04   260.72   271.15     4119.09   8947.0  64.455850    11.753758   
2015-07-05   271.15   270.41     7902.94   8692.0  70.961211    12.366152   
...             ...      ...         ...      ...        ...          ...   
2020-05-12  8821.42  9321.26    19825.54  10677.0  56.001358  1008.175343   
2020-05-13  9321.26  9795.34    20859.19   8349.0  61.535348  1006.165262   
2020-05-14  9795.34  9312.10    27425.67   8061.0  65.914358  1019.242768   
2020-05-15  9312.10  9383.16    22369.96   7443.0  58.592039   988.772268   
2020-05-16  9383.16  9761.46    10642.35   7142.0  59.307895   978.191598   

                   mean  
2015-07-01   239.563000  
2015-07-02   240.562000

Now I am creating my target future which results 1 if future price is higher than current(which means worth buying) and 0 if future price is smaller than current(which isn't worth to buy)

In [63]:
def classify(current,future):
    if float(future)>float(current):
        return 1
    else:
        return 0

apply function to our data frame

In [64]:
df['target'] = list(map(classify, df['BTC-USD'], df['future']))

df=df[['BTC-USD','BTC_Volume','future','views','rsi','std','mean','target']]

print(df)

            BTC-USD  BTC_Volume   future    views        rsi          std  \
2015-07-01   257.97     8150.38   255.22  12957.0  63.516932    11.714312   
2015-07-02   255.22     6288.44   256.36   9802.0  60.149131    11.729520   
2015-07-03   256.36     4850.58   260.72   8307.0  61.070622    11.719544   
2015-07-04   260.72     4119.09   271.15   8947.0  64.455850    11.753758   
2015-07-05   271.15     7902.94   270.41   8692.0  70.961211    12.366152   
...             ...         ...      ...      ...        ...          ...   
2020-05-12  8821.42    19825.54  9321.26  10677.0  56.001358  1008.175343   
2020-05-13  9321.26    20859.19  9795.34   8349.0  61.535348  1006.165262   
2020-05-14  9795.34    27425.67  9312.10   8061.0  65.914358  1019.242768   
2020-05-15  9312.10    22369.96  9383.16   7443.0  58.592039   988.772268   
2020-05-16  9383.16    10642.35  9761.46   7142.0  59.307895   978.191598   

                   mean  target  
2015-07-01   239.563000       0  
2015-07

This part as I mentioned before is based on sentdex youtube video. Here I set different parameters like:
SEQ_LEN- which is the numer of day we try to predict our future price
FUTURE_PERIOD_PREDICT- number of days in the future that we trying to predict
EPOCHS-how many time our model is going throught data
BATCH_SIZE-number of training examples utilized in one iteration
and then I want to change all the values to percentage changes, because different features values differ from each other significantly.
Secondly I want to normalize all the data beetwen 0 and 1, I don't want  some values to have bigger impact on our model than the others.
Then I want to shuffle the data to make sure that model is trying to find some pattern not memorize all the data, and finally make sure that I have equal number of buys and sell, which also help to not influencing the model in any direction and transforming all the data from pandas data frame to numpy array

In [65]:
SEQ_LEN=30
FUTURE_PERIOD_PREDICT=1
EPOCHS = 15  
BATCH_SIZE = 64 
NAME = f"{SEQ_LEN}-SEQ-{FUTURE_PERIOD_PREDICT}-PRED-{int(time.time())}"  

def preprocess_df(df_1):
    df_1 = df_1.drop("future", 1)  

    for col in df_1.columns:  
        if col != "target":  
            df_1[col] = df_1[col].pct_change()  
            df_1.dropna(inplace=True)  
            df_1[col] = preprocessing.scale(df_1[col].values)  

    df_1.dropna(inplace=True)  


    sequential_data = []  
    prev_days = deque(maxlen=SEQ_LEN)  

    for i in df_1.values:  
        prev_days.append([n for n in i[:-1]])  
        if len(prev_days) == SEQ_LEN:  
            sequential_data.append([np.array(prev_days), i[-1]])  

    random.shuffle(sequential_data)  

    buys = []  
    sells = []  

    for seq, target in sequential_data:  
        if target == 0: 
            sells.append([seq, target]) 
        elif target == 1:  
            buys.append([seq, target])  

    random.shuffle(buys)  
    random.shuffle(sells) 

    lower = min(len(buys), len(sells))  

    buys = buys[:lower]  
    sells = sells[:lower]  

    sequential_data = buys+sells  
    random.shuffle(sequential_data) 

    X = []
    y = []

    for seq, target in sequential_data: 
        X.append(seq)  
        y.append(target)  

    return np.array(X), y  

Here I set up validation df, which are data that the model didn't see(our test data)

In [66]:
times = sorted(df.index.values)
last_10pct = sorted(df.index.values)[-int(0.1*len(times))]
print(last_10pct)

validation_df = df[(df.index >= last_10pct)] 
df = df[(df.index < last_10pct)] 


2019-11-21


Using my function to process all the data

In [67]:
train_x, train_y = preprocess_df(df)
validation_x, validation_y = preprocess_df(validation_df)
print(f"train data: {len(train_x)} validation: {len(validation_x)}")
print(f"Dont buys: {train_y.count(0)}, buys: {train_y.count(1)}")
print(f"VALIDATION Dont buys: {validation_y.count(0)}, buys: {validation_y.count(1)}")
train_x = np.asarray(train_x)
train_y = np.asarray(train_y)
validation_x = np.asarray(validation_x)
validation_y = np.asarray(validation_y)


train data: 1424 validation: 128
Dont buys: 712, buys: 712
VALIDATION Dont buys: 64, buys: 64


model by it's self which as I mentioned before is based on sentdex video.
 The first parameter in LSTM and Dense are number of cells. Dropout means percentage of data that we randomly throw away to prevent overfitting. Relu as activation function which adds nonlinearity to the model

In [68]:
model = Sequential()
model.add(LSTM(128, input_shape=(train_x.shape[1:]), return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())  #normalizes activation outputs, same reason you want to normalize your input data.

model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(2, activation='softmax'))

opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)

# Compile model
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=opt,
    metrics=['accuracy']
)

tensorboard = TensorBoard(log_dir="logs\\{}".format(NAME))

filepath = "epoch_{epoch:02d}-val_accuracy_{val_accuracy:.3f}"
checkpoint = ModelCheckpoint("models\\{}_{}.model".format(NAME,filepath), monitor='val_acc', verbose=1, save_best_only=True, mode='max')
# Train model
history = model.fit(
    train_x, train_y,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(validation_x, validation_y),
    callbacks=[tensorboard, checkpoint],
)

# Score model
score = model.evaluate(validation_x, validation_y, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
# Save model
model.save("models\\{}".format(NAME))

Train on 1424 samples, validate on 128 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test loss: 0.6996703892946243
Test accuracy: 0.5625
INFO:tensorflow:Assets written to: models\30-SEQ-1-PRED-1589742194\assets


## Conclusion
We get test accuracy on the level of 0.5625 which means our model predict right future price in 56,25% cases which is better than 50% but still not the best I think what could help this model is bigger number of data but I couldn't find bigger data sets with trend views for bitcion of course I can change some parameters to ensure better model performance or add more important features.