<a href="https://colab.research.google.com/github/Papadopoulos18/Cryptocurrency-predicting-RNN-BTC-LTC-BCH-ETH-with-Tensorflow/blob/main/Normalizing_and_creating_sequences_Crypto_RNN_predicting_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1st structuring our data 
1. we got the data 
2. we merge the data
3. we create targets

In [2]:
import pandas as pd 
import os



upload the 4 .csv files manualy here on Google Colab as path:"/content/.csv" (link to download the data from:https://pythonprogramming.net/static/downloads/machine-learning-data/crypto_data.zip)

we are going to name the columns of the .csv file

In [3]:
df = pd.read_csv("/content/LTC-USD.csv", names=["time", "low", "high", "open", "close", "volume"])
print(df.head())

         time        low       high       open      close      volume
0  1528968660  96.580002  96.589996  96.589996  96.580002    9.647200
1  1528968720  96.449997  96.669998  96.589996  96.660004  314.387024
2  1528968780  96.470001  96.570000  96.570000  96.570000   77.129799
3  1528968840  96.449997  96.570000  96.570000  96.500000    7.216067
4  1528968900  96.279999  96.540001  96.500000  96.389999  524.539978


we want to get the close and the volume for each one of the 4 .csv files. The only thing that these 4 csv files have in common is the "time" column. They all share the same index, which is time

In [4]:
main_df = pd.DataFrame() # begin empty

ratios = ["BTC-USD", "LTC-USD", "BCH-USD", "ETH-USD"]  # the 4 ratios we want to consider
for ratio in ratios:
  print(ratio)
  dataset = f"/content/{ratio}.csv"
  # print(dataset)

  df = pd.read_csv(dataset, names=["time", "low", "high", "open", "close", "volume"])
  # print(df.head()) we want to work with close and volume
  df.rename(columns={"close": f"{ratio}_close", "volume": f"{ratio}_volume"}, inplace=True)

  df.set_index("time", inplace=True)
  df = df[[f'{ratio}_close',f"{ratio}_volume"]] # ignore the other columns besides price and volume
  # print(df.head())

  # now we want to merge those 4
  if len(main_df) == 0:           #i.e. is empty
    main_df = df
  else:
    main_df = main_df.join(df)

main_df.fillna(method="ffill", inplace=True)  # if there are gaps in data, use previously known values
main_df.dropna(inplace=True)
print(main_df.head())

BTC-USD
LTC-USD
BCH-USD
ETH-USD
            BTC-USD_close  BTC-USD_volume  ...  ETH-USD_close  ETH-USD_volume
time                                       ...                               
1528968720    6487.379883        7.706374  ...      486.01001       26.019083
1528968780    6479.410156        3.088252  ...      486.00000        8.449400
1528968840    6479.410156        1.404100  ...      485.75000       26.994646
1528968900    6479.979980        0.753000  ...      486.00000       77.355759
1528968960    6480.000000        1.490900  ...      486.00000        7.503300

[5 rows x 8 columns]


Next, we need to create a target. To do this, we need to know which price we're trying to predict. We also need to know how far out we want to predict. We'll go with Litecoin for now. Knowing how far out we want to predict probably also depends how long our sequences are. If our sequence length is 3 (so...3 minutes), we probably can't easily predict out 10 minutes. If our sequence length is 300, 10 might not be as hard. I'd like to go with a sequence length of 60, and a future prediction out of 3. We could also make the prediction a regression question, using a linear activation with the output layer, but, instead, I am going to just go with a binary classification.

If price goes up in 3 minutes, then it's a buy. If it goes down in 3 minutes, not buy/sell. With all of that in mind, I am going to make the following constants:

In [5]:
SEQ_LEN = 60
FUTURE_PERIOD_PREDICT = 3
RATIO_TO_PREDICT = "LTC-USD"

def classify(current, future):
  if float(futute)>float(current):
    return 1                        #BUY
  else:
    return 0                        #DONT BUY

## so knowing these we are writing our code(1st part) like below:

In [6]:
import pandas as pd 
import os

SEQ_LEN = 60
FUTURE_PERIOD_PREDICT = 3
RATIO_TO_PREDICT = "LTC-USD"


def classify(current, future):
  if float(future)>float(current):
    return 1                        #BUY
  else:
    return 0                        #DONT BUY



main_df = pd.DataFrame() # begin empty

ratios = ["BTC-USD", "LTC-USD", "BCH-USD", "ETH-USD"]  # the 4 ratios we want to consider
for ratio in ratios:
  # print(ratio)
  dataset = f"/content/{ratio}.csv"
  # print(dataset)

  df = pd.read_csv(dataset, names=["time", "low", "high", "open", "close", "volume"])
  # print(df.head()) we want to work with close and volume
  df.rename(columns={"close": f"{ratio}_close", "volume": f"{ratio}_volume"}, inplace=True)

  df.set_index("time", inplace=True)
  df = df[[f'{ratio}_close',f"{ratio}_volume"]]
  # print(df.head())

  # now we want to merge those 4
  if len(main_df) == 0:           #i.e. is empty
    main_df = df
  else:
    main_df = main_df.join(df)

  main_df.fillna(method="ffill", inplace=True)  # if there are gaps in data, use previously known values
  main_df.dropna(inplace=True)

# Now lets check the future price of litecoin  of all coins

main_df['future'] = main_df[f'{RATIO_TO_PREDICT}_close'].shift(-FUTURE_PERIOD_PREDICT) 
print(main_df.head())


#future price of litecoin(LTC-USD) 
# the 1st column is the "current" and the 2nd is the "future" after 3 periods 
print(main_df[[f'{RATIO_TO_PREDICT}_close', "future"]].head()) 


# The map part is what allows us to do this row-by-row for these columns, but also do it quite fast. 
# The list part converts the end result to a list, which we can just set as a column.
main_df['target'] = list(map(classify, main_df[f'{RATIO_TO_PREDICT}_close'], main_df['future']))
print(main_df[[f'{RATIO_TO_PREDICT}_close', "future","target" ]].head(12)) 



            BTC-USD_close  BTC-USD_volume  ...  ETH-USD_volume     future
time                                       ...                           
1528968720    6487.379883        7.706374  ...       26.019083  96.389999
1528968780    6479.410156        3.088252  ...        8.449400  96.519997
1528968840    6479.410156        1.404100  ...       26.994646  96.440002
1528968900    6479.979980        0.753000  ...       77.355759  96.470001
1528968960    6480.000000        1.490900  ...        7.503300  96.400002

[5 rows x 9 columns]
            LTC-USD_close     future
time                                
1528968720      96.660004  96.389999
1528968780      96.570000  96.519997
1528968840      96.500000  96.440002
1528968900      96.389999  96.470001
1528968960      96.519997  96.400002
            LTC-USD_close     future  target
time                                        
1528968720      96.660004  96.389999       0
1528968780      96.570000  96.519997       0
1528968840      96.50

# 2nd Normalizing and creating sequences for our cryptocurrency predicting RNN

The first thing I would like to do is separate out our validation/out of sample data. In the past, all we did was shuffle data, then slice it. Does that make sense here though?

The problem with that method is, the data is inherently sequential, so taking sequences that don't come in the future is likely a mistake. This is because sequences in our case, for example, 1 minute apart, will be almost identical. Chances are, the target is also going to be the same (buy or sell). Because of this, any overfitting is likely to actually pour over into the validation set. Instead, we want to slice our validation while it's still in order. I'd like to take the last 5% of the data. To do that, we'll do:

In [7]:
times = sorted(main_df.index.values) #.index->reference to index, .values->converts to numpy array
last_5pct = times[-int(0.05*len(times))] # # get the last 5% of the times
print(last_5pct)

1534922100


so we have a timestamp (last_5pct=1534922100) so we know that after that time stamp we have the last 5% of our data

### Now we are going to separate our validation data(or out of sample data) and our training data

In [8]:
validation_main_df = main_df[(main_df.index>=last_5pct)] # make the validation data where the index is in the last 5%
main_df = main_df[(main_df.index<last_5pct)]             # now the main_df is all the data up to the last 5%

Next, we need to balance and normalize this data. By balance, we want to make sure the classes have equal amounts when training, so our model doesn't just always predict one class.

One way to counteract this is to use class weights, which allows you to weight loss higher for lesser-frequent classifications. That said, I've never personally seen this really be comparable to a real balanced dataset.

We also need to take our data and make sequences from it.

So...we've got some work to do! We'll start by making a function that will process the dataframes, so we can just do something like:

train_x, train_y = preprocess_df(main_df) 

validation_x, validation_y = preprocess_df(validation_main_df)

Let's start by removing the future column (the actual target is called literally target and only needed the future column temporarily to create it). Then, we need to scale our data:

In [14]:
from sklearn import preprocessing
from collections import deque
import random
import numpy as np


def preprocess_df(df):
  df = df.drop('future', 1)                          # don't need this anymore.

  for col in df.columns:
    if col != "target":                              # normalize all ... except for the target itself!(its already done)
      df[col] = df[col].pct_change()                 # pct change "normalizes" the different currencies (each crypto coin has vastly diff values, we're really more interested in the other coin's movements)
      df.dropna(inplace=True)                        #remove the nas created by pct_change
      df[col] = preprocessing.scale(df[col].values)  # scale between 0 and 1
  
  df.dropna(inplace=True)                            # cleanup again. Those nasty NaNs love to creep in.

    # Alright, we've normalized and scaled the data! Next up, we need to create our actual sequences. To do this:
  sequential_data = []                               # this is a list that will CONTAIN the sequences
  prev_days = deque(maxlen=SEQ_LEN)                  # These will be our actual sequences. They are made with deque, which keeps the maximum length by popping out older values as new ones come in.
 
  for i in df.values:                                #iterate over the values(df.values->converts my dataframe to a list of lists, so it wont contain "time" anymore, but its in the order of the index, BUT it is going to contain "target" so we have to be careful), i is the row of all the columns(BTC-USD_close,BTC-USD_volume,LTC-USD_close...)
    prev_days.append([n for n in i[:-1]])            # store all but the target
    if len(prev_days) == SEQ_LEN:                    # make sure we have 60 sequences!
      sequential_data.append([np.array(prev_days), i[-1]])   #we are appending out x's and y's(features(=[np.array(prev_days)) and labels(=i[-1]))

  random.shuffle(sequential_data)                    # shuffle for good measure.

  # run the 3 below lines to have a better understanding of what is going on
  # print(df.head())
  # for c in df.columns:
  #   print(c)
  


preprocess_df(main_df)

KeyboardInterrupt: ignored

We've got our data, we've got sequences, we've got the data normalized, and we've got it scaled. The whole code so far:

In [15]:
import pandas as pd 
import os
from sklearn import preprocessing
from collections import deque
import random
import numpy as np

SEQ_LEN = 60
FUTURE_PERIOD_PREDICT = 3
RATIO_TO_PREDICT = "LTC-USD"


def classify(current, future):
  if float(future)>float(current):
    return 1                        #BUY
  else:
    return 0                        #DONT BUY


def preprocess_df(df):
  df = df.drop('future', 1)                          # don't need this anymore.

  for col in df.columns:
    if col != "target":                              # normalize all ... except for the target itself!(its already done)
      df[col] = df[col].pct_change()                 # pct change "normalizes" the different currencies (each crypto coin has vastly diff values, we're really more interested in the other coin's movements)
      df.dropna(inplace=True)                        #remove the nas created by pct_change
      df[col] = preprocessing.scale(df[col].values)  # scale between 0 and 1
  
  df.dropna(inplace=True)                            # cleanup again. Those nasty NaNs love to creep in.

    # Alright, we've normalized and scaled the data! Next up, we need to create our actual sequences. To do this:
  sequential_data = []                               # this is a list that will CONTAIN the sequences
  prev_days = deque(maxlen=SEQ_LEN)                  # These will be our actual sequences. They are made with deque, which keeps the maximum length by popping out older values as new ones come in.
 
 # run the 3 below lines to have a better understanding of what is going on
  # print(df.head())
  # for c in df.columns:
  #   print(c)

  for i in df.values:                                #iterate over the values(df.values->converts my dataframe to a list of lists, so it wont contain "time" anymore, but its in the order of the index, BUT it is going to contain "target" so we have to be careful), i is the row of all the columns(BTC-USD_close,BTC-USD_volume,LTC-USD_close...)
    prev_days.append([n for n in i[:-1]])            # store all but the target
    if len(prev_days) == SEQ_LEN:                    # make sure we have 60 sequences!
      sequential_data.append([np.array(prev_days), i[-1]])   #we are appending out x's and y's(features(=[np.array(prev_days)) and labels(=i[-1]))

  random.shuffle(sequential_data)                    # shuffle for good measure.


main_df = pd.DataFrame() # begin empty

ratios = ["BTC-USD", "LTC-USD", "BCH-USD", "ETH-USD"]  # the 4 ratios we want to consider
for ratio in ratios:
  # print(ratio)
  dataset = f"/content/{ratio}.csv"
  # print(dataset)

  df = pd.read_csv(dataset, names=["time", "low", "high", "open", "close", "volume"])
  # print(df.head()) we want to work with close and volume
  df.rename(columns={"close": f"{ratio}_close", "volume": f"{ratio}_volume"}, inplace=True)

  df.set_index("time", inplace=True)
  df = df[[f'{ratio}_close',f"{ratio}_volume"]]
  # print(df.head())

  # now we want to merge those 4
  if len(main_df) == 0:           #i.e. is empty
    main_df = df
  else:
    main_df = main_df.join(df)

  main_df.fillna(method="ffill", inplace=True)  # if there are gaps in data, use previously known values
  main_df.dropna(inplace=True)

# Now lets check the future price of litecoin  of all coins

main_df['future'] = main_df[f'{RATIO_TO_PREDICT}_close'].shift(-FUTURE_PERIOD_PREDICT) 
print(main_df.head())


#future price of litecoin(LTC-USD) 
# the 1st column is the "current" and the 2nd is the "future" after 3 periods 
print(main_df[[f'{RATIO_TO_PREDICT}_close', "future"]].head()) 


# The map part is what allows us to do this row-by-row for these columns, but also do it quite fast. 
# The list part converts the end result to a list, which we can just set as a column.
main_df['target'] = list(map(classify, main_df[f'{RATIO_TO_PREDICT}_close'], main_df['future']))
print(main_df[[f'{RATIO_TO_PREDICT}_close', "future","target" ]].head(12)) 

preprocess_df(main_df)

            BTC-USD_close  BTC-USD_volume  ...  ETH-USD_volume     future
time                                       ...                           
1528968720    6487.379883        7.706374  ...       26.019083  96.389999
1528968780    6479.410156        3.088252  ...        8.449400  96.519997
1528968840    6479.410156        1.404100  ...       26.994646  96.440002
1528968900    6479.979980        0.753000  ...       77.355759  96.470001
1528968960    6480.000000        1.490900  ...        7.503300  96.400002

[5 rows x 9 columns]
            LTC-USD_close     future
time                                
1528968720      96.660004  96.389999
1528968780      96.570000  96.519997
1528968840      96.500000  96.440002
1528968900      96.389999  96.470001
1528968960      96.519997  96.400002
            LTC-USD_close     future  target
time                                        
1528968720      96.660004  96.389999       0
1528968780      96.570000  96.519997       0
1528968840      96.50