# Bitcoin Historical Price Data Feature Engineering

__Niklas Gutheil__<br>
__2022-03-01__

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import datetime
import plotly.express as px
import requests
import json
# stats
from statsmodels.api import tsa # time series analysis

  from pandas import Int64Index as NumericIndex


## Table of Contents

* [Introduction](#introduction)<br>
* [Load in Data](#load)<br>
* [On-balance Volume](#obv)<br>
* [Exponential Moving Averages](#ema)<br>
* [Relative Strength Index](#rsi)<br>
* [Fear and Greed Index](#fear_greed)<br>
    - [Download Data](#download_greed)<br>

## Introduction <a class="anchor" id="introduction" ></a>

Currently there are only 5 features and 1 target variable for implementing future models. Those features are:
- __Date__: Time in Seconds since the Unix Epoch
- __Close__: The price of Bitcoin in USD at the end of a 5-minute interval
- __High__: The highest price of Bitcoin in USD during a 5-minute interval
- __Low__: The lowest price of Bitcoin in USD during a 5-minute interval
- __Volume__: The amount of Bitcoin bought and sold in a 5-minute interval

The target variable is: 

- __Open__: The price of Bitcoin in USD at the beginning of a 5-minute interval

We want to add features that will be predictive of our target variable `Open`. Luckily, Bitcoin and financial markets as a whole have many proven or popular indicators to predict if the price of an asset will rise or fall. We will explain each feature, and then add them to the dataset.

__Features to be added:__ <br>
- On-balance-volume (OBV)
- Exponential Moving Averages (EMAs) 7-day, 14-day, 21-day, 28-day, 5--day, 100-day and 150-day 
- Relative Strength Index (RSI)
- Fear and Greed Index 
- Puell Multiple (-could be important-)
- Stock to flow model (S2F Ratio = circ supply / yearly minting)

## Load in Data <a class = "anchor" id = "load"></a>

Let's load in our data and take a look at our current features.

In [2]:
bitcoin_df = pd.read_csv('data\BTCUSD_5m historical data.csv')
bitcoin_df.rename(columns = {'Unnamed: 0': 'Date'}, inplace = True)
bitcoin_df.head()

Unnamed: 0,Date,Open,Close,High,Low,Volume
0,2014-01-01 05:00:00,748.59,749.0,749.0,748.59,6.01
1,2014-01-01 05:05:00,749.0,749.0,749.0,748.95,31.30012
2,2014-01-01 05:10:00,747.0,748.93,749.96,746.28,14.990705
3,2014-01-01 05:15:00,749.89,749.89,749.89,749.89,2.241411
4,2014-01-01 05:20:00,748.91,748.89,749.93,748.89,9.664494


## On-balance-volume (OBV) <a class = "achor" id = "obv"></a>

* On-balance volume (OBV) is a technical indicator of momentum, using volume changes to make price predictions.<br>
* OBV shows crowd sentiment that can predict a bullish or bearish outcome.

Here we are going to modify a function that calculates OBV.

***************************************************************************************
*    Title: Pandas-Technical-Indicators
*    Author: egor-bogomolov
*    Date: March 28, 2018
*    Code version: 1.0
*    Availability: https://github.com/Crypto-toolbox/pandas-technical-indicators/blob/master/technical_indicators.py
***************************************************************************************

In [3]:
def on_balance_volume(df, n=1):
    """Calculate On-Balance Volume for given data.
    
    df: dataframe containing the Close and Volume columns of the price chart
    n: the amount of days to average over, the standard is usually 1, so we set it to 1 as default 
    return: dataframe containing the OBV numbers
    """
    i = 0
    OBV = [0]
    while i < df.index[-1]:
        if df.loc[i + 1, 'Close'] - df.loc[i, 'Close'] > 0:
            OBV.append(df.loc[i + 1, 'Volume'])
        if df.loc[i + 1, 'Close'] - df.loc[i, 'Close'] == 0:
            OBV.append(0)
        if df.loc[i + 1, 'Close'] - df.loc[i, 'Close'] < 0:
            OBV.append(-df.loc[i + 1, 'Volume'])
        i = i + 1
        
    OBV = pd.DataFrame(OBV)
    OBV_ma = pd.DataFrame(OBV.rolling(n, min_periods=n).mean())
    
    return OBV_ma

In [4]:
obv_df = pd.concat([bitcoin_df['Close'], bitcoin_df['Volume']], axis = 1)


bitcoin_features_df = pd.DataFrame()

bitcoin_features_df['OBV'] = on_balance_volume(obv_df, 1)
bitcoin_features_df.head()

Unnamed: 0,OBV
0,0.0
1,0.0
2,-14.990705
3,2.241411
4,-9.664494


We have now stored our OBV value in a dataframe where we will store all of our newly created features until we are ready to add them to the original dataset.

## Exponential Moving Averages <a class = "achor" id = "ema"></a>

We can begin to move on to our next features, moving average. We will be calculating the exponential moving average (EMA) rather than the simple moving average (SMA) as EMA weights recent prices higher, making it more responsive to new information. <br>

* A simple moving average (SMA) is a calculation that takes the arithmetic mean of a given set of prices over the specific number of days in the past; for example, over the previous 15, 30, 100, or 200 days.
* Exponential moving averages (EMA) is a weighted average that gives greater importance to the price of a stock in more recent days, making it an indicator that is more responsive to new information.

We will start by defining a function that will calculate EMA for any given time period `n`.

In [5]:
def EMA(df, n_range):
    """
    
    df: pandas.DataFrame cointaining the Close prices
    n_range : list of n time intervals to aggregate over
    return : EMA dataframe for all time interval in n_range
    """
    EMA = pd.DataFrame()
    for n in n_range:
        name = 'EMA_' + str(n)
        EMA[n] = pd.DataFrame(df.ewm(span=n, min_periods=n).mean())
        EMA.rename(columns = {n: name}, inplace = True)
    
    return EMA

Now we want to create a few EMA's and add them to our feature dataframe. We want to chose some short time frames to gain an understanding of price action for days or weeks, but we also want to include longer EMA's as they cancel out alot of "trade noise" and other premature signals like fake breakouts and traps.<br><br>

We will create EMA's for 7, 14, 21, 28, 50, 100 and 150. While we are working with 5-minute time intervals, these numbers have meaning when looking at them as days. This will become more important when we feed a 1-day time interval chart through this function.

In [6]:
EMA_features = pd.DataFrame()
EMA_features = EMA(bitcoin_df['Close'], [7, 14, 21, 28, 50, 100, 150])

In [7]:
EMA_features.shape

(841537, 7)

In [8]:
EMA_features.head(20)

Unnamed: 0,EMA_7,EMA_14,EMA_21,EMA_28,EMA_50,EMA_100,EMA_150
0,,,,,,,
1,,,,,,,
2,,,,,,,
3,,,,,,,
4,,,,,,,
5,,,,,,,
6,748.972287,,,,,,
7,748.771627,,,,,,
8,748.822543,,,,,,
9,748.148976,,,,,,


One issue here is that the first n entries for each EMA_n are NaN values as we havent had enough values to give an output yet. We are going to overwrite all of those NaN's with 0 so that we can use them confidently in models later on.

In [9]:
EMA_features.fillna(0, inplace = True)
EMA_features.isna().sum()

EMA_7      0
EMA_14     0
EMA_21     0
EMA_28     0
EMA_50     0
EMA_100    0
EMA_150    0
dtype: int64

We can now add our `EMA_features` to our `bitcoin_features_df`.

In [10]:
bitcoin_features_df = pd.concat([bitcoin_features_df, EMA_features], axis = 1)
bitcoin_features_df

Unnamed: 0,OBV,EMA_7,EMA_14,EMA_21,EMA_28,EMA_50,EMA_100,EMA_150
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,-14.990705,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,2.241411,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,-9.664494,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...
841532,1.340725,46719.330248,46746.749042,46760.634241,46761.261236,46728.224930,46720.803524,46797.721663
841533,6.393124,46740.497686,46754.382503,46764.576582,46764.208737,46731.196501,46722.450979,46797.804820
841534,2.147162,46760.623265,46763.264836,46769.705984,46768.125376,46734.718207,46724.402445,46798.112041
841535,-1.076775,46747.717448,46756.029524,46764.187258,46764.047764,46733.709650,46724.097446,46796.931749


One thing to note at this point is that we could create a MACD indicator, but that is simply a calculation of the 26 day EMA minus the 12-day EMA. Considering we have a variety of EMA's already present in our dataset, we should'nt add features that are simply additions or subtractions of other features. For feature engineering, multiplying or diving two known features has a much greater impact. 

## Relative Strenght Index (RSI) <a class = "achor" id = "rsi"></a>

* The relative strength index (RSI) is a popular momentum oscillator developed in 1978.
* The RSI provides technical traders with signals about bullish and bearish price momentum, and it is often plotted beneath the graph of an asset’s price.
* An asset is usually considered overbought when the RSI is above 70% and oversold when it is below 30%.

Calculating the RSI is a little challenging and computing these features is not the goal of this notebook, so we will once again modify an existing function created by someone else and posted on github, the credit is as follows: <br>

***************************************************************************************
*    Title: Pandas-Technical-Indicators
*    Author: egor-bogomolov
*    Date: March 28, 2018
*    Code version: 1.0
*    Availability: https://github.com/Crypto-toolbox/pandas-technical-indicators/blob/master/technical_indicators.py
***************************************************************************************

In [11]:
def RSI(df, n = 14):
    """Calculate Relative Strength Index(RSI) for given data.
    
    df: dataframe containing the High and Low price columns
    n: the amount of daya to aggregate over, but the traditional number for RSI is 14, so we set that as default 
    :return: pandas.DataFrame
    """
    i = 0
    UpI = [0]
    DoI = [0]
    while i + 1 <= df.index[-1]:
        UpMove = df.loc[i + 1, 'High'] - df.loc[i, 'High']
        DoMove = df.loc[i, 'Low'] - df.loc[i + 1, 'Low']
        if UpMove > DoMove and UpMove > 0:
            UpD = UpMove
        else:
            UpD = 0
        UpI.append(UpD)
        if DoMove > UpMove and DoMove > 0:
            DoD = DoMove
        else:
            DoD = 0
        DoI.append(DoD)
        i = i + 1
    UpI = pd.DataFrame(UpI)
    DoI = pd.DataFrame(DoI)
    PosDI = pd.DataFrame(UpI.ewm(span=n, min_periods=n).mean())
    NegDI = pd.DataFrame(DoI.ewm(span=n, min_periods=n).mean())
    RSI = pd.DataFrame(PosDI / (PosDI + NegDI))
    
    return RSI

In [12]:
RSI_feature = pd.concat([bitcoin_df['High'], bitcoin_df['Low']], axis = 1)
RSI_feature['RSI'] = RSI(RSI_feature)
RSI_feature.head(20)

Unnamed: 0,High,Low,RSI
0,749.0,748.59,
1,749.0,748.95,
2,749.96,746.28,
3,749.89,749.89,
4,749.93,748.89,
5,749.0,748.25,
6,748.9,748.25,
7,748.8,748.25,
8,748.96,748.25,
9,748.96,746.28,


Once again we will have to impute 0's for our NaN's, which will be the first 13 entries as they can't be defined with the 14 day moving average requirement.

In [13]:
RSI_feature['RSI'].fillna(0, inplace = True)
RSI_feature.isna().sum()

High    0
Low     0
RSI     0
dtype: int64

In [14]:
bitcoin_features_df = pd.concat([bitcoin_features_df, RSI_feature['RSI']], axis = 1)
bitcoin_features_df

Unnamed: 0,OBV,EMA_7,EMA_14,EMA_21,EMA_28,EMA_50,EMA_100,EMA_150,RSI
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,-14.990705,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,2.241411,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,-9.664494,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...
841532,1.340725,46719.330248,46746.749042,46760.634241,46761.261236,46728.224930,46720.803524,46797.721663,0.426702
841533,6.393124,46740.497686,46754.382503,46764.576582,46764.208737,46731.196501,46722.450979,46797.804820,0.568862
841534,2.147162,46760.623265,46763.264836,46769.705984,46768.125376,46734.718207,46724.402445,46798.112041,0.624253
841535,-1.076775,46747.717448,46756.029524,46764.187258,46764.047764,46733.709650,46724.097446,46796.931749,0.424825


## Fear and Greed Index <a class = "achor" id = "fear_greed"></a>

The Fear and Greed Index is a unique indicator created for cryptocurrencies. It is an accumulation of several metrics. <br>
The value ranges from 0 to 100 where 0 is considered __Extreme Fear__ and 100 being __Extreme Greed__. Each of the following metrics contributes in some way to this metric creation.<br>
- Volatility (25%): Measuring the current volume and max. downdraws of bitcoin compared with the corresponding average values of the last 30 and 90 days. Unusual rise in volatility is a sign of a __fearful__ market.
- Market Momentum/Volume (25%): Current volume and market momentum (in comparison to the avergae last 30/90 days) and put those two values together. High buying volumes on a daily basis in a positive market indicates __greed__.
- Social Media (15%): An analysis of twitter hashtags for each coin looking at how fast and many interactions they recieve in a certain time frame. An unsusal high interaction rate results in growing public interest and is an indicator of __greed__.
- Surveys (15%) (Paused early 2022): Around 2000 - 3000 votes on each poll asking investors about their market sentiment.
- Dominance (10%): Refers to what percentage of the total market cap of the cryprocurrency industry this coin prepresents. When Bitcoin dominance shrinks, it is an indicator of __greed__, when it rises it is an indicator of __fear__.
- Trends (10%): Collections of several search queries on google trends, especially on the trend of search volume. Also the looking at the "related search queries" offered by google showing new search terms that have gained traction. 

One problem with this indicator is that the data only starts on February 1st, 2018 when our data starts at the beginning of 2014. We will only be able to use this indicator for narrower time searches, but this includes the most recent "bull run" that began in later 2020. We will later conduct a comparison of a model where we include this metric and one where we dont.

### Download Data <a class = "anchor" id = "download_greed"></a>

To get the data we will have to use the Alternative.me's API which is a RESTful API. The way we request data is in the form of a URL where we set parts of the URL to specific parameters we wish to include. In our case we set the `limit` to 0, meaning we want all of the entries. `format` to JSON to get a JSON response and `date_format` to kr for Korea so we get our timestamp in the format of (YYYY-MM-DD).<br><br>

Once we have the data we can save it as a dataframe.

In [15]:
response = requests.get("https://api.alternative.me/fng/?limit=0&format=json&date_format=kr")

In [16]:
data = response.json()

fear_greed_df = pd.DataFrame(data['data'])

fear_greed_df['timestamp'] = pd.to_datetime(fear_greed_df['timestamp'])

In [17]:
fear_greed_df.head()

Unnamed: 0,value,value_classification,timestamp,time_until_update
0,24,Extreme Fear,2022-03-16,-1647311523.0
1,21,Extreme Fear,2022-03-15,
2,23,Extreme Fear,2022-03-14,
3,21,Extreme Fear,2022-03-13,
4,22,Extreme Fear,2022-03-12,


In [18]:
fear_greed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1502 entries, 0 to 1501
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   value                 1502 non-null   object        
 1   value_classification  1502 non-null   object        
 2   timestamp             1502 non-null   datetime64[ns]
 3   time_until_update     1 non-null      object        
dtypes: datetime64[ns](1), object(3)
memory usage: 47.1+ KB


Our next step is to resample this data to our 5 minute interval data. We will also drop all columns except for `Date` and `value`.

In [19]:
temp_df = fear_greed_df.set_index("timestamp").resample("5T").ffill().reset_index().rename(
    {"timestamp": "Date"}, axis=1)
temp_df.drop(columns = ['value_classification', 'time_until_update'], inplace = True)

In [20]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 433153 entries, 0 to 433152
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   Date    433153 non-null  datetime64[ns]
 1   value   433153 non-null  object        
dtypes: datetime64[ns](1), object(1)
memory usage: 6.6+ MB


Now we want to add the `value` column to our `bitcoin_features_df`, but it doesnt have a date column we can use to match on. So we will also add a date column to that dataframe.

In [21]:
bitcoin_features_df['Date'] = bitcoin_df['Date']
bitcoin_features_df['Date'] = pd.to_datetime(bitcoin_features_df['Date'])
bitcoin_features_df.head()

Unnamed: 0,OBV,EMA_7,EMA_14,EMA_21,EMA_28,EMA_50,EMA_100,EMA_150,RSI,Date
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2014-01-01 05:00:00
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2014-01-01 05:05:00
2,-14.990705,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2014-01-01 05:10:00
3,2.241411,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2014-01-01 05:15:00
4,-9.664494,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2014-01-01 05:20:00


In [22]:
bitcoin_features_df = pd.merge(temp_df, bitcoin_features_df, left_on='Date', right_on='Date', how = 'right')
bitcoin_features_df.rename(columns = {'value': 'fg_index'}, inplace = True)
bitcoin_features_df

Unnamed: 0,Date,fg_index,OBV,EMA_7,EMA_14,EMA_21,EMA_28,EMA_50,EMA_100,EMA_150,RSI
0,2014-01-01 05:00:00,,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,2014-01-01 05:05:00,,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,2014-01-01 05:10:00,,-14.990705,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,2014-01-01 05:15:00,,2.241411,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,2014-01-01 05:20:00,,-9.664494,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...
841532,2022-01-01 04:40:00,21,1.340725,46719.330248,46746.749042,46760.634241,46761.261236,46728.224930,46720.803524,46797.721663,0.426702
841533,2022-01-01 04:45:00,21,6.393124,46740.497686,46754.382503,46764.576582,46764.208737,46731.196501,46722.450979,46797.804820,0.568862
841534,2022-01-01 04:50:00,21,2.147162,46760.623265,46763.264836,46769.705984,46768.125376,46734.718207,46724.402445,46798.112041,0.624253
841535,2022-01-01 04:55:00,21,-1.076775,46747.717448,46756.029524,46764.187258,46764.047764,46733.709650,46724.097446,46796.931749,0.424825


We have successfully added our fear and greed index to our dataset. Lets replace our NaN values with 0, and move on to adding new features.

In [23]:
bitcoin_features_df['fg_index'].fillna(value = 0, inplace = True)
bitcoin_features_df.isna().sum()

Date        0
fg_index    0
OBV         0
EMA_7       0
EMA_14      0
EMA_21      0
EMA_28      0
EMA_50      0
EMA_100     0
EMA_150     0
RSI         0
dtype: int64

## Puell Multiple

This metric looks at the supply side of Bitcoin's economy - bitcoin miners and their revenue.
It explores market cycles from a mining revenue perspective. Bitcoin miners are sometimes referred to as compulsory sellers due to their need to cover fixed costs of mining hardware in a market where price is extremely volatile. The revenue they generate can therefore influence price over time.<br>

The Puell Multiple is calculated by dividing the daily issuance value of bitcoins (in USD) by the 365-day moving average of daily issuance value.<br>

A low Puell multiple suggests that miners are not earning very much money, so they are incentivised to hold their coins until the price increases to cover their realtively stable costs. A high Puell multiple suggests that miners are selling their newly minted coins to cover their cost of operations and take profit.

Since this calculation depends on the daily issuance of bitcoin, we need to aquire thoss numbers. Traditionally, downloading the entire 456GB Bitcoin blockchain could accomplish this task, but since the minting schedule follows rules, we can safely estimate those numbers and save a lot of time and resources. 

Since this is a metric that only updates once per day (looks at daily issuance, not hourly or by minute), we will resample our dataframe dates to 1-day intervals, compute the Puell Multiple, and then resample again back to 5 minute intervals. At that point we can add it to our `bitcoin_features_df`. We will also convert our date column to the number of days since the first block issuance on the Bitcoin blockchain. This will help us determine what block reward cycle we are currenly in. Every 1,458.33 days the rewards for each block are halved, affecting the daily issuance.

In [24]:
puell_df = pd.DataFrame()
puell_df['Date'] = bitcoin_df['Date']
puell_df['Date'] = pd.to_datetime(puell_df['Date'])

puell_df['Value'] = bitcoin_df['Open']
puell_df.set_index('Date', inplace = True)
puell_df = puell_df.resample('D').mean()
temp_date = puell_df.index # save the resampled date format so we can resample back again later
puell_df.sort_values(by = "Date", inplace = True) # have the dates start with the earliest first
puell_df.reset_index(inplace = True, drop = True)
puell_df.reset_index(inplace = True)


#While the first block was mined on January 3rd, 2009, proper mining didnt start until 6 days later on January 9th.
# We will have to add the numbers of days since January 9th, 2009 and the first date of our dataset (January 1st, 2014) to 
# our index column to properly reflect the days since mining began.
# The number of days between those dates are 1,818, plus one to offset for the index starting at 0

puell_df['index'] += 1819
puell_df

Unnamed: 0,index,Value
0,1819,745.649264
1,1820,757.447563
2,1821,795.871155
3,1822,810.806777
4,1823,865.487221
...,...,...
2918,4737,48878.768493
2919,4738,47597.815106
2920,4739,47109.523403
2921,4740,47297.407534


In [25]:
# Bitcoin's algorithm targets 10 minutes as the average block time, there are 1440 minutes in a day, so 144 blocks per day.
# At the beginning the daily issuance would have been 7200 btc as the block rewards were 50 BTC at the time.
# There are 1458.3 days between halvings, so we take the days since the first block, divide it by days in a halving cycle, and then 
# take that number to the power of 2 and divide by 7200 to find that days BTC issuance.
puell_df['BTC_issuance'] = 7200 / 2**(np.floor(puell_df['index'] / 1458 )) # we round down to get the halving cycle as an integer
puell_df['USD_issuance'] = puell_df['BTC_issuance'] * puell_df['Value']
puell_df['MA_USD_issuance'] = puell_df['USD_issuance'].rolling(window = 365).mean() # calculate 365-day moving average of btc issued in USD value
puell_df['MA_USD_issuance'].fillna(0, inplace = True) # fill in NaN values that exists for first 365 days
puell_df['puell'] = puell_df['USD_issuance'] / puell_df['MA_USD_issuance']
puell_df.replace([np.inf, -np.inf], 0, inplace = True) # replace the inf numbers for the first 365 days with 0
puell_df

Unnamed: 0,index,Value,BTC_issuance,USD_issuance,MA_USD_issuance,puell
0,1819,745.649264,3600.0,2.684337e+06,0.000000e+00,0.000000
1,1820,757.447563,3600.0,2.726811e+06,0.000000e+00,0.000000
2,1821,795.871155,3600.0,2.865136e+06,0.000000e+00,0.000000
3,1822,810.806777,3600.0,2.918904e+06,0.000000e+00,0.000000
4,1823,865.487221,3600.0,3.115754e+06,0.000000e+00,0.000000
...,...,...,...,...,...,...
2918,4737,48878.768493,900.0,4.399089e+07,4.256537e+07,1.033490
2919,4738,47597.815106,900.0,4.283803e+07,4.255125e+07,1.006740
2920,4739,47109.523403,900.0,4.239857e+07,4.259801e+07,0.995318
2921,4740,47297.407534,900.0,4.256767e+07,4.264355e+07,0.998221


Now we can reshape our data back to our 5m intervals, and use forwardfill to impute the new in-between values. 

In [26]:
puell_df['index'] = temp_date
puell_5m_df = puell_df.set_index("index").resample("5min").ffill().reset_index().rename(
    {"index": "Date"}, axis=1)
puell_5m_df

Unnamed: 0,Date,Value,BTC_issuance,USD_issuance,MA_USD_issuance,puell
0,2014-01-01 00:00:00,745.649264,3600.0,2.684337e+06,0.000000e+00,0.000000
1,2014-01-01 00:05:00,745.649264,3600.0,2.684337e+06,0.000000e+00,0.000000
2,2014-01-01 00:10:00,745.649264,3600.0,2.684337e+06,0.000000e+00,0.000000
3,2014-01-01 00:15:00,745.649264,3600.0,2.684337e+06,0.000000e+00,0.000000
4,2014-01-01 00:20:00,745.649264,3600.0,2.684337e+06,0.000000e+00,0.000000
...,...,...,...,...,...,...
841532,2021-12-31 23:40:00,47297.407534,900.0,4.256767e+07,4.264355e+07,0.998221
841533,2021-12-31 23:45:00,47297.407534,900.0,4.256767e+07,4.264355e+07,0.998221
841534,2021-12-31 23:50:00,47297.407534,900.0,4.256767e+07,4.264355e+07,0.998221
841535,2021-12-31 23:55:00,47297.407534,900.0,4.256767e+07,4.264355e+07,0.998221


Now we can add our `puell` variable to our `bitcoin_features_df`.

In [27]:
bitcoin_features_df['puell'] = puell_5m_df['puell']
bitcoin_features_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 841537 entries, 0 to 841536
Data columns (total 12 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   Date      841537 non-null  datetime64[ns]
 1   fg_index  841537 non-null  object        
 2   OBV       841537 non-null  float64       
 3   EMA_7     841537 non-null  float64       
 4   EMA_14    841537 non-null  float64       
 5   EMA_21    841537 non-null  float64       
 6   EMA_28    841537 non-null  float64       
 7   EMA_50    841537 non-null  float64       
 8   EMA_100   841537 non-null  float64       
 9   EMA_150   841537 non-null  float64       
 10  RSI       841537 non-null  float64       
 11  puell     841537 non-null  float64       
dtypes: datetime64[ns](1), float64(10), object(1)
memory usage: 83.5+ MB


## Adding Features to Original Data <a class = "anchor" id = "adding"></a>

Before we add our features back to our original dataset, lets convert our fg_index to a float64 like the rest of the features. We can also drop the Date column now as it already exists in our original data.

In [28]:
bitcoin_features_df['fg_index'] = bitcoin_features_df['fg_index'].astype('float64')
bitcoin_features_df.drop(columns = 'Date', inplace = True)
bitcoin_features_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 841537 entries, 0 to 841536
Data columns (total 11 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   fg_index  841537 non-null  float64
 1   OBV       841537 non-null  float64
 2   EMA_7     841537 non-null  float64
 3   EMA_14    841537 non-null  float64
 4   EMA_21    841537 non-null  float64
 5   EMA_28    841537 non-null  float64
 6   EMA_50    841537 non-null  float64
 7   EMA_100   841537 non-null  float64
 8   EMA_150   841537 non-null  float64
 9   RSI       841537 non-null  float64
 10  puell     841537 non-null  float64
dtypes: float64(11)
memory usage: 77.0 MB


Let's add our features back to our original data.

In [29]:
bitcoin_df = pd.concat([bitcoin_df, bitcoin_features_df], axis = 1)
bitcoin_df

Unnamed: 0,Date,Open,Close,High,Low,Volume,fg_index,OBV,EMA_7,EMA_14,EMA_21,EMA_28,EMA_50,EMA_100,EMA_150,RSI,puell
0,2014-01-01 05:00:00,748.590000,749.000000,749.000000,748.590000,6.010000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,2014-01-01 05:05:00,749.000000,749.000000,749.000000,748.950000,31.300120,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,2014-01-01 05:10:00,747.000000,748.930000,749.960000,746.280000,14.990705,0.0,-14.990705,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,2014-01-01 05:15:00,749.890000,749.890000,749.890000,749.890000,2.241411,0.0,2.241411,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,2014-01-01 05:20:00,748.910000,748.890000,749.930000,748.890000,9.664494,0.0,-9.664494,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
841532,2022-01-01 04:40:00,46684.368230,46738.951211,46745.774083,46684.368230,1.340725,21.0,1.340725,46719.330248,46746.749042,46760.634241,46761.261236,46728.224930,46720.803524,46797.721663,0.426702,0.998221
841533,2022-01-01 04:45:00,46740.088356,46804.000000,46804.000000,46740.088356,6.393124,21.0,6.393124,46740.497686,46754.382503,46764.576582,46764.208737,46731.196501,46722.450979,46797.804820,0.568862,0.998221
841534,2022-01-01 04:50:00,46804.000000,46821.000000,46834.000000,46804.000000,2.147162,21.0,2.147162,46760.623265,46763.264836,46769.705984,46768.125376,46734.718207,46724.402445,46798.112041,0.624253,0.998221
841535,2022-01-01 04:55:00,46807.179936,46709.000000,46808.000000,46709.000000,1.076775,21.0,-1.076775,46747.717448,46756.029524,46764.187258,46764.047764,46733.709650,46724.097446,46796.931749,0.424825,0.998221


In [30]:
bitcoin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 841537 entries, 0 to 841536
Data columns (total 17 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   Date      841537 non-null  object 
 1   Open      841537 non-null  float64
 2   Close     841537 non-null  float64
 3   High      841537 non-null  float64
 4   Low       841537 non-null  float64
 5   Volume    841537 non-null  float64
 6   fg_index  841537 non-null  float64
 7   OBV       841537 non-null  float64
 8   EMA_7     841537 non-null  float64
 9   EMA_14    841537 non-null  float64
 10  EMA_21    841537 non-null  float64
 11  EMA_28    841537 non-null  float64
 12  EMA_50    841537 non-null  float64
 13  EMA_100   841537 non-null  float64
 14  EMA_150   841537 non-null  float64
 15  RSI       841537 non-null  float64
 16  puell     841537 non-null  float64
dtypes: float64(16), object(1)
memory usage: 109.1+ MB


Let's also convert our Date column to the datetime dtype.

In [31]:
bitcoin_df['Date'] = pd.to_datetime(bitcoin_df['Date'])
bitcoin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 841537 entries, 0 to 841536
Data columns (total 17 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   Date      841537 non-null  datetime64[ns]
 1   Open      841537 non-null  float64       
 2   Close     841537 non-null  float64       
 3   High      841537 non-null  float64       
 4   Low       841537 non-null  float64       
 5   Volume    841537 non-null  float64       
 6   fg_index  841537 non-null  float64       
 7   OBV       841537 non-null  float64       
 8   EMA_7     841537 non-null  float64       
 9   EMA_14    841537 non-null  float64       
 10  EMA_21    841537 non-null  float64       
 11  EMA_28    841537 non-null  float64       
 12  EMA_50    841537 non-null  float64       
 13  EMA_100   841537 non-null  float64       
 14  EMA_150   841537 non-null  float64       
 15  RSI       841537 non-null  float64       
 16  puell     841537 non-null  float64    

We now have a dataset ready for modelling. Let's export this dataframe to a CSV file.

In [32]:
bitcoin_df.to_csv(f"Bitcoin 5-minute historical data modelling.csv")