# Acquiring and Cleaning Historical Price Data for Bitcoin and other Cryptocurrencies

__Niklas Gutheil__<br>

__2022-03-01__

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import bitfinex
import time
import datetime
import plotly.express as px
# stats
from statsmodels.api import tsa # time series analysis

## Table of Contents:

## Introduction

To begin any data science project, we will have to acquire our data. For our purposes we will connect to the Bitfinex API and download Bitcoins Historical Price Data in 1-minute intervals, which we can later aggregate to any other timeframe we wish to have. 

Bitfinex also has a cutsom python library that lets you connect to their public API, which we have installed into our environment using: <br> `pip install bitfinex_tencars`

Since this is a public API we won't need to generate an API key, but this comes with limitations such as only being able to make 1 request per second, and the amount of results allowed to be returned for every request is 10,000.

The basic information the request needs are: 
- `pair`: The trading pair ex: BTCUSD
- `time_interval`: What time resolution you want the data in ex: 1 minute, 1 hour, 1 day etc.
- `t_start`: Start date of requested historical price data
- `t_stop`: End date of requested historical price data <br><br>

We will download our data in 5-minute intervals as 1-minute intervals aren't neccesary for our inteded purpose of showcasing algorithms, and 1-minute time intervals would create far too many data entries. We will also download all the historical data from January 1st, 2014 to January 1st, 2022. The reason we are choosing 2014 as our starting point is that before this time trading activity looked very different than today. The market conditions before then were more akin to penny stocks, and would thus behave differently than high-volume assets.

In [None]:
pair = 'BTCUSD'

time_interval = '5m'    # 15-minute time interval

t_start = datetime.datetime(2014, 1, 1, 0, 0)     # January 1st, 2014
t_start = time.mktime(t_start.timetuple()) * 1000    # convert starting date to milliseconds since unix epoch

t_stop = datetime.datetime(2022, 1, 1, 0, 0)    # January 1st, 2021
t_stop = time.mktime(t_stop.timetuple()) * 1000


Now we have to build a function that will collect all of our chunks of data and combine them into a single DataFrame. Since we are using the public REST API, we are limited to 10,000 entries returned per request, and 1 request per second. <br><br>

Our function will take in our above parameters and 1 additional parameter `s_inInterval` that will store how many seconds our `time_interval`variable is made up of. This will be important for calcuating how many times we have to call the API to get our full data. <br><br>

Our function will return the combined data from all API request.

In [None]:
def fetch_data(start, stop, symbol, interval, s_inInterval):
    limit = 10000    # We want the maximum of 10000 data points

    api = bitfinex.bitfinex_v2.api_v2() # Create API instance
    
    interval_milli = s_inInterval * 1000 # turn our seconds in interval to milliseconds
    step = interval_milli * limit # our step size (time interval for each request) will be 10,000 times the single time interval we want our data in
    data = []

    total_steps = (stop-start)/interval_milli # total number of requests we will have to make
    
    while total_steps > 0:
        if total_steps < limit: # recalculating ending steps
            step = total_steps * interval_milli

        end = start + step # define endpoint for this request
        
        data += api.candles(symbol=symbol, interval=interval, limit=limit, start=start, end=end)
        print(pd.to_datetime(start, unit='ms'), pd.to_datetime(end, unit='ms'), "steps left:", total_steps)
        
        start = start + step # update new start point for next request
        total_steps -= limit # update total_steps left
        
        time.sleep(1.5) #sleep for 1.5 seconds to make sure we dont time out the API
    return data

Now we can fetch the data using our variables defined earlier, as well as defining our `s_inInterval` which will simply be 60.

In [None]:
s_inInterval = 300

result = fetch_data(t_start, t_stop, pair, time_interval, s_inInterval)
names = ['Date', 'Open', 'Close', 'High', 'Low', 'Volume']
df = pd.DataFrame(result, columns=names)


Let's inspect our data to make sure we have our expected columns.

In [None]:
df.head(5)

The data looks to be in the form that we were looking for. The columns can be interpreted as follows:
- __Date__: Time in Seconds since the Unix Epoch
- __Open__: The price of Bitcoin in USD at the beginning of our 5-minute interval
- __Close__: The price of Bitcoin in USD at the end of our 5-minute interval
- __High__: The highest price of Bitcoin in USD during our 5-minute interval
- __Low__: The lowest price of Bitcoin in USD during our 5-minute interval
- __Volume__: The amount of Bitcoin bought and sold in our 5-minute interval


Now we can take a look at duplicate values and decide if we want to drop them.

In [None]:
print(f'{df.duplicated().sum()} Duplicate Entries')
print(f'{np.round(df.duplicated().sum()/df.shape[0]*100, decimals = 2)}% of Entries are Duplicates')

In [None]:
df[df.duplicated(keep = False)].head(50)

The entries appear to be perfect duplicates as they share the same timestamp, and all other correpsonding values. They also only make up a tiny fraction of our total data, so we can safely drop them. One issue this might cause later is that we have missing values for some timestamps. We will investigate this later on in the notebook.

In [None]:
df.drop_duplicates(inplace=True)

We can also see that our Date column is showing the seconds since the Unix epoch. We will want to convert this column into a DateTime format so we can use this dataset for timeseries specific analysis. Additionally, we will set our index to be our Date column, as each timestamp is unique.

In [None]:
df['Date'] = pd.to_datetime(df['Date'], unit='ms')
df.set_index('Date', inplace=True)

In [None]:
df.head(10)

Our columns are looking much better now, but another problem we might have is that we could be missing entries for a specific time interval. Let's explore if and how many we have missing.

In [None]:
first_time = df.index.min()
last_time = df.index.max()
print(first_time, last_time)

In [None]:
full_range = pd.date_range(start=first_time, end=last_time, freq="5min")
differences = full_range.difference(df.index)

In [None]:
print(f"{100*round(len(differences)/df.shape[0], 6)}% of intervals missing")

We can see that only 2.29% of our entries are missing, which is good, as imputing entries won't lower the results by much. First lets check if we have any NULL values in our dataset.

In [None]:
df.isna().sum()

No NULL values is great to see! <br>
We should also make sure none of the values are 0, in case Bitfinex imputed null values with 0.

In [None]:
display(np.where(df['Volume'] == 0))
display(np.where(df['Open'] == 0))
display(np.where(df['Close'] == 0))
display(np.where(df['High'] == 0))
display(np.where(df['Low'] == 0))

None of the values are 0 so we are now confident we have real values for our existing entries.

Our next step is to create entries for our missing date ranges, and then impute those null values with some kind of value. We could use forwardfilling to simply impute the last known value, but using the mean of the last previous and next known value seems like a more appropriate choice as we are dealing with price information, which has to rise through a given set of values. <br>
We will set the method to 'time' for interpolate as this will also account for missing entries of 2 or more in a row. Instead of imputing both consecutive missing time intervals with the same number, it will split this increase over the amount of missing rows.

In [None]:
df_clean = df.reindex(full_range)

In [None]:
df_clean = df_clean.interpolate(method='time')

Let's check if we have any missing time intervals or NULL values.

In [None]:
first_time = df_clean.index.min()
last_time = df_clean.index.max()

full_range = pd.date_range(start=first_time, end=last_time, freq="5min")
differences = full_range.difference(df_clean.index)

df.sort_index(inplace=True) # make sure all the entries are in order of date

print(f"{100*round(len(differences)/df_clean.shape[0], 6)}% of intervals missing")

display(df.isna().sum())

print(df_clean.shape)

We now have a fully cleaned dataset with no missing intervals, no null values and 841,537 entries! Let's view Bitcoins historical price data from January 1st, 2014 to January 1st, 2022 in a graph.

In [None]:
# add lines for each column
fig = px.line(df, x=df.index, y=df.Open,)

# axis labels and title
fig.update_layout(
    yaxis_title="Date", 
    xaxis_title="Price in USD",
    legend_title="", 
    title="Bitcoin Price Chart for 2014"
)

# activate slider
fig.update_xaxes(rangeslider_visible=True)

fig.show()

Lastly, let's save this data as a csv file for easier later use.

In [None]:
df_clean.to_csv(f"{pair}_{time_interval} historical data.csv")