<a href="https://www.kaggle.com/code/koenbotermans/kraken-proccesing-historical-trades-to-bars?scriptVersionId=172422282" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Kraken - preprocessing historical trade data for 2022.
Are you intrigued by the potential of leveraging your data science skills to generate income through the creation of models on financial data? Have you ever pondered the possibility of crafting algorithmic trading strategies that could yield profitable results? Perhaps you're simply interested in conducting basic analyses of Bitcoin exchange data. Look no further. This series of notebooks commences with the fundamental step of acquiring historical trade data, setting the stage for a comprehensive exploration into the realm of financial analytics and algorithmic trading.

This notebook continuous where my [last notebook](https://www.kaggle.com/code/koenbotermans/kraken-obtaining-historical-trade-data-2022?scriptVersionId=167110759) left of and it will focuss on deriving candle stick data. 

Creating bars directly from ticks (trades) data offers several benefits in financial analysis and trading:

1. **Granularity Control**: Generating bars from tick data allows traders to customize the granularity of their data, enabling them to analyze price movements at various levels of detail, such as seconds, minutes, or even ticks themselves.
2. **Accurate Volume Analysis**: By constructing bars from tick data, traders can accurately calculate volume at each price level, providing insights into buying and selling pressure and allowing for more informed trading decisions.
3. **Reduced Noise**: Bars derived from tick data can help filter out market noise and smooth price fluctuations, making it easier to identify meaningful trends and patterns in the data.
4. **Improved Backtesting**: Using bars created from tick data in backtesting allows traders to test their strategies with more accurate and realistic data, leading to more reliable performance evaluations and strategy optimizations.
5. **Enhanced Transparency**: Building bars directly from tick data provides a transparent view of market activity, enabling traders to better understand price dynamics, liquidity conditions, and market microstructure. This transparency can lead to improved trade execution and risk management strategies.

## Content of this notebook.
This notebook will answer the following questions:

 - [What is historical trade data?](#paragraph1)
 - [How to derive basic hourly sampled klines bars?](#paragraph2)
 - [What is wrong with basic hourly sampled klines bars?](#paragraph3)
 - [How to sample tick bars?](#paragraph4)
 - [Wat is wrong with tick bars?](#paragraph5)
 - [How to sample volume bars?](#paragraph6)
 - [How to compare basic, tick and volume bars?](#paragraph7)
 
 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.graph_objects as go

# What is historical trade data? <a name="paragraph1"></a>

Historical trade data, also known as historical market data or historical tick data, consists of records of all trades executed on an exchange over a specific period. It includes information such as trade prices, volumes, timestamps, and possibly additional metadata. This data is fundamental because it provides a detailed record of market activity, enabling analysis of past price movements, volume trends, liquidity, and market dynamics. Traders and analysts use historical trade data to backtest trading strategies, conduct market research, identify patterns, and make informed decisions about future trading activities. 

Simply sad it is the most atomic data for a given exchange pair, in our case BTC-USDT on the Kraken exchange in the year of 2022.  


In [2]:
historical_trades_df = pd.read_csv("/kaggle/input/kraken-2022-historical-trades-btcusdt-csv/kraken_2022_historical_trades_btcusdt.csv")
historical_trades_df

Unnamed: 0,timestamp,id,order,info,datetime,symbol,type,side,takerOrMaker,price,amount,cost,fee,fees
0,2022-01-01 00:00:14.149,2602148,,"['46214.40000', '0.00115773', '1640995214.1498...",2022-01-01T00:00:14.149Z,BTC/USDT,limit,sell,,46214.4,0.001158,53.503797,,[]
1,2022-01-01 00:00:17.992,2602149,,"['46225.10000', '0.07861219', '1640995217.9920...",2022-01-01T00:00:17.992Z,BTC/USDT,limit,sell,,46225.1,0.078612,3633.856344,,[]
2,2022-01-01 00:00:17.995,2602150,,"['46225.10000', '0.05117904', '1640995217.9956...",2022-01-01T00:00:17.995Z,BTC/USDT,limit,sell,,46225.1,0.051179,2365.756242,,[]
3,2022-01-01 00:00:18.495,2602151,,"['46225.10000', '0.04520877', '1640995218.4951...",2022-01-01T00:00:18.495Z,BTC/USDT,limit,sell,,46225.1,0.045209,2089.779914,,[]
4,2022-01-01 00:00:19.039,2602152,,"['46229.90000', '0.05117904', '1640995219.0396...",2022-01-01T00:00:19.039Z,BTC/USDT,limit,sell,,46229.9,0.051179,2366.001901,,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1457995,2023-01-01 17:53:49.043,4057532,,"['16583.10000', '0.00258002', '1672595629.0437...",2023-01-01T17:53:49.043Z,BTC/USDT,limit,buy,,16583.1,0.002580,42.784730,,[]
1457996,2023-01-01 17:53:55.005,4057533,,"['16583.50000', '0.00258022', '1672595635.0057...",2023-01-01T17:53:55.005Z,BTC/USDT,limit,buy,,16583.5,0.002580,42.789078,,[]
1457997,2023-01-01 17:53:55.007,4057534,,"['16583.50000', '0.00273437', '1672595635.0074...",2023-01-01T17:53:55.007Z,BTC/USDT,limit,buy,,16583.5,0.002734,45.345425,,[]
1457998,2023-01-01 17:54:01.015,4057535,,"['16582.90000', '0.00957066', '1672595641.0157...",2023-01-01T17:54:01.015Z,BTC/USDT,limit,buy,,16582.9,0.009571,158.709298,,[]


# How to derive basic hourly sampled klines bars? <a name="paragraph2"></a>

You can sample basic bars by utilizing the `.ohlcv()` method provided by pandas. This method allows you to efficiently aggregate financial data such as Open, High, Low, Close, and Volume (OHLCV) into specified time intervals, such as hourly, daily, or weekly. By employing this method, you can quickly generate summarized bars that represent the essential trading information within the desired time frames for further analysis and visualization.


In [3]:
def derive_time_bar_df(historical_trades_df: pd.DataFrame, bar_time: str = "1h") -> pd.DataFrame:
    """Derive basic time sample bar data. """
    
    # Make a copy, and use the timestamp column as index.
    trades_df = historical_trades_df.copy()
    trades_df["timestamp"] = pd.to_datetime(trades_df["timestamp"])
    
    #copy timestamp so we can use it to get the open and close time
    trades_df["timestamp_copy"] = trades_df["timestamp"].copy()
    trades_df = trades_df.set_index(keys=["timestamp"])
    
    ohlc_df = trades_df["price"].resample(rule=bar_time).ohlc()
    times_df = trades_df["timestamp_copy"].resample(rule=bar_time).agg(["first", "last"])
    times_df.columns = ["open_time", "close_time"]
    volume_count_df = trades_df["amount"].resample(rule=bar_time).agg(["sum", "count"])
    volume_count_df.columns = ["volume", "count"]
    return pd.concat(objs=[times_df, ohlc_df, volume_count_df], axis=1).reset_index(drop=True)

In [4]:
time_bar_df = derive_time_bar_df(historical_trades_df=historical_trades_df)
time_bar_df

Unnamed: 0,open_time,close_time,open,high,low,close,volume,count
0,2022-01-01 00:00:14.149,2022-01-01 00:59:52.709,46214.4,46710.0,46214.4,46671.3,17.384664,278
1,2022-01-01 01:00:26.111,2022-01-01 01:59:54.034,46658.1,46900.3,46556.4,46770.6,21.329330,218
2,2022-01-01 02:06:06.614,2022-01-01 02:50:24.206,46801.7,46857.9,46740.9,46779.6,4.405363,82
3,2022-01-01 03:00:30.096,2022-01-01 03:56:28.309,46800.9,46878.8,46769.1,46851.1,0.722208,60
4,2022-01-01 04:06:37.716,2022-01-01 04:57:32.773,46851.1,46851.1,46649.1,46727.7,19.246209,210
...,...,...,...,...,...,...,...,...
8773,2023-01-01 13:00:31.901,2023-01-01 13:56:02.789,16563.1,16565.8,16551.8,16551.9,0.815243,29
8774,2023-01-01 14:02:39.414,2023-01-01 14:57:42.597,16552.1,16555.2,16551.8,16551.9,1.303458,38
8775,2023-01-01 15:01:54.558,2023-01-01 15:59:19.501,16551.8,16555.2,16536.3,16555.2,1.589749,65
8776,2023-01-01 16:01:11.838,2023-01-01 16:59:16.873,16555.2,16581.3,16555.2,16566.2,3.124767,50


In [5]:
time_bar_df.to_csv("kraken_btcusdt_2022_time_bar.csv")

In [6]:
fig = go.Figure(data=[go.Candlestick(x=time_bar_df['open_time'],
                open=time_bar_df['open'],
                high=time_bar_df['high'],
                low=time_bar_df['low'],
                close=time_bar_df['close'])])

# Adding title
fig.update_layout(title="Hourly Sampled Candlestick Chart")

# Adding axis labels
fig.update_layout(xaxis=dict(title="Time"),
                  yaxis=dict(title="Price"))


fig.show()

# What is wrong with basic hourly sampled klines bars? <a name="paragraph3"></a>
The problem with time bars is that they don't sample based on information, but on time. 

This is can be easily seen when in the displot below, some of the sampled bars contain a lot of trades, while others don't contain any.  The same holds for Volume, some bars hold a lot of volume while others don't hold any trades.

I plotted this for your convenience below...

In [7]:
import plotly.figure_factory as ff

fig = ff.create_distplot([time_bar_df["count"]], ["count"])
fig.update_layout(title="Distribution of Trade Counts")

# Adding axis labels
fig.update_layout(xaxis=dict(title="Trade Counts"),
                  yaxis=dict(title="Density"))

fig.show()

In [8]:
import plotly.figure_factory as ff

fig = ff.create_distplot([time_bar_df["volume"]], ["volume"])
fig.update_layout(title="Distribution of Traded Volume")

# Adding axis labels
fig.update_layout(xaxis=dict(title="Traded Volume"),
                  yaxis=dict(title="Density"))

fig.show()


# How to sample tick bars? <a name="paragraph4"></a>

The solution to the problem mentioned above is to work with tick bars. Sample a bar everytime a certain number of transactions (ticks) takes place. In the sample below I use `250` ticks because that will give you roughly `6000` bars sampled in 2022.

This can be done in using the `.ohlc()` method to aggregate them based on the amount of ticks that have already occurred.


In [9]:
def derive_tick_bar_df(historical_trades_df: pd.DataFrame, n_ticks: int) -> pd.DataFrame:

    trades_df = historical_trades_df.copy()
    trades_df["index"] = np.arange(len(trades_df)) // n_ticks
    trades_df = trades_df.reset_index()
    trades_df = trades_df.set_index(keys=["index"])

    ohlc_df = trades_df.groupby(by="index")["price"].ohlc()
    times_df = trades_df.groupby(by="index")["timestamp"].agg(["first", "last"])
    times_df.columns = ["open_time", "close_time"]
    volume_count_df = trades_df.groupby(by="index")["amount"].agg(["sum", "count"])
    volume_count_df.columns = ["volume", "count"]
    
    return pd.concat(objs=[times_df, ohlc_df, volume_count_df], axis=1).reset_index(drop=True)

In [10]:
tick_bar_df = derive_tick_bar_df(historical_trades_df=historical_trades_df, n_ticks=250)
tick_bar_df

Unnamed: 0,open_time,close_time,open,high,low,close,volume,count
0,2022-01-01 00:00:14.149,2022-01-01 02:06:08.771,46214.4,46900.3,46214.4,46801.0,19.394580,250
1,2022-01-01 02:06:59.615,2022-01-01 05:22:56.067,46826.6,47000.0,46649.1,47000.0,20.612544,250
2,2022-01-01 05:22:56.148,2022-01-01 09:24:57.582,47009.3,47531.5,46940.6,47170.0,8.344130,250
3,2022-01-01 09:24:57.894,2022-01-01 13:04:09.851,47170.0,47187.3,46741.0,47070.3,10.883168,250
4,2022-01-01 00:00:14.149,2022-01-01 02:06:08.771,46214.4,46900.3,46214.4,46801.0,19.394580,250
...,...,...,...,...,...,...,...,...
5827,2022-12-31 20:12:35.324,2022-12-31 23:09:53.795,16568.4,16570.9,16479.7,16495.5,34.958919,250
5828,2022-12-31 23:09:53.795,2023-01-01 02:09:44.876,16495.5,16550.4,16495.5,16542.9,26.495696,250
5829,2023-01-01 02:10:26.627,2023-01-01 08:19:30.177,16542.4,16558.1,16503.8,16523.1,16.680717,250
5830,2023-01-01 08:26:58.053,2023-01-01 13:16:07.635,16522.8,16571.4,16517.4,16558.0,13.316222,250


In [11]:
tick_bar_df.to_csv("kraken_btcusdt_2022_tick_bar.csv")

In [12]:
fig = go.Figure(data=[go.Candlestick(x=tick_bar_df['open_time'],
                open=tick_bar_df['open'],
                high=tick_bar_df['high'],
                low=tick_bar_df['low'],
                close=tick_bar_df['close'])])
# Adding title
fig.update_layout(title="Tick Sampled Candlestick Chart")

# Adding axis labels
fig.update_layout(xaxis=dict(title="Time"),
                  yaxis=dict(title="Price"))

fig.show()

# What is wrong with tick bars? <a name="paragraph5"></a>
The problem with time bars is that they don't sample based on information, but on ticks. Each represents a fixed number of trades, this mean that they might generated uneven bars duing period of **low volume**. 

This is shown below, where I plot the traded volume. 




In [13]:
import plotly.figure_factory as ff

fig = ff.create_distplot([tick_bar_df["volume"]], ["volume"])
fig.update_layout(title="Distribution of Traded Volume")

# Adding axis labels
fig.update_layout(xaxis=dict(title="Traded Volume"),
                  yaxis=dict(title="Density"))
fig.show()


# How to sample Volume Bars? <a name="paragraph6"></a>
So instead of sampling based on trades (ticks) we should sample data based on a certain volume of traded Bitcoins, regardless of number of trades required to achieve that volume. This ensures a more uniform sampling regardless of trading volume. Thus sample a a new bar everytime a certain volume is traded.

The disadvantage of volume bars is that they may overlook changes in market dynamics that occur with a low number of trades but high volume, such as large block trades.

They are prefered because;

1. **Consistencey**: Volume bars provide more consistent sampling, making them easier to interpret and analyze compared to tick bars, especially during periods of low trading activity.
2. **More robust**: Volume bars are less sensitive to outliers or irregularities in trading activity compared to tick bars, leading to more robust analysis and trading strategies.
3. **Reflects Market Liquidity**: Volume bars better reflect market liquidity and participation, as they are directly tied to the volume of shares traded, which is a critical aspect of market dynamics.



In [14]:
def derive_volume_bar(historical_trades_df: pd.DataFrame, traded_volume: float) -> pd.DataFrame:
    trades_df = historical_trades_df.copy()
    trades_df["volume_traded"] = trades_df["amount"].cumsum()
    trades_df["index"] = trades_df["volume_traded"] // traded_volume
    trades_df = trades_df.set_index(keys=["index"])

    ohlc_df = trades_df.groupby(by="index")["price"].ohlc()
    times_df = trades_df.groupby(by="index")["timestamp"].agg(["first", "last"])
    times_df.columns = ["open_time", "close_time"]
    volume_count_df = trades_df.groupby(by="index")["amount"].agg(["sum", "count"])
    volume_count_df.columns = ["volume", "count"]

    return pd.concat(objs=[times_df, ohlc_df, volume_count_df], axis=1).reset_index(drop=True)


In [15]:
volume_bars_df = derive_volume_bar(historical_trades_df=historical_trades_df, traded_volume=15.0)
volume_bars_df

Unnamed: 0,open_time,close_time,open,high,low,close,volume,count
0,2022-01-01 00:00:14.149,2022-01-01 01:04:05.263,46214.4,46710.0,46214.4,46560.7,14.330866,164
1,2022-01-01 01:04:05.264,2022-01-01 04:50:10.971,46557.8,46900.3,46556.4,46780.0,15.427954,233
2,2022-01-01 04:50:10.998,2022-01-01 05:49:02.575,46780.0,47531.5,46679.0,47195.4,14.985645,202
3,2022-01-01 05:50:35.109,2022-01-01 00:06:08.983,47237.2,47304.4,46214.4,46400.0,14.898925,420
4,2022-01-01 00:06:10.176,2022-01-01 01:09:16.851,46400.0,46710.0,46303.0,46629.2,15.114237,150
...,...,...,...,...,...,...,...,...
8055,2022-12-31 22:51:16.747,2023-01-01 00:19:06.466,16523.4,16547.6,16495.5,16522.0,13.876720,103
8056,2023-01-01 00:19:06.498,2023-01-01 06:07:53.213,16522.0,16558.1,16516.7,16534.7,16.071344,270
8057,2023-01-01 06:07:53.213,2023-01-01 08:33:43.506,16534.8,16553.1,16503.8,16526.2,14.443594,155
8058,2023-01-01 08:39:17.677,2023-01-01 15:47:16.210,16517.4,16571.4,16517.4,16553.3,15.623145,338


In [16]:
volume_bars_df.to_csv("kraken_btcusdt_2022_volume_bar.csv")

In [17]:
fig = go.Figure(data=[go.Candlestick(x=volume_bars_df['open_time'],
                open=volume_bars_df['open'],
                high=volume_bars_df['high'],
                low=volume_bars_df['low'],
                close=volume_bars_df['close'])])
# Adding title
fig.update_layout(title="Tick Sampled Candlestick Chart")

# Adding axis labels
fig.update_layout(xaxis=dict(title="Time"),
                  yaxis=dict(title="Price"))

fig.show()

# How to compare basic, tick and volume bars? <a name="paragraph7"></a>



In [18]:
volume_bars_df["return"] = volume_bars_df["close"]/volume_bars_df["open"]
tick_bar_df["return"] = tick_bar_df["close"]/tick_bar_df["open"]
time_bar_df["return"] = time_bar_df["close"]/time_bar_df["open"]

In [19]:
time_bar_df = time_bar_df.dropna()
tick_bar_df = tick_bar_df.dropna()
volume_bars_df = volume_bars_df.dropna()

In [20]:
hist_data = [df["return"].to_list() for df in [time_bar_df, tick_bar_df, volume_bars_df]]
dist_fig = ff.create_distplot(hist_data=hist_data, group_labels=["Time bars", "Tick bars", "Volume bars"])
dist_fig.update_xaxes(range=[time_bar_df["return"].min()*0.99, time_bar_df["return"].max()*1.01])

dist_fig.update_layout(title="Comparison of Return Distributions for Different Bar Types")

# Adding axis labels
dist_fig.update_layout(xaxis=dict(title="Return"),
                        yaxis=dict(title="Density"))

dist_fig.show()


In [21]:
import numpy as np
import scipy.stats as stats
import plotly.graph_objects as go

fig = go.Figure()
names = ["time bars", "tick bars", "volume bars"]
dfs = [time_bar_df, tick_bar_df, volume_bars_df]

for df, name in zip(dfs, names):
    measurements = np.random.normal(loc=20, scale=5, size=100)
    qq_plot_data = stats.probplot(df["return"], dist="norm")
    
    x = np.array([qq_plot_data[0][0][0], qq_plot_data[0][0][-1]])
    
    # Add scatter plot for QQ plot points
    fig.add_scatter(x=qq_plot_data[0][0], y=qq_plot_data[0][1], mode='markers', name=f"{name} QQ Plot")
    
    # Add line plot for QQ plot line
    fig.add_scatter(x=x, y=qq_plot_data[1][1] + qq_plot_data[1][0]*x, mode='lines', name=f"{name} QQ Line")

# Adding title
fig.update_layout(title="QQ Plot Comparison of Return Distributions for Different Bar Types")

# Adding axis labels
fig.update_layout(xaxis=dict(title="Theoretical quantiles"),
                  yaxis=dict(title="Sample quantiles"))

# Adding legend
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="right",
    x=1
))

fig.show()


In [22]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create figure with subplots
fig = make_subplots(rows=3, cols=1, shared_xaxes=True, vertical_spacing=0.05)

# Loop through the dataframes and add candlestick traces to the subplots
for i, df in enumerate(dfs):
    temp_df = df.iloc[:100]
    fig.add_trace(go.Candlestick(
        x=temp_df['open_time'],
        open=temp_df['open'],
        high=temp_df['high'],
        low=temp_df['low'],
        close=temp_df['close'],
        name=names[i]), row=i+1, col=1)

# Update layout
fig.update_layout(title_text="Candlestick Data Comparison for Different Bar Types",
                  title_x=0.5,
                  xaxis_rangeslider_visible=False)

# Adding axis labels
fig.update_xaxes(title_text="Time", row=3, col=1)
fig.update_yaxes(title_text="Price", row=2, col=1)

# Adding legend
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="right",
    x=1
))

# Show plot
fig.show()
