# Accumulation Opportunity

## Imports

In [34]:
# <include-accumulation_opportunity/utils.py>

In [35]:
# <imports>
import pandas as pd
import plotly.io as pio

from accumulation_opportunity import utils

pd.options.plotting.backend = "plotly"
pio.templates.default = "none"

## Summary

## Summary
For this assignment we assess the feasibility of accumulating large positions while attempting to maintain low trading
costs in an electronic market.

The objective of this analysis is to describe the dynamics of position accumulation and liquidation strategies. The analysis is based on marked millisecond tick-level data for BTC-USD from 
2018-03-31 03:34:51 to 2018-04-11 06:06:52 (579,826 obs) and 2021-04-11 02:28:52 to 2021-04-21 05:17:13 (3,777,963 obs) and a simple execution model with the following key parameters:
* Arrival time (timestamp)
* Position size (shares)
* Target participation rate (%) - target % of total market volume to trade
* Max trade participation rate (%)- maximum % of volume of qualified transactions to participate in
* Chunk size (shares)- Increments of total market volume in which increases in target participation are determined
* Price window (ms) - Duration of look-ahead window for purposes of adhering to constraint of transacting at least favorable prices

The model works by starting at an arrival date and then establishing a target participation level based on the number of chunks of total market volume that have traded. If traded volume to date is less than targeted participation, trades are participated in on the appropriate side, with size capped as a percentage of the aggregate volume traded on a millisecond basis and price based on the least favorable price within a set number of milliseconds.

We then execute strategies at randomly determined arrival times and characterize their dynamics in terms of:
* Expected value of implementation shortfall (%) - shortfall in volume weighted average price as a percentage of the market price at arrival time. Note that we follow a convention that a negative IS indicates a diminution in value.
* Standard deviation of implementation shortfall (%)
* Expected value of execution duration (minutes)
* Standard deviation of execution duration (minutes)

### Constraints
* Have to trade a least favorable price within price window
* Must evaluate positions of consequence
* Target total execution time of 1 to 15 minutes

Trading costs are not included in the analysis as they would have been modeled as a linear function of traded value and as such would not have resulted in relative differences between the strategies evaluated. I believe Almgren and Chriss assume that there is no fixed component of trading costs, such that trading zero shares results in zero trading costs.

### Experiments
* We conduct two experiments and compare them to a baseline set of model parameters.
* The first is a reduction in execution duration effected by reducing position size, effectively executing the same initial trades as would be executed in the baseline case.
* The second is a reduction in execution duration effected by increasing the maximum allowable participation in individual trades.
* Both of these experiments showed statistically insignificant reductions in expected value and variance

### Further research
Without modelling the impact on price of trade size, our model fails to capture a critical driver of the real world impacts of trade offs between changes in execution duration and trade sizes.
* Develop model for temporary and permanent impact ala Almgren and Chriss
    * Temporary could be just taking large trades on either side and regressing size (what should the units be?) against change in price
    * Permanent would be more complicated - something along the lines of a multiple regression of aggregate marked trades for each side against total change in price with longer aggregation periods
* Implement Almgren and Chriss optimal schedule

## Tick Data

Below is a brief exploratory analysis of the 2018 tick data for purposes of determining reasonable ranges for the parameters of our strategies.

Position size.
* A total of 1.3268e+14 billionths were traded over the 10 day period
* The median billionths traded in 15 minute intervals was 2.3571e+10, with the 25th and 75th percentiles as 1.1595e+10 and 4.7196e+10, respectively
* We evaluate strategies with position sizes ranging from 2% of the 25th percentile of the 15 minute volumes (2.3190e+8 billionths) to 10% of the 75th percentile volume (4.7195e+9 billionths)

Chunk size.
* We set chunk size based on a target number of steps over a 15 minute interval
* At the high end of the spectrum (low end of target number of steps) we consider a chunk size of 1.1785e+9 billionths, which, based on the the median 15 minute volume of 2.3571e+10, would result in 20 steps
* At the low end of the spectrum (high end of target number of steps) we consider a chunk size of 1.1785e+8 billionths, which would result in 200 steps at the median 15 minute volume

Participation rate:
* We consider participation rates between one and ten percent.

Max trade participation rate:
* We consider max trade participation rates between one and ten percent.

Price window:
* We consider least favorable price windows between 50 and 500 milliseconds.

In [36]:
df = utils.get_trade_data("BTC-USD", "2018")
df.name = "BTC-USD"

In [37]:
df_15min = df.groupby("Side").resample("15min").sum()[["SizeBillionths"]]
df_15min = df_15min.reset_index()
df_15min.SizeBillionths = df_15min.SizeBillionths * df_15min.Side
df_15min = df_15min.sort_values(["Side", "timestamp_utc_nanoseconds"], ascending=[False, True])
df_15min.name = df.name
fig = df_15min.plot(
    x="timestamp_utc_nanoseconds",
    y="SizeBillionths", kind="bar",
    title=f"{df_15min.name}: Total Volume Traded in 15 Minute Intervals",
    color=df_15min.Side.astype(str), labels={"color": "Side"}
)
fig.show()

In [38]:
df_15min_g = df_15min.groupby(["Side", "timestamp_utc_nanoseconds"]).sum().unstack("Side")
df_15min_g["Total"] = df_15min_g[("SizeBillionths", 1)] + df_15min_g[("SizeBillionths", -1)] * -1
df_15min_g.describe()

Unnamed: 0_level_0,SizeBillionths,SizeBillionths,Total
Side,-1,1,Unnamed: 3_level_1
count,1067.0,1067.0,1067.0
mean,-59744360000.0,64513310000.0,124257700000.0
std,72907400000.0,95053680000.0,139806000000.0
min,-1030764000000.0,0.0,0.0
25%,-74107960000.0,20430850000.0,46963640000.0
50%,-37808620000.0,39253460000.0,84231590000.0
75%,-20068520000.0,72408930000.0,150307900000.0
max,0.0,1703932000000.0,1855377000000.0


Here we see that there is a fair amount of variability in price changes with some positive drift (over a relatively short period of time).

In [39]:
df_15min_price = df.groupby("Side").resample("15min").mean()[["PriceMillionths"]]
df_15min_price.reset_index().plot(
    x="timestamp_utc_nanoseconds",
    y="PriceMillionths",
    title=f"BTC-USD: Average Price in  15 Minute Intervals",
)

## Single Stragegy

Here we desribe the results of a single strategy to ensure our model functions correctly and to describe the outputs.

In [40]:
params_space = dict(
    quantity=(2.3190e+8, 4.7195e+9),
    participation=(0.01, 0.10),
    max_trade_participation=(0.01, 0.10),
    chunk_size=(1.1785e+8,1.1785e+9),
    price_window_ms=(50, 500)
)

In [41]:
base_params = {k: sum(v)/2 for k,v in params_space.items()}
base_params["quantity"] = int(base_params["quantity"])
base_params["chunk_size"] = int(base_params["chunk_size"])
base_params["side"] = -1
base_params["arrival_time"] = "2018-04-08 22:05"
pd.DataFrame([base_params])

Unnamed: 0,quantity,participation,max_trade_participation,chunk_size,price_window_ms,side,arrival_time
0,2475700000,0.055,0.055,648175000,275.0,-1,2018-04-08 22:05


In [42]:
df_accum, df_trades, result = utils.get_accum_df(df, **base_params)

Here we see for our baseline paramters on the sell side that our strategy completed in 51 trades over 13.5 minutes with an implementation shortfall of -0.000416 (again, negative indicates shortfall).

In [43]:
pd.DataFrame([result])

Unnamed: 0,quantity,side,S0,VWAP,IS,n_trades,mean_trade_size,arrival_time,completion_time,execution_time,participation,max_trade_participation,chunk_size,price_window_ms
0,2475700000,-1,6961860000,6958961790,-0.000416,51,48543137,2018-04-08 22:05:00.754,2018-04-08 22:18:36.633,0 days 00:13:35.879000,0.055,0.055,648175000,275.0


The blue dots are the market buy side transactions. The purple dots are the market sell side transactions and the orange dots are our trades. It looks like our strategy is executing correctly, only trading at the least favorable sell side prices as can bee seen by orange dots only appearing on top of the purple dots, and on the dot with the lowest price when there are multiple dots within the pricing window. You can zoom in to the chart to confirm that those transactions that appear to occur at the most favorable transaction price, in fact occur outside of the pricing window for the less favorable transactions (2018-04-08 22:07:23.064, for example).

In [44]:
utils.make_trade_prices_chart(df, df_accum, df_trades)

Here, we can confirm that our max trade participation constraints are being adhered to. One thing to note is that the width of the bars has been increased to 3000ms to increase legibility, which does result in some bars overlapping. It is possible that not every sell side market transaction will have a corresponding trade because of the step nature of our target participation as determined by our chunk size parameter.

In [45]:
utils.make_trade_sizes_chart(df, df_accum, df_trades)

Lastly, we look at our actual participation versus our target participation. The red line is the target and the orange is our actual participation, both on the left axis. The green line is the total market volume, on the right axis. It looks like early in the period, prior to 22:08, our actual participation keeps up with our target as a result of the market volume mostly being on the sell side (as can be seen in the chart above) and then our actual participation lags behind as our target participation increase as a result of buy side volume.

In [46]:
utils.make_participation_chart(df_accum, df_trades)

## Experiments

### Baseline

To establish a baseline, we run 1000 strategies at randomly selected arrival times, discarding the runs that fail to complete before the end of the period. This results in an expected implementation shortage of -0.000018 with a standard deviation of 0.003317. In terms of execution time, we end up with an expected value of 20.9 minutes with a standard deviation of 21.7 minutes. The implementation shortfall has positive skewness and some extreme outliers on the upside with excess kurtosis of 6.8. Execution duration is skewed to the positive (as we would expect given that duration cannot be less than zero) and also includes some extreme outliers (excess kurtosis of 36.3).

**Numbers may vary with different sample populations.

In [47]:
df_results = utils.get_results_df(df, params=base_params, nobs=1000)

In [48]:
df_results.describe()

Unnamed: 0,quantity,side,S0,VWAP,IS,n_trades,mean_trade_size,execution_time,participation,max_trade_participation,chunk_size,price_window_ms
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000,1000.0,1000.0,1000.0,1000.0
mean,2475700000.0,-1.0,6936063000.0,6936723000.0,0.000103,87.078,52255380.0,0 days 00:21:05.327889,0.055,0.055,648175000.0,275.0
std,0.0,0.0,234985500.0,234714500.0,0.003488,56.903919,151091800.0,0 days 00:18:55.856910767,1.3884730000000002e-17,1.3884730000000002e-17,0.0,0.0
min,2475700000.0,-1.0,6458950000.0,6467530000.0,-0.008162,1.0,5190146.0,0 days 00:00:01.686000,0.055,0.055,648175000.0,275.0
25%,2475700000.0,-1.0,6756900000.0,6762644000.0,-0.001808,54.0,23355660.0,0 days 00:08:28.977500,0.055,0.055,648175000.0,275.0
50%,2475700000.0,-1.0,6912750000.0,6918486000.0,-0.000427,78.0,31739740.0,0 days 00:17:05.657000,0.055,0.055,648175000.0,275.0
75%,2475700000.0,-1.0,7064505000.0,7068396000.0,0.001334,106.0,45846300.0,0 days 00:28:55.461250,0.055,0.055,648175000.0,275.0
max,2475700000.0,-1.0,7507790000.0,7498187000.0,0.025833,477.0,2475700000.0,0 days 03:49:29.562000,0.055,0.055,648175000.0,275.0


In [49]:
utils.get_result_hist(df_results)

Interestingly, excluding the execution_time outliers, there appears to be correlation that is consistent with what we would expect from the Almgren and Chriss model, i.e, as execution time decreases, expected value decreases. However, this must be for reasons other than those of the Almgren and Chriss model, because trade size has no direct impact on trade price in our model. With such a factor, the higher expected value in our model must be attributable to drift over our execution period.

However, the result of the linear regression below indicates there really is no relationship.

In [50]:
utils.make_shortfall_time_scatter(df_results, n_trend_obs=150)

LinregressResult(slope=2.6782710435085037e-05, intercept=-0.0004620355637878151, rvalue=0.14538005101193113, pvalue=3.909887114457838e-06, stderr=5.769600714193524e-06, intercept_stderr=0.00016343548804769158)


### Smaller Position Size

To further explore the relationship between execution duration and implementation shortage, let's reduce our position size to the small end of the spectrum of our parameter space, so that our execution time is much less than it is for our baseline. To start with, we sill compare the single run at the same arrival time as our baseline.

In [51]:
params_1 = base_params.copy()
params_1["quantity"] = params_space["quantity"][0]
params_1

{'quantity': 231900000.0,
 'participation': 0.055,
 'max_trade_participation': 0.055,
 'chunk_size': 648175000,
 'price_window_ms': 275.0,
 'side': -1,
 'arrival_time': numpy.datetime64('2018-04-06T23:21:06.118000000')}

#### Single Result

In [52]:
df_accum_1, df_trades_1, result_1 = utils.get_accum_df(df, **params_1)

That is perhaps not the most interesting example - it completed in a little under one minute with no implementation shortfall (vs.-0.000416 in the baseline). However, this does highlight that the outcome is dependent on a smaller number of pricing observations. To the extent that prices closer in time to the arrival time have less of an opportunity to change, we should expect certainly the variance to be less than our baseline case.

In this case, we haven't altered our individual trade size, we have just traded out over a smaller number of trades (6 vs. 51). Our average trade size may be different, but that would be because of differences in the characteristics of the trade sizes on the days included in the longer period, but not in the shorter ones. The trades sizes on the overlapping days should be the same since we didn't change the target participation, chunk size or max trade participation parameters.

In [53]:
pd.DataFrame([result_1, result])

Unnamed: 0,quantity,side,S0,VWAP,IS,n_trades,mean_trade_size,arrival_time,completion_time,execution_time,participation,max_trade_participation,chunk_size,price_window_ms
0,231900000,-1,6618940000,6626209106,0.001098,13,17838461,2018-04-06 23:21:16.362,2018-04-06 23:23:54.640,0 days 00:02:38.278000,0.055,0.055,648175000,275.0
1,2475700000,-1,6961860000,6958961790,-0.000416,51,48543137,2018-04-08 22:05:00.754,2018-04-08 22:18:36.633,0 days 00:13:35.879000,0.055,0.055,648175000,275.0


In [54]:
df_trades_1.iloc[:5]

Unnamed: 0_level_0,StratTradeSize,StratTradePrice
timestamp_utc_nanoseconds,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-04-06 23:21:35.995,13024634,6618990000
2018-04-06 23:21:40.896,24023007,6618990000
2018-04-06 23:22:43.565,1518000,6623990000
2018-04-06 23:22:55.705,1529000,6623990000
2018-04-06 23:22:56.728,33000000,6623990000


In [55]:
df_trades.iloc[:5]

Unnamed: 0_level_0,StratTradeSize,StratTradePrice
timestamp_utc_nanoseconds,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-04-08 22:05:07.476,55000000,6961860000
2018-04-08 22:05:13.607,34980000,6961860000
2018-04-08 22:05:20.827,41737720,6961860000
2018-04-08 22:05:24.541,77929500,6961860000
2018-04-08 22:05:36.939,4416500,6961860000


In [56]:
utils.make_trade_prices_chart(df, df_accum_1, df_trades_1)

In [57]:
utils.make_trade_sizes_chart(df, df_accum_1, df_trades_1)

In [58]:
utils.make_participation_chart(df_accum_1, df_trades_1)

#### Distribution of Results

Relative to our baseline, we ended up with a higher expected value with an IS metric of 0.000153 vs. -0.000018 (the bottom two rows in the table below relate to the baseline), and  lower variance with a standard deviation of 0.001725 vs. 0.003177. The expected value of the execution duration was 3.3 minutes with a standard deviation of 4.5 minutes. Both distributions had some extreme positive outliers. However, a t-test of the difference in IS means, indicates that it is not statistically significant.

This result is consistent with what would expect as it relates to variance - given the shorter duration in which trades are executed, there is less opportunity for prices to change. This is a little surprising as it relates to expected value. I would have thought tha in since our trade sizes stayed the same (and we don't model cost directly as a function of trade size in any event), that any differences in expected value could be attributed to drift over the longer execution period of the baseline.

In [59]:
df_results_1 = utils.get_results_df(df, params=params_1, nobs=1000)

In [60]:
utils.stats.ttest_ind(df_results_1.IS, df_results.IS)

Ttest_indResult(statistic=0.1887569909912746, pvalue=0.8503024522389699)

In [61]:
pd.concat([df_results_1.describe().loc[["mean", "std"]], df_results.describe().loc[["mean", "std"]]], axis=0)

Unnamed: 0,quantity,side,S0,VWAP,IS,n_trades,mean_trade_size,execution_time,participation,max_trade_participation,chunk_size,price_window_ms
mean,231900000.0,-1.0,6928894000.0,6929768000.0,0.000125,12.064,42012880.0,0 days 00:03:06.525034,0.055,0.055,648175000.0,275.0
std,0.0,0.0,226648200.0,227111900.0,0.001477,10.696837,55020910.0,0 days 00:03:43.273320357,1.3884730000000002e-17,1.3884730000000002e-17,0.0,0.0
mean,2475700000.0,-1.0,6936063000.0,6936723000.0,0.000103,87.078,52255380.0,0 days 00:21:05.327889,0.055,0.055,648175000.0,275.0
std,0.0,0.0,234985500.0,234714500.0,0.003488,56.903919,151091800.0,0 days 00:18:55.856910767,1.3884730000000002e-17,1.3884730000000002e-17,0.0,0.0


In [62]:
utils.get_result_hist(df_results_1)

In [63]:
utils.make_shortfall_time_scatter(df_results_1, n_trend_obs=40)

LinregressResult(slope=8.367979546165788e-05, intercept=-0.00013426413721550059, rvalue=0.2108772666371916, pvalue=1.6272491529793614e-11, stderr=1.2278573547278342e-05, intercept_stderr=5.945254691154876e-05)


### Higher Max Trade Participation

As a contrast the the previous experiment that effected a shorter execution duration without altering individual trade size, we now explore a reduction in execution duration resulting from increasing trade size. As previously noted, we do not model any direct increase in costs associated with increased size.

In [64]:
params_2 = base_params.copy()
params_2["max_trade_participation"] = params_space["max_trade_participation"][1]
pd.DataFrame([params_2])

Unnamed: 0,quantity,participation,max_trade_participation,chunk_size,price_window_ms,side,arrival_time
0,2475700000,0.055,0.1,648175000,275.0,-1,2018-04-06 23:21:06.118


#### Single Result

In [65]:
df_accum_2, df_trades_2, result_2 = utils.get_accum_df(df, **params_2)

Below, we can see that mean trade size has more than doubled from our baseline as we've increased max trade participation from 5.5% to 10%.

Even though the duration is much lower, the IS for this single arrival time is essentially unchanged at -0.000606

In [66]:
pd.DataFrame([result_2, result_1, result])

Unnamed: 0,quantity,side,S0,VWAP,IS,n_trades,mean_trade_size,arrival_time,completion_time,execution_time,participation,max_trade_participation,chunk_size,price_window_ms
0,2475700000,-1,6618940000,6615180468,-0.000568,36,68769444,2018-04-06 23:21:16.362,2018-04-06 23:30:35.336,0 days 00:09:18.974000,0.055,0.1,648175000,275.0
1,231900000,-1,6618940000,6626209106,0.001098,13,17838461,2018-04-06 23:21:16.362,2018-04-06 23:23:54.640,0 days 00:02:38.278000,0.055,0.055,648175000,275.0
2,2475700000,-1,6961860000,6958961790,-0.000416,51,48543137,2018-04-08 22:05:00.754,2018-04-08 22:18:36.633,0 days 00:13:35.879000,0.055,0.055,648175000,275.0


In [67]:
utils.make_trade_prices_chart(df, df_accum_2, df_trades_2)

In [68]:
utils.make_trade_sizes_chart(df, df_accum_2, df_trades_2)

In [69]:
utils.make_participation_chart(df_accum_2, df_trades_2)

#### Distribution of Results

In [70]:
df_results_2 = utils.get_results_df(df, params=params_2, nobs=1000)

In [71]:
utils.stats.ttest_ind(df_results_2.IS, df_results.IS)

Ttest_indResult(statistic=0.501108977384818, pvalue=0.6163496305784659)

In [72]:
pd.concat([
    df_results_2.describe().loc[["mean", "std"]],
    df_results_1.describe().loc[["mean", "std"]],
    df_results.describe().loc[["mean", "std"]]
    ], axis=0)

Unnamed: 0,quantity,side,S0,VWAP,IS,n_trades,mean_trade_size,execution_time,participation,max_trade_participation,chunk_size,price_window_ms
mean,2475700000.0,-1.0,6935814000.0,6936968000.0,0.000173,54.526,72279100.0,0 days 00:13:45.093687,0.055,0.1,648175000.0,275.0
std,0.0,0.0,223170500.0,222633600.0,0.002724,39.405839,81298350.0,0 days 00:12:58.539906994,1.3884730000000002e-17,1.3884730000000002e-17,0.0,0.0
mean,231900000.0,-1.0,6928894000.0,6929768000.0,0.000125,12.064,42012880.0,0 days 00:03:06.525034,0.055,0.055,648175000.0,275.0
std,0.0,0.0,226648200.0,227111900.0,0.001477,10.696837,55020910.0,0 days 00:03:43.273320357,1.3884730000000002e-17,1.3884730000000002e-17,0.0,0.0
mean,2475700000.0,-1.0,6936063000.0,6936723000.0,0.000103,87.078,52255380.0,0 days 00:21:05.327889,0.055,0.055,648175000.0,275.0
std,0.0,0.0,234985500.0,234714500.0,0.003488,56.903919,151091800.0,0 days 00:18:55.856910767,1.3884730000000002e-17,1.3884730000000002e-17,0.0,0.0


In [73]:
utils.get_result_hist(df_results_2)

In [74]:
utils.make_shortfall_time_scatter(df_results_2, n_trend_obs=80)

LinregressResult(slope=3.081933454928288e-05, intercept=-0.00025086875363279, rvalue=0.14682478610095984, pvalue=3.124059747453048e-06, stderr=6.572434400318846e-06, intercept_stderr=0.0001241963121355217)
