# Index Fund Arbitrage

## Introduction

An index fund will hold a basket of assets for some sort of general investment thesis. This allows individuals to buy and sell the index fund shares but essentially "own" the basket instead of needing to form the basket themselves.

Every day, an index fund's NAV, Net Asset Value, is calculated and it represents the underlying value of the basket. The actual share price for the index fund shares can be at a premium (higher than the NAV) or at a discount (lower than the NAV). However, we expect an equilibrium over time where there is on average no premium or discount.

This example will track the price of a basket of goods as well as an index fund and try to see what dynamics might arise in terms of potential arbitrage. We will assume there are two types of actors:

1. Arbitrage Actor: This actor brings the price of the index fund in line with the basket of goods through arbitrage.
2. Momentum Trader: This actor will bet on momentum.

## Historical Set Up

Historically, we will say that the basket of goods and the index fund both gain and lose an amount which is highly correlated with one and other. There are also actions which are taken on the index fund. The steps for creating this series of hypothetical data is below:

1. Basket of goods grows in percentage by $r_b = N(\mu_1, \sigma^2_1)$
2. The index fund grows in percentage by $r_i = \lambda r_b + (1-\lambda)\cdot N(\mu_2, \sigma^2_2)$
3. The arbitrage trader brings the index fund in line with the basket of goods if $|P_b - P_i| > \theta_1$.
4. The momentum trader will trade with the momentum if the m day return of the index - $\mu_1*m$ on an absolute basis is greater than $\theta_2$ growing or shrinking the index fund value by  .5% for market impact.

We will make a very heroic assumption that we can parse out pure price movements and the price of the stock after including the actor actions.

In [1]:
import sqlite3
import pandas as pd
import numpy as np
con = sqlite3.connect('arb.db')
np.random.seed(0)

mu1 = .01
std1 = .05
lambda1 = .9
mu2 = .015
std2 = .1
n = 100
r_b = np.random.normal(mu1, std1, n)
r_i = lambda1 * r_b + (1 - lambda1) * np.random.normal(mu2, std2, n)

t = 0
P_b = 100
P_i = 100
P_b_historical = [100]
P_i_historical = [100]

theta1 = 10
arbitrage_trades = []

lookback = 5
theta2 = .1
momentum_buy = []
momentum_sell = []

for _ in range(n):
    P_b = P_b * (1 + r_b[t])
    P_i = P_i * (1 + r_i[t])
    
    #Check arbitrage trade
    if abs(P_b - P_i) > theta1:
        arbitrage_trades.append(t)
        P_i = P_b
    
    #Check momentum trade
    lookback_i = max(t - lookback + 1, 0)
    period_return = P_i / P_i_historical[lookback_i] - 1 - mu1 * lookback
    if abs(period_return) > theta2:
        if period_return > 0:
            momentum_buy.append(t)
            P_i = P_i * 1.025
        else:
            momentum_sell.append(t)
            P_i = P_i * (1-.025)
    
    P_b_historical.append(P_b)
    P_i_historical.append(P_i)
    
    t += 1
    
pure_returns = pd.DataFrame(zip(r_i, r_b))
pure_returns.columns = ["index_return", "basket_return"]
pure_returns["t"] = pure_returns.index

prices = pd.DataFrame(zip(P_i_historical, P_b_historical))
prices.columns = ["index_price", "basket_price"]
prices["t"] = prices.index


trades = [[x, "Arbitrage"] for x in arbitrage_trades] + [[x, "Momentum Buy"] for x in momentum_buy] + [[x, "Momentum Sell"] for x in momentum_sell]
trades = pd.DataFrame(trades)
trades.columns = ["time", "trade"]

pure_returns.to_sql("pure_returns", con, index=False, if_exists='replace')
prices.to_sql("prices", con, index=False, if_exists='replace')
trades.to_sql("trades", con, index=False, if_exists='replace')

## Writing the Data Pipeline

As a modeler, the first step will be to create the data pipeline.

### Data Pulls

Begin with the data pulls that are needed.

In [2]:
def pull_pure_returns(con):
    return pd.read_sql("SELECT * FROM pure_returns", con)

def pull_prices(con):
    return pd.read_sql("SELECT * FROM prices", con)

def pull_trades(con):
    return pd.read_sql("SELECT * FROM trades", con)

print(pull_pure_returns(con).head(5))
print()
print(pull_prices(con).head(5))
print()
print(pull_trades(con).head(5))

   index_return  basket_return  t
0      0.108714       0.098203  0
1      0.015029       0.030008  1
2      0.041838       0.058937  2
3      0.121034       0.122045  3
4      0.082809       0.103378  4

   index_price  basket_price  t
0   100.000000    100.000000  0
1   110.871386    109.820262  1
2   112.537726    113.115733  2
3   120.177273    119.782423  3
4   138.090899    134.401228  4

   time      trade
0     6  Arbitrage
1     9  Arbitrage
2    14  Arbitrage
3    16  Arbitrage
4    18  Arbitrage


### Data Processing

Add in the data processing.

In [3]:
def process_pure_returns(pure_returns_data):
    pure_returns_data = pure_returns_data.set_index('t')
    pure_returns_data = pure_returns_data.sort_index()
    return pure_returns_data

def process_prices(prices_data):
    prices_data = prices_data.set_index('t')
    prices_data = prices_data.sort_index()
    return prices_data

def process_trades(trades_data):
    trades_data = trades_data.rename(columns = {"time": "t"})
    trades_data["had_trade"] = True
    trades_data = trades_data.pivot("t", "trade", "had_trade")
    trades_data = trades_data.fillna(False)
    return trades_data

pure_returns_data = pull_pure_returns(con)
prices_data = pull_prices(con)
trades_data = pull_trades(con)

pure_returns_data = process_pure_returns(pure_returns_data)
prices_data = process_prices(prices_data)
trades_data = process_trades(trades_data)

print(pure_returns_data)
print()
print(prices_data)
print()
print(trades_data)

    index_return  basket_return
t                              
0       0.108714       0.098203
1       0.015029       0.030008
2       0.041838       0.058937
3       0.121034       0.122045
4       0.082809       0.103378
..           ...            ...
95      0.040580       0.045329
96      0.018690       0.010525
97      0.099099       0.099294
98      0.037843       0.016346
99      0.041955       0.030099

[100 rows x 2 columns]

     index_price  basket_price
t                             
0     100.000000    100.000000
1     110.871386    109.820262
2     112.537726    113.115733
3     120.177273    119.782423
4     138.090899    134.401228
..           ...           ...
96    282.987936    276.085791
97    295.484022    278.991594
98    314.360994    306.693653
99    319.499415    311.706746
100   329.116177    321.088953

[101 rows x 2 columns]

trade  Arbitrage  Momentum Buy  Momentum Sell
t                                            
2          False          True         

### Create an Aggregate Pull

Create the entry point for pulling all the data.

In [4]:
def aggregate_pull(con):
    pure_returns_data = pull_pure_returns(con)
    prices_data = pull_prices(con)
    trades_data = pull_trades(con)

    pure_returns_data = process_pure_returns(pure_returns_data)
    prices_data = process_prices(prices_data)
    trades_data = process_trades(trades_data)
    
    data = {"pure_returns": pure_returns_data,
           "prices_data": prices_data,
           "trades_data": trades_data}
    
    return data

data = aggregate_pull(con)

### Create Input Data Creation

In [5]:
def compute_input_data(data):
    pure_returns_data = data["pure_returns"].copy()
    prices_data = data["prices_data"].copy()
    trades_data = data["trades_data"].copy()
    
    #Grab the starting state
    starting_state = prices_data.iloc[0]
    prices_data = prices_data.iloc[1:]
    prices_data.index = prices_data.index - 1
    
    #Combine data
    historical_data = pd.concat([pure_returns_data, prices_data, trades_data], axis=1)
    historical_data[["Arbitrage", "Momentum Buy", "Momentum Sell"]] = historical_data[["Arbitrage", "Momentum Buy", "Momentum Sell"]].fillna(False)
    
    input_data = historical_data[["index_return", "basket_return"]]
    output_data = historical_data[["index_price", "basket_price", "Arbitrage", "Momentum Buy", "Momentum Sell"]]
    
    out = {"starting_state": starting_state,
          "historical_data": historical_data,
          "input_data": input_data,
          "output_data": output_data}
    
    
    return out

inputs = compute_input_data(data)
print(inputs)

{'starting_state': index_price     100.0
basket_price    100.0
Name: 0, dtype: float64, 'historical_data':     index_return  basket_return  index_price  basket_price  Arbitrage  \
t                                                                       
0       0.108714       0.098203   110.871386    109.820262      False   
1       0.015029       0.030008   112.537726    113.115733      False   
2       0.041838       0.058937   120.177273    119.782423      False   
3       0.121034       0.122045   138.090899    134.401228      False   
4       0.082809       0.103378   153.264202    148.295345      False   
..           ...            ...          ...           ...        ...   
95      0.040580       0.045329   282.987936    276.085791       True   
96      0.018690       0.010525   295.484022    278.991594      False   
97      0.099099       0.099294   314.360994    306.693653       True   
98      0.037843       0.016346   319.499415    311.706746       True   
99      0.041955 

### Define Types

In [6]:
from dataclasses import dataclass

share_price = float
percentage_return = float
trade_action = bool

@dataclass
class Prices():
    index_price: share_price
    basket_price: share_price
        
@dataclass
class Returns():
    index_return: percentage_return
    basket_return: percentage_return
        
@dataclass
class Trades():
    arbitrage: trade_action
    momentum_buy: trade_action
    momentum_sell: trade_action

In [7]:
print("add in a way to define and track types for the digital twin as a .types property")

add in a way to define and track types for the digital twin as a .types property


### Build Data Format Function

In [8]:
def map_price(data):
    return Prices(index_price = data["index_price"],
                 basket_price = data["basket_price"])

def map_returns(data):
    return Returns(index_return = data["index_return"],
    basket_return = data["basket_return"])

def map_trades(data):
    return Trades(arbitrage = data["Arbitrage"],
                 momentum_buy = data["Momentum Buy"],
                 momentum_sell = data["Momentum Sell"])

def format_inputs(inputs):
    inputs_f = {}
    
    starting_state = inputs["starting_state"].copy()
    starting_state = map_price(starting_state)
    inputs_f["starting_state"] = starting_state
    
    historical_data = inputs["historical_data"].copy()
    historical_data["returns"] = historical_data.apply(lambda x: map_returns(x),axis=1)
    historical_data["prices"] = historical_data.apply(lambda x: map_price(x),axis=1)
    historical_data["trades"] = historical_data.apply(lambda x: map_trades(x),axis=1)
    historical_data = historical_data[["returns", "prices", "trades"]]
    inputs_f["historical_data"] = historical_data
    
    input_data = inputs['input_data'].copy()
    input_data["returns"] = input_data.apply(lambda x: map_returns(x),axis=1)
    input_data = input_data[["returns"]]
    inputs_f["input_data"] = input_data
    
    output_data = inputs['output_data'].copy()
    output_data["prices"] = output_data.apply(lambda x: map_price(x),axis=1)
    output_data["trades"] = output_data.apply(lambda x: map_trades(x),axis=1)
    output_data = output_data[["prices", "trades"]]
    inputs_f["output_data"] = output_data
    
    return inputs_f
format_inputs(inputs)

{'starting_state': Prices(index_price=100.0, basket_price=100.0),
 'historical_data':                                               returns  \
 t                                                       
 0   Returns(index_return=0.10871386253910743, bask...   
 1   Returns(index_return=0.015029483765100592, bas...   
 2   Returns(index_return=0.04183835929990093, bask...   
 3   Returns(index_return=0.12103416104564571, bask...   
 4   Returns(index_return=0.08280887550560694, bask...   
 ..                                                ...   
 95  Returns(index_return=0.040580329256415186, bas...   
 96  Returns(index_return=0.018690406444573593, bas...   
 97  Returns(index_return=0.0990992137653999, baske...   
 98  Returns(index_return=0.037843403664469796, bas...   
 99  Returns(index_return=0.041954800849375494, bas...   
 
                                                prices  \
 t                                                       
 0   Prices(index_price=110.87138625391076,

## Wrap All Work in a Pipeline

In [9]:
import digital_twin

In [10]:
class ArbitrageDataPipeline(digital_twin.DataPipeline):
    def pull_historical_data(self):
        con = sqlite3.connect('arb.db')
        return aggregate_pull(con)
    
    def compute_input_data(self, data):
        return compute_input_data(data)
    
    def format_input_data(self, data):
        return format_inputs(data)

In [11]:
TestDataPipeline = ArbitrageDataPipeline()

historical_data = TestDataPipeline.pull_historical_data()
input_data = TestDataPipeline.compute_input_data(historical_data)
input_data_f = TestDataPipeline.format_input_data(input_data)

## Implementing the Digital Twin Class Part 1

We need to define the initial loading data function and the prior data loading function. We can test these like below.

In [12]:
class ArbitrageDigitalTwin(digital_twin.DigitalTwin):
    def load_data_initial(self):
        self.historical_data = self.data_pipeline.pull_historical_data()
        
        self.historical_data["pure_returns"].to_csv("pure_returns.csv")
        self.historical_data["prices_data"].to_csv("prices_data.csv")
        self.historical_data["trades_data"].to_csv("trades_data.csv")
    
    def load_data_prior(self):
        self.historical_data = {}
        
        self.historical_data["pure_returns"] = pd.read_csv("pure_returns.csv", index_col = 0)
        self.historical_data["prices_data"] = pd.read_csv("prices_data.csv", index_col = 0)
        self.historical_data["trades_data"] = pd.read_csv("trades_data.csv", index_col = 0)

In [13]:
arb_dt = ArbitrageDigitalTwin(name = "Test",
                    data_pipeline = TestDataPipeline)
arb_dt.load_data_initial()
del(arb_dt)
arb_dt = ArbitrageDigitalTwin(name = "Test",
                    data_pipeline = TestDataPipeline)
arb_dt.load_data_prior()
print(arb_dt.historical_data)

{'pure_returns':     index_return  basket_return
t                              
0       0.108714       0.098203
1       0.015029       0.030008
2       0.041838       0.058937
3       0.121034       0.122045
4       0.082809       0.103378
..           ...            ...
95      0.040580       0.045329
96      0.018690       0.010525
97      0.099099       0.099294
98      0.037843       0.016346
99      0.041955       0.030099

[100 rows x 2 columns], 'prices_data':      index_price  basket_price
t                             
0     100.000000    100.000000
1     110.871386    109.820262
2     112.537726    113.115733
3     120.177273    119.782423
4     138.090899    134.401228
..           ...           ...
96    282.987936    276.085791
97    295.484022    278.991594
98    314.360994    306.693653
99    319.499415    311.706746
100   329.116177    321.088953

[101 rows x 2 columns], 'trades_data':     Arbitrage  Momentum Buy  Momentum Sell
t                                        

We also want to make sure the compute data input functionality works.

In [14]:
arb_dt.compute_input_data()

In [16]:
print(arb_dt.input_data)

{'starting_state': Prices(index_price=100.0, basket_price=100.0), 'historical_data':                                               returns  \
t                                                       
0   Returns(index_return=0.1087138625391074, baske...   
1   Returns(index_return=0.0150294837651005, baske...   
2   Returns(index_return=0.0418383592999009, baske...   
3   Returns(index_return=0.1210341610456457, baske...   
4   Returns(index_return=0.0828088755056069, baske...   
..                                                ...   
95  Returns(index_return=0.0405803292564151, baske...   
96  Returns(index_return=0.0186904064445735, baske...   
97  Returns(index_return=0.0990992137653999, baske...   
98  Returns(index_return=0.0378434036644697, baske...   
99  Returns(index_return=0.0419548008493754, baske...   

                                               prices  \
t                                                       
0   Prices(index_price=110.87138625391076, basket_...   
1 