# Introduction to Testing

In [1]:
from fastcore.all import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from polygon import RESTClient
from utils import view_source_code, get_dollars
from datetime import datetime, timedelta, date
import math, time

In [2]:
path = Path('../data')

## Background

In Chapter 1 we created models and created actions we want to take for multiple approaches.  The question now is, how do we know if they are profitable?  How should we measure them?  How do we know if we simply got lucky, or if they are reliable?

As we mentioned in chapter 2, testing is the most important part of the process.  If done well you have a good way to determine what strategies should be implemented, and if done poorly you run the risk of implementing non-profitable strategies.  I believe you should strive to never sacrifice testing principles because effective testing is your **only** objective indication to whether you are doing something useful or not.  Without effective testing you are "flying blind".

This chapter will lay the groundwork and cover the basics of testing.  The goal of this chapter is to introduce concept and the three core things that need to be carefully considered for effective testing.

1. What data should you use for testing?
1. What metric should you use for testing?
1. What test should you use?

In this chapter we are going to walk through the concepts to see each step and understand the importance.  In the next chapter we are going apply our knowledge to build a better structured solution and framework that we can use through the remainder of the book.

## What data should you use for testing?

The first question we have to ask is what data to we use for testing?  Ideally we have 3 subsets of our data (training, validation, and test).  Let's go through what they are used for and why they are important.

### Training Set

The training set is unique because it has no restrictions on what we can do with it.  We can look at any piece of data in it.  We can normalize data using values in the training set.  We can train machine learning models on the training set.  This is often the largest subset of our data.

This training set is pretty explanatory - we use this for understanding our data and developing our model.  

We can load it in using the same method as we did in chapter 1.

In [3]:
raw = pd.read_csv(path/'eod-quotemedia.csv',parse_dates=['date'])
df = raw.pivot(index='date', columns='ticker',values='adj_close')
train = df.loc[:pd.Timestamp('2017-1-1')]

### Validation Set

The goal of creating a trading strategy is to have it perform well on data that it was not developed using.  We may use data from 2015 - 2020 to create a trading strategy, but the goal is to apply it to 2021 and 2022 to make a profit.

Because we want our model to perform on *unseen* data, we create some restriction to how we use the validation set.  We do not train any models on it, and we do not use statistics or data from the validation set when creating our model.  It's data our model has never seen.  The validation set is something we can only use to see how well our strategy or model performs.  

The entire purpose of the validation set is to give us unseen data to evaluate our approaches on.  By having this separate validation set we can more accurately determine what works and what doesn't.

We can get our validation set using the same method as we did in chapter 1.

In [4]:
valid = df.loc[pd.Timestamp('2017-1-1'):]

### Test Set


The Test set is very similar to the validation set, but it takes things a step further.  It has further restrictions in that is is the final model step before deployment.  The main difference is how often you can use it.  For the validation set, you can test anything on the validation set as many times as you want.  For the test set you only get to look at the test set once for your particular approach.

For example, you may try 300 different approaches and parameter changes to your strategy to see what works best.  You can check the profitability on each of them using the validation set.  Then once you have chosen a strategy, you do a final check to ensure it also performs on the test set.  Once you have done that you need a new test set or your project is over.

The reason this is important is that you want to ensure that you didn't get lucky and find a configuration out of your 300 attempts that just happens to work on the validation set but doesn't work elsewhere.  If you try enough combinations eventually you will find something that works, but the test set gives you confidence that your model works because it's a good strategy and not that you just tried enough things to find something that works on coincidence.


:::{note} Many people re-use or have more lax rules on the test set.  Many people do not use one at all.  In this text I am laying out the ideal state I believe we should strive for.  If you choose to loosen these restrictions on the test set or do without one, I would strongly encourage you to think hard about it.


To get our test set, we could have split our initial data into 3.  Because we are a bit concerned about survivorship bias, let's pull a new test set that uses recent data to and test how these strategies would perform over the last year and a half.

We need to get adjusted close price.  There are a variety of services that have APIs to pull from, I have picked polgygon to use here because it's free for what we need.

:::{note} We are using a free api key and putting the key in this notebook in an effort to show everthing to the reader, but best practice would be to read the api key from an environment variable.  From the polygon docs it will by default pull the api key from the `POLYGON_API_KEY` environment variable, then you would initiate the client with so that no credentials are exposed.
`client = RESTClient()`

In [5]:
polygon_free_api_key = 'wUv2tpS05klv9ebAQKyLD610FBWllpan'
client = RESTClient(polygon_free_api_key)

In [6]:
if not (path/'polytest_eod-quotemedia.csv').exists():
    dfs = L()
    errors = L()
    for ticker in valid:
        try:
            aggs = client.get_aggs(ticker, 1, "day", "2021-01-01", "2022-05-31",adjusted=True)
            close = {ticker:[o.close for o in aggs]}
            
            # Convert millisecond time stamp to date
            date = L(o.timestamp/1e3 for o in aggs).map(datetime.fromtimestamp)
            dfs.append(pd.DataFrame(close,index=date))
        except:
            errors.append(aggs)
            print(f"FAILURE: {ticker}")
        
        # Free api gives 5 API calls / minute - so we need to pace our api calls!
        time.sleep(60/5)
    df_test = pd.concat(dfs,axis=1)
    df_test.to_csv(path/'polytest_eod-quotemedia.csv')

df_test = pd.read_csv(path/'polytest_eod-quotemedia.csv',index_col=0,parse_dates=True)

In [7]:
df_test.iloc[:5,:5]

Unnamed: 0,A,AAL,AAP,AAPL,ABBV
2021-01-04,118.64,15.13,157.34,129.41,105.41
2021-01-05,119.61,15.43,157.17,131.01,106.5
2021-01-06,122.89,15.52,166.25,126.6,105.58
2021-01-07,126.16,15.38,167.67,130.92,106.71
2021-01-08,127.06,15.13,170.06,132.05,107.27


## What metric should you use for testing?

Now that we understand what data we will use for testing, let's start figuring out how well our first model from chapter 1 performs.

The next step is to figure out an appropriate metric.  There are a variety of ways to measure this and we will walk through a few first steps in this section

### Dollars

Let's take our first model from chapter 1 and measure how well it does in terms of dollars.  After all dollars is what we want to make, so it seems like a reasonable starting point.

In [8]:
from SimpleTimeSeries import get_momentum_actions

In [9]:
transactions = pd.DataFrame(columns=['open_date','ticker','action','close_date'])

valid_mom = get_momentum_actions(valid,28,0.08)
for dte,vals in valid_mom.iloc[:,:5].iterrows():    
    for val in vals[vals.values != ''].items():
        row = {'open_date':dte.date(),'ticker':val[0],'action':val[1]}
        transactions = pd.concat([transactions,pd.DataFrame(row,index=[0])])


No we have a dataframe with all the positions we are going to take and when to take them.  But we are missing one crucial piece!  When should we close those positions.  We cannot make money by simplying buying a stock (ignoring dividends for now) - the profit comes when we actually close the position and sell the stock.  Let's close the position 28 days after opening.

In [10]:
transactions['close_date'] = transactions.open_date + timedelta(28)
transactions = transactions.loc[transactions.open_date < (transactions.open_date.max() - timedelta(28))]


Next we need to get the stock price on the date of our initial action when we open to position, as well as when we close our position.  Let's start with the price on the day we open.

In [11]:
df_valid_long = valid.melt(var_name='ticker',value_name='adj_close',ignore_index=False).reset_index()
df_valid_long.columns = ['dte','ticker','adj_close']

In [12]:
transactions['open_date'] = pd.to_datetime(transactions.open_date)
df_valid_long['dte'] = pd.to_datetime(df_valid_long.dte)
pd.merge(left=transactions,left_on=['open_date','ticker'],
         right=df_valid_long,right_on=['dte','ticker'],
         how='left').head(10)

Unnamed: 0,open_date,ticker,action,close_date,dte,adj_close
0,2017-02-14,A,Buy,2017-03-14,2017-02-14,49.703267
1,2017-02-14,AAPL,Buy,2017-03-14,2017-02-14,132.397351
2,2017-02-15,AAPL,Buy,2017-03-15,2017-02-15,132.877833
3,2017-02-16,A,Buy,2017-03-16,2017-02-16,50.147135
4,2017-02-16,AAPL,Buy,2017-03-16,2017-02-16,132.716038
5,2017-02-17,AAPL,Buy,2017-03-17,2017-02-17,133.083754
6,2017-02-18,AAPL,Buy,2017-03-18,NaT,
7,2017-02-22,AAPL,Buy,2017-03-22,2017-02-22,134.446754
8,2017-02-23,AAPL,Buy,2017-03-23,2017-02-23,133.87802
9,2017-02-24,AAP,Short,2017-03-24,2017-02-24,156.842676


Uh oh - We have a join that isn't working correctly and get `NaT` and `NaN`!  We created our model assuming that we could make transactions any day we want, but the stock market is not open every day.  There are limitations to when we can trade openly in the stock market we need to start accounting for.  

When we trade using the adjusted close price we added a day because we wouldn't be able to actually place the trade until the following day.  If that day ended up being a Saturday in reality we would have to wait until Monday to place that trade (assuming that monday isn't a holiday).  

Let's fix that by getting the next available trading day for each date.  Because we know this same thing applies to our `close_date`, we will fix it there as well.

In [13]:
unique_dates = L(o.date() for o in valid.index)
unique_dates.sort()

In [14]:
def get_next_trading_day(dte,unique_dates):
    return unique_dates.filter(lambda x: pd.Timestamp(x) >= pd.Timestamp(dte))[0]

In [15]:
f = bind(get_next_trading_day,unique_dates=unique_dates)
transactions['open_date'] = transactions.open_date.apply(f)
transactions['close_date'] = transactions.close_date.apply(f)

In [16]:
transactions['open_date'] = pd.to_datetime(transactions.open_date)
transactions['close_date'] = pd.to_datetime(transactions.close_date)


Now we can merge in the price correctly!

In [17]:
transactions = pd.merge(left=transactions,left_on=['open_date','ticker'],
                         right=df_valid_long,right_on=['dte','ticker'],
                          how='left')
transactions = pd.merge(left=transactions,left_on=['close_date','ticker'],
                         right=df_valid_long,right_on=['dte','ticker'],
                          how='left',
                          suffixes=('_atOpen','_atClose'))

In [18]:
transactions.head(3)

Unnamed: 0,open_date,ticker,action,close_date,dte_atOpen,adj_close_atOpen,dte_atClose,adj_close_atClose
0,2017-02-14,A,Buy,2017-03-14,2017-02-14,49.703267,2017-03-14,51.498464
1,2017-02-14,AAPL,Buy,2017-03-14,2017-02-14,132.397351,2017-03-14,136.290237
2,2017-02-15,AAPL,Buy,2017-03-15,2017-02-15,132.877833,2017-03-15,137.731683


In [19]:
def f_committed(x):
    if x.action in ('Buy','Short'): return x.adj_close_atOpen  
    else: return 0
transactions['committed'] = transactions.apply(f_committed,axis=1)

def f_revenue(x):
    if x.action=="Buy": return x.adj_close_atClose
    else:               return x.adj_close_atOpen
transactions['revenue'] = transactions.apply(f_revenue,axis=1)

def f_cost(x):
    if x.action == 'Buy': return x.adj_close_atOpen  
    else:                 return x.adj_close_atClose
transactions['cost'] = transactions.apply(f_cost,axis=1)

transactions['profit'] = transactions.revenue - transactions.cost


In [20]:
get_dollars(transactions[transactions.action=='Buy'].profit.sum()), \
get_dollars(transactions[transactions.action=='Short'].profit.sum()), \
get_dollars(transactions.profit.sum())

('$56.40', '$33.86', '$90.26')

Great!  So according to our validation set we made $90 profit (pre-tax).  We could buy/short in higher volumes (ie Buy = 10x buys, Short = 10x shorts) to make this profit larger.

However this really isn't enough information to determine whether that is a good idea of feasible.  I would love to loan someone \\$100 if they would give me \\$190 dollars back a week later for \\$90 profit.  I would hate to loan someone \\$100,000 on the promise that they would pay me \\$100,090 back in 20 years.  The reward just wouldn't be worth the risk, and I can use that money better in a 20 year span than that..

Let's see if we can come up with a better metric that accounts for this.

### Percent Return

Instead of measuring raw dollars, lets consider how much money (capital) we needed in order to make that $90 profit.  To do this we need to keep track of our money more carefully than just looking at how much we made at the end.  Let's track out financials by day instead of by transaction o calculate this.

:::{note} I am using "committed" to be the amount we have invested + the amount leveraged.  For now, let's assume that we won't take out debt and borrow stocks (shorting) if we do not have the capital to cover the initial price we borrowed at

In [21]:
df = pd.DataFrame()
for cols in [['open_date','committed'],['close_date','profit'],['close_date','revenue'],['open_date','cost']]:
    _tmp = transactions[cols].groupby(cols[0]).sum()
    df = pd.merge(df,_tmp,how='outer',left_index=True,right_index=True)
df.fillna(0,inplace=True)
df.sort_index(inplace=True)
df

Unnamed: 0,committed,profit,revenue,cost
2017-02-14,182.100618,0.000000,0.000000,182.100618
2017-02-15,132.877833,0.000000,0.000000,132.877833
2017-02-16,182.863173,0.000000,0.000000,182.863173
2017-02-17,133.083754,0.000000,0.000000,133.083754
2017-02-21,134.044718,0.000000,0.000000,134.044718
...,...,...,...,...
2017-06-23,0.000000,-7.068996,251.341879,0.000000
2017-06-26,0.000000,-8.033424,250.513602,0.000000
2017-06-28,0.000000,-6.919895,251.147637,0.000000
2017-06-29,0.000000,19.157383,195.003533,0.000000


Now we can easily look at when we had the most committed.  We are subtracting revenue because once we get money back we can reinvest rather than using new money.

In [22]:
capital_needed = (df.committed.cumsum()-df.revenue.cumsum()).max()
capital_needed

4468.9311894957

And of course our profit is still the same as we had before because we are just aggregating the data differently.

:::{note} This is the first time we are using the [fastcore's testing framework](https://fastcore.fast.ai/test.html).  It has several handy and easy to use testing functions, such as testing if 2 numbers are arbitrarily close (useful for floats).

In [23]:
test_close(transactions.profit.sum(),df.profit.sum())
get_dollars(df.profit.sum())

'$90.26'

Now we see that we needed \\$4468 for earn the \$90 profit.  Let's use this to figure out the percent return.

In [24]:
f"Percent Return: {(df.profit.sum() / capital_needed) * 100:.2f}%"

'Percent Return: 2.02%'

Now that we have our percent return, we need to account for the time range in some way.  2% return per week would be **AMAZING**, 2% per year would be underwhelming.  

In [25]:
f"{df.index.min().date()} - {df.index.max().date()}"

'2017-02-14 - 2017-06-30'

For now let's compare to the percent return of the S&P 500 in that time range.  

+ On 2/14 the S&P 500 had an ajusted close price of 2,337.58
+ On 6/30 the S&P 500 had an ajdusted close price of 2,423.41

If we use that to calculate percent return between our start date we get:

$$\frac{(2423.41 - 2337.58)}{2337.58} = 3.7\%$$

So if we had just invested in the S&P 500 we would have made more money!  We may have made a profit, but we could have saved time and taken on less risk while making more money by simply investing the S&P 500.  Our investment strategy isn't looking so good after all!

### Log Return

More commonly rather than using the percent return we want to use the log return.  There are a lot of reasons they are advantageous to use that will be covered throughout the book as they are significant to what we are doing.  For now we will cover one that is immediately useful to us.

**Symmetry / Additive**
+ Percent Return
    - Invest \\$100
    - Get 50% Return on investment 1 and reinvest
    - Get -50% Return on investment 2
    - End with **\\$75** dollars
+ Log Return
    - Invest \\$100
    - Get 50% Return on investment 1 and reinvest
    - Get -50% Return on investment 2
    - End with **\\$100** dollars        

This property where a positive return + a equal-sized negative return = no return makes it much easier to look at returns and figure out if you are winning or losing.  Just add up all your log returns and you have your overall log return.  You have to be much more careful with percent returns.


In [26]:
pt = capital_needed + df.profit.sum()
pt_1 = capital_needed
log_return = math.log(pt/pt_1)
f"{math.log(pt/pt_1)*100:.2f}%"

'2.00%'

As we calculate the log return we see we get a very similar value to our percent return, but it's not exactly the same.  The advantage of using the log return however is we can accurate get an estimated annualized return.

This is great because we can easily have everything thought of as an annualized return so that we have a common time frame to compare investment strategies more easily.

In [27]:
time_period = (df.index.max().date() - df.index.min().date()).days

In [28]:
f"{(log_return/time_period * 365)*100:.2f}%"

'5.37%'

Now we can just convert to normal return to compare very simply to S&P 500 annual return

In [29]:
f"{(np.exp(log_return/time_period * 365)-1)*100:.2f}%"

'5.51%'

Go ahead an look up S&P 500 annual returns for each year online and compare.  How does this fare?

## Statistical Tests

### Motivation

If you bought 2 \\$1.50 lottery tickets and won 10 million dollars in the lottery, what could you say about your chances to win?  Well let's calculate our rate of return.

$$\frac{10,000,000 - 3}{3} = 3,333,332.33$$

So based on our calculations, the rate of return for playing the lottery is fantastic.  But we know that this doesn't really reflect reality or mean that it's a good safe investment strategy.  We know that you would've just gotten lucky.

So how do we determine if our trading strategy is a good strategy, or if it just got lucky on that dataset?  This is where statistical testing comes in.

### Statistical Test

In [30]:
valid.head()

ticker,A,AAL,AAP,AAPL,ABBV,ABC,ABT,ACN,ADBE,ADI,...,XL,XLNX,XOM,XRAY,XRX,XYL,YUM,ZBH,ZION,ZTS
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-03,45.856418,45.704924,170.071573,113.405731,59.148722,80.498934,37.810043,113.139753,103.48,70.247806,...,36.600691,57.236079,85.801688,58.078807,26.168629,48.733902,61.58491,102.066617,42.425041,53.041571
2017-01-04,46.458105,46.099783,171.467236,113.278803,59.982737,82.496547,38.110199,113.411771,104.14,70.10732,...,36.7268,56.819429,84.857671,58.554375,27.156124,49.460248,61.808997,103.005002,43.034201,53.55625
2017-01-05,45.905737,45.300194,171.347608,113.854863,60.437654,81.54159,38.439403,111.711663,105.91,69.099697,...,36.30967,56.131472,83.592688,57.791484,27.042182,49.008736,62.013598,103.66681,42.336614,53.378092
2017-01-06,47.335975,45.616081,169.104577,115.124148,60.456609,82.632969,39.48511,112.984315,108.3,69.37098,...,36.270868,57.2167,83.545488,57.751853,26.624396,48.665194,62.763801,103.676687,42.611719,53.546352
2017-01-09,47.483931,46.474899,169.004887,116.178631,60.854661,83.295592,39.44638,111.721378,108.57,69.700395,...,35.659724,57.226389,82.167223,57.930192,26.472474,48.459069,62.939174,105.691745,42.149936,53.397887


In [31]:
transactions.head()

Unnamed: 0,open_date,ticker,action,close_date,dte_atOpen,adj_close_atOpen,dte_atClose,adj_close_atClose,committed,revenue,cost,profit
0,2017-02-14,A,Buy,2017-03-14,2017-02-14,49.703267,2017-03-14,51.498464,49.703267,51.498464,49.703267,1.795196
1,2017-02-14,AAPL,Buy,2017-03-14,2017-02-14,132.397351,2017-03-14,136.290237,132.397351,136.290237,132.397351,3.892886
2,2017-02-15,AAPL,Buy,2017-03-15,2017-02-15,132.877833,2017-03-15,137.731683,132.877833,137.731683,132.877833,4.85385
3,2017-02-16,A,Buy,2017-03-16,2017-02-16,50.147135,2017-03-16,52.327016,50.147135,52.327016,50.147135,2.179881
4,2017-02-16,AAPL,Buy,2017-03-16,2017-02-16,132.716038,2017-03-16,137.957216,132.716038,137.957216,132.716038,5.241178


In [32]:
# from scipy import stats

# t_stastic, p_value = stats.ttest_1samp(expected_portfolio_returns, 0)
# t_stastic, p_value / 2 # one tailed

## Below this is just jibberish testing code

In [33]:
# np.log(valid/valid.shift(28)).iloc[28:]

In [34]:
# df = raw.pivot(index='date', columns='ticker',values='adj_close')
# valid = df.loc[pd.Timestamp('2017-1-1'):]

In [35]:
# exp_daily_return.sum(axis=1)/len(valid.columns)

In [36]:
# .sum(axis=1)/len(valid.columns)

In [37]:
# (np.log(valid/valid.shift(1)).dropna().sum(axis=1)/len(valid.columns)).sum()

In [38]:
# portfolio_ret_mean

In [39]:
# portfolio_ret_mean = exp_daily_return.mean()
# portfolio_ret_ste = exp_daily_return.sem()
# portfolio_ret_annual_rate = (np.exp(portfolio_ret_mean * 365)-1)*100

In [40]:
# from scipy import stats

# t_stastic, p_value = stats.ttest_1samp(expected_portfolio_returns, 0)
# t_stastic, p_value / 2 # one tailed

In [41]:
# t_value, p_value = analyze_alpha(exp_daily_return.values)
# print("""
# Alpha analysis:
#  t-value:        {:.3f}
#  p-value:        {:.6f}
# """.format(t_value, p_value))

In [42]:
# exp_daily_return.sum()

In [45]:
# df = None
# for cols in [['open_date','ticker','committed'],
#              ['close_date','ticker','profit'],
#              ['close_date','ticker','revenue'],
#              ['open_date','ticker','cost']]:
#     _tmp = transactions[cols]
#     _tmp.columns = ['dte',cols[1],cols[2]]
#     _tmp.set_index(['dte','ticker'],inplace=True)
#     if df is None: df = _tmp
#     else:
#         df = pd.merge(df,_tmp,how='outer',left_index=True,right_index=True)
# df.fillna(0,inplace=True)
# df.sort_index(inplace=True)
# df

In [46]:
# (df.committed.cumsum()-df.revenue.cumsum()