# 3.4 Label Target Data

Up until this point, we have concerned ourselves mostly with understanding relationships between our feature space and the psuedo-target '% daily return'. In doing so, we were able to find relationships, many trigonomic, which could describe and fit the curve of a stocks returns. 

However, in live trading our goal is not to predict with one hundred percent accuracy the curve itself, but merely to label buy and sell points and communicate those to our broker to engage in transactions. This task is simpler in many ways. For example, it does not require us to be aware of the magnitude of impending moves, as long as we are in position to take advantage of them. Of course, if we were at the point we had to choose between two instruments, it would be nice to know which would move more significantly, but that is a problem for another period in time. 

For now, we will simply concern ourselves with labeling buy and sell points, or entries and exits, in our system. Therefore we must redefine both our target and feature space to aptly do so. 

Two simple solutions to the problem of labeling the target space present themselves. First, we could label the maximums and minimums and call it a day. This would leave us with a data sparsity problem: We would have very few buy and sell points, and many 'hold' points along the curve. In the past, we have taken such an approach and found ourselves forced to oversample data. This is a valid methodology, but the risk associated with it is that the machine will learn some very specific identifying feature of the oversampled data. Rather, we would prefer to ensure that it learns the general characteristics of the points we wish to identify. 

The second easy solution to this problem is to label all points buy or sell as long as you can make some minimum gain by doing so without incurring some heavy drawdown. This method does not suffer from the problem of sparse data, because now almost every point is either buy or sell. Previous attempts to implement this solution have also been met with failure. While the system was able to learn which points were which in training, during backtesting the system failed to produce a return. Further work could be done on this second solution, but for now we have been disappointed with the results. 

Given that we have concerned ourselves so far with implementation of trigonomic relationships in our data, a third solution might be a trigonomic one. We might theoretically concern ourselves with buying and selling in some range of radians measured by theta, and so sell only after the peak has turned down and the valley has turned up. We could theoretically calculate values in radians we wish to buy and sell at, and simply trade on that basis. 

The difficulty is, of course, correctly identifying where you are currently and making the corrrect decision thereof. Since we lack the ability to know for sure what the true angle of the function is currently, we must resort to labelling by hand all of the entry and exit points, and then building a model to relate those points to our estimated angles calculated earlier.  


#### Steps

1. Investigate optimal entry / exit points

#### 1. Investigate optimal entry / exit points

In this case, our goal would be to buy and sell between the following ranges:

#### Buy:

$ .5\pi < x < .75\pi $

#### Sell:

$ 1.5\pi < x < 1.75\pi. $

Theoretically, if the plot were a perfect sine wave, it would look like this. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,2*np.pi,100)
y = np.sin(x - np.pi)

plt.figure(figsize=(8,5))
plt.scatter(x, y)
plt.fill_betweenx(y=[-1,1.1], x1=[.5*np.pi, .5*np.pi], x2=[.75*np.pi,.75*np.pi], color='g', alpha=0.2, label='buy')
plt.fill_betweenx(y=[-1,1.1], x1=[1.5*np.pi, 1.5*np.pi], x2=[1.75*np.pi,1.75*np.pi], color='r', alpha=0.2, label='sell')
plt.title('Buy and Sell regions, sin(x - pi)')
plt.legend();

Since our goal is to buy and sell right after the trend turns, we hopefully give ourselves enough allowance to not exit early and reverse a position unfortunately. As it happens in real life, often a wave will appear to be coming to an end and then invert and continue upwards for a second cycle. We want to be careful not to be confused when that happens. 

#### 2. Build a function that  to label the data

In order to label the chart by hand, I will need a function that prints the chart, waits for input, stores the input, and then moves forward. 

It would be helpful to continue without having to label every point, so I should build in some functionality to infer the current point == the last point if input is null. 

#### Steps to create

1. Load the y data for a stock
2. Iterate over the y data and collect values 
3. Save
4. Repeat for all 30 stocks

In [None]:
import time
from IPython import display
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sine_modules import load_set

In [None]:
data_dir = './data/screens/august25screen/'
fn = 'key.csv'
suffix = '.pickle'

key_df = pd.read_csv('{}{}'.format(data_dir,fn))
key_df.head()

In [None]:
top_30 = key_df.sort_values(by='R2_score', ascending=False).head(30)['SYMBOL']

sym = top_30.iloc[3]
df = load_set(sym, data_dir, suffix)
df = df[::-1].reset_index(drop=True)[::-1]
df.head()

In [None]:
for i in df.index[::-1]:
    print(i)

In [None]:
df.loc[-180:+180:-1, 'close']

In [None]:
%matplotlib inline

import time
import matplotlib.pyplot as plt
from IPython import display
from sys import exit

def i_label(df):
    plt.figure(figsize=(20,15))
    values = []
    ind = []
    
    try:
        for i in df.index[::-1]:
            display.clear_output(wait=True)
            plt.clf()
            plt.axvline(i, c='r')
            plt.plot(df.loc[i-25:i+25:-1, 'close'])
            plt.xticks(np.arange(i-25,i+25))
            display.display(plt.gcf())
            x = input("Write something: ")
            try:
                values.append(int(x))
                ind.append(i)
#             except ValueError:
#                 try:
#                     values.append(values[-1])
#                     ind.append(i)
#                 except IndexError:
#                     values.append(0)
#                     ind.append(i)
            except:
                values.append(0)
                ind.append(i)
            if i == df.index[-1]:
                display.clear_output(wait=True)
    except KeyboardInterrupt:
        display.clear_output(wait=True)
        return pd.Series(values, index=ind)
    
    return pd.Series(values, index=ind)

seres = i_label(df)
#print(seres)

In [None]:
print(seres)

Hand labeling may be superior to algorithmic at the daily level. However, at the minute level this would entail hand labeling over 240,000 points. Even with the system above to do so, if I can do 100 points per minute for example, it would take 2400 minutes. This equates to 40 hours, or an entire working week of work. 

Instead I should write an algorithm to label these points the way I would want them labeled. In order to do so I must first understand my minimum return expectation.

In [None]:
import math 
plt.figure(figsize=(80,30))

def get_ticks(close=df['close']):

    mx = math.ceil(close.max())
    mn = math.floor(close.min())

    s = mn - .5
    e = mx + .5
    ticks = []
    while s < e:
        ticks.append(s)
        s += .5

    xticks = []
    s = df.index[0] - 300
    e = df.index[-1] + 300
    while s < e:
        xticks.append(s)
        s += 100
        
    return ticks, xticks

ticks, xticks = get_ticks()

plt.plot(df['close'])
plt.yticks(ticks)
plt.xticks(xticks)
plt.grid(b=True)

In [None]:
mn, mx

Let's say I wanted to design an algorithm, which given the data, would label buy and sell points every time a peak or valley turns, which could net at least \\$0.50. So, I would want to buy at the bottom and sell at the top, if the difference between the next wave is at least \\$0.50. Furthermore, I dont just want to label the high peak/lowest valley, nor do i want to label 100% of the points. I want roughly 25% of the points labeled with buy and sell, or 12.5% for each. This is still sparse. 

I could bring in the smoother and write the algorithm on top of the output for that. Seems like a bit of unnessary abstraction though, does it not? 

Pros: 
- May give better results because the output of the smoothing function will be directly related to previous buy / sell points

Cons: 
- A lot of work
- Extra cycls in the work flow / change to the dataset
- Why can't you just do it on the underlying data? 

In [None]:
decision_list = pd.Series(np.zeros(df.shape[0]), index = df.index)

In [None]:
decision_list

In [None]:
for loc in df.index[::-1]:
    for floc in df.index[-loc-1::-1]:
        print((loc, floc))

In [None]:
df.loc[1, 'close'] / df.loc[0, 'close'] - 1 

In [None]:
19.11 / 18.75 - 1

In [None]:
df.loc[0]

In [None]:
df.loc[1]

In [None]:
plt.plot(df['close'])

In [None]:
def check_right(i, thres=0.01):
    peak_flag = True
    valley_flag = True
    cp = df.loc[i, 'close'] 
    
    while peak_flag == True or valley_flag == True:
        try:
            i += 1
            np = df.loc[i, 'close'] 
            if np > cp:
                peak_flag = False
            if np < cp:
                valley_flag = False
            if np >= cp + thres and valley_flag == True:
                return 1
            if np <= cp - thres and peak_flag == True:
                return -1
        except KeyError:
            return False

    return False
    
def check_left(i, thres=0.01):
    peak_flag = True
    valley_flag = True
    cp = df.loc[i, 'close'] 
    
    while peak_flag == True or valley_flag == True:
        try:
            i -= 1
            np = df.loc[i, 'close'] 
            if np > cp:
                peak_flag = False
            if np < cp:
                valley_flag = False
            if np >= cp + thres and valley_flag == True:
                return 1
            if np <= cp - thres and peak_flag == True:
                return -1
        except KeyError:
            return False
    
    return False

In [None]:
locs = np.zeros(np.shape(df.index)[0])
thres= 0.25

for j in df.index[::-1]:
    r = check_right(j, thres)
    l = check_left(j, thres)
    if r == l:
        locs[j] = r
np.unique(locs, return_counts=True)

In [None]:
b = np.where(locs == 1)
s = np.where(locs == -1)

In [None]:
plt.figure(figsize=(50,50))
plt.plot(df['close'])
plt.scatter(b[0]+1,df.loc[b]['close'], c='g')
plt.scatter(s[0]+1,df.loc[s]['close'], c='r')


In [None]:
from sklearn.preprocessing import minmax_scale

def decision_label(df, thres_p = 0.002, crit_p=0.01):
    thres = thres_p #.05
    crit = crit_p #.4
    length = df.shape[0]
    decision_list = pd.Series(np.zeros(length), index = df.index)
    #df['cmxs'] = minmax_scale(df['close'])
    for loc in df.index[::-1]:
        display.clear_output()
        print('{:.2f}%'.format(loc/length*100))
        max_diff = 0 
        min_diff = 0
        sell_flag = True
        buy_flag = True 

        for floc in df.index[-loc-1::-1]:
    #         time.sleep(1)
    #         display.clear_output()
    #         print(f'loc: {loc}')
    #         print(f'Max diff: {max_diff}')
    #         print(f'Min diff: {min_diff}')
    #         print(f'Sell: {sell_flag}')
    #         print(f'Buy: {buy_flag}')


            val_diff = df.loc[floc, 'close'] / df.loc[loc, 'close'] - 1  
            if val_diff > max_diff:
                max_diff = val_diff
            elif val_diff < min_diff:
                min_diff = val_diff

            if max_diff >= thres:
                sell_flag = False
            if min_diff <= -thres:
                buy_flag = False

            if sell_flag == True and min_diff <= -crit:
                decision_list.loc[loc] = 2
                break
            if buy_flag == True and max_diff >= crit:
                decision_list.loc[loc] = 1
                break

            if sell_flag == False and buy_flag == False:
                break
    return decision_list

In [None]:
crit = 0.005
thres = crit
df['decision'] = decision_label(df, crit, thres)

#ticks, xticks = get_ticks(df['close'])
plt.figure(figsize=(280,30))
plt.plot(df['close'])
plt.scatter(df[df['decision'] == 1].index, df[df['decision'] == 1].close, c='g')
plt.scatter(df[df['decision'] == 2].index, df[df['decision'] == 2].close, c='r')
#plt.yticks(ticks)
#plt.xticks(xticks)
plt.grid(b=True);

In [None]:
df['decision'].value_counts()

In [None]:
stock

In [None]:
#df[df['decision'] == 0]

In [None]:
max_diff, min_diff

In [None]:
crit_p = 0.5 / (df['close'].max() - df['close'].min())

In [None]:
thres_p = 0.05 / (df['close'].max() - df['close'].min())

In [None]:
crit_p, thres_p

#### 3. Iteratively load all relevant data sets and label

In [None]:
from os import listdir
from os.path import isfile, join
save_dir = './data/prepared/august25screenfixed/'

stocks = [f.split('.')[0] for f in listdir(save_dir) if isfile(join(save_dir, f))]
len(stocks)

In [None]:
save_dir

In [None]:
suffix = '.pickle'
df = load_set(stocks[0], save_dir, suffix)

In [None]:
df.head()

In [None]:
print('_'.join(stocks))

In [None]:
save_dir, crit, thres

In [None]:
import matplotlib.ticker as ticker

In [None]:
fig, ax = plt.subplots(10, 3, figsize=(150,200))
fig.suptitle('Top 30 Consistent Performers w/ Decision Labels', fontsize=32)

crit = 0.05
thres = crit

for i, stock in enumerate(stocks):
    df = load_set(stock, save_dir, suffix)
    
    df['D2'] = decision_label(df, crit, thres)
    
    x,y = divmod(i, 3)
    
    ax[x, y].plot(df['close'])
    ax[x, y].scatter(df[df['decision'] == 1].index, df[df['decision'] == 1].close, c='g')
    ax[x, y].scatter(df[df['decision'] == 2].index, df[df['decision'] == 2].close, c='r')
    ax[x, y].set_title(stock, fontsize=21)
    ax[x, y].set_xlabel('Minutes since beginning of period')
    ax[x, y].set_xlabel('Price ($)')
    ax[x, y].yaxis.set_major_locator(ticker.MultipleLocator(1))
    ax[x, y].xaxis.set_major_locator(ticker.MultipleLocator(100))
    plt.sca(ax[x,y])
    plt.grid(b=True)
    
    df.to_pickle(f'{save_dir}{stock}{suffix}')
#plt.savefig('./data/images/top30withDecisions.png')

In [None]:
for i, stock in enumerate(stocks):
    df = load_set(stock, save_dir, suffix)
    print(i, stock)
    print(df['decision'].shape)
    

In [None]:
print('\t'.join(stocks))

In [None]:
data_dir

In [None]:
save_dir

In [None]:
# create y matrix

y = np.zeros([30,4000])

for k, stock in enumerate(stocks):
    df = load_set(stock, save_dir, suffix)
    
    df = df.dropna(axis=0)
    
    e = df.shape[0]
    l = e - 4000 - 2

    y_ = df.iloc[l:e-2]['D2']
    y[k] = y_.to_numpy() 

np.save('./data/prepared/august25screenfixed/numpy_matrices/yD2.npy', y)

In [None]:
stocks[0]

In [None]:
y = np.zeros([30,4000])

for k, stock in enumerate(stocks):
    df = load_set(stock, save_dir, suffix)
    
    df = df.dropna(axis=0)
    
    e = df.shape[0]
    l = e - 4000 - 59
    
    df['y'] = df['%close'].shift(1)
    y_ = df.iloc[l:e-59]['y']
    y[k] = y_.to_numpy() 

np.save('./data/prepared/august25screenfixed/numpy_matrices/y_br2.npy', y)

In [None]:
df= load_set(stocks[0], save_dir, suffix)

In [None]:
df = df.dropna(axis=0)
df['y_close'] = df['%close'].shift(1)

In [None]:
print('\n'.join([str(x) for x in df.columns]))

In [None]:
df[['%close','y_close']]

In [None]:
stocks