# Feature Engineering 

In this notebook I will create new featurers and investigate for significant relationships with the target, tomorrow's percentage return. 

Features currently proposed for investigation include:

- Winning / Losing Streaks in # of days
- Polynomial Features, or multiplicative features as they seem appropriate
- Sin(x) and other trigonometry features
- Volume * Sin(return)
- Potentially narrow the distribution, although on significance testing many of the distribution features appeared significant


In [None]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 500)

In [None]:
df = pd.read_pickle('./data/GOOG081320.pickle')

In [None]:
df.head()

#### Extend basic features using base data

So far I ignored open, high, low and focused only on close. 

Work done so far showed that most previous days returns were not correlated to tomorrows return, with the exception of yesterday and 30 days prior. This shows there may actually be periodicity in the 1 month interval. Options for feature engineering here are using sine wave. Another option is to investigate an ARIMA model. Let's leave the ARIMA model for later, prehaps even as an input for another model. 

##### Open, High, Low 

- Diffed OHL features
- expression of % in min / max 30 day range, 0 - 100% 

In [None]:
df['open_diff'] = df.loc[::-1, 'open'].diff(1)[::-1]

In [None]:
df['high_diff'] = df.loc[::-1, 'high'].diff(1)[::-1]
df['low_diff'] = df.loc[::-1, 'low'].diff(1)[::-1]

In [None]:
df['volume_diff'] = df.loc[::-1, 'volume'].diff(1)[::-1]

In [None]:
df['close_diff'] = df.loc[::-1, 'close'].diff(1)[::-1]

In [None]:
df.head()

In [None]:
df[['volume','volume_diff']]

In [None]:
df.loc[::-1, 'close'].rolling(30).mean()[::-1]

In [None]:
df.loc[::-1, 'close'].rolling(30).std()[::-1]

In [None]:
df['close']

In [None]:
df['stds_from_mean_close'] = (df['close'] - df.loc[::-1, 'close'].rolling(30).mean()[::-1]) / df.loc[::-1, 'close'].rolling(30).std()[::-1]

In [None]:
df.head()

In [None]:
cols = ['open','high','low','close','volume']

def calc_stds_from_mean(col):
    return (col - col[::-1].rolling(30).mean()[::-1]) / col[::-1].rolling(30).std()[::-1]
    
calc_stds_from_mean(df['open'])

In [None]:
for col in cols:
    df['stds_from_mean_'+col] = calc_stds_from_mean(df[col])
df.head()

#### Sine Wave and Multi-Features

In [None]:
df['pct_return_0'].shift()

In [None]:
df['pct_return_0'].values

In [None]:
np.sin(df['pct_return_0'].values)

In [None]:
import numpy as np 
import math 
  
in_array = np.linspace(0, 2 * np.pi)
print ("Input array : \n", in_array) 
  
Sin_Values = np.sin(in_array) 
print ("\nSine values : \n", Sin_Values) 

In [None]:
import matplotlib.pyplot as plt 

plt.plot(in_array, Sin_Values)

In [None]:
np.sin([0.017])

In [None]:
in_array = df['pct_return_0'].values * 25
print ("Input array : \n", in_array) 
  
Sin_Values = np.sin(in_array) 
print ("\nSine values : \n", Sin_Values) 

plt.scatter(in_array, Sin_Values)

In [None]:
df['pct_return_0'].describe()

In [None]:
df['pct_return_0']

In [None]:
plt.figure(figsize=(160,10)) ## figsize x, y // horizontal, vertical
plt.plot(df.index[start_point:], df['pct_return_0'][start_point:])

In [None]:
time_sine = np.sin((df.index-4)/2)

In [None]:
time_sine = np.sin((df.index-4)/2) / 20
start_point = int(len(df.index) - len(df.index)/64)
print(start_point)

plt.figure(figsize=(160,10)) ## figsize x, y // horizontal, vertical
plt.plot(df.index[start_point:], df['pct_return_0'][start_point:], lw=2)
plt.plot(df.index[start_point:], df['log_return'][start_point:], lw=2)
plt.plot(df.index[start_point:], time_sine[start_point:], c='y', lw=2)

In [None]:
int(len(df.index) - len(df.index)/64)

#### Thoughts on curvature of return

From a visual perspective, I notice several features to this curve. 

1) It is very easy to fit a sine wave to the curve, to some degree.   
2) The magnitude does fluctuate  
3) The frequency seems to either change over time or not be completely accurate   

##### Hypotheses:

<u>1) Basic idea for a sine curve:</u>

$$y = sin((x - b_1) / period) * magnitude$$

$$return = sin((time - b_1) / period) * b_2 *magnitude$$

Unknowns are magnitude and period

<u>2) Magnitude may be related to volume</u>

<u>3) Period can be inferred by fitting the most recent returns</u>
   
Solve for period:


$$period = \frac{(time - b_1)}{\sin^-1(return / (b_2 * volume))}$$

In [None]:
df.head()

In [None]:
plt.plot(np.log(df['close'] / df['close'].shift(-1)))

In [None]:
df['zscaled_volume'] = (df['volume'] - df['volume'].mean()) / df['volume'].std() 

In [None]:
volume_sine = np.sin((df.index-4)/2) / 30 * df['zscaled_volume']
time_sine = np.sin((df.index-4)/2) / 20
start_point = int(len(df.index) - len(df.index)/64)


plt.figure(figsize=(160,10)) ## figsize x, y // horizontal, vertical
plt.plot(df.index[start_point:], df['pct_return_0'][start_point:], lw=2)
plt.plot(df.index[start_point:], time_sine[start_point:], c='y', lw=2)
plt.title('Naive Sine Wave against returns of GOOG')


plt.figure(figsize=(160,10)) ## figsize x, y // horizontal, vertical
plt.plot(df.index[start_point:], df['pct_return_0'][start_point:], lw=2)
plt.plot(df.index[start_point:], -volume_sine[start_point:], c='y', lw=2)
plt.title('Naive Sine Wave multiplied by z_scaled Volume vs. returns of GOOG')

In [None]:
df.rename({'pct_change':'return'}, inplace=True, axis=1)

In [None]:
df.head()

In [None]:
df['return_mult_volume'] = df['return'] * df['zscaled_volume']

In [None]:
df['arcsin_return_mult_volume'] = np.arcsin(df['return_mult_volume'])

In [None]:
df['arcsin_return_mult_volume']

In [None]:
target = df.index.values

In [None]:
X = df['arcsin_return_mult_volume']

In [None]:
import statsmodels.api as sm

X = sm.add_constant(X)
X

In [None]:
X.dropna(inplace=True)

In [None]:
target[X.index].shape, X.shape

In [None]:
target = target[X.index]

In [None]:
model = sm.OLS(target, X)
results = model.fit()
print(results.summary())

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(X, target)
lr.score(X, target)

In [None]:
X.head()

In [None]:
lr.coef_

In [None]:
preds = lr.predict(X)

In [None]:
preds

In [None]:
plt.figure(figsize=(100,100))
plt.scatter(y=X['arcsin_return_mult_volume'], x=X.index)
plt.scatter(y=preds, x=X.index, c='r')

In [None]:
volume_sine = np.sin((df.index-4)/2) / 30 * df['zscaled_volume']

return = np.sin((time - phase_shift) / period) / b_1 * volume

time = ?? 

return * volume = np.sin((time - b_1) / b_2) / b_3
b_3 * return * volume = np.sin((time - b_1) / b_2)
arcsin(b_3 * return * volume) = (time - b_1) / b_2
b_2 * arcsin(b_3 * return * volume) = (time - b_1)
b_2 * arcsin(b_3 * return * volume) + b_1 = time
time = b_1 * const + b_2 * arcsin(b_3 * return * volume) 

In [None]:
return = np.sin((time - phase_shift) / period) / (const * volume)

return * const * volume = np.sin((time - phase_shift) / period)
arcsin(return * const * volume) = (time - phase_shift) / period
period = (time - phase_shift) / arcsin(return * const * volume)



In [None]:
df['time'] = df.index
df.head()

In [None]:
df[['time', 'return', 'volume', 'arcsin_return_mult_volume']]

# period = (time - phase_shift) / arcsin(return * const * volume)
# period = (time - b_1) / arcsin(return * b_2 * volume)