# Re-Engineering
- I got both too-perfect results and weirdly dis-improving model tuning on my first engineered dataset.
- Because PyCaret automates so much, I want to do two things: compare my engineered data to the original, and in both cases convert the price info into two signals instead of dozens.
    - caution, I had to ditch a version of this file because I used ALL the data and trying to run pandas_profiling was taking 8GB of memory and forcing Chrome to crash the tab
    - the two signals I want to try are "price change between day-before and day-of filing" and "price change between day-of and day-after filing".
- Reading back over previous project files, I am in danger of wandering away from my original question. Although, "my original question" was multifaceted and focused more on relationships than predictions.
    - I would definitely like to see a decision tree and/or other "explainable" prediction method
    - The question I will concentrate on now is, can I use SEC filing data to predict the change in stock price in the days immediately surrounding the filing date?

In [2]:
# redoing my data. I will aim for a "price change" column and I will do two frames, one for change from day before to day of,
# and one for day of to day after
import pandas as pd
import numpy as np

df = pd.read_csv('hf-3-day-prices.csv', parse_dates=['date'])
# most complete revenues data
tech_revenues = ['CSCO', 'FB', 'GOOGL', 'HPQ', 'IBM', 'ORCL']
df_tr = df[df['ticker'].isin(tech_revenues)]
# drop completely null columns
df_tr = df_tr.dropna(axis='columns', how='all')
# drop excess price columns
dropcols = ['split_coefficient', 'split_coef_minus1', 'split_coef_plus1', 'open', 'high', 'low', 'close_adjusted', 
            'volume', 'date_minus1', 'open_minus1', 'high_minus1', 'low_minus1', 'close_adj_minus1', 'volume_minus1', 
            'date_plus1', 'open_plus1', 'high_plus1', 'low_plus1', 'close_adj_plus1', 'volume_plus1']
df_tr.drop(dropcols, axis=1, inplace=True)
df_tr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 713 to 1983
Data columns (total 30 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   ticker                            199 non-null    object        
 1   date                              199 non-null    datetime64[ns]
 2   commonstocksharesissued           160 non-null    float64       
 3   assetscurrent                     199 non-null    float64       
 4   accountspayablecurrent            199 non-null    float64       
 5   commonstockvalue                  19 non-null     float64       
 6   liabilities                       122 non-null    float64       
 7   liabilitiesandstockholdersequity  199 non-null    float64       
 8   stockholdersequity                199 non-null    float64       
 9   earningspersharebasic             199 non-null    float64       
 10  netincomeloss                     199 non-null 

In [3]:
# where am I missing close prices?
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_tr[df_tr['close'].isnull()])

Unnamed: 0,ticker,date,commonstocksharesissued,assetscurrent,accountspayablecurrent,commonstockvalue,liabilities,liabilitiesandstockholdersequity,stockholdersequity,earningspersharebasic,netincomeloss,profitloss,costofgoodssold,costsandexpenses,cash,preferredstockvalue,depreciation,operatingexpenses,revenues,land,deferredrevenue,grossprofit,sharesissued,commercialpaper,costofservices,debtcurrent,salariesandwages,close,close_minus1,close_plus1
1375,IBM,2012-10-30,2182470000.0,48141000000.0,7085000000.0,,94112000000.0,115778000000.0,21541000000.0,9.38,15000000.0,10771000000.0,10003000000.0,,,,2572000000.0,,563000000.0,,670000000.0,35131000000.0,,2458000000.0,29285000000.0,9334000000.0,,,,194.53


In [4]:
# for one row, I will just make the close price equal to the "plus1" data point because it's what I have.
# also need to fill in missing plus1 and minus1 prices (set to day-of price)
df_tr['close'].fillna(value=df_tr['close_plus1'], axis=0, inplace=True)
df_tr['close_minus1'].fillna(value=df_tr['close'], axis=0, inplace=True)
df_tr['close_plus1'].fillna(value=df_tr['close'], axis=0, inplace=True)
df_tr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 713 to 1983
Data columns (total 30 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   ticker                            199 non-null    object        
 1   date                              199 non-null    datetime64[ns]
 2   commonstocksharesissued           160 non-null    float64       
 3   assetscurrent                     199 non-null    float64       
 4   accountspayablecurrent            199 non-null    float64       
 5   commonstockvalue                  19 non-null     float64       
 6   liabilities                       122 non-null    float64       
 7   liabilitiesandstockholdersequity  199 non-null    float64       
 8   stockholdersequity                199 non-null    float64       
 9   earningspersharebasic             199 non-null    float64       
 10  netincomeloss                     199 non-null 

In [5]:
# interpolate in ticker groups. In-place doesn't seem to work. Remember to "apply" instead of "transform" so the ticker column
# does not vanish!
df_interp = df_tr.groupby('ticker').apply(pd.DataFrame.interpolate)
df_interp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 713 to 1983
Data columns (total 30 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   ticker                            199 non-null    object        
 1   date                              199 non-null    datetime64[ns]
 2   commonstocksharesissued           160 non-null    float64       
 3   assetscurrent                     199 non-null    float64       
 4   accountspayablecurrent            199 non-null    float64       
 5   commonstockvalue                  53 non-null     float64       
 6   liabilities                       122 non-null    float64       
 7   liabilitiesandstockholdersequity  199 non-null    float64       
 8   stockholdersequity                199 non-null    float64       
 9   earningspersharebasic             199 non-null    float64       
 10  netincomeloss                     199 non-null 

In [6]:
# fill everything else with 0 and get ready for pycaret!
df_interp.fillna(0, inplace=True)
# Create two target dataframes: one for minus1 to day-of, one for day-of to plus1
df_minus_dayof = df_interp.copy(deep=True)
df_dayof_plus = df_interp.copy(deep=True)
df_minus_dayof['change'] = df_minus_dayof['close'] - df_minus_dayof['close_minus1']
df_dayof_plus['change'] = df_dayof_plus['close_plus1'] - df_dayof_plus['close']
# calculate percentage as well as scalar change for possible use in a percent-change transformed dataset later
df_minus_dayof['pct_change'] = df_minus_dayof['change'] / df_minus_dayof['close_minus1'] 
df_dayof_plus['pct_change'] = df_dayof_plus['change'] / df_dayof_plus['close']
# drop the actual price columns
dropprice = ['close', 'close_minus1', 'close_plus1']
df_minus_dayof.drop(dropprice, axis=1, inplace=True)
df_dayof_plus.drop(dropprice, axis=1, inplace=True)
# make sure my changes are usually different!
display(df_minus_dayof.iloc[0:5,-2:])
display(df_dayof_plus.iloc[0:5,-2:])

Unnamed: 0,change,pct_change
713,-0.09,-0.003736
714,0.07,0.002917
715,-0.42,-0.018018
716,-0.1099,-0.005053
717,-0.36,-0.018405


Unnamed: 0,change,pct_change
713,-0.32,-0.013333
714,0.21,0.008725
715,0.78,0.034076
716,0.03,0.001386
717,0.26,0.013542


In [7]:
# In order to experiment with different models, also create frames with the percent-change transform
# skip the newly calculated change columns? What will it do to do a percent-change over rows on a column that's already percent
# change from within the row? What the hell, right?
df_m_d_pctchg = pd.DataFrame()
for col in df_minus_dayof.columns:
    if df_minus_dayof[col].dtype == np.float64:
        df_m_d_pctchg[col] = df_minus_dayof.groupby('ticker')[col].pct_change()
    else:
        df_m_d_pctchg[col] = df_minus_dayof[col]
# confident in filling the NaN's here with 0, due to earlier work. I will lose the first row of each ticker.
df_m_d_pctchg.fillna(0, inplace=True)
df_m_d_pctchg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 713 to 1983
Data columns (total 29 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   ticker                            199 non-null    object        
 1   date                              199 non-null    datetime64[ns]
 2   commonstocksharesissued           199 non-null    float64       
 3   assetscurrent                     199 non-null    float64       
 4   accountspayablecurrent            199 non-null    float64       
 5   commonstockvalue                  199 non-null    float64       
 6   liabilities                       199 non-null    float64       
 7   liabilitiesandstockholdersequity  199 non-null    float64       
 8   stockholdersequity                199 non-null    float64       
 9   earningspersharebasic             199 non-null    float64       
 10  netincomeloss                     199 non-null 

In [8]:
df_d_p_pctchg = pd.DataFrame()
for col in df_dayof_plus.columns:
    if df_dayof_plus[col].dtype == np.float64:
        df_d_p_pctchg[col] = df_dayof_plus.groupby('ticker')[col].pct_change()
    else:
        df_d_p_pctchg[col] = df_dayof_plus[col]
# confident in filling the NaN's here with 0, due to earlier work. I will lose the first row of each ticker.
df_d_p_pctchg.fillna(0, inplace=True)
df_d_p_pctchg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 713 to 1983
Data columns (total 29 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   ticker                            199 non-null    object        
 1   date                              199 non-null    datetime64[ns]
 2   commonstocksharesissued           199 non-null    float64       
 3   assetscurrent                     199 non-null    float64       
 4   accountspayablecurrent            199 non-null    float64       
 5   commonstockvalue                  199 non-null    float64       
 6   liabilities                       199 non-null    float64       
 7   liabilitiesandstockholdersequity  199 non-null    float64       
 8   stockholdersequity                199 non-null    float64       
 9   earningspersharebasic             199 non-null    float64       
 10  netincomeloss                     199 non-null 

## Outliers! Can PyCaret help me here?
- Earlier I discovered a number of odd data points, in which after the percent-change transform the number was 100 or more, as high as 2700. I decided to "clip" them at (+/-) 100 but I wished I had more information about whether these were likely to be real or data errors. I know it's plausible for a company to be having a stellar good (or bad) result and change its prior numbers by more than 100%, but how big a change is still plausible?
- I'm curious what PyCaret does with outliers!

In [9]:
# let's save this work!
df_minus_dayof.to_csv('minus_dayof_changes.csv')
df_dayof_plus.to_csv('dayof_plus_changes.csv')
df_m_d_pctchg.to_csv('m_d_pctchg.csv')
df_d_p_pctchg.to_csv('d_p_pctchg.csv')

In [23]:
from pycaret.regression import *
print(df_minus_dayof.date.max())
# remember to hold out by date!
training = df_minus_dayof[df_minus_dayof['date']<'2019']
unseen = df_minus_dayof[df_minus_dayof['date']>='2019'] # this holds out ~10% of data for this particular dataset
training.reset_index(drop=True, inplace=True)
unseen.reset_index(drop=True, inplace=True)
# Need to pick one of my two targets...
training.drop('pct_change', axis=1, inplace=True)
unseen.drop('pct_change', axis=1, inplace=True)
print('training: ',training.shape)
print('unseen: ',unseen.shape)
# I will eventually be doing this and following steps... four times!

2019-09-05 00:00:00
training:  (187, 28)
unseen:  (12, 28)


- on non-transformed data, investigate "naive" PyCaret setup vs asking for transform/normalization
- profile may tell me something about the transforms I want, but caution, it takes a LOT of memory
- setting session ID the same in both setup requests will make any randomization identical
- Note the PyCaret workflow seems to be that I'll need to run all the steps on one iteration of "setup"
- I did try profiling and it says LOTS of high correlation, and skewed data all kinds of ways, which I mostly knew.
- And the profile completed but took nearly 3GB of RAM for <200 records. Onward!

In [24]:
exp_naive = setup(data = training, target = 'change', session_id=123, numeric_features=['preferredstockvalue'])
# skipping profile=True to save RAM, it tells me what I already knew

Unnamed: 0,Description,Value
0,session_id,123
1,Target,change
2,Original Data,"(187, 28)"
3,Missing Values,False
4,Numeric Features,25
5,Categorical Features,1
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(130, 48)"


In [25]:
top3 = compare_models(exclude = ['ransac'], n_select = 3)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
huber,Huber Regressor,2.905,67.7648,5.9584,-0.2642,0.7587,3.8514,0.012
llar,Lasso Least Angle Regression,2.8614,66.9092,5.9881,-0.3052,0.8367,4.3243,0.634
knn,K Neighbors Regressor,3.0925,89.2863,6.8112,-0.9927,0.6723,2.6235,0.009
omp,Orthogonal Matching Pursuit,3.5911,91.7912,7.1238,-1.7845,0.7671,5.2362,0.006
rf,Random Forest Regressor,3.7736,130.6785,8.2968,-1.8258,0.6516,3.5694,0.082
par,Passive Aggressive Regressor,4.6169,86.4339,7.0035,-1.9818,1.0803,16.3418,0.009
catboost,CatBoost Regressor,4.0161,164.2433,9.3505,-2.138,0.512,2.5527,0.982
br,Bayesian Ridge,4.1113,96.3729,7.4144,-2.7016,0.8842,12.2466,0.007
lr,Linear Regression,4.1113,96.3729,7.4144,-2.7016,0.8842,12.2466,0.649
et,Extra Trees Regressor,4.0914,148.302,9.0325,-2.703,0.5957,3.3818,0.062


- all of my R-squared vlaues are negative, which my class notes on metrics say means "something is very wrong". More specifically, these models fit worse than "just guess the average".
- Perhaps it's time to do my transformations and see if that works better.

In [26]:
exp_xfrm = setup(data = training, target = 'change', session_id=123, normalize = True, transformation = True, 
                 remove_multicollinearity = True, multicollinearity_threshold = 0.95, numeric_features=['preferredstockvalue']) 

Unnamed: 0,Description,Value
0,session_id,123
1,Target,change
2,Original Data,"(187, 28)"
3,Missing Values,False
4,Numeric Features,25
5,Categorical Features,1
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(130, 44)"


In [27]:
top3 = compare_models(exclude = ['ransac'], n_select = 3)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
llar,Lasso Least Angle Regression,2.8614,66.9092,5.9881,-0.3052,0.8367,4.3243,0.642
br,Bayesian Ridge,3.0375,81.2965,6.358,-0.3515,0.7823,5.1322,0.006
lasso,Lasso Regression,2.8862,72.8207,6.1465,-0.359,0.8044,3.9559,0.006
en,Elastic Net,2.8958,74.9547,6.2415,-0.4388,0.7839,3.3498,0.007
omp,Orthogonal Matching Pursuit,3.5379,89.6932,6.9789,-1.2091,0.7096,5.3642,0.007
huber,Huber Regressor,3.1813,71.4615,6.246,-1.453,0.6897,5.3808,0.012
rf,Random Forest Regressor,3.7581,130.2929,8.2188,-1.6649,0.657,3.2727,0.082
knn,K Neighbors Regressor,3.3032,94.1669,7.191,-1.8177,0.6526,1.8144,0.009
catboost,CatBoost Regressor,4.0103,162.3325,9.3273,-2.2431,0.548,2.6155,0.95
ada,AdaBoost Regressor,4.6004,214.3343,10.5099,-2.919,0.814,9.3681,0.022


- Transforming changed the order of the "best" models but they all still have negative R-squared. It didn't actually change the LLAR numbers, but the Huber is worse and nothing appears to be "good".
- Let's try the percent-change transformed data.

In [28]:
training = df_m_d_pctchg[df_m_d_pctchg['date']<'2019']
unseen = df_m_d_pctchg[df_m_d_pctchg['date']>='2019'] # this holds out ~10% of data for this particular dataset
training.reset_index(drop=True, inplace=True)
unseen.reset_index(drop=True, inplace=True)
# Need to pick one of my two targets...
training.drop('pct_change', axis=1, inplace=True)
unseen.drop('pct_change', axis=1, inplace=True)
print('training: ',training.shape)
print('unseen: ',unseen.shape)

training:  (187, 28)
unseen:  (12, 28)


In [29]:
exp_naive = setup(data = training, target = 'change', session_id=123, numeric_features=['preferredstockvalue'])

Unnamed: 0,Description,Value
0,session_id,123
1,Target,change
2,Original Data,"(187, 28)"
3,Missing Values,False
4,Numeric Features,25
5,Categorical Features,1
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(119, 50)"


In [30]:
top3 = compare_models(exclude = ['ransac'], n_select = 3)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
llar,Lasso Least Angle Regression,2.2113,19.2695,3.7488,-0.1544,0.6251,1.9971,0.006
br,Bayesian Ridge,2.2124,19.2746,3.7494,-0.1548,0.6254,2.0007,0.013
lasso,Lasso Regression,2.2532,19.3923,3.7858,-0.2428,0.6207,2.0648,0.007
ada,AdaBoost Regressor,2.5586,22.4913,4.1919,-0.5794,0.7858,1.8411,0.025
catboost,CatBoost Regressor,2.668,22.9807,4.1841,-0.636,0.816,2.2553,0.937
knn,K Neighbors Regressor,2.6972,22.0424,4.1731,-0.7961,0.7729,2.4415,0.01
rf,Random Forest Regressor,2.7558,24.075,4.368,-0.9471,0.8025,2.2525,0.074
en,Elastic Net,2.4104,21.1437,4.0689,-1.0604,0.6559,2.1991,0.006
lightgbm,Light Gradient Boosting Machine,2.8564,25.4496,4.5608,-1.0808,0.8047,1.9381,0.153
et,Extra Trees Regressor,2.8857,31.8078,4.8633,-1.8576,0.8491,2.6946,0.059


- Oh boy. This shows the least negative R-squared values so far... Time to bring in the transforms? Even on my once-transformed data?
- Let's also keep in mind going back to clipping the outliers, if this doesn't work so well.

In [31]:
exp_xfrm = setup(data = training, target = 'change', session_id=123, normalize = True, transformation = True, 
                 remove_multicollinearity = True, multicollinearity_threshold = 0.95, numeric_features=['preferredstockvalue']) 

Unnamed: 0,Description,Value
0,session_id,123
1,Target,change
2,Original Data,"(187, 28)"
3,Missing Values,False
4,Numeric Features,25
5,Categorical Features,1
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(119, 50)"


In [32]:
top3 = compare_models(n_select = 3)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
br,Bayesian Ridge,2.21,19.2649,3.7484,-0.1539,0.625,1.9961,0.007
lasso,Lasso Regression,2.2113,19.2695,3.7488,-0.1544,0.6251,1.9971,0.007
llar,Lasso Least Angle Regression,2.2113,19.2695,3.7488,-0.1544,0.6251,1.9971,0.007
en,Elastic Net,2.2093,19.3087,3.7588,-0.1606,0.6296,1.9344,0.008
knn,K Neighbors Regressor,2.8834,25.6064,4.4434,-0.6673,0.7906,2.3786,0.009
catboost,CatBoost Regressor,2.6677,23.3623,4.211,-0.6694,0.8075,2.2854,0.948
ada,AdaBoost Regressor,2.6464,24.2025,4.3406,-0.7284,0.8239,1.9103,0.024
omp,Orthogonal Matching Pursuit,2.734,21.8907,4.2111,-0.8337,0.7585,2.2252,0.006
rf,Random Forest Regressor,2.7504,24.1633,4.3748,-0.9441,0.8043,2.2434,0.072
lightgbm,Light Gradient Boosting Machine,2.8153,25.2211,4.5444,-1.0518,0.7664,2.0057,0.191


- Something other than LLAR on top... still all negative R-squared.
- I think I'm done here. Time to try time series.