# Construct X and y Matrices

In previous work, we have extracted data for 300 top returning stocks over the last month. Of these, 30 were selected for further study. On these 30, we implemented sinusoidal features on Open, High, Low, and Close positions. Using the close, we also algorithmically labeled buy, sell, and hold points as 1, 2, and 0 respectively. 

In order to train a model on these datasets, we must now ensure they are correctly constructed. The model we will be using will be a Convolutional Neural Network, so our X matrix must be constructed from 'images' comprised of the featurres. Our y will be an array. 

Since the goal will be to train a single, generalized model, our X and y dataset should contain shuffled data from the first half of our time period. Data from the second half accross all stocks will be held back for test data. 

Each dataset will first be split into train and test. Y will then be removed into an array of its own. The remaining X features will need to be transformed to comprise images. Both of these X and y sections will be deposited into the larger X/y matrices. 

#### Steps

1. Iteratively load the datasets
2. On each iteration,   
    A. Drop unnecessary columns  
    B. Sterile the data by scaling relative columns (open, high, low, close, volume) and those columns with very small values
    C. Remove y data and store in y Matrix  
    D. Pivot the X data  
    E. Slice the X data and store it in an array  
4. X.shape should equal (30, 8000, 30, 30) and y.shape (30, 8000)  
5. Save the base X and y  
6. Iterate over the elements of X and y together:  
    A. Split by second index (time)  
    B. Reshape data into train X (120,000, 30, 30) train y (120,000) test X (120,000, 30, 30) test y (120,000)  
    C. Save train and test data sets  
7. Use preprocessing functions:  
    A. balance classes 33/33/33  
    B. Shuffle the data (possibly)  
8. Save   

### 1. & 2. Iteratively load and transform the datsets

In [1]:
import pandas as pd
import numpy as np
from os import listdir
from os.path import isfile, join
from extract import load_set
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler

pd.set_option('display.max_columns', 130)
pd.set_option('display.max_rows', 100)

In [2]:
data_dir = './data/prepared/august25screenfixed/'
suffix = ''

stocks = [f.split('.')[0] for f in listdir(data_dir) if isfile(join(data_dir, f))]
stocks

['RRR',
 'RLGY',
 'HOME',
 'FDX',
 'MIK',
 'WSM',
 'NVDA',
 'DE',
 'TGT',
 'LULU',
 'FBHS',
 'XRX',
 'CX',
 'EAT',
 'DHI',
 'ICE',
 'EXPI',
 'CLNY',
 'IMVT',
 'ELAN',
 'PACB',
 'PENN',
 'REAL',
 'NKE',
 'BBY',
 'HTHT',
 'GRPN',
 'CZR',
 'GDDY',
 'LOW']

In [3]:
df = load_set(stocks[0], data_dir, suffix)
df.head()

Unnamed: 0,open,high,low,close,volume,datetime,date,hour,minute,min_num,SYMBOL,prev_close,diff_1,pct_change,log_return,%open,mesa_open,open_amp,open_omega,open_phase,open_offset,open_freq,open_period,open_maxcov,open_sin,time,open_angle,open_rad,open_rad2,open_sin2,open_cos2,open_cos,open_tan,open_tan2,open_xsinx,open_xcosx,open_sinxcosx,open_xsinxcosx,open_xsinx2,open_xcosx2,open_sinxcosx2,open_xsinxcosx2,open_xtanx,open_xtanx2,%high,mesa_high,high_amp,high_omega,high_phase,high_offset,high_freq,high_period,high_maxcov,high_sin,high_angle,high_rad,high_rad2,high_sin2,high_cos2,high_cos,high_tan,high_tan2,high_xsinx,high_xcosx,high_sinxcosx,high_xsinxcosx,high_xsinx2,high_xcosx2,high_sinxcosx2,high_xsinxcosx2,high_xtanx,high_xtanx2,%low,mesa_low,low_amp,low_omega,low_phase,low_offset,low_freq,low_period,low_maxcov,low_sin,low_angle,low_rad,low_rad2,low_sin2,low_cos2,low_cos,low_tan,low_tan2,low_xsinx,low_xcosx,low_sinxcosx,low_xsinxcosx,low_xsinx2,low_xcosx2,low_sinxcosx2,low_xsinxcosx2,low_xtanx,low_xtanx2,%close,mesa_close,close_amp,close_omega,close_phase,close_offset,close_freq,close_period,close_maxcov,close_sin,close_angle,close_rad,close_rad2,close_sin2,close_cos2,close_cos,close_tan,close_tan2,close_xsinx,close_xcosx,close_sinxcosx,close_xsinxcosx,close_xsinx2,close_xcosx2,close_sinxcosx2,close_xsinxcosx2,close_xtanx,close_xtanx2,decision,D2
8845,17.28,17.28,17.28,17.28,100,2020-08-25 22:48:00,2020-08-25,22,48,1368,RRR,17.3,-0.02,-0.001156,-0.001157,-0.001156,0.490588,0.001894,2.99995,141.151213,-0.0004,0.477457,2.09443,222173.7,-0.001211,8845,26675.70572,3.584091,4.061547,-0.001907,-0.605856,-0.903685,0.473836,1.31314,1e-06,0.001045,0.001095,-1e-06,2.204898e-06,0.0007,0.001156,-1.335851e-06,-0.000548,-0.001518,-0.001156,0.490591,-0.00194,2.802241,-332.993391,-0.000401,0.445991,2.2422,244042.8,0.001482,24452.828207,4.954177,5.400168,0.001098,0.634822,0.239439,-4.054941,-1.217125,-2e-06,-0.000277,0.000355,-4.103498e-07,-1e-06,-0.000734,0.000697,-8.057384e-07,0.004688,0.001407,-0.001156,0.490588,0.002575,2.990564,224.104713,-0.000389,0.475963,2.101003,144034.9,-0.001352,26675.646515,3.524886,4.00085,-0.00234,-0.653,-0.927438,0.403237,1.159812,2e-06,0.001072,0.001254,-1e-06,3e-06,0.000755,0.001528,-2e-06,-0.000466,-0.001341,-0.001156,0.490591,-0.002573,2.833824,-611.936535,-0.00039,0.451017,2.217211,172665.9,0.001653,24453.239631,5.365601,5.816618,0.000767,0.893118,0.60774,-1.306702,-0.503655,-2e-06,-0.000703,0.001004,-1e-06,-8.865898e-07,-0.001033,0.000685,-7.918291e-07,0.001511,0.000582,0.0,0.0
8844,17.3,17.3,17.3,17.3,200,2020-08-25 22:44:00,2020-08-25,22,44,1364,RRR,17.44,-0.14,-0.008028,-0.00806,-0.002882,0.489934,0.001881,1.287313,-270.196033,-0.000346,0.204882,4.880852,232257.6,-0.000633,8844,11114.801657,6.130034,6.334916,-0.000249,0.998662,0.988295,-0.15436,0.051777,2e-06,-0.002848,-0.000626,2e-06,7.176933e-07,-0.002878,-0.000249,7.167332e-07,0.000445,-0.000149,-0.008028,0.488493,-0.002108,2.808354,-387.061715,-0.000335,0.446963,2.23732,181936.0,-0.002103,24450.020219,2.146189,2.593152,-0.001434,-0.853339,-0.544164,-1.541776,-0.610962,1.7e-05,0.004368,0.001145,-9.187537e-06,1.2e-05,0.00685,0.001223,-9.820883e-06,0.012377,0.004905,-0.002882,0.489934,0.002457,3.00182,124.667322,-0.000327,0.477755,2.093125,210788.0,0.001151,26672.767225,0.645596,1.123351,0.001888,0.432664,0.798741,0.753279,2.083731,-3e-06,-0.002302,0.000919,-3e-06,-5e-06,-0.001247,0.000817,-2e-06,-0.002171,-0.006005,-0.008028,0.488493,-0.002719,2.838034,-649.214703,-0.000325,0.451687,2.213922,140893.5,-0.00199,24450.356426,2.482395,2.934083,-0.000885,-0.978547,-0.790484,-0.774819,-0.210541,1.6e-05,0.006346,0.001573,-1.3e-05,7.103888e-06,0.007855,0.000866,-6.951488e-06,0.00622,0.00169,0.0,0.0
8843,17.35,17.44,17.35,17.44,25,2020-08-25 22:36:00,2020-08-25,22,36,1356,RRR,17.35,0.09,0.005187,0.005174,0.0,0.496402,0.004198,3.106833,-802.280898,-0.000187,0.494468,2.022376,6295940.0,-0.002822,8843,26671.442713,5.60427,6.098738,-0.000956,0.983038,0.778254,-0.806869,-0.186568,-0.0,0.0,-0.002197,-0.0,-0.0,0.0,-0.00094,-0.0,-0.0,-0.0,0.005187,0.497884,-0.001594,2.843266,-695.427737,-1.2e-05,0.45252,2.209848,222953.1,0.000461,24447.572453,5.981608,6.434128,-0.000252,0.98863,0.954869,-0.311065,0.1521,2e-06,0.004953,0.00044,2.284356e-06,-1e-06,0.005128,-0.000249,-1.292442e-06,-0.001614,0.000789,0.0,0.497337,0.003319,3.069251,-470.566094,-0.00019,0.488486,2.04714,1029450.0,-0.003389,26670.820335,4.981891,5.470377,-0.0026,0.687462,0.266251,-3.620276,-1.056379,-0.0,0.0,-0.000902,-0.0,-0.0,0.0,-0.001788,-0.0,-0.0,-0.0,0.005187,0.499745,0.002211,2.999563,144.588277,-2.9e-05,0.477395,2.094701,341519.5,-0.001521,26669.720405,3.881962,4.359357,-0.002104,-0.345745,-0.73822,0.913766,2.713937,-8e-06,-0.003829,0.001123,6e-06,-1.091391e-05,-0.001793,0.000727,3.773424e-06,0.00474,0.014078,0.0,0.0
8842,17.35,17.35,17.35,17.35,4,2020-08-25 22:34:00,2020-08-25,22,34,1354,RRR,17.32,0.03,0.001732,0.001731,0.001732,0.491849,0.210609,3.140843,-1102.364681,-0.000269,0.499881,2.000478,2372299000.0,0.002702,8842,26668.965927,3.127483,3.627364,-0.0986,-0.884315,-0.9999,-0.014111,0.527968,5e-06,-0.001732,-0.002702,-5e-06,-0.0001707856,-0.001532,0.087194,0.0001510283,-2.4e-05,0.000914,0.001732,0.49248,0.170091,3.14082,-1102.168494,-0.000261,0.499877,2.000492,706510600.0,0.002159,26668.965812,3.127368,3.627245,-0.079657,-0.88437,-0.999899,-0.014225,0.527817,4e-06,-0.001732,-0.002158,-3.738379e-06,-0.000138,-0.001532,0.070446,0.0001220202,-2.5e-05,0.000914,0.001732,0.491378,0.377616,3.141009,-1103.833467,-0.000275,0.499907,2.000372,172314100.0,0.003632,26668.969689,3.131245,3.631152,-0.177844,-0.88254,-0.999946,-0.010348,0.532822,6e-06,-0.001732,-0.003632,-6e-06,-0.000308,-0.001529,0.156955,0.000272,-1.8e-05,0.000923,0.001732,0.491535,0.271152,3.140784,-1101.846602,-0.000272,0.499871,2.000515,9297269000.0,0.00353,26668.966016,3.127572,3.627443,-0.126889,-0.884278,-0.999902,-0.014022,0.52807,6e-06,-0.001732,-0.003529,-6e-06,-0.0002197851,-0.001532,0.112205,0.0001943511,-2.4e-05,0.000915,0.0,0.0
8841,17.32,17.32,17.32,17.32,499,2020-08-25 22:28:00,2020-08-25,22,28,1348,RRR,17.32,0.0,0.0,0.0,0.0,0.49098,0.244868,3.140783,-1101.835132,-0.000101,0.499871,2.000516,17649810000.0,-0.003312,8841,26665.825332,6.270073,6.769944,0.114439,0.883854,0.999914,-0.013113,0.529232,-0.0,0.0,-0.003311,-0.0,0.0,0.0,0.101148,0.0,-0.0,0.0,0.0,0.49098,0.002768,1.216969,339.262618,-0.001003,0.193687,5.162981,243913.8,0.000914,11098.481518,2.376266,2.569952,0.000494,-0.841015,-0.721156,-0.960642,-0.643285,0.0,-0.0,-0.000659,-0.0,0.0,-0.0,-0.000416,-0.0,-0.0,-0.0,0.0,0.491908,0.488526,3.141052,-1104.208483,-9.2e-05,0.499914,2.000344,21822430000.0,-0.004281,26665.829868,6.27461,6.774524,0.230398,0.881702,0.999963,-0.008576,0.535108,-0.0,0.0,-0.004281,-0.0,0.0,0.0,0.203142,0.0,-0.0,0.0,0.0,0.491907,0.003094,1.213933,372.387665,-0.001011,0.193203,5.175891,247825.3,0.001121,11104.769538,2.3811,2.574303,0.000651,-0.843361,-0.724496,-0.95139,-0.637151,0.0,-0.0,-0.000812,-0.0,0.0,-0.0,-0.000549,-0.0,-0.0,-0.0,0.0,0.0


In [4]:
df.dropna(axis=0).shape
#df.shape

(8819, 129)

In [5]:
df.describe()

Unnamed: 0,open,high,low,close,volume,hour,minute,min_num,prev_close,diff_1,pct_change,log_return,%open,mesa_open,open_amp,open_omega,open_phase,open_offset,open_freq,open_period,open_maxcov,open_sin,time,open_angle,open_rad,open_rad2,open_sin2,open_cos2,open_cos,open_tan,open_tan2,open_xsinx,open_xcosx,open_sinxcosx,open_xsinxcosx,open_xsinx2,open_xcosx2,open_sinxcosx2,open_xsinxcosx2,open_xtanx,open_xtanx2,%high,mesa_high,high_amp,high_omega,high_phase,high_offset,high_freq,high_period,high_maxcov,high_sin,high_angle,high_rad,high_rad2,high_sin2,high_cos2,high_cos,high_tan,high_tan2,high_xsinx,high_xcosx,high_sinxcosx,high_xsinxcosx,high_xsinx2,high_xcosx2,high_sinxcosx2,high_xsinxcosx2,high_xtanx,high_xtanx2,%low,mesa_low,low_amp,low_omega,low_phase,low_offset,low_freq,low_period,low_maxcov,low_sin,low_angle,low_rad,low_rad2,low_sin2,low_cos2,low_cos,low_tan,low_tan2,low_xsinx,low_xcosx,low_sinxcosx,low_xsinxcosx,low_xsinx2,low_xcosx2,low_sinxcosx2,low_xsinxcosx2,low_xtanx,low_xtanx2,%close,mesa_close,close_amp,close_omega,close_phase,close_offset,close_freq,close_period,close_maxcov,close_sin,close_angle,close_rad,close_rad2,close_sin2,close_cos2,close_cos,close_tan,close_tan2,close_xsinx,close_xcosx,close_sinxcosx,close_xsinxcosx,close_xsinx2,close_xcosx2,close_sinxcosx2,close_xsinxcosx2,close_xtanx,close_xtanx2,decision
count,8846.0,8846.0,8846.0,8846.0,8846.0,8846.0,8846.0,8846.0,8843.0,8843.0,8843.0,8843.0,8845.0,8842.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8846.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8845.0,8842.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8845.0,8842.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8845.0,8842.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8821.0,8846.0
mean,14.145617,14.159394,14.13214,14.146278,5692.403572,16.267579,30.398485,1006.453199,14.145844,0.000747,5.8e-05,5.4e-05,5.5e-05,0.491104,0.003151,1.58005,-3.909109,0.003543,0.251473,17.961551,inf,6.3e-05,4422.5,6870.890667,3.19675,3.448223,0.000336,0.011294,0.011014,1.51765,-0.200557,2.675746e-06,1.3e-05,-2e-06,-4.342282e-08,-1.062358e-05,-1.5e-05,7.1e-05,3.24909e-06,-0.003242,9.1e-05,5.5e-05,0.491103,0.001029,1.46028,-2.132063,0.001876,0.232411,19.75384,inf,6.2e-05,6619.616764,3.178317,3.410728,-0.000301,0.007198,0.007982,0.476751,-4.9404,2.483268e-06,-3e-06,-8e-06,4.126265e-08,-1.071168e-05,-5.5e-05,-0.000446,-1.524439e-06,-0.007622,-0.002061,5.5e-05,0.491104,0.001177,1.531006,11.975717,-6.7e-05,0.243667,19.26526,inf,6.8e-05,6654.423104,3.171779,3.415446,9.9e-05,-0.000909,-0.00133,-0.795871,0.907205,2.639368e-06,1.7e-05,1.5e-05,-9.723026e-08,-6.009884e-06,-3.2e-05,7.4e-05,3.344522e-06,-0.004453,0.005429,5.6e-05,0.491104,0.001226,1.665249,6.570902,0.00095,0.265033,16.45772,inf,7.2e-05,7347.135854,3.200566,3.465599,7.9e-05,0.00836,0.006554,-178.9241,-0.381181,2.861761e-06,1.2e-05,1.2e-05,-3.868957e-08,-7.361938e-06,-4.4e-05,-0.000103,1.117065e-06,0.207714,-0.000569,0.640289
std,2.27303,2.273987,2.272422,2.273237,14766.985612,2.126727,17.358268,127.08006,2.273173,0.042116,0.003049,0.003047,0.002855,0.002232,0.225521,0.869312,345.779368,0.218831,0.138355,122.079573,,0.001557,2553.764574,5774.56075,1.850503,1.862397,0.025716,0.731467,0.732894,135.880215,75.949944,1.161396e-05,0.001744,0.000917,6.468894e-06,0.0002806906,0.00179,0.022619,0.0002470835,0.178835,0.426653,0.002783,0.00222,0.226497,0.847979,347.821397,0.218752,0.13496,124.151195,,0.00153,5915.253634,1.842339,1.852946,0.027697,0.728218,0.72734,144.548542,380.276711,1.201474e-05,0.001616,0.000897,6.498956e-06,0.0003366416,0.001658,0.024358,0.0002963177,0.385524,0.380053,0.002893,0.002237,0.285837,0.848974,335.978515,0.272964,0.135118,126.639685,,0.001552,5552.363518,1.833618,1.843241,0.040226,0.729256,0.730584,96.870938,191.028652,1.424685e-05,0.001588,0.000905,5.887086e-06,0.0002518417,0.001644,0.035359,0.0002214006,0.39017,0.292639,0.003053,0.002318,0.241224,0.869976,354.528467,0.233807,0.138461,114.23339,,0.001617,6043.166106,1.844477,1.855578,0.02803,0.730697,0.734692,16670.25,44.352727,1.582033e-05,0.001697,0.00094,6.810879e-06,0.0002557167,0.001733,0.024648,0.000224972,19.729646,0.102166,0.771262
min,10.065,10.0698,10.06,10.065,1.0,8.0,0.0,481.0,10.065,-0.73,-0.0584,-0.060175,-0.031048,0.464275,-7.015935,-0.294963,-3003.246379,-7.438448,-0.046945,-21.301622,0.3980933,-0.012635,0.0,11.027983,0.000653,0.042419,-0.794805,-1.0,-1.0,-4043.46833,-5179.773877,-5.138965e-05,-0.018623,-0.009616,-0.0002010213,-0.01139064,-0.0209,-0.721305,-0.008012297,-12.846586,-33.516184,-0.030272,0.465919,-8.142943,-0.332098,-2926.413908,-8.159102,-0.052855,-50.06152,0.3602391,-0.010052,11.02602,0.001626,0.041784,-0.959405,-1.0,-1.0,-10781.899742,-35477.858023,-6.954567e-05,-0.018599,-0.009287,-0.0001867459,-0.0184986,-0.029522,-0.843584,-0.01626543,-28.280393,-31.268543,-0.0584,0.460241,-10.037444,-0.331968,-2656.852158,-10.036145,-0.052834,-18.92708,0.2605054,-0.010872,11.021397,0.000731,0.032306,-1.586518,-1.0,-1.0,-8313.55384,-5661.322774,-7.588841e-05,-0.015241,-0.012052,-0.0001525924,-0.01236961,-0.01711,-1.027094,-0.007959519,-35.13759,-1.936371,-0.0584,0.460274,-10.168829,-0.381556,-3016.412987,-7.768545,-0.060727,-16.46726,0.6802084,-0.013416,11.026246,0.001745,0.040717,-0.772481,-1.0,-1.0,-1565625.0,-2009.845708,-8.19824e-05,-0.016127,-0.01119,-0.0001804857,-0.01030068,-0.021977,-0.739892,-0.009073368,-25.37233,-3.44936,0.0
25%,11.85,11.850625,11.84,11.85,1000.0,15.0,15.0,901.0,11.8475,-0.01,-0.000856,-0.000857,-0.000855,0.490239,-0.000975,0.804294,-158.386359,-0.000162,0.128007,2.718297,11874.49,-0.000663,2211.25,2292.137655,1.629089,1.867653,-0.000662,-0.734879,-0.75183,-0.896787,-0.835382,-0.0,-0.000508,-0.000366,-2.542339e-07,0.0,-0.000524,-0.000355,-2.619121e-07,-0.000643,-0.000679,-0.000831,0.490258,-0.000881,0.742043,-153.759668,-0.000169,0.1181,2.911173,11880.57,-0.000646,1980.888294,1.61848,1.8324,-0.000655,-0.74284,-0.736272,-0.921088,-0.886003,-0.0,-0.000491,-0.000357,-2.115493e-07,0.0,-0.000519,-0.00034,-2.116218e-07,-0.00054,-0.000571,-0.000814,0.490288,-0.000956,0.788026,-132.251282,-0.000158,0.125418,2.774103,12392.67,-0.000641,2289.692815,1.632666,1.849125,-0.000645,-0.759759,-0.757906,-0.929054,-0.873811,0.0,-0.000468,-0.00035,-2.5375e-07,0.0,-0.000513,-0.000351,-2.446861e-07,-0.00061,-0.000616,-0.000857,0.490237,-0.001012,0.954071,-148.275677,-0.000164,0.151845,2.564763,11583.28,-0.00067,2327.456585,1.640797,1.899645,-0.000696,-0.749827,-0.751333,-0.8965017,-0.827177,-0.0,-0.000528,-0.00038,-2.752001e-07,0.0,-0.000551,-0.000374,-2.868235e-07,-0.000672,-0.000678,0.0
50%,14.194,14.20005,14.19,14.195,2425.5,16.0,31.0,1005.0,14.195,0.0,0.0,0.0,0.0,0.491036,0.000716,1.492468,-0.20612,2.8e-05,0.237534,4.20964,48801.58,2.6e-05,4422.5,5211.835958,3.162095,3.48066,3.7e-05,0.024731,0.024595,-0.01797,0.06516,2.609076e-07,0.0,-8e-06,0.0,2.300529e-07,-0.0,2.1e-05,0.0,0.0,0.0,0.0,0.491023,0.000724,1.314349,-1.234876,2.5e-05,0.209185,4.778866,46885.22,3.4e-05,4470.979174,3.137549,3.406493,3.3e-05,0.022588,0.007703,-0.021095,0.03322,2.211522e-07,-0.0,-3e-06,-0.0,1.862203e-07,0.0,1.4e-05,0.0,0.0,0.0,0.0,0.491087,0.000719,1.43845,2.864968,3.1e-05,0.228936,4.367472,47192.22,7e-05,5012.327601,3.136434,3.429228,5.5e-05,0.010547,0.000637,-0.022686,0.045357,2.408868e-07,0.0,1.4e-05,-0.0,2.154169e-07,0.0,2.2e-05,0.0,-0.0,0.0,0.0,0.491047,0.000774,1.685137,1.745912,2.9e-05,0.268198,3.727035,49148.04,4.5e-05,5660.449389,3.185088,3.506034,4.9e-05,0.022517,0.020705,-0.01555997,0.070797,2.960088e-07,-0.0,1.2e-05,0.0,2.521848e-07,0.0,2.6e-05,0.0,0.0,0.0,0.0
75%,16.443775,16.455,16.43,16.44,5600.0,18.0,45.0,1110.0,16.44,0.015,0.000952,0.000951,0.000907,0.491913,0.001343,2.311208,146.367451,0.000261,0.36784,7.803219,114508.0,0.000734,6633.75,9993.734854,4.803746,5.025663,0.000755,0.772527,0.769551,0.8181,0.89287,1.509341e-06,0.000505,0.000371,2.678915e-07,1.398925e-06,0.000485,0.000393,2.7818e-07,0.000611,0.000657,0.000846,0.491877,0.001361,2.156606,141.856274,0.000263,0.343234,8.462443,111219.4,0.000725,10161.645781,4.770701,4.97538,0.000717,0.762891,0.753349,0.870416,0.878947,1.336708e-06,0.000439,0.000341,2.356018e-07,1.234332e-06,0.000411,0.000362,2.408212e-07,0.000612,0.00064,0.000897,0.491932,0.001356,2.264599,148.406569,0.000262,0.360422,7.972776,112449.3,0.000727,9744.54623,4.771619,4.994267,0.000749,0.74923,0.759798,0.79716,0.866161,1.409432e-06,0.000487,0.00037,2.364151e-07,1.331478e-06,0.000461,0.000382,2.571962e-07,0.000622,0.000667,0.000951,0.491944,0.001398,2.449695,155.179949,0.000268,0.389881,6.584801,114519.2,0.000768,10974.735564,4.803415,5.050583,0.000774,0.764128,0.773566,0.8113658,0.891274,1.567982e-06,0.00056,0.000391,2.974597e-07,1.489484e-06,0.000525,0.00041,2.974937e-07,0.000649,0.000719,1.0
max,17.6,17.75,17.6,17.75,869073.0,23.0,59.0,1438.0,17.75,0.72,0.061172,0.059374,0.035038,0.514647,7.874241,3.413073,2185.86947,7.872991,0.543208,2624.302652,inf,0.011664,8845.0,26675.70572,6.282494,6.781995,0.819181,1.0,1.0,10063.159136,3314.173891,0.000322441,0.028978,0.010443,0.0001195725,0.0003312906,0.030632,0.56251,0.01002165,4.206843,19.3211,0.041371,0.520566,8.159629,3.316461,3707.700834,5.720848,0.527831,2611.05107,inf,0.010576,26668.965812,6.282742,6.782497,0.779398,1.0,1.0,5384.095398,1372.366018,0.0003159965,0.026706,0.010576,0.0001720486,0.001138875,0.023529,0.353408,0.01410561,4.543365,10.809829,0.061172,0.52188,8.695888,3.421479,2079.595595,6.906745,0.544545,2515.599193,inf,0.012052,26675.646515,6.282962,6.824546,1.168605,1.0,1.0,1571.030469,16375.549196,0.0006057182,0.023116,0.010924,0.0001250892,0.004049234,0.020343,1.395051,0.01087923,5.934175,23.618581,0.061172,0.521288,7.769305,3.427589,5436.964925,10.166909,0.545518,2325.52906,inf,0.01255,26669.720405,6.283099,6.782273,0.840644,1.0,1.0,2345.075,1881.700023,0.0005987539,0.023116,0.010738,0.0001503155,0.004398432,0.020349,0.642878,0.008783418,1852.811102,6.522357,2.0


In [6]:
df['open_maxcov'].values

array([ 222173.70242215,  232257.56696686, 6295939.83291142, ...,
                    nan,              nan,              nan])

In [18]:
print('\n'.join(df['open_maxcov'].values.astype('str')))

222173.70242215143
232257.5669668576
6295939.832911415
2372299301.0622935
17649813964.26809
2129393173.6066098
18168633068.746582
256633.3589228578
257406.38033527767
319907.6702010652
421745.5174003669
317201.93217751785
354169.7504004168
339895.80418645317
275317.0087003242
276704.8478375882
285092.9818454513
238259.84282341323
202470.21998356818
263346.97576999885
235493.1043254767
214877.81802646848
223583.57977778921
223998.340193155
258491.93987983337
248069.2786508555
241830.10467642112
290883.8911386318
263929.04789361585
258447.0318921178
299127.4692672525
336737.12757972925
348780.1474519252
297060.97912904073
344396.9640455302
227495.14325655677
158120.58103435655
221468485.55281866
2269273.425141904
297283.9834251032
205895.25310121244
122492.42933156829
184619.86277110453
269596.2717459314
225142.8366733965
211191.48468244806
213588.94593907314
2394245.49947217
737699.1005373549
234440.15415089423
2649945718.902009
308004.02474493516
247635.83685140798
282807.8276968841
40

#### A. Drop unnecessary columns

In [45]:
df.head(5)

Unnamed: 0,open,high,low,close,volume,datetime,date,hour,minute,min_num,SYMBOL,prev_close,diff_1,pct_change,log_return,%open,mesa_open,open_amp,open_omega,open_phase,open_offset,open_freq,open_period,open_maxcov,open_sin,time,open_angle,open_rad,open_rad2,open_sin2,open_cos2,open_cos,open_tan,open_tan2,open_xsinx,open_xcosx,open_sinxcosx,open_xsinxcosx,open_xsinx2,open_xcosx2,open_sinxcosx2,open_xsinxcosx2,open_xtanx,open_xtanx2,%high,mesa_high,high_amp,high_omega,high_phase,high_offset,high_freq,high_period,high_maxcov,high_sin,high_angle,high_rad,high_rad2,high_sin2,high_cos2,high_cos,high_tan,high_tan2,high_xsinx,high_xcosx,high_sinxcosx,high_xsinxcosx,high_xsinx2,high_xcosx2,high_sinxcosx2,high_xsinxcosx2,high_xtanx,high_xtanx2,%low,mesa_low,low_amp,low_omega,low_phase,low_offset,low_freq,low_period,low_maxcov,low_sin,low_angle,low_rad,low_rad2,low_sin2,low_cos2,low_cos,low_tan,low_tan2,low_xsinx,low_xcosx,low_sinxcosx,low_xsinxcosx,low_xsinx2,low_xcosx2,low_sinxcosx2,low_xsinxcosx2,low_xtanx,low_xtanx2,%close,mesa_close,close_amp,close_omega,close_phase,close_offset,close_freq,close_period,close_maxcov,close_sin,close_angle,close_rad,close_rad2,close_sin2,close_cos2,close_cos,close_tan,close_tan2,close_xsinx,close_xcosx,close_sinxcosx,close_xsinxcosx,close_xsinx2,close_xcosx2,close_sinxcosx2,close_xsinxcosx2,close_xtanx,close_xtanx2,decision
8845,17.28,17.28,17.28,17.28,100,2020-08-25 22:48:00,2020-08-25,22,48,1368,RRR,17.3,-0.02,-0.001156,-0.001157,-0.001156,0.490588,0.001894,2.99995,141.151213,-0.0004,0.477457,2.09443,222173.7,-0.001211,8845,26675.70572,3.584091,4.061547,-0.001907,-0.605856,-0.903685,0.473836,1.31314,1e-06,0.001045,0.001095,-1e-06,2.204898e-06,0.0007,0.001156,-1.335851e-06,-0.000548,-0.001518,-0.001156,0.490591,-0.00194,2.802241,-332.993391,-0.000401,0.445991,2.2422,244042.8,0.001482,24452.828207,4.954177,5.400168,0.001098,0.634822,0.239439,-4.054941,-1.217125,-2e-06,-0.000277,0.000355,-4.103498e-07,-1e-06,-0.000734,0.000697,-8.057384e-07,0.004688,0.001407,-0.001156,0.490588,0.002575,2.990564,224.104713,-0.000389,0.475963,2.101003,144034.9,-0.001352,26675.646515,3.524886,4.00085,-0.00234,-0.653,-0.927438,0.403237,1.159812,2e-06,0.001072,0.001254,-1e-06,3e-06,0.000755,0.001528,-2e-06,-0.000466,-0.001341,-0.001156,0.490591,-0.002573,2.833824,-611.936535,-0.00039,0.451017,2.217211,172665.9,0.001653,24453.239631,5.365601,5.816618,0.000767,0.893118,0.60774,-1.306702,-0.503655,-2e-06,-0.000703,0.001004,-1e-06,-8.865898e-07,-0.001033,0.000685,-7.918291e-07,0.001511,0.000582,0.0
8844,17.3,17.3,17.3,17.3,200,2020-08-25 22:44:00,2020-08-25,22,44,1364,RRR,17.44,-0.14,-0.008028,-0.00806,-0.002882,0.489934,0.001881,1.287313,-270.196033,-0.000346,0.204882,4.880852,232257.6,-0.000633,8844,11114.801657,6.130034,6.334916,-0.000249,0.998662,0.988295,-0.15436,0.051777,2e-06,-0.002848,-0.000626,2e-06,7.176933e-07,-0.002878,-0.000249,7.167332e-07,0.000445,-0.000149,-0.008028,0.488493,-0.002108,2.808354,-387.061715,-0.000335,0.446963,2.23732,181936.0,-0.002103,24450.020219,2.146189,2.593152,-0.001434,-0.853339,-0.544164,-1.541776,-0.610962,1.7e-05,0.004368,0.001145,-9.187537e-06,1.2e-05,0.00685,0.001223,-9.820883e-06,0.012377,0.004905,-0.002882,0.489934,0.002457,3.00182,124.667322,-0.000327,0.477755,2.093125,210788.0,0.001151,26672.767225,0.645596,1.123351,0.001888,0.432664,0.798741,0.753279,2.083731,-3e-06,-0.002302,0.000919,-3e-06,-5e-06,-0.001247,0.000817,-2e-06,-0.002171,-0.006005,-0.008028,0.488493,-0.002719,2.838034,-649.214703,-0.000325,0.451687,2.213922,140893.5,-0.00199,24450.356426,2.482395,2.934083,-0.000885,-0.978547,-0.790484,-0.774819,-0.210541,1.6e-05,0.006346,0.001573,-1.3e-05,7.103888e-06,0.007855,0.000866,-6.951488e-06,0.00622,0.00169,0.0
8843,17.35,17.44,17.35,17.44,25,2020-08-25 22:36:00,2020-08-25,22,36,1356,RRR,17.35,0.09,0.005187,0.005174,0.0,0.496402,0.004198,3.106833,-802.280898,-0.000187,0.494468,2.022376,6295940.0,-0.002822,8843,26671.442713,5.60427,6.098738,-0.000956,0.983038,0.778254,-0.806869,-0.186568,-0.0,0.0,-0.002197,-0.0,-0.0,0.0,-0.00094,-0.0,-0.0,-0.0,0.005187,0.497884,-0.001594,2.843266,-695.427737,-1.2e-05,0.45252,2.209848,222953.1,0.000461,24447.572453,5.981608,6.434128,-0.000252,0.98863,0.954869,-0.311065,0.1521,2e-06,0.004953,0.00044,2.284356e-06,-1e-06,0.005128,-0.000249,-1.292442e-06,-0.001614,0.000789,0.0,0.497337,0.003319,3.069251,-470.566094,-0.00019,0.488486,2.04714,1029450.0,-0.003389,26670.820335,4.981891,5.470377,-0.0026,0.687462,0.266251,-3.620276,-1.056379,-0.0,0.0,-0.000902,-0.0,-0.0,0.0,-0.001788,-0.0,-0.0,-0.0,0.005187,0.499745,0.002211,2.999563,144.588277,-2.9e-05,0.477395,2.094701,341519.5,-0.001521,26669.720405,3.881962,4.359357,-0.002104,-0.345745,-0.73822,0.913766,2.713937,-8e-06,-0.003829,0.001123,6e-06,-1.091391e-05,-0.001793,0.000727,3.773424e-06,0.00474,0.014078,0.0
8842,17.35,17.35,17.35,17.35,4,2020-08-25 22:34:00,2020-08-25,22,34,1354,RRR,17.32,0.03,0.001732,0.001731,0.001732,0.491849,0.210609,3.140843,-1102.364681,-0.000269,0.499881,2.000478,2372299000.0,0.002702,8842,26668.965927,3.127483,3.627364,-0.0986,-0.884315,-0.9999,-0.014111,0.527968,5e-06,-0.001732,-0.002702,-5e-06,-0.0001707856,-0.001532,0.087194,0.0001510283,-2.4e-05,0.000914,0.001732,0.49248,0.170091,3.14082,-1102.168494,-0.000261,0.499877,2.000492,706510600.0,0.002159,26668.965812,3.127368,3.627245,-0.079657,-0.88437,-0.999899,-0.014225,0.527817,4e-06,-0.001732,-0.002158,-3.738379e-06,-0.000138,-0.001532,0.070446,0.0001220202,-2.5e-05,0.000914,0.001732,0.491378,0.377616,3.141009,-1103.833467,-0.000275,0.499907,2.000372,172314100.0,0.003632,26668.969689,3.131245,3.631152,-0.177844,-0.88254,-0.999946,-0.010348,0.532822,6e-06,-0.001732,-0.003632,-6e-06,-0.000308,-0.001529,0.156955,0.000272,-1.8e-05,0.000923,0.001732,0.491535,0.271152,3.140784,-1101.846602,-0.000272,0.499871,2.000515,9297269000.0,0.00353,26668.966016,3.127572,3.627443,-0.126889,-0.884278,-0.999902,-0.014022,0.52807,6e-06,-0.001732,-0.003529,-6e-06,-0.0002197851,-0.001532,0.112205,0.0001943511,-2.4e-05,0.000915,0.0
8841,17.32,17.32,17.32,17.32,499,2020-08-25 22:28:00,2020-08-25,22,28,1348,RRR,17.32,0.0,0.0,0.0,0.0,0.49098,0.244868,3.140783,-1101.835132,-0.000101,0.499871,2.000516,17649810000.0,-0.003312,8841,26665.825332,6.270073,6.769944,0.114439,0.883854,0.999914,-0.013113,0.529232,-0.0,0.0,-0.003311,-0.0,0.0,0.0,0.101148,0.0,-0.0,0.0,0.0,0.49098,0.002768,1.216969,339.262618,-0.001003,0.193687,5.162981,243913.8,0.000914,11098.481518,2.376266,2.569952,0.000494,-0.841015,-0.721156,-0.960642,-0.643285,0.0,-0.0,-0.000659,-0.0,0.0,-0.0,-0.000416,-0.0,-0.0,-0.0,0.0,0.491908,0.488526,3.141052,-1104.208483,-9.2e-05,0.499914,2.000344,21822430000.0,-0.004281,26665.829868,6.27461,6.774524,0.230398,0.881702,0.999963,-0.008576,0.535108,-0.0,0.0,-0.004281,-0.0,0.0,0.0,0.203142,0.0,-0.0,0.0,0.0,0.491907,0.003094,1.213933,372.387665,-0.001011,0.193203,5.175891,247825.3,0.001121,11104.769538,2.3811,2.574303,0.000651,-0.843361,-0.724496,-0.95139,-0.637151,0.0,-0.0,-0.000812,-0.0,0.0,-0.0,-0.000549,-0.0,-0.0,-0.0,0.0


In [7]:
df['open'] = df['open'] - df['open'][::-1].rolling(30).mean()[::-1]

In [5]:

drop_cols = ['datetime',
            'date',
            'min_num',
            'SYMBOL',
            'prev_close',
            'diff_1',
            'time',
            'decision',
            'open_maxcov',
            'high_maxcov',
            'low_maxcov',
            'close_maxcov',
            'pct_change',
            'D2',]


df.drop(drop_cols, axis=1, inplace=True)
df.shape

(8846, 116)

In [6]:
X = np.zeros([30,4000,116,60])
y = np.zeros([30,4000])

drop_cols = ['datetime',
            'date',
            'min_num',
            'SYMBOL',
            'prev_close',
            'diff_1',
            'time',
            'decision',
            'open_maxcov',
            'high_maxcov',
            'low_maxcov',
            'close_maxcov',
            'pct_change',
            'D2',
            'target']

for k, stock in enumerate(stocks):
    df = load_set(stock, data_dir, suffix)
    
    for col in ['open','high','low','close']:
        df[col] = df[col] - df[col][::-1].rolling(30).mean()[::-1]
        
    df = df.dropna(axis=0)
    
    e = df.shape[0]
    l = e - 4000 - 59

    df['target'] = df['%close'].shift(1)
    y_ = df.iloc[l:e-59]['target']
    y[k] = y_.to_numpy() 

    df.drop(drop_cols, axis=1, inplace=True)
    
    
    
    ## TO DO: - standardscale Open, High, Low, Close columns
    ##        - consider minmax scaling trig columns where values are very close to zero (either upon inspection or universally)
    ##        - many of the interaction terms are very small because %min change is very small
    ##.       - either simply scale the interaction terms after creation
    ##.       - Or consider scaling the %min change columns prior to creating interaction terms 
    
    ##        NOTE: Originally I decided on standardizing the features. However, later in the process the features 
    ##              **MUST** be normalized between 0 and 1 for image recognition. 
    ##              Because of this contradiction, I have commented out the standardization code and instead
    ##.             implemented normalization on 0 - 1 from the beginning. 
    
#     ss = StandardScaler()

#     ss.fit(df.iloc[l:e-59-2000])
#     scaled_features = ss.transform(df)
#     df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)

    mx = MinMaxScaler()
    mx = mx.fit(df)
    scaled_features = mx.transform(df)
    df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)

    dt = df.transpose()

    for j, i in enumerate(range(l, e-59)):
        X[k][j] = dt.iloc[:, i:i+60].to_numpy()
        
np.save('./data/prepared/august25screenfixed/numpy_matrices/X_f.npy', X)
np.save('./data/prepared/august25screenfixed/numpy_matrices/y_f.npy', y)

In [2]:
X = np.load('./data/prepared/august25screenfixed/numpy_matrices/X.npy')
y = np.load('./data/prepared/august25screenfixed/numpy_matrices/y.npy')

In [3]:
X.shape, y.shape

((30, 4000, 116, 60), (30, 4000))

Unfortunately, the X file is 7 GB, which is prohibitively large. When working with this much data, I suppose it is expected to be large. 

In [6]:
zeros = np.where(X == 0)
zeros

(array([], dtype=int64),
 array([], dtype=int64),
 array([], dtype=int64),
 array([], dtype=int64))

In [7]:
X[0][0]

array([[-0.14538124,  0.15174104,  0.20069869, ...,  0.64655092,
         2.5136943 ,  2.81844721],
       [ 0.17827578, -0.0225963 ,  0.01856601, ...,  0.26926095,
         2.04253328,  2.78431104],
       [-0.18166262,  0.16911168,  0.36663506, ..., -0.70908412,
         1.01754274,  2.79954426],
       ...,
       [ 0.01076609,  0.01083497,  0.01071884, ...,  0.03321744,
        -0.07488926,  0.01197654],
       [ 0.01780772,  0.01655196,  0.01597589, ...,  0.21659399,
         0.06561708,  0.01615344],
       [-0.01091948, -0.02560259, -0.0490432 , ...,  3.53065162,
         0.44677296, -0.03398237]])

#### Reshape data into train X (60,000, 30, 30) train y (60,000) test X (60,000, 30, 30) test y (60,000)

Iterate over the X and y datasets. For each stock, split into train and test by time (:2000, 2000:). Save as X_train, X_test, y_train, and y_test. 

In [1]:
import numpy as np

In [2]:
X = np.load('./data/prepared/august25screenfixed/numpy_matrices/X.npy')

In [3]:
X_train = X[:, :2000, :, :]

In [4]:
del(X)

In [None]:
X_train.shape = (60000, 116, 60, 1)

In [None]:
X_train = X[:, :2000, :, :]
del(X)
X_ = X_train.reshape([60000, 116, 60, 1])
np.save('./data/prepared/august25screenfixed/numpy_matrices/X_train.npy', X)

In [None]:
X_train = X[:, :2000, :, :].reshape([60000, 116, 60, 1])
np.save('./data/prepared/august25screenfixed/numpy_matrices/X_train.npy', X_train)
del(X_train)

In [None]:
X_test = X[:, 2000:, :, :].reshape([60000, 116, 60, 1])
np.save('./data/prepared/august25screenfixed/numpy_matrices/X_test.npy', X_test)
del(X_test)

In [12]:
y_train = y[:, :2000].reshape([60000, 1])
np.save('./data/prepared/august25screenfixed/numpy_matrices/y_train.npy', y_train)

In [13]:
y_test = y[:, 2000:].reshape([60000, 1])
np.save('./data/prepared/august25screenfixed/numpy_matrices/y_test.npy', y_test)

In [11]:
np.unique(y_train, return_counts=True), np.unique(y_test,return_counts=True)

((array([0., 1., 2.]), array([34245, 16007,  9748])),
 (array([0., 1., 2.]), array([34719, 14301, 10980])))

In [14]:
y_train[:10]

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [2.]])

In [15]:
y_test[:10]

array([[1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.]])

In [4]:
from extract import load_set
import pandas as pd
import numpy as np

data_dir = './data/prepared/august25screenfixed/'
stock = 'RRR'
suffix = ''
df = load_set(stock, data_dir, suffix)    
    
df = df.dropna(axis=0)

e = df.shape[0]
l = e - 4000 - 59

y_ = df.iloc[l:e-59]['decision']

dt = df.transpose()

for j, i in enumerate(range(l, e-59)):
    display(dt.iloc[:, i:i+60])
    break

Unnamed: 0,4084,4083,4082,4081,4080,4079,4078,4077,4076,4075,...,4034,4033,4032,4031,4030,4029,4028,4027,4026,4025
open,14.1,14.1,14.12,14.13,14.16,14.1666,14.17,14.16,14.1703,14.09,...,14.02,14.0513,14.22,14.33,14.5,14.51,14.505,14.575,14.585,14.63
high,14.12,14.13,14.12,14.15,14.2,14.18,14.18,14.17,14.2,14.14,...,14.15,14.1,14.2243,14.335,14.5,14.555,14.52,14.575,14.66,14.67
low,14.09,14.09,14.095,14.11,14.12,14.1449,14.15,14.1499,14.165,14.09,...,14.02,14.05,14.029,14.17,14.33,14.49,14.47,14.52,14.56,14.56
close,14.12,14.09,14.11,14.12,14.12,14.18,14.17,14.17,14.17,14.135,...,14.14,14.07,14.029,14.19,14.33,14.5,14.505,14.52,14.575,14.6
volume,2400,3502,7204,5459,12245,7635,3874,3790,3047,1373,...,11400,4062,3950,19368,7300,3608,11813,3425,22165,18918
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
close_xsinxcosx2,1.29604e-06,-1.04494e-06,-1.28058e-07,-0,-4.89959e-07,-3.04935e-08,-0,-0,2.80321e-06,-2.96302e-07,...,6.2469e-06,-9.47354e-06,1.15471e-05,7.12522e-06,-2.61118e-05,5.94785e-07,-1.11564e-06,-1.32737e-05,1.00653e-06,2.70799e-06
close_xtanx,0.00400648,-0.00032803,0.000174534,-0,-0.0172016,0.000142469,-0,0,0.00309142,-0.000664253,...,-0.00247104,0.00467061,0.131381,0.118296,0.0289888,-0.000270486,0.0318026,0.00480437,0.00123597,-0.000184497
close_xtanx2,0.00557486,-0.000517237,8.3285e-05,-0,-0.0357574,0.000232094,0,0,0.00396838,0.000755574,...,-0.00219489,0.00495615,0.12983,0.116848,0.0153533,-0.000469115,0.00223592,0.00208403,0.000396816,-0.00244977
decision,0,0,0,0,0,2,2,2,2,0,...,2,0,0,2,2,2,2,2,2,2
