# 1.4 Download SP500 Minute Data

In this notebook we will consider an alternative question. Instead of asking how we can maximize profit on a single asset, we will ask whether a machine can pick the best assets given many selections. 

We will consider only price history data. We will provide 90 differenced timesteps on the minute interval period. We will choose the 500 stocks of the S&P 500. 

If our network can outperform the S&P 500 over the given time, we will consider it successful. 

In order to do this, we will need to perform the following steps:

- download datasets for all the stocks in the S&P 500. 
- format the data to represent the simulataneous movement of 500 stocks
- Build an environment to represent this movement
- Train a DQN to learn on it

#### Download datasets for all the stocks in the S&P 500.

In [12]:
!pip install lxml

Collecting lxml
  Downloading lxml-4.5.2-cp37-cp37m-manylinux1_x86_64.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 8.7 MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.2


In [1]:
import time
from IPython import display
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [40]:
import os
import sys
from os import listdir
from os.path import isfile, join

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [3]:
module_path

'/home/jovyan/work'

In [4]:
from modules.extract import extract_stock, extract_multi_periods, load_set
from modules.transform import format_date
from modules.sine_modules import *

In [12]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
data_dir = '../data/sp500/'
suffix = ''

In [1]:
table=pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
df = table[0]
df['Symbol'] = df['Symbol'].str.replace('.','')
df.to_csv('../data/sp500/S&P500-Info.csv')
df.to_csv('../data/sp500/S&P500-Symbols.csv', columns=['Symbol'])
print(df.shape)
df.head()

(505, 9)


Unnamed: 0,Symbol,Security,SEC filings,GICS Sector,GICS Sub Industry,Headquarters Location,Date first added,CIK,Founded
0,MMM,3M Company,reports,Industrials,Industrial Conglomerates,"St. Paul, Minnesota",1976-08-09,66740,1902
1,ABT,Abbott Laboratories,reports,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,1800,1888
2,ABBV,AbbVie Inc.,reports,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
3,ABMD,ABIOMED Inc,reports,Health Care,Health Care Equipment,"Danvers, Massachusetts",2018-05-31,815094,1981
4,ACN,Accenture plc,reports,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


In [77]:
df[df['Symbol'] =='BRKB']

Unnamed: 0,Symbol,Security,SEC filings,GICS Sector,GICS Sub Industry,Headquarters Location,Date first added,CIK,Founded
66,BRKB,Berkshire Hathaway,reports,Financials,Multi-Sector Holdings,"Omaha, Nebraska",2010-02-16,1067983,1839


In [7]:
stocksdf = df

In [26]:
data_dir = '../data/sp500/'
stocks = list(stocksdf['Symbol'])
spdf = pd.DataFrame()

In [16]:
stock = stocks[0]
df = extract_multi_periods(stock, 
                       data_dir=data_dir)

drop = ['date','hour','minute','min_num','SYMBOL','prev_close','diff_1','pct_change','log_return']
df = df.drop(drop, axis=1)
df = df.set_index('datetime')
col_names = {col:f'{stock}_{col}' for col in df.columns}
df = df.rename(col_names, axis=1)
df.head()

Unnamed: 0_level_0,MMM_open,MMM_high,MMM_low,MMM_close,MMM_volume
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-09-15 19:47:00,166.54,166.58,166.495,166.495,4955
2020-09-15 19:46:00,166.59,166.59,166.53,166.56,2222
2020-09-15 19:45:00,166.52,166.6,166.465,166.58,5114
2020-09-15 19:44:00,166.55,166.565,166.51,166.54,2747
2020-09-15 19:43:00,166.63,166.63,166.54,166.58,2499


In [17]:
len(stocks)

505

In [27]:
com_stocks = []

In [28]:
start_time = time.time()

for i, stock in enumerate(stocks):
    if stock in [f.split('.')[0] for f in listdir(data_dir) if isfile(join(data_dir, f))]:
        continue
    df = extract_multi_periods(stock, 
                       data_dir=data_dir)
    
    drop = ['date','hour','minute','min_num','SYMBOL','prev_close','diff_1','pct_change','log_return']
    df = df.drop(drop, axis=1)
    df = df.set_index('datetime')
    col_names = {col:f'{stock}_{col}' for col in df.columns}
    df = df.rename(col_names, axis=1)

    spdf = spdf.join(df, how='outer')
    
    display.clear_output()
    print(f'{i}')
    
    time.sleep(.5)
    
spdf.head()

504


Unnamed: 0_level_0,MMM_open,MMM_high,MMM_low,MMM_close,MMM_volume,ABT_open,ABT_high,ABT_low,ABT_close,ABT_volume,ABBV_open,ABBV_high,ABBV_low,ABBV_close,ABBV_volume,ABMD_open,ABMD_high,ABMD_low,ABMD_close,ABMD_volume,ACN_open,ACN_high,ACN_low,ACN_close,ACN_volume,ATVI_open,ATVI_high,ATVI_low,ATVI_close,ATVI_volume,ADBE_open,ADBE_high,ADBE_low,ADBE_close,ADBE_volume,AMD_open,AMD_high,AMD_low,AMD_close,AMD_volume,AAP_open,AAP_high,AAP_low,AAP_close,AAP_volume,AES_open,AES_high,AES_low,AES_close,AES_volume,...,WYNN_open,WYNN_high,WYNN_low,WYNN_close,WYNN_volume,XEL_open,XEL_high,XEL_low,XEL_close,XEL_volume,XRX_open,XRX_high,XRX_low,XRX_close,XRX_volume,XLNX_open,XLNX_high,XLNX_low,XLNX_close,XLNX_volume,XYL_open,XYL_high,XYL_low,XYL_close,XYL_volume,YUM_open,YUM_high,YUM_low,YUM_close,YUM_volume,ZBRA_open,ZBRA_high,ZBRA_low,ZBRA_close,ZBRA_volume,ZBH_open,ZBH_high,ZBH_low,ZBH_close,ZBH_volume,ZION_open,ZION_high,ZION_low,ZION_close,ZION_volume,ZTS_open,ZTS_high,ZTS_low,ZTS_close,ZTS_volume
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1
2020-08-17 08:00:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,81.5,81.5,81.5,81.5,609.0,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-08-17 08:01:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-08-17 08:02:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-08-17 08:03:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-08-17 08:04:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [29]:
spdf.shape

(19729, 2525)

In [32]:
spdf[::-1]

Unnamed: 0_level_0,MMM_open,MMM_high,MMM_low,MMM_close,MMM_volume,ABT_open,ABT_high,ABT_low,ABT_close,ABT_volume,ABBV_open,ABBV_high,ABBV_low,ABBV_close,ABBV_volume,ABMD_open,ABMD_high,ABMD_low,ABMD_close,ABMD_volume,ACN_open,ACN_high,ACN_low,ACN_close,ACN_volume,ATVI_open,ATVI_high,ATVI_low,ATVI_close,ATVI_volume,ADBE_open,ADBE_high,ADBE_low,ADBE_close,ADBE_volume,AMD_open,AMD_high,AMD_low,AMD_close,AMD_volume,AAP_open,AAP_high,AAP_low,AAP_close,AAP_volume,AES_open,AES_high,AES_low,AES_close,AES_volume,...,WYNN_open,WYNN_high,WYNN_low,WYNN_close,WYNN_volume,XEL_open,XEL_high,XEL_low,XEL_close,XEL_volume,XRX_open,XRX_high,XRX_low,XRX_close,XRX_volume,XLNX_open,XLNX_high,XLNX_low,XLNX_close,XLNX_volume,XYL_open,XYL_high,XYL_low,XYL_close,XYL_volume,YUM_open,YUM_high,YUM_low,YUM_close,YUM_volume,ZBRA_open,ZBRA_high,ZBRA_low,ZBRA_close,ZBRA_volume,ZBH_open,ZBH_high,ZBH_low,ZBH_close,ZBH_volume,ZION_open,ZION_high,ZION_low,ZION_close,ZION_volume,ZTS_open,ZTS_high,ZTS_low,ZTS_close,ZTS_volume
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1
2020-09-15 20:31:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-09-15 20:30:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-09-15 20:29:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-09-15 20:28:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-09-15 20:27:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,162.28,162.28,162.28,162.28,3951.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-08-17 08:04:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-08-17 08:03:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-08-17 08:02:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2020-08-17 08:01:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [39]:
spdf.to_csv(f'{data_dir}sp500_minute_data.csv')

In [None]:
spdf = pd.read_csv(f'{data_dir}sp500_minute_data.csv')

In [None]:
spdf[spdf.columns[spdf.isna().sum() > 1000]]

In [None]:
df = load_set('ABMD', data_dir, '.pickle')

df['datetime'].dtype

In [None]:
df['datetime'] = pd.to_datetime(df['datetime'])

In [None]:
df['datetime'].dtype

In [None]:
df.set_index('datetime', inplace=True)

In [None]:
df.index

In [None]:
spdf = pd.DataFrame()

for stock in stocks:
    df = load_set(stock, data_dir, '.pickle')
    
    df.set_index('datetime', inplace=True)
    df[f'{stock}'] = df['close'].diff(-1) / df['close']
    
    spdf[f'{stock}'] = df[f'{stock}']
    
spdf

In [None]:
pd.set_option('display.max_rows', 1000)

In [None]:
spdf[spdf.index.hour == 13]

In [33]:
h13 = spdf.index.hour > 13
h20 = spdf.index.hour < 20
m30 = spdf.index.minute >= 30
h12 = spdf.index.hour > 12

spdfm = spdf[h20 & (h13 | (h12 & m30))]

In [34]:
spdfm.head(1000)

Unnamed: 0_level_0,MMM_open,MMM_high,MMM_low,MMM_close,MMM_volume,ABT_open,ABT_high,ABT_low,ABT_close,ABT_volume,ABBV_open,ABBV_high,ABBV_low,ABBV_close,ABBV_volume,ABMD_open,ABMD_high,ABMD_low,ABMD_close,ABMD_volume,ACN_open,ACN_high,ACN_low,ACN_close,ACN_volume,ATVI_open,ATVI_high,ATVI_low,ATVI_close,ATVI_volume,ADBE_open,ADBE_high,ADBE_low,ADBE_close,ADBE_volume,AMD_open,AMD_high,AMD_low,AMD_close,AMD_volume,AAP_open,AAP_high,AAP_low,AAP_close,AAP_volume,AES_open,AES_high,AES_low,AES_close,AES_volume,...,WYNN_open,WYNN_high,WYNN_low,WYNN_close,WYNN_volume,XEL_open,XEL_high,XEL_low,XEL_close,XEL_volume,XRX_open,XRX_high,XRX_low,XRX_close,XRX_volume,XLNX_open,XLNX_high,XLNX_low,XLNX_close,XLNX_volume,XYL_open,XYL_high,XYL_low,XYL_close,XYL_volume,YUM_open,YUM_high,YUM_low,YUM_close,YUM_volume,ZBRA_open,ZBRA_high,ZBRA_low,ZBRA_close,ZBRA_volume,ZBH_open,ZBH_high,ZBH_low,ZBH_close,ZBH_volume,ZION_open,ZION_high,ZION_low,ZION_close,ZION_volume,ZTS_open,ZTS_high,ZTS_low,ZTS_close,ZTS_volume
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1
2020-08-17 13:30:00,165.0200,165.44,164.0300,164.0800,116302.0,,,,,,95.2900,95.6200,95.29,95.3000,135895.0,307.76,308.060,307.76,308.060,3035.0,231.450,231.620,230.98,230.980,27718.0,80.97,80.9700,80.5500,80.6550,90705.0,450.6200,451.500,449.2000,449.950,32940.0,82.0900,82.5800,81.9800,82.5600,607197.0,160.08,160.255,159.915,160.1075,14078.0,17.6000,17.700,17.550,17.550,66782.0,...,85.370,85.3900,85.100,85.270,45382.0,70.9900,71.02,70.570,70.570,25672.0,18.1100,18.1700,17.970,18.1350,79889.0,106.420,106.5300,105.963,105.990,44517.0,79.590,79.590,79.5900,79.590,6445.0,92.850,92.9700,92.5500,92.5500,18846.0,288.4900,288.4900,288.49,288.490,2197.0,137.740,137.970,137.50,137.8283,15744.0,34.240,34.3800,34.210,34.3800,10764.0,157.920,158.120,157.88,158.120,39460.0
2020-08-17 13:31:00,164.1500,165.01,163.9400,164.9450,21229.0,100.30,100.5000,100.3000,100.3900,700.0,95.3100,95.4000,95.27,95.4000,24765.0,,,,,,231.210,231.440,231.11,231.440,1300.0,80.67,80.6700,80.2300,80.3300,25967.0,449.9600,451.080,449.9600,451.080,900.0,82.5500,82.6000,82.2800,82.2866,218685.0,160.31,160.320,159.800,159.8000,3121.0,17.5300,17.640,17.530,17.590,14450.0,...,85.190,85.2899,85.175,85.180,2229.0,70.5400,70.69,70.430,70.690,5399.0,18.1100,18.1450,18.060,18.0699,9083.0,106.091,106.0910,106.091,106.091,100.0,79.390,79.500,79.3900,79.500,250.0,92.620,92.7750,92.6200,92.6300,500.0,289.0941,289.0941,287.92,287.920,503.0,138.090,138.090,137.99,137.9900,200.0,34.175,34.1750,34.175,34.1750,100.0,158.150,158.250,157.96,158.120,10481.0
2020-08-17 13:32:00,164.9700,164.97,163.6500,164.5800,64456.0,100.38,100.5100,100.2100,100.3441,91714.0,95.4100,95.5100,95.29,95.4500,23669.0,308.19,308.190,308.19,308.190,100.0,231.415,231.710,231.13,231.710,1100.0,80.24,80.6300,80.1686,80.6300,16832.0,451.0400,451.900,450.5629,451.525,6829.0,82.3000,82.4000,82.0650,82.3700,202872.0,159.84,159.980,159.700,159.8200,3496.0,17.5900,17.600,17.550,17.560,11727.0,...,85.200,85.3200,85.080,85.160,7115.0,70.6300,70.67,70.570,70.600,1274.0,18.0503,18.1000,18.000,18.0000,6759.0,105.880,105.9200,105.680,105.870,3788.0,79.470,79.620,79.3901,79.490,2964.0,92.780,92.7900,92.7450,92.7900,650.0,,,,,,138.090,138.200,138.03,138.2000,400.0,34.210,34.2400,34.090,34.0900,1500.0,158.210,158.390,157.78,158.220,5503.0
2020-08-17 13:33:00,164.5353,165.45,164.5353,165.4097,12810.0,100.37,100.5500,100.2800,100.5100,4866.0,95.4550,95.4900,95.34,95.3900,16112.0,305.68,305.680,305.68,305.680,100.0,231.540,231.760,231.48,231.480,1938.0,80.68,80.6800,80.4400,80.5000,10159.0,451.5550,451.785,450.8700,451.085,2993.0,82.3600,82.4200,81.8700,81.9300,226245.0,160.10,160.110,159.810,160.0900,2300.0,17.5400,17.545,17.485,17.500,17871.0,...,85.160,85.3500,85.090,85.300,12148.0,70.6000,70.60,70.410,70.490,2312.0,17.9800,17.9974,17.910,17.9300,16201.0,105.870,105.8758,105.655,105.655,3270.0,79.580,79.830,79.5800,79.830,2425.0,92.900,93.1285,92.9000,93.1285,6112.0,,,,,,138.065,138.100,137.92,138.0002,10545.0,34.170,34.2276,34.150,34.2276,1600.0,158.210,158.210,158.04,158.165,980.0
2020-08-17 13:34:00,165.3100,166.25,165.2800,166.1460,34689.0,100.51,100.5799,100.3500,100.3900,9128.0,95.4000,95.4700,95.30,95.3100,24926.0,,,,,,231.640,231.810,231.64,231.730,1200.0,80.47,80.6201,80.4301,80.5500,31644.0,451.2900,451.420,450.7150,450.980,4123.0,81.9300,81.9300,81.5600,81.7000,201646.0,160.18,160.430,160.180,160.3620,4821.0,17.5000,17.565,17.500,17.565,8103.0,...,85.225,85.2750,85.000,85.040,9099.0,70.5700,70.58,70.520,70.580,700.0,17.9300,18.0000,17.900,17.9706,13800.0,105.610,105.6100,105.450,105.490,2154.0,79.830,79.840,79.7700,79.770,600.0,93.030,93.0500,93.0100,93.0100,700.0,,,,,,138.000,138.140,138.00,138.1400,8120.0,34.200,34.2700,34.200,34.2500,1500.0,158.175,158.540,158.03,158.080,6438.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-08-19 17:05:00,164.8740,164.91,164.8740,164.8900,1265.0,100.91,100.9300,100.9094,100.9300,1998.0,96.8966,96.9200,96.88,96.8999,9544.0,315.05,315.425,315.05,315.425,200.0,235.920,235.920,235.90,235.900,200.0,82.29,82.2950,82.2600,82.2851,2979.0,465.9900,466.116,465.8700,465.875,2110.0,80.8067,80.8091,80.7650,80.7854,29157.0,160.40,160.400,160.310,160.3200,8000.0,17.7569,17.760,17.755,17.760,1824.0,...,81.820,81.8200,81.810,81.810,504.0,70.0605,70.07,70.055,70.055,835.0,18.4800,18.4850,18.480,18.4850,2145.0,103.050,103.0600,103.000,103.000,3448.0,79.730,79.750,79.7300,79.750,700.0,94.220,94.2800,94.2200,94.2500,2216.0,,,,,,,,,,,33.100,33.1000,33.100,33.1000,100.0,159.510,159.510,159.44,159.440,1103.0
2020-08-19 17:06:00,164.8500,164.85,164.7795,164.7795,550.0,100.93,100.9300,100.9100,100.9100,1837.0,96.8900,96.8980,96.81,96.8300,13885.0,,,,,,235.940,235.940,235.94,235.940,200.0,82.29,82.3099,82.2800,82.2800,1735.0,465.8600,466.025,465.8600,465.870,900.0,80.7900,80.8300,80.7800,80.7850,28263.0,160.39,160.390,160.390,160.3900,100.0,17.7500,17.755,17.750,17.755,500.0,...,81.790,81.7900,81.730,81.790,2000.0,70.0500,70.05,70.030,70.035,627.0,,,,,,102.980,102.9800,102.920,102.960,10111.0,,,,,,94.250,94.2500,94.2500,94.2500,400.0,281.7100,281.8450,281.71,281.845,200.0,138.550,138.550,138.55,138.5500,200.0,33.075,33.0750,33.065,33.0650,879.0,159.410,159.450,159.41,159.450,700.0
2020-08-19 17:07:00,164.8050,164.82,164.8050,164.8200,700.0,100.91,100.9400,100.9100,100.9200,550.0,96.8400,96.8699,96.83,96.8590,3210.0,315.33,315.420,315.15,315.150,500.0,235.930,235.930,235.87,235.880,1200.0,82.27,82.2700,82.2200,82.2200,4197.0,465.8700,465.870,465.5100,465.510,775.0,80.7800,80.7850,80.7400,80.7510,27504.0,160.27,160.400,160.270,160.4000,1300.0,17.7600,17.765,17.755,17.760,3325.0,...,81.790,81.8100,81.790,81.790,300.0,70.0350,70.06,70.030,70.050,4965.0,18.4710,18.4710,18.471,18.4710,175.0,102.960,102.9700,102.960,102.970,300.0,79.730,79.730,79.7300,79.730,346.0,94.270,94.3000,94.2631,94.3000,1700.0,281.6800,281.6800,281.68,281.680,100.0,138.575,138.575,138.55,138.5500,200.0,,,,,,159.450,159.470,159.45,159.470,201.0
2020-08-19 17:08:00,164.8055,164.82,164.8000,164.8000,2628.0,100.92,100.9200,100.8900,100.8900,825.0,96.8500,96.8550,96.81,96.8100,5252.0,,,,,,235.850,235.865,235.85,235.865,306.0,82.23,82.2300,82.1900,82.2000,6412.0,465.4100,465.660,465.4100,465.630,2580.0,80.7500,80.7700,80.7375,80.7470,20064.0,160.38,160.380,160.380,160.3800,100.0,17.7700,17.780,17.770,17.770,1995.0,...,81.778,81.8850,81.778,81.885,2950.0,70.0500,70.05,70.050,70.050,687.0,18.4750,18.4750,18.465,18.4700,6428.0,102.970,102.9700,102.900,102.960,1269.0,79.725,79.725,79.7250,79.725,200.0,94.280,94.3100,94.2700,94.3100,969.0,281.5100,281.5100,281.51,281.510,100.0,138.610,138.610,138.61,138.6100,200.0,33.065,33.0800,33.065,33.0800,644.0,159.480,159.480,159.47,159.470,675.0


In [38]:
spdfm.isna().sum().describe()

count    2525.000000
mean      444.253465
std       811.396311
min         0.000000
25%         8.000000
50%        88.000000
75%       533.000000
max      5728.000000
dtype: float64

In [None]:
plt.figure(figsize=(30,30))
plt.plot(spdfm['TFX'])

In [None]:
spdfm['TFX'][-1000:]

In [None]:
spdfm.fillna(0.0, inplace=True)

In [None]:
spdfm.isna().sum()

In [None]:
plt.figure(figsize=(100,100))
for stock in spdfm.columns:
    plt.plot(spdfm[stock])

In [None]:
spdfm.to_pickle(f'{data_dir}spdfm.pickle')

In [None]:
f'{data_dir}spdfm.pickle'

In [None]:
spdfm = pd.read_pickle(f'{data_dir}spdfm.pickle')

In [None]:
X = spdfm.to_numpy()

In [None]:
X.shape

In [None]:
X

#### Continue downloading from loaded data

Double the amount of data by opening our local file, downloading more information, and overwriting the original. 

In [3]:
import os
import sys
from os import listdir
from os.path import isfile, join

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd 
import numpy as np
from modules.extract import extract_stock, unix_time_millis
import time

data_dir = '../data/sp500/'
stock_files = [f.split('.')[0] for f in listdir(data_dir) if isfile(join(data_dir, f))]
sp_stocks = pd.read_csv('../data/sp500/S&P500-Info.csv')['Symbol']

In [4]:
stock_df = pd.read_pickle('../data/sp500/MMM.pickle')
end_stamp = stock_df['datetime'].iloc[-1]
start_stamp = stock_df['datetime'].iloc[0]

In [5]:
stock_df

Unnamed: 0,open,high,low,close,volume,datetime,date,hour,minute,min_num,SYMBOL,prev_close,diff_1,pct_change,log_return
0,166.50,166.50,166.50,166.50,8,2020-09-15 20:06:00,2020-09-15,20,6,1206,MMM,166.57,-0.07,-0.000420,-0.000420
1,166.57,166.57,166.57,166.57,501396,2020-09-15 20:03:00,2020-09-15,20,3,1203,MMM,168.81,-2.24,-0.013269,-0.013358
2,168.81,168.81,168.81,168.81,12,2020-09-15 20:02:00,2020-09-15,20,2,1202,MMM,168.20,0.61,0.003627,0.003620
3,168.40,168.40,168.20,168.20,2,2020-09-15 20:01:00,2020-09-15,20,1,1201,MMM,166.50,1.70,0.010210,0.010158
4,166.50,166.57,166.50,166.50,300,2020-09-15 20:00:00,2020-09-15,20,0,1200,MMM,166.57,-0.07,-0.000420,-0.000420
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8581,166.87,166.87,166.87,166.87,200,2020-08-17 11:51:00,2020-08-17,11,51,711,MMM,166.50,0.37,0.002222,0.002220
8582,166.50,166.50,166.50,166.50,302,2020-08-17 11:19:00,2020-08-17,11,19,679,MMM,166.41,0.09,0.000541,0.000541
8583,166.41,166.41,166.41,166.41,100,2020-08-17 11:13:00,2020-08-17,11,13,673,MMM,166.11,0.30,0.001806,0.001804
8584,166.11,166.11,166.11,166.11,200,2020-08-17 10:34:00,2020-08-17,10,34,634,MMM,167.50,-1.39,-0.008299,-0.008333


In [113]:
end_stamp > (start_stamp - pd.Timedelta('60 days'))

True

In [None]:
end_stamp - 

In [107]:
2* (end_stamp - start_stamp)

Timedelta('-59 days +01:06:00')

In [22]:
# stock ='MMM'
# periods = 10
# end_stamp = stock_df['datetime'].iloc[-1]
# start_stamp = stock_df['datetime'].iloc[0]

# endDate = end_stamp - pd.Timedelta(days=periods)
# print(end_stamp, endDate)

# endDate = unix_time_millis(endDate)
# end_stamp = unix_time_millis(end_stamp)
extract_stock(stock,
              return_df=True,
              periodType='day', 
              frequencyType='minute', 
              frequency='1', 
              periods=periods,
              endDate=end_stamp,
              startDate=endDate)

KeyError: 'candles'

In [6]:
periods = 10 # the maximum period for the minute/day history API
stock_files = ['MMM']

for stock in stock_files: 
    stock_df = pd.read_pickle(f'{data_dir}{stock}.pickle')
    
    end_stamp = stock_df['datetime'].min()
    start_stamp = stock_df['datetime'].max()
    
    while end_stamp > (start_stamp - pd.Timedelta('60 days')):
        endDate = end_stamp - pd.Timedelta(days=periods)
        print(end_stamp, endDate)
        endDate = unix_time_millis(endDate)
        startDate = unix_time_millis(end_stamp)
        stock_df = pd.concat([stock_df, extract_stock(stock,
                                                      return_df=True,
                                                      periodType='day', 
                                                      frequencyType='minute', 
                                                      frequency='1', 
                                                      periods=periods,
                                                      endDate=startDate,
                                                      startDate=endDate)])
        
        stock_df = stock_df.sort_values(by='datetime')
        stock_df.drop_duplicates(subset='datetime')
        
        end_stamp = stock_df['datetime'].min()
        start_stamp = stock_df['datetime'].max()
        
        time.sleep(.5)
        
df

2020-08-17 08:39:00 2020-08-07 08:39:00
2020-08-07 11:03:00 2020-07-28 11:03:00
2020-08-03 10:36:00 2020-07-24 10:36:00
2020-08-03 10:36:00 2020-07-24 10:36:00
2020-08-03 10:36:00 2020-07-24 10:36:00
2020-08-03 10:36:00 2020-07-24 10:36:00


KeyboardInterrupt: 

In [132]:
stock_df.shape

(25606, 15)

In [139]:
stock_df.sort_values(by='datetime')

Unnamed: 0,open,high,low,close,volume,datetime,date,hour,minute,min_num,SYMBOL,prev_close,diff_1,pct_change,log_return
0,150.65,150.65,150.65,150.65,100,2020-08-03 10:36:00,2020-08-03,10,36,636,MMM,,,,
0,150.65,150.65,150.65,150.65,100,2020-08-03 10:36:00,2020-08-03,10,36,636,MMM,,,,
0,150.65,150.65,150.65,150.65,100,2020-08-03 10:36:00,2020-08-03,10,36,636,MMM,,,,
0,150.65,150.65,150.65,150.65,100,2020-08-03 10:36:00,2020-08-03,10,36,636,MMM,,,,
0,150.65,150.65,150.65,150.65,100,2020-08-03 10:36:00,2020-08-03,10,36,636,MMM,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,166.50,166.57,166.50,166.50,300,2020-09-15 20:00:00,2020-09-15,20,0,1200,MMM,166.57,-0.07,-0.000420,-0.000420
3,168.40,168.40,168.20,168.20,2,2020-09-15 20:01:00,2020-09-15,20,1,1201,MMM,166.50,1.70,0.010210,0.010158
2,168.81,168.81,168.81,168.81,12,2020-09-15 20:02:00,2020-09-15,20,2,1202,MMM,168.20,0.61,0.003627,0.003620
1,166.57,166.57,166.57,166.57,501396,2020-09-15 20:03:00,2020-09-15,20,3,1203,MMM,168.81,-2.24,-0.013269,-0.013358


In [135]:
end_stamp > (start_stamp - pd.Timedelta('60 days'))

False

In [137]:
start_stamp - pd.Timedelta('60 days')

Timestamp('2020-07-17 20:06:00')

In [138]:
end_stamp

Timestamp('2020-08-03 10:36:00')

In [6]:
spdf = pd.read_csv(f'{data_dir}sp500_minute_data.csv')

In [7]:
spdf

Unnamed: 0,datetime,MMM_open,MMM_high,MMM_low,MMM_close,MMM_volume,ABT_open,ABT_high,ABT_low,ABT_close,...,ZION_open,ZION_high,ZION_low,ZION_close,ZION_volume,ZTS_open,ZTS_high,ZTS_low,ZTS_close,ZTS_volume
0,2020-08-17 08:00:00,,,,,,,,,,...,,,,,,,,,,
1,2020-08-17 08:01:00,,,,,,,,,,...,,,,,,,,,,
2,2020-08-17 08:02:00,,,,,,,,,,...,,,,,,,,,,
3,2020-08-17 08:03:00,,,,,,,,,,...,,,,,,,,,,
4,2020-08-17 08:04:00,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19724,2020-09-15 20:27:00,,,,,,,,,,...,,,,,,162.28,162.28,162.28,162.28,3951.0
19725,2020-09-15 20:28:00,,,,,,,,,,...,,,,,,,,,,
19726,2020-09-15 20:29:00,,,,,,,,,,...,,,,,,,,,,
19727,2020-09-15 20:30:00,,,,,,,,,,...,,,,,,,,,,


In [10]:
stock_df['datetime'].max()

Timestamp('2020-09-15 20:06:00')

In [11]:
stock_df['datetime'].min()

Timestamp('2020-08-17 08:39:00')

In [12]:
end_stamp = stock_df['datetime'].min()
start_stamp = stock_df['datetime'].max()

end_stamp, start_stamp

(Timestamp('2020-08-17 08:39:00'), Timestamp('2020-09-15 20:06:00'))

In [15]:
end_stamp > (start_stamp - pd.Timedelta('60 days'))

True

In [16]:
(start_stamp - pd.Timedelta('60 days'))

Timestamp('2020-07-17 20:06:00')

In [14]:
extract_stock('AAPL',
                                                      return_df=True,
                                                      periodType='day', 
                                                      frequencyType='minute', 
                                                      frequency='1', 
                                                      periods=periods,
                                                      endDate=None,
                                                      startDate=1505586960000)

Unnamed: 0,open,high,low,close,volume,datetime,date,hour,minute,min_num,SYMBOL,prev_close,diff_1,pct_change,log_return
0,106.8600,106.8700,106.8600,106.8700,3348,2020-08-03 08:00:00,2020-08-03,8,0,480,AAPL,,,,
1,106.9025,107.1425,106.9025,107.1425,7684,2020-08-03 08:01:00,2020-08-03,8,1,481,AAPL,106.8700,0.2725,0.002550,0.002547
2,107.1750,107.1750,107.1750,107.1750,2764,2020-08-03 08:03:00,2020-08-03,8,3,483,AAPL,107.1425,0.0325,0.000303,0.000303
3,107.1875,107.3250,107.1875,107.3250,2456,2020-08-03 08:04:00,2020-08-03,8,4,484,AAPL,107.1750,0.1500,0.001400,0.001399
4,107.3000,107.3175,107.2550,107.2550,2716,2020-08-03 08:05:00,2020-08-03,8,5,485,AAPL,107.3250,-0.0700,-0.000652,-0.000652
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27670,110.3100,110.3600,110.3100,110.3600,2110,2020-09-17 23:55:00,2020-09-17,23,55,1435,AAPL,110.3200,0.0400,0.000363,0.000363
27671,110.3600,110.4000,110.3600,110.3900,5107,2020-09-17 23:56:00,2020-09-17,23,56,1436,AAPL,110.3600,0.0300,0.000272,0.000272
27672,110.3900,110.3900,110.3600,110.3600,1825,2020-09-17 23:57:00,2020-09-17,23,57,1437,AAPL,110.3900,-0.0300,-0.000272,-0.000272
27673,110.3700,110.3800,110.3100,110.3200,8409,2020-09-17 23:58:00,2020-09-17,23,58,1438,AAPL,110.3600,-0.0400,-0.000362,-0.000363


The data only goes back a month and a half. 

What kind of a joke api is this even? 

In [7]:
endDate, startDate

(1595586960000, 1596450960000)

In [10]:
unix_time_millis(stock_df['datetime'].min())

1596450960000

In [16]:
end_stamp = stock_df['datetime'].min()
start_stamp = stock_df['datetime'].max()

end_stamp - start_stamp

Timedelta('-44 days +14:30:00')

### Load the sp500 dataset to BigQuery for testing 

In [7]:
from google.cloud import bigquery 

client = bigquery.Client()

table_id = "sp500historical.info"

job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True,
)

with open('../data/sp500/S&P500-Info.csv', "rb") as source_file:
    job = client.load_table_from_file(source_file, table_id, job_config=job_config)

job.result()  # Waits for the job to complete.

table = client.get_table(table_id)  # Make an API request.
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id
    )
)

Loaded 505 rows and 10 columns to sp500historical.info
