# 1.4 Download SP500 Minute Data

In this notebook we will consider an alternative question. Instead of asking how we can maximize profit on a single asset, we will ask whether a machine can pick the best assets given many selections. 

We will consider only price history data. We will provide 90 differenced timesteps on the minute interval period. We will choose the 500 stocks of the S&P 500. 

If our network can outperform the S&P 500 over the given time, we will consider it successful. 

In order to do this, we will need to perform the following steps:

- download datasets for all the stocks in the S&P 500. 
- format the data to represent the simulataneous movement of 500 stocks
- Build an environment to represent this movement
- Train a DQN to learn on it

#### Download datasets for all the stocks in the S&P 500.

In [None]:
import time
from IPython import display
from extract import extract_stock, extract_multi_periods, load_set
from transform import format_date
import pandas as pd 
import matplotlib.pyplot as plt
from sine_modules import *
from sklearn.linear_model import LinearRegression

In [None]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
data_dir = './data/sp500/'
suffix = ''

In [None]:
import pandas as pd

table=pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
df = table[0]
df.to_csv('S&P500-Info.csv')
df.to_csv("S&P500-Symbols.csv", columns=['Symbol'])
df.head()

In [None]:
stocksdf = df

In [None]:
data_dir = './data/sp500/'
stocks = list(stocksdf['Symbol'])
spdf = pd.DataFrame()
suffix = ''

In [None]:
df = load_set('MMM', data_dir, '.pickle')

In [None]:
df

In [None]:
df['close'].shift(-1)

In [None]:
df['close'].diff(-1)

In [None]:
-1.16 / 166.05

In [None]:
df['close'].diff(-1) / df['close'].shift(-1)

In [None]:
len(stocks)

In [None]:
com_stocks = []

In [None]:
start_time = time.time()

for i, stock in enumerate(stocks):
    if stock in com_stocks:
        continue
    df = extract_multi_periods(stock, 
                       data_dir=data_dir)
    
    df[f'{stock}'] = df['close'].diff(-1) / df['close']
    
    spdf[f'{stock}'] = df[f'{stock}']
    
    #stocks.remove(stock)
    
    com_stocks.append(stock)
    display.clear_output()
    print(f'{i}')
    
    time.sleep(.5)
    
spdf.head()

In [None]:
len(stocks)

In [None]:
spdf.shape

In [None]:
spdf.to_pickle(f'{data_dir}sp500_close_data.pickle')

In [None]:
spdf = pd.read_pickle(f'{data_dir}sp500_close_data.pickle')

In [None]:
spdf[spdf.columns[spdf.isna().sum() > 1000]]

In [None]:
df = load_set('ABMD', data_dir, '.pickle')

df['datetime'].dtype

In [None]:
df['datetime'] = pd.to_datetime(df['datetime'])

In [None]:
df['datetime'].dtype

In [None]:
df.set_index('datetime', inplace=True)

In [None]:
df.index

In [None]:
spdf = pd.DataFrame()

for stock in stocks:
    df = load_set(stock, data_dir, '.pickle')
    
    df.set_index('datetime', inplace=True)
    df[f'{stock}'] = df['close'].diff(-1) / df['close']
    
    spdf[f'{stock}'] = df[f'{stock}']
    
spdf

In [None]:
pd.set_option('display.max_rows', 1000)

In [None]:
spdf[spdf.index.hour == 13]

In [None]:
h13 = spdf.index.hour > 13
h20 = spdf.index.hour < 20
m30 = spdf.index.minute >= 30
h12 = spdf.index.hour > 12

spdfm = spdf[h20 & (h13 | (h12 & m30))]

In [None]:
spdfm.head(1000)

In [None]:
spdfm.isna().sum()

In [None]:
plt.figure(figsize=(30,30))
plt.plot(spdfm['TFX'])

In [None]:
spdfm['TFX'][-1000:]

In [None]:
spdfm.fillna(0.0, inplace=True)

In [None]:
spdfm.isna().sum()

In [None]:
plt.figure(figsize=(100,100))
for stock in spdfm.columns:
    plt.plot(spdfm[stock])

In [None]:
spdfm.to_pickle(f'{data_dir}spdfm.pickle')

In [None]:
f'{data_dir}spdfm.pickle'

In [None]:
spdfm = pd.read_pickle(f'{data_dir}spdfm.pickle')

In [None]:
X = spdfm.to_numpy()

In [None]:
X.shape

In [None]:
X