The data comes from https://firstratedata.com/, but their free samples seem to be tied to the day you request to have them.

I tried to find stocks in different fields.

I didn't feel the need to test for trends because my window is so short (just a couple of weeks)

I chose to use GRU based on this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9141105/#:~:text=2.4.-,Recurrent%20Neural%20Networks%20(RNNs),time%20intervals%20or%20time%20steps. Long Term Short Term networks may also work but they may be time intensive. "GRUs are simplified version of LSTMs that use single “update gate” to control the flow of information into the memory cell. GRUs are easier to train and faster to run than LSTMs, but they may not be as effective at storing and accessing long-term dependencies."

I have access to 10 days worth of minute-to-minute data, for 7 stocks in different fields. I used a "walk forward" stratedy to train my models: giving it two and a half days worth of data (from 8am day 1 to 11:59am on day 3) and then asked it to predict the price at noon, 2pm, and the closing price at 4pm on day 3. Once the model was tuned, I had it work from day 1 into the middle of day 4, and so on until the final day's worth of data was reached (the test set).

Remember: Convert change in price to percentage

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
#!pip install numpy==1.23.0
import numpy as np

from sklearn.metrics import mean_squared_error as mse
from sklearn.preprocessing import StandardScaler

from datetime import datetime as dt

from keras.models import Sequential
from keras.layers import *
from keras.losses import MeanSquaredError
from keras.metrics import RootMeanSquaredError
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from keras.layers import GRU
from keras.callbacks import ModelCheckpoint

import tensorflow as tf

In [2]:
print(np.__version__)

1.25.2


In [3]:
df_aal = pd.read_csv("https://raw.githubusercontent.com/FerdinandBeaman/Capstone/main/1MinSamples/AAL_1min_sample.csv")
df_fdx = pd.read_csv("https://raw.githubusercontent.com/FerdinandBeaman/Capstone/main/1MinSamples/FDX_1min_sample.csv")
df_fis = pd.read_csv("https://raw.githubusercontent.com/FerdinandBeaman/Capstone/main/1MinSamples/FIS_1min_sample.csv")
df_mcy = pd.read_csv("https://raw.githubusercontent.com/FerdinandBeaman/Capstone/main/1MinSamples/M_1min_sample.csv")
df_spr = pd.read_csv("https://raw.githubusercontent.com/FerdinandBeaman/Capstone/main/1MinSamples/S_1min_sample.csv")
df_sbx = pd.read_csv("https://raw.githubusercontent.com/FerdinandBeaman/Capstone/main/1MinSamples/SBUX_1min_sample.csv")
df_tsl = pd.read_csv("https://raw.githubusercontent.com/FerdinandBeaman/Capstone/main/1MinSamples/TSLA_1min_sample.csv")

all_dfs = [df_aal, df_fdx, df_fis, df_mcy, df_sbx, df_spr, df_tsl]

In [4]:
for df in all_dfs:
    print(len(df))

5700
4231
4396
5595
4776
5050
10005


In [5]:
for df in all_dfs:
    df['timestamp'] = pd.to_datetime(df['timestamp'])

In [6]:
# for df in all_dfs:
#     print(df.isnull().sum())
#     print("\n")

Next, I am just looking for the latest time that any of my stocks began to track their prices and the earliest time that any of them stopped tracking their prices. This way, I can make all of my data uniform in length.

In [7]:
for df in all_dfs:
    print(df["timestamp"][0])

2024-02-26 04:03:00
2024-02-26 06:09:00
2024-02-26 06:06:00
2024-02-26 04:41:00
2024-02-26 08:00:00
2024-02-26 04:00:00
2024-02-26 04:00:00


In [8]:
for df in all_dfs:
    print(df["timestamp"].iloc[-1])

2024-03-11 19:44:00
2024-03-11 18:11:00
2024-03-11 16:00:00
2024-03-11 19:39:00
2024-03-11 19:04:00
2024-03-11 19:38:00
2024-03-11 19:54:00


So 8am on the 26th and 4pm on the 11th.

In [9]:
for df in all_dfs:
    df.set_index('timestamp', inplace=True)

In [10]:
for i, df in enumerate(all_dfs):
    all_dfs[i] = df.resample("1min").asfreq().ffill()

In [11]:
for i, df in enumerate(all_dfs):
    all_dfs[i] = df['2024-02-26 08:00' : "2024-03-11 16:00" ]

In [12]:
for i, df in enumerate(all_dfs):
    all_dfs[i].drop(["high", "low", "close"], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_dfs[i].drop(["high", "low", "close"], axis = 1, inplace = True)


In [13]:
seven_dfs = pd.concat(all_dfs, axis=1)

In [14]:
cols = ["open_1", "volume_1", "open_2", "volume_2", "open_3",
                   "volume_3", "open_4", "volume_4", "open_5", "volume_5",
                   "open_6", "volume_6", "open_7", "volume_7"]

seven_dfs.set_axis(cols, axis = 1, inplace = True)

  seven_dfs.set_axis(cols, axis = 1, inplace = True)


In [15]:
# Getting the first one and a half days of data for the initial training set,
# Then using that to scale all of my data.


train_36_hrs = seven_dfs['2024-02-26 08:00' : "2024-02-28 12:00" ]

scaler = StandardScaler()

train_36_hrs[cols] = scaler.fit(train_36_hrs[cols])
seven_dfs[cols] = scaler.transform(seven_dfs[cols])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_36_hrs[cols] = scaler.fit(train_36_hrs[cols])


In [16]:
seven_dfs["hour"] = np.nan
for i in range(len(seven_dfs)):
  seven_dfs["hour"][i] = seven_dfs.index[i].hour

seven_dfs["day"] = np.nan
for i in range(len(seven_dfs)):
  seven_dfs["day"][i] = seven_dfs.index[i].dayofweek

KeyboardInterrupt: 

In [None]:
seven_dfs.drop(seven_dfs[(seven_dfs["hour"] < 8) |
                        (seven_dfs["hour"] > 15)].index, inplace = True)
seven_dfs.drop(seven_dfs[seven_dfs["day"] > 4].index, inplace = True)

In [None]:
len(seven_dfs)

In [None]:
seven_dfs.open_1.plot()
seven_dfs.open_2.plot()
seven_dfs.open_3.plot()

In [None]:
prices = ["open_1", "open_2", "open_3", "open_4", "open_5", "open_6", "open_7"]
for price in prices:
  seven_dfs[price][400:460].plot()

In [None]:
for price in prices:
  seven_dfs[price][1440:1920].plot()

In [None]:
for price in prices:
  seven_dfs[price][415:445].plot()

In [None]:
seven_dfs.drop(seven_dfs[seven_dfs["hour"] < 9].index, inplace = True)

In [None]:
len(seven_dfs)

In [None]:
seven_dfs.head()

In [None]:
# Code repurposed from Greg Hogg: https://www.youtube.com/watch?v=c0k-YLQGKjY
def df_to_Xy(df, window):
  df_np = df.to_numpy()
  X = []
  y = []
  for i in range(0, len(df)-window, window):
    row = [a for a in df_np[i:i+window]]
    X.append(row)
    y.append(df_np[i+window][[0,2,4,6,8,10,12]]) # y is just the 7 price cols
  return np.array(X), np.array(y,dtype=np.float32)

In [None]:
test = seven_dfs.to_numpy()
test[0][[0,2]]

In [None]:
X34, y34 = df_to_Xy(seven_dfs, 34)
X45, y45 = df_to_Xy(seven_dfs, 45)
X60, y60 = df_to_Xy(seven_dfs, 60)
X80, y80 = df_to_Xy(seven_dfs, 80)

In [None]:
len(X34)

In [None]:
X_train34, y_train34 = X34[:95], y34[:95] #Just over 70% of the data
X_val34, y_val34 = X34[95:115], y34[95:115] # len(val) == len(test)
X_test34, y_test34 = X34[115:], y34[115:]

X_train45, y_train45 = X45[:72], y45[:72]
X_val45, y_val45 = X45[72:87], y45[72:87]
X_test45, y_test45 = X45[87:], y45[87:]

X_train60, y_train60 = X60[:55], y60[:55]
X_val60, y_val60 = X60[55:65], y60[55:65]
X_test60, y_test60 = X60[65:75], y60[65:75]

X_train80, y_train80 = X80[:40], y80[:40]
X_val80, y_val80 = X80[40:49], y80[40:49]
X_test80, y_test80 = X80[49:], y80[49:]

First, I tried a model with a high learning rate (0.01), then default learning rate (0.0001) and a model with a small learning rate. (0.00001)

In [None]:
model_34 = Sequential()
model_34.add(InputLayer((34,16)))
model_34.add(GRU(64))
model_34.add(Dense(16, "relu"))
model_34.add(Dense(14, "relu"))
model_34.add(Dense(7, "linear"))

cp34 = ModelCheckpoint("model_34/", save_best_only=True)

model_34.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=.01),
                 metrics=[RootMeanSquaredError()])

In [None]:
model_45 = Sequential()
model_45.add(InputLayer((45,16)))
model_45.add(GRU(64))
model_45.add(Dense(16, "relu"))
model_45.add(Dense(14, "relu"))
model_45.add(Dense(7, "linear"))

cp45 = ModelCheckpoint("model_45/", save_best_only=True)

model_45.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=.01),
                 metrics=[RootMeanSquaredError()])

In [None]:
model_60 = Sequential()
model_60.add(InputLayer((60,16)))
model_60.add(GRU(64))
model_60.add(Dense(16, "relu"))
model_60.add(Dense(14, "relu"))
model_60.add(Dense(7, "linear"))

cp60 = ModelCheckpoint("model_60/", save_best_only=True)

model_60.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=.01),
                 metrics=[RootMeanSquaredError()])

In [None]:
model_80 = Sequential()
model_80.add(InputLayer((80,16)))
model_80.add(GRU(64))
model_80.add(Dense(16, "relu"))
model_80.add(Dense(14, "relu"))
model_80.add(Dense(7, "linear"))

cp80 = ModelCheckpoint("model_80/", save_best_only=True)

model_80.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=.01),
                 metrics=[RootMeanSquaredError()])

In [None]:
model_80.summary()

In [None]:
# The preceding 4 cells were supposed to be automated by this for loop
# but it was giving me trouble. Rather than let perfect get in the way
# of great (or graduating), I left this here to debug later.

# the_X_trains = [X_train32, X_train45, X_train60, X_train80]
# the_y_trains = [y_train32, y_train45, y_train60, y_train80]

# the_X_vals = [X_val32, X_val45, X_val60, X_val80]
# the_y_vals = [y_val32, y_val45, y_val60, y_val80]

# cp32 = ModelCheckpoint("model_32/", save_best_only=True)
# cp45 = ModelCheckpoint("model_45/", save_best_only=True)
# cp60 = ModelCheckpoint("model_60/", save_best_only=True)
# cp80 = ModelCheckpoint("model_80/", save_best_only=True)

# the_cps = [cp32, cp45, cp60, cp80]

# model_32 = Sequential()
# model_45 = Sequential()
# model_60 = Sequential()
# model_80 = Sequential()

# the_models = [model_32, model_45, model_60, model_80]

# for i, n in enumerate([32, 45, 60, 80]):
#   the_models[i].add(InputLayer((n,16)))
#   the_models[i].add(GRU(64))
#   the_models[i].add(Dense(16, "relu"))
#   the_models[i].add(Dense(14, "relu"))
#   the_models[i].add(Dense(7, "linear"))

#   the_models[i].compile(loss=MeanSquaredError(),
#                         optimizer=Adam(learning_rate=.00325),
#                         metrics=[RootMeanSquaredError()])

#   the_models[i].fit(the_X_trains[i], the_y_trains[0], validation_data=(
#       the_X_vals[i], the_y_vals[i]), epochs = 100,
#       callbacks = [the_cps[i], EarlyStopping(patience=4)])

In [None]:
# Code repurposed from Greg Hogg: https://www.youtube.com/watch?v=kGdbPnMCdOg


def plot_pred(model, X, y, col):
  y = y[:,col]
  preds = model.predict(X)[:,col].flatten()
  df = pd.DataFrame(data={"Predictions":preds, "Actuals":y})
  plt.plot(df["Predictions"][:], label = "Predictions")
  plt.plot(df["Actuals"][:], label = "Actuals")
  plt.legend()
  return mse(y, preds)

In [None]:
model_34.fit(X_train34, y_train34, validation_data=(X_val34, y_val34),
            epochs = 25, callbacks = [cp34, EarlyStopping(patience=4)])

In [None]:
#An example of a promising result:
plot_pred(model_34, X_val34, y_val34, 1)

In [None]:
#...and a bad result:
plot_pred(model_34, X_val34, y_val34, 2)

In [None]:
model_45.fit(X_train45, y_train45, validation_data=(X_val45, y_val45),
            epochs = 25, callbacks = [cp45, EarlyStopping(patience=4)])

In [None]:
#All of the models were pretty rough, here
plot_pred(model_45, X_val45, y_val45, 2)

In [None]:
model_60.fit(X_train60, y_train60, validation_data=(X_val60, y_val60),
            epochs = 25, callbacks = [cp60, EarlyStopping(patience=4)])

In [None]:
# and it only gets bleaker with the larger timescales
plot_pred(model_60, X_val60, y_val60, 3)

In [None]:
model_80.fit(X_train80, y_train80, validation_data=(X_val80, y_val80),
            epochs = 25, callbacks = [cp80, EarlyStopping(patience=4)])

In [None]:
plot_pred(model_80, X_val80, y_val80, 2)

Of course I expected the predictions to be more inaccurate with larger timescales, but I was surprised at the total lack of predictive power. But the 34 min model had promise. In fact it was here that I realized that a model which could make reasonably accurate predictions even just a few minutes in the future could work.

But, before I move on, I should at least check to see if the poor performance of large time-delayed models could be alleviated with a low learning rate or with more layers.

In [None]:
#First, lower learning rates
model_34_2 = model_34.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=.00001),
                 metrics=[RootMeanSquaredError()])
model_34_2.fit(X_train34, y_train34, validation_data=(X_val34, y_val34),
            epochs = 25, callbacks = [cp34, EarlyStopping(patience=4)])

In [None]:
plot_pred(model_34_2, X_val34, y_val34, 3)

In [None]:
model_45_2 model_45.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=.00001),
                 metrics=[RootMeanSquaredError()])
model_45.fit(X_train45, y_train45, validation_data=(X_val45, y_val45),
            epochs = 25, callbacks = [cp45, EarlyStopping(patience=4)])

In [None]:
plot_pred(model_45_2, X_val45, y_val45, 3)

In [None]:
model_60_2 = model_60.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=.00001),
                 metrics=[RootMeanSquaredError()])
model_60_2.fit(X_train60, y_train60, validation_data=(X_val60, y_val60),
            epochs = 25, callbacks = [cp60, EarlyStopping(patience=4)])

In [None]:
plot_pred(model_60_2, X_val60, y_val60, 3)

In [None]:
model_80_2 = model_80.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=.00001),
                 metrics=[RootMeanSquaredError()])
model_80_2.fit(X_train80, y_train80, validation_data=(X_val80, y_val80),
            epochs = 25, callbacks = [cp80, EarlyStopping(patience=4)])

In [None]:
plot_pred(model_80_2, X_val80, y_val80, 3)