# How can we predict the value of the S&P 500?
Jasper Hsu, Thomas Suman, Trey Hensel

## Libraries and dependencies:

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import math

## Loading the data: 
https://www.kaggle.com/datasets/gkitchen/s-and-p-500-spy

In [5]:
df = pd.read_csv("spy.csv")

First, we must decide what we are trying to predict. Our dataset supplies us with four options: open, high, low, and close. Predicting the high and low values of the stock on a given day is a seemingly good idea, but this information is hard to use because it is also dependent on time - and we aren't given any data on when the stock hits either extremity. Predicting the value of the stock when the market opens is definitely useful, but it generally doesn't reflect the overall market sentiment and may be heavily influenced by unpredictable overnight news or events. On the other hand, the closing price of a stock reflects the overall market sentiment and demand for the stock at the end of the trading day. It is also widely used as a benchmark for measuring the performance of the stock market. Thus, we will be predicting the closing price.

## Test/train split:

In [20]:
df = df.sample(frac = 1) # we shuffle the data so that our train/test split will be truly random

train_proportion = 0.8
n = len(df)
split = math.floor(n*train_proportion)

target = df["Close"]
features = ["Open", "High", "Low", "Volume", "Day", "Weekday", "Week", "Month", "Year"]
data = df.loc[:, df.columns.isin(features)]

# the following variable records the features of examples in the training set
train_x = data.iloc[:split]
# the following variable records the features of examples in the test set
test_x = data.iloc[split+1:]
# the following variable records the labels of examples in the training set
train_y = target[:split]
# the following variable records the labels of examples in the test set
test_y = target[split+1:]

Note that we effectively removed the "Date" column from our data as the same information is stored separately in the "Day", "Month", and "Year" columns.