# Introduction

What is the goal of your project?

• What is the data that you are using? What is the original data source if known?

• What does an instance in your data represent (e.g. a person, a transaction, etc.)? How many
instances are there?

• What is the target variable you are trying to predict?

• What are the features used to predict it? Give a few examples of the features.

• Provide any additional relevant information about your data if known (e.g. what is the time
period, what place is it collected from, etc.

Data will be from jan 1 2012 - jan 1 2025

# Problem Setup

In [None]:
#replace this with your own path to the CSV file
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/miaca/OneDrive/Desktop/full_data.csv")
#This removes all t2, t1, and t0 columns besides open_t0. This is the data we would have if we predicted close_t2 at t0 
feature_cols = [col for col in df.columns if '_t2' not in col and 't_1' not in col and 't_0' not in col and col not in ['Close_t2', 'Release_Datetime', 'Date']] + ['Open_t0']
feature_cols = [ col for col in feature_cols if col not in ['Unnamed: 0', 'Release Date']]
X = df[feature_cols]
y = df['Close_t2']


Unnamed: 0,Close_t-60,High_t-60,Low_t-60,Open_t-60,Volume_t-60,Close_t-59,High_t-59,Low_t-59,Open_t-59,Volume_t-59,...,High_t1,Low_t1,Open_t1,Volume_t1,Actual,Forecast,Previous,Weekly Net Import,Weekly Production,Open_t0
0,105.46,105.56,105.44,105.56,514.0,105.46,105.52,105.45,105.46,323.0,...,105.48,105.21,105.33,2052.0,2500000.0,1100000.0,-400000.0,57869000.0,39151000.0,105.24
1,98.82,98.89,98.77,98.88,401.0,98.78,98.84,98.73,98.82,232.0,...,98.55,98.44,98.44,473.0,1700000.0,1800000.0,2500000.0,60529000.0,39137000.0,98.4
2,105.52,105.53,105.46,105.47,187.0,105.5,105.53,105.48,105.52,121.0,...,105.5,105.27,105.4,983.0,2100000.0,2000000.0,1700000.0,62671000.0,38990000.0,105.35
3,104.65,104.67,104.56,104.6,484.0,104.65,104.65,104.57,104.64,201.0,...,104.4,104.35,104.37,292.0,2900000.0,2000000.0,2100000.0,63658000.0,38976000.0,104.34
4,108.45,108.46,108.43,108.43,37.0,108.44,108.46,108.43,108.46,26.0,...,108.63,108.48,108.62,567.0,2000000.0,1300000.0,2900000.0,62412000.0,39466000.0,108.6


In [12]:
#Splitting data into train, test and validation sets. We do not use a time series split because there are no lags from previous days
import numpy as np
np.random.seed(42)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
#Scaling the data

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Algorithms

## Linear Regression (Mia Callahan)


Running a linear regression acts as a baseline for our other models. We expect that the linear regression will not perform well, as the data likely has non-linear relationships. The bias-variance tradeoff likely will lead the linear regression to underfit the data as it is a very basic model. However, it will be interesting to see how it performs especially in comparison to the more complex models we will use later.

In [17]:
# fitting a linear regression model
import numpy as np
np.random.seed(42)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
linmodel = LinearRegression()
linmodel.fit(X_train_scaled, y_train)
linmodel.intercept_, linmodel.coef_

(71.4874658869396,
 array([-2.00072143e+01, -6.68882640e+00, -5.35274550e+00,  9.37945751e+00,
        -1.13250550e-02,  1.52769152e+01, -6.29674700e+00, -6.03521120e+00,
         2.65086500e+01,  4.76741888e-03,  9.10818650e+00,  3.73146878e+00,
         2.12625724e+00, -6.58658545e+00,  4.82749805e-03,  2.14260603e+00,
        -2.06221345e+00,  6.19040633e+00, -1.29745818e+01,  1.73938786e-02,
         1.40787867e+01,  5.91496960e+00, -6.30662228e+00, -4.37616719e+00,
        -6.94229144e-03,  8.60196737e+00,  4.19776264e+00, -6.28167265e+00,
        -1.33836764e+01, -1.62233620e-02, -7.57823791e+00,  1.39004063e+01,
         1.76065041e+00, -1.71513787e+01,  1.00448962e-02, -8.28304982e-01,
        -1.51552968e+00,  1.00615305e+01, -5.65567905e+00, -1.62648325e-02,
        -1.63127999e+01, -1.41814038e+01,  1.09544943e+01,  4.24595929e+00,
         2.31471066e-02, -1.53045238e+01, -7.53926326e+00,  6.99862951e+00,
         1.30030901e+01,  1.02421070e-02, -7.27599060e+00, -1.0649105

Because there are so many features, we can see that the list of coefficients is very long. It is likely that the linear reegression will overfit because it will try to fit all of the features, even if they are not relevant. After seeing performance, we will reduce the number of features to see if this improves out of sample performance.

In [None]:
# see how it performs on out of sample data
y_hat = linmodel.predict(X_test_scaled)
error = np.sqrt(np.mean((y_hat - y_test) ** 2))
print(f"RMSE on test set: {error:.2f}")
#print r2 score
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_hat)
print(f"R^2 on test set: {r2:.2f}")


RMSE on test set: 0.15
R^2 on test set: 1.00


The RMSE is quite low, at .15. The r squared is also extremely high at 1.0. 

In [None]:
# keep only columns that have '_-t1' -t2 or -t3 in them
new_features = ['Open_t-1', 'Close_t-1', 'Volume_t-1', 'High_t-1', 'Low_t-1',
                'Open_t-2', 'Close_t-2', 'Volume_t-2', 'High_t-2', 'Low_t-2',
                'Open_t-3', 'Close_t-3', 'Volume_t-3', 'High_t-3', 'Low_t-3'] + ['Open_t0']
X_new_train = X_train[feature_cols]
X_new_test = X_test[feature_cols]


(110, 318)

In [31]:
#now fitting new linear regression
linmodel_new = LinearRegression()
linmodel_new.fit(X_new_train, y_train)
# see how it performs on out of sample data
y_new_hat = linmodel_new.predict(X_new_test)
new_error = np.sqrt(np.mean((y_new_hat - y_test) ** 2))
print(f"RMSE on test set with new features: {new_error:.2f}")
# print r2 score
new_r2 = r2_score(y_test, y_new_hat)
print(f"R^2 on test set with new features: {new_r2:.2f}")

RMSE on test set with new features: 0.15
R^2 on test set with new features: 1.00


## Random Forest

## Neural Network

## XGBoost

## One more model I forget

# Conclusions