# Introduction

The goal of our project is to predict the price of Oil futures two minutes after release. 

• What is the data that you are using? What is the original data source if known?

• What does an instance in your data represent (e.g. a person, a transaction, etc.)? How many
instances are there?

• What is the target variable you are trying to predict?

• What are the features used to predict it? Give a few examples of the features.

• Provide any additional relevant information about your data if known (e.g. what is the time
period, what place is it collected from, etc.

Data will be from jan 1 2012 - jan 1 2025

# Problem Setup

In [None]:
#replace this with your own path to the CSV file
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/miaca/OneDrive/Desktop/full_data.csv")
#This removes all t2, t1, and t0 columns besides open_t0. This is the data we would have if we predicted close_t2 at t0 
feature_cols = [col for col in df.columns if '_t2' not in col and 't_1' not in col and 't_0' not in col and col not in ['Close_t2', 'Release_Datetime', 'Date']] + ['Open_t0']
feature_cols = [ col for col in feature_cols if col not in ['Unnamed: 0', 'Release Date']]
X = df[feature_cols]
y = df['Close_t2']


Unnamed: 0,Close_t-60,High_t-60,Low_t-60,Open_t-60,Volume_t-60,Close_t-59,High_t-59,Low_t-59,Open_t-59,Volume_t-59,...,High_t1,Low_t1,Open_t1,Volume_t1,Actual,Forecast,Previous,Weekly Net Import,Weekly Production,Open_t0
0,105.46,105.56,105.44,105.56,514.0,105.46,105.52,105.45,105.46,323.0,...,105.48,105.21,105.33,2052.0,2500000.0,1100000.0,-400000.0,57869000.0,39151000.0,105.24
1,98.82,98.89,98.77,98.88,401.0,98.78,98.84,98.73,98.82,232.0,...,98.55,98.44,98.44,473.0,1700000.0,1800000.0,2500000.0,60529000.0,39137000.0,98.4
2,105.52,105.53,105.46,105.47,187.0,105.5,105.53,105.48,105.52,121.0,...,105.5,105.27,105.4,983.0,2100000.0,2000000.0,1700000.0,62671000.0,38990000.0,105.35
3,104.65,104.67,104.56,104.6,484.0,104.65,104.65,104.57,104.64,201.0,...,104.4,104.35,104.37,292.0,2900000.0,2000000.0,2100000.0,63658000.0,38976000.0,104.34
4,108.45,108.46,108.43,108.43,37.0,108.44,108.46,108.43,108.46,26.0,...,108.63,108.48,108.62,567.0,2000000.0,1300000.0,2900000.0,62412000.0,39466000.0,108.6


In [55]:
#Splitting data into train, test and validation sets. Splitting the data 
import numpy as np
np.random.seed(42)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
#Scaling the data

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Algorithms

## Linear Regression / LASSO (Mia Callahan)


Running a linear regression acts as a baseline for our other models. We expect that the linear regression will not perform well, as the data likely has non-linear relationships. The bias-variance tradeoff likely will lead the linear regression to underfit the data as it is a very basic model. However, it will be interesting to see how it performs especially in comparison to the more complex models we will use later.

In [56]:
# fitting a linear regression model
import numpy as np
np.random.seed(42)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
linmodel = LinearRegression()
linmodel.fit(X_train_scaled, y_train)
linmodel.intercept_, linmodel.coef_

(71.71177474402737,
 array([-2.41949041e+01, -4.27718140e+00, -1.60463881e+00,  5.30702272e+00,
        -2.30654833e-04,  6.24477984e+00, -3.16396471e+00, -4.62059947e+00,
         2.71282307e+01, -1.30317313e-03,  9.98514319e+00, -5.19220377e-01,
        -7.00086529e-01,  3.43233614e-01,  1.17903759e-02,  1.20060990e+01,
        -3.40093055e+00,  3.25519133e+00, -8.73553350e+00,  1.26126065e-02,
         1.23391459e+01,  4.11215371e+00, -3.14744128e+00, -7.64907093e+00,
        -6.17136736e-03,  1.39148850e+00, -6.32265073e-01, -5.77784890e+00,
        -1.27345905e+01, -9.95758960e-03,  2.52536186e-01,  1.28642430e+01,
        -6.02304604e-01, -5.42731743e+00,  4.00473708e-04,  6.88779083e+00,
        -1.32449768e+00,  7.35996211e+00, -1.03044733e+01, -9.20374442e-03,
        -1.80850778e+01, -9.21455725e+00,  8.67656659e+00, -7.12266756e+00,
         1.31958508e-02, -1.69146534e+01, -7.69058363e+00,  1.87691512e+00,
         1.90484303e+01,  8.96623830e-03, -8.21226311e+00,  6.673701

Because there are so many features, we can see that the list of coefficients is very long. It is likely that the linear reegression will overfit because it will try to fit all of the features, even if they are not relevant. After seeing performance, we will reduce the number of features to see if this improves out of sample performance.

In [57]:
# see how it performs on out of sample data
y_hat = linmodel.predict(X_test_scaled)
error = np.sqrt(np.mean((y_hat - y_test) ** 2))
print(f"RMSE on test set: {error:.2f}")
#print r2 score
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_hat)
print(f"R^2 on test set: {r2:.2f}")


RMSE on test set: 0.16
R^2 on test set: 1.00


The RMSE is quite low, at .16. Additionally, 100 percent of the variance can be explained by the model. However, with only 300 observations, even though there was good out of sample performance, it is possible that the model is overfitting

In [None]:
#lasso regression with alpha selection using cross-validation
from sklearn.linear_model import Lasso
mses = []
best_mse = float('inf')
best_alpha = None
alphas = [10**(-x) for x in range(-60, 60)]  # This generates [0.1, 0.01, 0.001, 0.0001, 0.00001]
for alpha in alphas:
    lasso = Lasso(alpha=alpha,  random_state=42)
    lasso.fit(X_train_scaled, y_train)
    mse = mean_squared_error(y_val, lasso.predict(X_val_scaled))
    mses.append(mse)
    if mse < best_mse:
        best_mse = mse
        best_alpha = alpha
print(f"Best alpha: {best_alpha}")

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

Best alpha: 0.01
[1000000000000000000000000000000000000000000000000000000000000, 100000000000000000000000000000000000000000000000000000000000, 10000000000000000000000000000000000000000000000000000000000, 1000000000000000000000000000000000000000000000000000000000, 100000000000000000000000000000000000000000000000000000000, 10000000000000000000000000000000000000000000000000000000, 1000000000000000000000000000000000000000000000000000000, 100000000000000000000000000000000000000000000000000000, 10000000000000000000000000000000000000000000000000000, 1000000000000000000000000000000000000000000000000000, 100000000000000000000000000000000000000000000000000, 10000000000000000000000000000000000000000000000000, 1000000000000000000000000000000000000000000000000, 100000000000000000000000000000000000000000000000, 10000000000000000000000000000000000000000000000, 1000000000000000000000000000000000000000000000, 100000000000000000000000000000000000000000000, 10000000000000000000000000000000000000000000, 1

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


The best alpha is 0.01, so we will use this to fit the new lasso regression.

In [58]:
lasso_best = Lasso(alpha=best_alpha, random_state=42)
lasso_best.fit(X_train_scaled, y_train)
# see how it performs on out of sample data
y_hat_lasso = lasso_best.predict(X_test_scaled)
error_lasso = np.sqrt(np.mean((y_hat_lasso - y_test) ** 2))
print(f"RMSE on test set with Lasso: {error_lasso:.2f}")

RMSE on test set with Lasso: 0.41


  model = cd_fast.enet_coordinate_descent(


The RMSE of the lasso model is higher than the linear regression at .41. This is likely because the LASSO model reduces the number of features the model has to work with. 

## Random Forest

## Neural Network

## XGBoost

## One more model I forget

# Conclusions