In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load data from CSV
crude_data = pd.read_csv('crude_oil_5y.csv', parse_dates=['Date'])

# Ensure 'date' column is in datetime format
crude_data['Date'] = pd.to_datetime(crude_data['Date'], utc=True)

# Sorting by date 
crude_data.sort_values(by='Date', inplace=True)

# Handling missing values if any
crude_data.dropna(inplace=True)

In [7]:
crude_data.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Dividends',
       'Stock Splits'],
      dtype='object')

In [8]:
# Display DataFrame for context
crude_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2019-02-05 05:00:00+00:00,54.860001,55.209999,53.470001,53.66,609756,0.0,0.0
1,2019-02-06 05:00:00+00:00,53.73,54.299999,52.860001,54.009998,606720,0.0,0.0
2,2019-02-07 05:00:00+00:00,53.939999,54.209999,51.799999,52.639999,749010,0.0,0.0
3,2019-02-08 05:00:00+00:00,52.59,52.990002,52.080002,52.720001,621003,0.0,0.0
4,2019-02-11 05:00:00+00:00,52.66,52.779999,51.23,52.41,750242,0.0,0.0


In [9]:
# Extract features and target variable
X = crude_data.index
y = crude_data['Close']

In [10]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
# Instantiate the linear regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train.values.reshape(-1, 1), y_train)

In [13]:
# Make predictions on the testing set
y_pred = model.predict(X_test.values.reshape(-1, 1))

Mean Squared Error (MSE): This measures how well our model predicts the actual closing prices of crude oil over the past five years. Imagine we have a model that tries to predict the closing price of crude oil each day. The MSE tells us, on average, how far off our predictions are from the actual prices. So, a lower MSE would mean our model is better at predicting the actual prices accurately. For example, if the MSE is 294.14949, it means, on average, our model's predictions are off by this amount squared, which could be in dollars or any other currency.

R-squared (R2): This tells us how much of the variation in crude oil prices over the past five years our model can explain. If we have an R-squared value of 0.38339, it means about 38.339% of the changes in crude oil prices can be attributed to the factors our model includes (like supply, demand, geopolitical events, etc.). A higher R-squared value would mean our model does a better job of explaining why crude oil prices move the way they do.

In [14]:
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Round both MSE and R2 to 5 decimal places
mse_rounded = round(mse, 5)
r2_rounded = round(r2, 5)

print("Mean Squared Error (MSE):", mse_rounded)
print("R-squared (R2):", r2_rounded)

Mean Squared Error (MSE): 294.14949
R-squared (R2): 0.38339


Imagine you have a bunch of points on a graph that represent the actual closing prices of crude oil over the past five years. Now, you also have a line drawn through those points, which represents the predictions made by your model. The RMSE is like taking the average distance between each actual point and the corresponding point on the line, but first, we square each of these distances (so they're all positive), then we take the square root of the average of these squared distances. So, in simple terms, the RMSE tells us, on average, how far off our predictions are from the actual closing prices of crude oil.

In this case, with an RMSE of 17.15079, it means, on average, our model's predictions for crude oil prices over the past five years are off by approximately $17.15. Lower RMSE values indicate better performance of the model in predicting the actual prices more accurately.

In [15]:
import math

# Calculate RMSE
rmse = math.sqrt(mse)

# Round RMSE to 5 decimals
rmse_rounded = round(rmse, 5)

print("Root Mean Squared Error (RMSE):", rmse_rounded)

Root Mean Squared Error (RMSE): 17.15079
