# Introduction

What is the goal of your project?

• What is the data that you are using? What is the original data source if known?

• What does an instance in your data represent (e.g. a person, a transaction, etc.)? How many
instances are there?

• What is the target variable you are trying to predict?

• What are the features used to predict it? Give a few examples of the features.

• Provide any additional relevant information about your data if known (e.g. what is the time
period, what place is it collected from, etc.

Data will be from jan 1 2012 - jan 1 2025

# Problem Setup

# Algorithms

# Conclusions

In [None]:
import pandas as pd
df=pd.read_csv("price_window.csv")
#select valid rows
df = df[['Datetime',  'Open', 'High', 'Low', 'Close', 'Volume', 'Percent_Change']]
# add a column called min_to_Release that shows the number of minutes from or to 9:30:00
df['Datetime'] = pd.to_datetime(df['Datetime'])
release_time = pd.to_datetime(df['Datetime'].dt.date.astype(str) + ' 09:30:00')
df['min_to_Release'] = (df['Datetime'] - release_time).dt.total_seconds() / 60
# add a column called predict that is 1 if min_to_Release is 2, else 0
df['predict'] = (df['min_to_Release'] == 2).astype(int)
df['Datetime'] = df['Datetime'].dt.date

# Identify which Release_Datetime groups contain at least one predict = 1
valid_releases = df.groupby('Datetime')['predict'].max()
valid_releases = valid_releases[valid_releases == 1].index

#Keep all rows from those groups
df = df[df['Datetime'].isin(valid_releases)]
df = df[df['min_to_Release'] <= 2]
df.to_csv("price_window_valid.csv", index=False)

  df=pd.read_csv("C:/Users/miaca/OneDrive/Desktop/price_window.csv")


In [None]:
def pivot_market_data(df, x_minutes_before, cols_to_pivot=None):
    """
    pivot into wide formate where minutes to release from [-x, 2] are kept as columns for each feature.
    """
    if cols_to_pivot is None:
        cols_to_pivot = ['Open', 'High', 'Low', 'Close', 'Volume']

    # Filter for the minutes of interest (from -x to 0 inclusive)
    df_filtered = df[(df['min_to_Release'] <= 2) & (df['min_to_Release'] >= -x_minutes_before)].copy()

    # Melt to long format
    df_long = df_filtered.melt(
        id_vars=['Datetime', 'min_to_Release'],
        value_vars=cols_to_pivot,
        var_name='Feature',
        value_name='Value'
    )

    # Create composite column like 'Open_t-2'
    df_long['Feature_min'] = df_long['Feature'] + '_t' + df_long['min_to_Release'].astype(int).astype(str)

    # Pivot to wide format
    df_wide = df_long.pivot_table(
        index='Datetime',
        columns='Feature_min',
        values='Value'
    ).reset_index()
    #drop columns that have t2 in them besides Close_t2
    cols_to_drop = [col for col in df_wide.columns if 't2' in col and col != 'Close_t2']
    df_wide.drop(columns=cols_to_drop, inplace=True)
    return df_wide
df_wide = pivot_market_data(df, x_minutes_before=60)

df_wide.dropna(inplace=True)

df_wide.to_csv("price_window_valid_wide.csv", index=False)



Columns with t2:
['Close_t2']


In [6]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
#set seed to 42
import numpy as np
np.random.seed(42)
df_wide = pd.read_csv("price_window_valid_wide.csv")
display(df_wide)
# Define cutoff index for time-based split (e.g., 80% train, 20% test)
cutoff = int(len(df_wide) * 0.8)
feature_cols = [col for col in df_wide.columns if '_t2' not in col and col not in ['Close_t2', 'Release_Datetime', 'Date']]
X = df_wide[feature_cols]
y = df_wide['Close_t2']
# Split
X_train = X.iloc[:cutoff]
X_test = X.iloc[cutoff:]
y_train = y.iloc[:cutoff]
y_test = y.iloc[cutoff:]

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')






Unnamed: 0,Release_Datetime,Close_t-1,Close_t-10,Close_t-11,Close_t-12,Close_t-13,Close_t-14,Close_t-15,Close_t-16,Close_t-17,...,Volume_t-58,Volume_t-59,Volume_t-6,Volume_t-60,Volume_t-7,Volume_t-8,Volume_t-9,Volume_t0,Volume_t1,Date
0,2011-03-09 09:30:00-06:00,105.13,105.15,105.14,105.17,105.19,105.20,105.20,105.33,105.20,...,899.0,323.0,318.0,514.0,138.0,254.0,252.0,2458.0,2052.0,2011-03-09
1,2011-03-16 08:30:00-05:00,98.38,98.44,98.35,98.39,98.35,98.35,98.43,98.53,98.63,...,127.0,232.0,275.0,401.0,491.0,262.0,190.0,588.0,473.0,2011-03-16
2,2011-03-23 08:30:00-05:00,105.36,105.26,105.21,105.19,105.15,105.12,105.02,105.03,105.04,...,350.0,121.0,111.0,187.0,187.0,466.0,412.0,325.0,983.0,2011-03-23
3,2011-03-30 08:30:00-05:00,104.34,104.35,104.36,104.36,104.42,104.39,104.40,104.40,104.40,...,119.0,201.0,133.0,484.0,382.0,548.0,188.0,280.0,292.0,2011-03-30
4,2011-04-06 08:30:00-05:00,108.62,108.48,108.52,108.55,108.52,108.54,108.53,108.59,108.59,...,29.0,26.0,265.0,37.0,199.0,182.0,86.0,556.0,567.0,2011-04-06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,2025-02-05 09:30:00-06:00,71.66,71.61,71.60,71.54,71.53,71.57,71.55,71.61,71.63,...,313.0,319.0,246.0,547.0,206.0,379.0,513.0,2069.0,929.0,2025-02-05
727,2025-02-12 09:30:00-06:00,72.10,72.04,72.09,72.11,72.07,72.07,72.05,72.02,72.05,...,188.0,202.0,159.0,491.0,109.0,183.0,140.0,1269.0,515.0,2025-02-12
728,2025-02-20 11:00:00-06:00,72.71,72.75,72.72,72.71,72.70,72.72,72.75,72.77,72.77,...,166.0,140.0,100.0,158.0,109.0,76.0,94.0,932.0,325.0,2025-02-20
729,2025-02-26 09:00:00-06:00,68.93,68.94,68.89,68.86,68.88,68.94,68.93,68.83,68.82,...,640.0,1314.0,160.0,989.0,226.0,263.0,260.0,477.0,349.0,2025-02-26


ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values