'''
# Predicting the S&P 500 Index Using Machine Learning

This notebook aims to build and evaluate machine learning models to predict the S&P 500 index. We will use both deep learning (LSTM) and classical machine learning (XGBoost) approaches. The focus will be on feature engineering, model training, and evaluation, with a special emphasis on maintaining the temporal order of the data to ensure realistic forecasting.

## Methodologies:
- **LSTM (Long Short-Term Memory)**: A type of recurrent neural network well-suited for time series data.
- **XGBoost (Extreme Gradient Boosting)**: A powerful gradient boosting framework known for its performance and speed.

## Importance of Feature Engineering and Model Evaluation:
- Feature engineering is crucial in financial forecasting as it can significantly impact model performance.
- Proper model evaluation, especially with time series data, ensures that the model generalizes well to unseen future data.
'''

In [1]:
%pip install yfinance
%pip install keras-tuner
%pip install xgboost

%pip install tensorflow

Collecting yfinanceNote: you may need to restart the kernel to use updated packages.

  Downloading yfinance-0.2.59-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting multitasking>=0.0.7 (from yfinance)
  Downloading multitasking-0.0.11-py3-none-any.whl.metadata (5.5 kB)
Collecting frozendict>=2.3.4 (from yfinance)
  Downloading frozendict-2.4.6-py311-none-any.whl.metadata (23 kB)
Collecting peewee>=3.16.2 (from yfinance)
  Downloading peewee-3.18.1.tar.gz (3.0 MB)
     ---------------------------------------- 0.0/3.0 MB ? eta -:--:--
     ------------------------ --------------- 1.8/3.0 MB 11.2 MB/s eta 0:00:01
     ---------------------------------------- 3.0/3.0 MB 10.4 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): 


[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: C:\Users\MarketCipher\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Collecting keras-tunerNote: you may need to restart the kernel to use updated packages.

  Downloading keras_tuner-1.4.7-py3-none-any.whl.metadata (5.4 kB)
Collecting keras (from keras-tuner)
  Downloading keras-3.9.2-py3-none-any.whl.metadata (6.1 kB)
Collecting kt-legacy (from keras-tuner)
  Downloading kt_legacy-1.0.5-py3-none-any.whl.metadata (221 bytes)
Collecting absl-py (from keras->keras-tuner)
  Downloading absl_py-2.2.2-py3-none-any.whl.metadata (2.6 kB)
Collecting rich (from keras->keras-tuner)
  Downloading rich-14.0.0-py3-none-any.whl.metadata (18 kB)
Collecting namex (from keras->keras-tuner)
  Downloading namex-0.0.9-py3-none-any.whl.metadata (322 bytes)
Collecting h5py (from keras->keras-tuner)
  Downloading h5py-3.13.0-cp311-cp311-win_amd64.whl.metadata (2.5 kB)
Collecting optree (from keras->keras-tuner)
  Downloading optree-0.15.0-cp311-cp311-win_amd64.whl.metadata (49 kB)
Collecting ml-dtypes (from keras->keras-tuner)
  Downloading ml_dtypes-0.5.1-cp311-cp311-win_am


[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: C:\Users\MarketCipher\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Collecting xgboostNote: you may need to restart the kernel to use updated packages.

  Downloading xgboost-3.0.0-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-3.0.0-py3-none-win_amd64.whl (150.0 MB)
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   ---------------------------------------- 1.6/150.0 MB 9.4 MB/s eta 0:00:16
    --------------------------------------- 3.7/150.0 MB 9.5 MB/s eta 0:00:16
   - -------------------------------------- 5.8/150.0 MB 10.1 MB/s eta 0:00:15
   -- ------------------------------------- 7.9/150.0 MB 9.7 MB/s eta 0:00:15
   -- ------------------------------------- 10.5/150.0 MB 10.4 MB/s eta 0:00:14
   --- ------------------------------------ 13.1/150.0 MB 10.8 MB/s eta 0:00:13
   ---- ----------------------------------- 15.7/150.0 MB 10.9 MB/s eta 0:00:13
   ---- ----------------------------------- 18.4/150.0 MB 11.1 MB/s eta 0:00:12
   ----- ---------------------------------- 20.7/150.0 MB 11.1 MB/s eta 0:00:12
  


[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: C:\Users\MarketCipher\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Collecting tensorflow
  Downloading tensorflow-2.19.0-cp311-cp311-win_amd64.whl.metadata (4.1 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Using cached astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=24.3.25 (from tensorflow)
  Downloading flatbuffers-25.2.10-py2.py3-none-any.whl.metadata (875 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow)
  Using cached gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow)
  Using cached google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting libclang>=13.0.0 (from tensorflow)
  Using cached libclang-18.1.1-py2.py3-none-win_amd64.whl.metadata (5.3 kB)
Collecting opt-einsum>=2.3.2 (from tensorflow)
  Using cached opt_einsum-3.4.0-py3-none-any.whl.metadata (6.3 kB)
Collecting termcolor>=1.1.0 (from tensorflow)
  Downloading termcolor-3.1.0-py3-none-any.whl.metadata (6.4 kB)
Collecting wrapt>=1.11.0 (from tensorflow)
  Downloading wrapt-1


[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: C:\Users\MarketCipher\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [1]:
# ----------------------------
# 2. Importing Libraries (Code Cell)
# ----------------------------
# Import necessary libraries
%pip install pandas
%pip install numpy
%pip install matplotlib.pyplot
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# For data preprocessing and model building
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
from kerastuner import RandomSearch
from kerastuner.engine.hyperparameters import HyperParameters

# For handling dates and time series
from datetime import datetime
import yfinance as yf

# For XGBoost
import xgboost as xgb
from xgboost import XGBRegressor

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: C:\Users\MarketCipher\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: C:\Users\MarketCipher\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
ERROR: Could not find a version that satisfies the requirement matplotlib.pyplot (from versions: none)

[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: C:\Users\MarketCipher\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
ERROR: No matching distribution found for matplotlib.pyplot


Note: you may need to restart the kernel to use updated packages.


ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

In [8]:
# ----------------------------
# 3. Data Collection (Code Cell)
# ----------------------------
# Define the ticker symbol for S&P 500
ticker = '^GSPC'

# Define the date range for historical data
start_date = '2010-01-01'
end_date = datetime.today().strftime('%Y-%m-%d')

# Fetch historical data using yfinance
print(f"Fetching historical data for {ticker} from {start_date} to {end_date}...")
sp500 = yf.download(ticker, start=start_date, end=end_date)

# Check if data was fetched successfully
if sp500.empty:
    raise ValueError("Failed to fetch data. Please check the ticker symbol and date range.")

print("Data fetched successfully!")

# Display the first few rows of the data
sp500.head()

Fetching historical data for ^GSPC from 2010-01-01 to 2025-05-08...


[*********************100%***********************]  1 of 1 completed
ERROR:yfinance:
1 Failed download:
ERROR:yfinance:['^GSPC']: YFRateLimitError('Too Many Requests. Rate limited. Try after a while.')


ValueError: Failed to fetch data. Please check the ticker symbol and date range.

In [None]:
# ----------------------------
# 4. Data Preprocessing and Feature Engineering (Code Cell)
# ----------------------------
# Suggested Features:
# 1. Historical Prices: 'Open', 'High', 'Low', 'Close', 'Volume'
# 2. Technical Indicators:
#    - Moving Averages (e.g., 50-day, 200-day)
#    - Relative Strength Index (RSI)
#    - Moving Average Convergence Divergence (MACD)
#    - Bollinger Bands
# 3. Additional Features: Daily return and Stochastic Oscillator (optional)

# Calculate Moving Averages
sp500['MA50'] = sp500['Close'].rolling(window=50).mean()
sp500['MA200'] = sp500['Close'].rolling(window=200).mean()

# Calculate Relative Strength Index (RSI)
def calculate_rsi(data, window=14):
    delta = data.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=window).mean()
    rs = gain / loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

sp500['RSI'] = calculate_rsi(sp500['Close'])

# Calculate Moving Average Convergence Divergence (MACD)
ema_12 = sp500['Close'].ewm(span=12, adjust=False).mean()
ema_26 = sp500['Close'].ewm(span=26, adjust=False).mean()
sp500['MACD'] = ema_12 - ema_26

# Calculate Bollinger Bands
sp500['20d_Std'] = sp500['Close'].rolling(window=20).std()
sp500['Upper_BB'] = sp500['MA20'] + (sp500['20d_Std'] * 2)
sp500['Lower_BB'] = sp500['MA20'] - (sp500['20d_Std'] * 2)

# Calculate Daily Return
sp500['Daily_Return'] = sp500['Close'].pct_change()

# Calculate Stochastic Oscillator
def stochastic_oscillator(df, high_col='High', low_col='Low', close_col='Close', k_period=14, d_period=3):
    df['Stoch_K'] = ((df[close_col] - df[low_col].rolling(window=k_period).min()) /
                     (df[high_col].rolling(window=k_period).max() - df[low_col].rolling(window=k_period).min())) * 100
    df['Stoch_D'] = df['Stoch_K'].rolling(window=d_period).mean()
    return df

sp500 = stochastic_oscillator(sp500)

# Drop rows with NaN values resulting from rolling calculations
sp500.dropna(inplace=True)

# Select features for the model
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'MA50', 'MA200', 'RSI', 'MACD', 'Upper_BB',
            'Lower_BB', 'Daily_Return', 'Stoch_K', 'Stoch_D']

# Define the target variable
# Predict the next day's closing price
sp500['Target'] = sp500['Close'].shift(-1)

# Drop the last row as it will have a NaN target
sp500.dropna(inplace=True)


In [None]:
# ----------------------------
# 5. Feature Scaling (Code Cell)
# ----------------------------
# Feature Scaling
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_features = scaler.fit_transform(sp500[features])

# Convert to DataFrame for easier manipulation
scaled_df = pd.DataFrame(scaled_features, columns=features, index=sp500.index)

# ----------------------------
# 6. Preparing the Dataset for LSTM with Revised Train-Test Split (Code Cell)
# ----------------------------
# Define the cutoff date for the train-test split
split_date = '2016-01-01'

# Ensure the split date exists in the index
if split_date not in sp500.index:
    raise ValueError(f"The split date {split_date} is not in the data. Please choose a valid date.")

# Split the data into training and testing sets
train_df = sp500.loc[sp500.index < split_date]
test_df = sp500.loc[sp500.index >= split_date]

print(f"Training data points: {train_df.shape[0]}")
print(f"Testing data points: {test_df.shape[0]}")

# Feature Scaling for Training and Testing Data
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_train = scaler.fit_transform(train_df[features])
scaled_test = scaler.transform(test_df[features])

# Prepare the training dataset
def create_dataset(df, features, target, timesteps=60):
    X, y = [], []
    for i in range(timesteps, len(df)):
        X.append(df[features].values[i-timesteps:i])
        y.append(df[target].values[i])
    return np.array(X), np.array(y)

timesteps = 60
X_train, y_train = create_dataset(train_df, features, 'Target', timesteps)
X_train = X_train.reshape((X_train.shape[0], timesteps, len(features)))

# Prepare the testing dataset
X_test, y_test = create_dataset(test_df, features, 'Target', timesteps)
X_test = X_test.reshape((X_test.shape[0], timesteps, len(features)))

print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")


In [None]:
# ----------------------------
# 7. Building the LSTM Model with Hyperparameter Tuning (Code Cell)
# ----------------------------
# Define the model-building function for Keras Tuner
def build_model(hp):
    model = Sequential()
    model.add(LSTM(units=hp.Int('units', min_value=32, max_value=512, step=32),
                   return_sequences=True,
                   input_shape=(X_train.shape[1], X_train.shape[2])))
    model.add(Dropout(rate=hp.Float('dropout1', min_value=0.1, max_value=0.5, step=0.1)))
    model.add(LSTM(units=hp.Int('units2', min_value=32, max_value=512, step=32),
                   return_sequences=False))
    model.add(Dropout(rate=hp.Float('dropout2', min_value=0.1, max_value=0.5, step=0.1)))
    model.add(Dense(units=1))
    model.compile(optimizer=Adam(
                    hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])),
                  loss='mean_squared_error')
    return model

# Initialize Keras Tuner
tuner = RandomSearch(
    build_model,
    objective='val_loss',
    max_trials=20,
    executions_per_trial=2,
    directory='lstm_tuning',
    project_name='sp500_lstm_tuning'
)

# Perform hyperparameter tuning
tuner.search(X_train, y_train,
             epochs=100,
             validation_split=0.1,
             callbacks=[EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)],
             verbose=1)

# Get the best hyperparameters
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

print(f"""
The best number of units in the first LSTM layer is {best_hps.get('units')}
The best dropout rate in the first LSTM layer is {best_hps.get('dropout1')}
The best number of units in the second LSTM layer is {best_hps.get('units2')}
The best dropout rate in the second LSTM layer is {best_hps.get('dropout2')}
The best learning rate for the optimizer is {best_hps.get('learning_rate')}
""")

# Build the model with the best hyperparameters
model = tuner.hypermodel.build(best_hps)


In [None]:
# ----------------------------
# 8. Training the LSTM Model with Cross-Validation (Code Cell)
# ----------------------------
# Define Early Stopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Define TimeSeriesSplit for cross-validation
tscv = TimeSeriesSplit(n_splits=5)

# Perform cross-validation
for train_index, val_index in tscv.split(X_train):
    X_train_cv, X_val_cv = X_train[train_index], X_train[val_index]
    y_train_cv, y_val_cv = y_train[train_index], y_train[val_index]
    model.fit(X_train_cv, y_train_cv,
              epochs=100,
              batch_size=32,
              validation_data=(X_val_cv, y_val_cv),
              callbacks=[early_stop],
              verbose=0)


In [None]:
# ----------------------------
# 9. Evaluating the LSTM Model (Code Cell)
# ----------------------------
# Make predictions
predictions = model.predict(X_test)
y_pred = predictions
y_actual = y_test

# Calculate evaluation metrics
mse = mean_squared_error(y_actual, y_pred)
mae = mean_absolute_error(y_actual, y_pred)

print(f"LSTM Model - Mean Squared Error: {mse}")
print(f"LSTM Model - Mean Absolute Error: {mae}")

# Plot the results
plt.figure(figsize=(14,7))
plt.plot(y_actual, label='Actual S&P 500')
plt.plot(y_pred, label='Predicted S&P 500')
plt.title('S&P 500 Prediction - LSTM')
plt.xlabel('Time')
plt.ylabel('Price')
plt.legend()
plt.show()


In [None]:
# ----------------------------
# 10. Building the XGBoost Model (Code Cell)
# ----------------------------
# Prepare the dataset for XGBoost
# XGBoost requires 2D input, so we flatten the sequences
X_train_xgb = X_train.reshape((X_train.shape[0], -1))
X_test_xgb = X_test.reshape((X_test.shape[0], -1))

# Define the XGBoost model
xgb_model = XGBRegressor(objective='reg:squarederror', n_estimators=100, learning_rate=0.1, random_state=42)

# Define the hyperparameter grid for XGBoost
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=xgb_model,
                                   param_distributions=param_grid,
                                   n_iter=20,
                                   cv=tscv,
                                   scoring='neg_mean_squared_error',
                                   random_state=42,
                                   verbose=1)

# Perform hyperparameter tuning for XGBoost
random_search.fit(X_train_xgb, y_train)

# Best parameters from XGBoost
best_xgb_params = random_search.best_params_
print(f"Best XGBoost Parameters: {best_xgb_params}")

# Train the final XGBoost model with best parameters
final_xgb_model = XGBRegressor(**best_xgb_params, random_state=42)
final_xgb_model.fit(X_train_xgb, y_train)

# Make predictions with XGBoost
y_pred_xgb = final_xgb_model.predict(X_test_xgb)

# Calculate evaluation metrics for XGBoost
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)

print(f"XGBoost Model - Mean Squared Error: {mse_xgb}")
print(f"XGBoost Model - Mean Absolute Error: {mae_xgb}")

# Plot the results for XGBoost
plt.figure(figsize=(14,7))
plt.plot(y_test, label='Actual S&P 500')
plt.plot(y_pred_xgb, label='Predicted S&P 500')
plt.title('S&P 500 Prediction - XGBoost')
plt.xlabel('Time')
plt.ylabel('Price')
plt.legend()
plt.show()


In [None]:
# ----------------------------
# 11. Evaluating the XGBoost Model (Code Cell)
# ----------------------------

# Make predictions with the trained XGBoost model on the test set
y_pred_xgb = final_xgb_model.predict(X_test_xgb)

# Calculate Mean Squared Error (MSE) for the XGBoost model
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
print(f"XGBoost Model - Mean Squared Error (MSE): {mse_xgb:.4f}")

# Calculate Mean Absolute Error (MAE) for the XGBoost model
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
print(f"XGBoost Model - Mean Absolute Error (MAE): {mae_xgb:.4f}")

# Optionally, you can calculate additional metrics such as Root Mean Squared Error (RMSE)
rmse_xgb = np.sqrt(mse_xgb)
print(f"XGBoost Model - Root Mean Squared Error (RMSE): {rmse_xgb:.4f}")

# Plotting the Actual vs. Predicted values for the XGBoost model
plt.figure(figsize=(14, 7))
plt.plot(test_df.index[-len(y_test):], y_test, label='Actual S&P 500', color='blue')
plt.plot(test_df.index[-len(y_pred_xgb):], y_pred_xgb, label='Predicted S&P 500 (XGBoost)', color='red', alpha=0.7)
plt.title('S&P 500 Prediction - XGBoost Model', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Price', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True)
plt.tight_layout()
plt.show()

## Model Comparison

### Evaluation Metrics:

| Model      | Mean Squared Error (MSE) | Mean Absolute Error (MAE) |
|------------|---------------------------|----------------------------|
| LSTM       | {mse}                     | {mae}                      |
| XGBoost    | {mse_xgb}                 | {mae_xgb}                  |

### Analysis:

- **LSTM**: Generally performs better in capturing complex temporal dependencies but may require more data and computational resources.
- **XGBoost**: Excels in structured data with clear feature-importance metrics but may struggle with very long sequences.

### Considerations:

- **Interpretability**: XGBoost provides better interpretability with feature importance.
- **Computational Efficiency**: XGBoost is generally faster to train than LSTM.
- **Performance**: Depending on the data and features, one model may outperform the other.

### Recommendation:

Evaluate both models based on your specific requirements and constraints. Consider using ensemble methods or model stacking for improved performance.
'''

## Conclusion and Next Steps

### Conclusion:

This notebook provided a comprehensive guide to building and evaluating machine learning models for predicting the S&P 500 index. We implemented both LSTM and XGBoost models with a revised train-test split to ensure that the models are trained on past data and tested on future data.

### Next Steps:

1. **Feature Expansion**: Incorporate additional features such as macroeconomic indicators, sentiment analysis, or other alternative data sources.
2. **Model Optimization**: Further tune model hyperparameters and explore different architectures (e.g., GRU, CNN-LSTM).
3. **Ensemble Methods**: Combine models to leverage the strengths of each approach.
4. **Model Deployment**: Deploy the model to a production environment for real-time prediction and trading.
5. **Backtesting**: Perform thorough backtesting with transaction costs and slippage to evaluate the model's profitability.
6. **Risk Management**: Integrate risk management strategies (e.g., stop-loss, position sizing) to enhance the trading strategy.
7. **Monitoring and Maintenance**: Continuously monitor model performance and implement mechanisms for model drift detection and updating.
'''