# Sugar Prices

The sugar prices data is taken from [Agricultural Futures Prices (Kaggle, originally Yahoo Finance)](https://www.kaggle.com/datasets/guillemservera/agricultural-futures)

Historically, sugar production was important in the growth of slavery in Louisiana and in the U.S. annexation of Hawaii.

Sugarcane 
* Florida: The largest sugarcane-producing region in the US, with most production in Palm Beach County
* Louisiana: A major producer of sugarcane
* Texas
* Hawaii

Sugar beets 
* California
* Colorado
* Idaho
* Michigan
* Minnesot
* Montana
* Nebraska
* North Dakota
* Oregon
* Washington
* Wyoming

In [67]:
import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from seaborn import set_style
import seaborn as sns
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error, r2_score
import time

# Data Processing 

We first need to get the data into a nice format. The production data is only annual, the weather data is daily, while the finance data is only on business days. We also don't have 2025 production data. My weather data goes up to March 19, 2025 (could get the most updated version though).

In [None]:
fin = pd.read_csv('../sugar_data/finance_data.csv')
prod = pd.read_csv('../sugar_data/production_data.csv')
weather = pd.read_csv('../sugar_data/weather_data.csv')

In [11]:
# Filter the DataFrame for the desired years.
prod_filtered = prod[(prod['year'] >= 2005) & (prod['year'] <= 2024)]

# Group by 'year' and sum the 'Value' column.
prod_grouped = prod_filtered.groupby('year', as_index=False)['Value'].sum()

# Rename the column to something more descriptive (e.g., 'production').
prod_grouped.rename(columns={'Value': 'production'}, inplace=True)

# Display the cleaned-up production data.
print(prod_grouped)

    year    production
0   2005  2.251679e+07
1   2006  1.655668e+07
2   2007  1.716306e+07
3   2008  4.837029e+07
4   2009  2.511015e+06
5   2010  5.315556e+07
6   2011  1.483909e+06
7   2012  5.988276e+08
8   2013  6.242756e+07
9   2014  1.323900e+08
10  2015  3.622623e+07
11  2016  9.081382e+07
12  2017  2.238663e+10
13  2018  1.193610e+10
14  2019  1.023751e+10
15  2020  8.103290e+08
16  2021  5.254510e+08
17  2022  1.995750e+09
18  2023  4.208000e+07
19  2024  2.446300e+07


In [None]:
weather.to_csv('../sugar_data/weather_data.csv')

In [None]:
fin.to_csv('../sugar_data/finance_data.csv')

In [34]:
# Define the period of interest.
start_date = '2005-01-01'
end_date = '2025-03-19'

# 1. Create a business day date range.
business_dates = pd.date_range(start=start_date, end=end_date, freq='B')

# 2. Impute annual production data onto the business days.
# Assume prod_grouped is a DataFrame with columns ['year', 'production'].
prod_expanded = pd.DataFrame(index=business_dates)
prod_expanded['year'] = prod_expanded.index.year

# Create a production series by setting the index to 'year'
production_series = prod_grouped.set_index('year')['production']
# Map each business day to the production value for that year.
prod_expanded['production'] = prod_expanded['year'].map(production_series)
# Optionally, drop the auxiliary 'year' column.
prod_expanded.drop(columns=['year'], inplace=True)

In [36]:
prod_expanded

Unnamed: 0,production
2005-01-03,22516789.0
2005-01-04,22516789.0
2005-01-05,22516789.0
2005-01-06,22516789.0
2005-01-07,22516789.0
...,...
2025-03-13,
2025-03-14,
2025-03-17,
2025-03-18,


In [37]:
weather_business = weather[weather.index.isin(business_dates)]
weather_business

Unnamed: 0_level_0,FL_max_temp_mean,FL_max_temp_var,FL_min_temp_mean,FL_min_temp_var,FL_avg_temp_mean,FL_avg_temp_var,FL_precip_mean,FL_precip_var,FL_snow_mean,FL_snow_var,LA_max_temp_mean,LA_max_temp_var,LA_min_temp_mean,LA_min_temp_var,LA_avg_temp_mean,LA_avg_temp_var,LA_precip_mean,LA_precip_var,LA_snow_mean,LA_snow_var
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2005-01-03,74.530074,7.791705,60.912226,2.057273,67.721526,2.682134,2.775408e-07,4.995726e-08,0.0,0.0,74.530074,7.791705,60.912226,2.057273,67.721526,2.682134,2.775408e-07,4.995726e-08,0.0,0.0
2005-01-04,76.341925,2.810333,58.742993,5.804291,67.542544,2.733289,0.000000e+00,0.000000e+00,0.0,0.0,76.341925,2.810333,58.742993,5.804291,67.542544,2.733289,0.000000e+00,0.000000e+00,0.0,0.0
2005-01-05,76.184676,2.990730,61.412292,3.618560,68.798794,2.513664,1.343158e-02,1.163041e-03,0.0,0.0,76.184676,2.990730,61.412292,3.618560,68.798794,2.513664,1.343158e-02,1.163041e-03,0.0,0.0
2005-01-06,74.942345,13.844986,54.127192,19.193575,64.534485,10.379555,1.479083e-01,1.193081e-02,0.0,0.0,74.942345,13.844986,54.127192,19.193575,64.534485,10.379555,1.479083e-01,1.193081e-02,0.0,0.0
2005-01-07,68.500560,44.847716,52.731653,14.136773,60.616413,16.100480,2.894701e-01,1.452934e-01,0.0,0.0,68.500560,44.847716,52.731653,14.136773,60.616413,16.100480,2.894701e-01,1.452934e-01,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-03-13,81.096189,2.651938,57.859202,9.009650,69.477348,4.458986,1.824144e-04,2.265630e-05,0.0,0.0,81.096189,2.651938,57.859202,9.009650,69.477348,4.458986,1.824144e-04,2.265630e-05,0.0,0.0
2025-03-14,80.304420,2.595409,64.408529,1.396591,72.355977,0.835358,6.121287e-07,6.121250e-08,0.0,0.0,80.304420,2.595409,64.408529,1.396591,72.355977,0.835358,6.121287e-07,6.121250e-08,0.0,0.0
2025-03-17,71.831232,8.497148,40.613640,2.985646,56.220881,2.924497,1.108208e-05,1.108086e-06,0.0,0.0,71.831232,8.497148,40.613640,2.985646,56.220881,2.924497,1.108208e-05,1.108086e-06,0.0,0.0
2025-03-18,73.685133,1.898568,44.804156,5.522434,59.243382,2.392422,0.000000e+00,0.000000e+00,0.0,0.0,73.685133,1.898568,44.804156,5.522434,59.243382,2.392422,0.000000e+00,0.000000e+00,0.0,0.0


In [39]:
weather_business = weather_business.reindex(business_dates)

In [42]:
start_date = '2005-01-01'
end_date = '2025-03-19'

# Extract rows within the specified date range
fin_small = fin.loc[start_date:end_date]

fin_small

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,Year,Month,Day,...,High-Close,Low-Close,TR,14D_ATR,Volume_Volatility_Ratio,14D_RSI,7D_MA,14D_MA,7D_EMA,14D_EMA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005-01-03,9.080000,9.200000,9.070000,9.170000,34423,0.0,0.0,2005,1,3,...,0.160000,0.030000,0.160000,0.143572,3.060048e+06,73.076905,9.000000,8.835714,9.008788,8.914379
2005-01-04,9.190000,9.230000,9.020000,9.030000,24746,0.0,0.0,2005,1,4,...,0.059999,0.150000,0.209999,0.148571,2.000109e+06,63.793069,9.024286,8.858571,9.014091,8.929795
2005-01-05,8.920000,8.980000,8.860000,8.980000,14882,0.0,0.0,2005,1,5,...,0.050000,0.170000,0.170000,0.149286,1.271632e+06,67.889846,9.041429,8.886429,9.005568,8.936489
2005-01-06,8.980000,9.000000,8.880000,8.990000,47304,0.0,0.0,2005,1,6,...,0.020000,0.099999,0.120000,0.147143,4.111412e+06,70.754652,9.041429,8.917857,9.001676,8.943624
2005-01-07,8.990000,9.010000,8.690000,8.710000,58376,0.0,0.0,2005,1,7,...,0.020000,0.300000,0.320001,0.162143,3.989646e+06,59.055128,8.997143,8.934286,8.928757,8.912474
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-03-13,18.840000,19.270000,18.760000,19.250000,47012,0.0,0.0,2025,3,13,...,0.410000,0.100000,0.510000,0.630715,1.710616e+06,31.607149,18.610000,19.212857,18.880568,19.092382
2025-03-14,19.170000,19.320000,18.750000,19.190001,43753,0.0,0.0,2025,3,14,...,0.070000,0.500000,0.570000,0.641429,1.591733e+06,32.536770,18.751429,19.077143,18.957926,19.105398
2025-03-17,19.250000,20.000000,19.219999,19.969999,89334,0.0,0.0,2025,3,17,...,0.809999,0.029999,0.809999,0.655714,3.020160e+06,37.264960,19.014286,18.970714,19.210945,19.220678
2025-03-18,19.900000,20.090000,19.629999,19.990000,72006,0.0,0.0,2025,3,18,...,0.120001,0.340000,0.460001,0.622858,2.576454e+06,43.564362,19.254286,18.924286,19.405708,19.323254


In [43]:
# Use the finance DataFrame's index as the reference
fin_dates = fin_small.index

# Reindex the weather and production DataFrames to have the same dates as fin_small.
# This will leave NaN for any dates that are missing in the original data.
weather_small = weather.reindex(fin_dates)
prod_small = prod_expanded.reindex(fin_dates)

print("Weather shape:", weather_small.shape)
print("Production shape:", prod_small.shape)
print("Finance shape:", fin_small.shape)


Weather shape: (5080, 20)
Production shape: (5080, 1)
Finance shape: (5080, 27)


In [59]:
combined_df = pd.concat([fin_small,prod_small,weather_small],axis=1)
combined_df = combined_df.ffill()
combined_df['shift_Log_Return'] = combined_df['Log_Return'].shift(-1)
# drop last row because of the shift
combined_df = combined_df.drop(combined_df.index[-1])
combined_df.to_csv("sugar_data/combined_sugar_data.csv", index=True)

# Models

* SARIMA
* Linear Regression, XGBoost
* Ridge, Lasso, Random Forest, SVR
* Neural Networks

In [64]:
combined_df = combined_df.drop(columns=['expiry']) # we have the days to expiry already (DTE)

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [77]:
# Set up the target and features:
y = combined_df['shift_Log_Return']
X = combined_df.drop(columns=['shift_Log_Return'])
features = list(X.columns)

# Initialize TimeSeriesSplit and an empty list for errors.
tscv = TimeSeriesSplit(n_splits=5)
errors = []

# Use combined_df for splitting.
for fold, (train_index, test_index) in enumerate(tscv.split(combined_df)):
    # Get training and test sets for the fold.
    train = combined_df.iloc[train_index]
    test = combined_df.iloc[test_index]
    
    # Define the SARIMAX model.
    if features:
        model = sm.tsa.SARIMAX(train['shift_Log_Return'], exog=train[features],
                               order=(30, 1, 1), seasonal_order=(0, 1, 0, 12))
        
        
    else:
        model = sm.tsa.SARIMAX(train['shift_Log_Return'],
                               order=(30, 1, 1), seasonal_order=(0, 1, 0, 12))
    
    # Fit the model.
    results = model.fit(disp=False)
    
    # Forecast for the test period using integer indexing.
    start = len(train)
    end = len(train) + len(test) - 1
    if features:
        pred = results.predict(start=start, end=end, exog=test[features])
    else:
        pred = results.predict(start=start, end=end)
    
    # Calculate the RMSE and R^2 for this fold.
    rmse = np.sqrt(mean_squared_error(test['shift_Log_Return'], pred))
    r2 = r2_score(test['shift_Log_Return'], pred)
    errors.append((rmse, r2))
    print(f"Fold {fold+1} RMSE: {rmse}, R^2: {r2}")

# Convert errors to a NumPy array and calculate the average metrics.
errors_array = np.array(errors)
avg_rmse, avg_r2 = errors_array.mean(axis=0)
print("Average RMSE:", avg_rmse)
print("Average R^2:", avg_r2)


Fold 1 RMSE: 3769.5128695095636, R^2: -17138321376.42553
Fold 2 RMSE: 6.165600094010441, R^2: -153019.93344975275
Fold 3 RMSE: 0.6813745412024003, R^2: -1210.634682646486
Fold 4 RMSE: 2.7292430374018455, R^2: -22656.6918999003
Fold 5 RMSE: 0.5741887983873264, R^2: -1130.010502812517
Average RMSE: 755.9326551961132
Average R^2: -3427699878.739213


In [78]:
from sklearn.linear_model import LinearRegression
import xgboost as xgb

# Define target and features.
y = combined_df['shift_Log_Return']
X = combined_df.drop(columns=['shift_Log_Return'])
# List of feature names.
features = list(X.columns)

# Set up time series cross-validation.
tscv = TimeSeriesSplit(n_splits=5)

# Lists to store metrics for each model.
lr_errors = []   # For Linear Regression: list of (rmse, r2)
xgb_errors = []  # For XGBoost: list of (rmse, r2)

for fold, (train_index, test_index) in enumerate(tscv.split(combined_df)):
    # Create training and testing subsets.
    train = combined_df.iloc[train_index]
    test = combined_df.iloc[test_index]
    
    # Separate predictors and target.
    X_train = train.drop(columns=['shift_Log_Return'])
    y_train = train['shift_Log_Return']
    X_test = test.drop(columns=['shift_Log_Return'])
    y_test = test['shift_Log_Return']
    
    # ----- Linear Regression -----
    lr_model = LinearRegression()
    lr_model.fit(X_train, y_train)
    lr_pred = lr_model.predict(X_test)
    
    lr_rmse = np.sqrt(mean_squared_error(y_test, lr_pred))
    lr_r2 = r2_score(y_test, lr_pred)
    lr_errors.append((lr_rmse, lr_r2))
    
    # ----- XGBoost Regressor -----
    xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100)
    xgb_model.fit(X_train, y_train)
    xgb_pred = xgb_model.predict(X_test)
    
    xgb_rmse = np.sqrt(mean_squared_error(y_test, xgb_pred))
    xgb_r2 = r2_score(y_test, xgb_pred)
    xgb_errors.append((xgb_rmse, xgb_r2))
    
    print(f"Fold {fold+1}:")
    print(f"  Linear Regression -> RMSE: {lr_rmse:.4f}, R^2: {lr_r2:.4f}")
    print(f"  XGBoost           -> RMSE: {xgb_rmse:.4f}, R^2: {xgb_r2:.4f}")

# Convert error lists to NumPy arrays to average metrics.
lr_errors_array = np.array(lr_errors)
avg_lr_rmse, avg_lr_r2 = lr_errors_array.mean(axis=0)

xgb_errors_array = np.array(xgb_errors)
avg_xgb_rmse, avg_xgb_r2 = xgb_errors_array.mean(axis=0)

print("\nAverage Metrics:")
print(f"  Linear Regression -> Average RMSE: {avg_lr_rmse:.4f}, Average R^2: {avg_lr_r2:.4f}")
print(f"  XGBoost           -> Average RMSE: {avg_xgb_rmse:.4f}, Average R^2: {avg_xgb_r2:.4f}")


Fold 1:
  Linear Regression -> RMSE: 0.0305, R^2: -0.1186
  XGBoost           -> RMSE: 0.0373, R^2: -0.6815
Fold 2:
  Linear Regression -> RMSE: 0.0204, R^2: -0.6807
  XGBoost           -> RMSE: 0.0203, R^2: -0.6585
Fold 3:
  Linear Regression -> RMSE: 0.0207, R^2: -0.1225
  XGBoost           -> RMSE: 0.0217, R^2: -0.2249
Fold 4:
  Linear Regression -> RMSE: 0.0181, R^2: 0.0036
  XGBoost           -> RMSE: 0.0204, R^2: -0.2650
Fold 5:
  Linear Regression -> RMSE: 0.0173, R^2: -0.0303
  XGBoost           -> RMSE: 0.0211, R^2: -0.5281

Average Metrics:
  Linear Regression -> Average RMSE: 0.0214, Average R^2: -0.1897
  XGBoost           -> Average RMSE: 0.0242, Average R^2: -0.4716


In [None]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

# Create target and features.
y = combined_df['shift_Log_Return']
X = combined_df.drop(columns=['shift_Log_Return'])
features = list(X.columns)

# Initialize TimeSeriesSplit.
tscv = TimeSeriesSplit(n_splits=5)

# Lists to store metrics for each model.
ridge_errors = []  # (RMSE, R^2) tuples for Ridge.
lasso_errors = []  # (RMSE, R^2) tuples for Lasso.
rf_errors = []     # (RMSE, R^2) tuples for Random Forest.
svr_errors = []    # (RMSE, R^2) tuples for SVR.

# Loop over each fold.
for fold, (train_index, test_index) in enumerate(tscv.split(combined_df)):
    train = combined_df.iloc[train_index]
    test = combined_df.iloc[test_index]
    
    # Separate predictors and target.
    X_train = train.drop(columns=['shift_Log_Return'])
    y_train = train['shift_Log_Return']
    X_test = test.drop(columns=['shift_Log_Return'])
    y_test = test['shift_Log_Return']
    
    # ----- Ridge Regression -----
    ridge_model = Ridge(alpha=1.0)
    ridge_model.fit(X_train, y_train)
    ridge_pred = ridge_model.predict(X_test)
    ridge_rmse = np.sqrt(mean_squared_error(y_test, ridge_pred))
    ridge_r2 = r2_score(y_test, ridge_pred)
    ridge_errors.append((ridge_rmse, ridge_r2))
    
    # ----- Lasso Regression -----
    lasso_model = Lasso(alpha=0.1)
    lasso_model.fit(X_train, y_train)
    lasso_pred = lasso_model.predict(X_test)
    lasso_rmse = np.sqrt(mean_squared_error(y_test, lasso_pred))
    lasso_r2 = r2_score(y_test, lasso_pred)
    lasso_errors.append((lasso_rmse, lasso_r2))
    
    # ----- Random Forest Regressor -----
    rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
    rf_model.fit(X_train, y_train)
    rf_pred = rf_model.predict(X_test)
    rf_rmse = np.sqrt(mean_squared_error(y_test, rf_pred))
    rf_r2 = r2_score(y_test, rf_pred)
    rf_errors.append((rf_rmse, rf_r2))
    
    # ----- Support Vector Regression -----
    svr_model = SVR(kernel='rbf', C=1.0, epsilon=0.1)
    svr_model.fit(X_train, y_train)
    svr_pred = svr_model.predict(X_test)
    svr_rmse = np.sqrt(mean_squared_error(y_test, svr_pred))
    svr_r2 = r2_score(y_test, svr_pred)
    svr_errors.append((svr_rmse, svr_r2))
    
    print(f"Fold {fold+1}:")
    print(f"  Ridge -> RMSE: {ridge_rmse:.4f}, R^2: {ridge_r2:.4f}")
    print(f"  Lasso -> RMSE: {lasso_rmse:.4f}, R^2: {lasso_r2:.4f}")
    print(f"  Random Forest    -> RMSE: {rf_rmse:.4f}, R^2: {rf_r2:.4f}")
    print(f"  SVR   -> RMSE: {svr_rmse:.4f}, R^2: {svr_r2:.4f}\n")

# Convert error lists to NumPy arrays and average metrics.
ridge_errors_arr = np.array(ridge_errors)
lasso_errors_arr = np.array(lasso_errors)
rf_errors_arr = np.array(rf_errors)
svr_errors_arr = np.array(svr_errors)

avg_ridge_rmse, avg_ridge_r2 = ridge_errors_arr.mean(axis=0)
avg_lasso_rmse, avg_lasso_r2 = lasso_errors_arr.mean(axis=0)
avg_rf_rmse, avg_rf_r2 = rf_errors_arr.mean(axis=0)
avg_svr_rmse, avg_svr_r2 = svr_errors_arr.mean(axis=0)

print("Average Metrics Across Folds:")
print(f"  Ridge -> Average RMSE: {avg_ridge_rmse:.4f}, Average R^2: {avg_ridge_r2:.4f}")
print(f"  Lasso -> Average RMSE: {avg_lasso_rmse:.4f}, Average R^2: {avg_lasso_r2:.4f}")
print(f"  Random Forest    -> Average RMSE: {avg_rf_rmse:.4f}, Average R^2: {avg_rf_r2:.4f}")
print(f"  SVR   -> Average RMSE: {avg_svr_rmse:.4f}, Average R^2: {avg_svr_r2:.4f}")


Fold 1:
  Ridge -> RMSE: 0.0300, R²: -0.0843
  Lasso -> RMSE: 0.0289, R²: -0.0069
  RF    -> RMSE: 0.0318, R²: -0.2162
  SVR   -> RMSE: 0.0357, R²: -0.5360

Fold 2:
  Ridge -> RMSE: 0.0194, R²: -0.5090
  Lasso -> RMSE: 0.0159, R²: -0.0197
  RF    -> RMSE: 0.0159, R²: -0.0148
  SVR   -> RMSE: 0.0623, R²: -14.6124

Fold 3:
  Ridge -> RMSE: 0.0198, R²: -0.0255
  Lasso -> RMSE: 0.0363, R²: -2.4451
  RF    -> RMSE: 0.0199, R²: -0.0311
  SVR   -> RMSE: 0.0237, R²: -0.4719

Fold 4:
  Ridge -> RMSE: 0.0181, R²: 0.0044
  Lasso -> RMSE: 0.0182, R²: -0.0032
  RF    -> RMSE: 0.0179, R²: 0.0261
  SVR   -> RMSE: 0.0191, R²: -0.1084

Fold 5:
  Ridge -> RMSE: 0.0174, R²: -0.0348
  Lasso -> RMSE: 0.0171, R²: -0.0018
  RF    -> RMSE: 0.0178, R²: -0.0823
  SVR   -> RMSE: 0.0176, R²: -0.0673

Average Metrics Across Folds:
  Ridge -> Average RMSE: 0.0209, Average R²: -0.1299
  Lasso -> Average RMSE: 0.0233, Average R²: -0.4953
  RF    -> Average RMSE: 0.0206, Average R²: -0.0637
  SVR   -> Average RMSE: 0.

In [86]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Fit the scaler on the training data and transform it.
scaled = scaler.fit_transform(combined_df)

# Convert the NumPy arrays back into DataFrames.
scaled_df = pd.DataFrame(scaled, index=combined_df.index, columns=combined_df.columns)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

# Set device for PyTorch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Prepare target and predictors as NumPy arrays.
# combined_df should contain the target 'shift_Log_Return' and predictors.
y = scaled_df['shift_Log_Return'].values
X = scaled_df.drop(columns=['shift_Log_Return']).values

# Define a simple feed-forward neural network model.
class FeedForwardNN(nn.Module):
    def __init__(self, input_dim):
        super(FeedForwardNN, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )
    
    def forward(self, x):
        return self.model(x)

# Training function with early stopping.
def train_model(model, optimizer, criterion, X_train, y_train, X_val, y_val,
                num_epochs=100, batch_size=32, patience=10):
    model.train()
    n_train = X_train.shape[0]
    best_val_loss = np.inf
    epochs_no_improve = 0
    best_model_state = None

    # Convert all training and validation data to tensors.
    X_train_tensor = torch.tensor(X_train, dtype=torch.float32).to(device)
    y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1).to(device)
    X_val_tensor = torch.tensor(X_val, dtype=torch.float32).to(device)
    y_val_tensor = torch.tensor(y_val, dtype=torch.float32).view(-1, 1).to(device)
    
    for epoch in range(num_epochs):
        model.train()
        permutation = torch.randperm(n_train)
        epoch_loss = 0.0

        # Mini-batch training.
        for i in range(0, n_train, batch_size):
            optimizer.zero_grad()
            indices = permutation[i:i+batch_size]
            batch_x = X_train_tensor[indices]
            batch_y = y_train_tensor[indices]
            outputs = model(batch_x)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item() * batch_x.size(0)
        
        epoch_loss /= n_train

        # Evaluate on validation data.
        model.eval()
        with torch.no_grad():
            val_outputs = model(X_val_tensor)
            val_loss = criterion(val_outputs, y_val_tensor).item()
        
        # Early stopping check.
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model_state = model.state_dict()
            epochs_no_improve = 0
        else:
            epochs_no_improve += 1
            if epochs_no_improve >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                break

    # Restore best model state.
    if best_model_state is not None:
        model.load_state_dict(best_model_state)
    
    return model

# Set up TimeSeriesSplit for cross-validation.
tscv = TimeSeriesSplit(n_splits=5)
nn_errors = []  # To store (RMSE, R^2) for each fold.

for fold, (train_index, test_index) in enumerate(tscv.split(combined_df)):
    # Get fold data.
    train_data = combined_df.iloc[train_index]
    test_data = combined_df.iloc[test_index]
    
    X_train = train_data.drop(columns=['shift_Log_Return']).values
    y_train = train_data['shift_Log_Return'].values
    X_test = test_data.drop(columns=['shift_Log_Return']).values
    y_test = test_data['shift_Log_Return'].values
    
    # Further split training data to have a validation set (e.g., 80/20 split).
    split_idx = int(0.8 * X_train.shape[0])
    X_train_part, X_val = X_train[:split_idx], X_train[split_idx:]
    y_train_part, y_val = y_train[:split_idx], y_train[split_idx:]
    
    input_dim = X_train.shape[1]
    model = FeedForwardNN(input_dim).to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.MSELoss()
    
    # Train with early stopping.
    model = train_model(model, optimizer, criterion,
                        X_train_part, y_train_part, X_val, y_val,
                        num_epochs=100, batch_size=32, patience=10)
    
    # Evaluate on the test set.
    model.eval()
    X_test_tensor = torch.tensor(X_test, dtype=torch.float32).to(device)
    with torch.no_grad():
        test_pred = model(X_test_tensor).cpu().numpy().flatten()
    
    nn_rmse = np.sqrt(mean_squared_error(y_test, test_pred))
    nn_r2 = r2_score(y_test, test_pred)
    nn_errors.append((nn_rmse, nn_r2))
    
    print(f"Fold {fold+1}: NN RMSE: {nn_rmse:.4f}, NN R^2: {nn_r2:.4f}")

# Compute average metrics across folds.
nn_errors_arr = np.array(nn_errors)
avg_nn_rmse, avg_nn_r2 = nn_errors_arr.mean(axis=0)
print("\nAverage Neural Network Metrics (PyTorch):")
print(f"  Average RMSE: {avg_nn_rmse:.4f}")
print(f"  Average R^2: {avg_nn_r2:.4f}")


Early stopping at epoch 88
Fold 1: NN RMSE: 8837.9265, NN R²: -94210344665.0053
Early stopping at epoch 37
Fold 2: NN RMSE: 1548.4181, NN R²: -9651095771.6688
Early stopping at epoch 19
Fold 3: NN RMSE: 7244811.5928, NN R²: -136979028810712768.0000
Early stopping at epoch 17
Fold 4: NN RMSE: 6995.4700, NN R²: -148855552456.0731
Early stopping at epoch 14
Fold 5: NN RMSE: 26488.1559, NN R²: -2406914455840.9692

Average Neural Network Metrics (PyTorch):
  Average RMSE: 1457736.3127
  Average R²: -27396337688432300.0000


### What happened??