# 🎯 Benchmarking with Random Forest  

In this notebook, we’ll build a **baseline model** to predict **weekly S&P 500 closing values** (`sp500_close`) using Spotify audio features.  

Before creating a Neural Network, we first set a benchmark using a **Random Forest Regressor**.  

✅ **Why start with Random Forest?**  
- It’s a strong, non-linear model that often performs well without heavy preprocessing.  
- It gives a quick baseline to see if there’s a meaningful relationship between music trends & the stock market.  
- We can later compare the Neural Network’s performance against this benchmark.  

We’ll evaluate the model using:  
- **MAE (Mean Absolute Error)** → average prediction error  
- **RMSE (Root Mean Squared Error)** → penalizes larger errors more  
- **R² score** → how much variance in the target is explained by the model  

---

## 📂 Step 1: Load Prepared Dataset  

We load the `merged_spotify_sp500_weekly.csv` file created in the previous notebook.  
This dataset contains:  
- **Spotify weekly aggregated audio features**  
- **Weekly S&P 500 closing values (`sp500_close`)**  

We’ll use this as the input for feature selection and modeling.


In [3]:
import pandas as pd

# ✅ Load the prepared merged dataset
data_path = "merged_spotify_sp500_weekly.csv"
merged = pd.read_csv(data_path)

print("✅ Dataset loaded successfully!")
print("Shape:", merged.shape)
print(merged.head())


✅ Dataset loaded successfully!
Shape: (1096, 17)
         week  popularity    duration_ms  explicit  danceability    energy  \
0  2000-01-03   35.513665  236354.200000  0.021739      0.584862  0.634662   
1  2000-01-10   24.194030  264388.388060  0.000000      0.579485  0.473000   
2  2000-01-17   35.553398  314967.650485  0.029126      0.564311  0.601623   
3  2000-01-24   29.603774  234980.283019  0.207547      0.579434  0.606309   
4  2000-01-31   33.893617  242338.063830  0.042553      0.614447  0.645043   

        key   loudness      mode  speechiness  acousticness  instrumentalness  \
0  5.288820  -8.272797  0.668944     0.063389      0.315100          0.056376   
1  5.194030 -11.943164  0.537313     0.081290      0.525509          0.114000   
2  5.388350  -9.865961  0.553398     0.090569      0.387647          0.174762   
3  5.754717  -8.492604  0.660377     0.087913      0.304237          0.034134   
4  5.595745  -8.106702  0.531915     0.045194      0.337882          0.092659

## 🏗️ Step 2: Define Features & Target + Train/Test Split  

Now we prepare the data for modeling:  

- **Target (`y`)** → `sp500_close` (weekly S&P 500 closing value)  
- **Features (`X`)** → all Spotify audio features (e.g., `danceability`, `energy`, `valence`, etc.)  

We then split the dataset into:  
- **Training set (80%)** → used for fitting the model  
- **Test set (20%)** → held out for final performance evaluation  

📌 **Important:**  
Since this is a **time-series-like problem**, we **do NOT shuffle** the data when splitting.  
This preserves the chronological order and avoids data leakage from the future.  


In [4]:
from sklearn.model_selection import train_test_split

# ✅ Define target (y) and features (X)
target = "sp500_close"
feature_cols = [col for col in merged.columns if col not in ["week", target]]

X = merged[feature_cols]
y = merged[target]

print("Features used:", feature_cols)
print("X shape:", X.shape, "| y shape:", y.shape)

# ✅ Train/Test Split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False  # no shuffle to preserve time-series order
)

print(f"Train size: {X_train.shape[0]} | Test size: {X_test.shape[0]}")


Features used: ['popularity', 'duration_ms', 'explicit', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature']
X shape: (1096, 15) | y shape: (1096,)
Train size: 876 | Test size: 220
