# ⚖️ 03_baselines.ipynb  
## Benchmark Models: RandomForest & ARIMA

### 1. Objective  
See how **simple models** perform before we try deep learning.

> _Like testing your bicycle (RF) and your skateboard (ARIMA) before buying a motorbike (LSTM)._

---

### 2. Definitions

- **MAE (Mean Absolute Error)**  
  - The average of |predicted – true|.  
  - Lower is better (0 = perfect).  
- **R² (Coefficient of Determination)**  
  - Fraction of variance explained (0–1).  
  - Higher is better (1 = perfect).

---

### 3. Data Split  
- **Train** on the first 80% of dates.  
- **Test** on the last 20%.  
- **No shuffling**—we respect time order.

---

### 4. RandomForest  
1. Train on features `["rsi","macd"]`.  
2. Evaluate on test set → **MAE_rf**, **R²_rf**.

---

### 5. ARIMA  
1. Train ARIMA(5,1,0) on raw Close prices.  
2. Forecast the last 20%.  
3. Convert forecasts to log-returns.  
4. Evaluate → **MAE_ar**, **R²_ar**.


In [3]:
import pandas as pd
from pathlib import Path

# 1) Load the features CSV
root = Path().resolve().parent
features = pd.read_csv(root/"data"/"processed"/"features.csv",
                       index_col="Date", parse_dates=True)

# 2) Separate inputs (X) and target (y)
#    We’ll predict “return”, so y = return
X = features[["rsi", "macd"]]
y = features["return"]

# 3) Preview
features.head()


Unnamed: 0_level_0,Open,High,Low,Close,Volume,return,rsi,macd
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-07-13,3205.08,3235.32,3149.43,3155.22,2694339000.0,-0.009407,55.238587,6.658386
2020-07-14,3141.11,3200.95,3127.66,3197.52,2638225000.0,0.013317,59.409451,7.621329
2020-07-15,3225.98,3238.28,3200.76,3226.56,2849504000.0,0.009041,62.025553,9.495176
2020-07-16,3208.36,3220.39,3198.59,3215.57,2214118000.0,-0.003412,60.43801,9.265467
2020-07-17,3224.21,3233.52,3205.65,3224.73,2219965000.0,0.002845,61.326496,8.994139


## Train & evaluate a Random Forest

In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import numpy as np

# 1) Split chronologically: first 80% train, last 20% test
split = int(len(X) * 0.8)
X_train, X_test = X.iloc[:split], X.iloc[split:]
y_train, y_test = y.iloc[:split], y.iloc[split:]

# 2) Train
rf = RandomForestRegressor(n_estimators=100, random_state=0)
rf.fit(X_train, y_train)

# 3) Evaluate with R² and MAE
r2 = rf.score(X_test, y_test)
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, rf.predict(X_test))

print(f"RandomForest R²: {r2:.3f}")
print(f"RandomForest MAE: {mae:.5f}")


RandomForest R²: 0.128
RandomForest MAE: 0.00785


## Train & evaluate an ARIMA on prices

In [4]:
from statsmodels.tsa.arima.model import ARIMA
import pandas as pd

# Load raw Close prices
prices = pd.read_csv(root/"data"/"raw"/"sp500.csv",
                     index_col="Date", parse_dates=True)["Close"]

# Fit ARIMA(5,1,0)
model = ARIMA(prices, order=(5,1,0)).fit()

# Forecast the last 20% of days
n_test = len(prices) - split
pred = model.predict(start=len(prices)-n_test, end=len(prices)-1, typ="levels")

# Evaluate on returns
# convert predicted prices to returns
pred_ret = np.log(pred / pred.shift(1)).dropna()
true_ret = y.iloc[-len(pred_ret):]

r2_arima = np.corrcoef(pred_ret, true_ret)[0,1]**2
mae_arima = mean_absolute_error(true_ret, pred_ret)

print(f"ARIMA R²: {r2_arima:.3f}")
print(f"ARIMA MAE: {mae_arima:.5f}")


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


ARIMA R²: 0.001
ARIMA MAE: 0.01103
