# Gold Price: Feature Engineering & Baseline Forecasting

This notebook builds on **01_gold_price_eda.ipynb** and implements:

1. **Temporal train/validation split** — no future leakage.
2. **Technical indicators** — RSI, MACD, ATR as candidate features.
3. **Baseline models** — naive (last value), rolling mean, and a simple ML baseline.
4. **Time-series metrics** — MAE, RMSE, MAPE for next-day Close prediction.

**Prerequisites:** Run `01_gold_price_eda.ipynb` first. Data is loaded from the same CSV.

In [1]:
import os
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Data path (same as EDA notebook)
_DATA_PATH = Path("gold_historical_data.csv")
DATA_PATH = _DATA_PATH if _DATA_PATH.exists() else Path("../gold_historical_data.csv")
SEED = 42
np.random.seed(SEED)

In [2]:
df = pd.read_csv(DATA_PATH, parse_dates=["Date"])
df = df.sort_values("Date").reset_index(drop=True)
df["Return"] = df["Close"].pct_change()
print(f"Loaded {len(df):,} rows. Date range: {df['Date'].min().date()} to {df['Date'].max().date()}")
df.head()

Loaded 2,510 rows. Date range: 2016-02-05 to 2026-01-30


Unnamed: 0,Date,Adj Close,Close,High,Low,Open,Volume,Return
0,2016-02-05,1157.800049,1157.800049,1174.800049,1146.0,1155.599976,877,
1,2016-02-08,1197.900024,1197.900024,1200.400024,1165.0,1173.5,931,0.034635
2,2016-02-09,1198.699951,1198.699951,1199.199951,1186.699951,1188.699951,700,0.000668
3,2016-02-10,1194.699951,1194.699951,1197.699951,1182.099976,1189.800049,671,-0.003337
4,2016-02-11,1247.900024,1247.900024,1260.800049,1204.599976,1205.599976,995,0.04453


## 1. Temporal train/validation split

Use a strict time-based cutoff so the validation set is always in the future. Here we reserve the last 20% of days for validation.

In [3]:
train_frac = 0.80
split_idx = int(len(df) * train_frac)
train_df = df.iloc[:split_idx].copy()
val_df = df.iloc[split_idx:].copy()
print(f"Train: {len(train_df):,} rows ({train_df['Date'].min().date()} to {train_df['Date'].max().date()})")
print(f"Val:   {len(val_df):,} rows ({val_df['Date'].min().date()} to {val_df['Date'].max().date()})")

Train: 2,008 rows (2016-02-05 to 2024-02-01)
Val:   502 rows (2024-02-02 to 2026-01-30)


## 2. Technical indicators (RSI, MACD, ATR)

- **RSI** (14): relative strength index (0–100).
- **MACD**: MACD line and signal line (12/26/9).
- **ATR** (14): average true range for volatility.

In [4]:
def add_rsi(series: pd.Series, period: int = 14) -> pd.Series:
    delta = series.diff()
    gain = delta.clip(lower=0)
    loss = (-delta).clip(lower=0)
    avg_gain = gain.rolling(period).mean()
    avg_loss = loss.rolling(period).mean()
    rs = avg_gain / avg_loss.replace(0, np.nan)
    return 100 - (100 / (1 + rs))

def add_macd(close: pd.Series, fast: int = 12, slow: int = 26, signal: int = 9):
    ema_fast = close.ewm(span=fast, adjust=False).mean()
    ema_slow = close.ewm(span=slow, adjust=False).mean()
    macd_line = ema_fast - ema_slow
    signal_line = macd_line.ewm(span=signal, adjust=False).mean()
    return macd_line, signal_line

def add_atr(high: pd.Series, low: pd.Series, close: pd.Series, period: int = 14) -> pd.Series:
    prev_close = close.shift(1)
    tr = pd.concat([high - low, (high - prev_close).abs(), (low - prev_close).abs()], axis=1).max(axis=1)
    return tr.rolling(period).mean()

In [5]:
# Apply on full df so we can use train period for fit and val for predict
df["RSI_14"] = add_rsi(df["Close"], 14)
macd_line, signal_line = add_macd(df["Close"])
df["MACD"] = macd_line
df["MACD_signal"] = signal_line
df["ATR_14"] = add_atr(df["High"], df["Low"], df["Close"], 14)
# Rolling features for baselines
df["Rolling20"] = df["Close"].rolling(20).mean()
df["Rolling50"] = df["Close"].rolling(50).mean()
df = df.dropna(subset=["RSI_14", "MACD", "ATR_14", "Rolling20", "Rolling50"]).reset_index(drop=True)
# Re-split after dropping NaNs
split_idx = int(len(df) * train_frac)
train_df = df.iloc[:split_idx].copy()
val_df = df.iloc[split_idx:].copy()
print("Technical indicators added. Sample (last 3 rows of train):")
train_df[["Date", "Close", "RSI_14", "MACD", "MACD_signal", "ATR_14", "Rolling20"]].tail(3)

Technical indicators added. Sample (last 3 rows of train):


Unnamed: 0,Date,Close,RSI_14,MACD,MACD_signal,ATR_14,Rolling20
1965,2024-02-12,2018.199951,47.3275,0.240414,1.395778,17.392848,2026.404987
1966,2024-02-13,1992.900024,41.131749,-2.653072,0.586008,18.571429,2024.749988
1967,2024-02-14,1990.300049,38.728812,-5.097219,-0.550638,18.457136,2024.134991


## 3. Target and feature matrix

Predict **next-day Close**. Features: lagged Close, Return, RSI, MACD, ATR, rolling means (no future info).

In [6]:
df["target"] = df["Close"].shift(-1)  # next-day Close
feat_cols = ["Close", "Return", "RSI_14", "MACD", "MACD_signal", "ATR_14", "Rolling20", "Rolling50"]
X_train = train_df[feat_cols]
y_train = train_df["target"].dropna()
# Align X_train with y_train (drop last row of X that has no target)
X_train = X_train.loc[y_train.index]
X_val = val_df[feat_cols]
y_val = val_df["target"].dropna()
X_val = X_val.loc[y_val.index]
print(f"Train X: {X_train.shape}, y: {y_train.shape}")
print(f"Val   X: {X_val.shape},   y: {y_val.shape}")

KeyError: 'target'

## 4. Baseline models and time-series metrics

- **Naive:** predict last known Close (persistence).
- **Rolling mean:** predict 20-day rolling mean of Close.
- **Ridge:** linear model on the engineered features.

In [None]:
def mae(y_true, y_pred): return mean_absolute_error(y_true, y_pred)
def rmse(y_true, y_pred): return np.sqrt(mean_squared_error(y_true, y_pred))
def mape(y_true, y_pred): return np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100

# Naive: next Close = current Close
y_naive = X_val["Close"].values
# Rolling: use Rolling20 as proxy for "recent average" prediction
y_rolling = X_val["Rolling20"].values
# Ridge
model = Ridge(alpha=1.0, random_state=SEED)
model.fit(X_train, y_train)
y_ridge = model.predict(X_val)

results = pd.DataFrame({
    "Model": ["Naive (last Close)", "Rolling20 mean", "Ridge"],
    "MAE": [mae(y_val, y_naive), mae(y_val, y_rolling), mae(y_val, y_ridge)],
    "RMSE": [rmse(y_val, y_naive), rmse(y_val, y_rolling), rmse(y_val, y_ridge)],
    "MAPE (%)": [mape(y_val, y_naive), mape(y_val, y_rolling), mape(y_val, y_ridge)],
})
results

In [None]:
# Plot validation period: actual vs Ridge predictions
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(val_df.loc[y_val.index, "Date"], y_val.values, label="Actual Close", alpha=0.8)
ax.plot(val_df.loc[y_val.index, "Date"], y_ridge, label="Ridge pred", alpha=0.8)
ax.set_title("Validation: Actual vs Ridge baseline")
ax.set_xlabel("Date")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 5. Summary and suggested next notebook

- **Done:** Temporal split, RSI/MACD/ATR, naive + rolling + Ridge baselines, MAE/RMSE/MAPE.
- **Next (03):** Try ARIMA or Prophet for univariate Close forecasting; or tree-based models (e.g. Gradient Boosting) with the same features; add proper walk-forward or expanding-window backtesting.