
# Modular Quantitative Research Workflow in Algorithmic Trading  
## End-to-End Example on SPY (Triple-Barrier + Random Forest + backtesting.py)

This notebook demonstrates a **modular research framework** for systematic / algorithmic trading using a practical example on the S&P 500 ETF (**SPY**).

We walk through:

1. **Data Curation**  
2. **Data Labeling** (simplified *triple-barrier* à la López de Prado)  
3. **Predictive Modelling** (Random Forest classifier)  
4. **Strategy Construction** (mapping predictions to positions)  
5. **Backtesting & Evaluation** using [`backtesting.py`](https://kernc.github.io/backtesting.py/)

The focus is on the **structure of the workflow**, not on building a production-ready strategy.


In [72]:

import numpy as np
import pandas as pd

import yfinance as yf

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

import matplotlib.pyplot as plt

from backtesting import Backtest, Strategy

plt.rcParams["figure.figsize"] = (12, 5)
plt.rcParams["axes.grid"] = True

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

pd.set_option("display.max_columns", 50)



## 1. Data Curation

In a real quant research environment, **data curation** is a major part of the job:

- Collecting raw market data (trades, quotes, OHLCV, fundamentals, etc.)
- Cleaning outliers and bad ticks
- Adjusting for corporate actions (splits, dividends)
- Ensuring there is **no look-ahead** and **no survivorship bias**

Here, for clarity, we:

- Use `yfinance` to download **daily OHLCV data** for SPY.
- Use `auto_adjust=True`, so the `Close` is already adjusted for splits and dividends.
- Keep cleaning minimal (dropping missing rows).


In [73]:
ticker = "AMZN"
start_date = "2020-01-01"

# Download data
raw = yf.download(ticker, start=start_date, auto_adjust=True)

# --- FIX START ---
# Flatten MultiIndex columns (e.g. ('Close', 'SPY') -> 'Close')
if isinstance(raw.columns, pd.MultiIndex):
    raw.columns = raw.columns.get_level_values(0)
# --- FIX END ---

raw = raw.dropna()

# Rename columns to lower case for convenience
data = raw.rename(
    columns={
        "Open": "open",
        "High": "high",
        "Low": "low",
        "Close": "close",
        "Volume": "volume",
    }
)

data.head()

[*********************100%***********************]  1 of 1 completed


Price,close,high,low,open,volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-02,94.900497,94.900497,93.207497,93.75,80580000
2020-01-03,93.748497,94.309998,93.224998,93.224998,75288000
2020-01-06,95.143997,95.184502,93.0,93.0,81236000
2020-01-07,95.343002,95.694504,94.601997,95.224998,80898000
2020-01-08,94.598503,95.550003,94.321999,94.902,70160000



## 2. Data Labeling — Simplified Triple-Barrier Method

We now transform raw prices into **labels** for supervised learning.

Inspired by **López de Prado's triple-barrier method**, we:

1. Estimate **daily volatility** from log returns.
2. For each day:
   - Set an **upper barrier** (profit-taking level)
   - Set a **lower barrier** (stop-loss level)
   - Set a **time barrier** (maximum holding period in days)
3. Look forward in time:
   - If the **upper barrier** is hit first → label = **+1**
   - If the **lower barrier** is hit first → label = **−1**
   - If neither is hit before the time barrier → label = **0**

This gives us labels that encode a directional view with basic risk bounds.


In [74]:

def get_daily_vol(close: pd.Series, span: int = 50) -> pd.Series:
    """Estimate daily volatility using exponentially weighted moving std of log returns."""
    log_ret = np.log(close / close.shift(1))
    vol = log_ret.ewm(span=span).std()
    return vol


def apply_triple_barrier(
    close: pd.Series,
    daily_vol: pd.Series,
    pt_mult: float = 1.0,
    sl_mult: float = 1.0,
    max_holding: int = 10,
) -> pd.Series:
    """Simplified triple-barrier labeling.

    Parameters
    ----------
    close : pd.Series
        Price series.
    daily_vol : pd.Series
        Daily volatility estimate (same index as `close`).
    pt_mult : float
        Profit-take barrier multiple of volatility.
    sl_mult : float
        Stop-loss barrier multiple of volatility.
    max_holding : int
        Maximum holding period in days (time barrier).

    Returns
    -------
    labels : pd.Series
        +1 if upper barrier hit first,
        -1 if lower barrier hit first,
         0 if neither barrier is hit before the time limit,
         NaN where we cannot look far enough ahead or vol is NaN.
    """
    close = close.copy()
    daily_vol = daily_vol.copy()
    labels = pd.Series(index=close.index, dtype="float64")

    n = len(close)
    for i in range(n):
        if i + 1 >= n:
            labels.iloc[i] = np.nan
            continue

        price_t = close.iloc[i]
        vol_t = daily_vol.iloc[i]

        if np.isnan(vol_t):
            labels.iloc[i] = np.nan
            continue

        # Set profit-take and stop-loss barriers
        pt = price_t * (1 + pt_mult * vol_t)
        sl = price_t * (1 - sl_mult * vol_t)

        # Look forward up to max_holding steps (or to end of series)
        end_idx = min(i + 1 + max_holding, n)
        future_prices = close.iloc[i + 1 : end_idx]

        hit_pt = future_prices >= pt
        hit_sl = future_prices <= sl

        hit_pt_idx = hit_pt.idxmax() if hit_pt.any() else None
        hit_sl_idx = hit_sl.idxmax() if hit_sl.any() else None

        if hit_pt_idx is not None and hit_sl_idx is not None:
            if hit_pt_idx < hit_sl_idx:
                labels.iloc[i] = 1
            elif hit_sl_idx < hit_pt_idx:
                labels.iloc[i] = -1
            else:
                labels.iloc[i] = 0
        elif hit_pt_idx is not None:
            labels.iloc[i] = 1
        elif hit_sl_idx is not None:
            labels.iloc[i] = -1
        else:
            labels.iloc[i] = 0

    return labels


In [75]:

# Use 'close' as adjusted price series for modelling
price = data["close"].copy()

daily_vol = get_daily_vol(price, span=50)
labels = apply_triple_barrier(
    close=price,
    daily_vol=daily_vol,
    pt_mult=1.0,
    sl_mult=1.0,
    max_holding=10,
)

# Keep only points where we have both vol and label
mask = labels.notna() & daily_vol.notna()
data = pd.DataFrame(
    {
        "adj_close": price[mask],
        "daily_vol": daily_vol[mask],
        "label": labels[mask].astype(int),
    }
)

print("Shape after labeling:", data.shape)
print("Label distribution:")
print(data["label"].value_counts())
data.head()


Shape after labeling: (1476, 3)
Label distribution:
label
 1    776
-1    653
 0     47
Name: count, dtype: int64


Unnamed: 0_level_0,adj_close,daily_vol,label
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-01-06,95.143997,0.019084,-1
2020-01-07,95.343002,0.013363,-1
2020-01-08,94.598503,0.011853,-1
2020-01-09,95.052498,0.010481,-1
2020-01-10,94.157997,0.010179,-1



## 3. Feature Engineering & Predictive Model

We now build a **predictive model** that tries to forecast the triple-barrier label (+1 / 0 / −1).

We use simple technical features:

- **Past returns** over 1, 5, and 10 days
- **Moving averages** (10-day and 20-day) via their ratio
- **Daily volatility**

Workflow:

1. Build feature matrix `X` and label vector `y`.
2. Split time-wise into **train** and **test** sets.
3. Train a **Random Forest classifier** on the training period.
4. Evaluate classification metrics on the test period.


In [76]:

# Feature engineering
data["ret_1d"] = data["adj_close"].pct_change(1)
data["ret_5d"] = data["adj_close"].pct_change(5)
data["ret_10d"] = data["adj_close"].pct_change(10)

data["ma_10"] = data["adj_close"].rolling(10).mean()
data["ma_20"] = data["adj_close"].rolling(20).mean()
data["ma_ratio_10_20"] = data["ma_10"] / data["ma_20"]

feature_cols = ["ret_1d", "ret_5d", "ret_10d", "daily_vol", "ma_ratio_10_20"]

# Drop rows with NaNs in features or labels
data = data.dropna(subset=feature_cols + ["label"])

X = data[feature_cols]
y = data["label"].astype(int)

# Time-based train/test split
split_date = "2022-01-01"
train_mask = X.index < split_date
test_mask = X.index >= split_date

X_train, y_train = X[train_mask], y[train_mask]
X_test, y_test = X[test_mask], y[test_mask]

X_train.shape, X_test.shape


((484, 5), (973, 5))

In [77]:

rf_clf = RandomForestClassifier(
    n_estimators=300,
    max_depth=6,
    min_samples_leaf=20,
    random_state=RANDOM_STATE,
    n_jobs=-1,
)

rf_clf.fit(X_train, y_train)

y_pred = rf_clf.predict(X_test)

print("Classification report (test period):")
print(classification_report(y_test, y_pred))

print("Confusion matrix (test period):")
print(confusion_matrix(y_test, y_pred))


Classification report (test period):
              precision    recall  f1-score   support

          -1       0.48      0.29      0.36       433
           0       0.00      0.00      0.00        33
           1       0.53      0.74      0.62       507

    accuracy                           0.52       973
   macro avg       0.34      0.34      0.33       973
weighted avg       0.49      0.52      0.48       973

Confusion matrix (test period):
[[125   0 308]
 [  6   0  27]
 [130   0 377]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])



## 4. Strategy Construction — From Predictions to Signals

We now turn model predictions into a **trading signal**.

Steps:

1. Use the classifier to get **class probabilities** for each day:
   \( P(y=-1), P(y=0), P(y=1) \).
2. Compute a **directional score**:
   \[ s_t = P(y=1) - P(y=-1) \]
3. Map this score into a **signal**:
   - If \( s_t > \theta \) → **+1** (bullish)
   - If \( s_t < -\theta \) → **−1** (bearish)
   - Else → **0** (no position)

We will **not** shift the signal here.  
`backtesting.py`'s default behavior is:
- Use information up to bar \( t \)
- Place orders to be executed on **next bar** (no look-ahead).


In [78]:

# Predict class probabilities on the full dataset
proba = rf_clf.predict_proba(X)
classes = rf_clf.classes_

idx_neg1 = np.where(classes == -1)[0][0]
idx_pos1 = np.where(classes == 1)[0][0]

score = proba[:, idx_pos1] - proba[:, idx_neg1]
data["score"] = pd.Series(score, index=X.index)

threshold = 0.1
signal = np.zeros(len(score), dtype=int)
signal[score > threshold] = 1
signal[score < -threshold] = -1

data["signal"] = pd.Series(signal, index=X.index)

data[["adj_close", "score", "signal"]].tail()


Unnamed: 0_level_0,adj_close,score,signal
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2025-11-11,249.100006,0.163572,1
2025-11-12,244.199997,0.169794,1
2025-11-13,237.580002,0.188538,1
2025-11-14,234.690002,0.269767,1
2025-11-17,232.869995,0.280434,1



## 5. Backtesting with `backtesting.py`

We now evaluate the strategy on the **test period** only using [`backtesting.py`](https://kernc.github.io/backtesting.py/).

Steps:

1. Build a DataFrame for backtesting with columns:
   - `Open`, `High`, `Low`, `Close`, `Volume` (from `raw`)
   - `Signal` (our model-based signal, −1 / 0 / +1)
2. Implement a `Strategy` subclass that:
   - Reads the latest `Signal`
   - Goes long if signal = +1
   - Goes short if signal = −1
   - Stays flat if signal = 0
3. Run `Backtest` on the **test period** and inspect stats & equity curve.


In [79]:

# Restrict to test period
test_index = X_test.index

# Build backtesting DataFrame: OHLCV + Signal
bt_data = raw.loc[test_index, ["Open", "High", "Low", "Close", "Volume"]].copy()
bt_data["Signal"] = data.loc[test_index, "signal"].astype(int)

bt_data.head()


Price,Open,High,Low,Close,Volume,Signal
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022-01-03,167.550003,170.703506,166.160507,170.404495,63520000,-1
2022-01-04,170.438004,171.399994,166.349503,167.522003,70726000,0
2022-01-05,166.882996,167.126495,164.356995,164.356995,64302000,1
2022-01-06,163.4505,164.800003,161.936996,163.253998,51958000,0
2022-01-07,163.839005,165.2435,162.031006,162.554001,46606000,0


In [80]:

class MLSignalStrategy(Strategy):
    def init(self):
        pass

    def next(self):
        sig = int(self.data.Signal[-1])

        if sig == 1:
            # Go long (or stay long)
            if not self.position.is_long:
                self.position.close()
                self.buy()
        elif sig == -1:
            # Go short (or stay short)
            if not self.position.is_short:
                self.position.close()
                self.sell()
        else:  # sig == 0
            # Close any open position
            if self.position:
                self.position.close()


bt = Backtest(
    bt_data,
    MLSignalStrategy,
    cash=100_000,
    commission=0.0005,  # 5 bps per trade
    trade_on_close=False,
    exclusive_orders=True,
)

stats = bt.run()
stats


Backtest.run:   0%|          | 0/972 [00:00<?, ?bar/s]

  stats = bt.run()


Start                     2022-01-03 00:00:00
End                       2025-11-17 00:00:00
Duration                   1414 days 00:00:00
Exposure Time [%]                    89.00308
Equity Final [$]                 138145.46927
Equity Peak [$]                  178395.47849
Commissions [$]                   14351.93637
Return [%]                           38.14547
Buy & Hold Return [%]                36.65719
Return (Ann.) [%]                      8.7292
Volatility (Ann.) [%]                38.19528
CAGR [%]                              5.92793
Sharpe Ratio                          0.22854
Sortino Ratio                         0.37155
Calmar Ratio                          0.16963
Alpha [%]                            16.83794
Beta                                  0.58126
Max. Drawdown [%]                   -51.46127
Avg. Drawdown [%]                    -6.43277
Max. Drawdown Duration      728 days 00:00:00
Avg. Drawdown Duration       45 days 00:00:00
# Trades                          

In [81]:

print("=== Backtesting.py stats (test period) ===")
print(stats)

bt.plot()


=== Backtesting.py stats (test period) ===
Start                     2022-01-03 00:00:00
End                       2025-11-17 00:00:00
Duration                   1414 days 00:00:00
Exposure Time [%]                    89.00308
Equity Final [$]                 138145.46927
Equity Peak [$]                  178395.47849
Commissions [$]                   14351.93637
Return [%]                           38.14547
Buy & Hold Return [%]                36.65719
Return (Ann.) [%]                      8.7292
Volatility (Ann.) [%]                38.19528
CAGR [%]                              5.92793
Sharpe Ratio                          0.22854
Sortino Ratio                         0.37155
Calmar Ratio                          0.16963
Alpha [%]                            16.83794
Beta                                  0.58126
Max. Drawdown [%]                   -51.46127
Avg. Drawdown [%]                    -6.43277
Max. Drawdown Duration      728 days 00:00:00
Avg. Drawdown Duration       45 days 


## 6. Recap & Extensions

We implemented a full **modular quant research workflow** on SPY:

1. **Data Curation**  
   - Downloaded & lightly cleaned daily OHLCV data for SPY using `yfinance`.

2. **Data Labeling (Triple-Barrier)**  
   - Estimated daily volatility.
   - Applied a simplified triple-barrier scheme to get labels (+1, 0, −1).

3. **Predictive Modelling**  
   - Engineered simple technical features (returns, moving-average ratio, volatility).
   - Trained a `RandomForestClassifier` with a time-based train/test split.

4. **Strategy Construction**  
   - Converted class probabilities into a directional score.
   - Mapped the score into a discrete signal: long / short / flat.

5. **Backtesting (`backtesting.py`)**  
   - Built a strategy that trades based on the signal.
   - Evaluated performance and plotted equity.

---

### Possible Extensions for Your Society

- Extend from 1 asset (SPY) to a **universe** and build a long-short portfolio.
- Try different **labeling schemes** (pure forward returns, meta-labeling, etc.).
- Use richer **features** and other models (GBM, XGBoost, etc.).
- Add **position sizing** logic and risk constraints inside the strategy.
- Incorporate **transaction cost modelling** and slippage experiments.

You can clone this notebook and swap individual modules (labels, model, signal mapping, backtest engine) while keeping the overall research framework intact.
