#### 03. Macro and Technical

In this module we focus on **macro data** and **technical features** to complement NLP and options signals:  

- **Technical features (small-cap biotech stocks)**:  
  We will compute a variety of **price-based indicators** (momentum, moving averages, volatility, cumulative returns, relative strength, etc.) tailored to the small-cap biotech universe under study.  
  These aim to capture short-term trends, overbought/oversold conditions, and volatility regimes that may drive returns.

- **Macro features**:  
  We will integrate **broad economic and market variables** (yield curve slopes, credit spreads, USD index, VIX, OAS, etc.).  
  Such features help contextualize biotech performance within global risk sentiment and macro-financial conditions.  

This dual approach ensures the model is informed both by **stock-specific technical signals** and the **macro environment**, enabling better regime detection and robustness across different market conditions.

______________________________________________________________________________________________

### Technical Indicators (T–1)

To contextualize each news event with recent market behavior, we engineer a set of **technical indicators** based on daily price and volume data, all computed **as of the day before the news (`T–1`)**.  
These indicators aim to capture **volatility, momentum, trading activity anomalies, cumulative trends, and risk exposure** in the small-cap biotech universe.

- `atr_10d_Tm1`: **Average True Range** over the last 10 days — a non-directional measure of recent price volatility, capturing intraday ranges and overnight gaps. Useful to assess **market turbulence**.

- `vol_10d_Tm1`, `vol_5d_Tm1`: **Realized volatility** over 10-day and 5-day rolling windows — standard deviation of daily returns. Provides a view of **short- vs. medium-term variability**.

- `momentum_5d_Tm1`, `momentum_20d_Tm1`: **Price momentum** over 5 and 20 trading days — signals **trend direction and strength**, often predictive of continuation or reversal around catalysts.

- `volume_5d`: **Average trading volume** over the past 5 days — baseline liquidity measure against which volume anomalies can be detected.

- `volume_spike_Tm1`: **Volume anomaly ratio**, defined as current daily volume divided by the 5-day moving average volume. High spikes may indicate **pre-event positioning, rumors, or speculative flow**.

- `cumret_20d_Tm1`: **Cumulative return** over the past 20 days — proxy for **sentiment build-up** or trend exhaustion prior to the news.

- `maxdd_20d_Tm1`: **Maximum drawdown** in the past 20 days — a risk metric quantifying peak-to-trough decline, highlighting **recent fragility or stress** in the stock’s price path.

---

All indicators are calculated in a **ticker-specific rolling fashion**.  
- The first 20 trading days per stock are discarded (insufficient history).  
- Every feature is **lagged to T–1**, ensuring the model only has access to information that would have been available **before the news release**.  

These technical features enrich the dataset by embedding **recent market context** prior to news arrival — crucial in modeling **asymmetric reactions** in the highly volatile biotech domain.

In [10]:
import pandas as pd
import numpy as np
from pathlib import Path
import os
from dotenv import load_dotenv, find_dotenv
ROOT = Path(__file__).resolve().parents[0] if "__file__" in globals() else Path.cwd()
DATA_DIR = Path(os.getenv("DATA_DIR", ROOT / "data"))  
def p(file): return DATA_DIR / file
load_dotenv(find_dotenv(usecwd=True), override=False)



True

In [11]:
news = pd.read_parquet(p("final_ml.parquet"))
news["ticker"] = news["ticker"].str.upper().str.strip()
news["news_date"]    = pd.to_datetime(news["date"]).dt.normalize()
news["feature_date"] = news["news_date"] - pd.tseries.offsets.BDay(1)


news = news.dropna(subset=["feature_date"])


news.sort_values(["ticker", "feature_date"], inplace=True, kind="mergesort")


df = (pd.read_csv(p("underlying_prices_complete.csv"), parse_dates=["date"])
        .rename(columns={"close":"adj", "volume":"vol"}))

df["ticker"] = df["ticker"].str.upper().str.strip()
df["date"]   = df["date"].dt.normalize()
df.sort_values(["ticker","date"], inplace=True)

tr = pd.concat([
    df["high"] - df["low"],
    (df["high"] - df.groupby("ticker")["adj"].shift(1)).abs(),
    (df["low"]  - df.groupby("ticker")["adj"].shift(1)).abs()
], axis=1).max(axis=1)
df["tr"] = tr

df["atr_10d_Tm1"] = (
    df.groupby("ticker")["tr"]
      .rolling(10, min_periods=10).mean()
      .shift(1).reset_index(0, drop=True)
)


df["ret"] = df.groupby("ticker")["adj"].pct_change()

df["vol_10d_Tm1"] = (
    df.groupby("ticker")["ret"]
      .rolling(10, min_periods=10).std()
      .shift(1).reset_index(0, drop=True)
)
df["vol_5d_Tm1"] = (
    df.groupby("ticker")["ret"]
      .rolling(5, min_periods=5).std()
      .shift(1).reset_index(0, drop=True)
)

df["momentum_5d_Tm1"]  = df.groupby("ticker")["adj"].pct_change(5 ).shift(1)
df["momentum_20d_Tm1"] = df.groupby("ticker")["adj"].pct_change(20).shift(1)


df["volume_5d"] = (
    df.groupby("ticker")["vol"]
      .rolling(5, min_periods=5).mean()
      .shift(1).reset_index(0, drop=True)
)
df["volume_spike_Tm1"] = df["vol"] / df["volume_5d"]


df["cumret_20d_Tm1"] = df.groupby("ticker")["adj"].pct_change(20).shift(1)


def max_draw(x):
    peak = x.cummax()
    return (x / peak - 1).min()

df["maxdd_20d_Tm1"] = (
    df.groupby("ticker")["adj"]
      .rolling(20, min_periods=20)
      .apply(max_draw, raw=False)
      .shift(1).reset_index(0, drop=True)
)


tech_cols = ["vol_5d_Tm1","atr_10d_Tm1","vol_10d_Tm1",
             "momentum_5d_Tm1","momentum_20d_Tm1",
             "volume_spike_Tm1","cumret_20d_Tm1","maxdd_20d_Tm1"]


df = (
    df.groupby("ticker")
      .apply(lambda g: g.iloc[20:], include_groups=False)
      .reset_index()                 
      .drop(columns="level_1")      
      .dropna(subset=tech_cols)
)


df_tech = (df.rename(columns={"date": "feature_date"})
             .sort_values(["ticker", "feature_date"],
                          kind="mergesort"))

news = (news.dropna(subset=["feature_date"])
             .sort_values(["ticker", "feature_date"],
                          kind="mergesort"))

news["feature_date"] = (
    news["news_date"] - pd.tseries.offsets.BDay(1)   
)

df_tech = df.rename(columns={"date": "feature_date"})
merged  = news.merge(df_tech, on=["ticker", "feature_date"], how="left")

# 8. salva
merged.to_parquet("news_with_tech_from_scratch.parquet", index=False)
print("final rows:", merged.shape[0])

final rows: 6792


## Adding Market Context via IBB Volatility

To incorporate broader sector-level sentiment, we include a macro feature derived from the **iShares Nasdaq Biotechnology ETF (IBB)**—a widely followed benchmark for **large- and mid-cap biotech stocks**.

Specifically, we compute a **10-day rolling volatility of IBB's log returns**, lagged by one day (`IBB_v_Tm1`), and align it with our news dataset via `merge_asof`. This ensures that each news event is enriched with the most recent estimate of market turbulence in the biotech sector at large.

While our trading universe is closer in style to **XBI** (which focuses on small-cap biotech), IBB still provides useful macro context:

- It reflects broader investor sentiment toward biotech as a sector.
- It captures systemic volatility trends (e.g., FDA decisions, biotech selloffs) that often affect both large and small caps.
- It acts as a stabilizing signal that complements more idiosyncratic features like options or firm-level sentiment.

In the absence of high-frequency XBI volatility data, **IBB serves as a robust proxy** for sector-level risk environment.

In [3]:
import pandas as pd
import numpy as np
import yfinance as yf

def flatten_columns(df: pd.DataFrame) -> pd.DataFrame:
    
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = [
            "_".join([str(c) for c in col if str(c)])
            for col in df.columns.to_list()
        ]
    return df


news = pd.read_parquet(p("news_with_tech_from_scratch.parquet"))
news["feature_date"] = pd.to_datetime(news["feature_date"])


news = flatten_columns(news)


etf = yf.download(
    "IBB",
    start="2000-01-01",
    end="2025-07-30",
    progress=False
)


etf["ret"] = etf["Close"].pct_change().apply(lambda x: np.log(1 + x))
etf['IBB_ret_5d']  = etf['ret'].rolling(5).sum().shift(1)
etf['IBB_ret_20d'] = etf['ret'].rolling(20).sum().shift(1)


roll_max = etf['Close'].rolling(20).max().shift(1)
etf['IBB_draw_20d'] = (etf['Close'].shift(1) - roll_max) / roll_max
etf["IBB_v_Tm1"] = (
    etf["ret"]
      .rolling(window=10, min_periods=1)
      .std()
      .shift(1)
)


etf = (
    etf[["IBB_v_Tm1",'IBB_ret_5d','IBB_ret_20d','IBB_draw_20d']]
    .reset_index()
    .rename(columns={"Date": "etf_date"})
)
etf = flatten_columns(etf)


news = news.sort_values("feature_date")
etf  = etf.sort_values("etf_date")

merged = pd.merge_asof(
    news,
    etf,
    left_on="feature_date",
    right_on="etf_date",
    direction="backward",
)


merged = merged.drop(columns=["etf_date"])


merged.to_parquet("news_with_target_and_ibbetf_vol.parquet", index=False)
print("Done! final shape:", merged.shape)

YF.download() has changed argument auto_adjust default to True
Done! final shape: (6792, 42)


### Integrating Macro Market Signals

We incorporate high-level **macro-financial indicators** that reflect broad market sentiment and systemic risk conditions. These are aligned to each news observation using `merge_asof` on the `feature_date` (T–1), ensuring temporal integrity.

####  Features Added:
- **`SPY_ma5`**: 5-day moving average of the S&P 500 ETF (SPY) — a proxy for overall equity market momentum.
- **`VIX_ma5`**: 5-day moving average of the CBOE Volatility Index (VIX) — a measure of market-wide fear and risk aversion.
- **`spread_3m_10y`**: The interest rate spread between the 3-month (`IRX`) and 10-year (`TNX`) Treasury yields — a classic gauge of the economic outlook and potential yield curve inversion (recession signal).

Each of these signals captures macro sentiment that may influence how markets react to biotech news. For example:
- A high VIX might amplify the market’s response to negative news.
- A rising SPY may soften the perceived risk in biotech.
- A deeply inverted yield curve could enhance defensive positioning, including healthcare sector bias.

---
> **Note on Temporal Alignment**  
All features were carefully constructed or aligned to **T–1**, the day *before* the news event.  
This ensures **strict forward-looking validity** and prevents any form of **data leakage**, preserving the real-world applicability of any predictive model built on this dataset.

In [4]:
import yfinance as yf
import pandas as pd

macro_start = "2023-12-01"
macro_end   = "2025-07-30"


spy = yf.download("SPY", start=macro_start, end=macro_end, progress=False)["Close"]
vix = yf.download("^VIX", start=macro_start, end=macro_end, progress=False)["Close"]
irx = yf.download("^IRX", start=macro_start, end=macro_end, progress=False)["Close"] / 10
tnx = yf.download("^TNX", start=macro_start, end=macro_end, progress=False)["Close"] / 10


macro = pd.concat([spy, vix, irx, tnx], axis=1, join='outer')
macro.columns = ["SPY", "VIX", "IRX_3m", "TNX_10y"]


macro["spread_3m_10y"] = macro["IRX_3m"] - macro["TNX_10y"]


macro["SPY_ma5"] = macro["SPY"].rolling(5).mean()
macro["VIX_ma5"] = macro["VIX"].rolling(5).mean()

# **SHIFT 
macro[["SPY_ma5", "VIX_ma5", "spread_3m_10y"]] = macro[["SPY_ma5", "VIX_ma5", "spread_3m_10y"]].shift(1)


macro = macro[["SPY_ma5", "VIX_ma5", "spread_3m_10y"]]
macro.index.name = "macro_date"
macro = macro.reset_index()
news = merged

news["feature_date"] = pd.to_datetime(news["feature_date"])
macro["macro_date"] = pd.to_datetime(macro["macro_date"])

news_macro = pd.merge_asof(
    news.sort_values("feature_date"),
    macro.sort_values("macro_date"),
    left_on="feature_date",
    right_on="macro_date",
    direction="backward"
).drop(columns="macro_date")

news_macro = news_macro.loc[:, ~news_macro.columns.duplicated()]
news_macro.to_parquet("news_with_target_and_macro.parquet", index=False)
print("✅ merged with macro features @ T-1:", news_macro.shape)

✅ merged with macro features @ T-1: (6792, 45)


# Macro Block — Data Sources, Alignment, and Engineered Features

## Goal
Enrich the ML dataset with macro/regime signals (rates, credit, volatility, USD, financial conditions) **while avoiding look-ahead**.  
Steps:
1. Download daily macro series from **FRED**.  
2. Align them to the **US business-day calendar**.  
3. Engineer features.  
4. Shift them to **T-1**.  
5. Merge into the **event-level panel**.

---

## 1. Inputs (FRED series)

We pull the following tickers via `pandas_datareader("fred")`:

- **NFCI** — Chicago Fed National Financial Conditions Index (weekly → forward-fill to business days).  
- **DGS10, DGS2, DGS3MO** — Constant maturity US Treasury yields (10y, 2y, 3m), % p.a.  
- **T10YIE** — 10-Year Breakeven Inflation (inflation expectations), % p.a.  
- **DTWEXBGS** — Broad Trade-Weighted US Dollar Index (Goods & Services).  
- **VIXCLS** — CBOE VIX (S&P 500 implied volatility).  
- **BAA** — Moody’s Seasoned Baa Corporate Bond Yield, % p.a.  

**Implementation details**:
- FRED API key from `os.environ["FRED_API_KEY"]`.  
- Each series reindexed to a **US business-day index** and forward-filled (weekends/holidays).  

---

## 2. Calendar & Anti-Leakage Alignment

- Build a **business-day index** from `(min(date) − 800 days)` to `(max(date) + 5 days)`.  
- After feature engineering, **shift all macro features by +1 business day**:  
  - On day *t*, model only sees info available by **close of day t−1**.  
- Merge with ML dataset using:  
  ```python
  pd.merge_asof(..., direction="backward")

In [8]:
import os, numpy as np, pandas as pd
from pandas.tseries.offsets import BDay
from pandas_datareader import data as pdr
from dotenv import load_dotenv, find_dotenv
import os
load_dotenv(find_dotenv(), override=True)
FRED_API_KEY = os.getenv("FRED_API_KEY")
if not FRED_API_KEY:
    raise RuntimeError("Set FRED_API_KEY in your .env")
def _get_fred(series, start, end):
    
    df = pdr.DataReader(series, "fred", start=start, end=end)
    df.index = pd.to_datetime(df.index).normalize()
    return df.rename(columns={series: series})

def _z(x, win=252):
    mu = x.rolling(win, min_periods=win//2).mean()
    sd = x.rolling(win, min_periods=win//2).std(ddof=0)
    return (x - mu) / sd

def add_macro_block(df_ml: pd.DataFrame) -> pd.DataFrame:
    d = df_ml.copy()
    d["date"] = pd.to_datetime(d["date"]).dt.normalize()

    start = (d["date"].min() - pd.Timedelta(days=800)).date()
    end   = (d["date"].max() + pd.Timedelta(days=5)).date()

    
    ser = {}

    codes = ["NFCI", "DGS10", "DGS2", "DGS3MO", "T10YIE", "DTWEXBGS", "VIXCLS", "BAA"]
    for code in codes:
        try:
            ser[code] = _get_fred(code, start, end)
        except Exception as e:
            print(f"[WARN] FRED {code} non scaricato: {e}")
            ser[code] = None

    bidx = pd.date_range(start, end, freq="B")

   
    for k, v in ser.items():
        if v is not None and not v.empty:
            ser[k] = v.reindex(bidx).ffill()

    macro = pd.DataFrame(index=bidx)
    for k, v in ser.items():
        if v is not None and not v.empty:
            macro[k] = v[k]

    
    if {"DGS10","DGS2"}.issubset(macro.columns):
        macro["slope_2s10s"] = macro["DGS10"] - macro["DGS2"]
    if {"DGS10","DGS3MO"}.issubset(macro.columns):
        macro["slope_3m10y"] = macro["DGS10"] - macro["DGS3MO"]

    if "DGS10" in macro.columns:
        macro["rate_vol_20d"] = macro["DGS10"].diff().rolling(20, min_periods=10).std()

    if {"DGS10","T10YIE"}.issubset(macro.columns):
        macro["real_yield_proxy"] = macro["DGS10"] - macro["T10YIE"]

    if "NFCI" in macro.columns:
        macro["NFCI_z"] = _z(macro["NFCI"])
        macro["NFCI_stress"] = (macro["NFCI_z"] > 1.0).astype(int)

    if "DTWEXBGS" in macro.columns:
        macro["DXY_chg_5d"]  = macro["DTWEXBGS"].diff(5)
        macro["DXY_chg_20d"] = macro["DTWEXBGS"].diff(20)
        macro["DXY_z"]       = _z(macro["DTWEXBGS"])

    if {"BAA","DGS10"}.issubset(macro.columns):
        macro["IG_spread"] = macro["BAA"] - macro["DGS10"]

    if "VIXCLS" in macro.columns:
        macro["VIX_ma5"]        = macro["VIXCLS"].rolling(5, min_periods=3).mean()
        macro["VIX_chg_5d"]     = macro["VIXCLS"].pct_change(5)
        macro["VIX_chg_5d_d1"]  = macro["VIX_chg_5d"].diff(1)
        macro["VIX_chg_5d_d2"]  = macro["VIX_chg_5d_d1"].diff(1)
        macro["VIX_chg_5d_d2_ema3"] = macro["VIX_chg_5d_d2"].ewm(span=3, adjust=False, min_periods=3).mean()
        win = 63
        mu  = macro["VIX_chg_5d_d2"].rolling(win, min_periods=win//3).mean()
        sd  = macro["VIX_chg_5d_d2"].rolling(win, min_periods=win//3).std(ddof=0)
        macro["VIX_d2_5d_z"] = (macro["VIX_chg_5d_d2"] - mu) / sd

   
    macro = macro.reset_index().rename(columns={"index":"date"})
    macro["date"] = macro["date"] + BDay(1)

    
    d = d.sort_values("date")
    macro = macro.sort_values("date")
    out = pd.merge_asof(d, macro, on="date", direction="backward")

    new_feats = [c for c in [
        "slope_2s10s","slope_3m10y","rate_vol_20d","real_yield_proxy",
        "NFCI","NFCI_z","NFCI_stress","DTWEXBGS","DXY_chg_5d","DXY_chg_20d","DXY_z",
        "IG_spread","VIXCLS","VIX_ma5","VIX_chg_5d","VIX_chg_5d_d1","VIX_chg_5d_d2","VIX_chg_5d_d2_ema3","VIX_d2_5d_z"
    ] if c in out.columns]

    return out, new_feats
df_ml_hy = pd.read_parquet(p("news_with_target_and_macro.parquet"))
df_ml_more , added_feats = add_macro_block(df_ml_hy)
print("Added:", added_feats)
df_ml_more.to_parquet('macro.parquet')

Added: ['slope_2s10s', 'slope_3m10y', 'rate_vol_20d', 'real_yield_proxy', 'NFCI', 'NFCI_z', 'NFCI_stress', 'DTWEXBGS', 'DXY_chg_5d', 'DXY_chg_20d', 'DXY_z', 'IG_spread', 'VIXCLS', 'VIX_chg_5d', 'VIX_chg_5d_d1', 'VIX_chg_5d_d2', 'VIX_chg_5d_d2_ema3', 'VIX_d2_5d_z']


## 3. Engineered Features

### Rates & Curve
- **slope_2s10s** = DGS10 − DGS2 (in % points).  
  *Yield curve steepness; inversions = risk-off bias.*  

- **slope_3m10y** = DGS10 − DGS3MO.  
  *Adds sensitivity to policy/front-end moves.*  

- **rate_vol_20d** = rolling 20d std of daily DGS10 changes.  
  *Rates volatility proxy, spikes = macro uncertainty.*  

- **real_yield_proxy** = DGS10 − T10YIE.  
  *Approximate real rate; higher = tighter conditions.*  

---

### Credit & Spreads
- **IG_spread** = BAA − DGS10.  
  *BBB risk premium; widening = deteriorating credit.*  

---

### Financial Conditions
- **NFCI_z** = rolling z-score of NFCI (window 252, min 126).  
- **NFCI_stress** = 1{NFCI_z > 1.0}.  
  *Convenient regime flag for gating ON/OFF.*  

---

### US Dollar
- **DXY_chg_5d** = DTWEXBGS.diff(5).  
- **DXY_chg_20d** = DTWEXBGS.diff(20).  
- **DXY_z** = rolling z-score of DTWEXBGS (252d window, min 126).  

---

### Equity Volatility (VIX)
- **VIX_ma5** = 5d moving average of VIXCLS.  
- **VIX_chg_5d** = VIXCLS.pct_change(5).  
- **VIX_chg_5d_d1** = Δ(VIX_chg_5d).  
- **VIX_chg_5d_d2** = Δ(VIX_chg_5d_d1).  
- **VIX_chg_5d_d2_ema3** = EMA(3) of VIX_chg_5d_d2.  
- **VIX_d2_5d_z** = rolling z-score of VIX_chg_5d_d2 (63d window, min 21).  

---

## 4. Rolling & Z-Score Conventions

- **Z-scores**:
  \[
  z_t = \frac{x_t - \mu_t}{\sigma_t}, \quad 
  \mu_t = \text{rolling mean}, \quad 
  \sigma_t = \text{rolling std}, \quad 
  \text{window} = 252
  \]  

  with `min_periods = window/2`.  

- **Volatility windows**:  
  - 20d (≈1m) for rates vol.  
  - 63d (≈1q) for VIX jerk standardization.  

---

## 5. Practical Safeguards

- Missing series safely skipped.  
- Weekly/monthly series forward-filled.  
- All merges after **+1 BDay shift** → no look-ahead.  
- **Units**:  
  - Yields in %  
  - Slopes in % points  
  - VIX in level  
  - USD in index level  
  - Diffs/returns as computed  
- Rescale as needed before training.  

# Adding XBI–IBB Relative Features for Small-Cap Biotech Context

## Motivation
In the previous step, we had already introduced **IBB (iShares Biotechnology ETF)** as a proxy for large-/mid-cap biotech performance.  
Now, we enrich the dataset by adding **XBI (SPDR S&P Biotech ETF)**, which is more tilted toward **small-cap biotech stocks**.  

By combining IBB and XBI, we capture the **relative dynamics between large- and small-cap biotech segments**. This context is important because small-cap biotech companies are typically more sensitive to funding conditions, sentiment, and volatility.

---

## What the Code Does
1. **Data Loading**  
   - We fetch adjusted daily closes for `XBI` and `IBB` from Stooq (or Yahoo, depending on the loader).  
   - Both series are aligned to a common business-day index.

2. **Feature Engineering**
   - **Relative Spread and Z-Score:**  
     - `XBI_IBB_spread_log`: log price spread.  
     - `XBI_IBB_spread_z`: z-score of the spread.  
   - **Relative Strength:**  
     - `XBI_over_IBB_RS20`: 20-day relative strength ratio.  
   - **Return Differentials:**  
     - `XBI_IBB_ret5_diff` and `XBI_IBB_ret20_diff`: differences in rolling returns.  
   - **Beta & Residuals:**  
     - Estimate the 60-day rolling beta of IBB on XBI, and the average 20-day residuals.

3. **Single-Series Enhancements (XBI-specific)**
   - `XBI_rsi2_Tm1`: 2-day **RSI** using Wilder’s smoothing.  
   - `XBI_v5_Tm1`: 5-day realized volatility.  

4. **First and Second Derivatives (Acceleration Features)**
   - `XBI_5chg` / `IBB_5chg`: 5-day percent change in each ETF.  
   - `XBI_accel_5d` / `IBB_accel_5d`: **first derivative** of the 5-day percent change (i.e., day-over-day change in momentum).  
   - `Accel_diff_XBI_IBB`: relative acceleration between XBI and IBB.  
     This effectively captures the **second derivative of price levels** — how quickly momentum itself is changing, a useful signal for detecting inflection points.

---

## Anti-Leakage Safeguard
All engineered features are **shifted by one business day** (`+1 BDay`) before merging into the ML dataset.  
This ensures that on day *t*, the model only sees information that would have been available by the **end of day t−1**, preventing look-ahead bias.

---

## Why It Matters
- **IBB** = anchor for large/mid-cap biotech.  
- **XBI** = small-cap biotech, higher beta, more sensitive to liquidity/credit conditions.  
- Their **relative performance and accelerations** provide a **richer macro–micro context** for modeling biotech event outcomes (e.g., trial results, FDA decisions).  

In short: *by looking at both ETFs together, and by computing not just returns but also the first and second derivatives of their moves, we capture leading signals of stress or risk-on/risk-off regimes in biotech equities.*

In [14]:
from pandas_datareader import data as pdr
def _load_stooq_close(symbol: str, start, end, label: str = None) -> pd.DataFrame:
    
    df = pdr.DataReader(symbol, "stooq", start, end)
   
    df = df.sort_index()
    df.index = pd.to_datetime(df.index).tz_localize(None)
    df.index.name = "date"
    colname = f"{label}_close" if label else "close"
    df = df[["Close"]].rename(columns={"Close": colname})
    return df
def _rsi_wilder(close: pd.Series, n: int = 14) -> pd.Series:
    d = close.diff()
    up = d.clip(lower=0)
    dn = -d.clip(upper=0)
    
    roll_up = up.ewm(alpha=1/n, adjust=False, min_periods=n).mean()
    roll_dn = dn.ewm(alpha=1/n, adjust=False, min_periods=n).mean()
    rs = roll_up / (roll_dn + 1e-12)
    return 100 - (100 / (1 + rs))

def _feat_relations_from_closes(xbi: pd.DataFrame, ibb: pd.DataFrame) -> pd.DataFrame:
    z = xbi.join(ibb, how="inner")
    x = z["XBI_close"]; i = z["IBB_close"]

    
    r_x = np.log(x).diff()
    r_i = np.log(i).diff()
    pct_change_ibb = i.pct_change(5)
    pct_change_xbi = x.pct_change(5)
    spread_log = np.log(x) - np.log(i)
    mu_s = spread_log.rolling(252, min_periods=60).mean()
    sd_s = spread_log.rolling(252, min_periods=60).std(ddof=0)
    spread_z = (spread_log - mu_s) / (sd_s + 1e-12)

    ma20_x = x.rolling(20, min_periods=10).mean()
    ma20_i = i.rolling(20, min_periods=10).mean()
    rs20   = (x/ma20_x) / (i/ma20_i)

    ret5_diff  = x.pct_change(5)  - i.pct_change(5)
    ret20_diff = x.pct_change(20) - i.pct_change(20)

    cov_60 = r_i.rolling(60, min_periods=30).cov(r_x)
    var_60 = r_x.rolling(60, min_periods=30).var()
    beta_60 = cov_60 / (var_60 + 1e-12)
    resid_20 = (r_i - beta_60*r_x).rolling(20, min_periods=10).mean()

   
    xbi_rsi = _rsi_wilder(x, n=2)                   
    xbi_vol = r_x.rolling(5, min_periods=5).std()  

    out = pd.DataFrame(index=z.index)
   
    out["XBI_IBB_spread_log"] = spread_log
    out["XBI_IBB_spread_z"]   = spread_z
    out["XBI_over_IBB_RS20"]  = rs20
    out["XBI_IBB_ret5_diff"]  = ret5_diff
    out["XBI_IBB_ret20_diff"] = ret20_diff
    out["IBB_on_XBI_beta60"]  = beta_60
    out["IBB_vs_XBI_resid20"] = resid_20
    
    out["XBI_rsi2_Tm1"] = xbi_rsi
    out["XBI_v5_Tm1"]   = xbi_vol
    out['XBI_5chg'] =pct_change_xbi
    out['IBB_5chg'] =pct_change_ibb
    out['XBI_accel_5d'] = out['XBI_5chg'].diff(1)
    out['IBB_accel_5d'] = out['IBB_5chg'].diff(1)
    out["Accel_diff_XBI_IBB"] = out["XBI_accel_5d"] - out["IBB_accel_5d"]
    
    

   
    out = out.shift(1, freq=BDay()).reset_index().rename(columns={"index":"date"})
    return out

def add_only_xbi_ibb_relations(df_ml: pd.DataFrame) -> tuple[pd.DataFrame, list]:
    d = df_ml.copy()
    d["date"] = pd.to_datetime(d["date"]).dt.normalize()
    d = d.sort_values(["date","ticker"]).reset_index(drop=True)

    start = (d["date"].min() - pd.Timedelta(days=400)).date()
    end   = (d["date"].max() + pd.Timedelta(days=5)).date()

    xbi_px = _load_stooq_close("xbi.us", start, end, label="XBI")
    ibb_px = _load_stooq_close("ibb.us", start, end, label="IBB")

    rel = _feat_relations_from_closes(xbi_px, ibb_px)

   
    dup = [c for c in rel.columns if c in d.columns and c != "date"]
    if dup:
        rel = rel.drop(columns=dup)

    out = pd.merge_asof(d.sort_values("date"),
                        rel.sort_values("date"),
                        on="date", direction="backward")
    new_feats = [c for c in rel.columns if c != "date"]
    return out, new_feats


df_ml = pd.read_parquet('macro.parquet')
df_ml_rel, added_cols = add_only_xbi_ibb_relations(df_ml)
print("New features:", added_cols)
df_ml_rel.to_parquet('cisiamo.parquet')

New features: ['XBI_IBB_spread_log', 'XBI_IBB_spread_z', 'XBI_over_IBB_RS20', 'XBI_IBB_ret5_diff', 'XBI_IBB_ret20_diff', 'IBB_on_XBI_beta60', 'IBB_vs_XBI_resid20', 'XBI_rsi2_Tm1', 'XBI_v5_Tm1', 'XBI_5chg', 'IBB_5chg', 'XBI_accel_5d', 'IBB_accel_5d', 'Accel_diff_XBI_IBB']


______________________________________________________________________________________________

### 4. Macro Features — High Yield Credit Spreads (HY OAS, T–1)

To capture **credit market stress** and broader risk sentiment, we enrich the dataset with features derived from the **ICE BofA US High Yield Option-Adjusted Spread (HY OAS)**, sourced from the Federal Reserve (FRED).  

####  Pipeline
1. **Data ingestion**  
   - Retrieve `BAMLH0A0HYM2` series from FRED (via `fredapi` or fallback to `pandas_datareader`).  
   - Reindex to business days (`B`), forward-fill missing values, and normalize dates.  
   - Convert from **percent to basis points** (`HY_OAS_bp`).

2. **Feature engineering**  
   Using rolling windows and differences, we derive multiple indicators of **spread level, change, and stress**:
   - `HY_OAS_bp`: Raw spread in **basis points** (bps).  
   - `HY_OAS_z`: **Z-score normalized spread**, relative to a 252-day (~1 year) rolling mean and standard deviation.  
   - `HY_OAS_chg_5d`, `HY_OAS_chg_20d`: Absolute change in spreads over 5-day and 20-day horizons.  
   - `HY_OAS_ma20`: 20-day moving average of spreads — smoother representation of credit conditions.  
   - `HY_stress`: Binary indicator = 1 if spread > 1σ above its rolling mean, 0 otherwise — flags **credit stress regimes**.

3. **Alignment with ML dataset**  
   - The engineered features are **shifted by one business day forward** (`+1 BDay`) to ensure alignment at **T–1 relative to each news event**.  
   - Merged into the ML dataset via `merge_asof` (backward match on date).  

---

#### Intuition
- **HY OAS** is a widely used proxy for **risk appetite vs. stress** in credit markets.  
- Rising spreads typically signal **tightening financial conditions**, higher default risk, and **risk-off regimes**.  
- The z-score and binary stress indicator (`HY_stress`) capture **deviation from “normal” credit levels**, making regime shifts more visible to the model.  
- By lagging all features to **T–1**, we ensure the model only “knows” what was observable in credit markets before each news release.  

---

These macro credit features complement the **technical indicators** and **NLP/options signals**, embedding a **regime-aware macro layer** into the dataset.

In [15]:
import os, numpy as np, pandas as pd
from pandas.tseries.offsets import BDay

from dotenv import load_dotenv, find_dotenv
import os
load_dotenv(find_dotenv(), override=True)

FRED_API_KEY = os.getenv("FRED_API_KEY")
if not FRED_API_KEY:
    raise RuntimeError("Set FRED_API_KEY in your .env")

def load_hy_oas_from_fred(start_date: str, end_date: str) -> pd.DataFrame:
    
    try:
        
        from fredapi import Fred
        fred = Fred(api_key=FRED_API_KEY)
        s = fred.get_series("BAMLH0A0HYM2",
                            observation_start=start_date,
                            observation_end=end_date)
        hy = s.to_frame("HY_OAS_pct")
        hy.index = pd.to_datetime(hy.index).normalize()
    except Exception:
       
        from pandas_datareader import data as pdr
        hy = pdr.DataReader("BAMLH0A0HYM2", "fred",
                            start=start_date, end=end_date).rename(
                            columns={"BAMLH0A0HYM2": "HY_OAS_pct"})
        hy.index = pd.to_datetime(hy.index).normalize()

    
    bidx = pd.date_range(hy.index.min(), hy.index.max(), freq="B")
    hy = (hy.reindex(bidx).ffill().rename_axis("date").reset_index())
    
    hy["HY_OAS_bp"] = hy["HY_OAS_pct"] * 100.0
    return hy

def engineer_hy_features(hy: pd.DataFrame,
                         roll:int=252) -> pd.DataFrame:
    
    hy = hy.copy()
    
    mu = hy["HY_OAS_bp"].rolling(roll, min_periods=roll//2).mean()
    sd = hy["HY_OAS_bp"].rolling(roll, min_periods=roll//2).std(ddof=0)
    hy["HY_OAS_z"] = (hy["HY_OAS_bp"] - mu) / sd
    hy["HY_OAS_chg_5d"]= hy["HY_OAS_bp"].diff(5)
    hy["HY_OAS_chg_20d"]= hy["HY_OAS_bp"].diff(20)
    hy["HY_OAS_ma20"] = hy["HY_OAS_bp"].rolling(20, min_periods=10).mean()
    hy["HY_stress"]= (hy["HY_OAS_z"] > 1.0).astype(int)

    
    hy["date"] = pd.to_datetime(hy["date"]) + BDay(1)
    return hy

def add_hy_to_dfml(df_ml: pd.DataFrame) -> pd.DataFrame:
    
    df_ml = pd.read_parquet('cisiamo.parquet')
    df_ml = df_ml.copy()
    df_ml["date"] = pd.to_datetime(df_ml["date"]).dt.normalize()

    
    start = (df_ml["date"].min() - pd.Timedelta(days=600)).strftime("%Y-%m-%d")
    end   = (df_ml["date"].max() + pd.Timedelta(days=5)).strftime("%Y-%m-%d")

    hy = load_hy_oas_from_fred(start, end)
    hy = engineer_hy_features(hy)

    
    df_ml = df_ml.sort_values("date")
    hy    = hy.sort_values("date")
    out = pd.merge_asof(df_ml, hy, on="date", direction="backward")
    return out


df_ml = pd.read_parquet('cisiamo.parquet')
df_ml = add_hy_to_dfml(df_ml)
df_ml = df_ml.sort_values(['ticker'])

hy_feats = ["HY_OAS_bp","HY_OAS_z","HY_OAS_chg_5d","HY_OAS_chg_20d","HY_OAS_ma20","HY_stress"]

print(df_ml[hy_feats].describe().T.round(3))
print(df_ml)


                 count     mean     std      min      25%      50%      75%  \
HY_OAS_bp       6792.0  312.156  36.918  259.000  285.000  310.000  327.000   
HY_OAS_z        6792.0   -0.445   1.312   -2.687   -1.452   -0.536    0.144   
HY_OAS_chg_5d   6792.0   -2.116  21.508  -57.000  -11.000   -4.000    6.000   
HY_OAS_chg_20d  6792.0   -7.918  42.034 -106.000  -27.000  -14.000   13.000   
HY_OAS_ma20     6792.0  315.552  35.079  265.400  289.200  315.250  330.600   
HY_stress       6792.0    0.115   0.319    0.000    0.000    0.000    0.000   

                    max  
HY_OAS_bp       461.000  
HY_OAS_z          5.255  
HY_OAS_chg_5d   107.000  
HY_OAS_chg_20d  148.000  
HY_OAS_ma20     409.150  
HY_stress         1.000  
     ticker       date  opt_avg_iv_call  opt_avg_iv_put  opt_put_call_ratio  \
3395   AARD 2025-02-12              NaN             NaN                 NaN   
5885   AARD 2025-06-24              NaN             NaN                 NaN   
3417   AARD 2025-02-13     

In [16]:
df_ml.to_parquet('cisiamo2.parquet')

### Ready for Modeling

The final table offers a **clean, leakage-safe feature block** that combines:

- **Micro**: price/volume  
- **Sector**: IBB , XBI 
- **Macro**: broad regime context  

—all at **T–1**.

#### Suitable for:
- Downstream purged–embargo cross-validation  
- Per-regime evaluation  
- Macro-gated live decision rules

In [17]:
df_ml.tail(5)

Unnamed: 0,ticker,date,opt_avg_iv_call,opt_avg_iv_put,opt_put_call_ratio,opt_total_option_volume,opt_avg_volume_per_contract,opt_iv_skew,finnhub_title,rss_titles,...,XBI_accel_5d,IBB_accel_5d,Accel_diff_XBI_IBB,HY_OAS_pct,HY_OAS_bp,HY_OAS_z,HY_OAS_chg_5d,HY_OAS_chg_20d,HY_OAS_ma20,HY_stress
3013,ZYME,2025-01-08,1.865318,0.0,0.2,6.0,1.2,-1.865318,Zymeworks Outlines Strategic Priorities and Ou...,,...,0.023737,0.023093,0.000644,2.79,279.0,-1.352509,-13.0,12.0,281.05,0
4953,ZYME,2025-05-14,3.008442,0.0,5.0,24.0,6.0,-3.008442,,Analysts Just Shipped A Stunning Upgrade To Th...,...,0.043642,0.038113,0.005528,3.09,309.0,-0.047584,-57.0,-100.0,372.5,0
4222,ZYME,2025-04-09,4.606494,0.0,2.0,3.0,1.5,-4.606494,,Hedge Fund Managers Are Aggressively Buying Bi...,...,-0.00696,-0.008408,0.001448,4.57,457.0,4.870908,107.0,135.0,350.05,1
2113,ZYME,2024-11-05,,,,,,-2.670033,Zymeworks Announces First Patient Dosed in Pha...,,...,-0.013998,-0.008204,-0.005794,2.87,287.0,-1.692481,5.0,-8.0,290.45,0
5335,ZYME,2025-05-29,3.535821,0.0,800000000.0,8.0,4.0,-3.535821,,"WisBusiness: the Podcast with Nikki Johnston, ...",...,-0.027041,-0.024914,-0.002127,3.23,323.0,0.287125,-2.0,-71.0,336.5,0
