# Pipeline: Conditional Return Analysis

## 1. Pull in Data (Cleaned & Normalised)
- Load price data (e.g., OHLCV).
- Compute **log returns** and standardize/clean missing or infinite values.
- (Optional) Resample to desired timeframe and align features.

## 2. Random Baseline Sample
- Randomly select a subset of timestamps.
- Compute forward returns for these timestamps to form a **random returns distribution**.

## 3. Define the Condition
- **Indicator thresholds** (example set):
  - **Volatility** (e.g., rolling œÉ): `œÉ_t > thresh_high` or `< thresh_low`
  - **RSI**: `RSI < 30` (oversold) or `RSI > 70` (overbought)
  - **MACD**: MACD cross above/below signal line
  - **ADX**: `ADX > 25` (trend) or `< 20` (range)
- **Moving average crossovers**:
  - Short-term MA crosses above/below long-term MA (e.g., `MA(20) > MA(50)`).
- **Autocorrelation & stationarity**:
  - Negative lag-1 autocorr (`œÅ‚ÇÅ < 0`) for mean reversion.
  - Stationarity tests (e.g., ADF) to confirm regime.
- **Pattern recognition**:
  - Head & Shoulders, Double Top/Bottom (rule-based or library-assisted detection).
- **ML-based directional predictions**:
  - Train/predict labels (`up`/`down`) and require model confidence > threshold.

 **Event =** condition becomes **true** at time *t*.

## 4. Forward-Return Windows
- For each event time *t*, compute cumulative log return over windows:
  - **[1, 3, 5, 10, 20, 50, 100]** bars ahead.

## 5. Empirical Comparison Plots
- Compare **event-conditioned** forward returns vs **random** baseline:
  - Histograms (density overlay)
  - ECDF curves
  - Box/violin plots
  - (Optional) QQ plots and summary tables (mean, median, œÉ, t-test/KS/MWU p-values)


In [None]:
#¬†Pulls in price data, normalises (rolling average) it, and creates features using the pandas ta-lib library
# Features to create: Bollinger Band Width, ADX, MACD, GARCH - optimise across 3x3, 
# RSI, ARIMA rolling directional prediction, price-VWAP spread, Variance AutoRegression, VIX, Volume
# daily/weekly/monthly highs and lows, distances from highs and lows

# Machine learning features:
# - NAAIM Exposure (Manager equity exposure)
# - AAII bull-bear spread - retail sentiment
# - CFTC COT Nasdaq-100 managed money net positions

# - US Interest rates
# - Foreign Interest rates
# - Yield spreads
# - FX pairs - EUR/USD, GBP/USD
# - Gold prices
# - CPI, PPI, PCE, PMI,GDP

#¬†Pattern mining (price structure) ‚Äì H&S, double top/bottom, triangles; peak/trough (ZigZag); trend strength; breakouts/false breaks; volatility contraction.
# Market regimes ‚Äì clustering or HMM labels (risk-on/off).

# Text-based:
# Jerome Powell events ‚Äì speeches, FOMC statements.
# U.S. Presidents‚Äô announcements ‚Äì White House releases.
# Presidential tweets/posts ‚Äì Trump, Biden, @POTUS, @WhiteHouse.
# Elon Musk tweets ‚Äì @elonmusk timeline.
# Financial news headlines ‚Äì Bloomberg, Reuters, Yahoo, etc.
# For each text: sentiment scores, topic tags, burst indicators, embeddings.

# Ensemble/meta-learning model (summary):
# Base features: all above + technicals (RSI, ATR, MAs, etc.).
# Classical signals: GARCH, ARIMA, decision tree ‚Üí feed outputs as features.
# Main learner: transformer predicts NQ direction.
# Meta-filter: logistic regression (trained on backtests) decides when to trust predictions.
# Evaluation: walk-forward/backtests with realistic costs & slippage.

#### Sharpe ratio for variance adjusted returns


In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('/Users/damensavvasavvi/Desktop/AlgoTrading/marketDataNasdaqFutures/NQ_continuous_backadjusted_1m_cleaned.csv', parse_dates=['timestamp']).set_index('timestamp')
df = df[df.index < '2024-01-01']
time = df.index
newyork_session = (time.dayofweek<5) & (time.hour>=10)& (time.hour<17)
df = df[newyork_session]
df = df[::5]  # downsample to 5-min bars
df

Unnamed: 0_level_0,open,high,low,close,volume,symbol
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-07-20 10:00:00+00:00,11693.734874,11697.035159,11693.184826,11693.184826,128.0,NQU0
2020-07-20 10:05:00+00:00,11692.634779,11695.660040,11692.634779,11695.109993,80.0,NQU0
2020-07-20 10:10:00+00:00,11690.984636,11694.559945,11690.984636,11692.359755,50.0,NQU0
2020-07-20 10:15:00+00:00,11689.884541,11690.159565,11689.334494,11689.609518,30.0,NQU0
2020-07-20 10:20:00+00:00,11685.759185,11687.134304,11684.109043,11686.584256,88.0,NQU0
...,...,...,...,...,...,...
2023-12-29 16:39:00+00:00,18329.573507,18337.683818,18329.032820,18334.439694,994.0,NQH4
2023-12-29 16:44:00+00:00,18333.087975,18337.143131,18330.114194,18336.602443,620.0,NQH4
2023-12-29 16:49:00+00:00,18352.012034,18359.852002,18350.660316,18359.581658,1259.0,NQH4
2023-12-29 16:54:00+00:00,18350.389972,18363.366470,18345.794129,18362.014752,1346.0,NQH4


In [None]:
df = df['close']
log_returns = np.log(df / df.shift(1))
log_returns.fillna(0, inplace=True)
rolling_vol = log_returns.rolling(20).std()
threshold = rolling_vol.median() * 1.5
df = df[rolling_vol < threshold]
df

timestamp
2020-07-20 10:00:00+00:00    11693.184826
2020-07-20 10:05:00+00:00    11695.109993
2020-07-20 10:10:00+00:00    11692.359755
2020-07-20 10:15:00+00:00    11689.609518
2020-07-20 10:20:00+00:00    11686.584256
                                 ...     
2023-12-29 16:24:00+00:00    18334.439694
2023-12-29 16:29:00+00:00    18322.274227
2023-12-29 16:44:00+00:00    18336.602443
2023-12-29 16:54:00+00:00    18362.014752
2023-12-29 16:59:00+00:00    18358.229940
Name: close, Length: 55182, dtype: float64

# üìä Feature Documentation for Market ML Model

This document explains all engineered features generated from the OHLCV time-series.  
All features are computed in **UTC** and are **lagged to avoid look-ahead bias**.

---

## ‚úÖ Core Principles

- **All inputs remain in UTC**  
- **No leakage:** Any rolling statistic uses `.shift(1)` to ensure only past data is used  
- **Symbols supported:** Features can be computed for one or multiple tickers  
- **Returns computed on log prices**  
- **Three return variants included:** raw, rolling-z, and volatility-scaled  

---

## üß† Feature Categories

The features fall into several groups:

1. **Returns & Return Normalisations**
2. **Volatility & Range Measures**
3. **Momentum / Trend**
4. **Volume & Flow**
5. **VWAP & Fair-Value Distance**
6. **Microstructure Proxies (from OHLCV)**
7. **Seasonality & Time Encoding**
8. **GARCH Model-Based Volatility**

---

## 1. üîÅ Returns & Return Normalisations

| Feature | Description | Notes |
|---------|---------------|--------|
| `ret_1` | 1-bar log return | Lagged by 1 bar |
| `ret_1_z` | Rolling z-score of 1-bar return | Uses past mean & std to normalise |
| `ret_1_volsc` | 1-bar return scaled by ‚àö(rolling variance) | Makes returns stationary across regimes |
| `ret_k` | k-bar cumulative log return | `'k'` in config; lagged |
| `ret_k_z` | Rolling z-score of k-bar return | Normalised cumulative return |
| `ret_k_volsc` | k-bar return scaled by rolling variance | Volatility-adjusted |

<font color="green" >These help the model learn return magnitude, normalised behaviour, and regime-adjusted movement. </font>

---

## 2. üìà Volatility & Range Measures

| Feature | Description |
|---------|--------------|
| `atr` | Average True Range (Wilder‚Äôs) ‚Äì measures price range volatility |
| `parkinson_var` | Parkinson range-based variance using ln(high/low)¬≤ |
| `rv` | Realised volatility: ‚àö(Œ£ returns¬≤ over window) |
| `vol_of_vol` | Volatility of volatility (stdev of return stdev) |

<font color="green" >Captures short-term, range-based, and realised price volatility and volatility regime shifts.</font>

---

## 3. üìâ Momentum / Trend Indicators

| Feature | Description |
|---------|--------------|
| `ema_fast` | Fast EMA of close (default 12) |
| `ema_slow` | Slow EMA of close (default 26) |
| `macd` | Fast EMA ‚Äì Slow EMA (trend momentum) |
| `macd_sig` | EMA of MACD (signal line) |
| `macd_div` | MACD ‚Äì Signal (histogram; trend strength) |
| `ema_spread` | Close minus fast EMA (distance to trend anchor) |

<font color="green">These measure trending behavior, trend acceleration, and trend strength.</font>

---

## 4. üì¶ Volume & Flow Indicators

| Feature | Description |
|---------|--------------|
| `vol_z` | Volume z-score (deviation from rolling norm) |
| `obv` | On-Balance Volume ‚Äì cumulative volume adjusted by direction |
| `cmf` | Chaikin Money Flow ‚Äì volume-weighted buying/selling pressure |


<font color="green">Captures demand/supply imbalance, momentum supported by volume, and accumulation/distribution phases.</font>

---

## 5. ‚öñÔ∏è VWAP & Fair-Value Distance

| Feature | Description |
|---------|--------------|
| `vwap_roll` | Rolling VWAP (typical price √ó volume / sum volume) |
| `dist_vwap` | Close ‚Äì VWAP (raw deviation) |
| `dist_vwap_z` | Distance to VWAP scaled by ATR (vol-adjusted fair-value gap) |

<font color="green">Captures demand/supply imbalance, momentum supported by volume, and accumulation/distribution phases.</font>

---

## 6. üß© Microstructure Proxies (from OHLCV Only)

| Feature | Description |
|---------|--------------|
| `clv` | Close Location Value: (Close‚ÄìLow)/(High‚ÄìLow) ‚àà [0,1] |
| `gap_pct` | Gap size between prev close and open (absolute % gap) |
| `dir_persist` | Fraction of positive returns in rolling window |

<font color="green">These approximate order-flow or market microstructure signals without needing Level-2 data.</font>

---

## 7. üïí Seasonality & Time Encoding (UTC)

| Feature | Description |
|---------|--------------|
| `tod_sin`, `tod_cos` | Time-of-day encoded as cyclical features |
| `dow_sin`, `dow_cos` | Day-of-week encoded cyclically |


<font color="green">Helps the model identify behaviour that depends on time of day or day of week.</font>

---

## 8. üìâ GARCH Model-Based Volatility

| Feature | Description |
|---------|--------------|
| `garch_sigma` | Conditional volatility (œÉ‚Çú) from best-fit GARCH model |

### How it‚Äôs computed:

- A small grid search is run on training data over:  
  `(p,q) ‚àà {(1,1), (1,2), (2,1)}` √ó `{Normal, Student-T}`
- Best model is selected using AIC (or BIC if configured)
- Model is **not refit on test** ‚Äì parameters are fixed and the full series is filtered
- œÉ‚Çú is lagged by 1 bar to avoid leakage


<font color="green">Adds a robust statistical volatility estimate that adapts to volatility clustering.</font>

---

## üßº Cleaning & Safety

- All time-series windows use **`.shift(1)`** to avoid forward-looking leakage
- Infinite and NaN values dropped after feature assembly
- Index remains in **UTC at all times**

---

## üóÇ Returned Output Format

The feature builder returns:

- A DataFrame with all features + a `symbol` column
- A dictionary of best GARCH specs per symbol, e.g.:

```python
{
  "NQ": {"p":1,"q":2,"dist":"t","criterion":"aic","score":12345.6}
}

In [None]:
df_ultimate_feature_set = pd.DataFrame(index=df.index)
df_ultimate_feature_set = df_ultimate_feature_set.concat([df_norm_zscore,df_returns_variants,df_volatility_features,df_momentum_features,df_vol_z,df_vwap_features,df_microstructure,df_vp,df_time,df_garch_feature,df_cdl], axis=1)