## Prototype Overview

This notebook demonstrates **Signal Engine**, an end-to-end pipeline that:
1. Ingests OHLCV data for MCD & TMO (strong cases) and COP & PFE (weak cases).
2. Engineers technical indicators (momentum, trend, volatility, volume).
3. Trains rule-based models (Decision Tree & RuleFit) plus an XGBoost surrogate.
4. Back-tests signals in a walk-forward fashion with realistic fees.

In [None]:
# prototype tickers: 2 “good” + 2 “fails”
syms = ["MCD", "TMO", "COP", "PFE"]

### Data Fetching
- Download Open, High, Low, Close, Volume data for each ticker via `yfinance` CLI script.
- Uses `auto_adjust=True` to apply splits/dividends.
- Saves raw CSVs in `data/raw/{ticker}.csv`.
- Covers a 10-year history to support multi-year walk-forward tests.


In [None]:
!python ../src/ingestion/fetch_data.py --tickers MCD TMO COP PFE

### Data Cleaning
- **Forward-fill** missing business-day data to avoid NaNs.
- **Adjust** for splits/dividends via `yfinance auto_adjust`.
- **Drop** duplicate timestamps; enforce OHLCV column order.


In [None]:
!python ../src/ingestion/clean_data.py --syms MCD TMO COP PFE


### Engineered Features
- **Trend:** sma_5, sma_10, ema_20, ema_50, macd, macd_sig, macd_hist  
- **Momentum:** roc_10, rsi_14  
- **Volatility:** atr_14, vol_20d  
- **Mean-Reversion:** bb_width, pct_b  
- **Volume:** obv, vol_chg  
- **Strength:** adx_14  
Each captures a different regime signal for the rule engine.


In [None]:
!python ../src/features/compute_features.py --syms MCD TMO COP PFE


## Model Engines Overview

This notebook uses **four** distinct model types to generate and interpret trading signals:

1. **Decision Tree (Holdout)**  
   - Trains a single shallow `DecisionTreeClassifier` on an 80/20 holdout split.  
   - Exports human-readable “if–then” rules via `export_text`.  
   - Serves as a quick, fully interpretable baseline.

2. **Grid-Tuned Decision Tree**  
   - Performs a time-series aware grid search over `max_depth` & `min_samples_leaf`.  
   - Selects per-symbol probability thresholds (long vs flat) to maximize validation accuracy.  
   - Retains full interpretability with optimized hyperparameters.

3. **RuleFit Classifier**  
   - Builds an ensemble of small trees plus an L1-penalized linear model for rules.  
   - Extracts a **sparse** set of high-weight rules (`max_rules`) for readability.  
   - Balances predictive power with compact rule lists.

4. **XGBoost + Surrogate**  
   - Trains a full `XGBClassifier` with randomized hyperparameter search and time-series CV.  
   - Fits a shallow `DecisionTreeRegressor` on the model’s predicted probabilities to generate surrogate rules.  
   - Leverages XGBoost’s accuracy while preserving a rule-based explanation layer.


In [None]:
!python ../src/models/train_models.py --mode holdout --syms MCD TMO COP PFE


In [None]:
!python ../src/models/grid_tree_search.py --syms MCD TMO COP PFE


In [None]:
!python ../src/models/train_rulefit.py --syms MCD TMO COP PFE


In [None]:
!python ../src/models/xgb_surrogate.py --syms MCD TMO COP PFE


### Back-Test Configuration
- **Walk-forward:** 5 y train => 1 y validation => 1 y test (rolled yearly).  
- **Capital:** \$100 000.  
- **Fees:** 5 bp on entry & exit.  
- **Signals:** next-day open; fully invested vs flat.

In [None]:
!python ../backtest_run.py --syms MCD TMO COP PFE
