# Starter ML Notebook (Market Baseline)

This notebook walks through loading the curated market dataset (features + labels),
training a simple baseline model, and computing a basic PnL metric.

Prereqs:
- Generate a dataset first (from the repo root):

```bash
python scripts/build_market_dataset.py \
  --exchange binance \
  --symbol BTC/USDT \
  --timeframe 1m \
  --out datasets/market_btcusdt_1m.parquet
```

If you haven't installed common ML libs locally, uncomment the next cell to install them.

In [None]:
# Uncomment if you need these locally
# %pip install -q scikit-learn matplotlib seaborn
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', 200)
DATASET = Path('datasets/market_btcusdt_1m.parquet')
assert DATASET.exists(), f'Expected dataset at {DATASET}. Run the build script first.'
df = pd.read_parquet(DATASET)
df.head()

In [None]:
# Prepare train/test split by time (no leakage)
df = df.sort_values('timestamp').reset_index(drop=True)
feature_cols = [c for c in df.columns if c not in ('timestamp','dt','symbol','exchange','timeframe','feature_version','ret_next','y_dir','close')]
X = df[feature_cols].fillna(0.0).values
y = df['y_dir'].values
split_idx = int(0.8 * len(df))
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_test)[:,1]
pred = (proba >= 0.5).astype(int)
print('AUC:', roc_auc_score(y_test, proba))
print(classification_report(y_test, pred, digits=4))

In [None]:
# Simple PnL backtest (illustrative only)
test = df.iloc[split_idx:].copy()
test['proba'] = proba
# Threshold to create positions [-1,0,1] around 0.5
thr = 0.55
test['pos'] = 0
test.loc[test['proba']>=thr, 'pos'] = 1
test.loc[test['proba']<=1-thr, 'pos'] = -1
# Apply next-bar return as PnL with simple cost
cost_bps = 5e-4  # 5 bps per trade change
test['pos_shift'] = test['pos'].shift(1).fillna(0)
test['turnover'] = (test['pos'] - test['pos_shift']).abs()
test['pnl'] = test['pos_shift'] * test['ret_next'] - cost_bps * test['turnover']
test['equity'] = (1 + test['pnl']).cumprod()

fig, ax = plt.subplots(figsize=(10,4))
ax.plot(test['equity'].values)
ax.set_title('Cumulative equity (simple)')
ax.grid(True)
plt.show()

print('Final equity:', float(test['equity'].iloc[-1]))
print('Mean pnl (bp):', 1e4*test['pnl'].mean())

### Next steps
- Try different thresholds / models (e.g., gradient boosting).
- Add RSS/Reddit aggregates via `scripts/build_training_matrix.py`.
- Use walk-forward evaluation (multiple windows) and include transaction costs.