# BTC-USDT Minute-Bars — **Collection & Feature-Matrix Summary**

_This document captures everything produced inside the `Collect/` directory so the next phases (Gym environment, DQN, back-test) can plug-in with zero guesswork._

---

## 1. Pipeline steps & artefacts

| Phase | Notebook / Script | Key Actions | Output |
|-------|-------------------|-------------|--------|
| **Raw download** | `Download.ipynb` | • Binance REST ⟶ 12 mo of 1-minute OHLCV (525 601 rows).<br>• Retry + `checkpoint.json` for resume.<br>• Saved “fat” Parquet with full columns. | `data_final/btcusdt_1m_20240511-20250511.parquet` |
| **Indicators** | `Indicators.ipynb` | • EMA (8, 21), RSI 14, Stochastic %K/%D (14,3,3), daily VWAP & deviation, 1-min log-return.<br>• Winsorise Volume & LogRet on 1-week rolling 3 σ.<br>• Drop initial NaNs. | `data_final/btc1m_features.parquet` |
| **Feature matrix (6 cols)** | `Indicators.ipynb` | Select ΔEMA, RSI14, StochK, StochD, VWAP\_dev, log\_return. | `data_final/btc1m_features_matrix.parquet` |
| **Normalisation** | `Indicators.ipynb` | a) Z-score (train µ/σ).<br>b) EWMA α = 0.001 (online).<br>• Store µ/σ & split indices. | `btc1m_features_zscore.parquet`, `btc1m_features_ewma.parquet`, `norm_stats.json` |
| **Low-volume flag** | `Indicators.ipynb` | • 5-percentile of Volume on **train** slice.<br>• `LowVolFlag = Volume < p5`.<br>• No rows dropped. | `btc1m_features_with_lowvol.parquet` |
| **Loader helper** | any | One function merges Z-score + LowVolFlag on demand. | `utils/data_loader.py` |

---




## 2. Folder structure

<img src="Screenshot.png" alt="Folder structure" width="500"/>



---

## 3. Result files — schema & how to use

| File | Columns (dtypes) | Purpose | Typical Loader Call |
|------|------------------|---------|---------------------|
| **`btc1m_features_matrix.parquet`** | `ΔEMA, RSI_14, Stoch_%K, Stoch_%D, VWAP_Dev, LogRet_1m` — `float64` | raw numeric features | `pd.read_parquet(..., columns=[...])` |
| **`btc1m_features_zscore.parquet`** | same 6 cols, already Z-scored with μ/σ from train slice | fastest prototyping; no extra scaling | `pd.read_parquet(... )` |
| **`btc1m_features_ewma.parquet`** | same 6 cols, EWMA-scaled (α = 0.001) | research on regime-shift robustness | — |
| **`btc1m_features_with_lowvol.parquet`** | 6 raw features **+ LowVolFlag** (`int8`) | hybrid approach; feed flag to agent | — |
| **`norm_stats.json`** | `{mean:{}, std:{}, split:{train_end,val_end,rows}}` | reproducible µ/σ & split boundaries | `json.load(...)` |

### Quick loader (already saved as `utils/data_loader.py`)

```python
from utils.data_loader import load_features
X, split = load_features(zscore=True, add_lowvol=True)   # 7-col matrix
train = slice(None, split['train_end'])
val   = slice(split['train_end'], split['val_end'])
test  = slice(split['val_end'], None)


| Indicator  | Formula                                                                                                                 |
| ---------- | ----------------------------------------------------------------------------------------------------------------------- |
| ΔEMA       | $\text{EMA}_8 - \text{EMA}_{21}$                                                                                        |
| RSI 14     | $\displaystyle 100 - \frac{100}{1 + \frac{\mathrm{EMA}_{14}(\text{gains})}{\mathrm{EMA}_{14}(\text{losses})}}$          |
| Stoch %K   | $100 \times \frac{C_t - L_{14}}{H_{14} - L_{14}}$                                                                       |
| Stoch %D   | 3-period SMA of %K, then another 3-period SMA (slow variant)                                                            |
| VWAP dev   | $\text{Close}_t - \frac{\sum_{i\le t}(P_i V_i)}{\sum_{i\le t} V_i}$ (reset daily)                                       |
| Log-return | $\ln \frac{C_t}{C_{t-1}}$                                                                                               |
| Z-score    | $x' = (x - \mu_{\text{train}}) / \sigma_{\text{train}}$                                                                 |
| EWMA µ/σ   | $\mu_t = \alpha x_{t-1} + (1-\alpha)\mu_{t-1}$,  $\sigma^2_t = \alpha (x_{t-1}-\mu_{t-1})^2 + (1-\alpha)\sigma^2_{t-1}$ |
| LowVolFlag | `1` if `Volume` < 5-perc (train), else `0`                                                                              |
