# Dataset Generation Notebook

This notebook demonstrates how to build both classification and regression datasets using the updated dataset generation utilities, including ML logging and custom window schemes.

In [5]:
# Install dependencies (optional if already installed)
!pip install -r requirements.txt

Collecting category_encoders (from -r requirements.txt (line 13))
  Using cached category_encoders-2.8.1-py3-none-any.whl.metadata (7.9 kB)
Collecting wandb (from -r requirements.txt (line 16))
  Using cached wandb-0.20.1-py3-none-win_amd64.whl.metadata (10 kB)
Collecting cudf-cu11 (from -r requirements.txt (line 17))
  Using cached cudf_cu11-25.4.0.tar.gz (2.7 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'error'


  error: subprocess-exited-with-error
  
  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [68 lines of output]
      INFO:wheel-stub:Testing wheel cudf_cu11-25.4.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl against tag cp310-cp310-manylinux_2_24_aarch64
      INFO:wheel-stub:Testing wheel cudf_cu11-25.4.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl against tag cp310-cp310-manylinux_2_28_aarch64
      INFO:wheel-stub:Testing wheel cudf_cu11-25.4.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl against tag cp310-cp310-manylinux_2_24_x86_64
      INFO:wheel-stub:Testing wheel cudf_cu11-25.4.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl against tag cp310-cp310-manylinux_2_28_x86_64
      INFO:wheel-stub:Testing wheel cudf_cu11-25.4.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl against tag cp311-cp311-manylinux_2_24_aarch64
      INFO:wheel-stub:Testing wheel cudf_cu11

In [6]:
from utils.build_dataset import generate_dataset

## Build classification dataset

In [7]:
generate_dataset(
    raw_path='data/raw/BTCUSDT_1h.csv',
    output_dir='data/processed/classification',
    version='v1',
    task='classification',
    horizon=3,
    use_gpu=True,
    ml_logger='mlflow',
    tracking_uri='file:./mlruns',
    window_scheme='fibonacci'
)

✅ Loaded 67999 rows from data/raw/BTCUSDT_1h.csv
⚠️ Market regime detection failed: Expected n_samples >= n_components but got n_components = 5, n_samples = 2
✅ Built 1168 features | 67908 samples
🧹 Removed 0 rows with NaN labels
🔍 Starting feature selection on 1164 features for classification...
🧹 Variance threshold: 169 features removed
🧹 Correlation filter: 403 features removed
📊 MI selected 100 candidate features
⏱️ Feature selection completed in 186.54s
🎯 Final feature count: 40
💾 Saved selected feature names to data/processed/classification\selected_features_v1.csv
📊 PCA reduced to 20 components (95% variance)
✅ Dataset saved to data/processed/classification
📊 Final shape: (67908, 20) features, (67908, 1) labels
📝 Saved comprehensive feature reference for deployment


## Build regression dataset

In [8]:
generate_dataset(
    raw_path='data/raw/BTCUSDT_1h.csv',
    output_dir='data/processed/regression',
    version='v1',
    task='regression',
    horizon=3,
    use_gpu=True,
    ml_logger='mlflow',
    tracking_uri='file:./mlruns',
    window_scheme='fibonacci'
)

✅ Loaded 67999 rows from data/raw/BTCUSDT_1h.csv
⚠️ Market regime detection failed: Expected n_samples >= n_components but got n_components = 5, n_samples = 2
✅ Built 1168 features | 67908 samples
🧹 Removed 1 rows with NaN labels
🔍 Starting feature selection on 1163 features for regression...
🧹 Variance threshold: 169 features removed
🧹 Correlation filter: 402 features removed
📊 MI selected 100 candidate features
⏱️ Feature selection completed in 197.62s
🎯 Final feature count: 40
💾 Saved selected feature names to data/processed/regression\selected_features_v1.csv
📊 PCA reduced to 30 components (95% variance)
✅ Dataset saved to data/processed/regression
📊 Final shape: (67907, 30) features, (67907, 3) labels
📝 Saved comprehensive feature reference for deployment
