# Numerai — Dataset download & quick EDA

This notebook downloads the Numerai training dataset (v5.2 by default), optionally caches it locally, and runs a lightweight exploratory pass.

**Sections**
- Setup & configuration
- Data download / load
- Quick EDA



In [None]:
# Setup & configuration
from __future__ import annotations

from pathlib import Path

DATASET_VERSION = "v5.2"
DATA_DIR = Path("./v5.2")
TRAIN_PARQUET = DATA_DIR / "train.parquet"
EXPORT_DIR = Path("./data")
EXPORT_CSV = EXPORT_DIR / "data.csv"

# NOTE: This dataset is very large. Writing CSV can be slow and create multi-GB files.
WRITE_CSV = True



In [2]:
# Data download / load
from numerapi import NumerAPI
import pandas as pd


def get_data(
    version: str = DATASET_VERSION,
    train_path: Path = TRAIN_PARQUET,
    download_if_missing: bool = True,
) -> pd.DataFrame:
    """Download (if needed) and load the Numerai training dataset."""
    train_path.parent.mkdir(parents=True, exist_ok=True)

    if download_if_missing and not train_path.exists():
        NumerAPI().download_dataset(f"{version}/train.parquet")

    return pd.read_parquet(train_path)



## Notes

- This notebook is self-contained (previous helper modules were merged in).
- At merge time, `preprocessing.py` and `model.py` were empty.



In [None]:
# Load data
data = get_data()

# Optional export (can be slow/large)
if WRITE_CSV:
    EXPORT_DIR.mkdir(parents=True, exist_ok=True)
    data.to_csv(EXPORT_CSV, index=False)
    print(f"Wrote: {EXPORT_CSV}")
else:
    print("WRITE_CSV is False — skipping CSV export")

In [None]:
"""## Quick EDA

The full dataset is large (~millions of rows and thousands of columns). The checks below are designed to be informative without doing expensive full-matrix operations (like full correlation).
"""


In [None]:
import pandas as pd

# High-level snapshot
display(pd.DataFrame({
    "rows": [len(data)],
    "cols": [data.shape[1]],
    "memory_gb": [data.memory_usage(deep=True).sum() / (1024**3)],
}))

# Peek at the schema
display(data.dtypes.value_counts().rename("dtype_count").to_frame())

# Basic preview
display(data.head(5))

# Lightweight summary on a small sample (avoids huge compute)
sample = data.sample(n=min(100_000, len(data)), random_state=42)
display(sample.describe(include="all", percentiles=[0.01, 0.5, 0.99]).T.head(30))


In [None]:
# Targeted check: train/validation split (if present)
display(data["data_type"].value_counts(dropna=False))

data_type
train    2746270
Name: count, dtype: int64

In [None]:
"""## Next steps

- Add feature/target column selection
- Train a baseline model (e.g., LightGBM) with era-based validation
- Generate predictions and (optionally) submit to Numerai
"""

