# Numerai — Dataset download & quick EDA

This notebook downloads the Numerai training dataset (v5.2 by default), optionally caches it locally, and runs a lightweight exploratory pass.

**Sections**
- Setup & configuration
- Data download / load
- Quick EDA



In [None]:
from numerapi import NumerAPI
import pandas as pd
import json
from ydata_profiling import ProfileReport
import seaborn as sns

  from .autonotebook import tqdm as notebook_tqdm
2025-12-29 07:49:18,073 INFO visions.backends: Pandas backend loaded 2.3.3
2025-12-29 07:49:18,077 INFO visions.backends: Numpy backend loaded 1.26.4
2025-12-29 07:49:18,078 INFO visions.backends: Pyspark backend NOT loaded
2025-12-29 07:49:18,078 INFO visions.backends: Python backend loaded


In [2]:
# Setup & configuration
DATASET_VERSION = 'v5.2'
napi = NumerAPI()

In [3]:
all_datasets = napi.list_datasets()
dataset_versions = list(set(d.split('/')[0] for d in all_datasets))
print("Available versions:\n", dataset_versions)

Available versions:
 ['v5.2', 'v5.0', 'v5.1']


In [4]:
current_version_files = [f for f in all_datasets if f.startswith(DATASET_VERSION)]
print("Available", DATASET_VERSION, "files:\n", current_version_files)

Available v5.2 files:
 ['v5.2/features.json', 'v5.2/live.parquet', 'v5.2/live_benchmark_models.parquet', 'v5.2/live_example_preds.csv', 'v5.2/live_example_preds.parquet', 'v5.2/meta_model.parquet', 'v5.2/train.parquet', 'v5.2/train_benchmark_models.parquet', 'v5.2/validation.parquet', 'v5.2/validation_benchmark_models.parquet', 'v5.2/validation_example_preds.csv', 'v5.2/validation_example_preds.parquet']


In [6]:
#napi.download_dataset(f'{DATASET_VERSION}/features.json')
feature_metadata = json.load(open(f"{DATASET_VERSION}/features.json"))
for metadata in feature_metadata:
  print(metadata, len(feature_metadata[metadata]))

feature_sets 18
targets 41


In [7]:
feature_sets = feature_metadata["feature_sets"]
for feature_set in ["small", "medium", "all"]:
  print(feature_set, len(feature_sets[feature_set]))

small 42
medium 780
all 2748


In [9]:
wanted_feature_set = "medium"
data = pd.read_parquet(
    path='./v5.2/numerai.parquet',columns=["era", "target"] + feature_sets[wanted_feature_set]
)

In [10]:
display(data.info())
display(data.describe())
display(data.value_counts())
display(data.nunique())

<class 'pandas.core.frame.DataFrame'>
Index: 2746270 entries, n0007b5abb0c3a25 to nfffed717119d633
Columns: 782 entries, era to feature_zymotic_windswept_cooky
dtypes: float32(1), int8(780), object(1)
memory usage: 2.0+ GB


None

Unnamed: 0,target,feature_able_deprived_nona,feature_ablest_inflexional_egeria,feature_absorbable_hyperalgesic_mode,feature_accoutered_revolute_vexillology,feature_accredited_consummate_currie,feature_acetose_crackerjack_needlecraft,feature_acheulian_conserving_output,feature_acronychal_bilobate_stevenage,feature_acrylic_gallic_wine,...,feature_wrapround_chrestomathic_timarau,feature_xanthic_contending_noblewoman,feature_xanthic_transpadane_saleswoman,feature_xanthochroid_petrified_gutenberg,feature_zincy_cirrhotic_josh,feature_zippy_trine_diffraction,feature_zonal_snuffly_chemism,feature_zygotic_middlebrow_caribbean,feature_zymolytic_intertidal_privet,feature_zymotic_windswept_cooky
count,2746268.0,2746270.0,2746270.0,2746270.0,2746270.0,2746270.0,2746270.0,2746270.0,2746270.0,2746270.0,...,2746270.0,2746270.0,2746270.0,2746270.0,2746270.0,2746270.0,2746270.0,2746270.0,2746270.0,2746270.0
mean,0.4999478,1.999915,1.999925,1.999915,1.999915,1.999935,1.999915,1.999944,1.999915,1.999928,...,1.999915,1.999935,1.999915,1.999915,1.999915,1.999915,1.999928,1.999915,1.999915,1.999915
std,0.2236927,1.414359,1.373126,1.414359,1.414359,1.229121,1.414359,1.165599,1.414359,1.314663,...,1.414359,1.229121,1.414359,1.414359,1.414359,1.414359,1.314663,1.414359,1.414359,1.414359
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
50%,0.5,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
75%,0.5,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
max,1.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0


0001  0.0     0                           2                                  0                                     1                                        2                                     1                                        2                                    1                                      2                            4                                3                                       0                                            2                                1                                       1                                0                                   2                                 1                                    1                                     1                                     4                                      3                                         2                                 0                                              1                                      3                                                1  

era                                     574
target                                    5
feature_able_deprived_nona                5
feature_ablest_inflexional_egeria         5
feature_absorbable_hyperalgesic_mode      5
                                       ... 
feature_zippy_trine_diffraction           5
feature_zonal_snuffly_chemism             5
feature_zygotic_middlebrow_caribbean      5
feature_zymolytic_intertidal_privet       5
feature_zymotic_windswept_cooky           5
Length: 782, dtype: int64

In [14]:
ProfileReport(data)

Summarize dataset:  85%|████████▍ | 668/787 [07:05<02:05,  1.06s/it, Describe variable: feature_unbending_expandable_slew]                

: 