# Page Index & Bloom Filters

> **Level:** Advanced  
> **Spec:** [Page Index](https://parquet.apache.org/docs/file-format/pageindex/) · [Bloom Filter](https://parquet.apache.org/docs/file-format/bloomfilter/)  
> **PyArrow docs:** [Parquet Datasets](https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets)

**What you will learn:**

1. How column statistics enable row-group skipping for range predicates
2. How the page index extends statistics to the page level for finer-grained skipping
3. How bloom filters accelerate point-lookup queries on high-cardinality columns
4. How to write files with bloom filters enabled using PyArrow
5. How to verify skipping behaviour by reading individual row groups manually

In [9]:
import time
import os

import pyarrow as pa
import pyarrow.parquet as pq

---

## 1. Row-group skipping with column statistics

> **Spec:** [Metadata: statistics](https://parquet.apache.org/docs/file-format/metadata/)

Each column chunk stores `min` and `max` statistics in the file metadata.
A reader evaluating a filter like `ts >= threshold` can skip any row group where `max(ts) < threshold`, without reading a single data byte.

This is the most fundamental form of predicate pushdown in Parquet.

In [2]:
# Build a table sorted by 'ts' so that each row group covers a distinct range
N = 300_000
RG_SIZE = 100_000  # 3 row groups

table = pa.table({
    "ts":     pa.array(range(N), type=pa.int64()),        # monotone ➡️ min/max are tight
    "value":  pa.array([float(i) * 0.01 for i in range(N)]),
    "label":  pa.array(["a", "b", "c"][i % 3] for i in range(N)),
})

path = "/tmp/predpush.parquet"
pq.write_table(table, path, row_group_size=RG_SIZE, write_statistics=True)

pf = pq.ParquetFile(path)
meta = pf.metadata

print(f"Row groups: {meta.num_row_groups}")
print()
print(f"{'RG':>3} {'ts_min':>10} {'ts_max':>10}")
print("-" * 27)
for rg_idx in range(meta.num_row_groups):
    rg = meta.row_group(rg_idx)
    for col_idx in range(meta.num_columns):
        col = rg.column(col_idx)
        if col.path_in_schema == "ts":
            stats = col.statistics
            print(f"{rg_idx:>3} {stats.min:>10} {stats.max:>10}")

Row groups: 3

 RG     ts_min     ts_max
---------------------------
  0          0      99999
  1     100000     199999
  2     200000     299999


---

## 2. Applying a filter: PyArrow's predicate pushdown

> **Spec:** [Page Index](https://parquet.apache.org/docs/file-format/pageindex/)

`pq.read_table(filters=...)` passes the predicate to the C++ Parquet reader,
which applies row-group skipping based on statistics before decoding any pages.
We can observe this by timing filtered vs full reads and by reading row groups manually.

In [3]:
RUNS = 3

def bench(fn, label):
    times = []
    for _ in range(RUNS):
        t0 = time.perf_counter()
        result = fn()
        times.append(time.perf_counter() - t0)
    avg = sum(times) / RUNS
    print(f"{label}: {avg * 1000:.1f} ms, {result.num_rows:,} rows returned")
    return avg

# Read all rows
t_full = bench(
    lambda: pq.read_table(path),
    "Full table read (no filter)  "
)

# Read only rows where ts >= 200_000, should skip the first 2 row groups
t_filtered = bench(
    lambda: pq.read_table(path, filters=[("ts", ">=", 200_000)]),
    "Filtered read (ts >= 200000)"
)

print(f"\nSpeedup from row-group skipping: {t_full / t_filtered:.1f}x")

Full table read (no filter)  : 11.6 ms, 300,000 rows returned
Filtered read (ts >= 200000): 3.8 ms, 100,000 rows returned

Speedup from row-group skipping: 3.1x


---

## 3. Manual row-group inspection to confirm skipping

> **Spec:** [Metadata: Row group statistics](https://parquet.apache.org/docs/file-format/metadata/)

We can replicate the reader's logic manually: walk row groups, check statistics,
and only call `read_row_group(i)` for groups that pass the predicate.

In [6]:
filter_value = 200_000
pf2 = pq.ParquetFile(path)
meta2 = pf2.metadata

print(f"Manual predicate pushdown for ts >= {filter_value}:\n")

batches = []
for rg_idx in range(meta2.num_row_groups):
    rg = meta2.row_group(rg_idx)
    for col_idx in range(meta2.num_columns):
        col = rg.column(col_idx)
        if col.path_in_schema == "ts":
            stats = col.statistics
            if stats.max >= filter_value:
                print(f"  Row group {rg_idx}: max={stats.max} ⏭️ READ")
                batches.append(pf2.read_row_group(rg_idx))
            else:
                print(f"  Row group {rg_idx}: max={stats.max} ⏭️ SKIP")

result_manual = pa.concat_tables(batches)
# Apply remaining row-level filter in memory
mask = pa.compute.greater_equal(result_manual.column("ts"), filter_value)
result_manual = result_manual.filter(mask)

print(f"\nResult rows: {result_manual.num_rows:,}")
print(f"Expected:    {N - filter_value:,}")

Manual predicate pushdown for ts >= 200000:

  Row group 0: max=99999 ⏭️ SKIP
  Row group 1: max=199999 ⏭️ SKIP
  Row group 2: max=299999 ⏭️ READ

Result rows: 100,000
Expected:    100,000


---

## 4. Bloom filters: point-lookup acceleration

> **Spec:** [Bloom Filter](https://parquet.apache.org/docs/file-format/bloomfilter/)

Column statistics (min/max) work well for range predicates on sorted data.
For **equality predicates on high-cardinality columns** (e.g., `user_id = 'abc123'`),
a bloom filter provides probabilistic membership testing:
- A **definite NO** allows the row group to be skipped entirely
- A **possible YES** means the row group must be read (false positives are possible)

> **PyArrow 23 note:** The Python bindings do not yet expose a `WriterProperties`
> class or a `bloom_filter_enabled` keyword for writing bloom filters. The Parquet
> C++ library supports them, but the API surface has not been surfaced in Python.
> Equality-predicate pushdown still works through **row-group min/max statistics**,
> just without the probabilistic bloom filter layer.

---

## 5. Row-group statistics confirm skipping for the equality predicate

> **Spec:** [Bloom Filter: BloomFilterHeader](https://parquet.apache.org/docs/file-format/bloomfilter/)

When bloom filters **are** written, Parquet stores a `bloom_filter_offset` byte
offset in `ColumnChunkMetaData`. A non-zero offset means a bloom filter is present.

Because PyArrow 23 does not write bloom filters from Python, all offsets below
will be `None`. The cell also shows min/max statistics, confirming that the
sorted `user_id` column still enables row-group skipping via statistics alone.

---

## Summary

| Technique | Predicate type | Mechanism | PyArrow 23 support |
|-----------|---------------|----------|--------------------|
| Column statistics (min/max) | Range: `col >= x`, `col BETWEEN a AND b` | Stored in FileMetaData footer, zero seeks to evaluate | ✅ read & write |
| Page index | Range at page granularity | Finer than row-group stats, stored after footer | ✅ write (`write_page_index=True`), ⚠️ not yet used on read |
| Bloom filter | Equality: `col == value` | Probabilistic set membership. definite-no ➡️ skip row group | ⚠️ C++ only; Python write API not yet exposed |

All three techniques skip data **before any column chunk data is decompressed or decoded**, they operate purely on metadata.

**End of the Parquet series.** Return to the index: [Intro](00_intro.ipynb)