# Metadata

> **Level:** Intermediate  
> **Spec:** [Metadata](https://parquet.apache.org/docs/file-format/metadata/)  
> **PyArrow docs:** [pyarrow.parquet.FileMetaData](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.FileMetaData.html)

**What you will learn:**

1. What `FileMetaData` contains and how to read it with PyArrow
2. Row-group and column-chunk level metadata: sizes, encodings, compression
3. Column statistics: min/max/null counts stored per column chunk
4. How statistics enable row-group skipping (predicate pushdown preview)

In [None]:
import io

import pyarrow as pa
import pyarrow.parquet as pq

---

## 1. `FileMetaData`: top-level file metadata

> **Spec:** [File metadata](https://parquet.apache.org/docs/file-format/metadata/)

The `FileMetaData` Thrift struct is serialized in the footer and contains:
- Parquet format version
- Number of rows
- Arrow schema (serialized in key-value metadata)
- One `RowGroup` entry per row group (with offsets and sizes for every column chunk)

In [None]:
N = 30_000

table = pa.table({
    "id":       pa.array(range(N), type=pa.int32()),
    "revenue":  pa.array([float(i) * 1.5 for i in range(N)], type=pa.float64()),
    "region":   pa.array(["north", "south", "east", "west"][i % 4] for i in range(N)),
    "active":   pa.array([i % 3 != 0 for i in range(N)], type=pa.bool_()),
})

buf = io.BytesIO()
pq.write_table(table, buf, row_group_size=10_000, write_statistics=True)
buf.seek(0)

pf = pq.ParquetFile(buf)
meta = pf.metadata

print(f"Format version:  {meta.format_version}")
print(f"Created by:      {meta.created_by}")
print(f"Total rows:      {meta.num_rows:,}")
print(f"Row groups:      {meta.num_row_groups}")
print(f"Columns:         {meta.num_columns}")
print()
print("Key-value metadata (first 3 entries):")
if meta.metadata:
    for k, v in list(meta.metadata.items())[:3]:
        print(f"  {k!r}: {v[:60]!r}..." if len(v) > 60 else f"  {k!r}: {v!r}")

---

## 2. Row-group metadata

> **Spec:** [RowGroup](https://parquet.apache.org/docs/file-format/metadata/)

Each `RowGroupMetaData` entry records:
- Number of rows in the row group
- Total uncompressed byte size
- One `ColumnChunkMetaData` per column

In [None]:
buf.seek(0)
pf2 = pq.ParquetFile(buf)
meta2 = pf2.metadata

for rg_idx in range(meta2.num_row_groups):
    rg = meta2.row_group(rg_idx)
    print(f"Row group {rg_idx}")
    print(f"  rows:               {rg.num_rows:,}")
    print(f"  total_byte_size:    {rg.total_byte_size:,} bytes (uncompressed)")
    print()

---

## 3. Column-chunk metadata

> **Spec:** [ColumnChunk](https://parquet.apache.org/docs/file-format/metadata/)

`ColumnChunkMetaData` includes:
- File offset and sizes (compressed / uncompressed)
- Compression codec
- Encodings used
- Per-chunk row-level statistics

In [None]:
buf.seek(0)
pf3 = pq.ParquetFile(buf)
meta3 = pf3.metadata

rg = meta3.row_group(0)
print(f"Column chunk details for row group 0:")
print(f"{'Column':<12} {'Codec':<10} {'Compressed':>12} {'Uncompressed':>14} {'Encodings'}")
print("-" * 65)
for col_idx in range(meta3.num_columns):
    col = rg.column(col_idx)
    print(
        f"{col.path_in_schema:<12} "
        f"{col.compression:<10} "
        f"{col.total_compressed_size:>12,} "
        f"{col.total_uncompressed_size:>14,} "
        f"{col.encodings}"
    )

---

## 4. Column statistics and row-group skipping

> **Spec:** [Statistics](https://parquet.apache.org/docs/file-format/metadata/)

When `write_statistics=True` (the default), each column chunk stores:
- `min`: smallest value in the chunk
- `max`: largest value in the chunk
- `null_count`: number of null values
- `distinct_count`: estimated cardinality (optional)

A reader evaluating a filter like `revenue > 40000` can inspect the `max` of each row-group's
`revenue` column chunk and **skip** any row group whose `max < 40000` without reading any data.

In [None]:
buf.seek(0)
pf4 = pq.ParquetFile(buf)
meta4 = pf4.metadata

print("Statistics for 'revenue' across all row groups:")
print(f"{'RG':>3} {'min':>12} {'max':>12} {'null_count':>12}")
print("-" * 42)
for rg_idx in range(meta4.num_row_groups):
    rg = meta4.row_group(rg_idx)
    # Find the 'revenue' column
    for col_idx in range(meta4.num_columns):
        col = rg.column(col_idx)
        if col.path_in_schema == "revenue":
            stats = col.statistics
            print(f"{rg_idx:>3} {stats.min:>12.1f} {stats.max:>12.1f} {stats.null_count:>12}")

# Demonstrate predicate pushdown: only row groups where max >= filter_val are relevant
filter_val = 29_000.0
print()
print(f"Filter: revenue > {filter_val}")
for rg_idx in range(meta4.num_row_groups):
    rg = meta4.row_group(rg_idx)
    for col_idx in range(meta4.num_columns):
        col = rg.column(col_idx)
        if col.path_in_schema == "revenue":
            stats = col.statistics
            verdict = "READ" if stats.max > filter_val else "SKIP"
            print(f"  Row group {rg_idx}: max={stats.max:.1f}  ➡️ {verdict}")

---

## Summary

| Concept | Key point |
|---------|----------|
| `FileMetaData` | Top-level: version, row count, schema, list of row groups |
| `RowGroupMetaData` | Per-partition: row count, uncompressed size, list of column chunks |
| `ColumnChunkMetaData` | Per-column: codec, encodings, compressed/uncompressed sizes, file offset |
| Statistics | min/max/null_count per column chunk: enables row-group skipping |
| Row-group skipping | Filter pushed into metadata scan: skip chunks whose max < filter value |

**Next ⏭️** [Types](04_types.ipynb)