# Nested Encoding

> **Level:** Advanced  
> **Spec:** [Nested Encoding](https://parquet.apache.org/docs/file-format/nestedencoding/)  
> **PyArrow docs:** [Nested data](https://arrow.apache.org/docs/python/data.html#nested-data)

**What you will learn:**

1. How Parquet encodes nested data using the Dremel algorithm (definition/repetition levels)
2. What definition levels represent and how they encode nulls at arbitrary depth
3. What repetition levels represent and how they reconstruct list structure
4. How to write and read back `list` and `struct` columns with PyArrow
5. How to inspect max definition and repetition levels from column metadata

In [1]:
import io

import pyarrow as pa
import pyarrow.parquet as pq

---

## 1. The Dremel encoding: definition and repetition levels

> **Spec:** [Nested Encoding](https://parquet.apache.org/docs/file-format/nestedencoding/)

Parquet uses the **Dremel shredding algorithm** to flatten nested structures into columnar form.
Every leaf column stores three parallel streams:

- **Values**: the non-null leaf values (flat)
- **Definition levels**: for each position, how many optional/repeated fields are defined  
  (`max_definition_level` = number of nullable/repeated fields on the path to the leaf)
- **Repetition levels**: indicate at which nesting level a new list element begins  
  (`0` = new top-level row, `k` = restart at level k within the current row)

The maximum definition and repetition level can be read directly from column-chunk metadata.

In [2]:
# A simple nullable column: max_def_level=1, max_rep_level=0
nullable_table = pa.table({
    "score": pa.array([1.0, None, 3.0, None, 5.0], type=pa.float32())
})
buf_nullable = io.BytesIO()
pq.write_table(nullable_table, buf_nullable)
buf_nullable.seek(0)

pf = pq.ParquetFile(buf_nullable)
col_meta = pf.metadata.row_group(0).column(0)
print("Nullable flat column:")
print(f"  path:              {col_meta.path_in_schema}")
print("  max_definition_level (schema): ", end="")

# PyArrow exposes levels through the Parquet schema
schema = pf.schema
col0 = schema.column(0)
print(col0.max_definition_level)
print(f"  max_repetition_level:          {col0.max_repetition_level}")
print()
print("(0 def_level = null, 1 = value defined)")

Nullable flat column:
  path:              score
  max_definition_level (schema): 1
  max_repetition_level:          0

(0 def_level = null, 1 = value defined)


---

## 2. List columns: repetition levels in action

> **Spec:** [Nested Encoding: repetition levels](https://parquet.apache.org/docs/file-format/nestedencoding/)

A `list<int32>` column has:
- `max_definition_level = 2`: the list itself can be null (level 0), the list can be empty (level 1), or the element can be present (level 2)
- `max_repetition_level = 1`: one level of list nesting

Repetition level `0` marks the start of a new top-level row.
Repetition level `1` marks subsequent elements in the same list.

In [5]:
list_table = pa.table({
    "tags": pa.array(
        [["a", "b", "c"],   # row 0: 3 elements
         ["x"],             # row 1: 1 element
         None,              # row 2: null list
         [],                # row 3: empty list
         ["p", "q"]],       # row 4: 2 elements
        type=pa.list_(pa.string())
    )
})

buf_list = io.BytesIO()
pq.write_table(list_table, buf_list)
buf_list.seek(0)

pf_list = pq.ParquetFile(buf_list)
schema_list = pf_list.schema

# The list element is the leaf: tags.list.element
print("Parquet schema for the list<string> column:")
print(pf_list.schema)
print()

# Find the leaf column
for i in range(pf_list.metadata.num_columns):
    col = schema_list.column(i)
    print(f"Column path: {col.path}")
    print(f"  max_definition_level: {col.max_definition_level}")
    print(f"  max_repetition_level: {col.max_repetition_level}")

# Round-trip: read back and verify
buf_list.seek(0)
result = pq.read_table(buf_list)
print()
print("Round-tripped values:")
for i, val in enumerate(result.column("tags").to_pylist()):
    print(f"  row {i}: {val}")


Parquet schema for the list<string> column:
<pyarrow._parquet.ParquetSchema object at 0x7fb4c460cc80>
required group field_id=-1 schema {
  optional group field_id=-1 tags (List) {
    repeated group field_id=-1 list {
      optional binary field_id=-1 element (String);
    }
  }
}


Column path: tags.list.element
  max_definition_level: 3
  max_repetition_level: 1

Round-tripped values:
  row 0: ['a', 'b', 'c']
  row 1: ['x']
  row 2: None
  row 3: []
  row 4: ['p', 'q']


---

## 3. Struct columns: definition levels across fields

> **Spec:** [Nested Encoding: definition levels](https://parquet.apache.org/docs/file-format/nestedencoding/)

A `struct<name: string, age: int32>` column is shredded into two leaf columns:
`address.name` and `address.age`. Each inherits one extra definition level for the
struct wrapper itself: if the struct is null, both leaves record definition level 0.

In [6]:
struct_array = pa.array(
    [{"name": "Alice", "age": 30},
     None,
     {"name": "Carol", "age": None},
     {"name": None,   "age": 25}],
    type=pa.struct([("name", pa.string()), ("age", pa.int32())])
)
struct_table = pa.table({"person": struct_array})

buf_struct = io.BytesIO()
pq.write_table(struct_table, buf_struct)
buf_struct.seek(0)

pf_struct = pq.ParquetFile(buf_struct)
print("Parquet schema for the struct column:")
print(pf_struct.schema)
print()

schema_struct = pf_struct.schema
for i in range(pf_struct.metadata.num_columns):
    col = schema_struct.column(i)
    print(f"Leaf column: {col.path}")
    print(f"  max_definition_level: {col.max_definition_level}")
    print(f"  max_repetition_level: {col.max_repetition_level}")

buf_struct.seek(0)
result_struct = pq.read_table(buf_struct)
print()
print("Round-tripped values:")
for i, val in enumerate(result_struct.column("person").to_pylist()):
    print(f"  row {i}: {val}")

Parquet schema for the struct column:
<pyarrow._parquet.ParquetSchema object at 0x7fb4837e9dc0>
required group field_id=-1 schema {
  optional group field_id=-1 person {
    optional binary field_id=-1 name (String);
    optional int32 field_id=-1 age;
  }
}


Leaf column: person.name
  max_definition_level: 2
  max_repetition_level: 0
Leaf column: person.age
  max_definition_level: 2
  max_repetition_level: 0

Round-tripped values:
  row 0: {'name': 'Alice', 'age': 30}
  row 1: None
  row 2: {'name': 'Carol', 'age': None}
  row 3: {'name': None, 'age': 25}


---

## 4. List of structs: full nesting

> **Spec:** [Nested Encoding](https://parquet.apache.org/docs/file-format/nestedencoding/)

Combining lists and structs demonstrates how levels compound:
a `list<struct<x: int32>>` leaf has `max_definition_level = 3` and `max_repetition_level = 1`.

In [7]:
point_type = pa.struct([("x", pa.float32()), ("y", pa.float32())])
nested_table = pa.table({
    "path": pa.array(
        [[{"x": 0.0, "y": 0.0}, {"x": 1.0, "y": 1.0}],
         [{"x": 2.0, "y": 2.0}],
         None,
         []],
        type=pa.list_(point_type)
    )
})

buf_nested = io.BytesIO()
pq.write_table(nested_table, buf_nested)
buf_nested.seek(0)

pf_nested = pq.ParquetFile(buf_nested)
print("Parquet schema (list of structs):")
print(pf_nested.schema)
print()

schema_nested = pf_nested.schema
for i in range(pf_nested.metadata.num_columns):
    col = schema_nested.column(i)
    print(f"Leaf: {col.path:<25} def_level={col.max_definition_level}  rep_level={col.max_repetition_level}")

buf_nested.seek(0)
result_nested = pq.read_table(buf_nested)
print()
print("Round-tripped values:")
for i, val in enumerate(result_nested.column("path").to_pylist()):
    print(f"  row {i}: {val}")

Parquet schema (list of structs):
<pyarrow._parquet.ParquetSchema object at 0x7fb4837e81c0>
required group field_id=-1 schema {
  optional group field_id=-1 path (List) {
    repeated group field_id=-1 list {
      optional group field_id=-1 element {
        optional float field_id=-1 x;
        optional float field_id=-1 y;
      }
    }
  }
}


Leaf: path.list.element.x       def_level=4  rep_level=1
Leaf: path.list.element.y       def_level=4  rep_level=1

Round-tripped values:
  row 0: [{'x': 0.0, 'y': 0.0}, {'x': 1.0, 'y': 1.0}]
  row 1: [{'x': 2.0, 'y': 2.0}]
  row 2: None
  row 3: []


---

## Summary

| Concept | Key point |
|---------|----------|
| Dremel algorithm | Shreds nested structures into flat columnar streams |
| Definition level | Counts how many optional/repeated fields on the path are defined. Encodes nulls at any depth |
| Repetition level | Marks at which list level a new element begins (`0` = new row) |
| `max_definition_level` | Sum of all nullable/repeated fields from root to leaf |
| `max_repetition_level` | Number of repeated (list) fields from root to leaf |
| PyArrow | Transparently handles encoding/decoding of nested types. Round-trips with full fidelity |

**Next ⏭️** [Page index & Bloom filter](08_page_index_bloom_filter.ipynb)