# File Format

> **Level:** Beginner  
> **Spec:** [File Format](https://parquet.apache.org/docs/file-format/)  
> **PyArrow docs:** [pyarrow.parquet.ParquetFile](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html)

**What you will learn:**

1. The physical layout of a Parquet file: magic bytes, data, footer
2. The hierarchy: File → Row Groups → Column Chunks → Pages
3. Why file metadata is written at the end (footer-first design)
4. How to navigate the hierarchy programmatically with `ParquetFile`

In [1]:
import io
import struct

import pyarrow as pa
import pyarrow.parquet as pq

---

## 1. Magic bytes: `PAR1`

> **Spec:** [File Format](https://parquet.apache.org/docs/file-format/)

Every valid Parquet file starts **and** ends with the 4-byte magic number `PAR1` (`0x50 0x41 0x52 0x31`).

Layout:
```
[ PAR1 ]  [ column data ... ]  [ File Metadata ]  [ 4-byte footer length ]  [ PAR1 ]
```

Readers locate the footer by seeking to the last 8 bytes of the file:
- bytes [-4:] must be `PAR1`
- bytes [-8:-4] hold the footer length as a little-endian `uint32`

In [None]:
# Write a minimal table to a BytesIO buffer so we can inspect the raw bytes
table = pa.table({"x": pa.array([1, 2, 3], type=pa.int32())})
buf = io.BytesIO()
pq.write_table(table, buf)
data = buf.getvalue()

magic = b"PAR1"

print(f"File size: {len(data)} bytes")
print()

# Leading magic
leading = data[:4]
print(f"First 4 bytes: {leading!r}  ➡️  valid: {leading == magic}")

# Trailing magic
trailing = data[-4:]
print(f"Last  4 bytes: {trailing!r}  ➡️  valid: {trailing == magic}")

# Footer length (little-endian uint32 at bytes [-8:-4])
footer_len = struct.unpack_from("<I", data, len(data) - 8)[0]
print(f"Footer length (bytes [-8:-4]): {footer_len} bytes")
print(f"Footer starts at byte offset:  {len(data) - 8 - footer_len}")

File size: 457 bytes

First 4 bytes: b'PAR1'  →  valid: True
Last  4 bytes: b'PAR1'  →  valid: True
Footer length (bytes [-8:-4]): 358 bytes
Footer starts at byte offset:  91


---

## 2. Hierarchy: Row Groups → Column Chunks → Pages

> **Spec:** [Concepts](https://parquet.apache.org/docs/concepts/)

```
File
 └─ Row Group 0          (horizontal slice of rows)
     ├─ Column Chunk 0   (all values of column 0 in this row group)
     │   ├─ Page 0       (smallest unit: compressed + encoded)
     │   └─ Page 1
     └─ Column Chunk 1
         └─ Page 0
 └─ Row Group 1
     ...
```

- **Row Group**: logical horizontal partitioning. Parallelism unit for MapReduce / distributed engines.
- **Column Chunk**: all values of one column within a row group. Contiguous on disk.
- **Page**: indivisible unit for compression and encoding.

In [3]:
# Write a larger table split across multiple row groups
N = 50_000
big_table = pa.table({
    "id":    pa.array(range(N), type=pa.int32()),
    "score": pa.array([float(i) / N for i in range(N)], type=pa.float32()),
    "label": pa.array(["a" if i % 2 == 0 else "b" for i in range(N)], type=pa.string()),
})

buf2 = io.BytesIO()
# row_group_size controls the max number of rows per row group
pq.write_table(big_table, buf2, row_group_size=20_000)
buf2.seek(0)

pf = pq.ParquetFile(buf2)
meta = pf.metadata

print(f"Total rows:   {meta.num_rows:,}")
print(f"Row groups:   {meta.num_row_groups}")
print(f"Columns:      {meta.num_columns}")
print()

for rg_idx in range(meta.num_row_groups):
    rg = meta.row_group(rg_idx)
    print(f"  Row group {rg_idx}: {rg.num_rows:,} rows, {rg.total_byte_size:,} bytes uncompressed")
    for col_idx in range(meta.num_columns):
        col = rg.column(col_idx)
        print(f"    Column '{col.path_in_schema}': {col.total_compressed_size:,} bytes compressed")

Total rows:   50,000
Row groups:   3
Columns:      3

  Row group 0: 20,000 rows, 237,850 bytes uncompressed
    Column 'id': 117,636 bytes compressed
    Column 'score': 117,637 bytes compressed
    Column 'label': 195 bytes compressed
  Row group 1: 20,000 rows, 237,850 bytes uncompressed
    Column 'id': 117,637 bytes compressed
    Column 'score': 117,637 bytes compressed
    Column 'label': 195 bytes compressed
  Row group 2: 10,000 rows, 116,540 bytes uncompressed
    Column 'id': 57,614 bytes compressed
    Column 'score': 57,614 bytes compressed
    Column 'label': 135 bytes compressed


---

## 3. Footer-first design

> **Spec:** [File Format: metadata after data](https://parquet.apache.org/docs/file-format/)

File metadata is written **after** all column data. This allows single-pass writing:
a writer streams column chunks without needing to know final offsets upfront,
then appends the complete metadata, including all column chunk offsets, at the end.

A reader therefore:
1. Seeks to EOF − 8 bytes to read the footer length
2. Seeks back by that length to read the full `FileMetaData` Thrift struct
3. Uses the column chunk offsets in the metadata to seek to each column chunk

All column chunk offsets are stored in the file footer, random-access column reads require exactly **two seeks** before any data is read.

In [None]:
# Verify: column chunk file offsets are stored in the metadata (not inline)
buf2.seek(0)
pf2 = pq.ParquetFile(buf2)
meta2 = pf2.metadata

file_size = len(buf2.getvalue())
footer_length = struct.unpack_from("<I", buf2.getvalue(), file_size - 8)[0]
data_region_end = file_size - 8 - footer_length

print(f"File size:               {file_size:,} bytes")
print(f"Footer length:           {footer_length:,} bytes")
print(f"Data region: bytes 4 .. {data_region_end:,}")
print()
print("Column chunk offsets (from footer metadata):")
for rg_idx in range(meta2.num_row_groups):
    rg = meta2.row_group(rg_idx)
    for col_idx in range(meta2.num_columns):
        col = rg.column(col_idx)
        print(f"  RG {rg_idx} / col '{col.path_in_schema}': offset = {col.file_offset:,}")

---

## 4. Reading individual row groups

> **Spec:** [Concepts: Row group](https://parquet.apache.org/docs/concepts/)

`ParquetFile.read_row_group(i)` reads exactly one row group without touching the others.
This is the basis for parallel and incremental processing.

In [None]:
buf2.seek(0)
pf3 = pq.ParquetFile(buf2)

for rg_idx in range(pf3.metadata.num_row_groups):
    batch = pf3.read_row_group(rg_idx)
    ids = batch.column("id")
    print(f"Row group {rg_idx}: rows {ids[0].as_py()} .. {ids[-1].as_py()} ({len(ids):,} rows)")

---

## Summary

| Concept | Key point |
|---------|----------|
| Magic bytes | Files start and end with `PAR1` for quick validity check |
| Footer-first | Metadata written after data for single-pass writing, two-seek reading |
| Row group | Horizontal partition: controls parallelism and memory during read |
| Column chunk | Contiguous on disk per-column per-row-group which enables column projection |
| Page | Smallest compression/encoding unit inside a column chunk |

**Next ⏭️** [Metadata](03_metadata.ipynb)