# Compression

> **Level:** Intermediate  
> **Spec:** [Compression](https://parquet.apache.org/docs/file-format/data-pages/compression/)  
> **PyArrow docs:** [write_table - compression](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html)

**What you will learn:**

1. The compression codecs supported by Parquet: SNAPPY, GZIP, ZSTD, LZ4_RAW, NONE
2. How to compare codec performance (size vs read latency) on the same data
3. How to set per-column compression overrides with `ParquetWriter`
4. When to choose which codec for analytical vs archival workloads

In [1]:
import os
import time

import pyarrow as pa
import pyarrow.parquet as pq

---

## 1. Supported codecs

> **Spec:** [Compression codecs](https://parquet.apache.org/docs/file-format/data-pages/compression/)

| Codec | Key characteristic |
|-------|-------------------|
| `UNCOMPRESSED` | No-op: fastest, largest |
| `SNAPPY` | Google Snappy: fast with moderate ratio |
| `GZIP` | RFC 1952: high ratio, slower |
| `ZSTD` | Zstandard: tunable ratio/speed, best modern default |
| `LZ4_RAW` | LZ4 block format: very fast, low ratio |
| `BROTLI` | RFC 7932: high ratio, slow (web-oriented) |
| `LZ4` | **Deprecated**: non-standard framing, avoid in new files |

Compression acts on the **page-level data** after encoding. A SNAPPY-compressed ZSTD page
is not possible: one codec per column chunk.

In [2]:
# Build a mixed-type table suitable for compression benchmarking
N = 200_000
categories = ["north", "south", "east", "west"]

table = pa.table({
    "id":       pa.array(range(N), type=pa.int64()),
    "value":    pa.array([float(i) * 0.123 for i in range(N)], type=pa.float64()),
    "region":   pa.array([categories[i % 4] for i in range(N)], type=pa.string()),
    "note":     pa.array([f"record-{i:07d}" for i in range(N)], type=pa.string()),
})

print(f"Rows: {table.num_rows:,}")
print(f"Columns: {table.num_columns}")

Rows: 200,000
Columns: 4


---

## 2. Codec comparison: size and read speed

> **Spec:** [Compression: Overview](https://parquet.apache.org/docs/file-format/data-pages/compression/)

We write identical data with each codec and measure:
- Final file size (on-disk bytes)
- Read latency (`time.perf_counter()` average over 3 runs)

In [12]:
CODECS = ["NONE", "SNAPPY", "GZIP", "ZSTD", "LZ4", "BROTLI"]
RUNS = 3
results = []

for codec in CODECS:
    path = f"/tmp/sample_{codec.lower()}.parquet"

    # Write
    pq.write_table(table, path, compression=codec)
    size = os.path.getsize(path)

    # Read (average of RUNS)
    times = []
    for _ in range(RUNS):
        t0 = time.perf_counter()
        pq.read_table(path)
        times.append(time.perf_counter() - t0)
    avg_ms = sum(times) / RUNS * 1000

    results.append((codec, size, avg_ms))

# Baseline: NONE
base_size = results[0][1]
print("Lower is better. None is the baseline (1.00x).")
print(f"{'Codec':<12} {'Size (MB)':>10} {'Ratio':>9} {'Read (ms)':>10} {'Read Ratio':>11}")
print("-" * 56)
for codec, size, avg_ms in results:
    ratio = size / base_size
    read_ratio = avg_ms / results[0][2]
    print(f"{codec:<12} {size/1024/1024:>10.2f} {ratio:>8.2f}x {avg_ms:>10.1f} {read_ratio:>10.2f}x")

Lower is better. None is the baseline (1.00x).
Codec         Size (MB)     Ratio  Read (ms)  Read Ratio
--------------------------------------------------------
NONE               7.16     1.00x        6.5       1.00x
SNAPPY             3.32     0.46x        6.6       1.02x
GZIP               1.97     0.28x        9.8       1.50x
ZSTD               1.26     0.18x        5.6       0.86x
LZ4                3.19     0.45x        5.0       0.77x
BROTLI             1.22     0.17x        7.7       1.18x


---

## 3. Per-column compression overrides

> **Spec:** [Compression: per column](https://parquet.apache.org/docs/file-format/data-pages/compression/)

Parquet stores compression per column chunk. You can assign different codecs to
different columns, for example, use ZSTD for compressible text columns and
LZ4_RAW for numeric columns where speed matters more than ratio.

In [14]:
per_col_path = "/tmp/sample_per_col.parquet"

# id + value ➡️ LZ4_RAW (speed-first numerics)
# region + note ➡️ ZSTD   (ratio-first text)
pq.write_table(
    table,
    per_col_path,
    compression={
        "id":     "LZ4",
        "value":  "LZ4",
        "region": "ZSTD",
        "note":   "ZSTD",
    },
)

pf = pq.ParquetFile(per_col_path)
rg = pf.metadata.row_group(0)

print("Per-column compression in the written file:")
for col_idx in range(pf.metadata.num_columns):
    col = rg.column(col_idx)
    print(f"  {col.path_in_schema:<10} ➡️ {col.compression}  ({col.total_compressed_size:,} bytes)")

Per-column compression in the written file:
  id         ➡️ LZ4  (1,072,935 bytes)
  value      ➡️ LZ4  (1,312,007 bytes)
  region     ➡️ ZSTD  (1,787 bytes)
  note       ➡️ ZSTD  (189,961 bytes)


---

## 4. ZSTD compression level

> **Spec:** [ZSTD](https://parquet.apache.org/docs/file-format/data-pages/compression/)

ZSTD supports a compression level (1–22). Higher levels trade CPU time for smaller files.
Level 1 is already competitive with SNAPPY (level 3 is the default).

In [15]:
print(f"{'ZSTD level':<12} {'Size (MB)':>10} {'Write (ms)':>12}")
print("-" * 37)
for level in [1, 3, 9, 19]:
    path = f"/tmp/sample_zstd_{level}.parquet"
    t0 = time.perf_counter()
    pq.write_table(table, path, compression="ZSTD", compression_level=level)
    write_ms = (time.perf_counter() - t0) * 1000
    size = os.path.getsize(path)
    print(f"level {level:<6} {size/1024/1024:>10.2f} {write_ms:>12.1f}")

ZSTD level    Size (MB)   Write (ms)
-------------------------------------
level 1            1.26         47.2
level 3            1.33         39.1
level 9            1.25         95.5
level 19           1.13       1321.5


---

## Summary

| Codec | Use when |
|-------|----------|
| `UNCOMPRESSED` | Benchmarking / NVMe storage where CPU is the bottleneck |
| `SNAPPY` | Low-latency, high-throughput pipelines (legacy default) |
| `ZSTD` | General-purpose: best modern default. Tune level per workload |
| `LZ4_RAW` | Maximum decompression speed (streaming, real-time ingest) |
| `GZIP` | Archival / cold storage where size matters more than speed |
| Per-column | Mix codecs when columns have different compressibility profiles |

**Next ⏭️** [Nested encoding](07_nested_encoding.ipynb)