# Encodings

> **Level:** Intermediate  
> **Spec:** [Encodings](https://parquet.apache.org/docs/file-format/data-pages/encodings/)  
> **PyArrow docs:** [ParquetWriter](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html)

**What you will learn:**

1. How `PLAIN` encoding stores values back-to-back with no compression
2. How `RLE_DICTIONARY` slashes size for low-cardinality columns
3. How `DELTA_BINARY_PACKED` exploits monotone integer sequences
4. How `BYTE_STREAM_SPLIT` improves floating-point compressibility
5. How to read back the encodings actually used by PyArrow from column-chunk metadata

In [1]:
import io

import pyarrow as pa
import pyarrow.parquet as pq

---

## 1. PLAIN encoding: simple back-to-back storage

> **Spec:** [Encodings: PLAIN](https://parquet.apache.org/docs/file-format/data-pages/encodings/)

PLAIN is the baseline encoding. Values are stored consecutively with no further transformation:
- `INT32`: 4 bytes little-endian per value
- `BYTE_ARRAY`: `[4-byte length][bytes]` per value

PLAIN must be supported for all types. High-cardinality or random columns fall back to PLAIN.

In [None]:
import random
import string

rng = random.Random(42)
N = 50_000

# High-cardinality string column, essentially random, dictionary will be large -> falls back to PLAIN
random_strings = ["".join(rng.choices(string.ascii_letters, k=8)) for _ in range(N)]
table_plain = pa.table({"uuid_like": pa.array(random_strings, type=pa.string())})

buf_plain = io.BytesIO()
pq.write_table(table_plain, buf_plain)
buf_plain.seek(0)

meta = pq.ParquetFile(buf_plain).metadata
col = meta.row_group(0).column(0)
print(f"Column: {col.path_in_schema}")
print(f"Encodings: {col.encodings}")
print(f"Compressed:   {col.total_compressed_size:,} bytes")
print(f"Uncompressed: {col.total_uncompressed_size:,} bytes")

Column: uuid_like
Encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
Compressed:   598,337 bytes
Uncompressed: 697,794 bytes


---

## 2. RLE_DICTIONARY encoding: low-cardinality columns

> **Spec:** [Encodings: Dictionary Encoding](https://parquet.apache.org/docs/file-format/data-pages/encodings/)

Dictionary encoding builds a per-column-chunk lookup table of distinct values.
Data pages store integer indices into that table, encoded with RLE / bit-packing.

For a column with only 5 distinct values repeated 50 000 times:
- Dictionary page: 5 entries
- Data pages: RLE-encoded integers in the range [0..4], 3 bits each

In [3]:
categories = ["alpha", "beta", "gamma", "delta", "epsilon"]
table_dict = pa.table({"category": pa.array([categories[i % 5] for i in range(N)], type=pa.string())})

buf_dict = io.BytesIO()
pq.write_table(table_dict, buf_dict, use_dictionary=True)
buf_dict.seek(0)

meta_dict = pq.ParquetFile(buf_dict).metadata
col_dict = meta_dict.row_group(0).column(0)
print(f"Column: {col_dict.path_in_schema}")
print(f"Encodings: {col_dict.encodings}")
print(f"Compressed:   {col_dict.total_compressed_size:,} bytes")
print(f"Uncompressed: {col_dict.total_uncompressed_size:,} bytes")

# Compare: plain encoding of the same data
buf_no_dict = io.BytesIO()
pq.write_table(table_dict, buf_no_dict, use_dictionary=False)
size_plain = pq.ParquetFile(buf_no_dict).metadata.row_group(0).column(0).total_compressed_size
print()
print(f"Dictionary encoding: {col_dict.total_compressed_size:,} bytes")
print(f"Plain encoding:      {size_plain:,} bytes")
print(f"Dictionary is {size_plain / col_dict.total_compressed_size:.1f}x smaller")

Column: category
Encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
Compressed:   1,226 bytes
Uncompressed: 19,060 bytes

Dictionary encoding: 1,226 bytes
Plain encoding:      22,074 bytes
Dictionary is 18.0x smaller


---

## 3. DELTA_BINARY_PACKED: monotone integer sequences

> **Spec:** [Encodings: Delta Encoding](https://parquet.apache.org/docs/file-format/data-pages/encodings/)

Delta encoding stores the **differences** between consecutive values rather than absolute values.
For a monotone sequence like `0, 1, 2, 3, …` every delta is 1, all deltas compute to 0 after
subtracting the minimum delta, so they can be stored with **0 bits per value**.

Delta encoding is enabled automatically for `INT32`/`INT64` columns when PyArrow detects
it will produce smaller output than plain/dictionary.

In [4]:
table_delta = pa.table({"monotone_id": pa.array(range(N), type=pa.int64())})

# use_dictionary=False forces PyArrow to choose the best non-dict encoding (delta for int)
buf_delta = io.BytesIO()
pq.write_table(table_delta, buf_delta, use_dictionary=False)
buf_delta.seek(0)

meta_delta = pq.ParquetFile(buf_delta).metadata
col_delta = meta_delta.row_group(0).column(0)
print(f"Encodings: {col_delta.encodings}")
print(f"Compressed:   {col_delta.total_compressed_size:,} bytes")
print(f"Uncompressed: {col_delta.total_uncompressed_size:,} bytes")
print(f"Theoretical minimum (PLAIN): {N * 8:,} bytes  (8 bytes x {N:,} int64 values)")

Encodings: ('RLE', 'PLAIN')
Compressed:   201,107 bytes
Uncompressed: 400,231 bytes
Theoretical minimum (PLAIN): 400,000 bytes  (8 bytes x 50,000 int64 values)


---

## 4. BYTE_STREAM_SPLIT: better floating-point compression

> **Spec:** [Encodings: Byte Stream Split](https://parquet.apache.org/docs/file-format/data-pages/encodings/)

BYTE_STREAM_SPLIT **transposes** the bytes of each floating-point value.
For N `FLOAT` values, instead of `[B0 B1 B2 B3][B0 B1 B2 B3]…`, it emits:
`[all B0s][all B1s][all B2s][all B3s]`. The total size is unchanged,
but byte streams within each position are highly compressible by entropy coders.

In [9]:
import math

# Smooth float sequence: exponent byte barely changes ➡️ B3 stream is near-constant
floats = [math.sin(i * 0.001) for i in range(N)]
table_float = pa.table({"signal": pa.array(floats, type=pa.float32())})

# Default (no byte stream split, snappy compression)
buf_plain_float = io.BytesIO()
pq.write_table(table_float, buf_plain_float, use_dictionary=False, compression="snappy")

# With byte stream split + snappy
buf_bss = io.BytesIO()
pq.write_table(table_float, buf_bss,
               use_dictionary=False,
               use_byte_stream_split=True,
               compression="snappy")

size_plain = pq.ParquetFile(buf_plain_float).metadata.row_group(0).column(0).total_compressed_size
size_bss   = pq.ParquetFile(buf_bss).metadata.row_group(0).column(0).total_compressed_size

print(f"PLAIN + snappy:             {size_plain:,} bytes")
print(f"BYTE_STREAM_SPLIT + snappy: {size_bss:,} bytes")
print(f"Ratio: {size_plain / size_bss:.2f}x smaller with BYTE_STREAM_SPLIT")

PLAIN + snappy:             200,207 bytes
BYTE_STREAM_SPLIT + snappy: 166,074 bytes
Ratio: 1.21x smaller with BYTE_STREAM_SPLIT


---

## Summary

| Encoding | Physical type | Best for |
|----------|--------------|----------|
| `PLAIN` | All | High-cardinality / random data |
| `RLE_DICTIONARY` | All | Low-cardinality columns (categories, booleans, enums) |
| `DELTA_BINARY_PACKED` | INT32, INT64 | Monotone or slowly-changing integer sequences |
| `DELTA_LENGTH_BYTE_ARRAY` | BYTE_ARRAY | Variable-length strings of similar lengths |
| `DELTA_BYTE_ARRAY` | BYTE_ARRAY | Sorted strings with shared prefixes |
| `BYTE_STREAM_SPLIT` | FLOAT, DOUBLE | Smooth floating-point data (before applying a compressor) |

**Next ⏭️** [Compression](06_compression.ipynb)