# Types

> **Level:** Intermediate  
> **Spec:** [Types](https://parquet.apache.org/docs/file-format/types/) · [Logical Types](https://parquet.apache.org/docs/file-format/types/logicaltypes/)  
> **PyArrow docs:** [Schema and Field](https://arrow.apache.org/docs/python/api/datatypes.html)

**What you will learn:**

1. The six Parquet physical types and how they map to PyArrow types
2. Logical type annotations: how a `BYTE_ARRAY` becomes a `STRING` or `DATE`
3. How to read back the Parquet schema (not the Arrow schema) from a file
4. Type precision: `DECIMAL`, `TIMESTAMP`, and `UUID` as annotated physical types

In [None]:
import io
import datetime

import pyarrow as pa
import pyarrow.parquet as pq

---

## 1. Physical types

> **Spec:** [Physical Types](https://parquet.apache.org/docs/file-format/types/)

Parquet defines six physical types. The actual binary representation on disk:

| Physical type | Bytes | PyArrow mapping |
|---------------|-------|------------------|
| `BOOLEAN` | 1 bit | `pa.bool_()` |
| `INT32` | 4 | `pa.int32()`, `pa.date32()`, `pa.float16()` |
| `INT64` | 8 | `pa.int64()`, `pa.timestamp(...)` |
| `FLOAT` | 4 | `pa.float32()` |
| `DOUBLE` | 8 | `pa.float64()` |
| `BYTE_ARRAY` | variable | `pa.string()`, `pa.large_binary()` |
| `FIXED_LEN_BYTE_ARRAY` | fixed | `pa.decimal128(...)`, `pa.binary(n)` |

In [None]:
# Build a table with one column of each primary physical type
table_phys = pa.table({
    "bool_col":   pa.array([True, False, True], type=pa.bool_()),
    "int32_col":  pa.array([1, 2, 3], type=pa.int32()),
    "int64_col":  pa.array([10, 20, 30], type=pa.int64()),
    "float_col":  pa.array([1.1, 2.2, 3.3], type=pa.float32()),
    "double_col": pa.array([1.11, 2.22, 3.33], type=pa.float64()),
    "bytes_col":  pa.array([b"hello", b"world", b"!"], type=pa.binary()),
})

buf = io.BytesIO()
pq.write_table(table_phys, buf)
buf.seek(0)

pf = pq.ParquetFile(buf)
parquet_schema = pf.schema_arrow  # Arrow schema reconstructed from Parquet
print("Arrow schema from file:")
print(parquet_schema)
print()

# The native Parquet schema (physical + logical annotation)
print("Parquet schema (physical types):")
print(pf.schema)

---

## 2. Logical type annotations

> **Spec:** [Logical Types](https://parquet.apache.org/docs/file-format/types/logicaltypes/)

A logical type wraps a physical type with semantic meaning.
For example, `STRING` annotates `BYTE_ARRAY` to indicate UTF-8 encoding.
This allows a `BYTE_ARRAY` column to be interpreted as text rather than raw bytes.

Common logical type annotations:

| Logical type | Physical type | PyArrow type |
|-------------|--------------|---------------|
| `STRING` | `BYTE_ARRAY` | `pa.string()` / `pa.utf8()` |
| `DATE` | `INT32` | `pa.date32()` |
| `TIME(MILLIS)` | `INT32` | `pa.time32('ms')` |
| `TIMESTAMP(MICROS, UTC)` | `INT64` | `pa.timestamp('us', tz='UTC')` |
| `DECIMAL(p, s)` | `FIXED_LEN_BYTE_ARRAY` | `pa.decimal128(p, s)` |
| `UUID` | `FIXED_LEN_BYTE_ARRAY(16)` | `pa.binary(16)` |

In [None]:
import decimal

table_logical = pa.table({
    "name":      pa.array(["Alice", "Bob", "Carol"], type=pa.utf8()),
    "dob":       pa.array([datetime.date(1990, 1, 1), datetime.date(1985, 6, 15), datetime.date(2000, 12, 31)]),
    "joined_at": pa.array(
        [datetime.datetime(2020, 1, 1, tzinfo=datetime.timezone.utc),
         datetime.datetime(2021, 3, 10, tzinfo=datetime.timezone.utc),
         datetime.datetime(2022, 7, 20, tzinfo=datetime.timezone.utc)]
    ),
    "balance":   pa.array([decimal.Decimal("10.50"), decimal.Decimal("200.00"), decimal.Decimal("0.99")],
                           type=pa.decimal128(10, 2)),
})

buf2 = io.BytesIO()
pq.write_table(table_logical, buf2)
buf2.seek(0)

pf2 = pq.ParquetFile(buf2)
print("Arrow schema (logical types):")
print(pf2.schema_arrow)
print()
print("Parquet schema (physical + logical annotations):")
print(pf2.schema)

---

## 3. Round-trip type fidelity

> **Spec:** [Logical Types: TIMESTAMP](https://parquet.apache.org/docs/file-format/types/logicaltypes/)

The Arrow schema is embedded in the Parquet file's key-value metadata (`pandas` or `arrow` schema)
as a serialized Flatbuffers blob. This ensures that types round-trip exactly, even when the
Parquet logical type system doesn't have a 1:1 mapping for every Arrow type.

In [None]:
buf2.seek(0)
result = pq.read_table(buf2)

print("Original schema:")
print(table_logical.schema)
print()
print("Round-tripped schema:")
print(result.schema)
print()
print("Schemas equal:", table_logical.schema.equals(result.schema))
print()
print("Sample row:")
for name in result.schema.names:
    print(f"  {name}: {result.column(name)[0].as_py()!r} ({result.schema.field(name).type})")

---

## 4. Unsigned integers and extended types

> **Spec:** [Logical Types: INTEGER](https://parquet.apache.org/docs/file-format/types/logicaltypes/)

Parquet's `INT32` and `INT64` physical types are always signed bits.
The `INTEGER` logical type annotation adds `bit_width` and `is_signed` flags
to represent unsigned integers (`UINT8`, `UINT16`, `UINT32`, `UINT64`).

In [None]:
table_uint = pa.table({
    "u8":  pa.array([0, 128, 255], type=pa.uint8()),
    "u16": pa.array([0, 1000, 65535], type=pa.uint16()),
    "u32": pa.array([0, 1_000_000, 4_294_967_295], type=pa.uint32()),
    "u64": pa.array([0, 1, 2**64 - 1], type=pa.uint64()),
})

buf3 = io.BytesIO()
pq.write_table(table_uint, buf3)
buf3.seek(0)

pf3 = pq.ParquetFile(buf3)
print("Parquet schema for unsigned integer columns:")
print(pf3.schema)
print()
print("Arrow schema (unsigned types preserved):")
print(pf3.schema_arrow)

---

## Summary

| Concept | Key point |
|---------|----------|
| Physical types | 6 raw binary representations: BOOLEAN, INT32, INT64, FLOAT, DOUBLE, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY |
| Logical types | Annotate physical types with semantic meaning: STRING, DATE, TIMESTAMP, DECIMAL … |
| Arrow schema embedding | Arrow schema serialized in key-value metadata, types round-trip exactly |
| `pf.schema` | Native Parquet schema (physical + logical annotation) |
| `pf.schema_arrow` | Arrow schema reconstructed from Parquet schema + embedded metadata |

**Next ⏭️** [Encoding](05_encodings.ipynb)