# Extension Types

> **Level:** Advanced  
> **Spec:** [Extension types](https://arrow.apache.org/docs/format/Columnar.html#extension-types)  
> **PyArrow docs:** [Extension types](https://arrow.apache.org/docs/python/extending_types.html)

## What you will learn

1. What extension types are and why they matter
2. Defining a `UuidType` backed by `FixedSizeBinary(16)`
3. Parameterized extension types with JSON metadata
4. IPC round-trip: metadata keys `ARROW:extension:name` / `ARROW:extension:metadata`
5. Schema and Field level metadata

In [1]:
import uuid
import json
import io
import pyarrow as pa
import pyarrow.ipc as ipc

---
## 1. What are Extension Types?

> **Spec:** [Extension types specification](https://arrow.apache.org/docs/format/Columnar.html#extension-types)

Extension types let you annotate a **storage type** with a **semantic name and metadata**.  
The underlying bytes are identical to the storage type, so consumers that don't know the extension  
still see a valid Arrow array of the storage type.

Two required metadata keys are stored in the field metadata:
- `ARROW:extension:name`: unique name (e.g. `"my_org.uuid"`)
- `ARROW:extension:metadata`:optional serialized parameters

In [2]:
class UuidType(pa.ExtensionType):
    """UUID stored as 16 bytes (FixedSizeBinary)."""

    def __init__(self):
        super().__init__(pa.binary(16), 'my_org.uuid')

    def __arrow_ext_serialize__(self):
        return b''  # no parameters

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        return cls()


pa.register_extension_type(UuidType())

# Create an array of UUIDs
uuids = [uuid.uuid4().bytes for _ in range(4)]
storage = pa.array(uuids, type=pa.binary(16))
uuid_arr = pa.ExtensionArray.from_storage(UuidType(), storage)

print('Type       :', uuid_arr.type)
print('Storage    :', uuid_arr.type.storage_type)
print('Array[0]   :', uuid.UUID(bytes=uuid_arr[0].as_py()))

Type       : extension<my_org.uuid<UuidType>>
Storage    : fixed_size_binary[16]
Array[0]   : 537ec064-62b3-4396-9444-cc8b29de8191


---
## 2. Parameterized Extension Type

> **Spec:** [Extension type metadata](https://arrow.apache.org/docs/format/Columnar.html#extension-types)

Pass parameters via JSON-serialized metadata stored in `ARROW:extension:metadata`.

In [3]:
class JsonStringType(pa.ExtensionType):
    """Utf8 string that is guaranteed to contain valid JSON."""

    def __init__(self, schema_hint=None):
        self._schema_hint = schema_hint
        super().__init__(pa.utf8(), 'my_org.json_string')

    def __arrow_ext_serialize__(self):
        return json.dumps({'schema_hint': self._schema_hint}).encode()

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        params = json.loads(serialized)
        return cls(schema_hint=params.get('schema_hint'))


pa.register_extension_type(JsonStringType())

json_type = JsonStringType(schema_hint={'type': 'object', 'properties': {'x': {'type': 'integer'}}})
raw_strings = pa.array(['{"x": 1}', '{"x": 2}', None], type=pa.utf8())
json_arr = pa.ExtensionArray.from_storage(json_type, raw_strings)
print('Type         :', json_arr.type)
print('Metadata repr:', json_arr.type.__arrow_ext_serialize__())

Type         : extension<my_org.json_string<JsonStringType>>
Metadata repr: b'{"schema_hint": {"type": "object", "properties": {"x": {"type": "integer"}}}}'


---
## Summary

| Concept | Detail |
|---------|--------|
| Extension type | Semantic name + metadata on top of a storage type |
| `ARROW:extension:name` | Unique identifier, stored in field metadata |
| `ARROW:extension:metadata` | Serialized parameters (often JSON) |
| `pa.register_extension_type` | Makes the custom type known to PyArrow |

**Next ⏭️** [C Data Interface via CTypes](06_c_data_interface_ctypes.ipynb): zero-copy sharing via CTypes