# Arrow C Data Interface: PyCapsule Protocol

> **Level:** Advanced  
> **Spec:** [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html) · [PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) · [C Stream Interface](https://arrow.apache.org/docs/format/CStreamInterface.html)

## What you will learn

1. The goals and non-goals of the C Data Interface
2. The two C structs: `ArrowSchema` and `ArrowArray`
3. Memory management: `release` callbacks, moved vs released state
4. The Python PyCapsule protocol: `__arrow_c_schema__` / `__arrow_c_array__` / `__arrow_c_stream__`
5. Demonstrating zero-copy sharing between two PyArrow objects
6. Verifying buffer identity (same physical memory, no copy)
7. Interoperability with non-PyArrow libraries via the capsule protocol

---
## 1. Rationale: Why the C Data Interface?

> **Spec:** [Motivation](https://arrow.apache.org/docs/format/CDataInterface.html#motivation)

Problem: two libraries in the same Python process each have their own Arrow implementation.  
They cannot call each other's APIs without:

- Introducing a compile-time/runtime dependency on each other, OR
- Serializing via IPC (Flatbuffers + memcopy overhead)

**C Data Interface solution:**
- ~30 lines of C struct definitions. Copy-pasteable into any project
- Zero-copy: pass pointer to the existing buffer
- ABI-stable: struct layout is frozen
- No Flatbuffers, no serialization

### Goals vs Non-goals
| Goals | Non-goals |
|-------|-----------|
| ABI-stable C interface | C API for computation |
| Zero-copy same-process sharing | Cross-process sharing |
| No dependency on Arrow libraries | Data persistence |
| Easy to copy into other codebases | Compression |

---
## 2. The Two C Structs

> **Spec:** [ArrowSchema structure](https://arrow.apache.org/docs/format/CDataInterface.html#the-arrowschema-structure) · [ArrowArray structure](https://arrow.apache.org/docs/format/CDataInterface.html#the-arrowarray-structure)

```c
struct ArrowSchema {
  const char* format;        // type format string, e.g. "i" = int32
  const char* name;          // field name
  const char* metadata;      // binary metadata (key/value pairs)
  int64_t     flags;         // NULLABLE, DICTIONARY_ORDERED, MAP_KEYS_SORTED
  int64_t     n_children;
  struct ArrowSchema** children;
  struct ArrowSchema*  dictionary;
  void (*release)(struct ArrowSchema*);  // MUST be NULL when released
  void* private_data;
};

struct ArrowArray {
  int64_t  length;
  int64_t  null_count;       // -1 if not yet computed
  int64_t  offset;           // logical offset into buffers
  int64_t  n_buffers;
  int64_t  n_children;
  const void** buffers;      // pointers to actual data buffers
  struct ArrowArray** children;
  struct ArrowArray*  dictionary;
  void (*release)(struct ArrowArray*);
  void* private_data;
};
```

---
## 3. Format Strings

> **Spec:** [Data Type Description Format Strings](https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings)

The type is communicated as a null-terminated C string. Key examples:

| Format | Arrow type |
|--------|-----------|
| `n` | null |
| `b` | boolean |
| `c` / `C` | int8 / uint8 |
| `s` / `S` | int16 / uint16 |
| `i` / `I` | int32 / uint32 |
| `l` / `L` | int64 / uint64 |
| `f` / `g` | float32 / float64 |
| `u` / `U` | utf-8 / large utf-8 |
| `z` / `Z` | binary / large binary |
| `d:19,10` | decimal128(precision=19, scale=10) |
| `w:16` | FixedSizeBinary(16) |
| `+s` | struct |
| `+l` / `+L` | list / large list |
| `+ud:0,1` | dense union with type ids 0,1 |
| `+r` | run-end encoded |
| `tsu:UTC` | timestamp[us, UTC] |

In [23]:
import pyarrow as pa

types = [
    ('int32',         pa.int32()),
    ('float64',       pa.float64()),
    ('utf8',          pa.utf8()),
    ('binary(16)',    pa.binary(16)),
    ('list(int8)',    pa.list_(pa.int8())),
    ('struct',        pa.struct([('a', pa.int32()), ('b', pa.float32())])),
    ('timestamp[us]', pa.timestamp('us', tz='UTC')),
]

for name, t in types:
    arr = pa.array([], type=t)
    schema_ptr, array_ptr = arr.__arrow_c_array__()
    print(f'{name:<22} ➡️ type={t}')
    print(f'  capsule: {schema_ptr}')

int32                  ➡️ type=int32
  capsule: <capsule object "arrow_schema" at 0x7fc701f27380>
float64                ➡️ type=double
  capsule: <capsule object "arrow_schema" at 0x7fc701f10270>
utf8                   ➡️ type=string
  capsule: <capsule object "arrow_schema" at 0x7fc701f27380>
binary(16)             ➡️ type=fixed_size_binary[16]
  capsule: <capsule object "arrow_schema" at 0x7fc602482a70>
list(int8)             ➡️ type=list<item: int8>
  capsule: <capsule object "arrow_schema" at 0x7fc701f10270>
struct                 ➡️ type=struct<a: int32, b: float>
  capsule: <capsule object "arrow_schema" at 0x7fc701f27380>
timestamp[us]          ➡️ type=timestamp[us, tz=UTC]
  capsule: <capsule object "arrow_schema" at 0x7fc6024839c0>


---
## 4. The PyCapsule Protocol

> **Spec:** [PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html)  
> **PyArrow API:** [`pa.Array._import_from_c`](https://arrow.apache.org/docs/python/generated/pyarrow.Array.html)

Python builds on top of the C Data Interface using **PyCapsules**, an opaque Python objects wrapping a C pointer.

Three dunder methods:

| Method | Returns | Contains |
|--------|---------|----------|
| `__arrow_c_schema__()` | `PyCapsule` (name=`"arrow_schema"`) | `ArrowSchema*` |
| `__arrow_c_array__(requested_schema=None)` | `(PyCapsule, PyCapsule)` | `(ArrowSchema*, ArrowArray*)` |
| `__arrow_c_stream__(requested_schema=None)` | `PyCapsule` (name=`"arrow_array_stream"`) | `ArrowArrayStream*` |

Any library that implements these methods can exchange Arrow data with any other library with **zero copies and no shared dependency**.

In [24]:
import ctypes

original = pa.array([10, 20, 30, 40, 50], type=pa.int32())
print('=== Producer side ===')
print('Original array:', original.to_pylist())
orig_buf_ptr = original.buffers()[1].address
print(f'Original values buffer address: 0x{orig_buf_ptr:016x}')

schema_cap, array_cap = original.__arrow_c_array__()
print(f'\nCapsule type : {type(schema_cap)}')

=== Producer side ===
Original array: [10, 20, 30, 40, 50]
Original values buffer address: 0x00007fc602b00680

Capsule type : <class 'PyCapsule'>


In [25]:
print('=== Consumer side ===')
consumed = pa.Array._import_from_c_capsule(schema_cap, array_cap)
print('Consumed array:', consumed.to_pylist())
cons_buf_ptr = consumed.buffers()[1].address
print(f'Consumed values buffer address: 0x{cons_buf_ptr:016x}')
print(f'\nSame physical buffer? {orig_buf_ptr == cons_buf_ptr}')
print('➡️ True means ZERO COPY: no data was duplicated!')

=== Consumer side ===
Consumed array: [10, 20, 30, 40, 50]
Consumed values buffer address: 0x00007fc602b00680

Same physical buffer? True
➡️ True means ZERO COPY: no data was duplicated!


---
## 5. Move Semantics: One Live Copy at a Time

> **Spec:** [Memory Management](https://arrow.apache.org/docs/format/CDataInterface.html#memory-management)

The C Data Interface enforces **ownership transfer**:
1. Producer fills the `ArrowArray` struct
2. Consumer takes it (bitwise copy of the struct)
3. Producer marks source as **released** (sets `release = NULL`)
4. Consumer eventually calls `release()` on its copy

Key rule: **`release` must never be called on an already-released struct** (where `release == NULL`).

In [26]:
prod = pa.array([100, 200, 300], type=pa.int64())
schema_cap2, array_cap2 = prod.__arrow_c_array__()
result = pa.Array._import_from_c_capsule(schema_cap2, array_cap2)
print('Transferred array:', result.to_pylist())

try:
    result2 = pa.Array._import_from_c_capsule(schema_cap2, array_cap2)
    print('ERROR: should have failed')
except Exception as e:
    print(f'Expected error when re-importing: {type(e).__name__}: {e}')

Transferred array: [100, 200, 300]
Expected error when re-importing: ArrowInvalid: Cannot import released ArrowSchema


---
## 6. Arrow C Stream Interface

> **Spec:** [Arrow C Stream Interface](https://arrow.apache.org/docs/format/CStreamInterface.html)  
> **PyArrow API:** [`pa.RecordBatchReader._import_from_c_capsule`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html)

The stream interface wraps multiple batches:

```c
struct ArrowArrayStream {
  int (*get_schema)(struct ArrowArrayStream*, struct ArrowSchema*);
  int (*get_next  )(struct ArrowArrayStream*, struct ArrowArray*);
  const char* (*get_last_error)(struct ArrowArrayStream*);
  void (*release)(struct ArrowArrayStream*);
  void* private_data;
};
```

In Python: `__arrow_c_stream__()` returns a single capsule.

In [27]:
table = pa.table({
    'x': pa.chunked_array([pa.array([1, 2]), pa.array([3, 4, 5])]),
    'y': pa.chunked_array([pa.array([1.1, 2.2]), pa.array([3.3, 4.4, 5.5])]),
})

print('Has __arrow_c_stream__:', hasattr(table, '__arrow_c_stream__'))
stream_cap = table.__arrow_c_stream__()
reader = pa.RecordBatchReader._import_from_c_capsule(stream_cap)

print('Batches from stream:')
for i, rb in enumerate(reader):
    print(f'  batch {i}: {rb.to_pydict()}')

Has __arrow_c_stream__: True
Batches from stream:
  batch 0: {'x': [1, 2], 'y': [1.1, 2.2]}
  batch 1: {'x': [3, 4, 5], 'y': [3.3, 4.4, 5.5]}


---
## 7. Interoperability Across Libraries

> **Spec:** [PyCapsule Interface: Consumer API](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#arrow-pycapsule-interface)

Libraries that support the PyCapsule protocol today:
- **[PyArrow](https://arrow.apache.org/docs/python/api.html)** (producer + consumer)
- **[Polars](https://docs.pola.rs/api/python/stable/reference/series/arrow.html)** (`Series.__arrow_c_stream__`)
- **[DuckDB](https://duckdb.org/docs/guides/python/arrow.html)** (`duckdb.DuckDBPyRelation` ➡️ capsule)
- **[nanoarrow](https://arrow.apache.org/nanoarrow/latest/)** (lightweight C reference impl)
- **[cuDF](https://docs.rapids.ai/api/cudf/stable/)** (GPU DataFrame library)
- **[pandas](https://pandas.pydata.org/docs/user_guide/arrow.html)** (Arrow-backed arrays via `ArrowDtype`)

In [28]:
class MyArrowProducer:
    """Hypothetical other-library object exposing the capsule protocol."""
    def __init__(self, data, arrow_type):
        self._array = pa.array(data, type=arrow_type)
    def __arrow_c_array__(self, requested_schema=None):
        return self._array.__arrow_c_array__(requested_schema)
    def __arrow_c_schema__(self):
        return self._array.__arrow_c_schema__()

producer = MyArrowProducer([1.5, 2.5, 3.5, 4.5], pa.float64())
print('Producer class  :', type(producer).__name__)
print('Is PyArrow Array:', isinstance(producer, pa.Array))

schema_cap, array_cap = producer.__arrow_c_array__()
consumed = pa.Array._import_from_c_capsule(schema_cap, array_cap)
print('Consumed by PyArrow:', consumed.to_pylist())

Producer class  : MyArrowProducer
Is PyArrow Array: False
Consumed by PyArrow: [1.5, 2.5, 3.5, 4.5]


---
## 8. Record Batches as Struct Arrays

> **Spec:** [Record Batches](https://arrow.apache.org/docs/format/CDataInterface.html#record-batches)  
> **PyArrow API:** [`pa.RecordBatch._import_from_c_capsule`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html)

The spec says: **a record batch is equivalent to a top-level struct array**.  
The `ArrowSchema.metadata` of the top-level struct carries schema-level metadata.

In [29]:
batch = pa.record_batch({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
schema_cap, array_cap = batch.__arrow_c_array__()
rb2 = pa.RecordBatch._import_from_c_capsule(schema_cap, array_cap)
print('Round-tripped RecordBatch:')
print(rb2.to_pandas())

Round-tripped RecordBatch:
   a  b
0  1  x
1  2  y
2  3  z


---
## Summary

| Concept | Detail |
|---------|--------|
| `ArrowSchema` | Type description: format string, flags, children, dictionary, release |
| `ArrowArray` | Data: length, null_count, offset, buffers[], children[], release |
| `release = NULL` | Released state: consumers must check before reading |
| Move semantics | Bitwise copy + mark source released, only one live copy |
| PyCapsule | Python wrapping of the C pointer, named `"arrow_schema"` / `"arrow_array"` |
| `__arrow_c_array__` | Zero-copy export: same buffer, no serialization |
| Record batch | Top-level `+s` struct array |

**References:**
- [C Data Interface specification](https://arrow.apache.org/docs/format/CDataInterface.html)
- [PyCapsule Interface specification](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html)
- [C Stream Interface specification](https://arrow.apache.org/docs/format/CStreamInterface.html)

**Next ⏭️** [IPC](08_ipc_streaming_file.ipynb)