# Arrow C Data Interface: Low-Level ctypes Deep Dive

> **Level:** Expert  
> **Spec:** [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html) · [Memory management](https://arrow.apache.org/docs/format/CDataInterface.html#memory-management) · [ArrowSchema](https://arrow.apache.org/docs/format/CDataInterface.html#the-arrowschema-structure) · [ArrowArray](https://arrow.apache.org/docs/format/CDataInterface.html#the-arrowarray-structure)  
> **Prerequisites:** Notebooks 01–07, familiarity with C memory management

This notebook implements the Arrow C Data Interface **from scratch** using only Python's `ctypes` module. You will:

1. Define `ArrowSchema` and `ArrowArray` as `ctypes.Structure`
2. Implement a **producer** that allocates buffers with `malloc`
3. Write a proper **release callback** (C function pointer)
4. Implement a **consumer** that reads raw buffer pointers
5. Hand the filled structs to PyArrow and confirm zero-copy import
6. Build a `struct<float32, utf8>` producer matching the spec example
7. Understand **move semantics** and the `private_data` bookkeeping pattern

In [1]:
import ctypes
import sys
import pyarrow as pa

---
## 1. Defining the C Structs in ctypes

> **Spec:** [ArrowSchema fields](https://arrow.apache.org/docs/format/CDataInterface.html#the-arrowschema-structure) · [ArrowArray fields](https://arrow.apache.org/docs/format/CDataInterface.html#the-arrowarray-structure)

Self-referential structs require a **two-pass** definition in ctypes:  
declare the class first, then fill `_fields_` separately.

In [2]:
class ArrowSchema(ctypes.Structure): pass
class ArrowArray(ctypes.Structure):  pass

release_schema_fn = ctypes.CFUNCTYPE(None, ctypes.POINTER(ArrowSchema))
release_array_fn  = ctypes.CFUNCTYPE(None, ctypes.POINTER(ArrowArray))

NULL_RELEASE_SCHEMA = ctypes.cast(None, release_schema_fn)
NULL_RELEASE_ARRAY  = ctypes.cast(None, release_array_fn)

def make_schema_release(label='schema'):
    @release_schema_fn
    def _release(schema_ptr):
        schema = schema_ptr.contents
        schema.release = NULL_RELEASE_SCHEMA
        print(f'  [release_schema:{label}] called')
    return _release

def make_noop_array_release(label='array'):
    @release_array_fn
    def _release(arr_ptr):
        arr = arr_ptr.contents
        arr.release = NULL_RELEASE_ARRAY
        print(f'  [release_array:{label}] called')
    return _release

ArrowSchema._fields_ = [
    ('format',       ctypes.c_char_p),
    ('name',         ctypes.c_char_p),
    ('metadata',     ctypes.c_char_p),
    ('flags',        ctypes.c_int64),
    ('n_children',   ctypes.c_int64),
    ('children',     ctypes.POINTER(ctypes.POINTER(ArrowSchema))),
    ('dictionary',   ctypes.POINTER(ArrowSchema)),
    ('release',      release_schema_fn),
    ('private_data', ctypes.c_void_p),
]

ArrowArray._fields_ = [
    ('length',       ctypes.c_int64),
    ('null_count',   ctypes.c_int64),
    ('offset',       ctypes.c_int64),
    ('n_buffers',    ctypes.c_int64),
    ('n_children',   ctypes.c_int64),
    ('buffers',      ctypes.POINTER(ctypes.c_void_p)),
    ('children',     ctypes.POINTER(ctypes.POINTER(ArrowArray))),
    ('dictionary',   ctypes.POINTER(ArrowArray)),
    ('release',      release_array_fn),
    ('private_data', ctypes.c_void_p),
]

# Flag constants
ARROW_FLAG_DICTIONARY_ORDERED = 1
ARROW_FLAG_NULLABLE           = 2
ARROW_FLAG_MAP_KEYS_SORTED    = 4

print('ArrowSchema size:', ctypes.sizeof(ArrowSchema), 'bytes')
print('ArrowArray  size:', ctypes.sizeof(ArrowArray),  'bytes')

ArrowSchema size: 72 bytes
ArrowArray  size: 80 bytes


---
## 2. C Memory Allocation Helpers

> **Python docs:** [ctypes: A foreign function library for Python](https://docs.python.org/3/library/ctypes.html)

We use `libc.malloc` / `libc.free` to allocate raw memory, exactly as a C producer would.

In [3]:
if sys.platform == 'linux':
    libc = ctypes.CDLL('libc.so.6')
elif sys.platform == 'darwin':
    libc = ctypes.CDLL(ctypes.util.find_library('c'))
elif sys.platform == 'emscripten':
    # Pyodide / Emscripten: malloc/free/memcpy live in the main program namespace
    libc = ctypes.CDLL(None)
else:
    libc = ctypes.CDLL(ctypes.util.find_library('msvcrt'))

libc.malloc.restype  = ctypes.c_void_p
libc.malloc.argtypes = [ctypes.c_size_t]
libc.free.restype    = None
libc.free.argtypes   = [ctypes.c_void_p]
libc.memcpy.restype  = ctypes.c_void_p
libc.memcpy.argtypes = [ctypes.c_void_p, ctypes.c_void_p, ctypes.c_size_t]
print('libc loaded:', libc)


libc loaded: <CDLL 'libc.so.6', handle 7f63df8f95b0 at 0x7f63ac162a50>


---
## 3. Producer: Simple `int32` Array

> **Spec:** [Producing Arrow data](https://arrow.apache.org/docs/format/CDataInterface.html#producer-requirements)

Steps:
1. Allocate data buffer (4 bytes x N)
2. Fill `buffers[] = {NULL, data_ptr}` (no validity bitmap when null_count == 0)
3. Fill `ArrowSchema` with format `b"i"`
4. Fill `ArrowArray` with length, buffers, release callback

In [4]:
_kept_alive = []

def make_int32_array(values):
    n = len(values)
    data_buf = libc.malloc(4 * n)
    int_arr  = (ctypes.c_int32 * n)(*values)
    libc.memcpy(data_buf, int_arr, 4 * n)
    buffers = (ctypes.c_void_p * 2)(None, data_buf)

    @release_array_fn
    def release_array(arr_ptr):
        arr = arr_ptr.contents
        libc.free(data_buf)
        arr.release = NULL_RELEASE_ARRAY
        print('  [release_array:int32] called')

    release_schema = make_schema_release('int32')

    schema = ArrowSchema()
    schema.format     = b'i'
    schema.name       = b'x'
    schema.flags      = ARROW_FLAG_NULLABLE
    schema.n_children = 0
    schema.release    = release_schema

    array = ArrowArray()
    array.length     = n
    array.null_count = 0
    array.offset     = 0
    array.n_buffers  = 2
    array.n_children = 0
    array.buffers    = buffers
    array.release    = release_array

    _kept_alive.extend([int_arr, buffers, release_schema, release_array, schema, array])
    return schema, array

print('Producer function defined.')

Producer function defined.


In [5]:
schema_c, array_c = make_int32_array([10, 20, 30, 40, 50])

print('=== ArrowSchema ===')
print(f'  format     : {schema_c.format}')
print(f'  name       : {schema_c.name}')
print(f'  flags      : {schema_c.flags}  (NULLABLE={bool(schema_c.flags & ARROW_FLAG_NULLABLE)})')

print('\n=== ArrowArray ===')
print(f'  length     : {array_c.length}')
print(f'  null_count : {array_c.null_count}')
data_ptr = array_c.buffers[1]
print(f'  buffers[0] : {array_c.buffers[0]}  (null = no validity bitmap)')
print(f'  buffers[1] : 0x{data_ptr:016x}  (data)')

raw_values = (ctypes.c_int32 * array_c.length).from_address(data_ptr)
print(f'  Values via ctypes: {list(raw_values)}')

=== ArrowSchema ===
  format     : b'i'
  name       : b'x'
  flags      : 2  (NULLABLE=True)

=== ArrowArray ===
  length     : 5
  null_count : 0
  buffers[0] : None  (null = no validity bitmap)
  buffers[1] : 0x00005624a0087170  (data)
  Values via ctypes: [10, 20, 30, 40, 50]


---
## 4. Hand Off to PyArrow / Zero-Copy Import

> **PyArrow API:** [`pa.Array._import_from_c(array_ptr, schema_ptr)`](https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array._import_from_c)

`_import_from_c` accepts **raw integer addresses** (the lower-level API below PyCapsules).  
PyArrow takes ownership of the `release` callback and calls it on garbage collection.

In [6]:
schema_addr = ctypes.addressof(schema_c)
array_addr  = ctypes.addressof(array_c)

print(f'Schema address : 0x{schema_addr:016x}')
print(f'Array  address : 0x{array_addr:016x}')
print(f'Data   address : 0x{data_ptr:016x}')

pa_array = pa.Array._import_from_c(array_addr, schema_addr)
print('\nPyArrow array:', pa_array.to_pylist())
print('Type:', pa_array.type)

pa_buf_addr = pa_array.buffers()[1].address
print(f'\nPA buffer address : 0x{pa_buf_addr:016x}')
print(f'Our data ptr      : 0x{data_ptr:016x}')
print(f'Same memory?      : {pa_buf_addr == data_ptr}')

Schema address : 0x00007f63af356830
Array  address : 0x00007f63af3567e0
Data   address : 0x00005624a0087170
  [release_schema:int32] called

PyArrow array: [10, 20, 30, 40, 50]
Type: int32

PA buffer address : 0x00005624a0087170
Our data ptr      : 0x00005624a0087170
Same memory?      : True


---
## 5. Producer: `struct<float32, utf8>`

> **Spec:** [Producing struct arrays](https://arrow.apache.org/docs/format/CDataInterface.html#producer-requirements) · [Struct layout](https://arrow.apache.org/docs/format/Columnar.html#struct-layout)

A struct array with two children replicates the spec's second canonical example:
- Parent schema: format `"+s"`, 2 children (`float32` = `"f"`, `utf8` = `"u"`)
- Parent array: `n_buffers=1` (one validity buf), `n_children=2`

In [None]:
def make_validity_bitmap(nulls, libc=libc):
    nbytes = (len(nulls) + 7) // 8
    ptr = libc.malloc(nbytes)
    buf = (ctypes.c_uint8 * nbytes)(0)
    for i, valid in enumerate(nulls):
        if valid:
            buf[i // 8] |= 1 << (i % 8)
    libc.memcpy(ptr, buf, nbytes)
    _kept_alive.append(buf)
    return ptr


def make_struct_float32_utf8(floats, strings):
    n = len(floats)
    alive = []

    # float32 child
    f32_valid = [v is not None for v in floats]
    f32_data  = [v or 0.0 for v in floats]
    f32_vmap  = make_validity_bitmap(f32_valid)
    f32_vals  = (ctypes.c_float * n)(*f32_data)
    f32_ptr   = libc.malloc(4 * n)
    libc.memcpy(f32_ptr, f32_vals, 4 * n)
    f32_bufs  = (ctypes.c_void_p * 2)(f32_vmap, f32_ptr)
    alive.extend([f32_vals, f32_bufs])

    f32_schema_release = make_schema_release('struct.f32_schema')
    f32_array_release  = make_noop_array_release('struct.f32_array')

    f32_schema = ArrowSchema()
    f32_schema.format = b'f'
    f32_schema.name = b'x'
    f32_schema.flags = ARROW_FLAG_NULLABLE
    f32_schema.n_children = 0
    f32_schema.release = f32_schema_release

    f32_array = ArrowArray()
    f32_array.length = n
    f32_array.null_count = f32_valid.count(False)
    f32_array.offset = 0
    f32_array.n_buffers = 2
    f32_array.n_children = 0
    f32_array.buffers = f32_bufs
    f32_array.release = f32_array_release

    # utf8 child
    str_valid = [v is not None for v in strings]
    str_data  = [v or '' for v in strings]
    raw_bytes = b''.join(s.encode() for s in str_data)
    offsets   = [0]
    for s in str_data: offsets.append(offsets[-1] + len(s.encode()))
    str_vmap  = make_validity_bitmap(str_valid)
    off_c     = (ctypes.c_int32 * (n+1))(*offsets)
    off_ptr   = libc.malloc(4 * (n+1))
    libc.memcpy(off_ptr, off_c, 4 * (n+1))
    raw_c     = (ctypes.c_char * len(raw_bytes))(*raw_bytes)
    raw_ptr   = libc.malloc(len(raw_bytes) + 1)
    libc.memcpy(raw_ptr, raw_c, len(raw_bytes))
    str_bufs  = (ctypes.c_void_p * 3)(str_vmap, off_ptr, raw_ptr)
    alive.extend([off_c, raw_c, str_bufs])

    str_schema_release = make_schema_release('struct.utf8_schema')
    str_array_release  = make_noop_array_release('struct.utf8_array')

    str_schema = ArrowSchema()
    str_schema.format = b'u'
    str_schema.name = b'y'
    str_schema.flags = ARROW_FLAG_NULLABLE
    str_schema.n_children = 0
    str_schema.release = str_schema_release

    str_array = ArrowArray()
    str_array.length = n
    str_array.null_count = str_valid.count(False)
    str_array.offset = 0
    str_array.n_buffers = 3
    str_array.n_children = 0
    str_array.buffers = str_bufs
    str_array.release = str_array_release

    # struct parent
    child_schemas = (ctypes.POINTER(ArrowSchema) * 2)(
        ctypes.pointer(f32_schema), ctypes.pointer(str_schema))
    child_arrays  = (ctypes.POINTER(ArrowArray)  * 2)(
        ctypes.pointer(f32_array),  ctypes.pointer(str_array))
    top_vmap = make_validity_bitmap([True]*n)
    top_bufs = (ctypes.c_void_p * 1)(top_vmap)
    alive.extend([f32_schema, f32_array, str_schema, str_array,
                  child_schemas, child_arrays, top_bufs])

    parent_schema_release = make_schema_release('struct.parent_schema')
    parent_array_release  = make_noop_array_release('struct.parent_array')

    parent_schema = ArrowSchema()
    parent_schema.format = b'+s'
    parent_schema.name = b''
    parent_schema.flags = 0
    parent_schema.n_children = 2
    parent_schema.children = child_schemas
    parent_schema.release = parent_schema_release

    parent_array = ArrowArray()
    parent_array.length = n
    parent_array.null_count = 0
    parent_array.offset = 0
    parent_array.n_buffers = 1
    parent_array.n_children = 2
    parent_array.buffers = top_bufs
    parent_array.children = child_arrays
    parent_array.release = parent_array_release

    alive.extend([f32_schema_release, f32_array_release,
                  str_schema_release, str_array_release,
                  parent_schema_release, parent_array_release,
                  parent_schema, parent_array])
    _kept_alive.extend(alive)
    return parent_schema, parent_array


print('Struct producer defined.')

Struct producer defined.


In [8]:
ps, pa_struct = make_struct_float32_utf8([1.5, None, 3.5, 4.0],
                                          ['hello', 'world', None, 'arrow'])

print('=== Parent ArrowSchema ===')
print(f'  format     : {ps.format}')
print(f'  n_children : {ps.n_children}')
for i in range(ps.n_children):
    child = ps.children[i].contents
    print(f'  child[{i}]  format={child.format}  name={child.name}')

result = pa.Array._import_from_c(ctypes.addressof(pa_struct), ctypes.addressof(ps))
print('\nPyArrow struct array:')
print(result)

=== Parent ArrowSchema ===
  format     : b'+s'
  n_children : 2
  child[0]  format=b'f'  name=b'x'
  child[1]  format=b'u'  name=b'y'
  [release_schema:struct.parent_schema] called

PyArrow struct array:
-- is_valid: all not null
-- child 0 type: float
  [
    1.5,
    null,
    3.5,
    4
  ]
-- child 1 type: string
  [
    "hello",
    "world",
    null,
    "arrow"
  ]


---
## 6. Move Semantics in Detail

> **Spec:** [Semantics of moving struct](https://arrow.apache.org/docs/format/CDataInterface.html#moving-the-arrowarray)

```
PRODUCER:                           CONSUMER:
┌──────────────────────┐            ┌──────────────────────┐
│  ArrowArray src      │ bitwise ─▶│  ArrowArray dst      │
│  .release = &my_fn   │  copy      │  .release = &my_fn   │
└──────────────────────┘            └──────────────────────┘
After copy: src.release = NULL    Eventually: dst.release(dst)
```

**Key rule:** Set `src.release = NULL` after the bitwise copy to mark it released.

In [9]:
def move_array(src, dst):
    ctypes.memmove(ctypes.addressof(dst), ctypes.addressof(src), ctypes.sizeof(ArrowArray))
    src.release = NULL_RELEASE_ARRAY  # mark source as released
    print('Moved: src.release is now NULL')

s1, a1 = make_int32_array([1, 2, 3])
data_before = a1.buffers[1]
print(f'Before move: data buffer @ 0x{data_before:016x}')

a2 = ArrowArray()
move_array(a1, a2)

data_after = a2.buffers[1]
print(f'After  move: data buffer @ 0x{data_after:016x}')
print(f'Same pointer: {data_before == data_after}')

pa_moved = pa.Array._import_from_c(ctypes.addressof(a2), ctypes.addressof(s1))
print('Moved array contents:', pa_moved.to_pylist())

Before move: data buffer @ 0x00005624a00ee340
Moved: src.release is now NULL
After  move: data buffer @ 0x00005624a00ee340
Same pointer: True
  [release_schema:int32] called
Moved array contents: [1, 2, 3]


---
## 7. Schema Metadata Binary Format

> **Spec:** [Schema metadata encoding](https://arrow.apache.org/docs/format/CDataInterface.html#schema-metadata)

The `metadata` field is a binary-encoded key-value map:

```
int32  n_keys
for each pair:
    int32  key_len   + key bytes
    int32  val_len   + val bytes
```

All lengths are **little-endian** 32-bit integers.

In [None]:
import struct as struct_mod

def encode_arrow_metadata(pairs):
    buf = struct_mod.pack('<i', len(pairs))
    for k, v in pairs.items():
        kb = k.encode()
        vb = v.encode()
        buf += struct_mod.pack('<i', len(kb)) + kb
        buf += struct_mod.pack('<i', len(vb)) + vb
    return buf

def decode_arrow_metadata(raw):
    if not raw: return {}
    off = 0
    n   = struct_mod.unpack_from('<i', raw, off)[0]
    off += 4
    res = {}
    for _ in range(n):
        kl = struct_mod.unpack_from('<i', raw, off)[0]
        off += 4
        k  = raw[off:off+kl].decode()
        off += kl
        vl = struct_mod.unpack_from('<i', raw, off)[0]
        off += 4
        v  = raw[off:off+vl].decode()
        off += vl
        res[k] = v
    return res

raw = encode_arrow_metadata({'ARROW:extension:name': 'my_uuid', 'version': '1'})
print('Encoded (hex):', raw.hex())
print('Decoded      :', decode_arrow_metadata(raw))

Encoded (hex): 02000000140000004152524f573a657874656e73696f6e3a6e616d65070000006d795f757569640700000076657273696f6e0100000031
Decoded      : {'ARROW:extension:name': 'my_uuid', 'version': '1'}


---
## 8. Producer Checklist

Before handing your structs to a consumer:

| Check | Detail |
|-------|--------|
| `schema.format` set | Correct format string |
| `schema.n_children` matches | For struct / list / union |
| `array.length` correct | Number of logical elements |
| `array.null_count` ≥ 0 | -1 = unknown |
| `array.n_buffers` correct | int32=2, utf8=3, struct=1, bool=2 |
| `array.buffers[0] = NULL` when null_count==0 | Omit validity bitmap |
| `array.release` set | Consumer will call it |
| Set `src.release = NULL` after move | Move semantics |

---
## Summary

```
Producer (ctypes)                         Consumer (pyarrow)
─────────────────                         ──────────────────
malloc(buffers)                           pa.Array._import_from_c(arr_addr, schema_addr)
fill ArrowSchema { format='i', ... } ──▶  reads schema.format
fill ArrowArray  { buffers=..., ... } ──▶  wraps buffers[1] as pa.Buffer (zero-copy)
set release callback                      calls release() on GC
src.release = NULL  (moved)
```

The entire contract fits in **two C structs and a release callback**.  
Nothing else is needed to share Arrow memory between any two implementations.

**References:**
- [C Data Interface specification](https://arrow.apache.org/docs/format/CDataInterface.html)
- [ctypes: Python standard library](https://docs.python.org/3/library/ctypes.html)