# Variable-Length and Nested Types

> **Level:** Intermediate  
> **Spec:** [Variable-size binary layout](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) · [Nested layouts](https://arrow.apache.org/docs/format/Columnar.html#nested-layouts) · [BinaryView](https://arrow.apache.org/docs/format/Columnar.html#binary-view-layout)  
> **PyArrow docs:** [Nested arrays](https://arrow.apache.org/docs/python/data.html#nested-arrays)

## What you will learn

1. VarBinary / Utf8: offset buffer + data buffer
2. Utf8View / BinaryView: 16-byte view struct (inline vs pointer)
3. List / ListView / FixedSizeList layouts
4. Struct arrays and AND-bitmap validation
5. Dense and Sparse Union arrays

In [1]:
import pyarrow as pa
import numpy as np
import struct

---
## 1. Variable-Length Binary / Utf8

> **Spec:** [Variable-size binary layout](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout)

Three buffers:
- `buffers()[0]`: validity bitmap
- `buffers()[1]`: int32 offsets (N+1 values)
- `buffers()[2]`: concatenated byte data

String `i` lives at `data[offsets[i] : offsets[i+1]]`.

In [2]:
strings = pa.array(['hello', 'Arrow', None, 'world!'], type=pa.utf8())

validity_buf = strings.buffers()[0]
offsets_buf  = strings.buffers()[1]
data_buf     = strings.buffers()[2]

offsets = np.frombuffer(offsets_buf, dtype=np.int32)
data    = data_buf.to_pybytes()

print('Offsets:', offsets.tolist())
print('Data   :', data)

print('\nDecoded strings:')
for i in range(len(strings)):
    start, end = offsets[i], offsets[i+1]
    val = data[start:end].decode() if strings[i].is_valid else None
    print(f'  [{i}] {val!r}  (offsets {start}..{end})')

Offsets: [0, 5, 10, 10, 16]
Data   : b'helloArrowworld!'

Decoded strings:
  [0] 'hello'  (offsets 0..5)
  [1] 'Arrow'  (offsets 5..10)
  [2] None  (offsets 10..10)
  [3] 'world!'  (offsets 10..16)


---
## 2. Utf8View / BinaryView (Arrow 1.4+)

> **Spec:** [Binary View layout](https://arrow.apache.org/docs/format/Columnar.html#binary-view-layout)

Each element is a fixed **16-byte view struct**:
```
If length <= 12:  [ length(4) | data_inline(12) ]
If length > 12:   [ length(4) | prefix(4) | buf_index(4) | offset(4) ]
```

In [None]:
view_arr = pa.array(['hi', 'Arrow is great!', 'x'], type=pa.string_view())
print('Type:', view_arr.type)

views_buf = view_arr.buffers()[1]  # 16 bytes per element
raw = views_buf.to_pybytes()

print(f'View buffer size: {len(raw)} bytes ({len(raw)//16} x 16)')

for i in range(len(view_arr)):
    chunk = raw[i*16 : (i+1)*16]
    length = struct.unpack_from('<i', chunk, 0)[0]
    if length <= 12:
        data = chunk[4:4+length].decode()
        print(f'  [{i}] length={length} INLINE  data={data!r}')
    else:
        prefix    = chunk[4:8]
        buf_index = struct.unpack_from('<i', chunk, 8)[0]
        offset    = struct.unpack_from('<i', chunk, 12)[0]
        print(f'  [{i}] length={length} POINTER buf={buf_index} offset={offset} prefix={prefix}')

Type: string_view
View buffer size: 48 bytes (3 x 16)
  [0] length=2 INLINE  data='hi'
  [1] length=15 POINTER buf=0 offset=0 prefix=b'Arro'
  [2] length=1 INLINE  data='x'


---
## 3. List / ListView / FixedSizeList

> **Spec:** [List layout](https://arrow.apache.org/docs/format/Columnar.html#list-layout)

| Type | Offsets | Child |
|------|---------|-------|
| `list<T>` | int32 offsets buffer | T array |
| `large_list<T>` | int64 offsets buffer | T array |
| `list_view<T>` | int32 offsets + sizes | T array |
| `fixed_size_list<T>[N]` | No offset buffer | T array |


In [4]:
# Variable-length list
list_arr = pa.array([[1,2,3], [4,5], None, [6]], type=pa.list_(pa.int32()))
print('List array:', list_arr)
offsets = np.frombuffer(list_arr.buffers()[1], dtype=np.int32)
print('Offsets:', offsets.tolist())
child_vals = np.frombuffer(list_arr.values.buffers()[1], dtype=np.int32)
print('Child values:', child_vals.tolist())

# FixedSizeList: no offsets buffer (stride is implicit)
ip_arr = pa.array([[192,168,1,1], [10,0,0,1], [127,0,0,1]],
                  type=pa.list_(pa.uint8(), 4))
print('\nIP addresses (FixedSizeList[4]):', ip_arr.to_pylist())
print('Buffers:', len(ip_arr.buffers()), '(validity + child values only, no offset)')

List array: [
  [
    1,
    2,
    3
  ],
  [
    4,
    5
  ],
  null,
  [
    6
  ]
]
Offsets: [0, 3, 5, 5, 6]
Child values: [1, 2, 3, 4, 5, 6]

IP addresses (FixedSizeList[4]): [[192, 168, 1, 1], [10, 0, 0, 1], [127, 0, 0, 1]]
Buffers: 3 (validity + child values only, no offset)


---
## 4. Struct Arrays

> **Spec:** [Struct layout](https://arrow.apache.org/docs/format/Columnar.html#struct-layout)

A struct array has:
- One validity bitmap for the struct level
- Child arrays, each with their own validity
- An element is null if **either** the struct bitmap OR the child bitmap marks it null

In [6]:
struct_arr = pa.array(
    [{'x': 1, 'y': 'a'}, None, {'x': None, 'y': 'c'}],
    type=pa.struct([('x', pa.int32()), ('y', pa.utf8())])
)
print('Struct array:', struct_arr.to_pylist())
print('Struct validity:', {" ".join(f"{b:08b}" for b in struct_arr.buffers()[0].to_pybytes())})
print('x child:', struct_arr.field('x').to_pylist())
print('y child:', struct_arr.field('y').to_pylist())

Struct array: [{'x': 1, 'y': 'a'}, None, {'x': None, 'y': 'c'}]
Struct validity: {'00000101'}
x child: [1, 0, None]
y child: ['a', '', 'c']


---
## 5. Union Arrays

> **Spec:** [Union layout](https://arrow.apache.org/docs/format/Columnar.html#union-layout)

| Mode | Type IDs buffer | Offsets buffer | Memory |
|------|----------------|---------------|---------|
| Dense | Yes | Yes | Compact: one entry per element in children |
| Sparse | Yes | No | Wasteful: all children same length |

In [8]:
# Dense union: int32 or utf8
type_ids = pa.array([0, 1, 0, 1, 0], type=pa.int8())
offsets  = pa.array([0, 0, 1, 1, 2], type=pa.int32())
int_child = pa.array([10, 20, 30])
str_child = pa.array(['hello', 'world'])

union_type = pa.union([pa.field('i', pa.int32()), pa.field('s', pa.utf8())],
                      mode='dense', type_codes=[0, 1])
dense = pa.UnionArray.from_dense(type_ids, offsets, [int_child, str_child])
print('Dense union:', dense.to_pylist())

int_child = pa.array([10, None, 20, None, 30])
str_child = pa.array([None, 'hello', None, 'world', None])
sparse = pa.UnionArray.from_sparse(type_ids, [int_child, str_child])
print('Sparse union:', sparse.to_pylist())

Dense union: [10, 'hello', 20, 'world', 30]
Sparse union: [10, 'hello', 20, 'world', 30]


---
## Summary

| Type | Buffers |
|------|---------|
| Utf8 | validity + int32 offsets + data bytes |
| Utf8View | validity + 16-byte views + optional data buffers |
| List | validity + int32 offsets + child array |
| FixedSizeList | validity + child array (no offsets) |
| Struct | validity + child arrays |
| Dense Union | type_ids (int8) + offsets (int32) + children |
| Sparse Union | type_ids (int8) + children (all same length) |

**Next ⏭️** [Dictionary and REE](04_dictionary_and_ree.ipynb): efficient encoding for repeated data