# Arrays, Buffers & Nulls

> **Level:** Beginner  
> **Spec:** [Physical memory layout](https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout) · [Validity bitmaps](https://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps) · [Fixed-size primitive layout](https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout)  
> **PyArrow docs:** [Data types and in-memory data model](https://arrow.apache.org/docs/python/data.html)

## What you will learn

1. How fixed-size primitive arrays lay out bytes in memory
2. How to inspect raw buffers with `arr.buffers()`
3. How validity bitmaps encode nulls (LSB-first bit packing)
4. The `null_count == 0` optimisation (no validity buffer allocated)
5. Booleans: bit-packed, 1 bit per element

In [1]:
import pyarrow as pa
import numpy as np

---
## 1. Fixed-Size Primitive Layout

> **Spec:** [Fixed-size primitive layout](https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout)

An `int32` array with nulls has **two** buffers:
- `buffers()[0]`: validity bitmap (1 bit per element, LSB first)
- `buffers()[1]`: values (4 bytes per element, little-endian)

In [2]:
arr = pa.array([1, None, 3, None, 5], type=pa.int32())
print(f'Type      : {arr.type}')
print(f'Length    : {len(arr)}')
print(f'Null count: {arr.null_count}')
print(f'Buffers   : {len(arr.buffers())}')

validity_buf = arr.buffers()[0]
values_buf   = arr.buffers()[1]
print(f'\nValidity buffer bits  : {" ".join(f"{b:08b}" for b in validity_buf.to_pybytes())}')
print(f'Values   buffer bytes : {values_buf.to_pybytes().hex()}')

# Decode values buffer as int32
values = np.frombuffer(values_buf, dtype=np.int32)
print(f'\nRaw int32 values (including null slots): {values}')

Type      : int32
Length    : 5
Null count: 2
Buffers   : 2

Validity buffer bits  : 00010101
Values   buffer bytes : 0100000000000000030000000000000005000000

Raw int32 values (including null slots): [1 0 3 0 5]


---
## 2. Validity Bitmap, LSB Decoding

> **Spec:** [Validity bitmaps](https://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps)

Bits are stored **Least-Significant-Bit first** within each byte.  
Element `i` is valid if `byte[i//8] >> (i%8) & 1 == 1`.

In [3]:
def decode_validity_bitmap(buf, length):
    if buf is None:
        return [True] * length  # no bitmap = all valid
    bitmap_bytes = buf.to_pybytes()
    result = []
    for i in range(length):
        byte_idx = i // 8
        bit_idx  = i %  8
        bit = (bitmap_bytes[byte_idx] >> bit_idx) & 1
        result.append(bool(bit))
    return result

valid = decode_validity_bitmap(arr.buffers()[0], len(arr))
print('Valid flags:', valid)
print('Expected:   ', [v is not None for v in arr.to_pylist()])

Valid flags: [True, False, True, False, True]
Expected:    [True, False, True, False, True]


---
## 3. null_count == 0 Optimisation

> **Spec:** [null_count == 0](https://arrow.apache.org/docs/format/Columnar.html#null-count)

When there are no nulls, Arrow **omits the validity buffer entirely** (`buffers()[0] is None`).  
This saves memory and speeds up consumers that skip the bitmap check.

In [None]:
no_null_arr = pa.array([1, 2, 3, 4, 5], type=pa.int32())
print('null_count          :', no_null_arr.null_count)
print('buffers()[0]:', no_null_arr.buffers()[0])

null_arr = pa.array([1, None, 3], type=pa.int32())
print('\nWith nulls: buffers()[0]:', null_arr.buffers()[0])

null_count          : 0
buffers()[0] is None: None

With nulls: buffers()[0]: <pyarrow.Buffer address=0x7f5844b20100 size=1 is_cpu=True is_mutable=True>


---
## 4. Boolean Arrays: Bit-Packed

> **Spec:** [Boolean layout](https://arrow.apache.org/docs/format/Columnar.html#boolean)

Boolean values are **bit-packed** (1 bit per element), not byte-per-element.  
8 booleans fit in a single byte, same LSB-first ordering as the validity bitmap.

In [5]:
bool_list = [True, False, True, True, False, False, True, False]
bool_arr = pa.array(bool_list)
print('Type   :', bool_arr.type)
print('Length :', len(bool_arr))

val_buf = bool_arr.buffers()[1]
print('Values buffer size (bytes):', val_buf.size)  # should be 1 for 8 booleans
raw_byte = val_buf.to_pybytes()[0]
print(f'Raw byte: 0x{raw_byte:02x} = {raw_byte:08b}')
print('\nDecoded (LSB first):', [(raw_byte >> i) & 1 for i in range(8)])
print('Expected           :', [int(v) for v in bool_list])

Type   : bool
Length : 8
Values buffer size (bytes): 1
Raw byte: 0x4d = 01001101

Decoded (LSB first): [1, 0, 1, 1, 0, 0, 1, 0]
Expected           : [1, 0, 1, 1, 0, 0, 1, 0]


---
## Summary

| Concept | Detail |
|---------|--------|
| Fixed-size primitive | `buffers()[0]` = validity, `buffers()[1]` = values |
| Validity bitmap | LSB-first bit packing: 1 = valid, 0 = null |
| null_count == 0 | Validity buffer omitted entirely |
| Boolean | Bit-packed in values buffer, same LSB order |

**Next ⏭️** [Variable Length and Nested](03_variable_length_and_nested.ipynb): strings, lists, structs and unions