Skip to content

Towards N-dimensional sparse arrays #2

@eriknw

Description

@eriknw

I think N-dimensional arrays may be achievable without too much consternation. Let me share a partial proposal that I think strikes a good balance between simplicity and expressiveness. This is largely inspired by TACO formats.

For the rest of this post, I will only assume row-oriented (aka C-oriented) structures such as CSR. Specific names are also open to change.

Instead of names like col_indices and columns, let's see how far we can get with using only indices{i} and pointers{i}, such as indices0, indices1, pointers0, pointers1, etc.

COO format it easy. It uses indices0, indices1, ..., indicesN.
CSR can use pointers0 and indices1 instead of indptr and col_indices.
HyperCSR (DCSR) can use indices0, pointers0, and indices1 instead of rows, indptr, and col_indices.

Let's look at all options for up to rank 3:

Ranks 1-3

rank123

I hope this is self-explanatory (although it may take some thinking). Please ask if you would like me to clarify anything.

Observation 1

If indicesN appears without pointersN, then len(indicesN) == len(values), and all subsequent indices (such as indices{N+1}) must also appear without the corresponding pointers (such as pointers{N+1}).

Edit: upon further consideration, this is not a strict rule. A format with indices0 + indices1 + pointers1 + indices2 is a valid rank 3 array and actually not that weird. indices0 + pointers1 + indices2 is pretty weird though.

Observe how rank 2 formats all have indices1, but no pointers1, because indices1 is the final coordinate that must match with values.

Observation 2

If pointersN appears without indicesN, then pointerN is an array of pointers that can be fast-indexed with any given coordinate index. For example, this is like indptr in CSR that lets you quickly index into it using a row id.

This case typically happens only at the beginning (as for CSR), and only once. Let's consider the two highlighted cases above that aren't typical:

  • "weird, but valid": the first index is compressed as in HyperCSR. Imaging only two rows have data. Then, the pointers1 array will be of length 2 * shape[1] and can be fast-indexed into.
  • "fast lookup": this let's us fast-index using the first two indices i and j. pointers1 array will be of length shape[0] * shape[1].

Ranks 1-5

For "easy-to-think-about" options (see typical cases from "observation 2" above)
rank12345

I think this illustrates the pattern of "reasonable" cases nicely. For rank R, there are 2*R - 1 such "reasonable" cases. Overall, though, there are 2**R - 1 3**(R-1) total possibilities given my current rules.

Okay, it's late. I'll continue discussing this some other day.

CC @ivirshup

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions