-
Notifications
You must be signed in to change notification settings - Fork 5
Description
I think N-dimensional arrays may be achievable without too much consternation. Let me share a partial proposal that I think strikes a good balance between simplicity and expressiveness. This is largely inspired by TACO formats.
For the rest of this post, I will only assume row-oriented (aka C-oriented) structures such as CSR. Specific names are also open to change.
Instead of names like col_indices and columns, let's see how far we can get with using only indices{i} and pointers{i}, such as indices0, indices1, pointers0, pointers1, etc.
COO format it easy. It uses indices0, indices1, ..., indicesN.
CSR can use pointers0 and indices1 instead of indptr and col_indices.
HyperCSR (DCSR) can use indices0, pointers0, and indices1 instead of rows, indptr, and col_indices.
Let's look at all options for up to rank 3:
Ranks 1-3
I hope this is self-explanatory (although it may take some thinking). Please ask if you would like me to clarify anything.
Observation 1
If indicesN appears without pointersN, then len(indicesN) == len(values), and all subsequent indices (such as indices{N+1}) must also appear without the corresponding pointers (such as pointers{N+1}).
Edit: upon further consideration, this is not a strict rule. A format with indices0 + indices1 + pointers1 + indices2 is a valid rank 3 array and actually not that weird. indices0 + pointers1 + indices2 is pretty weird though.
Observe how rank 2 formats all have indices1, but no pointers1, because indices1 is the final coordinate that must match with values.
Observation 2
If pointersN appears without indicesN, then pointerN is an array of pointers that can be fast-indexed with any given coordinate index. For example, this is like indptr in CSR that lets you quickly index into it using a row id.
This case typically happens only at the beginning (as for CSR), and only once. Let's consider the two highlighted cases above that aren't typical:
- "weird, but valid": the first index is compressed as in HyperCSR. Imaging only two rows have data. Then, the
pointers1array will be of length2 * shape[1]and can be fast-indexed into. - "fast lookup": this let's us fast-index using the first two indices
iandj.pointers1array will be of lengthshape[0] * shape[1].
Ranks 1-5
For "easy-to-think-about" options (see typical cases from "observation 2" above)
I think this illustrates the pattern of "reasonable" cases nicely. For rank R, there are 2*R - 1 such "reasonable" cases. Overall, though, there are 2**R - 13**(R-1) total possibilities given my current rules.
Okay, it's late. I'll continue discussing this some other day.
CC @ivirshup