Skip to content

FEATURE REQUEST: Can we have native support for ragged arrays? #3326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jachymb opened this issue Jan 8, 2025 · 4 comments
Open

FEATURE REQUEST: Can we have native support for ragged arrays? #3326

jachymb opened this issue Jan 8, 2025 · 4 comments

Comments

@jachymb
Copy link

jachymb commented Jan 8, 2025

I think it would be great if there is a native support for ragged arrays in the datatypes. The need for this appears in practice, when we have groups of measurements, but each group has different size. Such problems are often suitable for hierarchical modelling for which STAN is often a good tool of choice.

The user guide suggests two approaches how to address this in the current state.

  1. the inner arrays having a known maximum size, using (possibly sparse) binary indicators to distinguish whether data is missing or available.
  2. using a flattened array with a list of indices where it should be separated.

It's quite obvious that 1) is a possibly huge waste of memory in many cases.
Technically, 2) works but it's still not great. The main issue I have with it is that it's at the cost of loss of semantics: It makes the code harder to write, read and maintain. Performance-wise, 2) still (according to the guide) makes a copy of parts of the array within model (using a function like block or segment), which is a bit of a waste, it's not terrible, but certainly not optimal if we could instead reference the data directly without copying.

Description:

My idea is to allow type declarations for array of arrays of unknown length (especially in the data section) such as:

int m, n; 
array[m] array[] int array_of_varying_length_int_arrays;
array[m] vector[] array_of_varying_length_vectors;
array[m] matrix[, n] array_of_varying_height_matrices;

Then I expect to be able to get a vector using vector[] v = array_of_varying_length_vectors[k] and obtaining it's length using size(v) or a similar function.

Not sure how difficult is this to implement, I'm just a measly user, seems probably like a bigger change.

@WardBrian
Copy link
Member

There is an older design document which proposes this feature at stan-dev/design-docs#7

You are correct that there are some implementation challenges, but hearing that people would really use it is always motivating!

@bob-carpenter
Copy link
Member

bob-carpenter commented Jan 8, 2025

There are two semantic issues with the current approach. One, we have to translate data types. Two, we have to apply our own constraints.

To mitigate both issues, we really should have our built-in constraints implemented to map an array, a position, and a size/constraint specification into the relevant data structure---building in that deserialization would help. I'm thinking something like:

tuple(matrix, real) read_cov_matrix(array[] real x, int start)

that returns the constrained matrix and log Jacobian. This is essentially how our built-in constraints work.

Alternatively, we could define the deserializer as

matrix read_cov_matrix_jac(array[] real x, int start)

and increment the log density for the Jacobian within the function. Our built-ins support this pattern, too. I'm not sure if we actually have the _jac suffix to parallel the _lp suffix in user-defined functions, but we should (@WardBrian will know).

Native ragged arrays would be much better, but they're hard to specify with Stan's size-based design. Please feel free to comment on the design doc. Along with closures, it's one of the two big features we're looking at implementing. Tuples was the last one.

@WardBrian
Copy link
Member

I'm not sure if we actually have the _jac suffix to parallel the _lp suffix in user-defined functions, but we should (@WardBrian will know).

It's called _jacobian, not _jac, but it does exist. We have an existing stanc3 issue to expose all the current built-in transforms as them:
stan-dev/stanc3#1436

But it hasn't been started. I was originally imagining it as wrapping the stan-dev/math versions of the functions, while your comment sounds more like wrapping the stan-dev/stan deserializer class versions, so there would have to be some call there (the main difference is for e.g. simplex_jacobian, wrapping the stan-math implementation would require the input already be a vector, not an array[] real)

@bob-carpenter
Copy link
Member

Thanks, @WardBrian. Using a vector would also work.

I just realized that what I proposed above isn't enough for a stateless implementation. We really need something like this:

tuple(matrix, real, int) read_cov_matrix(vector x, int pos, int N)

where N is the number of rows and columns in the covariance matrix, pos is the position in x to start reading, and x is a vector of unconstrained values, with the return matrix being the N x N covariance matrix, the return real being the log Jacobian adjustment, and the int being the number of unconstrained values used. Here the number of unconstrained values required is (N choose 2) + N, so the user could compute that themselves, but that would be error prone.

If we want just the plain old transforms, where we consume a whole vector, that'd be

tuple(matrix, real) to_cov_matrix(vector x);

This would require the user to pass a vector that has a size equal to N choose 2 + N for some N, and the return is just the covariance matrix and Jacobian adjustment. We can also have a slightly more efficient version without the Jacobian.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants