Skip to content

Adding a ConstantArray class#87

Merged
hpages merged 4 commits intoBioconductor:masterfrom
LTLA:master
Feb 20, 2021
Merged

Adding a ConstantArray class#87
hpages merged 4 commits intoBioconductor:masterfrom
LTLA:master

Conversation

@LTLA
Copy link
Contributor

@LTLA LTLA commented Feb 13, 2021

Nothing too fancy here, this class just represents an array with a constant value for all of its cells.

This was originally implemented as part of SingleCellExperiment, which contains a primitive implementation of what I hope will be the combineRows implementation for SummarizedExperiment objects. The idea is to create all-NA matrices for assays that are not present in some SE objects, thus allowing them to be efficiently combined (without pretending that they're zero, which they're not).

I thought about using an RleArray for this purpose but it breaks pretty quickly with:

RleArray(Rle(NA, 1e10), c(1e5, 1e5))
## Error in .make_RleArraySeed_from_Rle(data, dim, dimnames, chunksize) : 
##   Input data is too long (>= 2^31). Please supply an ordinary list of Rle
##   objects instead, or an RleList object, or a DataFrame object where all
##   the columns are Rle objects.

I guess I could make a list of Rle objects, but it's a bit of a chore to have to create the chunks manually. Probably like:

NR <- 1e5
NC <- 1e5
chunk.size <- floor(.Machine$integer.max/NR)
nchunks <- ceiling(NC/chunk.size)
chunks <- rep(list(Rle(NA, NR * chunk.size)), nchunks)
chunks[[nchunks]] <- Rle(NA, NR*(NC - chunk.size * (nchunks - 1)))
RleArray(chunks, c(NR, NC))

By comparison, the proposed class is simple to understand and use (and maintain) by just calling:

ConstantArray(c(1e5, 1e5), NA)

Data extraction also seems faster based on some cursory timings, but this isn't a major consideration here.

@hpages
Copy link
Contributor

hpages commented Feb 17, 2021

Sounds good but does this work?

> ConstantArray(5:7, value=NA)
Error in new_DelayedArray(seed, Class = "ConstantMatrix") : 
  the supplied seed must have exactly 2 dimensions when the specified
  class (ConstantMatrix) extends DelayedMatrix

Seems like some unit tests are needed.

@LTLA
Copy link
Contributor Author

LTLA commented Feb 17, 2021

... only for matrices, apparently. Oops. Fixed.

@hpages
Copy link
Contributor

hpages commented Feb 17, 2021

Thanks. A few more things:

  1. ConstantMatrix needs to contain ConstantArray so the following returns TRUE:

    > is(ConstantArray(5:6, value=NA), "ConstantArray")
    [1] FALSE
    

    Another benefit of doing this is that ConstantMatrix will then inherit the ConstantArray reprresentation so you don't need to specify it again in the ConstantMatrix class definition. It will just be:

    setClass("ConstantMatrix", contains=c("ConstantArray", "DelayedMatrix"))
    

    Unfortunately, after doing this, you need to define coercion methods from ConstantMatrix to ConstantArray and from ConstantArray to ConstantMatrix to prevent some bad things to happen. See how these coercions are defined for RleArray/RleMatrix in RleArray-class.R and do the same thing.

  2. Validity method: Use setValidity2 instead of setValidity. Use msg <- validate_dim_slot(object, "dim") to validate the dim slot (it will also make sure that the slot doesn't contain NAs). Don't try to return all the problems in msg but return only the first one. See validity method for SparseArraySeed for an example.

  3. In addition to all the atomic types, the DelayedArray framework also supports arrays of type list so this would need to be supported:

    ConstantArray(4:3, value=list(55))
    ConstantArray(4:3, value=list(letters))
    

    I suggest you replace value="ANY" with value="vector" in the class definition (that's how the nzdata slot is specified for SparseArraySeed). Then just remove the !is.atomic(object@value) test in the validity method and you should be good.

  4. The extract_array() method should just return array(x@value, get_Nindex_lengths(index, dim(x))). Your .get_constant_dim helper is not needed.

  5. What's considered a zero depends on the type of a DelayedArray object. For example, if the type is character, the zero value is the empty string. If the type is list, it's NULL. If you use:

    zero <- vector(type(x), length=1L)
    identical(x@value, zero)
    

    in your is_sparse() method then it should do the right thing whatever the type. (However I just discovered that dense2sparse() is broken on arrays of type list so avoid this in your tests.)

  6. Export the ConstantArraySeed() constructor function, especially since you document it.

@LTLA
Copy link
Contributor Author

LTLA commented Feb 17, 2021

Done... I think I got all of the above.

@hpages
Copy link
Contributor

hpages commented Feb 19, 2021

Thanks. Sorry to insist but you don't need to specify the representation in the class definition of ConstantMatrix 'cause you inherit it from ConstantArray (hey, I gave you above the exact setClass statement to use for this class). Other than that, everything looks good.

@LTLA
Copy link
Contributor Author

LTLA commented Feb 19, 2021

ooops.

@hpages hpages merged commit c313349 into Bioconductor:master Feb 20, 2021
@hpages
Copy link
Contributor

hpages commented Feb 20, 2021

Just for fun, here's one way to make a delayed constant matrix without using the new ConstantMatrix stuff:

out2 <- DelayedArray(matrix(NA_real_))[rep(1, 1e6), rep(1, 1e6)]

However, it's 10x slower than:

out <- ConstantArray(c(1e6, 1e6), value=NA_real_)

and also the memory footprint of the resulting object is 5000x bigger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants