# CELLxGENE Discover Census Workshop - CSHL Single-Cell Analysis 2023

This notebook is a step-by-step walkthrough of the CELLxGENE Discover Census Workshop at CSHL's Single-Cell Analysis, 2023.

Original notebook: [colab.research.google.com/drive/158f6Ggl5MRxtnxC9Q01TjJMbkIPQxcim](https://colab.research.google.com/drive/158f6Ggl5MRxtnxC9Q01TjJMbkIPQxcim)

## License

MIT License

Copyright (c) 2022-2023 Chan Zuckerberg Initiative Foundation.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## Workshop

### Installation

To install the Census API on your laptop you should follow the [installation instructions](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_installation.html) in the documentation site.

### Census Hello World

Let's load the necessary libraries

In [1]:
library("tiledb")
library("cellxgene.census")
library("tiledbsoma")

Let's also set some configuration settings for SOMA. This allows us to set the size of data we stream at any given time, as this will be relevant when we cover SOMA iterators.

In [2]:
# Default: 1GB
# TileDB-Cloud can run default, but recommended for workshop is 10MB

#10MB
py.init_buffer_bytes <- 0.01 * 1024**3
#10MB
soma.init_buffer_bytes <- 0.01 * 1024**3

ctx = new_SOMATileDBContext_for_census(
  py.init_buffer_bytes = py.init_buffer_bytes,
  soma.init_buffer_bytes  = soma.init_buffer_bytes
)

#### Finding Census versions available

Let's first take a look at the data releases available in S3. There are two types of releases:
- **Long-term supported (LTS) data releases** published every six months to be available for up to 5 years.
- **Weekly releases** to be available for up to 6 weeks.

To see a list of all available releases and their version aliases, we can do the following

In [3]:
get_census_version_directory()

Unnamed: 0_level_0,release_date,release_build,soma.uri,soma.relative_uri,soma.s3_region,h5ads.uri,h5ads.relative_uri,h5ads.s3_region,do_not_delete,lts,alias
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>
stable,,2024-07-01,s3://cellxgene-census-public-us-west-2/cell-census/2024-07-01/soma/,/cell-census/2024-07-01/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2024-07-01/h5ads/,/cell-census/2024-07-01/h5ads/,us-west-2,True,True,stable
latest,,2024-09-02,s3://cellxgene-census-public-us-west-2/cell-census/2024-09-02/soma/,/cell-census/2024-09-02/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2024-09-02/h5ads/,/cell-census/2024-09-02/h5ads/,us-west-2,False,,latest
2023-05-15,,2023-05-15,s3://cellxgene-census-public-us-west-2/cell-census/2023-05-15/soma/,/cell-census/2023-05-15/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2023-05-15/h5ads/,/cell-census/2023-05-15/h5ads/,us-west-2,True,True,
2023-07-25,,2023-07-25,s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/soma/,/cell-census/2023-07-25/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/h5ads/,/cell-census/2023-07-25/h5ads/,us-west-2,True,True,
2023-12-15,,2023-12-15,s3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/,/cell-census/2023-12-15/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/h5ads/,/cell-census/2023-12-15/h5ads/,us-west-2,True,True,
2024-05-20,,2024-05-20,s3://cellxgene-census-public-us-west-2/cell-census/2024-05-20/soma/,/cell-census/2024-05-20/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2024-05-20/h5ads/,/cell-census/2024-05-20/h5ads/,us-west-2,True,,
2024-07-01,,2024-07-01,s3://cellxgene-census-public-us-west-2/cell-census/2024-07-01/soma/,/cell-census/2024-07-01/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2024-07-01/h5ads/,/cell-census/2024-07-01/h5ads/,us-west-2,True,True,
2024-08-05,,2024-08-05,s3://cellxgene-census-public-us-west-2/cell-census/2024-08-05/soma/,/cell-census/2024-08-05/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2024-08-05/h5ads/,/cell-census/2024-08-05/h5ads/,us-west-2,False,,
2024-08-12,,2024-08-12,s3://cellxgene-census-public-us-west-2/cell-census/2024-08-12/soma/,/cell-census/2024-08-12/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2024-08-12/h5ads/,/cell-census/2024-08-12/h5ads/,us-west-2,False,,
2024-08-19,,2024-08-19,s3://cellxgene-census-public-us-west-2/cell-census/2024-08-19/soma/,/cell-census/2024-08-19/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2024-08-19/h5ads/,/cell-census/2024-08-19/h5ads/,us-west-2,False,,


In [4]:
lts_census <- get_census_version_directory()
lts_census[lts_census$lts,]

Unnamed: 0_level_0,release_date,release_build,soma.uri,soma.relative_uri,soma.s3_region,h5ads.uri,h5ads.relative_uri,h5ads.s3_region,do_not_delete,lts,alias
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>
stable,,2024-07-01,s3://cellxgene-census-public-us-west-2/cell-census/2024-07-01/soma/,/cell-census/2024-07-01/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2024-07-01/h5ads/,/cell-census/2024-07-01/h5ads/,us-west-2,True,True,stable
,,,,,,,,,,,
2023-05-15,,2023-05-15,s3://cellxgene-census-public-us-west-2/cell-census/2023-05-15/soma/,/cell-census/2023-05-15/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2023-05-15/h5ads/,/cell-census/2023-05-15/h5ads/,us-west-2,True,True,
2023-07-25,,2023-07-25,s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/soma/,/cell-census/2023-07-25/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/h5ads/,/cell-census/2023-07-25/h5ads/,us-west-2,True,True,
2023-12-15,,2023-12-15,s3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/,/cell-census/2023-12-15/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/h5ads/,/cell-census/2023-12-15/h5ads/,us-west-2,True,True,
NA.1,,,,,,,,,,,
2024-07-01,,2024-07-01,s3://cellxgene-census-public-us-west-2/cell-census/2024-07-01/soma/,/cell-census/2024-07-01/soma/,us-west-2,s3://cellxgene-census-public-us-west-2/cell-census/2024-07-01/h5ads/,/cell-census/2024-07-01/h5ads/,us-west-2,True,True,
NA.2,,,,,,,,,,,
NA.3,,,,,,,,,,,
NA.4,,,,,,,,,,,


#### Opening a Census version

Now we can get a handle to the Census object hosted in S3. Remember that we can specify the data release to use.

In [5]:
# For the latest LTS use "stable", other options are "latest" for the latest
# weekly, or specific version

version <- "2024-09-02"

# Equivalent (at time of writing), but will emit a warning advising to pin a specific version
# version <- "latest"

census <- open_soma(census_version=version, tiledbsoma_ctx = ctx)

#### Inspecting the Census object

These are the types of SOMA objects used by Census:

- `SparseNDArray` is the same as DenseNDArray but sparse, and supports point indexing (disjoint index access).
- `DataFrame` is a multi-column table with user-defined columns names and value types, with support for point indexing.
- `Collection` is a persistent container of named SOMA objects, similar to a dictionary.
- `Experiment` is a class that represents a single-cell experiment. It always contains two objects:
   - `obs`: a `DataFrame` with primary annotations on the observation axis.
   - `ms`: a `Collection` of measurements, each composed of `X` matrices and axis annotation matrices or data frames (e.g. `var`, `varm`, `obsm`, etc).

The parent Census object is a SOMA `Collection`:

In [6]:
census

ERROR while rich displaying an object: Error: S3: Error while listing with prefix 's3://cellxgene-census-public-us-west-2/cell-census/2024-09-02/soma/__schema/' and delimiter '/'[Error Type: 100] [HTTP Response Code: 301] [Exception: PermanentRedirect] [Remote IP: 54.231.167.42] [Request ID: EBCXKC9P85TFSR04] [Headers: 'content-type' = 'application/xml' 'date' = 'Thu, 05 Sep 2024 14:14:52 GMT' 'server' = 'AmazonS3' 'transfer-encoding' = 'chunked' 'x-amz-bucket-region' = 'us-west-2' 'x-amz-id-2' = 'mhnkORebvrAltWe4enLyyoksgzvkJzyIUxe9nnSlCqMptSQBRtoeB9a0E9B4k/mNJEM98rvFxHM=' 'x-amz-request-id' = 'EBCXKC9P85TFSR04'] : Unable to parse ExceptionName: PermanentRedirect Message: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.

Traceback:
1. tryCatch(withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 .     rpr <- r

You can access items of a collection with square brackets.

- `"census_data"` has the Census single-cell data, will explore it in a momemnt.
- `"census_info"` has high-level summary information about Census

Let's take a look at `"census_info"`

**🚨 NOTE:** To access elements of a SOMA collection we need to use the R6 method `$get()`


In [7]:
census$get("census_info")

ERROR while rich displaying an object: Error: S3: Error while listing with prefix 's3://cellxgene-census-public-us-west-2/cell-census/2024-09-02/soma/census_info/__schema/' and delimiter '/'[Error Type: 100] [HTTP Response Code: 301] [Exception: PermanentRedirect] [Remote IP: 54.231.167.42] [Request ID: P4JDPNBTAX69KKCP] [Headers: 'content-type' = 'application/xml' 'date' = 'Thu, 05 Sep 2024 14:14:53 GMT' 'server' = 'AmazonS3' 'transfer-encoding' = 'chunked' 'x-amz-bucket-region' = 'us-west-2' 'x-amz-id-2' = 'ib8Q39ycXSVlMPH4qdgrFGccpfWSBupi6C1SrlpZkscrVTpeAvYQuLqcZfdl3UAlgUZYHkBnNUc=' 'x-amz-request-id' = 'P4JDPNBTAX69KKCP'] : Unable to parse ExceptionName: PermanentRedirect Message: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.

Traceback:
1. tryCatch(withCallingHandlers({
 .     if (!mime %in% names(repr::mime2repr)) 
 .         stop("No repr_* for mimetype ", mime, " in repr::mime2repr")
 . 

There are three items in this collection:

- `"summary"`: A data frame with high-level information of this Census, e.g. build date, total cell count, etc.
- `"summary_cell_counts"`: A data frame with cell counts stratified by relevant cell metadata
- `"datasets"`: A data frame with all datasets from CELLxGENE Discover used to create the Census.

Now let's take a look at `"census_data"`.

In [None]:
census$get("census_data")

These two are SOMA `Experiment` objects which are a specialized form of a `Collection`. Each of these store a data matrix (cells by genes), cell metadata, gene metadata, and some other useful components.

### Reading Data Frames


#### Reading cell metadata

Let's take a deeper dive into the single-cell data. As mentioned earlier, an `Experiment` always has an `obs` atrribute that can be accessed via `$obs`.

Let's take a look at the human `Experiment`.

In [None]:
census$get("census_data")$get("homo_sapiens")

In [None]:
census$get("census_data")$get("homo_sapiens")$obs

We can take a look at the columns available in a data frame with the `schema()` method, which shows the types of metadata available for each cell.

In [None]:
census$get("census_data")$get("homo_sapiens")$obs$schema()

Let's read two columns of the data frame.

In [None]:
obs <- census$get("census_data")$get("homo_sapiens")$obs$read(column_names=c("suspension_type", "tissue_general"))$concat()
head(as.data.frame(obs))

The line above retrieved the suspension type and tissue values for all human cells in Census. Let's dissect step-by-step to see what happened there:


1. `$read(column_names = c("suspension_type", "tissue_general")` - creates an iterator of Arrow tables that can be used for chunked-based data streaming.
2. `$concat()` - retrieves all the results of the iterator and concatenates them into a single Arrow table.
3. `as.data.frame(obs)` - converts the Arrow table into a data frame.

Let's do each step one more time and inspect the intermediate objects.

In [None]:
# Create iterator of Arrow tables
iterator <- census$get("census_data")$get("homo_sapiens")$obs$read(column_names=c("suspension_type", "tissue_general"))
iterator

In [None]:
# We can get individual chunks
table_chunk <- iterator$read_next()
table_chunk

In [None]:
head(as.data.frame(table_chunk))

In [None]:
# Or concatenate the remaining results into a single Arrow Table,
# and the convert to a DataFrame
table <- iterator$concat()
df_obs <- as.data.frame(table)
head(df_obs)

In [None]:
# And you can perform operations useful for your analysis
table(df_obs$suspension_type)

#### Summary info and dataset table

The same reading operations can be applied to any SOMA data frame in Census. Let's take a look back at the items of `"census_info"`.

In [None]:
census$get("census_info")

`"summary"` is a data frame with high-level information of this data release.

In [None]:
census$get("census_info")$get("summary")

In [None]:
as.data.frame(census$get("census_info")$get("summary")$read()$concat())

And `"datasets"` is data frame listing all of the datasets whose single-cell data is contained in this Census release.

**🚨 NOTE:** the column `dataset_id` is also present in the cell metadata for joining

In [None]:
datasets <- census$get("census_info")$get("datasets")$read()$concat()
head(as.data.frame(datasets))

#### Reading gene metadata

Reading gene metadata is similar to reading cell metadata. However the location of this data frame is inside the soma Measurement. This was designed to allow for multi-modal data, whereby the same observation (cell) can have a different set of features for each type of measurement (e.g. genes, proteins).


To read the gene metadata:

In [None]:
# Build iterator
iterator <- census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var$read()

# Grab first chunk
table_chunk <- iterator$read_next()

# Convert to data frame
head(as.data.frame(table_chunk))

#### Reading a Data Frame with row filters

SOMA makes it easy and efficient to only select a subset of rows based on a query filter. This helps when you want to grab data or metadata only for specific set of cells or genes based on the columns available in these data frames.

For example if you want to get the all the the *primary cells* you can add the following.

**🚨 NOTE:** cells annotated as `is_primary_data = True` are those marked as the original contribution, as some cells are represented more than once in Census due to their inclusion in multiple datasets.

In [None]:
# Build iterator with a query filter
iterator <- census$get("census_data")$get("homo_sapiens")$obs$read(
    value_filter = "is_primary_data == True"
)

# Grab first chunk
table_chunk <- iterator$read_next()

# Convert to data frame
head(as.data.frame(table_chunk))

The `value_filter` works similalrly to a Pandas `query` interface. It can take a string of which is evaluated as a boolean condition and selects rows that meet the criteria.

We can then use other operators to build complex queries, for example all epithelial cells from lung that are primary representations.

In [None]:
filter <- "is_primary_data == True & cell_type == 'epithelial cell' & tissue_general == 'lung'"
columns <- c("assay")

# Build iterator
iterator <- census$get("census_data")$get("homo_sapiens")$obs$read(
    value_filter = filter,
    column_names = columns,
)

# Grab first chunk
table_chunk <- iterator$read_next()

# Convert to data frame and get unique values
unique(as.data.frame(table_chunk))

#### Reading a Data Frame with coordinates

Finally, you can also read a data frame via coordinates. This is useful when testing code with a small set of data.

In [None]:
obs <- census$get("census_data")$get("homo_sapiens")$obs$read(coords=1:5)$concat()
as.data.frame(obs)

### Reading expression data

The single-cell expression data is stored as a SOMA `SparseNDArray`. This is a sparse representation of the data that enables efficient storage and access for data with a high number of missing values.

Currently Census has two expression layers:

- Raw counts.
- Normalized counts by library size.

For human, these are located in the \"RNA\" measurement at:

- `"census_data" --> "homo_sapiens"]$ms --> "RNA"$X --> "raw"`
- `"census_data" --> "homo_sapiens"]$ms --> "RNA"$X --> "normalized"`

Reading these data works similarly to reading data frames. The main difference is that there are different types of iterators available. In this workshop we'll focus on `Matrix::dgTMatrix` iterators.


In [None]:
# Creater a reader
reader <- census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$X$get("raw")$read()

# Build an iterator of Matrix::dgTMatrix objects
iterator <- reader$sparse_matrix()

# Grab first chunk
sparse_chunk <- iterator$read_next()

# Convert to data frame
str(sparse_chunk)

This produces an expression matrix in COO sparse format using `Matrix::dgTMatrix`:

- `i` - the ID + 1 for the cell.
- `j` - the ID + 1 for the gene.  
- `soma_data` - the expression value.


In [None]:
dim(sparse_chunk)

**🚨 NOTE:** Reading the expression matrix in isolation is usually not very useful without cell and gene metadata. We'll cover that in the next section.
**🚨 NOTE:** The shape of this matrix is 63,094,145 rows by 60,664 columns (the full size of Census). We have a "sparse" view.

### Atomic reading of expression data AND metadata

SOMA provides a convenient interface to query single-cell data in a metadata-aware fashion using `ExperimentAxisQuery`.

We covered in a previous section the concept of an SOMA `Experiment`, as a class that represents a single-cell experiment. It always contains two objects:
   - `obs`: a `DataFrame` with primary annotations on the observation axis.
   - `ms`: a `Collection` of measurements, each composed of `X` matrices and axis annotation matrices or data frames (e.g. `var`, `varm`, `obsm`, etc).

**🚨 NOTE:** An `ExperimentAxisQuery` enables users to query and slice an `Experiment` single-cell data and metadata using coordinates or value filters on the axes, similar to how a SOMA `DataFrame` is queried.

#### Creating an `ExperimentAxisQuery`

To create an `ExperimentAxisQuery` you can call the method `$axis_query()` of a SOMA `Experiment`.

In [None]:
cell_filter <- "tissue_general == 'tongue' & cell_type %in% c('leukocyte', 'keratinocyte')"
gene_filter <- "feature_name %in% c('PECAM1', 'DCN', 'KRT13')"

query <- census$get("census_data")$get("homo_sapiens")$axis_query(
    measurement_name = "RNA",
    obs_query = SOMAAxisQuery$new(value_filter = cell_filter),
    var_query = SOMAAxisQuery$new(value_filter = gene_filter)
)

In [None]:
query

#### Inspecting the query results

Once the `ExperimentAxisQuery` is created you have access to a variety of convenient methods to fetch data or useful information about your query.

In [None]:
# Number of cells in query
query$n_obs

In [None]:
# Number of cells in query
query$n_vars

In [None]:
# Grabing cell metadata
iterator <- query$obs(column_names = c("cell_type", "tissue_general"))
unique(as.data.frame(iterator$concat()))

In [None]:
# Grabing gene metadata
iterator <- query$var()
as.data.frame(iterator$concat())

#### Exporting query results to `Seurat`

`ExperimentAxisQuery` has the capability to export the query to an `AnnData` object to use for downstream analysis with Scanpy.

In [None]:
# Convert to Seurat
seurat <- query$to_seurat(X_layers = c(data = "normalized"), var_index = "feature_name")
seurat

In [None]:
# Example: doing an expression dot plot
Seurat::DotPlot(seurat, features = c('PECAM1', 'DCN', 'KRT13'), group.by="cell_type")

**🚨 NOTE:** The Census package provide a convinient way to get a Seurat object without creating an Experiment Query:

```r
seurat_obj <- get_seurat(
   census = census,
   organism = organism,
   var_value_filter = gene_filter,
   obs_value_filter = cell_filter,
   obs_column_names = cell_columns
)
```


#### Exporting query results to `SingleCellExperiment`

`ExperimentAxisQuery` has the capability to export the query to a `SingleCellExperiment` object to use for downstream analysis.

In [None]:
# Convert to Seurat
sce <- query$to_single_cell_experiment(X_layers = c(data = "normalized"), var_index = "feature_name")
sce

**🚨 NOTE:** The Census package provide a convinient way to get a SingleCellExperiment object without creating an Experiment Query:

```r
sce_obj <- get_single_cell_experiment(
   census = census,
   organism = organism,
   var_value_filter = gene_filter,
   obs_value_filter = cell_filter,
   obs_column_names = cell_columns
)
```


#### Getting the expression data and metadata

An`ExperimentAxisQuery` has all the necessary functionality to obtain the expression matrix along the corresponding cell and gene metadata.

`to_anndata()` (shown in the previous section) uses many of these methods under the hood.

Let's take a closer look. First, we can get the cell and gene metadata as follows:

In [None]:
# Get cell metadata, only cell types and SOMA IDs
obs <- query$obs(column_names = c("soma_joinid", "cell_type"))$concat()
obs <- as.data.frame(obs)
head(obs)

In [None]:
# Get gene metadata
var <- query$var()$concat()
var <- as.data.frame(var)
var

Now let's take a look at expression matrix. There's a method `X()` that works similarly to reading a SOMA `SparseNDArray`, it retunrs a reader that can be then used to create matrix iterators.

Importantly `X()` will only return the rows and columns corresponding to cells and genes in the query, respectively.

In [None]:
# Get reader, results, and concatenate them.
# We need to specify the layer.
X = query$X(layer_name = "raw")$sparse_matrix()$concat()
str(X)

In [None]:
dim(X)

**🚨 NOTE:** The shape of this matrix is 63,094,145 rows by 60,664 columns (the full size of Census).

However we know that there are 17K cells and 3 genes in our query. The reason for this discrepancy is that we are taking a "view" at the Census matrix in sparse format.

We can re-index these values to strip away all other cells and genes not included in our query result. The R package has a convinient function to get a re-indexed concatanted result.

In [None]:
x_reindexed <- query$to_sparse_matrix(collection = "X", layer_name = "raw", var_index="feature_name")
dim(x_reindexed)

In [None]:
x_reindexed[1:4,]

**🚨 NOTE:** Just like a file, the Census should be closed.

In [None]:
census$close()

### Efficient compute capabilities of Census

Census has some methods that makes use of SOMA streaming capabilities (iterators) to make it possible to apply common calculations on million of cells, using a common laptop.


#### Calculating average and variance across genes or cells

**🚨 NOTE:** This functionality is curretly only available in the Python package `cellxgene_census`

#### Getting highly variable genes

**🚨 NOTE:** This functionality is curretly only available in the Python package `cellxgene_census`

### Scalable modelling with PyTorch and Census

Census provides an `ExperimentDataPipe`. It is an implementation of [PyTorch's DataPipe interface](https://pytorch.org/data/main/torchdata.datapipes.iter.html), which defines a common mechanism for wrapping and accessing training data from any underlying source.

**🚨 NOTE:** This functionality is curretly only available in the Python package `cellxgene_census`