# Interoperability

## Motivation

As we have discussed in the [analysis frameworks and tools chapter] there are three main ecosystems for single-cell analysis, the [Bioconductor] and [Seurat] ecosystems in R and the Python-based [scverse] ecosystem. A common question from new analysts is which ecosystem they should focus on learning and using? While it makes sense to focus on one to start with, and a successful standard analysis can be performed in any ecosystem, we promote the idea that competent analysts should be familiar with all three ecosystems and comfortable moving between them. This approach allows analysts to use the best-performing tools and methods regardless of how they were implemented. When analysts are not comfortable moving between ecosystems they often tend to use packages that are easy to access, even when they have been shown to have shortcomings compared to packages in another ecosystem. The ability of analysts to move between ecosystems allows developers to take advantage of the different strengths of programming languages. For example, R has strong inbuilt support for complex statistical modelling while the majority of deep learning libraries are focused on Python. By supporting common on-disk data formats and in-memory data structures developers can be confident that analysts can access their package and focus on using the most appropriate platform for their method. Another motivation for being comfortable with multiple is the accessibility and availability of data, results and documentation. Often data or results are only made available in one format and analysts will need to be familiar with that format in order to access it. A basic understanding of other ecosystems is also necessary to understand package documentation and tutorials when deciding which methods to use.

While we encourage analysts to be comfortable with all the major ecosystems, moving between them is only possible when they are interoperable. Thankfully lots of work has been done in this area and it is now relatively simple in most cases using standard packages. In this chapter, we discuss the various ways data can be moved between ecosystems via disk or in-memory, the differences between them and their advantages. We focus on single-modality data and moving between R and Python as these are the most common cases but we also touch on multimodal data and other languages.


## Disk-based interoperability

The first approach to moving between languages is via disk-based interoperability. This involves writing a file to disk in one language and then reading that file into a second language. In many cases, this approach is simpler, more reliable and scalable than in-memory interoperability (which we discuss below) but it comes at the cost of greater storage requirements and reduced interactivity. Disk-based interoperability tends to work particularly well when there are established processes for each stage of analysis and you want to pass objects from one to the next (especially as part of a pipeline developed using a workflow manager such as [NextFlow] or [snakemake]). However, disk-based interoperability is less convenient for interactive steps such as data exploration or experimenting with methods as you need to write a new file whenever you want to move between languages.

### Simple formats

Before discussing file formats specifically developed for single-cell data we want to briefly mention that common simple text file formats (such as CSV, TSV, JSON etc.) can often be the answer to transferring data between languages. They work well in cases where some analysis has been performed and what you want to transfer is a subset of the information about an experiment. For example, you may want to transfer only the cell metadata but do not require the feature metadata, expression matrices etc. The advantage of using simple text formats is that they are well supported by almost any language and do not require single-cell specific packages. However, they can quickly become impractical as what you want to transfer becomes more complex.

### HDF5-based formats

The most common disk formats for single-cell data are based on [Hierarchical Data Format version 5] or HDF5. This is an open-source file format designed for storing large, complex and heterogeneous data. It has a file directory type structure (similar to how files and folders are organised on your computer) which allows many different kinds of data to be stored in a single file with an arbitrarily complex hierarchy. While this format is very flexible, to properly interact with it you need to know where and how the different information is stored. For this reason, standard specifications for storing single-cell data in HDF5 files have been developed.

!!! HDF5 OVERVIEW IMAGE !!!

#### H5AD

The H5AD format is the HDF5 disk representation of the `AnnData` object used by scverse packages and is commonly used to share single-cell datasets. As it is part of the scverse ecosystem, reading and writing these files from Python is well-supported and is part of the core functionality of the **anndata** package.

In [1]:
# READING/WRITING H5AD WITH ANNDATA

Several packages exist for reading and writing H5AD files from R. While they result in a file on disk these packages usually rely on wrapping the Python **anndata** package to handle the actual reading and writing of files with an in-memory conversion step to convert between R and Python.

##### Reading/writing H5AD with Bioconductor

The [Bioconductor **{zellkonverter}** package] helps makes this easier by using the [**{basilisk}** package] to manage creating an appropriate Python environment. If that all sounds a bit technical, the end result is that Bioconductor users can read and write H5AD files using commands like below without requiring any knowledge of Python.

In [2]:
# READING/WRITING H5AD WITH ZELLKONVERTER

**{zellkonverter}** has additional options such as allowing you to selectively read or write parts for an object, please refer to the documentation for more details. Similar functionality for writing a `SingleCellExperimentObject` to an H5AD file can be found in the [**{sceasy}** package]. While these packages are effective, wrapping Python requires some overhead which we hope will be addressed by native R H5AD writers/readers in the future.

##### Reading/writing H5AD with Seurat

Converting between a `Seurat` object and an H5AD file is a two-step process [as suggested by this tutorial]. Firstly the object is written to disk as a `.h5Seurat` file (a custom HDF5 format for `Seurat` objects) using the [**{SeuratObject}** package] and then this file is converted to an H5AD file.

In [3]:
# READING/WRITING H5AD WITH SEURAT

Note that because the structure of a `Seurat` object is quite different from `AnnData` and `SingleCellExperiment` objects the conversion process is more complex. See the [documentation of the conversion function] for more details on how this is done.

#### Loom

The [Loom file format] is an HDF5 specification for omics data. Unlike H5AD it is not linked to a specific analysis package, although the structure is similar to `AnnData` and `SingleCellExperiment` objects. Packages implementing the Loom format exist for both [R] and [Python]. However, it is often more convenient to use the higher-level interfaces provided by the core ecosystem packages. Apart from sharing datasets another common place Loom files are encountered is when spliced/unspliced reads are quantified using [velocycto] for [RNA velocity analysis].

### RDS files

Another file format you may see used to share single-cell datasets is the RDS format. This is a binary format used to serialise arbitrary R objects (similar to Python Pickle files). As `SingleCellExperiment` and `Seurat` objects did not always have matching on-disk representations RDS files are sometimes used to share the results from R analyses. While this is ok within an analysis project we discourage its use for sharing data publicly or with collaborators due to the lack of interoperability with other ecosystems. Instead, we recommend using one of the HDF5 formats mentioned above that can be read from multiple languages.

### New on-disk formats

While HDF5-based formats are currently the standard for on-disk representations of single-cell data other newer technologies such as [Zarr] and [TileDB] have some advantages, particularly for very large datasets and other modalities. We expect specifications to be developed for these formats in the future which may be adopted by the community (**anndata** already provides support for Zarr files).

## In-memory interoperability

The second approach to interoperability is to work on in-memory representations of an object. This approach involves active sessions from two programming languages running at the same time and either accessing the same object from both or converting between them as needed. Usually, one language acts as the main environment and there is an interface to the other language. This can be very useful for interactive analysis as it allows an analyst to work in two languages simultaneously. It is also often used when creating documents that use multiple languages (such as this book). However, in-memory interoperability has some drawbacks as it requires the analyst to be familiar with setting up and using both environments, more complex objects are often not supported by both languages and there is a greater memory overhead as data can easily become duplicated (making it difficult to use for larger datasets).

### Interoperability between R ecosystems

Before we look at in-memory interoperability between R and Python first let’s consider the simpler case of converting between the two R ecosystems. The **{Seurat}** package provides functions for performing this conversion [as described in this vignette].

In [None]:
# CONVERTING TO/FROM SINGLECELLEXPERIMENT/SEURAT

The difficult part here is due to the differences between the structures of the two objects. It is important to make sure the arguments are set correctly so that the conversion functions know which information to convert and where to place it.

In many cases, it may not be necessary to convert a `Seurat` object in order to use Bioconductor packages. This is because many of the most commonly used Bioconductor functions for single-cell analysis have been written to accept raw matrices as well as more complex objects. This means you can often provide the necessary part of a `Seurat` object directly to a Bioconductor function.


In [5]:
# USING A BIOCONDUCTOR FUNCTION ON A SEURAT OBJECT

However, it is important to be sure you are accessing the right information and storing any results in the correct place if needed.

### Accessing R from Python

The Python interface to R is provided by the [**rpy2** package]. This allows you to access R functions and objects from Python. For example:

In [4]:
# SIMPLE RPY2 USAGE

If you are using a Jupyter notebook (as we are for this book) you can use the IPython magic interface to create cells with native R code (passing objects as required).

In [6]:
# SIMPLE MAGIC CELL

This is the approach you will most commonly see in later chapters. For more information about using **rpy2** please refer to [the documentation].

To work with single-cell data in this way the [**anndata2ri**] package is especially useful. This is an extension to **rpy2** which allows R to see `AnnData` objects as `SingleCellExperiment` objects. This avoids unnecessary conversion and makes it easy to run R code on a Python object.

In [7]:
# USING AN R FUNCTION ON AN ANNDATA

Note that you will still run into issues if an object (or part of it) cannot be interfaced correctly (for example if there is an unsupported data type). In that case, you may need to modify your object first before it can be accessed.

### Accessing Python from R

Accessing Python from an R session is similar to accessing R from Python but here the interface is provided by the [**{reticulate}** package]. Once it is loaded we can access Python functions and objects from R.

In [8]:
# SIMPLE RETICULATE USAGE

If you are working in an [RMarkdown] or [Quarto] document you can also write native Python chunks using the **{reticulate}** Python engine. When we do this we can use the magic `r` and `py` variables to access objects in the other language.

In [9]:
# SIMPLE PYTHON CHUNK

Unlike **anndata2ri**, there are no R packages that provide a direct interface for Python to view `SingleCellExperiment` or `Seurat` objects as `AnnData` objects.  However, we can still access most parts of an `AnnData` using **{reticulate}**.

In [10]:
# ACCESSING ANNDATA

As mentioned above the R **{anndata}** package provides an R interface for `AnnData` objects but it is not currently used by many analysis packages.

For more complex analysis that requires a whole object to work on it may be necessary to completely convert an object from R to Python. This is not memory efficient as it creates a duplicate of the data but it does provide access to a greater range of packages. The **{zellkonverter}** package provides a function for doing this conversion (note that, unlike the function for reading H5AD files, this uses the normal Python environment rather than a specially created one). 


In [11]:
# SCE2ANNDATA

The created object can then be used by Python functions and the results converted back to R.

In [12]:
# ANNDATA2SCE

The **{sceasy}** package also provides this functionality but can also convert between `Seurat` and `AnnData`.

In [13]:
# SCEASY ANNDATA <-> SEURAT

## Interoperability for multimodal data

The developers of the `MuData` object, which we introduced in the [analysis frameworks and tools chapter] as an extension of `AnnData` for multimodal datasets, have considered interoperability in their design. While the main platform for MuData is Python, the authors have provided the [MuDataSeurat R package] for reading the on-disk H5MU format as `Seurat` objects and the [MuData R package] for doing the same with Bioconductor `MultiAssayExperiment` objects. This official support is very useful but there are still some inconsistencies due to differences between the objects. The MuData authors also provide a [Julia implementation].

## Interoperability with other languages

Here we briefly list some resources and tools for the interoperability of single-cell data with languages other than R and Python.

### Julia

- [Muon.jl] provides Julia implementations of AnnData and MuData objects, as well as IO for the H5AD and H5MU formats
- [scVI.jl] provides a Julia implementation of AnnData as well as IO for the H5AD format

### JavaScript

- [Vitessce] contains loaders from `AnnData` objects stored using the Zarr format
- The [kana family] supports reading H5AD files and `SingleCellExperiment` objects saved as RDS files

### Rust

- [anndata-rs] provides a Rust implementation of AnnData as well as advanced IO support for the H5AD format


## Session information

## References

```{bibliography}
:filter: docname in docnames
:labelprefix: int
```

## Contributors

We gratefully acknowledge the contributions of:

### Authors

* Luke Zappia

### Reviewers

* Lukas Heumos
* Isaac Virshup