# Anndata
anndata is a Python package for handling annotated data matrices in memory and on disk, positioned between pandas and xarray. anndata offers a broad range of computationally efficient features including, among others, sparse data support, lazy operations, and a PyTorch interface.
https://anndata.readthedocs.io/en/latest/


In [1]:
import numpy as np
import pandas as pd
import anndata as ad
from scipy.sparse import csr_matrix
print(ad.__version__)

0.12.0


In [2]:
# Let’s start by building a basic AnnData object with some sparse count information, perhaps representing gene expression counts.
counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)
adata = ad.AnnData(counts)
adata

AnnData object with n_obs × n_vars = 100 × 2000

In [4]:
# We see that AnnData provides a representation with summary stastics of the data The initial data we passed are accessible as a sparse matrix using adata.X.
adata.X

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 126573 stored elements and shape (100, 2000)>

In [3]:
adata.obs_names = [f"Cell_{i:d}" for i in range(adata.n_obs)]
adata.var_names = [f"Gene_{i:d}" for i in range(adata.n_vars)]
print(adata.obs_names[:10])

Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9'],
      dtype='object')


## Subsetting AnnData
These index values can be used to subset the AnnData, which provides a view of the AnnData object. We can imagine this to be useful to subset the AnnData to particular cell types or gene modules of interest. The rules for subsetting AnnData are quite similar to that of a Pandas DataFrame. You can use values in the obs/var_names, boolean masks, or cell index integers.

In [6]:
adata[["Cell_1", "Cell_10"], ["Gene_5", "Gene_1900"]]

View of AnnData object with n_obs × n_vars = 2 × 2

## Adding aligned metadata
### Observation/Variable level
So we have the core of our object and now we’d like to add metadata at both the observation and variable levels. This is pretty simple with AnnData, both adata.obs and adata.var are Pandas DataFrames.

In [7]:
ct = np.random.choice(["B", "T", "Monocyte"], size=(adata.n_obs,))
adata.obs["cell_type"] = pd.Categorical(ct)  # Categoricals are preferred for efficiency
adata.obs

Unnamed: 0,cell_type
Cell_0,B
Cell_1,B
Cell_2,B
Cell_3,T
Cell_4,T
...,...
Cell_95,B
Cell_96,Monocyte
Cell_97,T
Cell_98,Monocyte


In [9]:
# We can also see now that the AnnData representation has been updated:
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'

### Subsetting using metadata
We can also subset the AnnData using these randomly generated cell types:

In [10]:
bdata = adata[adata.obs.cell_type == "B"]
bdata

View of AnnData object with n_obs × n_vars = 35 × 2000
    obs: 'cell_type'

## Observation/variable-level matrices
We might also have metadata at either level that has many dimensions to it, such as a UMAP embedding of the data. For this type of metadata, AnnData has the .obsm/.varm attributes. We use keys to identify the different matrices we insert. The restriction of .obsm/.varm are that .obsm matrices must length equal to the number of observations as .n_obs and .varm matrices must length equal to .n_vars. They can each independently have different number of dimensions.
Let’s start with a randomly generated matrix that we can interpret as a UMAP embedding of the data we’d like to store, as well as some random gene-level metadata:

In [11]:
adata.obsm["X_umap"] = np.random.normal(0, 1, size=(adata.n_obs, 2))
adata.varm["gene_stuff"] = np.random.normal(0, 1, size=(adata.n_vars, 5))
adata.obsm

AxisArrays with keys: X_umap

In [12]:
# Again, the AnnData representation is updated.
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    obsm: 'X_umap'
    varm: 'gene_stuff'

A few more notes about .obsm/.varm

The “array-like” metadata can originate from a Pandas DataFrame, scipy sparse matrix, or numpy dense array.

When using scanpy, their values (columns) are not easily plotted, where instead items from .obs are easily plotted on, e.g., UMAP plots.

## Unstructured metadata
AnnData has .uns, which allows for any unstructured metadata. This can be anything, like a list or a dictionary with some general information that was useful in the analysis of our data.

In [13]:
adata.uns["random"] = [1, 2, 3]
adata.uns

OrderedDict([('random', [1, 2, 3])])

## Layers
Finally, we may have different forms of our original core data, perhaps one that is normalized and one that is not. These can be stored in different layers in AnnData. For example, let’s log transform the original data and store it in a layer:

In [14]:
adata.layers["log_transformed"] = np.log1p(adata.X)
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    uns: 'random'
    obsm: 'X_umap'
    varm: 'gene_stuff'
    layers: 'log_transformed'

## Conversion to DataFrames
We can also ask AnnData to return us a DataFrame from one of the layers:

In [15]:
adata.to_df(layer="log_transformed")
# We see that the .obs_names/.var_names are used in the creation of this Pandas object.

Unnamed: 0,Gene_0,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5,Gene_6,Gene_7,Gene_8,Gene_9,...,Gene_1990,Gene_1991,Gene_1992,Gene_1993,Gene_1994,Gene_1995,Gene_1996,Gene_1997,Gene_1998,Gene_1999
Cell_0,0.000000,0.693147,0.000000,1.098612,0.000000,0.000000,0.693147,1.098612,0.693147,1.386294,...,0.693147,1.098612,0.000000,0.000000,1.098612,0.693147,0.693147,0.000000,0.693147,0.693147
Cell_1,1.791759,0.000000,0.693147,1.386294,1.098612,0.693147,1.098612,0.693147,0.000000,0.693147,...,0.693147,1.098612,0.693147,0.000000,1.098612,1.098612,1.386294,0.000000,0.693147,0.693147
Cell_2,0.000000,0.693147,0.693147,1.098612,0.000000,1.098612,0.000000,0.693147,1.098612,0.693147,...,0.693147,0.693147,0.693147,0.693147,1.098612,0.000000,0.000000,0.693147,0.693147,0.000000
Cell_3,1.098612,1.098612,0.000000,0.693147,1.098612,0.693147,0.693147,0.693147,0.693147,0.000000,...,0.000000,1.098612,0.693147,0.000000,0.693147,1.609438,0.000000,1.098612,0.000000,0.693147
Cell_4,1.386294,0.693147,1.098612,1.098612,0.000000,0.693147,1.386294,0.000000,1.098612,0.000000,...,0.000000,0.000000,0.693147,0.693147,0.000000,0.000000,0.000000,0.693147,1.098612,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cell_95,0.000000,0.000000,1.098612,0.693147,0.693147,0.000000,0.693147,0.693147,1.098612,0.000000,...,0.000000,0.693147,1.386294,0.693147,0.693147,0.693147,0.000000,0.693147,0.693147,0.000000
Cell_96,0.693147,1.098612,0.000000,0.000000,0.000000,1.098612,1.098612,0.000000,0.000000,0.693147,...,1.098612,1.386294,0.000000,0.000000,0.000000,1.791759,1.098612,0.693147,1.098612,0.000000
Cell_97,0.000000,1.386294,0.693147,0.693147,0.693147,0.000000,0.000000,0.693147,0.693147,0.000000,...,0.000000,0.693147,0.693147,0.000000,0.000000,0.000000,0.693147,0.000000,0.693147,0.000000
Cell_98,0.693147,0.000000,1.098612,1.098612,1.098612,1.098612,0.000000,0.693147,0.693147,0.693147,...,0.693147,0.693147,0.000000,1.609438,0.000000,0.000000,0.693147,1.609438,1.098612,0.000000


## Writing the results to disk
AnnData comes with its own persistent HDF5-based file format: h5ad. If string columns with small number of categories aren’t yet categoricals, AnnData will auto-transform to categoricals.

In [16]:
adata.write('my_results.h5ad', compression="gzip")

In [17]:
!h5ls 'my_results.h5ad'

X                        Group
layers                   Group
obs                      Group
obsm                     Group
obsp                     Group
uns                      Group
var                      Group
varm                     Group
varp                     Group


## Views and copies
For the fun of it, let’s look at another metadata use case. Imagine that the observations come from instruments characterizing 10 readouts in a multi-year study with samples taken from different subjects at different sites. We’d typically get that information in some format and then store it in a DataFrame:

In [18]:
obs_meta = pd.DataFrame({
        'time_yr': np.random.choice([0, 2, 4, 8], adata.n_obs),
        'subject_id': np.random.choice(['subject 1', 'subject 2', 'subject 4', 'subject 8'], adata.n_obs),
        'instrument_type': np.random.choice(['type a', 'type b'], adata.n_obs),
        'site': np.random.choice(['site x', 'site y'], adata.n_obs),
    },
    index=adata.obs.index,    # these are the same IDs of observations as above!
)

# This is how we join the readout data with the metadata. Of course, the first argument of the following call for X could also just be a DataFrame.


In [19]:
adata = ad.AnnData(adata.X, obs=obs_meta, var=adata.var)
# Now we again have a single data container that keeps track of everything.

In [20]:
print(adata)

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'


Subsetting the joint data matrix can be important to focus on subsets of variables or observations, or to define train-test splits for a machine learning model.
Similar to numpy arrays, AnnData objects can either hold actual data or reference another AnnData object. In the later case, they are referred to as “view”.

Subsetting AnnData objects always returns views, which has two advantages:

no new memory is allocated

it is possible to modify the underlying AnnData object

You can get an actual AnnData object from a view by calling .copy() on the view. Usually, this is not necessary, as any modification of elements of a view (calling .[] on an attribute of the view) internally calls .copy() and makes the view an AnnData object that holds actual data. See the example below.

In [21]:
# Get access to the first 5 rows for two variables. Indexing into AnnData will assume that integer arguments to [] behave like .iloc in pandas, whereas string arguments behave like .loc. AnnData always assumes string indices.
adata[:5, ['Gene_1', 'Gene_3']]
# This is a view! If we want an AnnData that holds the data in memory, let’s call .copy()

View of AnnData object with n_obs × n_vars = 5 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

In [22]:
adata_subset = adata[:5, ['Gene_1', 'Gene_3']].copy()

In [23]:
# For a view, we can also set the first 3 elements of a column.
print(adata[:3, 'Gene_1'].X.toarray().tolist())
adata[:3, 'Gene_1'].X = [0, 0, 0]
print(adata[:3, 'Gene_1'].X.toarray().tolist())

[[1.0], [0.0], [1.0]]
[[0.0], [0.0], [0.0]]


  adata[:3, 'Gene_1'].X = [0, 0, 0]


In [24]:
# If you try to access parts of a view of an AnnData, the content will be auto-copied and a data-storing object will be generated.
adata_subset = adata[:3, ['Gene_1', 'Gene_2']]
adata_subset

View of AnnData object with n_obs × n_vars = 3 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site'

In [25]:
adata_subset.obs['foo'] = range(3)

  adata_subset.obs['foo'] = range(3)


In [26]:
# Now adata_subset stores the actual data and is no longer just a reference to adata.
adata_subset

AnnData object with n_obs × n_vars = 3 × 2
    obs: 'time_yr', 'subject_id', 'instrument_type', 'site', 'foo'

In [27]:
# Evidently, you can use all of pandas to slice with sequences or boolean indices.
adata[adata.obs.time_yr.isin([2, 4])].obs.head()

Unnamed: 0,time_yr,subject_id,instrument_type,site
Cell_0,2,subject 1,type a,site x
Cell_2,4,subject 4,type a,site x
Cell_4,4,subject 2,type b,site y
Cell_5,4,subject 2,type a,site y
Cell_6,4,subject 2,type a,site y
