## AnnData Tutorial

AnnData is a python package that provides a datastructure for matrix like data that needs indexed annotations as well as unstructured metadata. It is used with ScanPy for analysing scRNA-seq data.

In [1]:
import numpy as np
import pandas as pd
import anndata as ad
from scipy.sparse import csr_matrix
print(ad.__version__)

0.12.2


### Creating and reading data
Here we create some arbitrary random count data using the `csr_matrix` module. This is just a standin for some sparse count data like for example gene expression in a sequencing data set where we have many cells/nuclei that each have a small number of counts of many genes.

We then read the data into a structured AnnData object, where each row becomes an observation and each column becomes a variable. Like cells with many expressed genes (gene is the variable).

In [2]:
# Creating some data with the csr_matrix module
counts = csr_matrix(np.random.poisson(1, size= (100,2000)), dtype= np.float32)

# Reading it into an AnnData structure
adata = ad.AnnData(counts)
adata

AnnData object with n_obs × n_vars = 100 × 2000

In [3]:
# X is the part of the AnnData structure that contains the actual tabular locked raw data, the counts in this case
adata.X

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 126451 stored elements and shape (100, 2000)>

## Indices

Each dimension has indices in the AnnData object. Those, in our case, are the cell IDs and the Gene IDs. So we can reference each single cell and we can, for each single gene, read how many of the genes were expressed in this cell.

The observations (cells) can be accessed via the `.obs` attribute and its associated functions and the variables can be accessed and changed under the `.var` attribute.

Here we used the `.obs_names` and the `.var_names` methods to generate a name for each observation and variable in our artifical data.

In [4]:
# Making some cell and gene names
adata.obs_names = [f"Cell_{i:d}" for i in range(adata.n_obs)] # We use the number of observations (rows of adata) for the list length
adata.var_names = [f"Gene_{i:d}" for i in range(adata.n_vars)] # We use the number of variables (columns of adata)for the list length

print(adata.obs_names[:10])
print(adata.var_names[:10])

Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9'],
      dtype='object')
Index(['Gene_0', 'Gene_1', 'Gene_2', 'Gene_3', 'Gene_4', 'Gene_5', 'Gene_6',
       'Gene_7', 'Gene_8', 'Gene_9'],
      dtype='object')


### Subsetting AnnData

We can now use the names of observations and variables as indices to subset and slice the AnnData object however we want!

Cool for later analysis to, for example, pull out all Genes with a certain name (if you have a list of gene names belonging to a certain family) or to reference a set of specially annotated cells (foreshadowing). 

In [5]:
# Suppose we are interest of Genes 5,1501, and 7 of cells 5 and 60
adata[["Cell_5", "Cell_60"], ["Gene_5", "Gene_1501", "Gene_7"]] # First rows, then columns
# If you reference by key names, they have to go into lists

# Or cells 4 to 59 and genes 300 to 600
adata[4:60, 300:600]

adata[["Cell_5", "Cell_6"], ["Gene_99", "Gene_100"]]

View of AnnData object with n_obs × n_vars = 2 × 2

### Adding aligned metadata

We between the "column" that has the row names (cell IDs) and the row that has the column names (gene IDs here) there can be an limitless number of other descriptive __Annotation Layers__ that contain metadata like alternative gene names, cell barcodes, plate origin, batch ID, known gene origin (like mito genes for example).

Those annotation layers are regular Pandas DataFrames. Dis y we need Pandus, gud ting I did the Pandus Tootrial lately heheee.

Here we use `np.random` again to generate a list of random categorical data containing fictitious cell types.

In [6]:
# Generating a list with random cell types the length of AnnData vertical dimension (observations)
cell_types = np.random.choice(["IHC", "OHC", "Pillar Cell", "Fibrocyte_3"], size= (adata.n_obs,))

adata.obs["cell_type"] = pd.Categorical(cell_types) # We use categoricals pandas dataframe because more efficient

adata.obs

Unnamed: 0,cell_type
Cell_0,IHC
Cell_1,OHC
Cell_2,Pillar Cell
Cell_3,IHC
Cell_4,OHC
...,...
Cell_95,Pillar Cell
Cell_96,OHC
Cell_97,Fibrocyte_3
Cell_98,OHC


Eeeeh wolla! We got an extra column that tells us the cell type of each cell, sweet, neat, pristeet.

In [7]:
adata # Mind that the observations have changed! Theres now a new column

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'

### Subsetting Metadata

Now you can also subset all the observations by any annotation layer. Like filtering out all the cells belonging to a certain cell type.

Because an AnnData object, like a Pandas df, or a regular ass Python list, also accepts boolean lists for indexing. You have to rereference the original data tho, hence why you have to put the adata also in the index.

We can slap in a logical expression and filter through an annotation layer. If theres only one argument btw in the brackets, its always referring to rows.

In [8]:
adata[adata.obs.cell_type.str.contains("HC")] # Filter out all the Hair Cells for example

View of AnnData object with n_obs × n_vars = 52 × 2000
    obs: 'cell_type'

In [9]:
# # OR
adata[adata.obs.cell_type == "Pillar Cell", "Gene_9"]

View of AnnData object with n_obs × n_vars = 25 × 1
    obs: 'cell_type'

### Multidimensional Annotations

Metadata can have more than one dimension too! Like if you have a UMAP projection. This is handled in the `.obsm` and `.varm` attributes. Using those, you can annotate whole matrices and identify them with keys.

<font color='red'> **HOWEVER!!!** </font>

The y-dimension of .obsm needs to match n_obs and the x-dimension of .varm needs to match n_vars.

In [10]:
# We create a matrix
n_dim_obs_annotation = np.random.normal(0, 1, size= (adata.n_obs, 2))
n_dim_var_annotation = np.random.normal(0, 1, size= (adata.n_vars, 2))

adata.obsm["X_umap"] = n_dim_obs_annotation
adata.varm["gene_things"] = n_dim_var_annotation

adata.varm
adata.obsm

AxisArrays with keys: X_umap

In [11]:
adata

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    obsm: 'X_umap'
    varm: 'gene_things'

### Unstructured Metadata

Theres a sparate little field in an AnnData object that neither corresponds to vars nor obs. It is for unstructured metadata. This can be the name of the sample or batch, it could be some experiment ID or the date or just a description of the analysis or whatever. It can have any structure, like a list or array or dataframe.

IT CAN NOT BE A DICTIONARY! AT LEAST NOT IF YOU WANNA SAVE THE FILE TO .h5ad because it throws an error and I dont know why.

Use the `.uns` attribute to access it.

In [12]:
# Creating an unstructured metadata entry and filling it widdabuncha crepe
adata.uns["randothing"] = ["This is the experiment A", "2025-03-11", "Johann was here"]

adata.uns

OrderedDict([('randothing',
              ['This is the experiment A', '2025-03-11', 'Johann was here'])])

### Layers

This is different from when I say "annotation layer". Annotations are just maximum 2D labels that go on the edges of our data matrix to tell us what is in those rows and columns and what their names are.
A Layer here is a complete parallel copy of the whole AnnData matrix that is being slapped on top.

This is useful to track analysis steps. If you transform the data for example you can save it to a new layer to preserve the pre-transformed stuff.

**Note:** Not everything is stored in layers! Only the core data matrix X is contained in the new layer. So if you wanna do anything with the Layers, you always have to call `.X` after the layer indexing like this: `adata.layer["Layer_A"].X`.

In [None]:
adata.layers["log_transform"] = np.log1p(adata.X) # IMPORTANT! A layer only copies the matrix itself
adata # See how theres a layer now!

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    uns: 'randothing'
    obsm: 'X_umap'
    varm: 'gene_things'
    layers: 'log_transform'

In [14]:
adata.to_df() # Look, regular ass normal sparse counts form the poisson method from numpy

Unnamed: 0,Gene_0,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5,Gene_6,Gene_7,Gene_8,Gene_9,...,Gene_1990,Gene_1991,Gene_1992,Gene_1993,Gene_1994,Gene_1995,Gene_1996,Gene_1997,Gene_1998,Gene_1999
Cell_0,0.0,0.0,1.0,0.0,1.0,2.0,2.0,0.0,0.0,0.0,...,1.0,1.0,2.0,1.0,0.0,1.0,1.0,1.0,0.0,3.0
Cell_1,2.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,2.0,2.0,2.0,0.0,0.0
Cell_2,1.0,1.0,2.0,2.0,1.0,1.0,3.0,1.0,0.0,1.0,...,2.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
Cell_3,0.0,1.0,1.0,2.0,0.0,1.0,1.0,0.0,4.0,1.0,...,1.0,1.0,2.0,0.0,3.0,0.0,0.0,2.0,1.0,1.0
Cell_4,0.0,1.0,0.0,0.0,1.0,1.0,2.0,1.0,0.0,1.0,...,1.0,1.0,2.0,1.0,4.0,0.0,1.0,2.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cell_95,1.0,0.0,2.0,1.0,1.0,1.0,0.0,1.0,4.0,2.0,...,2.0,2.0,2.0,1.0,1.0,1.0,4.0,0.0,0.0,0.0
Cell_96,1.0,3.0,0.0,0.0,3.0,0.0,3.0,1.0,1.0,1.0,...,0.0,2.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,1.0
Cell_97,0.0,0.0,1.0,2.0,1.0,2.0,3.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,3.0,2.0,2.0,1.0,0.0
Cell_98,2.0,2.0,0.0,0.0,1.0,2.0,1.0,0.0,0.0,1.0,...,1.0,1.0,0.0,1.0,1.0,2.0,1.0,0.0,1.0,2.0


In [15]:
adata.to_df(layer= "log_transform") # BEHOLD
# IT HAS THE SAME VAR AND OBS NAMES N EVERYTHING!! BUT DATA IS DIFFERENT

Unnamed: 0,Gene_0,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5,Gene_6,Gene_7,Gene_8,Gene_9,...,Gene_1990,Gene_1991,Gene_1992,Gene_1993,Gene_1994,Gene_1995,Gene_1996,Gene_1997,Gene_1998,Gene_1999
Cell_0,0.000000,0.000000,0.693147,0.000000,0.693147,1.098612,1.098612,0.000000,0.000000,0.000000,...,0.693147,0.693147,1.098612,0.693147,0.000000,0.693147,0.693147,0.693147,0.000000,1.386294
Cell_1,1.098612,0.000000,0.000000,0.000000,0.000000,0.693147,0.693147,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.693147,0.693147,0.693147,1.098612,1.098612,1.098612,0.000000,0.000000
Cell_2,0.693147,0.693147,1.098612,1.098612,0.693147,0.693147,1.386294,0.693147,0.000000,0.693147,...,1.098612,0.000000,0.693147,0.693147,0.000000,0.000000,0.000000,0.000000,0.693147,0.000000
Cell_3,0.000000,0.693147,0.693147,1.098612,0.000000,0.693147,0.693147,0.000000,1.609438,0.693147,...,0.693147,0.693147,1.098612,0.000000,1.386294,0.000000,0.000000,1.098612,0.693147,0.693147
Cell_4,0.000000,0.693147,0.000000,0.000000,0.693147,0.693147,1.098612,0.693147,0.000000,0.693147,...,0.693147,0.693147,1.098612,0.693147,1.609438,0.000000,0.693147,1.098612,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cell_95,0.693147,0.000000,1.098612,0.693147,0.693147,0.693147,0.000000,0.693147,1.609438,1.098612,...,1.098612,1.098612,1.098612,0.693147,0.693147,0.693147,1.609438,0.000000,0.000000,0.000000
Cell_96,0.693147,1.386294,0.000000,0.000000,1.386294,0.000000,1.386294,0.693147,0.693147,0.693147,...,0.000000,1.098612,1.098612,1.098612,0.000000,0.693147,0.000000,0.000000,0.000000,0.693147
Cell_97,0.000000,0.000000,0.693147,1.098612,0.693147,1.098612,1.386294,0.693147,0.000000,0.693147,...,0.000000,0.000000,0.000000,0.000000,0.693147,1.386294,1.098612,1.098612,0.693147,0.000000
Cell_98,1.098612,1.098612,0.000000,0.000000,0.693147,1.098612,0.693147,0.000000,0.000000,0.693147,...,0.693147,0.693147,0.000000,0.693147,0.693147,1.098612,0.693147,0.000000,0.693147,1.098612


In [16]:
adata.layers["log_transform"]

<Compressed Sparse Row sparse matrix of dtype 'float32'
	with 126451 stored elements and shape (100, 2000)>

### Saving to disk

Things like clustering take a long time and annotating is a bitch to repeat! Save your work!

In [18]:
adata.uns.clear()

adata.write_h5ad("15months_Analysis3.h5ad", compression= "gzip")

### Copying and Analysing Large Files

All modifications on AnnData objects are in-place, so no copy is created and everything is added or overwritten.
Furthermore, every time you access a AnnData object via indexing or slicing you actually only get whats called **a View** of adata. It does not store any actual data.

If you want to retain a subset of adata for any purpose, you need to add `.copy()` to the end of the instantiation of the subset (assuming you save it to a new variable).

There is also something called **"auto-copy"** though. So if you subset adata AND make a change to it at the same time, then the View will turn into an actual object that holds changed data and it does the copying on its own.

Also! It is possible to access a file in part by reading it from storage by using the `backed = 'r'` parameter in the `ad.read('file.h5ad, backed= 'r')` method call. The file will then be open for reading just partially which saves memory.
If you do this, you have to manually close it at the end tho using `adata.file.close()` OR just open it using `with`

### Misc

Remember, AnnData is Pandas based. You can use all the Pandas functions on the obs and vars because they are just Pandas dataframes. You can also use Pandas logical function like `isin()` when slicing.

Okay have fun!