# AnnData(annotated data) tutorial

## Introduction

Anndata is a python package that aims to effectively aid in handling data that requires dealing with sparsity, lack of structure and both observation and feature level metadata. It is specifically designed for multi-demensional matrix data in which each dimension acts as a feature or variable.

AnnData indexes both the rows and columns of the matrix allowing for storage or metadata for both rows and columns. This metadata can include column and row names among other things.



## Getting Started with Anndata

The following code will demonstrate how an annData object can be initialised and explain with examples the basic structure of the object

In [2]:
# Python Imports for basic annData usage
import numpy as np
import pandas as pd
import anndata as ad
from scipy.sparse import csr_matrix

In [15]:
''' 
AnnData stores matrix objects, so the easiest way to create an AnnData object 
is to create a matrix object and then create an AnnData object from that matrix object.
'''

counts = csr_matrix(np.random.poisson(1,size=(100,2000)), dtype=np.float32)
adata = ad.AnnData(counts)
adata


AnnData object with n_obs × n_vars = 100 × 2000

*Note that annData does support reading of data from files however that will be covered later*

Now that the anndata object has been initialised, it is possible to get a summary of the data stored through the function *adata.X*.

In [4]:
#Print summary statistics of adata object
adata.X

<100x2000 sparse matrix of type '<class 'numpy.float32'>'
	with 126606 stored elements in Compressed Sparse Row format>

An important component of the Anndata object is the ability to store metadata regarding rows and columns. One approach is simply to name to obs and vars with respect to their index in the matrix, such that the first row is 'Obs_1'. The number of obs and vars can be accessed using `adata.n_obs` or `adata.n_vars`

In [14]:
# Initialize observation names
adata.obs_names = ['Cell_' + str(i) for i in range(adata.n_obs)]

# Initialize variable names
adata.var_names = ['Gene_' + str(i) for i in range(adata.n_vars )]
print(adata.obs_names[:10])
print(adata.var_names[:10])
print(adata)


Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9'],
      dtype='object')
Index(['Gene_0', 'Gene_1', 'Gene_2', 'Gene_3', 'Gene_4', 'Gene_5', 'Gene_6',
       'Gene_7', 'Gene_8', 'Gene_9'],
      dtype='object')
AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'group'


Once metadata has been added to the anndata it becomes very useful, not only for adding more context to the data but also for increasing the ease of working with and understanding your data. For example the anndata object can be subsetted using the added metadata rather than index positions. 

In [8]:
subsetData = adata[["Cell_1", "Cell_2", "Cell_3"], ["Gene_1", "Gene_3"]]
print(subsetData)

View of AnnData object with n_obs × n_vars = 3 × 2


Subsetting the anndata object does not require a symmetric subset of the observation and variable names. And In cases such as gene expression analysis it can be very useful to reduce the size of working data with relative ease.

Subsetting can also be performed using boolean statements based on the value of obs

In [13]:
ct = np.random.choice(["A", "B", "C"], size=(adata.n_obs,))
adata.obs['group'] = pd.Categorical(ct)
print(adata)
adata.obs

AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'group'


Unnamed: 0,group
Cell_0,A
Cell_1,B
Cell_2,A
Cell_3,B
Cell_4,A
...,...
Cell_95,B
Cell_96,B
Cell_97,C
Cell_98,A
