# **Exploring average gene expression with atlasapprox API**

When investigating cell atlases, a common approach is to explore gene expression patterns across different cell types and organs. This tutorial will demonstrate how to use the `atlasapprox` Python API to query and analyze average gene expression data for any species in our database. We'll explore a range of use cases that illustrate various methods for accessing and comparing gene expression profiles. 

## **Initialize the API**

Import the atlasapprox in python and create an API object:

In [None]:
import atlasapprox

api = atlasapprox.API()

For detailed initial setup instructions, refer to the [Quick Start Tutorial](link to quick_start).

## **Querying average gene expression data**

A convenient way to query and fetch gene expression data is to use the `average` method of the `atlasapprox` API. It allows you to retrieve the the average gene expression for selected genes within a specific organ of an organism.

The following example demonstrates how to retrieve the average expression of four genes (*TP53*, *KRAS*, *EGFR*, *ALK*) in the human lung:

In [3]:
avg_gene_expr_lung = api.average(
    organism = "h_sapiens", 
    organ = "lung", 
    features = ["TP53", "KRAS", "EGFR", "ALK"]
    # measurement_type = 'gene_expression'
)

# Display the result
display(avg_gene_expr_lung)

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
TP53,0.054815,0.119978,0.327787,0.132754,0.238697,0.178123,0.038301,0.202786,0.239634,0.074571,...,0.219169,0.126632,0.254856,0.175867,0.16092,0.110756,0.193365,0.252695,0.152536,0.227391
KRAS,1.529643,0.436303,0.977728,0.489622,0.443576,0.562167,0.243355,0.82648,0.76758,0.467747,...,1.357422,1.346849,0.794639,0.249971,0.388602,0.582125,0.251247,0.708356,0.319821,0.714599
EGFR,0.016721,0.024823,0.028325,0.011413,0.000949,0.138468,0.031597,0.064145,0.051909,0.0,...,0.579046,0.736705,0.085129,0.666291,0.720418,0.670286,0.528496,0.980535,0.2257,2.810568
ALK,0.0,0.0,0.001285,0.013633,0.0,0.0,0.0,0.002077,0.0,0.0,...,0.001188,0.0,0.0,0.000689,0.004787,0.0,0.0,0.0,0.0,0.0


#### Output

The function returns a *Pandas DataFrame* where:

* Each row represents a gene.
* Each column corresponds to a cell type.
* The values indicate the average gene expression, measured in counts per ten thousand (cptt).

## **Querying average gene expression for multiple organs**

For comprehensive analysis, you may want to explore gene expression across multiple organs within the same species. 

This example shows the average gene expression for four specified genes (*TP53*, *KRAS*, *EGFR*, *ALK*) across three *human* organs (*bladder*, *blood*, and *colon*). In this case, the measurement type is automatically set to *gene_expression*.

In [4]:
# To select an organ_list, you can specify the desired organ.
# To display all available organs, try the following command:
# organ_list = api.organs("h_sapiens")
organ_list = ['bladder','blood','colon']

# loop through organ_list and display the results
for organ in organ_list: 
    avg_gene_expr = api.average(
        organism = "h_sapiens", 
        organ = organ, 
        features = ["TP53", "KRAS", "EGFR", "ALK"],
    )

    print(f'Average gene expression in human {organ}:')
    display(avg_gene_expr)

Average gene expression in human bladder:


Unnamed: 0,mast,macrophage,B,plasma,T,NK,plasmacytoid,urothelial,venous,capillary,lymphatic,fibroblast,smooth muscle,pericyte
TP53,0.051055,0.18824,0.327816,0.053807,0.147462,0.314548,0.398251,0.162376,0.339704,0.217849,0.104213,0.111125,0.162751,0.112216
KRAS,0.564742,0.690973,1.319512,0.357356,1.065008,1.569626,0.354131,0.537687,0.719438,0.811878,0.722906,0.393044,0.612407,0.582806
EGFR,0.014139,0.0208,0.011188,0.006583,0.007657,0.006421,0.0,0.290386,0.041818,0.076526,0.0,0.897405,0.48993,0.349536
ALK,0.001072,0.007006,0.0,0.0,0.003232,0.0,0.0,0.000645,0.0,0.0,0.0,0.000287,0.001065,0.003441


Average gene expression in human blood:


Unnamed: 0,HSC,neutrophil,basophil,myeloid,monocyte,macrophage,dendritic,erythrocyte,B,plasma,T,NK,plasmacytoid,platelet
TP53,0.429484,0.019245,0.550442,0.757884,0.28239,0.40935,0.153117,0.004213,0.287588,0.174535,0.205015,0.241251,0.401704,0.060797
KRAS,0.701804,1.378338,1.040511,0.776177,0.804196,0.684039,1.118797,0.02061,0.494324,0.654123,0.790222,0.788223,0.67032,0.370046
EGFR,0.0,0.00019,0.0,0.0,0.000307,0.0,0.0,0.000325,0.0,0.0,0.0,1e-05,0.0,0.0
ALK,0.0,0.0,0.0,0.0,0.001128,0.008228,0.0,0.0,0.0,0.003425,0.001432,0.000239,0.0,0.0


Average gene expression in human colon:


Unnamed: 0,neutrophil,mast,monocyte,B,plasma,T,goblet,brush,crypt,transit amp,enterocyte,paneth,venous,capillary,fibroblast,enteroendocrine
TP53,0.111315,0.033383,0.085653,0.185189,0.025554,0.06861,0.063521,0.013328,0.267211,0.449279,0.089426,0.076705,0.239154,0.0,0.13657,0.236432
KRAS,0.864672,0.984021,0.556534,2.100426,0.726135,0.985572,0.522061,0.13228,0.424796,0.55747,0.619195,0.907783,0.504579,1.104401,0.388044,1.0116
EGFR,0.058211,0.0,0.0177,0.016101,0.00332,0.012897,0.183984,0.011498,0.225115,0.284618,0.174868,0.074162,0.0,0.111962,1.088221,0.146555
ALK,0.0,0.0,0.035261,0.0,0.001314,0.00154,0.0008,0.0,0.0,0.0,0.000854,0.0,0.0,0.0,0.002482,0.0


#### Output

* The function returns multiple *Pandas DataFrames*, one for each queried organ.

## **Querying average gene expression of the top 3 marker genes**

The following example calls the `markers` function from API to display the average gene expression of the top three marker genes for *neutrophils* in the *human lung*.

In [12]:
# Get markers for neutrophils in the human lung
markers_in_human_lung_neu = api.markers(
    organism="h_sapiens", 
    organ="lung", 
    cell_type="neutrophil", 
    number=3
)

# Calculate average gene expression for the markers
avg_gene_expr_markers = api.average(
    organism="h_sapiens",
    organ="lung",
    features=markers_in_human_lung_neu
)

display(avg_gene_expr_markers)

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
CXCR2,12.413691,0.01319,0.020326,0.068983,0.096474,0.0,0.0,0.016979,0.246515,0.188133,...,0.006449,0.0,0.0,0.0,0.005169,0.0,0.0,0.0,0.015745,0.0
FCGR3B,11.70941,0.0,0.020036,0.011601,1.3e-05,0.0,0.0,0.029604,0.023175,0.188133,...,0.006533,0.005917,0.0,0.004314,0.0,0.002679,0.0,0.012478,0.0,0.0
IL1R2,62.680073,0.008473,1.464313,0.055198,1.171577,0.0,0.034122,0.094902,0.068963,0.0,...,0.060423,0.037151,0.0,0.01131,0.011099,0.019712,0.020363,0.046019,0.0,0.22693


#### Output

* The function returns a *Pandas DataFrame* and displays the average gene expression of the top three marker genes for neutrophils in the human lung across all available cell types.

## **Exploring with API**

The API provides several functions that can show you all accessible organisms, organs, and cell types.

### Get available organisms

The following example demonstrates how to retrieve a list of available organisms from the API. The `organisms` function sets *gene expression* as the default measurement type.

In [6]:
organisms = api.organisms()

print("This is the available organisms under gene expression measurement type:")
display(organisms)

This is the available organisms under gene expression measurement type:


{'gene_expression': ['a_queenslandica',
  'a_thaliana',
  'c_elegans',
  'c_gigas',
  'c_hemisphaerica',
  'd_melanogaster',
  'd_rerio',
  'f_vesca',
  'h_miamia',
  'h_sapiens',
  'i_pulchra',
  'l_minuta',
  'm_leidyi',
  'm_murinus',
  'm_musculus',
  'n_vectensis',
  'o_sativa',
  'p_crozieri',
  'p_dumerilii',
  's_lacustris',
  's_mansoni',
  's_mediterranea',
  's_pistillata',
  's_purpuratus',
  't_adhaerens',
  't_aestivum',
  'x_laevis',
  'z_mays']}

#### Output

This function returns a list of all available organisms in the API, sorted alphabetically.

Alternatively, you can check if the organism you're interested in is included in the API:

In [7]:
# interested organism
aim_organism = 'h_sapiens'

# check if the organism is included
if aim_organism in organisms['gene_expression']:
    print(f"{aim_organism} is available.")
else:
    print(f"{aim_organism} is NOT available.")

h_sapiens is available.


#### Output

This function provides one of two possible outputs:

* "xx(aim_organism) is available." if the API contains data for this organism.
* "xx(aim_organism) is NOT available." if the API does not have data for this organism.

### Get available human organs

The following example takes an organism (*human*) as a parameter and returns a list of available organs in the API. The `organ` function sets *gene expression* as the default measurement type.

In [8]:
# Check all available cell types in human lung
human_organs = api.organs(organism='h_sapiens')

display(human_organs)

['bladder',
 'blood',
 'colon',
 'eye',
 'fat',
 'gut',
 'heart',
 'kidney',
 'liver',
 'lung',
 'lymphnode',
 'mammary',
 'marrow',
 'muscle',
 'pancreas',
 'prostate',
 'salivary',
 'skin',
 'spleen',
 'thymus',
 'tongue',
 'trachea',
 'uterus']

#### Output

This function returns a list of all available organs in the API, sorted alphabetically.

Alternatively, you can check if the organ you're interested in is included in the API:

In [9]:
# interested organ
aim_organ = 'lung'

# search if this organ is included
if aim_organ in human_organs:
    print(f"{aim_organ} is available.")
else:
    print(f"{aim_organ} is NOT available.")

lung is available.


#### Output

This function will return one of two possible outputs:

* "xx(aim_organ) is available." if the API contains data for this organ in the selected organism.
* "xx(aim_organ) is NOT available." if the API does not have data for this organ in the selected organism.

### Get available genes of human

The following example takes an organism (*human*) as a parameter and returns a list of available gene in the API. The `features` function sets *gene expression* as the default measurement type.

In [13]:
human_genes = api.features(organism='h_sapiens')

display(human_genes)

Index(['A1BG', 'A1BG-AS1', 'A1CF', 'A2M', 'A2M-AS1', 'A2ML1', 'A2ML1-AS1',
       'A2ML1-AS2', 'A2MP1', 'A3GALT2',
       ...
       'ZXDA', 'ZXDB', 'ZXDC', 'ZYG11A', 'ZYG11AP1', 'ZYG11B', 'ZYX', 'ZYXP1',
       'ZZEF1', 'hsa-mir-1253'],
      dtype='object', name='features', length=58870)

#### Output

This function returns a **Pandas Index** that holds a list of available genes along with the total number of genes in the list, sorted alphabetically.

You can also check if your gene of interest is included:

In [None]:
aim_gene = 'MTRNR2L12'

if aim_gene in human_genes:
    print(f"{aim_gene} gene is available.")
else:
    print(f"{aim_gene} gene is NOT available.")

MTRNR2L12 gene is available.


#### Output

This function will provide one of two possible outputs:

* "xx(aim_organ) gene is available." if the API contains data for this gene under defined organism.
* "xx(aim_organism) gene is NOT available." if the API does not have data for this gene under defined organism.

### Get available cell types

This function returns a list of available cell types under chosen organism(*human*) and organ(*lung*).

In [14]:
celltypes_human_lung = api.celltypes(organism='h_sapiens', organ='lung', measurement_type='gene_expression')

display(celltypes_human_lung)

['neutrophil',
 'basophil',
 'monocyte',
 'macrophage',
 'dendritic',
 'B',
 'plasma',
 'T',
 'NK',
 'plasmacytoid',
 'goblet',
 'AT1',
 'AT2',
 'club',
 'ciliated',
 'basal',
 'serous',
 'mucous',
 'arterial',
 'venous',
 'capillary',
 'CAP2',
 'lymphatic',
 'fibroblast',
 'alveolar fibroblast',
 'smooth muscle',
 'vascular smooth muscle',
 'pericyte',
 'mesothelial',
 'ionocyte']

#### Output

This function returns a list of available cell types in the specified organism (*human*) and organ (lung).

You can also check if your cell type of interest is included:

In [None]:
aim_celltype = 'NK'

if aim_celltype in celltypes_human_lung:
    print(f"{aim_celltype} cell is available.")
else:
    print(f"{aim_celltype} cell is NOT available.")

NK cell is available.


#### Output

This function provides one of two possible outputs:

* "xx(aim_celltypes) cell is available." if the API contains data for this cell type in the human lung.
* "xx(aim_celltypes) cell is NOT available." if the API does not have data for this cell type in the human lung.

### **Sorting gene expression by cell type**

The following example demonstrates how to sort `avg_gene_expr_lung` by the average gene expression in a specific cell type (*neutrophils*) in ascending order.

In [15]:
# in-place sorting
sorted_by_neutrophil = avg_gene_expr_lung.sort_values(by='neutrophil')

display(sorted_by_neutrophil)

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
ALK,0.0,0.0,0.001285,0.013633,0.0,0.0,0.0,0.002077,0.0,0.0,...,0.001188,0.0,0.0,0.000689,0.004787,0.0,0.0,0.0,0.0,0.0
EGFR,0.016721,0.024823,0.028325,0.011413,0.000949,0.138468,0.031597,0.064145,0.051909,0.0,...,0.579046,0.736705,0.085129,0.666291,0.720418,0.670286,0.528496,0.980535,0.2257,2.810568
TP53,0.054815,0.119978,0.327787,0.132754,0.238697,0.178123,0.038301,0.202786,0.239634,0.074571,...,0.219169,0.126632,0.254856,0.175867,0.16092,0.110756,0.193365,0.252695,0.152536,0.227391
KRAS,1.529643,0.436303,0.977728,0.489622,0.443576,0.562167,0.243355,0.82648,0.76758,0.467747,...,1.357422,1.346849,0.794639,0.249971,0.388602,0.582125,0.251247,0.708356,0.319821,0.714599


#### Output

* The function returns a **Pandas DataFrame**, the rows are ordered based on the average gene expression in *neutrophils*, arranged in ascending order.

Additionally, `sort_values` can return the output in descending order by setting the ascending parameter to *False*:

In [16]:
# in-place sorting
des_sorted_by_neutrophil = avg_gene_expr_lung.sort_values(by='neutrophil', ascending=False)

display(des_sorted_by_neutrophil)

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
KRAS,1.529643,0.436303,0.977728,0.489622,0.443576,0.562167,0.243355,0.82648,0.76758,0.467747,...,1.357422,1.346849,0.794639,0.249971,0.388602,0.582125,0.251247,0.708356,0.319821,0.714599
TP53,0.054815,0.119978,0.327787,0.132754,0.238697,0.178123,0.038301,0.202786,0.239634,0.074571,...,0.219169,0.126632,0.254856,0.175867,0.16092,0.110756,0.193365,0.252695,0.152536,0.227391
EGFR,0.016721,0.024823,0.028325,0.011413,0.000949,0.138468,0.031597,0.064145,0.051909,0.0,...,0.579046,0.736705,0.085129,0.666291,0.720418,0.670286,0.528496,0.980535,0.2257,2.810568
ALK,0.0,0.0,0.001285,0.013633,0.0,0.0,0.0,0.002077,0.0,0.0,...,0.001188,0.0,0.0,0.000689,0.004787,0.0,0.0,0.0,0.0,0.0


#### Output

* The function returns a **Pandas DataFrame**, the rows are ordered based on the average gene expression in neutrophils, arranged in descending order.

### **Sorting gene expression by multiple cell types**

The following example demonstrates how to sort `avg_gene_expr_lung` first by *neutrophil* gene expression and then by *basophil* gene expression, both in ascending order.

In [17]:
# in-place sorting
sorted_by_neutrophil_basophil = avg_gene_expr_lung.sort_values(by=['neutrophil', 'basophil'])

display(sorted_by_neutrophil_basophil)

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
ALK,0.0,0.0,0.001285,0.013633,0.0,0.0,0.0,0.002077,0.0,0.0,...,0.001188,0.0,0.0,0.000689,0.004787,0.0,0.0,0.0,0.0,0.0
EGFR,0.016721,0.024823,0.028325,0.011413,0.000949,0.138468,0.031597,0.064145,0.051909,0.0,...,0.579046,0.736705,0.085129,0.666291,0.720418,0.670286,0.528496,0.980535,0.2257,2.810568
TP53,0.054815,0.119978,0.327787,0.132754,0.238697,0.178123,0.038301,0.202786,0.239634,0.074571,...,0.219169,0.126632,0.254856,0.175867,0.16092,0.110756,0.193365,0.252695,0.152536,0.227391
KRAS,1.529643,0.436303,0.977728,0.489622,0.443576,0.562167,0.243355,0.82648,0.76758,0.467747,...,1.357422,1.346849,0.794639,0.249971,0.388602,0.582125,0.251247,0.708356,0.319821,0.714599


#### Output

* The function returns a **Pandas DataFrame**, the rows are ordered first by average gene expression in *neutrophils* and then by average gene expression in *basophils*, both in ascending order.

### **Sorting gene expression across multiple organs**

The following example displays the average gene expression for five specified genes (*COL13A1*, *COL14A1*, *TGFBI*, *PDGFRA*, *GZMA*) across three *human* organs (*lung*, *gut*, and *liver*). The gene expression data for each organ is sorted by the values in *neutrophils* and *T cells* before being displayed.

In [11]:
# If you want to get avg_gene_expr for all organs in human, try:
# organ_list = api.organs(organism="h_sapiens")
organ_list = ['bladder', 'blood', 'colon']

# loop through organ_list and display the results
for organ in organ_list:
    avg_gene_expr = api.average(
        organism = "h_sapiens", 
        organ = organ, 
        features = ['MTRNR2L12', 'AL512646.1', 'FP671120.5'], 
        measurement_type = 'gene_expression'
    )

    # Display the result
    print(f'Average gene expression in human {organ}:')
    display(avg_gene_expr)
    # display(avg_gene_expr.sort_values(by=['neutrophil', 'T'])

Average gene expression in human bladder:


Unnamed: 0,mast,macrophage,B,plasma,T,NK,plasmacytoid,urothelial,venous,capillary,lymphatic,fibroblast,smooth muscle,pericyte
MTRNR2L12,0.040105,0.12186,0.132304,0.016499,0.013802,0.052788,0.002277,0.059001,0.092321,0.041095,0.083313,0.058918,0.034633,0.051836
AL512646.1,0.035943,0.038304,0.022516,0.003811,3e-06,0.010509,0.009106,0.009846,0.01345,0.000868,0.019044,0.003351,0.013949,0.001608
FP671120.5,0.037544,0.052078,0.109802,0.03963,0.001136,0.003897,0.038702,0.061551,0.125526,0.004339,0.2083,0.036039,0.009927,0.02497


Average gene expression in human blood:


Unnamed: 0,HSC,neutrophil,basophil,myeloid,monocyte,macrophage,dendritic,erythrocyte,B,plasma,T,NK,plasmacytoid,platelet
MTRNR2L12,0.042015,0.020157,0.04958,0.0,0.064458,0.003231,0.035253,0.006407,0.024542,0.026737,0.02128,0.020722,0.0,0.165943
AL512646.1,0.001165,0.03667,0.0,0.0,0.145861,0.0,0.0,0.004813,0.004529,0.040637,0.002428,0.000875,0.0,0.002268
FP671120.5,0.05801,0.016955,0.0,0.0,0.06889,0.01362,0.0,0.008137,0.086548,0.006233,0.122227,0.037445,0.0,0.080209


Average gene expression in human colon:


Unnamed: 0,neutrophil,mast,monocyte,B,plasma,T,goblet,brush,crypt,transit amp,enterocyte,paneth,venous,capillary,fibroblast,enteroendocrine
MTRNR2L12,0.878699,0.0,0.025914,0.030038,0.012355,0.024335,2.114801,2.636931,0.351932,0.056754,0.674713,0.123892,0.0,0.0,0.060827,0.21881
AL512646.1,0.055621,0.0,0.0,0.0,0.0,0.0,0.035407,0.32817,0.006913,0.000898,0.014697,0.0,0.0,0.0,0.000169,0.001798
FP671120.5,0.040737,0.0,0.004727,0.003165,0.0,0.002037,0.422802,1.238326,0.056464,0.006978,0.048138,0.0,0.085853,0.0,0.005391,0.0


### Output

* The function returns multiple *Pandas DataFrames*, one for each queried organ.

### **Filtering data by selected cell types**

The following example demonstrates how to filter `avg_gene_expr_lung` to display only the selected cell types (*neutrophil*, *macrophage*, *plasma*).

In [5]:
# Filter your data with specific cell types
chosen_cell_type = ['neutrophil', 'macrophage', 'plasma']
filtered_cell_type_df = avg_gene_expr_lung[chosen_cell_type]

display(filtered_cell_type_df)

Unnamed: 0,neutrophil,macrophage,plasma
TP53,0.054815,0.132754,0.038301
KRAS,1.529643,0.489622,0.243355
EGFR,0.016721,0.011413,0.031597
ALK,0.0,0.013633,0.0


#### Output

* The function returns a **Pandas DataFrame**, only selected cell types (*neutrophil*, *macrophage*, *plasma*) from the `avg_gene_expr_lung` are displayed.



## **Conclusion**


This tutorial provide the some basic usage of `average` in *atlasapprox*. Thank you for using *atlasapprox* API, for more detailed information, please refer to the [official documentation](https://atlasapprox.readthedocs.io/en/latest/python/index.html).