# **Average gene expression and chromatin accessibility**

## **Introduction**

The `average` function is designed to return a *Pandas.DataFrame* that represents *gene expression* or c*hromatin accessibility* across various cell types for selected features. This tutorial demonstrates several possible applications of the `average` function.

## **Prerequisite**

In [11]:
import atlasapprox

api = atlasapprox.API()

For detailed initial setup instructions, refer to the [Quick Start Tutorial](link to quick_start).

## **Get average gene expression**

This example illustrates a simple application of the `average` function, which allows you to calculate the *average gene expression* for selected genes,  within a specific organ of an organism.

The following example shows how to get the average expression of four genes (*TP53*, *KRAS*, *EGFR*, *ALK*) in the human lung:

In [46]:
avg_gene_expr_lung = api.average(
    organism = "h_sapiens", 
    organ = "lung", 
    features = ["TP53", "KRAS", "EGFR", "ALK"], 
    # measurement_type = 'gene_expression'
)

# Display the result
display(avg_gene_expr_lung)

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
TP53,0.054815,0.119978,0.327787,0.132754,0.238697,0.178123,0.038301,0.202786,0.239634,0.074571,...,0.219169,0.126632,0.254856,0.175867,0.16092,0.110756,0.193365,0.252695,0.152536,0.227391
KRAS,1.529643,0.436303,0.977728,0.489622,0.443576,0.562167,0.243355,0.82648,0.76758,0.467747,...,1.357422,1.346849,0.794639,0.249971,0.388602,0.582125,0.251247,0.708356,0.319821,0.714599
EGFR,0.016721,0.024823,0.028325,0.011413,0.000949,0.138468,0.031597,0.064145,0.051909,0.0,...,0.579046,0.736705,0.085129,0.666291,0.720418,0.670286,0.528496,0.980535,0.2257,2.810568


### **Output**

The `avg_gene_expr_lung` variable contains a Pandas DataFrame where:
* Each row represents a gene.
* Each column corresponds to a cell type.
* The values indicate the average gene expression, measured in counts per ten thousand (cptt).

## **Get gene expression via organs**

This example demonstrates how to get the average gene expression of selected features across multiple organs, specifically within a particular cell type. In this case, four genes (TP53, KRAS, EGFR, and ALK) are analyzed.

In [116]:
import pandas as pd

def traverse_organ(organ_list):
    chosen_organism = "h_sapiens"
    chosen_cell_type = ['neutrophil', 'B']
    chosen_genes = ["TP53", "KRAS", "EGFR", "ALK"]

    for cell_type in chosen_cell_type:
        df_combined = pd.DataFrame()
        # Display gene expression via organs
        for organ in organ_list:
            # If the chosen organ does't contain the specific cell type, show "/" in the corresponding cell
            if organ not in api.celltype_location(organism=chosen_organism, cell_type=cell_type):
                empty_column = ['/'] * len(chosen_genes)
                df_combined[organ] = empty_column
                df_combined = df_combined.set_index(pd.Index(chosen_genes))
                continue
            
            # Get average gene expression of each organ
            avg_expr = api.average(
                organism=chosen_organism, 
                organ=organ, 
                features=chosen_genes, 
                # measurement_type = 'gene_expression'
            )
            # Filter data with specific cell types
            filtered_cell_type_df = pd.DataFrame(avg_expr[cell_type])
        
            # Add new col with different organ to pandas.dataframe
            df_combined[organ] = filtered_cell_type_df
        
        # Rename columns with organs
        df_combined.columns = organ_list
        print(f"Current cell type: {cell_type}")
        display(df_combined)

organ_list = ['lung', 'gut', 'marrow']
# organ_list = api.organs(organism=chosen_organism)
traverse_organ(organ_list)

Current cell type: neutrophil


Unnamed: 0,lung,gut,marrow
TP53,0.054815,0.160057,0.037409
KRAS,1.529643,0.793704,1.094597
EGFR,0.016721,0.0,0.004238
ALK,0.0,0.0,0.000568


Current cell type: B


Unnamed: 0,lung,gut,marrow
TP53,0.178123,0.127678,0.355219
KRAS,0.562167,1.738706,0.663666
EGFR,0.138468,0.002238,6.6e-05
ALK,0.0,0.0,0.001408


### **Output**
The *traverse_organ* function returns several Pandas DataFrames where:

* Each DataFrame displays the gene expression levels of selected genes across the specified organs.
* Rows represent individual genes.
* Columns correspond to different organs.
* The values indicate the average gene expression, measured in counts per ten thousand (cptt).

## **Filtering data by selected cell types**

This example demonstrates how to filter `avg_gene_expr_lung` to display only the selected cell types (*neutrophil*, *macrophage*, *plasma*).

In [None]:
# Filter your data with specific cell types
chosen_cell_type = ['neutrophil', 'macrophage', 'plasma']
filtered_cell_type_df = avg_gene_expr_lung[chosen_cell_type]

display(filtered_cell_type_df)

Unnamed: 0,neutrophil,macrophage,plasma
COL13A1,0.0,0.000711,0.002205
COL14A1,0.0,0.001362,0.002607
TGFBI,0.06515,1.252701,0.083882
PDGFRA,0.0,0.002414,0.0
GZMA,0.013437,0.029326,0.063292


### **Output**
`filtered_cell_type_df` gives a *Pandas DataFrame* where:
* Each row represents a gene.  
* Each column corresponds to a cell type.
* The values indicate the average gene expression (measured in counts per ten thousand, or cptt).
* only selected cell types (*'neutrophil', 'macrophage', 'plasma'*) from the `avg_gene_expr_lung` are displayed.

## **Sorting gene expression by cell type**

This example shows how to sort `avg_gene_expr_lung` by the average gene expression in a specific cell type (neutrophils), in ascending order.

In [4]:
sorted_by_neutrophil = avg_gene_expr_lung.sort_values(by='neutrophil')

display(sorted_by_neutrophil)

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
COL13A1,0.0,0.222863,0.0,0.000711,0.0,0.0,0.002205,0.0,0.029147,0.0,...,0.003937,0.0,0.0,0.005113,0.446961,0.0,0.131642,0.06796,0.0,0.0
COL14A1,0.0,0.0,0.001422,0.001362,0.0,0.0,0.002607,0.0,0.0,0.0,...,0.007525,0.026666,0.059648,1.110076,1.226022,1.033389,2.10846,0.03358,0.0,0.0
PDGFRA,0.0,0.0,0.000965,0.002414,0.003172,0.0,0.0,0.005035,0.0,0.0,...,0.011427,0.00292,0.0,1.772957,3.724075,0.128634,0.059852,0.0,0.332479,0.0
GZMA,0.013437,0.142837,0.174047,0.029326,0.020453,0.025113,0.063292,9.006065,19.687157,0.0,...,0.044351,0.042996,0.073877,0.029919,0.081036,0.119041,0.0,0.460141,0.044982,0.058806
TGFBI,0.06515,0.111107,1.802062,1.252701,2.190132,0.0,0.083882,0.10046,0.32661,4.492828,...,0.045932,0.06761,0.521915,0.393191,0.175393,0.311884,0.258512,0.11901,0.404976,0.032419


### **Output**
`sorted_by_neutrophil` gives a *Pandas DataFrame* where:
* Each row represents a gene.  
* Each column corresponds to a cell type.
* The values indicate the average gene expression (measured in counts per ten thousand, or cptt).
* The order of the rows is based on the average gene expression in neutrophil in ascending order.

Additionally, `sort_values()` can return the output in descending order by setting the ascending parameter to False:

In [5]:
des_sorted_by_neutrophil = avg_gene_expr_lung.sort_values(by='neutrophil', ascending=False)

display(des_sorted_by_neutrophil)

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
TGFBI,0.06515,0.111107,1.802062,1.252701,2.190132,0.0,0.083882,0.10046,0.32661,4.492828,...,0.045932,0.06761,0.521915,0.393191,0.175393,0.311884,0.258512,0.11901,0.404976,0.032419
GZMA,0.013437,0.142837,0.174047,0.029326,0.020453,0.025113,0.063292,9.006065,19.687157,0.0,...,0.044351,0.042996,0.073877,0.029919,0.081036,0.119041,0.0,0.460141,0.044982,0.058806
COL13A1,0.0,0.222863,0.0,0.000711,0.0,0.0,0.002205,0.0,0.029147,0.0,...,0.003937,0.0,0.0,0.005113,0.446961,0.0,0.131642,0.06796,0.0,0.0
COL14A1,0.0,0.0,0.001422,0.001362,0.0,0.0,0.002607,0.0,0.0,0.0,...,0.007525,0.026666,0.059648,1.110076,1.226022,1.033389,2.10846,0.03358,0.0,0.0
PDGFRA,0.0,0.0,0.000965,0.002414,0.003172,0.0,0.0,0.005035,0.0,0.0,...,0.011427,0.00292,0.0,1.772957,3.724075,0.128634,0.059852,0.0,0.332479,0.0


### **Output**
`des_sorted_by_neutrophil` gives a *Pandas DataFrame* where:
* Each row represents a gene.  
* Each column corresponds to a cell type.
* The values indicate the average gene expression (measured in counts per ten thousand, or cptt).
* The order of the rows is based on the average gene expression in neutrophil in descending order.

## **Sorting gene expression by multiple cell types**

This example demonstrates how to sort `avg_gene_expr_lung` first by *neutrophils gene expression* and then by *basophils gene expression* in ascending order.

In [43]:
sorted_by_neutrophil_basophil = avg_gene_expr_lung.sort_values(by=['neutrophil', 'basophil'])

display(sorted_by_neutrophil_basophil)

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
COL14A1,0.0,0.0,0.001422,0.001362,0.0,0.0,0.002607,0.0,0.0,0.0,...,0.007525,0.026666,0.059648,1.110076,1.226022,1.033389,2.10846,0.03358,0.0,0.0
PDGFRA,0.0,0.0,0.000965,0.002414,0.003172,0.0,0.0,0.005035,0.0,0.0,...,0.011427,0.00292,0.0,1.772957,3.724075,0.128634,0.059852,0.0,0.332479,0.0
COL13A1,0.0,0.222863,0.0,0.000711,0.0,0.0,0.002205,0.0,0.029147,0.0,...,0.003937,0.0,0.0,0.005113,0.446961,0.0,0.131642,0.06796,0.0,0.0
GZMA,0.013437,0.142837,0.174047,0.029326,0.020453,0.025113,0.063292,9.006065,19.687157,0.0,...,0.044351,0.042996,0.073877,0.029919,0.081036,0.119041,0.0,0.460141,0.044982,0.058806
TGFBI,0.06515,0.111107,1.802062,1.252701,2.190132,0.0,0.083882,0.10046,0.32661,4.492828,...,0.045932,0.06761,0.521915,0.393191,0.175393,0.311884,0.258512,0.11901,0.404976,0.032419


### **Output**
`sorted_by_neutrophil_basophil` gives a *Pandas DataFrame* where:
* Each row represents a gene.  
* Each column corresponds to a cell type.
* The values indicate the average gene expression (measured in counts per ten thousand, or cptt).
* The order of the rows is based on the average gene expression firstly by neutrophil gene expression then by basophil, in ascending order.


This example demonstrates how to filter `avg_gene_expr_lung` to display only the selected cell types (*neutrophil*, *macrophage*, *plasma*).

In [44]:
# Filter your data with specific cell types
chosen_cell_type = ['neutrophil', 'macrophage', 'plasma']
filtered_cell_type_df = avg_gene_expr_lung[chosen_cell_type]

display(filtered_cell_type_df)

Unnamed: 0,neutrophil,macrophage,plasma
COL13A1,0.0,0.000711,0.002205
COL14A1,0.0,0.001362,0.002607
TGFBI,0.06515,1.252701,0.083882
PDGFRA,0.0,0.002414,0.0
GZMA,0.013437,0.029326,0.063292


### **Output**
`filtered_cell_type_df` gives a *Pandas DataFrame* where:
* Each row represents a gene.  
* Each column corresponds to a cell type.
* The values indicate the average gene expression (measured in counts per ten thousand, or cptt).
* only selected cell types (*'neutrophil', 'macrophage', 'plasma'*) from the `avg_gene_expr_lung` are displayed.

## **Sorting gene expression across multiple organs**

This example calculates and displays the *average gene expression* for five specified genes (*COL13A1*, *COL14A1*, *TGFBI*, *PDGFRA*, *GZMA*) across three human organs: *lung*, *gut*, and *liver*. The gene expression data for each organ is sorted by the values in *neutrophils* and *T cells* before being displayed.

In [45]:
organ_list = ["lung", "gut", "liver"]

# loop through organ_list and display the results
for item in organ_list:
    avg_gene_expr = api.average(
        organism = "h_sapiens", 
        organ = item, 
        features = ["COL13A1", "COL14A1", "TGFBI", "PDGFRA", "GZMA"], 
        measurement_type = 'gene_expression'
    )

    # Display the result
    print(f'Everage gene expression in human {item}:')
    display(avg_gene_expr.sort_values(by=['neutrophil', 'T'])
)

Everage gene expression in human lung:


Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
COL13A1,0.0,0.222863,0.0,0.000711,0.0,0.0,0.002205,0.0,0.029147,0.0,...,0.003937,0.0,0.0,0.005113,0.446961,0.0,0.131642,0.06796,0.0,0.0
COL14A1,0.0,0.0,0.001422,0.001362,0.0,0.0,0.002607,0.0,0.0,0.0,...,0.007525,0.026666,0.059648,1.110076,1.226022,1.033389,2.10846,0.03358,0.0,0.0
PDGFRA,0.0,0.0,0.000965,0.002414,0.003172,0.0,0.0,0.005035,0.0,0.0,...,0.011427,0.00292,0.0,1.772957,3.724075,0.128634,0.059852,0.0,0.332479,0.0
GZMA,0.013437,0.142837,0.174047,0.029326,0.020453,0.025113,0.063292,9.006065,19.687157,0.0,...,0.044351,0.042996,0.073877,0.029919,0.081036,0.119041,0.0,0.460141,0.044982,0.058806
TGFBI,0.06515,0.111107,1.802062,1.252701,2.190132,0.0,0.083882,0.10046,0.32661,4.492828,...,0.045932,0.06761,0.521915,0.393191,0.175393,0.311884,0.258512,0.11901,0.404976,0.032419


Everage gene expression in human gut:


Unnamed: 0,neutrophil,mast,monocyte,B,plasma,T,goblet,brush,crypt,transit amp,enterocyte,paneth,venous,lymphatic,fibroblast,enteroendocrine
COL14A1,0.0,0.0,0.035564,0.0,0.001895,0.00032,0.001404,0.0,0.0,0.001129,0.001227,0.002958,0.0,0.0,1.486094,0.074912
PDGFRA,0.0,0.0,0.0,0.0,0.0,0.001268,0.003393,0.0,0.0,0.0,0.00301,0.0,0.0,0.0,5.021291,0.006898
COL13A1,0.014146,0.049564,0.0,0.0,0.007597,0.000415,0.003135,0.138984,0.009553,0.001848,0.002501,0.0,0.0,0.0,0.036372,0.0
GZMA,0.100076,0.014693,0.023376,0.008562,0.03212,1.13951,0.01649,0.048951,0.012393,0.012576,0.014222,0.005095,0.0,0.0,0.04264,0.032293
TGFBI,3.150326,0.0,0.0,0.029953,0.003055,0.031463,0.017436,0.0,0.0,0.006506,0.00712,0.001341,0.0,0.167873,0.163474,0.0


Everage gene expression in human liver:


Unnamed: 0,neutrophil,monocyte,macrophage,dendritic,erythrocyte,plasma,T,NK,epithelial,cholangiocyte,capillary,lymphatic,fibroblast,hepatocyte
GZMA,0.0,0.125646,0.050053,0.043797,0.012339,0.131326,3.470086,14.692757,0.0,0.0,0.10265,0.0,0.0,0.0045
PDGFRA,0.003274,0.0,0.007626,0.0,0.0,0.0,0.021953,0.0,0.220303,0.0,0.129755,0.0,1.398366,0.075638
COL13A1,0.050246,0.0,0.0,0.0,0.0,0.0,0.008966,0.019678,0.0,0.0,0.008885,0.0,0.065998,0.0
COL14A1,0.065358,0.005026,0.018901,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.168168,0.000142
TGFBI,0.079988,1.447749,3.611403,0.932887,0.010905,0.017768,0.061105,0.037221,1.049998,0.095486,0.068124,2.320411,1.738326,0.187946


### **Output**
This code gives 3 *Pandas DataFrame* where:
* Each *pandas Dataframe* focus on one chosen organ.
* For each *Pandas Dataframe*, each row represents a gene.  
* For each *Pandas Dataframe*, each column corresponds to a cell type.
* For each *Pandas Dataframe*, the values indicate the average gene expression (measured in counts per ten thousand, or cptt).
* For each *Pandas Dataframe*, the order of the rows is based on the average gene expression firstly by neutrophil gene expression then by t cell, in ascending order.



## **Conclusion**



This tutorial provide the some basic usage of `average` in *atlasapprox*. Thank you for using *atlasapprox* API, for more detailed information, please refer to the [official documentation](https://atlasapprox.readthedocs.io/en/latest/python/index.html).