<div style="text-align: center;">
    <h1><b>AtlasApprox API (python) Tutorial</b></h1>
    <h2><b>1 Get the average gene expression for specific features</b></h2>
</div>


This tutorial demonstrates the process of displaying the average expression levels of selected genes using our AtlasApprox API. In this guide, we will walk you through the step-by-step methodology required to compute and visualize gene expression data.

For all the walkthrough under, we use the human lung as example. ALl genes and organs chosen will be spesified in section 1.2.

## **1.1 Requirments and Installation**

To use the atlasapprox API, please install the following Python packages:
- `equests`

- `pandas`

To install *atlasapprox*, please use `pip` command as followed in your terminal.

In [40]:
pip install atlasapprox


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## **1.2 Get started**

Firstly, please import `atlasapprox` and instantiate the `API` project.

In [41]:
import atlasapprox

# instantiate
api = atlasapprox.API()

### **parameters**

To get average gene expression for specific features, we need the following parameters:

- organism (*The organism you want to explore*).

- organ (*The organ you want to explore*).

- features (*The gene you want to explore*).

- measurement_type (*gene_expression/chromatin_accessibility*)

### **Sample Code**

In this sample code, we tried to print out 5 chosen genes' (*"COL13A1", "COL14A1", "TGFBI", "PDGFRA", "GZMA"*) expression in human's lung. 

The unit of measurement, or normalisation, is counts per ten thousand (cptt).

In [39]:
average = api.average(
    organism = "h_sapiens", 
    organ = "lung", 
    features = ["COL13A1", "COL14A1", "TGFBI", "PDGFRA", "GZMA"], 
    measurement_type = 'gene_expression'
)

average

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
COL13A1,0.0,0.222863,0.0,0.000711,0.0,0.0,0.002205,0.0,0.029147,0.0,...,0.003937,0.0,0.0,0.005113,0.446961,0.0,0.131642,0.06796,0.0,0.0
COL14A1,0.0,0.0,0.001422,0.001362,0.0,0.0,0.002607,0.0,0.0,0.0,...,0.007525,0.026666,0.059648,1.110076,1.226022,1.033389,2.10846,0.03358,0.0,0.0
TGFBI,0.06515,0.111107,1.802062,1.252701,2.190132,0.0,0.083882,0.10046,0.32661,4.492828,...,0.045932,0.06761,0.521915,0.393191,0.175393,0.311884,0.258512,0.11901,0.404976,0.032419
PDGFRA,0.0,0.0,0.000965,0.002414,0.003172,0.0,0.0,0.005035,0.0,0.0,...,0.011427,0.00292,0.0,1.772957,3.724075,0.128634,0.059852,0.0,0.332479,0.0
GZMA,0.013437,0.142837,0.174047,0.029326,0.020453,0.025113,0.063292,9.006065,19.687157,0.0,...,0.044351,0.042996,0.073877,0.029919,0.081036,0.119041,0.0,0.460141,0.044982,0.058806


**output:** This code will give back a pandas.DataFrame with the gene expression. Each **column** is a cell type, each **row** a feature. 

## **1.3 Method extension**

In this extension, we will explore some useful methods that might help you get a cleaner data output. We will cover data sorting and filtering using python package.

### **1.3.1 Sort your data by cell type**

We can use *sort_values()* method in python to sort our data by specific cell type.

In [10]:
sorted_by_neutrophil = average.sort_values(by='neutrophil')

sorted_by_neutrophil

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
COL13A1,0.0,0.222863,0.0,0.000711,0.0,0.0,0.002205,0.0,0.029147,0.0,...,0.003937,0.0,0.0,0.005113,0.446961,0.0,0.131642,0.06796,0.0,0.0
COL14A1,0.0,0.0,0.001422,0.001362,0.0,0.0,0.002607,0.0,0.0,0.0,...,0.007525,0.026666,0.059648,1.110076,1.226022,1.033389,2.10846,0.03358,0.0,0.0
PDGFRA,0.0,0.0,0.000965,0.002414,0.003172,0.0,0.0,0.005035,0.0,0.0,...,0.011427,0.00292,0.0,1.772957,3.724075,0.128634,0.059852,0.0,0.332479,0.0
GZMA,0.013437,0.142837,0.174047,0.029326,0.020453,0.025113,0.063292,9.006065,19.687157,0.0,...,0.044351,0.042996,0.073877,0.029919,0.081036,0.119041,0.0,0.460141,0.044982,0.058806
TGFBI,0.06515,0.111107,1.802062,1.252701,2.190132,0.0,0.083882,0.10046,0.32661,4.492828,...,0.045932,0.06761,0.521915,0.393191,0.175393,0.311884,0.258512,0.11901,0.404976,0.032419


output: In this output, we sort the original Dataframe by neutrophil gene expression from lowest to highest.

### **1.3.2 Sort your data by multiple cell types**

We can sort our original data with more than one cell types.

In [12]:
sorted_by_neutrophil_basophil = average.sort_values(by=['neutrophil', 'basophil'])

sorted_by_neutrophil_basophil

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
COL14A1,0.0,0.0,0.001422,0.001362,0.0,0.0,0.002607,0.0,0.0,0.0,...,0.007525,0.026666,0.059648,1.110076,1.226022,1.033389,2.10846,0.03358,0.0,0.0
PDGFRA,0.0,0.0,0.000965,0.002414,0.003172,0.0,0.0,0.005035,0.0,0.0,...,0.011427,0.00292,0.0,1.772957,3.724075,0.128634,0.059852,0.0,0.332479,0.0
COL13A1,0.0,0.222863,0.0,0.000711,0.0,0.0,0.002205,0.0,0.029147,0.0,...,0.003937,0.0,0.0,0.005113,0.446961,0.0,0.131642,0.06796,0.0,0.0
GZMA,0.013437,0.142837,0.174047,0.029326,0.020453,0.025113,0.063292,9.006065,19.687157,0.0,...,0.044351,0.042996,0.073877,0.029919,0.081036,0.119041,0.0,0.460141,0.044982,0.058806
TGFBI,0.06515,0.111107,1.802062,1.252701,2.190132,0.0,0.083882,0.10046,0.32661,4.492828,...,0.045932,0.06761,0.521915,0.393191,0.175393,0.311884,0.258512,0.11901,0.404976,0.032419


**output:** In this output, we sort the original DataFrame by 2 cell types, firstly by neutrophil then by basophil, in ascending order.

### **1.3.3 Sort your data in descending way**

We can use this code to display our gene expressions in a descending order (from highest expresion to lowest).

In [13]:
sorted_by_neutrophil = average.sort_values(by='neutrophil', ascending=False)

sorted_by_neutrophil

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
TGFBI,0.06515,0.111107,1.802062,1.252701,2.190132,0.0,0.083882,0.10046,0.32661,4.492828,...,0.045932,0.06761,0.521915,0.393191,0.175393,0.311884,0.258512,0.11901,0.404976,0.032419
GZMA,0.013437,0.142837,0.174047,0.029326,0.020453,0.025113,0.063292,9.006065,19.687157,0.0,...,0.044351,0.042996,0.073877,0.029919,0.081036,0.119041,0.0,0.460141,0.044982,0.058806
COL13A1,0.0,0.222863,0.0,0.000711,0.0,0.0,0.002205,0.0,0.029147,0.0,...,0.003937,0.0,0.0,0.005113,0.446961,0.0,0.131642,0.06796,0.0,0.0
COL14A1,0.0,0.0,0.001422,0.001362,0.0,0.0,0.002607,0.0,0.0,0.0,...,0.007525,0.026666,0.059648,1.110076,1.226022,1.033389,2.10846,0.03358,0.0,0.0
PDGFRA,0.0,0.0,0.000965,0.002414,0.003172,0.0,0.0,0.005035,0.0,0.0,...,0.011427,0.00292,0.0,1.772957,3.724075,0.128634,0.059852,0.0,0.332479,0.0


**output:** In this output, the original DataFrame is sorted by gene expression in neutrophils in ascending order.

### **1.3.4 Filter your data by chosen cell type**

We can try to filter our data with the cell types we focus on. For example, if we are only interested in some of the cell types, e.g., *neutrophil*, *macrophage*, *plasma*, we can do this.

In [24]:
# Filter your data with specific cell types
chosen_cell_type = ['neutrophil', 'macrophage', 'plasma']
filtered_cell_type_df = average[chosen_cell_type]

filtered_cell_type_df

Unnamed: 0,neutrophil,macrophage,plasma
COL13A1,0.0,0.000711,0.002205
COL14A1,0.0,0.001362,0.002607
TGFBI,0.06515,1.252701,0.083882
PDGFRA,0.0,0.002414,0.0
GZMA,0.013437,0.029326,0.063292


**output:** In this output, only the selected cell types (*'neutrophil', 'macrophage', 'plasma'*) from the original DataFrame are displayed.

### **1.3.5 Filter your data by a threshold expression value**

Try to focus on value above 0.06. This code only give all the values above 0.06 in the dataframe.

In [30]:
# Create a threshold
threshold = 0.06

filtered_threshold_df = average[average > threshold]
filtered_threshold_df = filtered_threshold_df.fillna("")

filtered_threshold_df

Unnamed: 0,neutrophil,basophil,monocyte,macrophage,dendritic,B,plasma,T,NK,plasmacytoid,...,capillary,CAP2,lymphatic,fibroblast,alveolar fibroblast,smooth muscle,vascular smooth muscle,pericyte,mesothelial,ionocyte
COL13A1,,0.222863,,,,,,,,,...,,,,,0.446961,,0.131642,0.06796,,
COL14A1,,,,,,,,,,,...,,,,1.110076,1.226022,1.033389,2.10846,,,
TGFBI,0.06515,0.111107,1.802062,1.252701,2.190132,,0.083882,0.10046,0.32661,4.492828,...,,0.06761,0.521915,0.393191,0.175393,0.311884,0.258512,0.11901,0.404976,
PDGFRA,,,,,,,,,,,...,,,,1.772957,3.724075,0.128634,,,0.332479,
GZMA,,0.142837,0.174047,,,,0.063292,9.006065,19.687157,,...,,,0.073877,,0.081036,0.119041,,0.460141,,


**output:** In this output, all values below 0.6 have been excluded, displaying only the values greater than or equal to 0.6 from the original DataFrame.

### **1.3.6 Find average value among all cell types**

We can find the average value amoung all cell types of a chosen feature.

In [33]:
chosen_feature = "TGFBI"
mean_value = average.loc[chosen_feature][1:].mean()  # Exclude the 'feature' column

print(f"The average gene expression of feature {chosen_feature} is {mean_value}.")

The average gene expression of feature TGFBI is 0.47069306697311075.


## **1.4 Conclusion**

This tutorial provide the basic and advanced use of `average` in Atlasapprox. For more detailed information, please refer to the official documentation.