# Curated Atlas Query (Python)

## Importing the package

In [1]:
from curated_atlas_query_py import get_metadata, get_anndata

## Getting the metadata
The `get_metadata()` function returns a database connection and a DuckDB table.
The table can be used to query the metadata, while the connection's main purpose is to be closed when you are finished.

In [2]:
conn, table = get_metadata()
table

FloatProgress(value=0.0, layout=Layout(width='100%'), style=ProgressStyle(bar_color='black'))

┌────────────────────┬──────────────────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┐
│       .cell        │     sample_id_db     │       .sample        │ … │ n_cell_type_in_tis…  │ n_tissue_in_cell_t…  │
│      varchar       │       varchar        │       varchar        │   │        int64         │        int64         │
├────────────────────┼──────────────────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┤
│ AAACCTGAGAGACGAA_1 │ 8a0fe0928684d6765d…  │ 5f20d7daf6c42f4f91…  │ … │                 NULL │                 NULL │
│ AAACCTGAGTTGTCGT_1 │ 8a0fe0928684d6765d…  │ 5f20d7daf6c42f4f91…  │ … │                 NULL │                 NULL │
│ AAACCTGCAGTCGATT_1 │ 02eb2ebcb5f802e271…  │ 5f20d7daf6c42f4f91…  │ … │                 NULL │                 NULL │
│ AAACCTGCAGTTCATG_1 │ 02eb2ebcb5f802e271…  │ 5f20d7daf6c42f4f91…  │ … │                 NULL │                 NULL │
│ AAACCTGGTCTAAACC_1 │ 8a0fe0928684d6765d…  │ 5f

### Querying the metadata
The DuckDB table can be queried using a number of methods [described here](https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyRelation). In particular:
* [`.filter()`](https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyRelation.filter): filters the metadata using a string expression
* [`.aggregate()`](https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyRelation.aggregate): groups by one or more columns, and calculates some aggregate statistics such as counts
* [`.fetchdf()`](https://duckdb.org/docs/api/python/reference/#duckdb.DuckDBPyRelation.fetchdf): Executes the query and returns it as a pandas DataFrame

In [3]:
table.aggregate("tissue, file_id, COUNT(*) as n", group_expr="tissue, file_id").fetchdf()

Unnamed: 0,tissue,file_id,n
0,cortex of kidney,2977b3fa-e4d6-4929-8540-ae12d33a3c53,25166
1,entorhinal cortex,29d5d028-6f90-4943-91f7-fa3f93731de8,5500
2,middle temporal gyrus,2a689fda-d335-4ac0-81b1-a356fdf939db,18402
3,respiratory airway,2c2d5bea-8be7-4227-8a56-f2a85d57fa56,4594
4,thymus,2ec94470-8171-4825-8346-34d77383438b,18524
...,...,...,...
886,cortex of kidney,f2ccb395-6800-455a-9655-476cc54ab864,27036
887,lung,f4065ffa-c2d6-45bf-b596-5a69e36d8fcd,46500
888,heart left ventricle,f445a8b3-b17a-4cc6-b94e-ff56a180dadc,2551
889,lung epithelium,f498030e-246c-4376-87e3-90b28c7efb00,19361


In [4]:
table.filter("ethnicity == 'African'")

┌──────────────────────┬──────────────────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┐
│        .cell         │     sample_id_db     │       .sample        │ … │ n_cell_type_in_tis…  │ n_tissue_in_cell_t…  │
│       varchar        │       varchar        │       varchar        │   │        int64         │        int64         │
├──────────────────────┼──────────────────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┤
│ AGGGAGTAGCGTTTAC_S…  │ 9da02eab40e49d1a07…  │ 20071ec5a126508641…  │ … │                   28 │                   32 │
│ ATTGGACAGCCGATTT_F…  │ 89ec472baa9d514068…  │ 4fc10a6b85e5fa688b…  │ … │                   28 │                   32 │
│ CCCAGTTCATACCATG_S…  │ 26750b2a06c447f7f2…  │ 055e5172053464e8ef…  │ … │                   28 │                   32 │
│ TGGACGCAGTGATCGG_S…  │ 26750b2a06c447f7f2…  │ 055e5172053464e8ef…  │ … │                   28 │                   32 │
│ ACGGTTACAGTCTTCC_S…  │ c87e74c

In [5]:
query = table.filter("""
    ethnicity == 'African'
    AND assay LIKE '%10%'
    AND tissue == 'lung parenchyma'
    AND cell_type LIKE '%CD4%'
""")
query

┌──────────────────────┬──────────────────────┬──────────────────────┬───┬──────────────────────┬──────────────────────┐
│        .cell         │     sample_id_db     │       .sample        │ … │ n_cell_type_in_tis…  │ n_tissue_in_cell_t…  │
│       varchar        │       varchar        │       varchar        │   │        int64         │        int64         │
├──────────────────────┼──────────────────────┼──────────────────────┼───┼──────────────────────┼──────────────────────┤
│ ACAGCCGGTCCGTTAA_F…  │ 33cdeb84ae1462d723…  │ 4fc10a6b85e5fa688b…  │ … │                   28 │                   31 │
│ GGGAATGAGCCCAGCT_F…  │ 33cdeb84ae1462d723…  │ 4fc10a6b85e5fa688b…  │ … │                   28 │                   32 │
│ TCTTCGGAGTAGCGGT_F…  │ 33cdeb84ae1462d723…  │ 4fc10a6b85e5fa688b…  │ … │                   28 │                   32 │
│ CCTTACGAGAGCTGCA_F…  │ 33cdeb84ae1462d723…  │ 4fc10a6b85e5fa688b…  │ … │                   28 │                   32 │
│ ATCTACTCAATGGAAT_F…  │ 33cdeb8

## Extracting Counts

Once you're happy with your query, you can pass it into `get_anndata()` to obtain an AnnData object:

In [6]:
get_anndata(query, assays=["counts"])

AnnData object with n_obs × n_vars = 1571 × 60661
    obs: '.cell', 'sample_id_db', '.sample', '.sample_name', 'assay', 'assay_ontology_term_id', 'file_id_db', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'ethnicity', 'ethnicity_ontology_term_id', 'file_id', 'is_primary_data.x', 'organism', 'organism_ontology_term_id', 'sample_placeholder', 'sex', 'sex_ontology_term_id', 'tissue', 'tissue_ontology_term_id', 'tissue_harmonised', 'age_days', 'dataset_id', 'collection_id', 'cell_count', 'dataset_deployments', 'is_primary_data.y', 'is_valid', 'linked_genesets', 'mean_genes_per_cell', 'name', 'published', 'revision', 'schema_version', 'tombstone', 'x_normalization', 'created_at.x', 'published_at', 'revised_at', 'updated_at.x', 'filename', 'filetype', 's3_uri', 'user_submitted', 'created_at.y', 'updated_at.y', 'cell_type_harmonised', 'confidence_class', 'cell_annotation_azimuth_l2', 'cell_annotati

You can query counts scaled per million. This is helpful if just few genes are of interest:

In [7]:
get_anndata(query, assays=['cpm'])

AnnData object with n_obs × n_vars = 1571 × 60661
    obs: '.cell', 'sample_id_db', '.sample', '.sample_name', 'assay', 'assay_ontology_term_id', 'file_id_db', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'ethnicity', 'ethnicity_ontology_term_id', 'file_id', 'is_primary_data.x', 'organism', 'organism_ontology_term_id', 'sample_placeholder', 'sex', 'sex_ontology_term_id', 'tissue', 'tissue_ontology_term_id', 'tissue_harmonised', 'age_days', 'dataset_id', 'collection_id', 'cell_count', 'dataset_deployments', 'is_primary_data.y', 'is_valid', 'linked_genesets', 'mean_genes_per_cell', 'name', 'published', 'revision', 'schema_version', 'tombstone', 'x_normalization', 'created_at.x', 'published_at', 'revised_at', 'updated_at.x', 'filename', 'filetype', 's3_uri', 'user_submitted', 'created_at.y', 'updated_at.y', 'cell_type_harmonised', 'confidence_class', 'cell_annotation_azimuth_l2', 'cell_annotati

We can query a subset of genes. Notice how the result only has `nvars = 1`:

In [8]:
anndata = get_anndata(query, features = ['PUM1'], repository='file:///vast/projects/human_cell_atlas_py/anndata')
anndata

AnnData object with n_obs × n_vars = 1571 × 1
    obs: '.cell', 'sample_id_db', '.sample', '.sample_name', 'assay', 'assay_ontology_term_id', 'file_id_db', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'ethnicity', 'ethnicity_ontology_term_id', 'file_id', 'is_primary_data.x', 'organism', 'organism_ontology_term_id', 'sample_placeholder', 'sex', 'sex_ontology_term_id', 'tissue', 'tissue_ontology_term_id', 'tissue_harmonised', 'age_days', 'dataset_id', 'collection_id', 'cell_count', 'dataset_deployments', 'is_primary_data.y', 'is_valid', 'linked_genesets', 'mean_genes_per_cell', 'name', 'published', 'revision', 'schema_version', 'tombstone', 'x_normalization', 'created_at.x', 'published_at', 'revised_at', 'updated_at.x', 'filename', 'filetype', 's3_uri', 'user_submitted', 'created_at.y', 'updated_at.y', 'cell_type_harmonised', 'confidence_class', 'cell_annotation_azimuth_l2', 'cell_annotation_b

We can access the metadata using normal anndata conventions:

In [9]:
anndata.obs

Unnamed: 0,.cell,sample_id_db,.sample,.sample_name,assay,assay_ontology_term_id,file_id_db,cell_type,cell_type_ontology_term_id,development_stage,...,s3_uri,user_submitted,created_at.y,updated_at.y,cell_type_harmonised,confidence_class,cell_annotation_azimuth_l2,cell_annotation_blueprint_singler,n_cell_type_in_tissue,n_tissue_in_cell_type
0,ACAGCCGGTCCGTTAA_F02526,33cdeb84ae1462d723c19af1bea2a366,4fc10a6b85e5fa688b253db4e0db8ba0,VUHD92___lung parenchyma___55-year-old human s...,10x 5' v1,EFO:0011025,bc380dae8b14313a870973697842878b,"CD4-positive, alpha-beta T cell",CL:0000624,55-year-old human stage,...,s3://corpora-data-prod/13825e35-ea32-4104-a0b7...,1,19226.0,19227.0,cd4 tem,1.0,mait,cd4 tem,28.0,31.0
1,GGGAATGAGCCCAGCT_F02526,33cdeb84ae1462d723c19af1bea2a366,4fc10a6b85e5fa688b253db4e0db8ba0,VUHD92___lung parenchyma___55-year-old human s...,10x 5' v1,EFO:0011025,bc380dae8b14313a870973697842878b,"CD4-positive, alpha-beta T cell",CL:0000624,55-year-old human stage,...,s3://corpora-data-prod/13825e35-ea32-4104-a0b7...,1,19226.0,19227.0,cd4 tcm,4.0,cd4 tcm,cd4 tem,28.0,32.0
2,TCTTCGGAGTAGCGGT_F02526,33cdeb84ae1462d723c19af1bea2a366,4fc10a6b85e5fa688b253db4e0db8ba0,VUHD92___lung parenchyma___55-year-old human s...,10x 5' v1,EFO:0011025,bc380dae8b14313a870973697842878b,"CD4-positive, alpha-beta T cell",CL:0000624,55-year-old human stage,...,s3://corpora-data-prod/13825e35-ea32-4104-a0b7...,1,19226.0,19227.0,cd4 tcm,4.0,cd4 tcm,cd4 tem,28.0,32.0
3,CCTTACGAGAGCTGCA_F02526,33cdeb84ae1462d723c19af1bea2a366,4fc10a6b85e5fa688b253db4e0db8ba0,VUHD92___lung parenchyma___55-year-old human s...,10x 5' v1,EFO:0011025,bc380dae8b14313a870973697842878b,"CD4-positive, alpha-beta T cell",CL:0000624,55-year-old human stage,...,s3://corpora-data-prod/13825e35-ea32-4104-a0b7...,1,19226.0,19227.0,cd4 tcm,4.0,cd4 tcm,cd4 tem,28.0,32.0
4,ATCTACTCAATGGAAT_F02526,33cdeb84ae1462d723c19af1bea2a366,4fc10a6b85e5fa688b253db4e0db8ba0,VUHD92___lung parenchyma___55-year-old human s...,10x 5' v1,EFO:0011025,bc380dae8b14313a870973697842878b,"CD4-positive, alpha-beta T cell",CL:0000624,55-year-old human stage,...,s3://corpora-data-prod/13825e35-ea32-4104-a0b7...,1,19226.0,19227.0,cd4 tcm,4.0,cd4 tcm,cd4 tem,28.0,32.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1566,TACTTACGTAATAGCA_F02526,33cdeb84ae1462d723c19af1bea2a366,4fc10a6b85e5fa688b253db4e0db8ba0,VUHD92___lung parenchyma___55-year-old human s...,10x 5' v1,EFO:0011025,bc380dae8b14313a870973697842878b,"CD4-positive, alpha-beta T cell",CL:0000624,55-year-old human stage,...,s3://corpora-data-prod/13825e35-ea32-4104-a0b7...,1,19226.0,19227.0,cd4 tcm,4.0,cd4 tcm,cd4 tem,28.0,32.0
1567,AGATAGAGTGCCTTCT_SC84,21ef23ac07391c64cadc78e16511effa,13f5331436ecaeaeffada423c8dbd1ef,NU_CZI01___lung parenchyma___52-year-old human...,10x 3' v3,EFO:0009922,bc380dae8b14313a870973697842878b,"CD4-positive, alpha-beta T cell",CL:0000624,52-year-old human stage,...,s3://corpora-data-prod/13825e35-ea32-4104-a0b7...,1,19226.0,19227.0,cd4 tcm,4.0,cd4 tcm,cd4 tem,28.0,32.0
1568,CGCGGTATCCGCGCAA_SC24,9dfbd16390b119392af9406561cb664f,055e5172053464e8efc5de1b5b3a7646,Donor_06___lung parenchyma___22-year-old human...,10x 3' v2,EFO:0009899,bc380dae8b14313a870973697842878b,"CD4-positive, alpha-beta T cell",CL:0000624,22-year-old human stage,...,s3://corpora-data-prod/13825e35-ea32-4104-a0b7...,1,19226.0,19227.0,cd4 tcm,4.0,cd4 tcm,cd4 tem,28.0,32.0
1569,TACAACGTCAGCATTG_SC84,21ef23ac07391c64cadc78e16511effa,13f5331436ecaeaeffada423c8dbd1ef,NU_CZI01___lung parenchyma___52-year-old human...,10x 3' v3,EFO:0009922,bc380dae8b14313a870973697842878b,"CD4-positive, alpha-beta T cell",CL:0000624,52-year-old human stage,...,s3://corpora-data-prod/13825e35-ea32-4104-a0b7...,1,19226.0,19227.0,cd4 tcm,3.0,cd4 tcm,tregs,28.0,32.0


## Finishing Up

When you are finished, you should close the connection:

In [10]:
conn.close()