# Enhanced Querying of the CELLxGENE Census with cxg-query-enhancer

## What is cxg-query-enhancer?
The `cxg-query-enhancer` library automatically expands CELLxGENE Census queries by including relevant subclasses and part-of relationships. Instead of manually specifying every cell type or tissue subclass, you can query at a high level and let the library handle the expansion.

**Key Benefits:**
* **Comprehensive results**: Automatically includes all relevant subtypes
* **Simplified queries**: Write queries using broad terms like "B cell" or "lung"
* **Consistent results**: Reduces the risk of missing relevant cell populations
* **Time-saving**: No need to manually research and list all subclasses

This notebook introduces and demonstrates examples of how to use the `enhance()` function from the library `cxg-query-enhancer` to enhance CELLxGENE Census queries by automatically expanding filters to include relevant subclasses and part-of relationships across annotations such as cell types, tissues, development stages, and diseases.

## Getting started:

In [1]:
import cellxgene_census  
from cxg_query_enhancer import enhance              #import the function `enhance()` from `cxg_query_enhancer`

## Querying cell metadata (obs) without the library `cxg-query-enhancer`:

In [5]:
with cellxgene_census.open_soma(census_version="latest") as census:

    cell_metadata_b_cell = cellxgene_census.get_obs(
        census,
        "homo_sapiens",
        value_filter="cell_type in ['B cell'] and tissue in ['lung'] and is_primary_data==True",    # B cells in the lung without enhancement 
        column_names=["disease"],
    )

The output (without `enhance()` function) covers around 70K B cells in the lung.

In [6]:
print(cell_metadata_b_cell)

           disease cell_type tissue  is_primary_data
0      anencephaly    B cell   lung             True
1      anencephaly    B cell   lung             True
2      anencephaly    B cell   lung             True
3      anencephaly    B cell   lung             True
4      anencephaly    B cell   lung             True
...            ...       ...    ...              ...
72820       normal    B cell   lung             True
72821       normal    B cell   lung             True
72822       normal    B cell   lung             True
72823       normal    B cell   lung             True
72824       normal    B cell   lung             True

[72825 rows x 4 columns]


In [7]:
cell_metadata_b_cell.value_counts().reset_index()

Unnamed: 0,disease,cell_type,tissue,is_primary_data,count
0,lung adenocarcinoma,B cell,lung,True,30238
1,squamous cell lung carcinoma,B cell,lung,True,10048
2,non-small cell lung carcinoma,B cell,lung,True,7515
3,pulmonary fibrosis,B cell,lung,True,6798
4,normal,B cell,lung,True,6601
5,lung cancer,B cell,lung,True,2880
6,COVID-19,B cell,lung,True,2510
7,chronic obstructive pulmonary disease,B cell,lung,True,2203
8,lung large cell carcinoma,B cell,lung,True,1534
9,pleomorphic carcinoma,B cell,lung,True,1210


Only B cells in the lung tissue were included in the obs query.

In [5]:
# Number and list of unique cell types in cell_metadata_b_cell
print(f"Unique cell types ({len(cell_metadata_b_cell['cell_type'].unique())}):", cell_metadata_b_cell["cell_type"].unique())

Unique cell types (1): ['B cell']
Categories (831, object): ['A2 amacrine cell', 'B cell', 'B-1 B cell', 'B-1a B cell', ..., 'vein endothelial cell of respiratory system', 'ventricular cardiac muscle cell', 'vestibular dark cell', 'visceromotor neuron']


In [6]:
# Number and list of unique tissues in cell_metadata_b_cell
print(f"Unique tissues ({len(cell_metadata_b_cell['tissue'].unique())}):", cell_metadata_b_cell["tissue"].unique())

Unique tissues (1): ['lung']
Categories (377, object): ['Brodmann (1909) area 17', 'Brodmann (1909) area 4', 'Brodmann (1909) area 7', 'Brodmann (1909) area 9', ..., 'white matter of parietal lobe', 'white matter of temporal lobe', 'yolk sac', 'zone of skin']


## Querying cell metadata (obs) with the library `cxg-query-enhancer`:

In [2]:
with cellxgene_census.open_soma(census_version="latest") as census:

    cell_metadata_b_cell = cellxgene_census.get_obs(
        census,
        "homo_sapiens",
        value_filter= enhance(                                            
            "cell_type in ['B cell'] and tissue in ['lung'] and is_primary_data==True",         # B cells in the lung with enhancement 
            organism="Homo sapiens"
        ),
        column_names=["disease"],
    )

The output for querying B cells in the lung is around 100k with the `enhance()` function, including the subclasses of both cell_type and tissue categories in the obs query.

In [3]:
print(cell_metadata_b_cell)

            disease cell_type tissue  is_primary_data
0       anencephaly    B cell   lung             True
1       anencephaly    B cell   lung             True
2       anencephaly    B cell   lung             True
3       anencephaly    B cell   lung             True
4       anencephaly    B cell   lung             True
...             ...       ...    ...              ...
106800       normal    B cell   lung             True
106801       normal    B cell   lung             True
106802       normal    B cell   lung             True
106803       normal    B cell   lung             True
106804       normal    B cell   lung             True

[106805 rows x 4 columns]


In [4]:
cell_metadata_b_cell.value_counts().reset_index()

Unnamed: 0,disease,cell_type,tissue,is_primary_data,count
0,lung adenocarcinoma,B cell,lung,True,30238
1,squamous cell lung carcinoma,B cell,lung,True,10048
2,non-small cell lung carcinoma,B cell,lung,True,7515
3,pulmonary fibrosis,B cell,lung,True,6798
4,normal,B cell,lung,True,6601
5,lung adenocarcinoma,B cell,upper lobe of right lung,True,5306
6,COVID-19,B cell,upper lobe of left lung,True,4465
7,normal,mature B cell,lung,True,3940
8,normal,B cell,lung parenchyma,True,3092
9,lung cancer,B cell,lung,True,2880


B cells in the lung (and their subclasses) are included in the obs query with `enhance` function. The obs query includes 15 cell types and 12 tissues.

In [6]:
# Number and list of unique cell types in cell_metadata_b_cell
print(f"Unique cell types ({len(cell_metadata_b_cell['cell_type'].unique())}):", cell_metadata_b_cell["cell_type"].unique())

Unique cell types (15): ['B cell', 'small pre-B-II cell', 'large pre-B-II cell', 'B-1b B cell', 'immature B cell', ..., 'naive B cell', 'plasmablast', 'germinal center B cell', 'mature B cell', 'CD22-positive, CD38-low small pre-B cell']
Length: 15
Categories (831, object): ['A2 amacrine cell', 'B cell', 'B-1 B cell', 'B-1a B cell', ..., 'vein endothelial cell of respiratory system', 'ventricular cardiac muscle cell', 'vestibular dark cell', 'visceromotor neuron']


In [7]:
# Number and list of unique tissues in cell_metadata_b_cell
print(f"Unique tissues ({len(cell_metadata_b_cell['tissue'].unique())}):", cell_metadata_b_cell["tissue"].unique())

Unique tissues (12): ['lung', 'alveolus of lung', 'lung parenchyma', 'lower lobe of left lung', 'lower lobe of right lung', ..., 'upper lobe of left lung', 'left lung', 'middle lobe of right lung', 'lingula of left lung', 'segmental bronchus']
Length: 12
Categories (377, object): ['Brodmann (1909) area 17', 'Brodmann (1909) area 4', 'Brodmann (1909) area 7', 'Brodmann (1909) area 9', ..., 'white matter of parietal lobe', 'white matter of temporal lobe', 'yolk sac', 'zone of skin']


## Obtaining a slice as AnnData using the library **cxg-query-enhancer**:

In [12]:
# Open the latest version of the CELLxGENE Census
with cellxgene_census.open_soma(census_version="latest") as census:
    
    # Retrieve an AnnData object based on specific filters
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        var_value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
        obs_value_filter=enhance(             #enhance function to expand the query     
            "sex == 'female' and cell_type in ['medium spiny neuron']",
            organism="Homo sapiens",
        ),
        obs_column_names=[
                "assay",
                "cell_type",
                "tissue",
                "tissue_general",
                "suspension_type",
                "disease",
            ],
    )

The output (with `enhance()` function) covers about 5K cells and 2 genes of the medium spiny neuron and subclasses.

In [13]:
adata

AnnData object with n_obs × n_vars = 5471 × 2
    obs: 'assay', 'cell_type', 'tissue', 'tissue_general', 'suspension_type', 'disease', 'sex'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_type', 'feature_length', 'nnz', 'n_measured_obs'

With `enhance()` function, the `cell_type` column now includes not only 'medium spiny neuron' but also its subclasses, such as 'indirect pathway medium spiny neuron' and 'direct pathway medium spiny neuron'.

In [14]:
adata.obs

Unnamed: 0,assay,cell_type,tissue,tissue_general,suspension_type,disease,sex
0,10x 3' v3,indirect pathway medium spiny neuron,caudate nucleus,brain,nucleus,normal,female
1,10x 3' v3,direct pathway medium spiny neuron,caudate nucleus,brain,nucleus,normal,female
2,10x 3' v3,indirect pathway medium spiny neuron,caudate nucleus,brain,nucleus,normal,female
3,10x 3' v3,indirect pathway medium spiny neuron,caudate nucleus,brain,nucleus,normal,female
4,10x 3' v3,direct pathway medium spiny neuron,caudate nucleus,brain,nucleus,normal,female
...,...,...,...,...,...,...,...
5466,10x 3' v3,medium spiny neuron,cerebral cortex,brain,cell,normal,female
5467,10x 3' v3,medium spiny neuron,cerebral cortex,brain,cell,normal,female
5468,10x 3' v3,medium spiny neuron,cerebral cortex,brain,cell,normal,female
5469,10x 3' v3,medium spiny neuron,cerebral cortex,brain,cell,normal,female


## Obtaining a slice as AnnData with multiple categories (cell_type, tissue, and disease) using the library `cxg-query-enhancer`:

This example further demonstrates `enhance()` by applying it to a query involving multiple categories: `cell_type`, `tissue`, and `disease`. 

In [33]:
with cellxgene_census.open_soma(census_version="latest") as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        var_value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
        obs_value_filter= enhance("cell_type in ['B cell'] and tissue in ['lung'] and disease in ['COVID-19'] and is_primary_data == True",
                                  organism="Homo sapiens"
                                 ),
        obs_column_names=["sex"],
    )

The output with about 7K cells and 2 genes, with expanded categories. 

In [34]:
adata

AnnData object with n_obs × n_vars = 7202 × 2
    obs: 'sex', 'cell_type', 'tissue', 'disease', 'is_primary_data'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_type', 'feature_length', 'nnz', 'n_measured_obs'

In [35]:
adata.obs

Unnamed: 0,sex,cell_type,tissue,disease,is_primary_data
0,male,B cell,lung,COVID-19,True
1,male,B cell,lung,COVID-19,True
2,male,B cell,lung,COVID-19,True
3,male,B cell,lung,COVID-19,True
4,male,B cell,lung,COVID-19,True
...,...,...,...,...,...
7197,unknown,B cell,lung,COVID-19,True
7198,unknown,B cell,lung,COVID-19,True
7199,unknown,B cell,lung,COVID-19,True
7200,male,B cell,lung,COVID-19,True


`enhance()` function automatically expanded the `tissue` filter to include subclasses like 'upper lobe of left lung' and 'lower lobe of left lung'. The `cell_type` and `disease` terms remained as specified, indicating no further relevant subclasses were present in this specific dataset for those categories.

In [36]:
for col in adata.obs.columns:
    # This loop iterates through each column in adata.obs and prints its unique values for each category
    print(f"Unique values in '{col}':", adata.obs[col].unique())

Unique values in 'sex': ['male', 'female', 'unknown']
Categories (3, object): ['female', 'male', 'unknown']
Unique values in 'cell_type': ['B cell']
Categories (831, object): ['A2 amacrine cell', 'B cell', 'B-1 B cell', 'B-1a B cell', ..., 'vein endothelial cell of respiratory system', 'ventricular cardiac muscle cell', 'vestibular dark cell', 'visceromotor neuron']
Unique values in 'tissue': ['lung', 'upper lobe of left lung', 'lower lobe of left lung']
Categories (377, object): ['Brodmann (1909) area 17', 'Brodmann (1909) area 4', 'Brodmann (1909) area 7', 'Brodmann (1909) area 9', ..., 'white matter of parietal lobe', 'white matter of temporal lobe', 'yolk sac', 'zone of skin']
Unique values in 'disease': ['COVID-19']
Categories (144, object): ['Alzheimer disease', 'B-cell acute lymphoblastic leukemia', 'B-cell non-Hodgkin lymphoma', 'Barrett esophagus', ..., 'tubulovillous adenoma', 'type 1 diabetes mellitus', 'type 2 diabetes mellitus', 'uveal melanoma']
Unique values in 'is_prima

In [2]:
with cellxgene_census.open_soma(census_version="latest") as census:

    cell_metadata_b_cell = cellxgene_census.get_obs(
        census,
        "homo_sapiens",
        value_filter= enhance(                                            
            "cell_type in ['B cell'] and tissue in ['lung', 'kidney'] and is_primary_data==True",         # B cells in the lung with enhancement 
            organism="Homo sapiens"
        ),
        column_names=["disease"],
    )

In [3]:
cell_metadata_b_cell.value_counts().reset_index()

Unnamed: 0,disease,cell_type,tissue,is_primary_data,count
0,lung adenocarcinoma,B cell,lung,True,30238
1,squamous cell lung carcinoma,B cell,lung,True,10048
2,non-small cell lung carcinoma,B cell,lung,True,7515
3,pulmonary fibrosis,B cell,lung,True,6798
4,normal,B cell,lung,True,6601
...,...,...,...,...,...
82,acute kidney failure,B cell,renal medulla,True,6
83,normal,unswitched memory B cell,kidney,True,4
84,normal,germinal center B cell,lung,True,4
85,normal,mature B cell,kidney,True,1


In [4]:
# Number and list of unique cell types in cell_metadata_b_cell
print(f"Unique cell types ({len(cell_metadata_b_cell['cell_type'].unique())}):", cell_metadata_b_cell["cell_type"].unique())

Unique cell types (19): ['B cell', 'small pre-B-II cell', 'large pre-B-II cell', 'B-1b B cell', 'immature B cell', ..., 'germinal center B cell', 'mature B cell', 'CD22-positive, CD38-low small pre-B cell', 'B-2 B cell', 'B-1 B cell']
Length: 19
Categories (831, object): ['A2 amacrine cell', 'B cell', 'B-1 B cell', 'B-1a B cell', ..., 'vein endothelial cell of respiratory system', 'ventricular cardiac muscle cell', 'vestibular dark cell', 'visceromotor neuron']


In [5]:
# Number and list of unique tissues in cell_metadata_b_cell
print(f"Unique tissues ({len(cell_metadata_b_cell['tissue'].unique())}):", cell_metadata_b_cell["tissue"].unique())

Unique tissues (16): ['lung', 'kidney', 'cortex of kidney', 'renal pelvis', 'renal medulla', ..., 'upper lobe of left lung', 'left lung', 'middle lobe of right lung', 'lingula of left lung', 'segmental bronchus']
Length: 16
Categories (377, object): ['Brodmann (1909) area 17', 'Brodmann (1909) area 4', 'Brodmann (1909) area 7', 'Brodmann (1909) area 9', ..., 'white matter of parietal lobe', 'white matter of temporal lobe', 'yolk sac', 'zone of skin']


## Querying T cells in the lung 

In [2]:
import cellxgene_census  
from cxg_query_enhancer import enhance   

In [None]:
with cellxgene_census.open_soma(census_version="latest") as census:

    cell_metadata_T_cell = cellxgene_census.get_obs(
        census,
        "homo_sapiens",
        value_filter="cell_type in ['T cell'] and tissue in ['lung'] and is_primary_data==True",    # T cells in the lung without enhancement 
        column_names=["disease"],
    )

In [4]:
print(cell_metadata_T_cell)

                   disease cell_type tissue  is_primary_data
0              anencephaly    T cell   lung             True
1              anencephaly    T cell   lung             True
2              anencephaly    T cell   lung             True
3              anencephaly    T cell   lung             True
4              anencephaly    T cell   lung             True
...                    ...       ...    ...              ...
71275             COVID-19    T cell   lung             True
71276  lung adenocarcinoma    T cell   lung             True
71277  lung adenocarcinoma    T cell   lung             True
71278             COVID-19    T cell   lung             True
71279             COVID-19    T cell   lung             True

[71280 rows x 4 columns]


In [None]:
with cellxgene_census.open_soma(census_version="latest") as census:

    cell_metadata_T_cell = cellxgene_census.get_obs(
        census,
        "homo_sapiens",
        value_filter= enhance(                                            
            "cell_type in ['T cell'] and tissue in ['lung'] and is_primary_data==True",         # T cells in the lung with enhancement 
            organism="Homo sapiens"
        ),
        column_names=["disease"],
    )

In [6]:
print(cell_metadata_T_cell)

            disease                        cell_type tissue  is_primary_data
0       anencephaly                           T cell   lung             True
1       anencephaly                           T cell   lung             True
2       anencephaly                           T cell   lung             True
3       anencephaly                           T cell   lung             True
4       anencephaly                           T cell   lung             True
...             ...                              ...    ...              ...
697305       normal  CD4-positive, alpha-beta T cell   lung             True
697306       normal  CD4-positive, alpha-beta T cell   lung             True
697307       normal  CD8-positive, alpha-beta T cell   lung             True
697308       normal  CD4-positive, alpha-beta T cell   lung             True
697309       normal  CD8-positive, alpha-beta T cell   lung             True

[697310 rows x 4 columns]


In [None]:
# Number and list of unique cell types in cell_metadata_T_cell
print(f"Unique cell types ({len(cell_metadata_b_cell['cell_type'].unique())}):", cell_metadata_b_cell["cell_type"].unique())

Unique cell types (33): ['T cell', 'naive thymus-derived CD8-positive, alpha-beta..., 'mature NK T cell', 'naive thymus-derived CD4-positive, alpha-beta..., 'effector memory CD4-positive, alpha-beta T cell', ..., 'CD4-positive helper T cell', 'CD8-positive, alpha-beta memory T cell, CD45R..., 'T follicular helper cell', 'CD8-positive, alpha-beta memory T cell', 'mature gamma-delta T cell']
Length: 33
Categories (831, object): ['A2 amacrine cell', 'B cell', 'B-1 B cell', 'B-1a B cell', ..., 'vein endothelial cell of respiratory system', 'ventricular cardiac muscle cell', 'vestibular dark cell', 'visceromotor neuron']
