# Pandasaurus CxG Extension Walkthrough 

## Overview
Welcome to this Jupyter notebook walkthrough for pandasaurus_cxg! This library provides powerful tools for analyzing and enriching AnnData objects, enabling you to gain deeper insights into your single-cell RNA sequencing (scRNA-seq) data.

In this notebook, we will explore two main classes: `AnndataEnricher` and `AnndataAnalyzer`. Let's dive in and see how these classes can help us in our scRNA-seq analysis.

Now, let's get started with an example workflow that demonstrates the capabilities of these classes. We'll load an example dataset, perform enrichment, analysis, and visualization steps to gain a better understanding of our scRNA-seq data.

## Test Data
The following files are used in the walkthrough. Please download them manually to a folder of your choice. Ensure that you adjust the file paths used in the examples to match your local file paths.
- [Time-resolved Systems Immunology Reveals a Late Juncture Linked to Fatal COVID-19: Adaptive Cells](https://cellxgene.cziscience.com/collections/db14ce52-5dd6-4649-a9e9-7fb2572d0605)
- [Integrated Single-nucleus and Single-cell RNA-seq of the Adult Human Kidney](https://cellxgene.cziscience.com/collections/36b8480d-114e-42fe-b6a9-bdf79a7eb1fc)

## AnndataEnricher Walkthrough

### Initialization
Let's import the necessary modules and initialize our AnndataEnricher

In [1]:
from pandasaurus_cxg.anndata_enricher import AnndataEnricher

In [2]:
# Using Time-resolved Systems Immunology Reveals a Late Juncture Linked to Fatal COVID-19: Adaptive Cells dataset
ade = AnndataEnricher("test/data/test_covid.h5ad")

Anndata Obs field details

In [3]:
ade._anndata.obs

Unnamed: 0,tissue_ontology_term_id,author_cell_type,disease_ontology_term_id,age,days_since_hospitalized,donor_id,severity,dsm_severity_score,ever_admitted_to_icu,days_since_onset,...,organism_ontology_term_id,suspension_type,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage
AAACCTGAGACGACGT-1_1,UBERON:0000178,naive B cell,PATO:0000461,32.0,,AA220907,,,,,...,NCBITaxon:9606,cell,naive B cell,10x 5' v1,normal,Homo sapiens,female,blood,unknown,32-year-old human stage
AAACCTGAGCTAGTTC-1_1,UBERON:0000178,naive B cell,MONDO:0100096,37.0,6.0,HGR0000079,Critical,,False,30.0,...,NCBITaxon:9606,cell,naive B cell,10x 5' v1,COVID-19,Homo sapiens,female,blood,European,37-year-old human stage
AAACCTGCATAGACTC-1_1,UBERON:0000178,memory B cell,PATO:0000461,32.0,,AA220907,,,,,...,NCBITaxon:9606,cell,memory B cell,10x 5' v1,normal,Homo sapiens,female,blood,unknown,32-year-old human stage
AAACCTGCATTAACCG-1_1,UBERON:0000178,naive B cell,MONDO:0100096,54.0,1.0,HGR0000143,Critical,1.674056,True,8.0,...,NCBITaxon:9606,cell,naive B cell,10x 5' v1,COVID-19,Homo sapiens,female,blood,East Asian,54-year-old human stage
AAACCTGGTCATCGGC-1_1,UBERON:0000178,gamma-delta T cell,MONDO:0100096,55.0,1.0,HGR0000083,Moderate,-1.950858,False,6.0,...,NCBITaxon:9606,cell,gamma-delta T cell,10x 5' v1,COVID-19,Homo sapiens,male,blood,European,55-year-old human stage
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TTTGTCACAAGAGGCT-16_4,UBERON:0000178,memory B cell,PATO:0000461,73.0,,SHD5,,,,,...,NCBITaxon:9606,cell,memory B cell,10x 5' v1,normal,Homo sapiens,male,blood,unknown,73-year-old human stage
TTTGTCACACGTCTCT-16_4,UBERON:0000178,naive CD8+ T cell,MONDO:0100096,49.0,0.0,HGR0000142,Critical,-0.617735,False,12.0,...,NCBITaxon:9606,cell,"naive thymus-derived CD8-positive, alpha-beta ...",10x 5' v1,COVID-19,Homo sapiens,male,blood,European,49-year-old human stage
TTTGTCACAGGTCGTC-16_4,UBERON:0000178,naive CD8+ T cell,MONDO:0100096,53.0,2.0,HGR0000134,Critical,-1.925448,False,16.0,...,NCBITaxon:9606,cell,"naive thymus-derived CD8-positive, alpha-beta ...",10x 5' v1,COVID-19,Homo sapiens,male,blood,European,53-year-old human stage
TTTGTCAGTAAATGAC-16_4,UBERON:0000178,"CD4-positive, alpha-beta memory T cell",PATO:0000461,51.0,,SHD1,,,,,...,NCBITaxon:9606,cell,"CD4-positive, alpha-beta memory T cell",10x 5' v1,normal,Homo sapiens,male,blood,unknown,51-year-old human stage


Avialable slims for minimal and full slim enrichment methods

In [4]:
ade.slim_list

[{'name': 'blood_and_immune_upper_slim',
  'description': 'a subset of general classes related to blood and the immune system, primarily of hematopoietic origin'},
 {'name': 'eye_upper_slim',
  'description': 'a subset of general classes related to specific cell types in the eye.'}]

Avialable slims for contextual enrichment methods

In [5]:
ade._AnndataEnricher__context_list

['UBERON:0000178']

Pandas configuration to display all rows

In [6]:
import pandas as pd

pd.set_option('display.max_rows', None)

### Enrichment

#### Simple enrichment
Returns a DataFrame that is enriched with synonyms and inferred relationships between terms in the seed. Subject and object terms are members of the seed terms.

In [7]:
ade.simple_enrichment()

Unnamed: 0,s,s_label,p,o,o_label
0,CL:0000798,gamma-delta T cell,rdfs:subClassOf,CL:0000084,T cell
1,CL:0000809,"double-positive, alpha-beta thymocyte",rdfs:subClassOf,CL:0000084,T cell
2,CL:0000813,memory T cell,rdfs:subClassOf,CL:0000084,T cell
3,CL:0000815,regulatory T cell,rdfs:subClassOf,CL:0000084,T cell
4,CL:0000895,"naive thymus-derived CD4-positive, alpha-beta ...",rdfs:subClassOf,CL:0000084,T cell
5,CL:0000897,"CD4-positive, alpha-beta memory T cell",rdfs:subClassOf,CL:0000084,T cell
6,CL:0000897,"CD4-positive, alpha-beta memory T cell",rdfs:subClassOf,CL:0000813,memory T cell
7,CL:0000900,"naive thymus-derived CD8-positive, alpha-beta ...",rdfs:subClassOf,CL:0000084,T cell
8,CL:0000909,"CD8-positive, alpha-beta memory T cell",rdfs:subClassOf,CL:0000084,T cell
9,CL:0000909,"CD8-positive, alpha-beta memory T cell",rdfs:subClassOf,CL:0000813,memory T cell


#### Minimal slim enrichment
Returns a DataFrame that is enriched with synonyms and inferred relationships between terms in the seed list and in an extended seed list. The extended seed list consists of terms from the seed list and terms from given slim lists, classes tagged with some specified ‘subset’ axiom.

In [8]:
ade.minimal_slim_enrichment(["blood_and_immune_upper_slim"])

Unnamed: 0,s,s_label,p,o,o_label
0,CL:0000084,T cell,rdfs:subClassOf,CL:0000842,mononuclear cell
1,CL:0000787,memory B cell,rdfs:subClassOf,CL:0000842,mononuclear cell
2,CL:0000787,memory B cell,rdfs:subClassOf,CL:0000236,B cell
3,CL:0000787,memory B cell,rdfs:subClassOf,CL:0000145,professional antigen presenting cell
4,CL:0000788,naive B cell,rdfs:subClassOf,CL:0000842,mononuclear cell
5,CL:0000788,naive B cell,rdfs:subClassOf,CL:0000236,B cell
6,CL:0000788,naive B cell,rdfs:subClassOf,CL:0000145,professional antigen presenting cell
7,CL:0000798,gamma-delta T cell,rdfs:subClassOf,CL:0000084,T cell
8,CL:0000798,gamma-delta T cell,rdfs:subClassOf,CL:0000842,mononuclear cell
9,CL:0000809,"double-positive, alpha-beta thymocyte",rdfs:subClassOf,CL:0000084,T cell


#### Full slim enrichment
Returns a DataFrame that is enriched with synonyms and inferred relationships between terms in the seed list and in an extended seed list. The extended seed list consists of terms from the seed list and terms from given slim lists, classes tagged with some specified ‘subset’ axiom, with inferred terms via transitive subClassOf queries.

In [9]:
ade.full_slim_enrichment(["blood_and_immune_upper_slim"])

Unnamed: 0,s,s_label,p,o,o_label
0,CL:0000084,T cell,rdfs:subClassOf,CL:0000842,mononuclear cell
1,CL:0000084,T cell,rdfs:subClassOf,CL:0000542,lymphocyte
2,CL:0000787,memory B cell,rdfs:subClassOf,CL:0000785,mature B cell
3,CL:0000787,memory B cell,rdfs:subClassOf,CL:0000236,B cell
4,CL:0000787,memory B cell,rdfs:subClassOf,CL:0000945,lymphocyte of B lineage
5,CL:0000787,memory B cell,rdfs:subClassOf,CL:0000785,mature B cell
6,CL:0000787,memory B cell,rdfs:subClassOf,CL:0001201,"B cell, CD19-positive"
7,CL:0000787,memory B cell,rdfs:subClassOf,CL:0000785,mature B cell
8,CL:0000787,memory B cell,rdfs:subClassOf,CL:0000236,B cell
9,CL:0000787,memory B cell,rdfs:subClassOf,CL:0000842,mononuclear cell


#### Contextual enrichment
Returns a DataFrame that is enriched with synonyms and inferred relationships between terms in the seed list and in an extended seed list. The extended seed list consists of terms from the seed list and all terms satisfied by some set of existential restrictions in the ubergraph (e.g. part_of some 'kidney').

In [10]:
ade.contextual_slim_enrichment()

Unnamed: 0,s,s_label,p,o,o_label
0,CL:0000798,gamma-delta T cell,rdfs:subClassOf,CL:0000084,T cell
1,CL:0000809,"double-positive, alpha-beta thymocyte",rdfs:subClassOf,CL:0000084,T cell
2,CL:0000813,memory T cell,rdfs:subClassOf,CL:0000084,T cell
3,CL:0000815,regulatory T cell,rdfs:subClassOf,CL:0000084,T cell
4,CL:0000895,"naive thymus-derived CD4-positive, alpha-beta ...",rdfs:subClassOf,CL:0000084,T cell
5,CL:0000897,"CD4-positive, alpha-beta memory T cell",rdfs:subClassOf,CL:0000084,T cell
6,CL:0000897,"CD4-positive, alpha-beta memory T cell",rdfs:subClassOf,CL:0000813,memory T cell
7,CL:0000900,"naive thymus-derived CD8-positive, alpha-beta ...",rdfs:subClassOf,CL:0000084,T cell
8,CL:0000909,"CD8-positive, alpha-beta memory T cell",rdfs:subClassOf,CL:0000084,T cell
9,CL:0000909,"CD8-positive, alpha-beta memory T cell",rdfs:subClassOf,CL:0000813,memory T cell


### Secondary Example

In [11]:
#Using Integrated Single-nucleus and Single-cell RNA-seq of the Adult Human Kidney dataset
ade = AnndataEnricher("test/data/human_kidney.h5ad")
print(f"""Contexts of the dataset from tissue field are 
{ade._AnndataEnricher__context_list}""")
ade.contextual_slim_enrichment()

Contexts of the dataset from tissue field are 
['UBERON:0000362', 'UBERON:0001225', 'UBERON:0001228', 'UBERON:0002113']


Unnamed: 0,s,s_label,p,o,o_label
0,CL:0000653,podocyte,rdfs:subClassOf,CL:0002681,kidney cortical cell
1,CL:0000653,podocyte,rdfs:subClassOf,CL:1000450,epithelial cell of glomerular capsule
2,CL:0000653,podocyte,rdfs:subClassOf,CL:1000449,epithelial cell of nephron
3,CL:0000653,podocyte,rdfs:subClassOf,CL:0002681,kidney cortical cell
4,CL:0000653,podocyte,rdfs:subClassOf,CL:0002584,renal cortical epithelial cell
5,CL:0000653,podocyte,rdfs:subClassOf,CL:1000746,glomerular cell
6,CL:0000653,podocyte,rdfs:subClassOf,CL:1000746,glomerular cell
7,CL:0000653,podocyte,rdfs:subClassOf,CL:1000612,kidney corpuscule cell
8,CL:0000653,podocyte,rdfs:subClassOf,CL:1000510,kidney glomerular epithelial cell
9,CL:0000653,podocyte,rdfs:subClassOf,CL:1000510,kidney glomerular epithelial cell


## AnndataAnalyzer walkthrough

### Initialization
Let's import the necessary modules and initialize our AnndataAnalyzer

In [12]:
from pandasaurus_cxg.anndata_analyzer import AnndataAnalyzer

In [13]:
# temporarily using a placeholder schema for free text cell types 
ada = AnndataAnalyzer("test/data/test_covid.h5ad", "pandasaurus_cxg/schema/schema.json")

### Analyzer

#### Co-annotation report
Generates a co-annotation report based on the provided schema.

In [14]:
ada.co_annotation_report()

Unnamed: 0,field_name1,value1,predicate,field_name2,value2
0,author_cell_type,naive B cell,cluster_matches,cell_type,naive B cell
1,author_cell_type,memory B cell,cluster_matches,cell_type,memory B cell
2,author_cell_type,gamma-delta T cell,cluster_matches,cell_type,gamma-delta T cell
3,author_cell_type,plasmablast,cluster_matches,cell_type,plasmablast
4,author_cell_type,regulatory T cell,cluster_matches,cell_type,regulatory T cell
5,author_cell_type,"CD4-positive, alpha-beta memory T cell",cluster_matches,cell_type,"CD4-positive, alpha-beta memory T cell"
6,author_cell_type,"CD8-positive, alpha-beta memory T cell",cluster_matches,cell_type,"CD8-positive, alpha-beta memory T cell"
7,author_cell_type,naive CD8+ T cell,cluster_matches,cell_type,"naive thymus-derived CD8-positive, alpha-beta ..."
8,author_cell_type,naive CD4+ T cell,cluster_matches,cell_type,"naive thymus-derived CD4-positive, alpha-beta ..."
9,author_cell_type,mucosal invariant T cell (MAIT),cluster_matches,cell_type,mucosal invariant T cell


### Secondary Example

In [15]:
ada = AnndataAnalyzer("test/data/human_kidney.h5ad", "pandasaurus_cxg/schema/schema.json")
ada.co_annotation_report()

Unnamed: 0,field_name1,value1,predicate,field_name2,value2
0,subclass.l3,dPT,cluster_matches,subclass.full,Degenerative Proximal Tubule Epithelial Cell
1,subclass.l3,aPT,cluster_matches,subclass.full,Adaptive / Maladaptive / Repairing Proximal Tu...
2,subclass.l3,M-FIB,cluster_matches,subclass.full,Medullary Fibroblast
3,subclass.l3,MD,cluster_matches,subclass.full,Macula Densa Cell
4,subclass.l3,NKC/T,cluster_matches,subclass.full,Natural Killer Cell / Natural Killer T Cell
5,subclass.l3,tPC-IC,cluster_matches,subclass.full,Transitional Principal-Intercalated Cell
6,subclass.l3,EC-DVR,cluster_matches,subclass.full,Descending Vasa Recta Endothelial Cell
7,subclass.l3,M-TAL,cluster_matches,subclass.full,Medullary Thick Ascending Limb Cell
8,subclass.l3,CCD-IC-A,cluster_matches,subclass.full,Cortical Collecting Duct Intercalated Cell Type A
9,subclass.l3,dM-TAL,cluster_matches,subclass.full,Degenerative Medullary Thick Ascending Limb Cell
