# CASSIA Analysis Tutorial

This Python Notebook demonstrates a complete workflow using CASSIA for cell type annotation of single-cell RNA sequencing data. We'll analyze an intestinal cell dataset containing six distinct populations:

1.monocyte

2.plasma cells

3.cd8-positive, alpha-beta t cell

4.transit amplifying cell of large intestine

5.intestinal enteroendocrine cell

6.intestinal crypt stem cell

## Setup and Environment Preparation

First, let's install and import the required packages:

In [None]:
!pip install CASSIA

In [11]:
import CASSIA

In [25]:

# Set API keys
CASSIA.set_api_key("your-openai-key", provider="openai")
CASSIA.set_api_key("your-anthropic-key", provider="anthropic")
CASSIA.set_api_key("your-openrouter-key", provider="openrouter")

In [None]:
processed_markers = CASSIA.loadmarker(marker_type="processed")
unprocessed_markers = CASSIA.loadmarker(marker_type="unprocessed")
subcluster_results = CASSIA.loadmarker(marker_type="subcluster_results")

# List available marker sets
available_markers = CASSIA.list_available_markers()
print(available_markers) 


## Fast Mode

In [None]:
# Run the CASSIA pipeline in fast mode
CASSIA.runCASSIA_pipeline(
    output_file_name = "FastAnalysisResults",
    tissue = "large intestine",
    species = "human",
    marker_path = unprocessed_markers,
    max_workers = 6,  # Matches the number of clusters in dataset
    annotation_model = "google/gemini-2.5-flash-preview", #openai/gpt-4o-2024-11-20
    annotation_provider = "openrouter",
    score_model = "google/gemini-2.5-flash-preview",
    score_provider = "openrouter",
    score_threshold = 75,
    annotationboost_model="google/gemini-2.5-flash-preview",
    annotationboost_provider="openrouter"
)


### Step 2: Detailed Batch Analysis

In [16]:
output_name="intestine_detailed"

In [None]:

# Run batch analysis
CASSIA.runCASSIA_batch(
    marker = unprocessed_markers,
    output_name = output_name,
    model = "google/gemini-2.5-flash-preview",
    tissue = "large intestine",
    species = "human",
    max_workers = 6,  # Matching cluster count
    n_genes = 50,
    additional_info = None,
    provider = "openrouter")

### Step 3: Quality Scoring

In [None]:
# Run quality scoring
CASSIA.runCASSIA_score_batch(
    input_file = output_name + "_full.csv",
    output_file = output_name + "_scored.csv",
    max_workers = 6,
    model = "google/gemini-2.5-flash-preview",
    provider = "openrouter"
)

# Generate quality report
CASSIA.runCASSIA_generate_score_report(
    csv_path = output_name + "_scored.csv",
    index_name = output_name + "_report.html"
)

### Optional Step: Uncertainty Quantification

This could be useful to study the uncertainty of the annotation, and potentially improve the accurracy.

Note:This is step could be costy, since multiple iteration will be performed.


In [35]:
# Run multiple iterations
iteration_results = CASSIA.runCASSIA_batch_n_times(
    n=2,
    marker=unprocessed_markers,  # Changed from markers_unprocessed
    output_name=output_name + "_Uncertainty",  # Changed from paste0()
    model="openai/gpt-4o-2024-11-20",
    provider="openrouter",
    tissue="large intestine",
    species="human",
    max_workers=6,
    batch_max_workers=3  # Conservative setting for API rate limits
)


# Calculate similarity scores
similarity_scores = CASSIA.runCASSIA_similarity_score_batch(
    marker=unprocessed_markers,  # Changed from markers_unprocessed
    file_pattern=output_name + "_Uncertainty_*_full.csv",  # Changed from paste0()
    output_name="intestine_uncertainty",
    max_workers=6,
    model="claude-3-5-sonnet-20241022",
    provider="anthropic",
    main_weight=0.5,
    sub_weight=0.5
)


Starting batch run 1/2
Processing input dataframe to get top markers
Starting batch run 2/2
Processing input dataframe to get top markers

Analyzing cd8-positive, alpha-beta t cell...

Analyzing intestinal crypt stem cell...

Analyzing intestinal enteroendocrine cell...

Analyzing monocyte...

Analyzing cd8-positive, alpha-beta t cell...

Analyzing plasma cell...

Analyzing intestinal crypt stem cell...

Analyzing transit amplifying cell of large intestine...

Analyzing intestinal enteroendocrine cell...

Analyzing monocyte...

Analyzing plasma cell...

Analyzing transit amplifying cell of large intestine...
Analysis for transit amplifying cell of large intestine completed.

Analysis for plasma cell completed.

Analysis for monocyte completed.

Analysis for intestinal enteroendocrine cell completed.

Analysis for intestinal crypt stem cell completed.

Analysis for plasma cell completed.

Analysis for transit amplifying cell of large intestine completed.

Analysis for intestinal crypt s

### Optional Step: Annotation Boost on Selected Cluster

The monocyte cluster is sometimes annotated as mixed population of immune cell and neuron/glia cells.

Here we use annotation boost agent to test these hypothesis in more detail.

In [None]:
# Run validation plus for the high mitochondrial content cluster
CASSIA.runCASSIA_annotationboost(
    full_result_path = output_name + "_full.csv",
    marker = unprocessed_markers,
    output_name = "monocyte_annotationboost",
    cluster_name = "monocyte",
    major_cluster_info = "Human Large Intestine",
    num_iterations = 5,
    model = "anthropic/claude-3.5-sonnet",
    provider = "openrouter"
)

### Optional Step: Retrieve Augmented Generation

This is particularly useful if you have a very specific and detialed annottaion to work with. It can significantly imrpove the granularity and accuracy of the annotation. It automatically extract marker information and genearte a report as additional informatyion for default CASSIA pipeline.

Intsall the package


In [None]:
!pip install cassia-rag


In [None]:
!pip install cassia-rag
from cassia_rag import run_complete_analysis
import os 

: 

Set up the API keys if you have not done so.


In [None]:
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["OPENAI_API_KEY"] = "your-openai-key"
    

Run the wrapper function to trigger a multiagent pipeline.

run_complete_analysis(
        tissue_type="Liver", # tissue you are analyzing
        target_species="Tiger", # species you are analyzing
        reference_species="Human", # either Human or mouse, if other species, then use Human instead of mouse
        model_choice='claude', # either claude or gpt, highly recommend claude
        compare=True,  # if you want to compare with reference species, for example fetal vs human, then set to True
        db_path="~/Canonical_Marker (1).csv", # path to the database
        max_workers=8
)

All the outputs (intermediate and final) are saved as a txt file in the "TissueType_Species" folder. In our example, it is Liver_Tiger folder. 
Final output is in summary_clean.txt file. And the content in this file can be used as additional information in CASSIA pipeline later.

There are also some other files in the folder, which are intermediate outputs. 
Use the tutorial input as example, the files are:
1. liver_tiger_marker_analysis.txt # marker analysis and interpretation from the database
2. final_ontology.txt # ontology related to the tissue type and target species
3. cell_type_patterns_claude.txt # cell type patterns analysis from the
4. summary.txt # raw summary file
5. additional_considerations.txt # additional considerations if we have different species than reference species.

### Optional Step: Compare the Subtypes Using Multiple LLMs

This agent can be used after you finish the default CASSIA pipeline, and are still unsure about a celltype. You can use this agent to get a more confident subtype annotation. Here we use the Plasma Cells cluster as examples. To distinguish if it is more like a general plasma cell or other celltypes.

In [None]:
# The marker here are copy from CASSIA's previous results.
marker = "IGLL5, IGLV6-57, JCHAIN, FAM92B, IGLC3, IGLC2, IGHV3-7, IGKC, TNFRSF17, IGHG1, AC026369.3, IGHV3-23, IGKV4-1, IGKV1-5, IGHA1, IGLV3-1, IGLV2-11, MYL2, MZB1, IGHG3, IGHV3-74, IGHM, ANKRD36BP2, AMPD1, IGKV3-20, IGHA2, DERL3, AC104699.1, LINC02362, AL391056.1, LILRB4, CCL3, BMP6, UBE2QL1, LINC00309, AL133467.1, GPRC5D, FCRL5, DNAAF1, AP002852.1, AC007569.1, CXorf21, RNU1-85P, U62317.4, TXNDC5, LINC02384, CCR10, BFSP2, APOBEC3A, AC106897.1"

CASSIA.compareCelltypes(
    tissue = "large intestine",
    celltypes = ["Plasma Cells", "IgA-secreting Plasma Cells", "IgG-secreting Plasma Cells", "IgM-secreting Plasma Cells"],
    marker_set = marker,
    species = "human",
    output_file = "plasama_cell_subtype"
)

### Optional Step: Subclustering


This agent can be used to study subclustered population, such as a T cell population or a Fibroblast cluster. We recommend to apply the default cassia first, and on a target cluster, apply Seurat pipeline to subcluster the cluster and get the findallmarke results to be used here. Here we present the results for the cd8-positive, alpha-beta t cell cluster as example. This cluster is a cd8 population mixed with other celltypes.

In [None]:
##below are R code or can be done in Scanpy too

# large=readRDS("/Users/xie227/Downloads/seurat_object.rds")
# # Extract CD8+ T cells
# cd8_cells <- subset(large, cell_ontology_class == "cd8-positive, alpha-beta t cell")
# # Normalize and identify variable features
# cd8_cells <- NormalizeData(cd8_cells)
# cd8_cells <- FindVariableFeatures(cd8_cells, selection.method = "vst", nfeatures = 2000)
# # Scale data and run PCA
# all.genes <- rownames(cd8_cells)
# cd8_cells <- ScaleData(cd8_cells, features = all.genes)
# cd8_cells <- RunPCA(cd8_cells, features = VariableFeatures(object = cd8_cells),npcs = 30)
# # Run clustering (adjust resolution and dims as needed based on elbow plot)
# cd8_cells <- FindNeighbors(cd8_cells, dims = 1:20)
# cd8_cells <- FindClusters(cd8_cells, resolution = 0.3)
# # Run UMAP
# cd8_cells <- RunUMAP(cd8_cells, dims = 1:20)
# # Create visualization plots
# p1 <- DimPlot(cd8_cells, reduction = "umap", label = TRUE) +
#   ggtitle("CD8+ T Cell Subclusters")
# # Find markers for each subcluster
# cd8_markers <- FindAllMarkers(cd8_cells,
#                             only.pos = TRUE,
#                             min.pct = 0.1,
#                             logfc.threshold = 0.25)
# cd8_markers=cd8_markers %>% filter(p_val_adj<0.05)
# write.csv(cd8_markers, "cd8_subcluster_markers.csv")

CASSIA.runCASSIA_subclusters(marker = subcluster_results,
    major_cluster_info = "cd8 t cell",
    output_name = "subclustering_results",
    model = "anthropic/claude-3.5-sonnet",
    provider = "openrouter")

It is recommend to run the CS score for the subclustering to get a more confident answer.

In [None]:
CASSIA.runCASSIA_n_subcluster(
    n=5, 
    marker=subcluster_results,
    major_cluster_info="cd8 t cell", 
    base_output_name="subclustering_results_n",
    model="anthropic/claude-3.5-sonnet",
    temperature=0,
    provider="openrouter",
    max_workers=5,
    n_genes=50
)


In [None]:
# Calculate similarity scores
CASSIA.runCASSIA_similarity_score_batch(
    marker = subcluster_results,
    file_pattern = "subclustering_results_n_*.csv",
    output_name = "subclustering_uncertainty",
    max_workers = 6,
    model = "claude-3-5-sonnet-20241022",
    provider = "anthropic",
    main_weight = 0.5,
    sub_weight = 0.5
)

### Optional Step: Annotation Boost with Additional Task

This can be used to study a given problem related to a cluster, such as infer the state of a cluster. Here we use the cd8-positive, alpha-beta t cell as an example. Note that the performance of this agent has not been benchmarked, so please be cautious with the results.

In [None]:
#only openrouter is supported as provider now.

CASSIA.runCASSIA_annottaionboost_additional_task(
    full_result_path = output_name + "_full.csv",  # Changed from paste0() to Python string concatenation
    marker = unprocessed_markers,
    output_name = "T_cell_state",
    cluster_name = "cd8-positive, alpha-beta t cell",  # Cluster with high mitochondrial content
    major_cluster_info = "Human Large Intestine",
    num_iterations = 5,
    model = "anthropic/claude-3.5-sonnet",
    additional_task = "infer the state of this T cell cluster"
)