# CASSIA Analysis Tutorial

This Python Notebook demonstrates a complete workflow using CASSIA for cell type annotation of single-cell RNA sequencing data. We'll analyze an intestinal cell dataset containing six distinct populations:

1.monocyte
2.plasma cells
3.cd8-positive, alpha-beta t cell
4.transit amplifying cell of large intestine
5.intestinal enteroendocrine cell
6.intestinal crypt stem cell

## Setup and Environment Preparation

First, let's install and import the required packages:

In [1]:
!pip install CASSIA==0.1.7

Collecting CASSIA==0.1.7
  Obtaining dependency information for CASSIA==0.1.7 from https://files.pythonhosted.org/packages/96/1f/bb55e307a91d58477954d8786da585b9b4c4952bb3024a04cfe9461b9be9/CASSIA-0.1.7-py3-none-any.whl.metadata
  Downloading CASSIA-0.1.7-py3-none-any.whl.metadata (4.8 kB)
Downloading CASSIA-0.1.7-py3-none-any.whl (40 kB)
   ---------------------------------------- 0.0/40.6 kB ? eta -:--:--
   -------------------- ------------------- 20.5/40.6 kB 682.7 kB/s eta 0:00:01
   ---------------------------------------- 40.6/40.6 kB 645.0 kB/s eta 0:00:00
Installing collected packages: CASSIA
  Attempting uninstall: CASSIA
    Found existing installation: CASSIA 0.1
    Uninstalling CASSIA-0.1:
      Successfully uninstalled CASSIA-0.1
Successfully installed CASSIA-0.1.7



[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: c:\Users\ellio\OneDrive - UW-Madison\RAN\liana\liana_env\Scripts\python.exe -m pip install --upgrade pip


In [None]:
pip install --target="D:\anaconda3\envs\cassia_env\Lib\site-packages" CASSIA==0.1.7 --no-cache-dir

In [2]:
print("All attributes and methods:")
print([item for item in dir(CASSIA) if not item.startswith('_')])

All attributes and methods:
['Anthropic', 'Counter', 'OpenAI', 'Path', 'ThreadPoolExecutor', 'agent_judgement', 'agent_judgement_claude', 'agent_judgement_single', 'agent_unification', 'agent_unification_claude', 'agent_unification_deplural', 'annotate_subclusters', 'anthropic', 'as_completed', 'check_formatted_output', 'claude_agent', 'compareCelltypes', 'consensus_similarity_flexible', 'consensus_similarity_flexible_single', 'construct_prompt_from_csv_subcluster', 'convert_markdown_to_html', 'create_and_save_results_dataframe', 'csv', 'defaultdict', 'escape', 'extract_cell_types_from_results_single', 'extract_celltypes_from_llm', 'extract_celltypes_from_llm_claude', 'extract_celltypes_from_llm_single', 'extract_score_and_reasoning', 'extract_subcluster_results_with_llm', 'extract_subcluster_results_with_llm_multiple_output', 'generate_cell_type_analysis_report', 'generate_cell_type_analysis_report_openai', 'generate_cell_type_analysis_report_openrouter', 'generate_html_report', 'gene

In [1]:
import CASSIA
import pandas as pd



In [12]:

# Set API keys
CASSIA.set_api_key("your-openai-key", provider="openai")
CASSIA.set_api_key("your-anthropic-key", provider="anthropic")
CASSIA.set_api_key("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", provider="openrouter")

In [22]:
# Load different types of markers
processed_markers = CASSIA.loadmarker("processed")
unprocessed_markers = CASSIA.loadmarker("unprocessed")
subcluster_results = CASSIA.loadmarker("subcluster_results")

# List available marker sets
available_markers = CASSIA.list_available_markers()
print(available_markers) 


AttributeError: module 'CASSIA' has no attribute 'loadmarker'

## Fast Mode

In [10]:
# Run the CASSIA pipeline in fast mode
CASSIA.runCASSIA_pipeline(
    output_file_name = "FastAnalysisResults",
    tissue = "large intestine",
    species = "human",
    marker_path = markers_unprocessed,
    max_workers = 6,  # Matches the number of clusters in dataset
    annotation_model = "openai/gpt-4o-2024-11-20", #openai/gpt-4o-2024-11-20
    annotation_provider = "openrouter",
    score_model = "anthropic/claude-3.5-sonnet",
    score_provider = "openrouter",
    score_threshold = 75,
    annotationboost_model="anthropic/claude-3.5-sonnet",
    annotationboost_provider="openrouter"
)



=== Starting cell type analysis ===
Processing input dataframe to get top markers

Analyzing B cell...

Analyzing CD8 T cell...

Analyzing Podocyte...
B cell generated an exception: OpenRouter API error: 401
CD8 T cell generated an exception: OpenRouter API error: 401
Podocyte generated an exception: OpenRouter API error: 401
All analyses completed. Results saved to 'FastAnalysisResults'.
Two CSV files have been created:
1. FastAnalysisResults_full.csv (full data)
2. FastAnalysisResults_summary.csv (summary data)
✓ Cell type analysis completed

=== Starting scoring process ===
Starting scoring process with 6 workers using openrouter (anthropic/claude-3.5-sonnet)...
All rows already scored!
✓ Scoring process completed

=== Generating main reports ===


NameError: name 'process_all_reports' is not defined

### Step 2: Detailed Batch Analysis

In [13]:
# Calculate recommended workers
output_name="intestine_detailed"

# Run batch analysis
CASSIA.runCASSIA_batch(
    marker = markers_unprocessed,
    output_name = output_name,
    model = "openai/gpt-4o-2024-11-20",
    tissue = "large intestine",
    species = "human",
    max_workers = 6,  # Matching cluster count
    n_genes = 50,
    additional_info = None,
    provider = "openrouter")

Processing input dataframe to get top markers

Analyzing B cell...

Analyzing CD8 T cell...

Analyzing Podocyte...
Analysis for B cell completed.

Analysis for CD8 T cell completed.

Analysis for Podocyte completed.

All analyses completed. Results saved to 'intestine_detailed'.
Two CSV files have been created:
1. intestine_detailed_full.csv (full data)
2. intestine_detailed_summary.csv (summary data)


{'B cell': {'analysis_result': {'main_cell_type': 'B cells',
   'sub_cell_types': ['Plasma cells',
    'Memory B cells',
    'Germinal center B cells'],
   'possible_mixed_cell_types': [],
   'num_markers': 50,
   'iterations': 1,
   'marker_list': ['IGKC',
    'IGHG1',
    'IGLC2',
    'IGHG3',
    'IGHG2',
    'IGLC3',
    'IGHGP',
    'IGHM',
    'JCHAIN',
    'IGHG4',
    'IGHA1',
    'MZB1',
    'SDF2L1',
    'SUB1',
    'IGHA2',
    'HSP90B1',
    'AL928768.3',
    'CD27',
    'IGHD',
    'MANF',
    'PIM2',
    'SEC11C',
    'TNFRSF13B',
    'MS4A1',
    'CD79A',
    'SPCS3',
    'ISG20',
    'PPIB',
    'PDIA4',
    'CD37',
    'VPREB3',
    'POU2AF1',
    'FCRLA',
    'DDX39A',
    'FCRL1',
    'CD79B',
    'CXCR5',
    'CD19',
    'TNFRSF13C',
    'BLK',
    'CCDC167',
    'VIMP',
    'LINC00926',
    'P2RX5',
    'MYDGF',
    'FCER2',
    'RANBP1',
    'BANK1',
    'SLC30A1',
    'SMC1A']},
  'conversation_history': [('Final Annotation Agent',
    '### Step-by-Step Analysis:

### Step 3: Quality Scoring

In [15]:
# Run quality scoring
CASSIA.runCASSIA_score_batch(
    input_file = output_name + "_full.csv",
    output_file = output_name + "_scored.csv",
    max_workers = 6,
    model = "anthropic/claude-3.5-sonnet",
    provider = "openrouter"
)

# Generate quality report
CASSIA.runCASSIA_generate_score_report(
    csv_path = output_name + "_scored.csv",
    index_name = output_name + "_report.html"
)

Starting scoring process with 6 workers using openrouter (anthropic/claude-3.5-sonnet)...
Processed row 1: Score = 95
Processed row 2: Score = 95
Processed row 3: Score = 30

Scoring completed!

Summary:
Total rows: 3
Successfully scored: 3
Failed/Skipped: 0


TypeError: runCASSIA_generate_score_report() got an unexpected keyword argument 'output_name'

### Optional Step: Annotation Boost on Selected Cluster

The monocyte cluster is sometimes annotated as mixed population of immune cell and neuron/glia cells.

Here we use annotation boost agent to test these hypothesis in more detail.

In [19]:
# Run validation plus for the high mitochondrial content cluster
CASSIA.runCASSIA_annotationboost(
    full_result_path = output_name + "_full.csv",
    marker = markers_unprocessed,
    output_name = "monocyte_annotationboost2",
    cluster_name = "Podocyte",
    major_cluster_info = "Human Large Intestine",
    num_iterations = 5,
    model = "anthropic/claude-3.5-sonnet",
    provider = "openrouter"
)

Iteration 1 completed.
Iteration 2 completed.
Iteration 3 completed.
Iteration 4 completed.
Final annotation completed in iteration 5.
HTML report generated and saved as 'monocyte_annotationboost2.html'
Report successfully saved as 'monocyte_annotationboost2_raw.html'
Analysis completed successfully. Report saved as monocyte_annotationboost2


### Optional Step: Compare the Subtypes Using Multiple LLMs

This agent can be used after you finish the default CASSIA pipeline, and are still unsure about a celltype. You can use this agent to get a more confident subtype annotation. Here we use the Plasma Cells cluster as examples. To distinguish if it is more like a general plasma cell or other celltypes.

In [21]:
# The marker here are copy from CASSIA's previous results.
marker = "IGLL5, IGLV6-57, JCHAIN, FAM92B, IGLC3, IGLC2, IGHV3-7, IGKC, TNFRSF17, IGHG1, AC026369.3, IGHV3-23, IGKV4-1, IGKV1-5, IGHA1, IGLV3-1, IGLV2-11, MYL2, MZB1, IGHG3, IGHV3-74, IGHM, ANKRD36BP2, AMPD1, IGKV3-20, IGHA2, DERL3, AC104699.1, LINC02362, AL391056.1, LILRB4, CCL3, BMP6, UBE2QL1, LINC00309, AL133467.1, GPRC5D, FCRL5, DNAAF1, AP002852.1, AC007569.1, CXorf21, RNU1-85P, U62317.4, TXNDC5, LINC02384, CCR10, BFSP2, APOBEC3A, AC106897.1"

CASSIA.compareCelltypes(
    tissue = "large intestine",
    celltypes = ["Plasma Cells", "IgA-secreting Plasma Cells", "IgG-secreting Plasma Cells", "IgM-secreting Plasma Cells"],
    marker_set = marker,
    species = "human",
    output_file = "plasama_cell_subtype"
)

Model: anthropic/claude-3.5-sonnet
Response: As a professional biologist, I'll analyze the marker set and score each plasma cell type based on known characteristics and markers. Let's evaluate each option:

IgA-secreting Plasma Cells: 85/100
Key supporting evidence:
- Strong presence of IGHA1 and IGHA2 (IgA heavy chains)
- High expression of JCHAIN (important for IgA dimerization)
- Presence of CCR10 (homing receptor for IgA plasma cells)
- Common plasma cell markers (MZB1, TNFRSF17)

IgG-secreting Plasma Cells: 60/100
Key supporting evidence:
- Presence of IGHG1 and IGHG3 (IgG heavy chains)
- Common plasma cell markers present
- However, lacks some typical IgG-specific markers

IgM-secreting Plasma Cells: 45/100
Key supporting evidence:
- Presence of IGHM (IgM heavy chain)
- JCHAIN present (relevant for IgM pentamers)
- However, overall profile less consistent with IgM-dominant cells

General Plasma Cells: 75/100
Key supporting evidence:
- Strong presence of general plasma cell marker

### Optional Step: Subclustering


This agent can be used to study subclustered population, such as a T cell population or a Fibroblast cluster. We recommend to apply the default cassia first, and on a target cluster, apply Seurat pipeline to subcluster the cluster and get the findallmarke results to be used here. Here we present the results for the cd8-positive, alpha-beta t cell cluster as example. This cluster is a cd8 population mixed with other celltypes.

In [None]:
##below are R code or can be done in Scanpy too

# large=readRDS("/Users/xie227/Downloads/seurat_object.rds")
# # Extract CD8+ T cells
# cd8_cells <- subset(large, cell_ontology_class == "cd8-positive, alpha-beta t cell")
# # Normalize and identify variable features
# cd8_cells <- NormalizeData(cd8_cells)
# cd8_cells <- FindVariableFeatures(cd8_cells, selection.method = "vst", nfeatures = 2000)
# # Scale data and run PCA
# all.genes <- rownames(cd8_cells)
# cd8_cells <- ScaleData(cd8_cells, features = all.genes)
# cd8_cells <- RunPCA(cd8_cells, features = VariableFeatures(object = cd8_cells),npcs = 30)
# # Run clustering (adjust resolution and dims as needed based on elbow plot)
# cd8_cells <- FindNeighbors(cd8_cells, dims = 1:20)
# cd8_cells <- FindClusters(cd8_cells, resolution = 0.3)
# # Run UMAP
# cd8_cells <- RunUMAP(cd8_cells, dims = 1:20)
# # Create visualization plots
# p1 <- DimPlot(cd8_cells, reduction = "umap", label = TRUE) +
#   ggtitle("CD8+ T Cell Subclusters")
# # Find markers for each subcluster
# cd8_markers <- FindAllMarkers(cd8_cells,
#                             only.pos = TRUE,
#                             min.pct = 0.1,
#                             logfc.threshold = 0.25)
# cd8_markers=cd8_markers %>% filter(p_val_adj<0.05)
# write.csv(cd8_markers, "cd8_subcluster_markers.csv")

marker_sub=read.csv("C:/Users/ellio/OneDrive - UW-Madison/cellgpt_final_folder/Test_results/Elliot/GTEX/final_testing/final_code_for_r_package/inst/extdata/subcluster_results.csv")

marker_sub=loadExampleMarkers_subcluster()


CASSIA.runCASSIA_subclusters(marker = marker_sub,
    major_cluster_info = "cd8 t cell",
    output_name = "subclustering_results",
    model = "anthropic/claude-3.5-sonnet",
    provider = "openrouter")

It is recommend to run the CS score for the subclustering to get a more confident answer.

In [None]:
CASSIA.runCASSIA_n_subcluster(n=5, marker_sub, "cd8 t cell", "subclustering_results_n", 
                                               model = "anthropic/claude-3.5-sonnet", temperature = 0, 
                                               provider = "openrouter", max_workers = 5,n_genes=50L)




# Calculate similarity scores
CASSIA.runCASSIA_similarity_score_batch(
    marker = marker_sub,
    file_pattern = "subclustering_results_n_*.csv",
    output_name = "subclustering_uncertainty",
    max_workers = 6,
    model = "claude-3-5-sonnet-20241022",
    provider = "anthropic",
    main_weight = 0.5,
    sub_weight = 0.5
)

### Optional Step: Annotation Boost with Additional Task

This can be used to study a given problem related to a cluster, such as infer the state of a cluster. Here we use the cd8-positive, alpha-beta t cell as an example. Note that the performance of this agent has not been benchmarked, so please be cautious with the results.

In [None]:
#only openrouter is supported as provider now.


CASSIA.runCASSIA_annottaionboost_additional_task(
    full_result_path = paste0(output_name, "_full.csv"),
    marker = markers_unprocessed,
    output_name="T_cell_state",
    cluster_name = "cd8-positive, alpha-beta t cell",  # Cluster with high mitochondrial content
    major_cluster_info = "Human Large Intestine",
    num_iterations = 5,
    model = "anthropic/claude-3.5-sonnet",
    additional_task = "infer the state of this T cell cluster"
)
