# CyteOnto Tutorial

This is a basic tutorial on how to use `cyteonto` package for comparing cluster labels.

**Note**: Make sure you have downloaded the precomputed embeddings and descriptions if you plan to use `Kimi-K2` as text and `Qwen3-8B` as embedding model. 

```bash
EMBEDDING_MODEL_API_KEY=deepinfra_api_key   # We are using deepinfra as model provider
DEEPINFRA_API_KEY=deepinfra_api_key
```

In [22]:
import sys
sys.path.append("..")

In [23]:
import os

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider

import cyteonto

In [24]:
model = OpenAIModel(
    "moonshotai/Kimi-K2-Instruct",
    provider=OpenAIProvider(
        base_url="https://api.deepinfra.com/v1/openai",
        api_key=os.getenv("DEEPINFRA_API_KEY"),
    ),
)
agent = Agent(model)

In [None]:
# one time setup
await cyteonto.setup(
    base_agent=agent,
    embedding_model="Qwen/Qwen3-Embedding-8B",
    embedding_provider="deepinfra",
)

In [26]:
author_labels = [
    "animal stem cell",
    "BFU-E",
    "CFU-M",
    "neutrophilic granuloblast"
]
labels1 = [
    "stem cell",
    "blast forming unit erythroid",
    "erythroid stem cell",
    "spermatogonium"
]
labels2 = [
    "neuronal receptor cell",
    "stem cell",
    "smooth muscle cell neural crest derived",
    "ovum"
]

In [None]:
# Initialize CyteOnto instance
cyto = cyteonto.CyteOnto(
    base_agent=agent,
    embedding_model="Qwen/Qwen3-Embedding-8B", 
    embedding_provider="deepinfra"
)

In [None]:
# Compare multiple algorithms at once - no file collision!
algo_comparison_data = [
    ("CyteType", labels1),
    ("CellTypist", labels2)
    # keep here 
]

results_df = await cyto.compare_batch(
    author_labels=author_labels,
    algo_comparison_data=algo_comparison_data,
    study_name="sample_study"
)

In [29]:
results_df

Unnamed: 0,study_name,algorithm,pair_index,author_label,algorithm_label,author_ontology_id,author_embedding_similarity,algorithm_ontology_id,algorithm_embedding_similarity,ontology_hierarchy_similarity,similarity_method
0,sample_study,CyteType,0,animal stem cell,stem cell,CL:0000034,0.9169,CL:0000723,0.8014,0.9868,ontology_hierarchy
1,sample_study,CyteType,1,BFU-E,blast forming unit erythroid,CL:0001066,0.8798,CL:0001066,0.8379,1.0,ontology_hierarchy
2,sample_study,CyteType,2,CFU-M,erythroid stem cell,CL:0000049,0.7403,CL:0001066,0.8645,0.9065,ontology_hierarchy
3,sample_study,CyteType,3,neutrophilic granuloblast,spermatogonium,CL:0000042,0.9121,CL:0000020,0.915,0.7566,ontology_hierarchy
4,sample_study,CellTypist,0,animal stem cell,neuronal receptor cell,CL:0000034,0.9169,CL:0000197,0.8493,0.9499,ontology_hierarchy
5,sample_study,CellTypist,1,BFU-E,stem cell,CL:0001066,0.8798,CL:0001024,0.8577,0.9174,ontology_hierarchy
6,sample_study,CellTypist,2,CFU-M,smooth muscle cell neural crest derived,CL:0000049,0.7403,CL:0000027,0.9157,0.8703,ontology_hierarchy
7,sample_study,CellTypist,3,neutrophilic granuloblast,ovum,CL:0000042,0.9121,CL:0000025,0.9137,0.8222,ontology_hierarchy


In [30]:
# Check overall performance by similarity method
print(results_df['similarity_method'].value_counts())

similarity_method
ontology_hierarchy    8
Name: count, dtype: int64


In [31]:
# Get high-quality matches (both labels matched well)
high_quality = results_df[
    (results_df['author_embedding_similarity'] > 0.7) 
    & 
    (results_df['algorithm_embedding_similarity'] > 0.7) 
    &
    (results_df['similarity_method'] == 'ontology_hierarchy')
]
high_quality

Unnamed: 0,study_name,algorithm,pair_index,author_label,algorithm_label,author_ontology_id,author_embedding_similarity,algorithm_ontology_id,algorithm_embedding_similarity,ontology_hierarchy_similarity,similarity_method
0,sample_study,CyteType,0,animal stem cell,stem cell,CL:0000034,0.9169,CL:0000723,0.8014,0.9868,ontology_hierarchy
1,sample_study,CyteType,1,BFU-E,blast forming unit erythroid,CL:0001066,0.8798,CL:0001066,0.8379,1.0,ontology_hierarchy
2,sample_study,CyteType,2,CFU-M,erythroid stem cell,CL:0000049,0.7403,CL:0001066,0.8645,0.9065,ontology_hierarchy
3,sample_study,CyteType,3,neutrophilic granuloblast,spermatogonium,CL:0000042,0.9121,CL:0000020,0.915,0.7566,ontology_hierarchy
4,sample_study,CellTypist,0,animal stem cell,neuronal receptor cell,CL:0000034,0.9169,CL:0000197,0.8493,0.9499,ontology_hierarchy
5,sample_study,CellTypist,1,BFU-E,stem cell,CL:0001066,0.8798,CL:0001024,0.8577,0.9174,ontology_hierarchy
6,sample_study,CellTypist,2,CFU-M,smooth muscle cell neural crest derived,CL:0000049,0.7403,CL:0000027,0.9157,0.8703,ontology_hierarchy
7,sample_study,CellTypist,3,neutrophilic granuloblast,ovum,CL:0000042,0.9121,CL:0000025,0.9137,0.8222,ontology_hierarchy


In [32]:
# Compare algorithms
for algo in results_df['algorithm'].unique():
    algo_data = results_df[results_df['algorithm'] == algo]
    median_sim = algo_data['ontology_hierarchy_similarity'].median()
    print(f"{algo}: {median_sim:.3f} median ontology similarity")

CyteType: 0.947 median ontology similarity
CellTypist: 0.894 median ontology similarity


In [33]:
# Find problematic matches
low_embedding_matches = results_df[
    (results_df['author_embedding_similarity'] < 0.5) |
    (results_df['algorithm_embedding_similarity'] < 0.5)
]
print("Labels with poor ontology matching:")
print(low_embedding_matches[['author_label', 'algorithm_label', 'similarity_method']])

Labels with poor ontology matching:
Empty DataFrame
Columns: [author_label, algorithm_label, similarity_method]
Index: []
