# CyteOnto

Semantic Cell Type Annotation Comparison Using Large Language Models and Cell Ontology

## Advanced tutorial

This notebook demonstrates advanced usage of CyteOnto for semantic cell type annotation comparison, including custom embedding generation and ...


### Prerequisites

Before you begin, ensure you have the following:

- Python 3.12+
- UV package manager (recommended)

Navigate to `CyteOnto` directory and install the required packages:

```bash
uv sync
```

### 1. Set API Keys as Environment Variables
 
```bash
LLM_API_KEY=your_api_key_here               # For example OpenAI (can be other like groq, openrouter, google, xai, deepinfra, etc.)
EMBEDDING_MODEL_API_KEY=your_api_key_here   # Cane be same as above if embedding model is from the same provider
 
# Optional: for higher rate limits
NCBI_API_KEY=your_ncbi_api_key_here         # for using pubmed tool calls
```

In [3]:
# Path management, only for running `cyteonto` from notebooks directory
import sys
sys.path.append("..")

In [4]:
import os

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider


You can check out [Pydantic AI's Model Documentation](https://ai.pydantic.dev/models/) for more information on how to initialize agents using various model providers such as `OpenAI`, `Anthropic`, `Gemini`, `Groq`, `Mistral`, etc.

In addition, many providers are compatible with the OpenAI API, and can be used with `OpenAIModel` in Pydantic AI: `DeepSeek`, `Grok (xAI)`, `Ollama` (for locally hosted models), etc.

We will use `OpenAIModel` for our initialization of a `moonshotai/Kimi-K2-Instruct` model from `deepinfra` as an agent. The `OpenAIProvider` is used to provide the openai compatible endpoint for `deepinfra`.

In [5]:
model = OpenAIModel(
    "moonshotai/Kimi-K2-Instruct",
    provider=OpenAIProvider(
        base_url="https://api.deepinfra.com/v1/openai",
        api_key=os.getenv("LLM_API_KEY"),
    ),
)
agent = Agent(model)

### 2. Generate Embeddings 

You can generate the descriptions and embeddings for Cell Ontology for using your preferred model by calling the `cyteonto.setup` function. You will need to provide the `embedding_model` and `embedding_provider` parameters to the `cyteonto.setup` function. Currenly, only `deepinfra`, `ollama`, `openai`, and `google` are supported as providers.

This is a one time operation that will create the necessary embeddings for your ontology at `cyteonto/data/embedding`. For more information on managing locations, you can refer to the [`docs/FILE_MANAGEMENT.md`](../docs/FILE_MANAGEMENT.md) documentation.

In [8]:
import cyteonto

In [None]:
# This may take a while
await cyteonto.setup(
    base_agent=agent,
    embedding_model="Qwen/Qwen3-Embedding-8B",
    embedding_provider="deepinfra",
)

### 3. Initialize CyteOnto with LLM and Embedding model

In [11]:
# Initialize CyteOnto instance
cyto = cyteonto.CyteOnto(
    base_agent=agent,
    embedding_model="Qwen/Qwen3-Embedding-8B",
    embedding_provider="deepinfra",
)

INFO: Loading ontology embeddings from /home/oskbor/Documents/onto-package/CyteOnto/notebooks/../cyteonto/data/embedding/cell_ontology/embeddings_moonshotai-Kimi-K2-Instruct_Qwen-Qwen3-Embedding-8B.npz
INFO: Loaded 2902 embeddings from /home/oskbor/Documents/onto-package/CyteOnto/notebooks/../cyteonto/data/embedding/cell_ontology/embeddings_moonshotai-Kimi-K2-Instruct_Qwen-Qwen3-Embedding-8B.npz
INFO: Loaded 2902 ontology embeddings
INFO: Loaded success: True
INFO: CyteOntoMatcher initialized
INFO: CyteOnto initialized with models: text='moonshotai/Kimi-K2-Instruct', embedding='Qwen/Qwen3-Embedding-8B'


In [12]:
author_labels = ["animal stem cell", "BFU-E", "CFU-M", "neutrophilic granuloblast"]
algorithm1_labels = ["stem cell", "blast forming unit erythroid", "erythroid stem cell", "spermatogonium"]
algorithm2_labels = ["neuronal receptor cell", "stem cell", "smooth muscle cell", "ovum"]

In [None]:
# Perform batch comparison
# This may take a while as the new embeddings are generated
results_df = await cyto.compare_batch(
    study_name="sample_study",              # Save and cache all the results to this directory. Serves as a unique run id.
    author_labels=author_labels,
    algo_comparison_data=[
        ("algorithm1", algorithm1_labels),
        ("algorithm2", algorithm2_labels)
    ],
)

The generated descriptions and embeddings are cached and stored at `cyteonto/data/user_files/descriptions/sample_study` and `cyteonto/data/user_files/embeddings/sample_study`.

## Custom Configuration

You can customize the behavior of CyteOnto by modifying the configuration at initialization.

In [None]:
# Custom paths and settings
await cyteonto.setup(
    base_agent=agent,
    embedding_model="text-embedding-ada-002",
    embedding_provider="openai",
    base_data_path="/custom/path/to/cl-embedding/data"
)

cyto = cyteonto.CyteOnto(
    base_agent=agent,
    embedding_model="text-embedding-ada-002",
    embedding_provider="openai",
    base_data_path="/custom/path/to/cl-embedding/data",     # Custom data location
    user_data_path="/custom/path/to/user/files",            # Custom user files location
    enable_user_file_caching=True                           # Enable/disable caching
)

## Troubleshooting

### Common Issues

#### Setup Fails with Description Errors
```bash
# Error: Description Error: 'data'
```
**Solution**: Ensure you're using compatible model versions and check API connectivity.

#### ParseError during Setup
```bash
# Error: ParseError: syntax error: line 1, column 0
```
**Solution**: The LLM is not returning valid JSON. Try a different model or check your API configuration.

#### Connection Errors
```bash
# Error: ConnectionError: HTTPSConnectionPool...
```
**Solution**: Check your internet connection and API keys. CyteOnto gracefully handles PubMed API failures.

### Debug Mode

By default `CyteOnto` runs in at `INFO` debug level. You can enable more verbose logging by setting the logging level to `DEBUG`.

```python
import logging
logging.basicConfig(level=logging.DEBUG)

# CyteOnto will now show detailed progress information
await cyto.compare_batch(author_labels, algo_data, study_name="debug_run")
```

---