In [1]:
#| hide
# what


# Single-Cell-Fuzzy-Labels


![Single-Cell Fuzzy Labels Concept](Single_Cell_Fuzzy_Labels/logo.png)



The `Single-Cell-Fuzzy-Labels` library enhances single-cell RNA-seq data analysis by enabling label transfer from a reference dataset to new data using a K-nearest neighbors (KNN) approach. It leverages pre-trained models and advanced language models like GPT-3.5 or 4 to align disparate label sets, improving interpretability and integration. The library also offers metrics to evaluate the label transfer's effectiveness.

## Streamlining Label Equivalence and Scoring with OpenAI API and GPT-4

To quantify the success of label transfer, we use a centroid-based consensus strategy from our nearest neighbor method. The challenge lies in the lack of a universal standard for label sets, often leading to a mismatch between manual annotations and our predicted labels. For example, manual annotations might label cells as 'neurons', while our method could predict more specific types like 'pyramidal neurons' or 'interneurons'.

We define the two label sets involved as:

- **Existing Labels (E):** The original labels in the dataset, typically from manual annotations or established knowledge.
- **Predicted Labels (P):** The labels our method predicts.

Mathematically, we represent these sets as:

$$
\text{Existing Labels (E)} = \{e_1, e_2, e_3, \ldots, e_n\}
$$

$$
\text{Predicted Labels (P)} = \{p_1, p_2, p_3, \ldots, p_m\}
$$

Comparing **Existing Labels (E)** with **Predicted Labels (P)** is challenging due to the need to categorize complex biological processes and the inconsistencies in label detail and standards. For example, a broad category like 'basal' in **E** (notated as $e_i$) may equate to multiple subtypes in **P** (notated as $\{p_j, p_k, \ldots\}$), requiring adaptable equivalence mapping strategies. Moreover, the same cell type might have different names, such as 'Type I Pneumocytes', 'Type I alveolar cells', or 'squamous alveolar cells', with variations in terms, spelling, and capitalization, complicating computational comparison despite being understandable to experts.


**Label Mapping and Evaluation:**

To compare the predicted label set \( P \) with the existing set \( E \), we define a mapping function \( f: P \rightarrow E \). This function assigns each predicted label \( p \in P \) to an existing label \( e \in E \), allowing for a consistent comparison. For instance, 'neuron' in \( P \) might map to 'pyramidal neuron' in \( E \), based on criteria like hierarchy or linguistic similarity.

The `fuzz1_score` evaluates this mapping's accuracy, akin to the F1 score, using the formula:

$$
\text{fuzz1\textunderscore score} = 2 \times \frac{(\text{Precision} \times \text{Recall})}{(\text{Precision} + \text{Recall})}
$$


Precision is the ratio of correct predictions to total predictions, while Recall is the ratio of correct predictions to total existing labels. For example, correctly mapping 'neuron' to 'pyramidal neuron' and 'interneuron' in \( E \) counts as true positives. Incorrect mappings are false positives, and missed mappings are false negatives.

The `fuzz1_score` thus quantifies the mapping's precision and recall, offering a concise measure of its effectiveness.



## Install

```sh
pip install Single_Cell_Fuzzy_Labels
```

## How to use

To utilize the `Single-Cell-Fuzzy-Labels` library in your single-cell RNA-seq data analysis, follow these steps:

1. Install the library using pip:
   ```sh
   pip install Single_Cell_Fuzzy_Labels
   ```

2. Import the library in your Python environment:
   ```python
   from Single_Cell_Fuzzy_Labels.core import *
   ```

Optional steps before label transfer:

3. Download pre-trained embeddings from cellxgene:
   ```python
   embeddings = download_embeddings(cellxgene_url)
   ```

   Or

   Embed your own dataset using a foundation model such as UCE or scGPT:
   ```python
   dataset_embeddings = embed_dataset(your_dataset, model='UCE')
   ```

4. Assess embedding quality using Single-cell Integration Benchmarking (scIB):
   ```python
   quality_metrics = assess_embedding_quality(dataset_embeddings)
   ```

5. Prepare your reference dataset with well-annotated labels.

6. Use the `transfer_labels` function to transfer labels from the reference dataset to your new query data:
   ```python
   transferred_labels = transfer_labels(query_data, reference_data)
   ```

7. Evaluate the label transfer quality using the provided metrics:
   ```python
   evaluate_transfer(transferred_labels)
   ```

For more detailed usage instructions and examples, refer to the documentation and the tutorials included in the GitHub repository.
