# 5. Annotate cell clusters

This notebook compiles cell cluster annotations from the Cao et al. 2019 analysis to generate cluster names and tissue colors for the Piekarz et al. 2024 scRNAseq data.

## 5.1 Setup

Load the necessary libraries and set the data directory.

In [1]:
from pathlib import Path

import zoogletools as zt



## 5.2 Load cell clusters

Load the cell clusters from the Cao et al. 2019 analysis.
This also processes the data to generate a merging barcode for each cell cluster for combining with the Piekarz et al. 2024 scRNAseq data.

In [2]:
cell_clusters = zt.ciona.data_processing.load_cell_clusters()
display(cell_clusters)

Unnamed: 0,NAME,X,Y,Clusters,Tissue Type,Cao_stage,stage,replicate,stage_replicate,barcode,merging_barcode
1,C110.1_AAACCTGTCAGTTTGG,1.771697179,0.169352728,44,epidermis,C110,iniG,rep1,iniG_rep1,AAACCTGTCAGTTTGG,iniG_rep1_AAACCTGTCAGTTTGG
2,C110.1_AAACGGGTCTGTCCGT,-6.462542321,6.657707053,14,epidermis,C110,iniG,rep1,iniG_rep1,AAACGGGTCTGTCCGT,iniG_rep1_AAACGGGTCTGTCCGT
3,C110.1_AAAGATGAGTTGAGAT,-2.589877535,-1.603552418,75,epidermis,C110,iniG,rep1,iniG_rep1,AAAGATGAGTTGAGAT,iniG_rep1_AAAGATGAGTTGAGAT
4,C110.1_AAATGCCGTCGCATAT,-12.31563402,-6.27978969,44,epidermis,C110,iniG,rep1,iniG_rep1,AAATGCCGTCGCATAT,iniG_rep1_AAATGCCGTCGCATAT
5,C110.1_AAATGCCGTGTTTGGT,-2.731373449,-1.449560153,75,epidermis,C110,iniG,rep1,iniG_rep1,AAATGCCGTGTTTGGT,iniG_rep1_AAATGCCGTGTTTGGT
...,...,...,...,...,...,...,...,...,...,...,...
90575,lv.1_TCGTAGAGTACCGAGA-2,29.1525446,-6.089481097,89,endoderm,lv,larva,rep1,larva_rep1,TCGTAGAGTACCGAGA-2,larva_rep1_TCGTAGAGTACCGAGA-2
90576,lv.1_TGAGAGGCACGAGAGT-2,2.836903468,-27.77839935,67,nervous system,lv,larva,rep1,larva_rep1,TGAGAGGCACGAGAGT-2,larva_rep1_TGAGAGGCACGAGAGT-2
90577,lv.1_TGGGCGTTCGCCTGAG-2,1.730967885,-28.28614185,67,nervous system,lv,larva,rep1,larva_rep1,TGGGCGTTCGCCTGAG-2,larva_rep1_TGGGCGTTCGCCTGAG-2
90578,lv.1_TGTTCCGAGTGTCCAT-2,3.113585251,-27.65757074,67,nervous system,lv,larva,rep1,larva_rep1,TGTTCCGAGTGTCCAT-2,larva_rep1_TGTTCCGAGTGTCCAT-2


## 5.3 Quantify cluster annotations

Quantify the cluster annotations from the Cao et al. 2019 analysis. This function takes the Cao et al. 2019 cell cluster annotations and a directory containing the scRNAseq data. It then quantifies the cluster annotations by counting the number of cells in each cluster for each stage.

The output is one TSV file per stage with the counts of cells of each type in each cluster.

In [3]:
input_dirpath = Path("../../data/Ciona_intestinalis_scRNAseq_data_Piekarz")
output_dirpath = Path("../../data/Ciona_intestinalis_scRNAseq_data_Piekarz/cluster_annotations")

zt.ciona.cluster_annotations.quantify_cluster_annotations(
    cell_clusters, input_dirpath=input_dirpath, output_dirpath=output_dirpath
)

Processing Cao Stage: iniG, Piekarz Stage: iniG
Processing Cao Stage: midG, Piekarz Stage: midG
Processing Cao Stage: earN, Piekarz Stage: earN
Processing Cao Stage: latN, Piekarz Stage: iniTI
Processing Cao Stage: iniTI, Piekarz Stage: earTI
Processing Cao Stage: earTI, Piekarz Stage: latN
Processing Cao Stage: midTII, Piekarz Stage: midTII
Processing Cao Stage: latTI, Piekarz Stage: latTI
Processing Cao Stage: latTII, Piekarz Stage: latTII
Processing Cao Stage: larva, Piekarz Stage: larva


## 5.4 Compile cluster names

Compile the cluster names from the Cao et al. 2019 analysis.

This function takes the output directory from the `quantify_cluster_annotations` function and a path to save the compiled cluster names.

The output is a TSV file with the formatted cluster names and the tissue colors.

Cluster names take the format `<top_cell_type>_<tissue_type_suffix>(<fraction_tissue_type>)`. For example, `epidermis_2(0.5)`.
- `<top_cell_type>` is the most abundant cell type in the cluster.
- `<tissue_type_suffix>` is a suffix indicating the number of clusters with that cell type. For example, if there are 2 clusters with the same cell type, the first cluster will have a suffix of `_1` and the second cluster will have a suffix of `_2`. Tissue types with only one cluster are not given a suffix.
- `<fraction_tissue_type>` is the fraction of cells in the cluster that are of the tissue type.

If the second most-abundant tissue type has at least half the cells of the most-abundant tissue type, then the cluster name will also include the fraction of cells of the second most-abundant tissue type.
For example, `epidermis_2(0.5)+nervous-system(0.4)` indicates that the cluster contains 50% epidermis and 40% nervous system.

In [4]:
zt.ciona.cluster_annotations.compile_all_cluster_names(
    output_dirpath, output_dirpath / "Ciona_scRNAseq_cluster_names.tsv"
)

Unnamed: 0,index_stage,seurat_clusters,formatted_cluster_name,cluster_tissue_color,top_cluster_celltype,top_cluster_fraction,second_cluster_celltype,second_cluster_fraction,top_cluster_suffix
0,1_iniG,0,unannotated_1(0.99),#8b9aae,unannotated,0.99,epidermis,0.01,1
1,1_iniG,1,epidermis_1(0.93),#3b74b1,epidermis,0.93,nervous-system,0.07,1
2,1_iniG,2,epidermis_2(0.56)+nervous-system(0.44),#6dadde,epidermis,0.56,nervous-system,0.44,2
3,1_iniG,3,nervous-system_1(0.86),#584a90,nervous-system,0.86,epidermis,0.14,1
4,1_iniG,4,notochord(0.76),#a49c96,notochord,0.76,endoderm,0.24,
...,...,...,...,...,...,...,...,...,...
237,10_larva,14,endoderm_6(0.93),#fdcb5b,endoderm,0.93,nervous-system,0.07,6
238,10_larva,15,muscle-heart_2(0.56)+mesenchyme(0.44),#f8a6b4,muscle-heart,0.56,mesenchyme,0.44,2
239,10_larva,16,mesenchyme_7(0.98),#f7966d,mesenchyme,0.98,endoderm,0.02,7
240,10_larva,17,mesenchyme_8(0.96),#fba979,mesenchyme,0.96,muscle-heart,0.04,8
