# a. Disambiguation of conflicts in data between Cao et al. 2019 and Pierkarz reanalysis

When attempting to merge cell annotations from the Cao et al. 2019 dataset with the Pierkarz remapping of data to the KY2021 reference genome, we encountered a number of conflicts where some developmental stages and replicates in the Pierkarz dataset appeared to share very few cell barcodes in common with the same stage and replicates in the Cao et al. dataset.

To resolve these conflicts, we needed to disambiguate the stage and replicate names in the Pierkarz dataset.

This notebook demonstrates the process of disambiguation we used to resolve the conflicts, which involved agnostically searching for overlaps in cell barcodes between the two datasets. Because cell barcodes are infrequently shared between stages, we used a Sankey diagram to visualize the mapping between the two datasets. 

## a.1 Loading necessary functions
The underlying code is found in the `zoogletools.ciona.disambiguation` module.

In [1]:
import zoogletools as zt



## a.2 Loading cell barcodes

Next, we load the cell barcodes from the Cao et al. 2019 dataset and the Pierkarz dataset.

An idiosyncrasy of the Cao et al. dataset is that technical replicates for stages latTI and latTII have "-1" and "-2" appended to the barcode, respectively. When loading the Piekarz dataset, we account for this by appending "-1" and "-2" to the barcode for the latTI and latTII stages, respectively.

In addition, some of the column names, replicate notation, and file names in the Piekarz dataset are internally inconsistent, so we format them to be consistent with each other.

In [2]:
cao_cell_barcodes = zt.ciona.disambiguation.load_cao_cell_barcodes()
piekarz_cell_barcodes = zt.ciona.disambiguation.load_piekarz_cell_barcodes()

display(cao_cell_barcodes)
display(piekarz_cell_barcodes)

merged_cell_barcodes = cao_cell_barcodes.merge(
    piekarz_cell_barcodes, on=["barcode"], suffixes=["_cao", "_piekarz"]
)
display(merged_cell_barcodes)

Unnamed: 0,stage_replicate,barcode
1,Cao_iniG_rep1,AAACCTGTCAGTTTGG
2,Cao_iniG_rep1,AAACGGGTCTGTCCGT
3,Cao_iniG_rep1,AAAGATGAGTTGAGAT
4,Cao_iniG_rep1,AAATGCCGTCGCATAT
5,Cao_iniG_rep1,AAATGCCGTGTTTGGT
...,...,...
90575,Cao_larva_rep1,TCGTAGAGTACCGAGA-2
90576,Cao_larva_rep1,TGAGAGGCACGAGAGT-2
90577,Cao_larva_rep1,TGGGCGTTCGCCTGAG-2
90578,Cao_larva_rep1,TGTTCCGAGTGTCCAT-2


Unnamed: 0,stage_replicate,barcode
0,Piekarz_iniG_rep1,AAACCTGAGCAGGCTA
1,Piekarz_iniG_rep1,AAACCTGCAGCGTAAG
2,Piekarz_iniG_rep1,AAACCTGCAGCTGTTA
3,Piekarz_iniG_rep1,AAACCTGGTGACGGTA
4,Piekarz_iniG_rep1,AAACCTGGTGTGGCTC
...,...,...
4008,Piekarz_larva_rep2,TTTGTCACATGGGAAC
4009,Piekarz_larva_rep2,TTTGTCAGTCCTCTTG
4010,Piekarz_larva_rep2,TTTGTCATCCGCAAGC
4011,Piekarz_larva_rep2,TTTGTCATCGGAGGTA


Unnamed: 0,stage_replicate_cao,barcode,stage_replicate_piekarz
0,Cao_iniG_rep1,AAACCTGTCAGTTTGG,Piekarz_iniG_rep1
1,Cao_iniG_rep1,AAACGGGTCTGTCCGT,Piekarz_iniG_rep1
2,Cao_iniG_rep1,AAAGATGAGTTGAGAT,Piekarz_latTI_rep2
3,Cao_iniG_rep1,AAATGCCGTCGCATAT,Piekarz_iniG_rep1
4,Cao_iniG_rep1,AACTTTCTCTTTAGGG,Piekarz_iniG_rep1
...,...,...,...
82989,Cao_larva_rep1,CAGCCGATCGTTTATC-2,Piekarz_latTI_rep1
82990,Cao_larva_rep1,CAGTAACGTTTCCACC-2,Piekarz_latTI_rep1
82991,Cao_larva_rep1,TTGACTTCAGTAACGG-2,Piekarz_latTI_rep1
82992,Cao_larva_rep1,GGCGACTTCTACCTGC-2,Piekarz_latTI_rep1


## a.3 Visualizing the mapping between the two datasets

Finally, we visualize the mapping between the two datasets using a Sankey diagram.

Based on this diagramm, we can see the following things that we need to correct in the mapping:
- Piekarz's "latN" == Cao's "earTI". Replicates are 1:1.  

- Piekarz's "iniTI" == Cao's "latN". Replicates are 1:1. 

- Piekarz's "earTI rep1" == Cao's "iniTI rep1".  
-Piekarz's "earTI rep2" == Cao's "iniTI rep3".

- Piekarz's "latTI_rep2" == Cao's "latTI rep1".  
- Piekarz's "latTI_rep1" == Cao's "latTI rep2" (after accounting for technical replicate hyphens for "-1" and "-2").

- Piekarz's "latTII_rep1" == Cao's "latTII rep1" (after accounting for technical replicate hyphens for "-1" and "-2").
- Piekarz's "latTII_rep2" == Cao's "latTII rep2".

- Piekarz's "larva rep1" == Cao's "larva rep3".
- Piekarz's "larva rep2" == Cao's "larva rep4".

- Cao's "larva rep1" does not appear to be present in the Pierkarz dataset.

In [3]:
fig = zt.ciona.disambiguation.plot_stage_replicate_sankey(
    merged_cell_barcodes,
    width=400,
    height=800,
    image_filepath="figures/Stage_replicate_sankey.svg",
    html_filepath="figures/Stage_replicate_sankey.html",
)
fig.show()

  value_counts = merged_cell_barcodes.groupby(
  source_y = get_cumulative_positions(value_counts.groupby(level=0).sum())
  target_y = get_cumulative_positions(value_counts.groupby(level=1).sum())
  for group in [value_counts.groupby(level=0), value_counts.groupby(level=1)]:
