A (mosaic) collection of python help functions related to lineage tracing data analysis, developed through the DARLIN project L. Li,...,S.-W. Wang, F. Camargo, Cell (2023).
It is often used together with cospar.
To install, run
git clone https://github.com/ShouWenWang-Lab/MosaicLineage
cd MosaicLineage
pip install -r requirements.txt
python setup.py develop
-
Snakemake_DARLIN, a Snakemake pipeline to automatically preprocess raw DARLIN sequencing data.
-
Notebooks to reproduce Figure 4 and Figure 5 in our paper. It also illustrates how to use the MosaicLineage package there.
-
Raw and intermediate data for these notebooks. To download all raw or processed data, please go to GEO: GSE222486
Since alleles are generated in different frequencies, in a given experiment where we need to barcode a given size of population, we need to determine those (rare) alleles that can reliably label clones. The workflow of an actual lineage tracing experiment and its downstream analysis is as below.
We carried out an experiment designed to measure the allele frequency, as below.
From this experiment, we curated a large collection of alleles, from the CA, TA, and RA locus of DARLIN, with estimated generation probability for each allele. These reference allele data, or allele bank, are located under the folder reference
. In particular, reference_merged_alleles_{X}_Gr.csv
, with X=CA, TA, or RA
, is the reference files created using alleles profiled from Granulocytes (the above allele frequency experiment) with minimum complication from heterogeneous clone proliferation.
As you can see from the above, each allele is described by normalized_count
, a score for its generation probability, and sample_count
, the number of mouse replicates that this allele was observed (maximum is 3 in this file). This normalized count is derived by normalizing against sequencing depth as well as the expression differences between the three loci. Please check the Method section Allele bank construction
and Homoplasy probability inference
in our paper for details. The higher sample_count
or the normalized_count
, the more frequent this allele will be generated during Dox induction.
Besides these allele bank experiment, we also have many small experiments not specifically designed to measure the allele generation frequency. In such experiments, the number of times that an allele was detected would depend both on its intrinsic frequency, as well as the extent of proliferation of its corresponding clone. Since we have several mouse samples for such experiments, we can still use the information sample_count
, to indicate whether this allele is more frequent or not, while not using the UMI_count
for these alleles. Therefore, we generated reference_merged_alleles_{X}.csv
files, with X=CA,TA or RA
, where we updated the sample_count
as the total mouse samples that an allele was observed across all these experiments, but kept normalized_count
as the same as in reference_merged_alleles_{X}_Gr.csv
.
In practice, you can use two criteria to select reliable alleles: (normalized_count
< sample_count
< Clone identification: practice
in our paper for more ideas on how to select reliable alleles. Also, check our notebooks to see how they were used in the applications of the DARLIN project.
The allele generation probability cutoff Clone identification: theory
in our paper for details.
Finally, historically we have named CA, TA, RA as CC, TC, and RC. We have not corrected this terminology throughout this package, as well as the deposited raw and processed data in GEO: GSE222486.
In other words, these terms are equivalent: CC
<-> CA
, TC
<-> TA
, RC
<-> RA
.