Skip to content

Package to analyse amplicon sequencing data from CRISPR-Cas9 recorder lineage tracing experiments

License

Notifications You must be signed in to change notification settings

Nowak-Lab/EvoTraceR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EvoTraceR - Evolutionary lineage Tracing in R

Actions Status Actions Status

EvoTraceR is an R package to analyse sequencing amplicon data from CRISPR-Cas9 recorder lineage tracing experiments. The package takes in paired-end FASTQ files from one to many tissues. The sequenced amplicon can contain one to many Cas9 cut sites. EvoTraceR trims and merges reads, collapses duplicates, calls mutations and infers a tree. The package outputs the inferred tree of relationships between Amplicon Sequence Variants (ASVs) as well as summary plots and tables of mutations.

EvoTraceR pipeline concept

Installation

The package can be installed using devtools.

library(devtools)
install_github("Nowak-Lab/EvoTraceR")

EvoTraceR can also be installed with the above method in a conda environment. Based on our experience, a clean environment installed with r-base, r-essentials, and r-devtools from the conda-forge channel works well.

conda create -n evotracer
conda install -c conda-forge r-base
conda install -c conda-forge r-essentials
conda install -c conda-forge r-devtools

Dependencies

Binaries for these dependencies can be downloaded via the links above. All other dependencies are handled by devtools during the package installation.

Usage example

A simulated toy dataset is available to test EvoTraceR. The sequences were generated from a single simulated mouse MMUS1469 and taken from cancer populations in two tissues (labelled PRL and HMR). The CRISPR-Cas9 recorder design was designated BC10v0. The naming scheme for the FASTQ files was thus MMUS1469_PRL_BC10v0_MG_120419. Note that EvoTraceR expects similar input file names with tissue labels in the same position of the naming scheme.

To get started with the analysis, we provide the directory containing the input FASTQ files. We also provide an output directory and the paths to the trimmomatic and flash binaries on our system (adjust these paths as needed for your system). Then we preprocess our input data with the initialize_EvoTraceR function.

library(EvoTraceR)
input_dir <- system.file("extdata", "input", package = "EvoTraceR")
output_dir <- "example_output"
trimmomatic_path <- "/my/path/trimmomatic.jar"
flash_path <- "/my/path/flash"

EvoTraceR_object <-
  initialize_EvoTraceR(
    input_dir = input_dir,
    output_dir = output_dir,
    trimmomatic_path = trimmomatic_path,
    flash_path = flash_path)

Now that the FASTQ data has been cleaned and the paired reads merged, we align the merged reads to the reference and call indels. To do this, we use the asv_analysis function and provide our unedited template sequence as the ref_seq and the expected flanking sequences. We also provide the 1-based reference sequence coordinates of the expected Cas9 cut sites and the borders between individual target regions (e.g., guide=20bp, PAM=3bp, spacer=3bp). This information is used to filter sequences with unexpected mutation patterns. For additional options on alignment parameters, parallelization and more, see the documentation.

EvoTraceR_object <-
  asv_analysis(EvoTraceR_object = EvoTraceR_object,
               ref_name = "BC10v0",
               ref_seq = "TCTACACGCGCGTTCAACCGAGGAAAACTACACACACGTTCAACCACGGTTTTTTACACACGCATTCAACCACGGACTGCTACACACGCACTCAACCGTGGATATTTACATACTCGTTCAACCGTGGATTGTTACACCCGCGTTCAACCAGGGTCAGATACACCCACGTTCAACCGTGGTACTATACTCGGGCATTCAACCGCGGCTTTCTGCACACGCCTACAACCGCGGAACTATACACGTGCATTCACCCGTGGATC",
               ref_flank_left = "^TCTAC",
               ref_flank_right = "CCCGTGGATC$",
               ref_cut_sites = c(17, 43, 69, 95, 121, 147, 173, 199, 225, 251),
               ref_border_sites = c(1, 26, 52, 78, 104, 130, 156, 182, 208, 234))

Finally, we process the mutations, infer a tree and output summaries of the results.

EvoTraceR_object <-
  analyse_mutations(EvoTraceR_object = EvoTraceR_object)

EvoTraceR_object <-
  infer_phylogeny(EvoTraceR_object = EvoTraceR_object, mutations_use = "del_ins")

EvoTraceR_object <-
  create_df_summary(EvoTraceR_object)

Output files

After running the example analysis, the following output files will be generated.

example_output/
|-- asv_analysis
|   |-- asv_filtering_freq.pdf
|   `-- asv_length_freq.pdf
|-- fastq_analysis
|   |-- fastq_flash_merged
|   |-- fastq_summary.csv
|   `-- fastq_trimmed
`-- phylogeny_analysis
    `-- phylogeny_del_ins
        |-- asv_stat.csv
        |-- tree_all_clones.csv
        `-- tree_all_clones.newick

The two key files for most follow-up analyses are as follows.

  • asv_stat.csv: Summary of all of detected amplicon sequence variants (ASVs) and mutations in the input FASTQ files
  • tree_all_clones.newick: Tree of the phylogenetic relationships between ASVs based on the detected mutations

Contact

Please feel free to contact us with feedback: Dawid Nowak, dgn2001 at med.cornell.edu or Armin Scheben, ascheben at cshl.edu.

About

Package to analyse amplicon sequencing data from CRISPR-Cas9 recorder lineage tracing experiments

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages