Skip to content

AG-Boerries/CAST-Seq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CAST-Seq

CAST(chromosomal aberrations analysis by single targeted LM-PCR)-Seq is a novel method capable of detecting and quantifying chromosomal aberrations derived from on- and off-target activity of CRISPR-Cas nucleases or TALEN. See Turchiano et al. for detail information about CAST-Seq background and potential clinical application.

Citation

Original CAST-Seq pipeline: Turchiano et al., Cell Stem Cell, 2021

T-CAST pipeline: Rhiel et al., Front. Genome Ed., 2023

D-CAST pipeline: Klermund et al., Molecular Therapy, 2024

Getting Started

General: The herein code is the official bioinformatic pipeline to process fastq files generated by CAST-Seq, T-CAST and D-CAST.

Composing of Results: The results directory is divided into 3 sub-directories: fastq_aln, guide_aln and random.

  1. fastq_aln contains all pre-processing and alignment files from fastq.gz to bam files.

  2. guide_aln contains the post-processing files from bed files to final xlsx report.

  3. random contains the information related to the random regions that are used for normalisation.

Prerequisites

Requiered software and databases

  1. Software

  2. Annotation for bowtie2

    • bowtie2Index
    • genome fasta
  3. R packages

    • openxlsx

    • GenomicRanges

    • Biostrings

    • data.table

    • ggplot2

    • ggseqlogo

    • textreadr

    • parallel

    • ChIPseeker

    • clusterProfiler

    • rtracklayer

    • biomaRt

    • tools

    • karyoploteR

    • UpSetR

    • circlize

    • tidyr

    • org.Hs.eg.db

    • BSgenome.Hsapiens.UCSC.hg38

    • TxDb.Hsapiens.UCSC.hg38.knownGene

  4. Additional Files (provided in annotation folder)

    • hg38_TSS_TES.txt (Bed file containing TSS and TES as start and end locations respectively)
    • chrom.sizes (chromosome size file)
    • TruSeq4-PE (adapter sequences)
    • CancerGenesList_ENTREZ.txt (OncoKB cancer related genes)

If available, the used versions are noted.

Additional Files

Needed during epigenetic profil analysis

Histones marks bed files (must be stored into annotations/histones/). Example of such file is provided for H3K4me3 in Primary hematopoietic stem cells (E035 from Roadmap Epigenomics https://egg2.wustl.edu/roadmap/web_portal/processed_data.html).

Directory structure

CAST-Seq
│
├── annotations
│   └── human
│       ├── bowtie2Index
│       │   ├── genome.1.bt2
│       │   ├── genome.2.bt2
│       │   ├── genome.3.bt2
│       │   ├── genome.4.bt2
│       │   ├── genome.fa.fai
│       │   ├── genome.rev.1.bt2
│       │   ├── genome.rev.1.bt2
│       │   └── genome.fa
│       ├── histones
│       │   ├── H3K4me3.bed
│       │   └── ...
│       ├── CancerGenesList_ENTREZ.txt
│       ├── hg38_TSS_TES.txt
│       ├── chrom.sizes
│       └──TruSeq4-PE.fa
│
├── samples
│   ├── XXX
│   │   ├── data
│   │   │	├── fastq
│   │   │	│	├── XXX_treated_R1_001.fastq.gz
│   │   │	│	├── XXX_treated_R2_001.fastq.gz
│   │   │	│	├── XXX_UNtreated_R1_001.fastq.gz
│   │   │	│	└── XXX_UNtreated_R2_001.fastq.gz
│   │   │	├── gRNA.fa
│   │   │	├── headTOhead.fa
│   │   │	├── linker_RC.fa
│   │   │	├── linker.fa
│   │   │	├── mispriming.fa
│   │   │	├── neg.fa
│   │   │	├── ots.bed
│   │   │	└── pos.fa
│   │  	└── results
│   │   	├── fastq_aln
│   │   	├── guide_aln
│   │   	└── random
│   ├── YYY
│   │   ├── data
│   │   │	├── fastq
│   │   │	└── ...
│   │   └── ...
│   └── ...
│
└── script   
    ├── run
    │   ├── XXX.R
    │   ├── YYY.R
    │  	└── ...
    ├── lcs.py
    ├── annotateGenes.R
    ├── bed2sequence.R
    ├── bedTools_fct.R
    └── ...

Remark: fastq files can also be stored in separated directories, then --tfastqD and --ufastqD can be used to set-up the paths.

Running CAST-Seq

After all tools and databases are installed and work properly, the whole CAST-Seq pipeline can be executed using this single command (see example in script/run/):

Rscript ./CAST-Seq.R --pipeline "crispr"\
                     --pname "G3_TOY_test"\
	             --sampleDname "G3_TOY"\
		     --tsamp "G3_treated"\
		     --usamp "G3_UNtreated"\
		     --homeD "../../"

--pipeline name of the pipeline you want to use. So far, "crispr", "talen" and "crispr2" are available for CAST-Seq, T-CAST and D-CAST respectively.
--pname name of current project sample
--sampleDname name of sample directory
--tsamp XXX name of test (treated) file. XXX_R1_001.fastq.gz AND XXX_R2_001.fastq.gz should exist
--usamp XXX name of control (untreated) file. XXX_R1_001.fastq.gz AND XXX_R2_001.fastq.gz should exist
--homeD name of home directory

Parameters

Additional parameters can be changed in the command above. Here is a description of these parameters:

--tfastqD name of directory containing the fastq (tsamp) files
--ufastqD name of directory containing the fastq (usamp) files
--grna name of gRNA fasta (default "gRNA.fa")
--onTarget name of ON-target bed file (default "ots.bed")
--otsDistance distance (bp) from the ON-target. Reads +/- this distance will be removed (default 50)
--surrounding_size distance (bp) from the ON-target. Use for the scoring system (default 20000)
--flank1 name of first flanking sequence (default "flank1.fa")
--flank2 name of second flanking sequence (default "flank2.fa")
--flankingSize distance to consider for HMT (default 2500)
--random number of random sequences to generate (default 10000)
--width distance to extend the putative sites (default 250)
--distCutoff distance to merge hits together (default 1500)
--pvCutoff pvalue threshold (default 0.05)
--scoreCutoff gRNA alignment score threshold (default NULL)
--hitsCutoff minimum number of hits per site (default 1)
--distCov distance from the maximum covered bin from where the gRNA will be aligned
--saveReads should reads fastq sequences be saved (default "no")
--species name of sample species (default "hg") so far only hg and mm (Mouse) can be used
--ovl number of samples to be considered in the overlap analysis (default 1)
--signif number of significant samples to be considered in the overlap analysis (default 1)

--cpu number of CPUs (default 2) at least 4 is advised
--pythonPath python path (default "/usr/bin/python")

T-CAST and D-CAST specific parameters

These parameters are only used when --pipeline "talen" is set.
--grnaR name of gRNA (RIGHT) fasta file
--grnaL name of gRNA (LEFT) fasta file

OVERLAP specific parameters (deprecated)

These parameters are only used when --pipeline "crispr_overlap" or "talen_overlap" is set.
--ovlDname name of overlap directory within sample directory
--ovlName name of overlap sample within overlap directory
--replicates name of sample to be used in the overlap analysis
--repNames labels of the replicates to be used in the overlap analysis
--repDname name of a representative replicate (used to find the appropriate replicate files) file
--ovl number of significant samples to be considered in the overlap analysis

Authors

  • Geoffroy Andrieux
  • Giandomenico Turchiano

License

This software is under AGPL3 license.

Acknowledgments

We thank all members of our laboratories for constructive discussions and suggestions.

References

  • Langmead, B, Salzberg, SL (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9, 4:357-9.
  • Quinlan, AR, Hall, IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 6:841-2.
  • Lawrence, M, Huber, W, Pagès, H, Aboyoun, P, Carlson, M, Gentleman, R, Morgan, MT, Carey, VJ (2013). Software for computing and annotating genomic ranges. PLoS Comput. Biol., 9, 8:e1003118.
  • Yu, G, Wang, LG, He, QY (2015). ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics, 31, 14:2382-3.
  • Durinck, S, Spellman, PT, Birney, E, Huber, W (2009). Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc, 4, 8:1184-91.

About

CAST-Seq Bioinformatic pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published