Skip to content

Contains code and instructions for simulating CRISPR regulatory screens and assessing performance of different methods

License

Notifications You must be signed in to change notification settings

patfiaux/CRSsim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CRSsim ✨

This repository contains code and instructions for simulating data generated by CRISPR regulatory screens. It contains the following sections:

Simulations

Advanced flags

Input formats

1. Simulating data generated by CRISPR regulatory screens

This software can be used to simulate data that would be generated by a standard CRISPR regulatory screen.

The simulation parameters mimic the experimental procedure of a CRISPR regulatory screen by taking the following variables into account:

Experimental step Simulated step
NA Size and number of true enhancers
NA Enhancer strength
NA Guide efficiency
Transduction of cells with guide library Generate guide distribution
Load sorter with cells* Specify number of cells to sort
Sort cells** Specify sorting probabilities for sorting using Dirichlet-Multinomial distribution
Specify sequencing depth Specify sequencing depth
Sequence pools Specify whether PCR duplicates are accounted for

* If a selection screen is specified, the user must provide the simulation with the number of cells in the input pool prior to sorting.

** dropout rate

1.1 Installations and setup for simulations

The simulations are executed in R. Please make sure you have R version 3.5.1 or higher installed on your computer.

Clone the CRSsim repository to your computer with the following command:

git clone https://github.com/patfiaux/CRSsim.git

or manually download the repository.

To simulate data, you will need the packages listed below. If you don't have them, install them using the following commands in R. Installations should take about 5 minutes on a standard laptop.

1.1.1 R packages

  • MCMCpack
install.packages('MCMCpack')
  • transport
install.packages('transport')
  • splines
install.packages('splines')

1.1.2 Bioconductor packages

if (!requireNamespace("BiocManager", quietly = TRUE))

    install.packages("BiocManager")
  • IRanges
BiocManager::install("IRanges")
  • GenomicRanges
BiocManager::install("GenomicRanges")

1.2 Simulate a selection screen with sample data

Here is a walkthrough of simulating data for a selection screen with sample data provided in ./CRSsim/Example_data.

Running a simulation will generate three .csv files:

  1. A guide information file, containing information about all the simulated guides (chromosome, start, end, label). The output file will be named {output_name}_info.csv. A guide will have one of the following labels:
  • 'exon': guides overlapping exons of the gene of interest
  • 'neg': non-targeting negative control guides
  • 'pos': guides overlapping enhancers. These are true positives
  • 'chr': guides targeting the region of interest but not overlapping with either an exon or an enhancer Note: Guides not overlapping any regulatory element are all generated using the same probability distribution. A small, random subset (default 300) is labelled as 'neg', the rest as 'chr'. See Advanced Simulations for how to increase or decrease the number of negative controls
  1. A counts file, containing the counts for each guide in each pool. The output file will be named {output_name}_counts.csv. Each row corresponds to a guide, each column to a pool, and each number to the number of times a guide was observed in a pool.

  2. An enhancer file, containing the locations of the simulated regulatory regions (e.g. enhancers). The output file will be named {output_name}_enhancers.csv. Each row corresponds to an enhancer.

Follow the steps below to simulate data for a selection screen.

1.2.1. Source the script

We recommend that you navigate into the ./CRSsim/Example_simulations directory (included in the repository) and generate all output files there. After navigating into that directory, run the following commands in R to access all functions called by the simulation script:

source('/path/to/script/CRSsim.r')

1.2.2. Setting up simulation flags

Running a simulation will generate a guide information file, a counts file and an enhancer file (described above).

After sourcing the CRSsim script you can begin setting the option flags. There are several different flags which have no default arguments and must be supplied by the user. Below is an outline on how to set these required flags.

  1. Flags are set up as an R list object. Run the following command in R to create this object:
sim.flags <- list()
  1. Set the output name for the simulation. All output files generated by this run of the simulation will contain this prefix in the filehandle. Here, we will name all outputs Example_simulation:
sim.flags$simName <- 'Example_simulation'
  1. Provide information about the intended guide targets. Either supply them directly by giving the .csv location as a string, as is demonstrated here, or generate them within the script (see details under Advanced Simulations). For details about file format see the file format section.
sim.flags$guideFile <- '../Example_data/Example_selectionScreen_info.csv'
  1. If guide targets are provided as in step 3, then the marker gene of interest should also be provided. This assumes that some of the guides provided are targeting the gene of interest and can serve as positive controls. The input should be a string to the location of a .csv file. For file format see section file format section.
sim.flags$exon <- '../Example_data/Example_gene.csv'
  1. Provide the original guide distribution for each replicate. The guide counts can be supplied by an existing data set, as demonstrated below. Here, we supply the guide distribution data from ../Example_data/Example_selectionScreen_counts.csv. Specifically, we use the guide counts from the 'before' pools. For details about file format see the file format section. Another option is to generate the distributions using a zero-inflated negative binomial (ZINB) distribution. See Advanced Simulations for details about this option.
example.counts <- read.csv('../Example_data/Example_selectionScreen_counts.csv', stringsAsFactors = F)

# note here: 'before_1' and 'before_2' are just placeholder names. They could be called something else.
sim.flags$inputGuideDistr <- cbind(before_1 = example.counts$before_repl1, 
                                   before_2 = example.counts$before_repl2)
  1. Specify the screen type. The user can specify one of two screen types: either a selectionScreen as shown here or a FACSscreen, where cells are sorted into different pools. See FACS screen simulation for an example of simulating a FACS screen.
sim.flags$screenType <- 'selectionScreen'
  1. Provide names for the different pools in each replicate. In this case, each replicate will have a before and an after selection pool. Their names will be: before_repl1, after_repl1, before_repl2, etc.
sim.flags$poolNames <- c('before', 'after')
  1. Specify the CRISPR system used to generate the data. For CRISPRi, the default range of the perturbation effect is assumed to be 1kb. However, this can be manually specified. See the Advanced Simulations section for more details on how to simulate data generated by other CRISPR systems.
sim.flags$crisprSystem <- 'CRISPRi'
  1. Specify the number and the size of enhancers to be simulated. They will be placed at random genomic positions throughout the screen.
sim.flags$nrEnhancers <- 25
sim.flags$enhancerSize <- 50  # base pairs
  1. Specify the sequencing depth for each of the pools. Here, the parameters have been set such that the average guide count is 15. A sequencing depth for each pool in each replicate must be defined:
repl1.seqDepth <- c(nrow(example.counts) * 15, nrow(example.counts) * 15) # set depths for 'before' and 'after' pools of replicate 1
repl2.seqDepth <- c(nrow(example.counts) * 15, nrow(example.counts) * 15) # set depths for 'before' and 'after' pools of replicate 2

To set the seqDepth flag, you must provide a list object where each named list entry represents a replicate and its corresponding sequencing depth(s).

sim.flags$seqDepth <- list(repl1 = repl1.seqDepth, repl2 = repl2.seqDepth)
  1. Additionally, the simulations can simulate data sets where PCR duplicates are either accounted for or not. If you would like to generate a data set where duplicates are accounted for, set the pcrDupl flag to FALSE or F.
sim.flags$pcrDupl <- TRUE
  1. You must also specify:
  • selection strength: how strong the effect of disrupting the gene of interest is (very-high, high or low).
  • guide efficiency: what proportion of the guides successfully perturb their targets (high, medium, or low)
  • enhancer strength: how should the enhancer signal is (high, medium, or low)

Each of these parameters can be manually provided as strings. See Advanced Simulations for details.

sim.flags$selectionStrength <- 'high'
sim.flags$guideEfficiency <- 'medium'
sim.flags$enhancerStrength <- 'medium'
  1. Specify the area of effect (AoE) of guides (for details see section 1.2.4). We recommend using normal as this is more likely to reflect the behavior of CRISPR.
sim.flags$areaOfEffect_type <- 'normal'
  1. Specify the relationship between counts and dispersion (for details see section 1.2.5). We recommend either exponential (this example) or spline.
sim.flags$dispersionType <- 'exponential'
  1. Run the simulation! ✨
simulate_data(sim.flags)

1.2.3 Simulate a FACS screen with sample data

Here is a demonstration for simulating data generated by a FACS screen. Highlighted below are the main option differences from a selection screen simulation.

  1. Specify the type of screen as FACSscreen in your argument flags.
sim.flags$screenType <- 'FACSscreen'
  1. As in the selection screen example, each pool in each replicate must be named. The difference in this example is that there must be more than one pool. Here is an example where cells are sorted from an input pool into either a high, medium, or low expression pool.
sim.flags$poolNames <- c('input', 'high', 'medium', 'low')
  1. The sequencing depth must also be specified for each pool in each replicate.
sim.flags$seqDepth <- list(repl1 = rep(nrow(example.counts) * 15, 4), 
                           repl2 = rep(nrow(example.counts) * 15, 4) )

Keep in mind!

An average guide count of 15 vs. 100 vs. 500 has a major effect on power to detect true signal when everything else is held constant. Make sure to adjust seqDepth when changing the number of guides used for simulating the data.

1.2.4 Specify the area of effect (AoE)

When CRISPR is directed to disrup a target site, then guide efficiency specifies how likely this disruption is to take place. However, the size of this disruption won't always be the same. For example, the effects of a CRISPRi perturbation can be up to 1kb but it's unlikely that every CRISPRi perturbation actually results in the silencing of an entire 1kb region. It's more likely that regions close to the target site have a higher probability of being silenced then sites far away. This can be modeled with a normal distribution. For comparison we also left the option of a uniform AoE.

sim.flags$areaOfEffect_type <- 'normal' # alternatively 'uniform'

Similarly, in case of a dual-guide experiment (dualCRISPR), it's more likely that each guide creates a short InDel instead of deleting the entire region. We model this with a normal AoE for the regions immediately around the target site of the two guides as well as a deletion probabilitiy (deletionProb) across the entire targeted region. By default, deletionProb is set to 0.2

sim.flags$areaOfEffect_type <- 'normal' # alternatively 'uniform'
sim.flags$deletionProb <- 0.2

See 3. Advanced flags for details on how to change the range of the AoE.

1.2.5 Specify the dispersion type

For most biological data there's usually a relationship between guide counts and dispersion (variance). CRSsim models this using either an exponential relationship or with splines (functions defined as piecewise polynomials which allows them to have great flexibility). CRSsim has suggested distribution parameters for each (exponential parameters from Fulco et al. replicate 2, spline parameters from Simeonov et al. IL2RA replicate 1). In case of splines, the user can provide a model object for making the dispersion predictions. Alternatively, the relationship between counts and dispersion can also be removed with an independent flag at dispersionType.

sim.flags$dispersionType <- 'spline' # alternatively 'exponential' or 'uniform'
sim.flags$splineModelLoc <- 'path-to-CRSsim/Spline_models/IL2RA_r1_df3_splineModel'

2. Advanced Flags

2.1 Advanced Simulations

2.1.1 Generate guides and their targets. Both single-guide as well as dual-guide screens can be simulated. For both of them, the number of guides (nrGuides) must be specified, as well as the screen system (crisprSystem) and the step size between guides (stepSize). Additionally, if a dual CRISPR screen is selected, the deletion size must be specified (deletionSize).

Possible crisprSystem options include: CRISPRi, CRISPRa, CRISPRcas9 and dualCRISPR

If this option is chosen, all guides are arbitrarily chosen to be located on chromosome 1 and ~5% of the guides will be selected to serve as positive controls.

sim.flags$nrGuides <- 10000
sim.flags$crisprSystem <- 'dualCRISPR'
sim.flags$stepSize <- 20
sim.flags$deletionSize <- 1000 # only used if $crisprSystem is 'dualCRISPR'

2.1.2 To change the default number of negative controls set the nrNeg flag.

sim.flags$nrNeg <- 300

2.1.3 Generate guide distributions. The input count distribution for the different replicates can be taken from an existing data set. It is also possible to generalize existing distributions using the zero-inflated negative binomial distribution (ZINB). The ZINB has both a mean (rate) and a dispersion parameter, as well as a parameter describing the fraction of the distribution originating from the zero mass (eta). Below are the steps to obtain and use the parameters from a ZINB:

# obtain ZINB parameters which describe the distribution
before.repl1.par <- obtain_ZINB_pars(example.counts$before_repl1)
example.rate <- before.repl1.par$rate              # rate = 76.5
example.dispersion <- before.repl1.par$dispersion  # dispersion = 2.6
example.eta <- before.repl1.par$eta                # eta = 1e-4

# to generate a ZINB distribution with 10000 guides
before.repl1.simulated <- create_ZINB_shape(10000, example.eta, example.rate, example.dispersion)
before.repl2.simulated <- create_ZINB_shape(10000, example.eta, example.rate, example.dispersion)

# combine the two simulated replicates and set them as input distributions
sim.flags$inputGuideDistr <- cbind(before_1 = before.repl1.simulated, before_2 = before.repl2.simulated)

2.1.4 Set effect range. Currently, four different CRISPR systems can be simulated: CRISPRi, CRISPRa, CRISPRcas9, and dualCRISPR. By default, CRISPRi and CRISPRa are assumed to have an effect range of 1kb and Cas9 of 20bp. However, it is also possible to manually set this range with the crisprEffectRange flag. For dualCRISPR, the effect range is equivalent to the deletion size. The deletion size introduced by two guides must be represented by 'start' set as the target site of guide 1 and 'end' as the target site of guide 2. Use the deletionSize flag when using dualCRISPR.

# example for how to change the effect range of a CRISPR system used
sim.flags$crisprSystem <- 'CRISPRi'
sim.flags$crisprEffectRange <- 500

2.1.5 Enhancer strength and guide efficiency. Both the guide efficiency and the enhancer strength are simulated from a beta distribution. The two parameters can be specified by setting guideEfficiency and enhancerStrength to either high, medium or low. It is also possible to directly specify the two shape parameters of the beta distribution. As a general rule, if the shape parameters provided are large, the observed distribution variance is reduced. The larger shape 1 parameter is compared to shape 2 parameter, the more the distribution will be skewed towards 1. To visualize this phenomenon, you can also plot the histogram by randomly sampling from a beta-distribution and then subsequently changing the shape parameters. The default parameters are:

- high: enhancerShape1 = 7, enhancerShape2 = 2

- medium: enhancerShape1 = 5, enhancerShape2 = 5

- low: enhancerShape1 = 2, enhancerShape2 = 7

This also applies for guideShape1 and guideShape2.

hist(rbeta(10000, shape1 = 8, shape2 = 1))  # randomly generate 10000 instances of the beta distribution

sim.flags$guideEfficiency <- 'medium'
sim.flags$enhancerStrenth <- 'medium'

# the above is equivalent to what's below
sim.flags$enhancerShape1 <- 5
sim.flags$enhancerShape2 <- 5
sim.flags$guideShape1 <- 5
sim.flags$guideShape2 <- 5

2.1.6 Selection strength. To understand the details of the selection strength it is helpful to have some understanding of the Dirichlet distribution. Both for the selection screen as well as for the FACS screen the probability of each guide being either selected or sorted into a given pool is given by a Dirichlet random variate. As an example:

Cells are sorted into three pools: high, medium and low gene expression. The expected probability that a negative control will be sorted into each of the given pools are 0.48, 0.48, and 0.04, respectively. For each guide, this probability shifts slightly. The Dirichlet random variable provides this variation. All cells containing a negative control guide are subsequently assigned to each of the three pools with following probabilities: rdirichlet(1, c(48, 48, 4)).

Continuing the example above: assume the expected probabilities that a positive control is sorted into each of the three pools - high, medium and low gene expression - are 0.45, 0.45, and 0.1, respectively. All cells containing a positive control guide is subsequently assigned to these pools by observing a Dirichlet random variable as follows: rdirichlet(1, c(45, 45, 10)).

To manually set the Dirichlet probabilities, use the posSortingFrequency and the negSortingFrequency flags. To continue the example from above:

sim.flags$posSortingFrequency <- c(45, 45, 10)
sim.flags$negSortingFrequency <- c(48, 48, 4)

Note:

  1. The sum of the frequencies does not have to equal 1.

  2. The larger the numbers chosen, the less variable the sorting becomes.

The defaults for the high flags were chosen based on their ability to accurately represent the sorting parameters for either a selection screen or a FACS screen. The default low flags were chosen as an arbitrary fraction of the high selection.

The default flags used for high:

# for a selection screen:
sim.flags$posSortingFrequency <- c(1)
sim.flags$negSortingFrequency <- c(5)

# in a FACS screen sorted into 3 pools: 
# '97' is repeated for all pools except the last one
sim.flags$posSortingFrequency <- c(97, 97, 13) * 0.5
sim.flags$negSortingFrequency <- c(97, 97, 3) * 0.5

The default flags used for low:

# for a selection screen:
sim.flags$posSortingFrequency <- c(4)
sim.flags$negSortingFrequency <- c(5)

# in a FACS screen sorted into 3 pools: 
# '97' is repeated for all pools except the last one
sim.flags$posSortingFrequency <- c(97, 97, 5) * 0.5
sim.flags$negSortingFrequency <- c(97, 97, 3) * 0.5

2.1.7 Area of Effect Under default settings, the normal AoE for Cas9 assumes that 50% of all perturbation happen within 10bp and by 95% within 21bp using a standard deviation of 8.5. In the case of CRISPRi or CRISPRa 50% of the activity is within 200bp from the target site and 95% within 415bp from the target site using a standard deviation of 170. To change these settings both the crisprEffectRange and the normal_areaOfEffect_sd flags have to be adjusted. It's recommended to use the crisprEffectRange as the 95% cutoff point.

# in the case of CRISPRi or CRISPRa
sim.flags$normal_areaOfEffect_sd <- 170
sim.flags$crisprEffectRange <- 415

2.2 Advanced Analysis and performance evaluation

2.2.1 alphaRRA. To analyze data with MAGeCK as used by Diao et al. 2017, specify use of alphaRRA and the bins within which to tile the analyzed region. Note, alphaRRA will be applied to per-guide scores from all methods given by the Method flag.

analysis.specs$postScoreAlphaRRA <- TRUE
analysis.specs$binSize <- 50

2.2.2 Sliding Window. To combine guide scores using a sliding window as in Fulco et al. 2016 or Simeonov et al. 2017, specify the number of guides to include per sliding window as well as the maximum window size (if the tiled deletion generates a 10KB gap it would not make sense to consider the entire region as one score).

analysis.specs$postScoreSlidingWindow <- TRUE
analysis.specs$guidePerSlidingWindow <- 15
analysis.specs$maxWindowSize <- 8000

3. Input table formats

3.1 Simulations

3.1.1 Simulation guides

If guides are directly provided by the user, the .csv file should have columns for chromosome, start position and end position labeled chrom, start, and end, respectively. All chromosome are given as a string. For chrom the spelling of the individual chromosomes not matter as long as the same chromosome has the same spelling ('chr8' and 'Chr8' will be recognized as two different chromosomes). start and end values are given as numbers. For Cas9, CRISPRi, and CRISPRa screens, the distance between the start and end sites should be set to something small, such as start = target site - 20 and end = target site. Here the top from the file provided at Example_selectionScreen_info.csv:

chrom start end
chr8 128703371 128703391
chr8 128703511 128703531
chr8 128703521 128703541
chr8 128703539 128703559

3.1.2 Simulation exons

If guide targets are provided for the simulations, then the marker gene of interest should also be provided. This assumes that some of the guides provided are targeting the gene of interest and can serve as positive controls. The input should be a string to the location of a .csv file containing the chromosome, start, and end sites of all exons of the gene of interest. The column names of this data frame should be chrom, start, and end. All chromosome are given as a string. For chrom the spelling of the individual chromosomes not matter as long as the same chromosome has the same spelling ('chr8' and 'Chr8' will be recognized as two different chromosomes). start and end values are given as numbers. Here, we supply the exon information from ../Example_data/Example_gene.csv.

chrom start end
chr8 128748314 128748869
chr8 128750493 128751265
chr8 128752641 128753680

3.1.3 Guide counts for simulated guide distributions

The input for inputGuideDistr is a data frame object and each column will be used to generate the guide count distribution for a replicate. Here, we supply the guide distribution data from ../Example_data/Example_selectionScreen_counts.csv. Specifically, we use the guide counts from the 'before' pools.

example.counts <- read.csv('../Example_data/Example_selectionScreen_counts.csv', stringsAsFactors = F)

# note here: 'before_1' and 'before_2' are just placeholder names. They could be called something else.
sim.flags$inputGuideDistr <- cbind(before_1 = example.counts$before_repl1, 
                                   before_2 = example.counts$before_repl2)

In the case above, inputGuideDistr looks like the following:

before_1 before_2
303 342
110 149
145 178
305 248

About

Contains code and instructions for simulating CRISPR regulatory screens and assessing performance of different methods

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages