beditor: A Computational Workflow for Designing Libraries of Guide RNAs for CRISPR-Mediated Base Editing
Rohan Dandage, Philippe C. Després, Nozomu Yachie and Christian R. Landry. GENETICS 2019
Basic requirements: Anaconda package manager
. See requirements.md for set of bash commands that would install it.
- Once all the requirements are satisfied, create a python 3.6 virtual environment.
wget https://raw.githubusercontent.com/rraadd88/beditor/master/environment.yml
conda env create -f environment.yml
- Activate the virtual environment.
source activate beditor
- Install
beditor
python package from pypi.
pip install beditor
Open the GUI window from terminal.
beditor
step1: input the configuration settings.
Note: genomes listed on the gui correspond to ensembl release=95.
step2: save the configuration settings and run beditor
. Outputs will be stored in the same directory as the saved configuration settings file (yml file) in a folder with the same same name as the basename of the configuration settings file.
Note: output directory will have the same basename as the saved configuration file. If /a/b/gene_ed.yml
is the path of the configuration file, the output directory will be /a/b/gene_ed/
. See Output format for structure of the output directory.
Note: see the terminal messages in case of any issue.
- Run the analysis.
beditor --cfg configuration.yml
- Run a single step in the analysis.
beditor --step {step number} configuration.yml
step number
and corresponding analysis:
1: Get genomic loci flanking the target site
2: Get possible mutagenesis strategies
3: Design guides
4: Check offtarget-effects
- Help
beditor --help
Note: Path to this tsv (tab-separated values) file is provided in configuration file as a value for variable called dinp
. E.g. dinp: input.tsv
.
According to the mutation_format opted in configuration.yml file and corresponding columns needed in input.
Example for S. cerevisiae (ensembl genome release=95):
genome coordinate | nucleotide mutation |
---|---|
I:147494-147494- | A |
I:143607-143607+ | A |
II:369937-369937- | C |
II:372003-372003- | C |
Example for S. cerevisiae (ensembl genome release=95):
transcript: id | aminoacid: position | amino acid mutation |
---|---|---|
YAL001C_mRNA | 18 | A |
YAL002W_mRNA | 24 | A |
YAL019W_mRNA | 24 | C |
YAL067W-A_mRNA | 13 | F |
Note: genomes listed in the gui correspond to ensembl release=95.
[for command line usage] Configuration file. It contains all the options and paths to files needed by the analysis.
This YAML formatted file contains the all the analysis specific parameters.
Template: https://github.com/rraadd88/test_beditor/blob/master/common/configuration.yml
# Input: Mutation information
## Path to this tsv (tab-separated values) file
dinp: input.tsv
reverse_mutations: False
# Step 1: Extracting sequences flanking mutation site (`01_sequences/`).
## host information
host: scientific name
genomerelease: 93
# check assembly from http://useast.ensembl.org/index.html
genomeassembly: fromensembl
# Step 2: Estimating the editable mutations based on base editors chosen. (`02_mutagenesis/`).
# whether aminoacid or nucleotide mutations
mutation_format: aminoacid or nucleotide
##[N nonsyn] S syn else both
mutation_type: N
## keep nonsense mutations
keep_mutation_nonsense: False
## Mutations information can be provided in 3 options:
## `mutations`: Required Mutations mentioned in input file.
## `substitutions`: Required Substitutions provided as a file (template: https://github.com/rraadd88/test_beditor/blob/master/common/dsubmap.tsv).
## `mimetic`: Carry out Mimetic substitutions (base on genome wide substitution maps). Only for human and yeast.
## input: options
## mutations, substitutions, mimetic, [no input: keeps all possible mutations (slow)]
mutations: mutations
## Parameters specific to above options
## 2. Substitutions provided as a file
dsubmap_preferred_path:
## 3. Mimetic substitutions
## mimetism level (high: only the best one, [medium: best 5], low: best 10)
mimetism_level: medium
## can not mutate between these
## if ['S','T','K'] is provided all mutations between thsese amino acids are disallowed
non_intermutables: []
# Step 3: Designed guides (`03_guides/`).
## allowed nucleuotide substitutions per codon
max_subs_per_codon: 1
## base editors to use (restriction max_subs_per_codon would override the choice of base editors)
BEs: ['Target-AID','ABE']
# Cas9 related options
## PAM sequence
pams: ['NGG','NG']
#------------------------------------------------
# System related options
## Number of cpus/threads
cores: 6
## Number of lines to process per cpu
chunksize: 200
## Dependencies
## by default the dependencies are installed from the conda environment.
## "optionally" paths to the dependencies could be included below.
bedtools: bedtools
bwa: bwa
samtools: samtools
mutation_format opted in configuration.yml file and corresponding columns needed in input:
nucleotide : ['genome coordinate','nucleotide wild-type','nucleotide mutation',]
aminoacid : ['transcript: id','aminoacid: wild-type','aminoacid: position','amino acid mutation','codon: wild-type','guide: id','guide+PAM sequence','beditor score','alternate alignments count','CFD score']
Format of guide: id
:
{genomic locus}|{position of mutation}|({strategy})
where,
strategy= {base editor};{strand};@{distance of mutation from PAM};{PAM};{codon wild-type}:{codon mutation};{amino acid wild-type}:{amino acid mutation};
A directory by the basename of configuration file (eg. directory called 'human' if configuration file is 'human.yml') would be created in the same folder where configuration file is located. It is referred to as 'project directory'.
Inside a project directory there would be following folders named by corresponding steps of analysis.
1. 01_sequences/
Stores the output of step #1. Extracting sequences flanking mutation site.
2. 02_mutagenesis/
Stores the output of step #2. Estimating the editable mutations based on base editors chosen.
3. 03_guides/
Stores the output of step #3. Designed guides.
4. 04_offtargets/
Stores the output of step #4. Offtarget effects.
5. 05_outputs/
Stores combined output and visualizations and sets of positive and negative control guides.
positive control guides are designed so that they introduce stop mutation in genes being targeted.
begative control guides lack editable nucleotide in the window of maximum activity, thereby supposed to not introduce any mutation.
Also,
- 00_input/
Stores input files.
- chunks/
If parallel processing is used, this folder would store individual parts (chunks) of the analysis.
Custom base editor and PAM sequences can be used incorporated in the workflow by selecting 'Custom' option in the 1st tab of the GUI. Following is the layout of the options to input the information about the base editor and the PAM sequence.
The sets of installed BEs and PAMs are stored in a tab-separated table, located at beditor/data/dbepams.tsv
directory (use which beditor
to locate directory of beditor).
In order to install new base editor or PAM, user would have to simply append the relevant information in the table.
# make the input files with mock data
git clone https://github.com/rraadd88/test_beditor.git
source activate beditor;cd test_beditor;python test_datasets.py
https://github.com/openvax/pyensembl#non-ensembl-data
beditor.pipeline.collect_chunks
(cfg, chunkcfgps)¶
Collects analysed chunks
- cfg – main configuration dict.
- chunkcfgps – paths to all configuration files of chunks
beditor.pipeline.collectchuckfiles
(cfg, fpinchunk, force=False)¶
Collects minor chunk files
- cfg – configuration dict
- fpinchunk – path inside chuck’s project directory
- force – if True overwrites the outputs
beditor.pipeline.main
()¶
Provides command-line inputs to the pipeline.
For checking the command-lineinputs,
beditor --help
beditor.pipeline.make_outputs
(cfg, plot=True)¶
Cobines stepwise analysis files into a pretty table.
- cfg – main configuration dict
- plot – if True creates visualizations
beditor.pipeline.pipeline
(cfgp, step=None, test=False, force=False)¶
Runs steps of the analysis workflow in tandem.
- cfgp – path to configuration file
- step – step number
- test – if True uses only one core, linear processing with verbose allowed
- force – if True overwrites outputs
beditor.pipeline.pipeline_chunks
(cfgp=None, cfg=None)¶
Runs indivudual chunk.
- cfgp – path to configuration file.
- cfg – configuration dict
Returns:
beditor.pipeline.validcfg
(cfg)¶
Checks if configuration dict is valid i.e. contains all the required fields
cfg – configuration dict
beditor.pipeline.validinput
(cfg, din)¶
Checks if input file is valid i.e. contains all the required columns.
- cfg – configuration dict
- din – dataframe containing input data
beditor.configure.get_deps
(cfg)¶
Installs dependencies of beditor
cfg – configuration dict
beditor.configure.get_genomes
(cfg)¶
Installs genomes
cfg – configuration dict
beditor.lib.get_seq.din2dseq
(cfg)¶
Wrapper for converting input data (transcript ids and positions of mutation) to seqeunces flanking the codon.
cfg – configuration dict
beditor.lib.get_seq.get_seq_aminoacid
(cfg, din)¶
Fetches sequences if mutation format is amino acid
- cfg – configuration dict
- din – input data
Returns dsequences:
dataframe with sequences
beditor.lib.get_seq.get_seq_nucleotide
(cfg, din)¶
Fetches sequences if mutation format is nucleotide
- cfg – configuration dict
- din – input data
Returns dsequences:
dataframe with sequences
beditor.lib.get_seq.t2pmapper
(t, coding_sequence_positions)¶
Maps transcript id with protein id.
- t – pyensembl transcript object
- t – reading frames
Returns coding_sequence_positions:
dataframe with mapped positions
beditor.lib.get_seq.tboundaries2positions
(t)¶
Fetches positions from transcript boundaries.
t – pyensembl transcript object
Returns coding_sequence_positions:
reading frames
beditor.lib.get_mutations.dseq2dmutagenesis
(cfg)¶
Generates mutagenesis strategies from identities of reference and mutated codons (from dseq).
cfg – configurations from yml file
beditor.lib.get_mutations.filterdmutagenesis
(dmutagenesis, cfg)¶
Filters the mutagenesis strategies by multiple options provided in configuration file (.yml).
- dmutagenesis – mutagenesis strategies (pd.DataFrame)
- cfg – configurations from yml file
beditor.lib.get_mutations.get_codon_table
(aa, tax_id=None)¶
Gets host specific codon table.
- aa – list of amino acids
- host – name of host
Returns:
codon table (pandas dataframe)
beditor.lib.get_mutations.get_codon_usage
(cuspp)¶
Creates codon usage table.
cuspp – path to cusp generated file
Returns:
codon usage table (pandas dataframe)
beditor.lib.get_mutations.get_possible_mutagenesis
(dcodontable, dcodonusage, BEs, pos_muts, host)¶
Assesses possible mutagenesis strategies, given the set of Base editors and positions of mutations.
- dcodontable – Codon table
- dcodonusage – Codon usage table
- BEs – Base editors (dict), see global_vars.py
- pos_muts – positions of mutations
- host – host organism
Returns:
possible mutagenesis strategies as a pandas dataframe
beditor.lib.get_mutations.get_submap
(cfg)¶
Fetches mimetic substitution map that would be used to filter mutagenesis strategies.
cfg – configurations from yml file.
beditor.lib.make_guides.dinnucleotide2dsequencesproper
(dsequences, dmutagenesis, dbug=False)¶
Makes dseqeunces dataframe of nucleotide mutation format compatible to guide design modules
- dsequences – dsequences dataframe
- dmutagenesis – dmutagenesis dataframe
beditor.lib.make_guides.dpam2dpam_strands
(dpam, pams)¶
Duplicates dpam dataframe to be compatible for searching PAMs on - strand
- dpam – dataframe with pam information
- pams – pams to be used for actual designing of guides.
beditor.lib.make_guides.dseq2dguides
(cfg)¶
Wrapper around make guides function.
cfg – configuration dict.
beditor.lib.make_guides.get_pam_searches
(dpam, seq, pos_codon, test=False)¶
Search PAM occurance
- dpam – dataframe with PAM sequences
- seq – target sequence
- pos_codon – reading frame
- test – debug mode on
Returns dpam_searches:
dataframe with positions of pams
beditor.lib.make_guides.guide2dpositions
(x, dbug=False)¶
Get positions of guides relative to the target site and PAM sequence Note: Index and flank sequence based indexing are 0-based Distances and positions from pam are 1-based
x – lambda section of dguides dataframe
beditor.lib.make_guides.make_guides
(cfg, dseq, dmutagenesis, dpam, test=False, dbug=False)¶
Wrapper around submodules that design guides by 1. searching all PAM sequences on ‘both’ the strands, 2. filtering guides by all possible strategies (given in dmutagenesis) e.g. activity window, Finally generates a table.
- cfg – configuration dict
- dseq – dsequences dataframe
- dmutagenesis – dmutagenesis dataframe
- dpam – dpam dataframe
- test – debug mode on
- dbug – more verbose
beditor.lib.get_specificity.alignmentbed2dalignedfasta
(cfg)¶
Get sequences in FASTA format from BED file step#5
cfg – configuration dict
beditor.lib.get_specificity.dalignbed2annotationsbed
(cfg)¶
Get annotations from the aligned BED file step#3
cfg – configuration dict
beditor.lib.get_specificity.dalignbed2dalignbedguides
(cfg)¶
Get guide seqeunces from the BED file step#4
cfg – configuration dict
beditor.lib.get_specificity.dalignbed2dalignbedguidesseq
(cfg)¶
Get sequences from BED file step#6
cfg – configuration dict
beditor.lib.get_specificity.dalignbedannot2daggbyguide
(cfg)¶
Aggregate annotations per alignment to annotations per guide. step#10
cfg – configuration dict
beditor.lib.get_specificity.dalignbedguidesseq2dalignbedstats
(cfg)¶
Gets scores for guides step#7
cfg – configuration dict
beditor.lib.get_specificity.dannots2dalignbed2dannotsagg
(cfg)¶
Aggregate annotations per guide step#8
cfg – configuration dict
beditor.lib.get_specificity.dannotsagg2dannots2dalignbedannot
(cfg)¶
Map aggregated annotations to guides step#9
cfg – configuration dict
beditor.lib.get_specificity.dguides2guidessam
(cfg, dguides)¶
Aligns guides to genome and gets SAM file step#1
- cfg – configuration dict
- dguides – dataframe of guides
beditor.lib.get_specificity.dguides2offtargets
(cfg)¶
All the processes in offtarget detection are here.
cfg – Configuration settings provided in .yml file
beditor.lib.get_specificity.guidessam2dalignbed
(cfg)¶
Processes SAM file to get the genomic coordinates in BED format step#2
cfg – configuration dict