`beditor`

beditor: A Computational Workflow for Designing Libraries of Guide RNAs for CRISPR-Mediated Base Editing

Rohan Dandage, Philippe C. Després, Nozomu Yachie and Christian R. Landry. GENETICS 2019

Installation

Basic requirements: Anaconda package manager. See requirements.md for set of bash commands that would install it.

Once all the requirements are satisfied, create a python 3.6 virtual environment.

wget https://raw.githubusercontent.com/rraadd88/beditor/master/environment.yml
conda env create -f environment.yml

Activate the virtual environment.

source activate beditor

Install beditor python package from pypi.

pip install beditor

Usage

GUI mode

Open the GUI window from terminal.

beditor

step1: input the configuration settings.

Note: genomes listed on the gui correspond to ensembl release=95.

step2: save the configuration settings and run beditor. Outputs will be stored in the same directory as the saved configuration settings file (yml file) in a folder with the same same name as the basename of the configuration settings file.

Note: output directory will have the same basename as the saved configuration file. If /a/b/gene_ed.yml is the path of the configuration file, the output directory will be /a/b/gene_ed/. See Output format for structure of the output directory.
Note: see the terminal messages in case of any issue.

Command-line mode

Run the analysis.

beditor --cfg configuration.yml

Run a single step in the analysis.

beditor --step {step number} configuration.yml

step number and corresponding analysis:

1: Get genomic loci flanking the target site
2: Get possible mutagenesis strategies
3: Design guides
4: Check offtarget-effects

Help

beditor --help

Input format

Table with mutation information.

Note: Path to this tsv (tab-separated values) file is provided in configuration file as a value for variable called dinp. E.g. dinp: input.tsv.

According to the mutation_format opted in configuration.yml file and corresponding columns needed in input.

nucleotide : ['genome coordinate','nucleotide mutation'].

Example for S. cerevisiae (ensembl genome release=95):

genome coordinate	nucleotide mutation
I:147494-147494-	A
I:143607-143607+	A
II:369937-369937-	C
II:372003-372003-	C

aminoacid : ['transcript: id','aminoacid: position','amino acid mutation'].

Example for S. cerevisiae (ensembl genome release=95):

transcript: id	aminoacid: position	amino acid mutation
YAL001C_mRNA	18	A
YAL002W_mRNA	24	A
YAL019W_mRNA	24	C
YAL067W-A_mRNA	13	F

Note: genomes listed in the gui correspond to ensembl release=95.

[for command line usage] Configuration file. It contains all the options and paths to files needed by the analysis.

This YAML formatted file contains the all the analysis specific parameters.

Template: https://github.com/rraadd88/test_beditor/blob/master/common/configuration.yml

# Input: Mutation information
## Path to this tsv (tab-separated values) file
dinp: input.tsv
reverse_mutations: False

# Step 1: Extracting sequences flanking mutation site (`01_sequences/`).
## host information
host: scientific name
genomerelease: 93
# check assembly from http://useast.ensembl.org/index.html
genomeassembly: fromensembl


# Step 2: Estimating the editable mutations based on base editors chosen. (`02_mutagenesis/`).
# whether aminoacid or nucleotide mutations
mutation_format: aminoacid or nucleotide
##[N nonsyn] S syn else both
mutation_type: N
## keep nonsense mutations
keep_mutation_nonsense: False

## Mutations information can be provided in 3 options: 
## `mutations`: Required Mutations mentioned in input file. 
## `substitutions`: Required Substitutions provided as a file (template: https://github.com/rraadd88/test_beditor/blob/master/common/dsubmap.tsv).
## `mimetic`: Carry out Mimetic substitutions (base on genome wide substitution maps). Only for human and yeast.
## input: options 
## mutations, substitutions, mimetic, [no input: keeps all possible mutations (slow)]
mutations: mutations

## Parameters specific to above options
## 2. Substitutions provided as a file
dsubmap_preferred_path: 
## 3. Mimetic substitutions
## mimetism level (high: only the best one, [medium: best 5], low: best 10)
mimetism_level: medium
## can not mutate between these 
## if ['S','T','K'] is provided all mutations between thsese amino acids are disallowed
non_intermutables: []


# Step 3: Designed guides (`03_guides/`).
## allowed nucleuotide substitutions per codon
max_subs_per_codon: 1
## base editors to use (restriction max_subs_per_codon would override the choice of base editors)
BEs: ['Target-AID','ABE']
# Cas9 related options
## PAM sequence
pams: ['NGG','NG']

#------------------------------------------------
# System related options 
## Number of cpus/threads
cores: 6
## Number of lines to process per cpu
chunksize: 200
## Dependencies 
## by default the dependencies are installed from the conda environment.
## "optionally" paths to the dependencies could be included below.
bedtools: bedtools
bwa: bwa
samtools: samtools

Output format

mutation_format opted in configuration.yml file and corresponding columns needed in input: 

nucleotide :  ['genome coordinate','nucleotide wild-type','nucleotide mutation',]
aminoacid : ['transcript: id','aminoacid: wild-type','aminoacid: position','amino acid mutation','codon: wild-type','guide: id','guide+PAM sequence','beditor score','alternate alignments count','CFD score']

Format of guide: id:

{genomic locus}|{position of mutation}|({strategy})
where,
strategy= {base editor};{strand};@{distance of mutation from PAM};{PAM};{codon wild-type}:{codon mutation};{amino acid wild-type}:{amino acid mutation};

A directory by the basename of configuration file (eg. directory called 'human' if configuration file is 'human.yml') would be created in the same folder where configuration file is located. It is referred to as 'project directory'.

Inside a project directory there would be following folders named by corresponding steps of analysis.

1. 01_sequences/
Stores the output of step #1. Extracting sequences flanking mutation site.
2. 02_mutagenesis/
Stores the output of step #2. Estimating the editable mutations based on base editors chosen.
3. 03_guides/
Stores the output of step #3. Designed guides.
4. 04_offtargets/
Stores the output of step #4. Offtarget effects.
5. 05_outputs/
Stores combined output and visualizations and sets of positive and negative control guides.   
positive control guides are designed so that they introduce stop mutation in genes being targeted.  
begative control guides lack editable nucleotide in the window of maximum activity, thereby supposed to not introduce any mutation.  

Also,
- 00_input/
Stores input files.
- chunks/
If parallel processing is used, this folder would store individual parts (chunks) of the analysis.

How to install custom base editor and PAM

GUI mode

Custom base editor and PAM sequences can be used incorporated in the workflow by selecting 'Custom' option in the 1st tab of the GUI. Following is the layout of the options to input the information about the base editor and the PAM sequence.

Command-line mode

The sets of installed BEs and PAMs are stored in a tab-separated table, located at beditor/data/dbepams.tsv directory (use which beditor to locate directory of beditor). In order to install new base editor or PAM, user would have to simply append the relevant information in the table.

How to analyze test datasets

# make the input files with mock data
git clone https://github.com/rraadd88/test_beditor.git
source activate beditor;cd test_beditor;python test_datasets.py

Working with non-ensembl genomes or arbitrary sequences

https://github.com/openvax/pyensembl#non-ensembl-data

API

`beditor.pipeline.collect_chunks`(cfg, chunkcfgps)¶

Collects analysed chunks

Parameters:

cfg – main configuration dict.
chunkcfgps – paths to all configuration files of chunks

`beditor.pipeline.collectchuckfiles`(cfg, fpinchunk, force=False)¶

Collects minor chunk files

Parameters:

cfg – configuration dict
fpinchunk – path inside chuck’s project directory
force – if True overwrites the outputs

`beditor.pipeline.main`()¶

Provides command-line inputs to the pipeline.

For checking the command-lineinputs,

beditor --help

`beditor.pipeline.make_outputs`(cfg, plot=True)¶

Cobines stepwise analysis files into a pretty table.

Parameters:

cfg – main configuration dict
plot – if True creates visualizations

`beditor.pipeline.pipeline`(cfgp, step=None, test=False, force=False)¶

Runs steps of the analysis workflow in tandem.

Parameters:

cfgp – path to configuration file
step – step number
test – if True uses only one core, linear processing with verbose allowed
force – if True overwrites outputs

`beditor.pipeline.pipeline_chunks`(cfgp=None, cfg=None)¶

Runs indivudual chunk.

Parameters:

cfgp – path to configuration file.
cfg – configuration dict

Returns:

`beditor.pipeline.validcfg`(cfg)¶

Checks if configuration dict is valid i.e. contains all the required fields

Parameters:

cfg – configuration dict

`beditor.pipeline.validinput`(cfg, din)¶

Checks if input file is valid i.e. contains all the required columns.

Parameters:

cfg – configuration dict
din – dataframe containing input data

`beditor.configure.get_deps`(cfg)¶

Installs dependencies of beditor

Parameters:

cfg – configuration dict

`beditor.configure.get_genomes`(cfg)¶

Installs genomes

Parameters:

cfg – configuration dict

`beditor.lib.get_seq.din2dseq`(cfg)¶

Wrapper for converting input data (transcript ids and positions of mutation) to seqeunces flanking the codon.

Parameters:

cfg – configuration dict

`beditor.lib.get_seq.get_seq_aminoacid`(cfg, din)¶

Fetches sequences if mutation format is amino acid

Parameters:

cfg – configuration dict
din – input data

Returns dsequences:

dataframe with sequences

`beditor.lib.get_seq.get_seq_nucleotide`(cfg, din)¶

Fetches sequences if mutation format is nucleotide

Parameters:

cfg – configuration dict
din – input data

Returns dsequences:

dataframe with sequences

`beditor.lib.get_seq.t2pmapper`(t, coding_sequence_positions)¶

Maps transcript id with protein id.

Parameters:

t – pyensembl transcript object
t – reading frames

Returns coding_sequence_positions:

dataframe with mapped positions

`beditor.lib.get_seq.tboundaries2positions`(t)¶

Fetches positions from transcript boundaries.

Parameters:

t – pyensembl transcript object

Returns coding_sequence_positions:

reading frames

`beditor.lib.get_mutations.dseq2dmutagenesis`(cfg)¶

Generates mutagenesis strategies from identities of reference and mutated codons (from dseq).

Parameters:

cfg – configurations from yml file

`beditor.lib.get_mutations.filterdmutagenesis`(dmutagenesis, cfg)¶

Filters the mutagenesis strategies by multiple options provided in configuration file (.yml).

Parameters:

dmutagenesis – mutagenesis strategies (pd.DataFrame)
cfg – configurations from yml file

`beditor.lib.get_mutations.get_codon_table`(aa, tax_id=None)¶

Gets host specific codon table.

Parameters:

aa – list of amino acids
host – name of host

Returns:

codon table (pandas dataframe)

`beditor.lib.get_mutations.get_codon_usage`(cuspp)¶

Creates codon usage table.

Parameters:

cuspp – path to cusp generated file

Returns:

codon usage table (pandas dataframe)

`beditor.lib.get_mutations.get_possible_mutagenesis`(dcodontable, dcodonusage, BEs, pos_muts, host)¶

Assesses possible mutagenesis strategies, given the set of Base editors and positions of mutations.

Parameters:

dcodontable – Codon table
dcodonusage – Codon usage table
BEs – Base editors (dict), see global_vars.py
pos_muts – positions of mutations
host – host organism

Returns:

possible mutagenesis strategies as a pandas dataframe

`beditor.lib.get_mutations.get_submap`(cfg)¶

Fetches mimetic substitution map that would be used to filter mutagenesis strategies.

Parameters:

cfg – configurations from yml file.

`beditor.lib.make_guides.dinnucleotide2dsequencesproper`(dsequences, dmutagenesis, dbug=False)¶

Makes dseqeunces dataframe of nucleotide mutation format compatible to guide design modules

Parameters:

dsequences – dsequences dataframe
dmutagenesis – dmutagenesis dataframe

`beditor.lib.make_guides.dpam2dpam_strands`(dpam, pams)¶

Duplicates dpam dataframe to be compatible for searching PAMs on - strand

Parameters:

dpam – dataframe with pam information
pams – pams to be used for actual designing of guides.

`beditor.lib.make_guides.dseq2dguides`(cfg)¶

Wrapper around make guides function.

Parameters:

cfg – configuration dict.

`beditor.lib.make_guides.get_pam_searches`(dpam, seq, pos_codon, test=False)¶

Search PAM occurance

Parameters:

dpam – dataframe with PAM sequences
seq – target sequence
pos_codon – reading frame
test – debug mode on

Returns dpam_searches:

dataframe with positions of pams

`beditor.lib.make_guides.guide2dpositions`(x, dbug=False)¶

Get positions of guides relative to the target site and PAM sequence Note: Index and flank sequence based indexing are 0-based Distances and positions from pam are 1-based

Parameters:

x – lambda section of dguides dataframe

`beditor.lib.make_guides.make_guides`(cfg, dseq, dmutagenesis, dpam, test=False, dbug=False)¶

Wrapper around submodules that design guides by 1. searching all PAM sequences on ‘both’ the strands, 2. filtering guides by all possible strategies (given in dmutagenesis) e.g. activity window, Finally generates a table.

Parameters:

cfg – configuration dict
dseq – dsequences dataframe
dmutagenesis – dmutagenesis dataframe
dpam – dpam dataframe
test – debug mode on
dbug – more verbose

`beditor.lib.get_specificity.alignmentbed2dalignedfasta`(cfg)¶

Get sequences in FASTA format from BED file step#5

Parameters:

cfg – configuration dict

`beditor.lib.get_specificity.dalignbed2annotationsbed`(cfg)¶

Get annotations from the aligned BED file step#3

Parameters:

cfg – configuration dict

`beditor.lib.get_specificity.dalignbed2dalignbedguides`(cfg)¶

Get guide seqeunces from the BED file step#4

Parameters:

cfg – configuration dict

`beditor.lib.get_specificity.dalignbed2dalignbedguidesseq`(cfg)¶

Get sequences from BED file step#6

Parameters:

cfg – configuration dict

`beditor.lib.get_specificity.dalignbedannot2daggbyguide`(cfg)¶

Aggregate annotations per alignment to annotations per guide. step#10

Parameters:

cfg – configuration dict

`beditor.lib.get_specificity.dalignbedguidesseq2dalignbedstats`(cfg)¶

Gets scores for guides step#7

Parameters:

cfg – configuration dict

`beditor.lib.get_specificity.dannots2dalignbed2dannotsagg`(cfg)¶

Aggregate annotations per guide step#8

Parameters:

cfg – configuration dict

`beditor.lib.get_specificity.dannotsagg2dannots2dalignbedannot`(cfg)¶

Map aggregated annotations to guides step#9

Parameters:

cfg – configuration dict

`beditor.lib.get_specificity.dguides2guidessam`(cfg, dguides)¶

Aligns guides to genome and gets SAM file step#1

Parameters:

cfg – configuration dict
dguides – dataframe of guides

`beditor.lib.get_specificity.dguides2offtargets`(cfg)¶

All the processes in offtarget detection are here.

Parameters:

cfg – Configuration settings provided in .yml file

`beditor.lib.get_specificity.guidessam2dalignbed`(cfg)¶

Processes SAM file to get the genomic coordinates in BED format step#2

Parameters:

cfg – configuration dict

Files

README.md

Latest commit

History

README.md

File metadata and controls

beditor

beditor: A Computational Workflow for Designing Libraries of Guide RNAs for CRISPR-Mediated Base Editing

Table of Contents

Installation

Usage

GUI mode

Command-line mode

Input format

Table with mutation information.

nucleotide : ['genome coordinate','nucleotide mutation'].

aminoacid : ['transcript: id','aminoacid: position','amino acid mutation'].

[for command line usage] Configuration file. It contains all the options and paths to files needed by the analysis.

Output format

How to install custom base editor and PAM

GUI mode

Command-line mode

How to analyze test datasets

Working with non-ensembl genomes or arbitrary sequences

API

beditor.pipeline.collect_chunks(cfg, chunkcfgps)¶

Parameters:

beditor.pipeline.collectchuckfiles(cfg, fpinchunk, force=False)¶

Parameters:

beditor.pipeline.main()¶

beditor.pipeline.make_outputs(cfg, plot=True)¶

Parameters:

beditor.pipeline.pipeline(cfgp, step=None, test=False, force=False)¶

Parameters:

beditor.pipeline.pipeline_chunks(cfgp=None, cfg=None)¶

Parameters:

beditor.pipeline.validcfg(cfg)¶

Parameters:

beditor.pipeline.validinput(cfg, din)¶

Parameters:

beditor.configure.get_deps(cfg)¶

Parameters:

beditor.configure.get_genomes(cfg)¶

Parameters:

beditor.lib.get_seq.din2dseq(cfg)¶

Parameters:

beditor.lib.get_seq.get_seq_aminoacid(cfg, din)¶

Parameters:

beditor.lib.get_seq.get_seq_nucleotide(cfg, din)¶

Parameters:

beditor.lib.get_seq.t2pmapper(t, coding_sequence_positions)¶

Parameters:

beditor.lib.get_seq.tboundaries2positions(t)¶

Parameters:

beditor.lib.get_mutations.dseq2dmutagenesis(cfg)¶

Parameters:

beditor.lib.get_mutations.filterdmutagenesis(dmutagenesis, cfg)¶

Parameters:

beditor.lib.get_mutations.get_codon_table(aa, tax_id=None)¶

Parameters:

beditor.lib.get_mutations.get_codon_usage(cuspp)¶

Parameters:

beditor.lib.get_mutations.get_possible_mutagenesis(dcodontable, dcodonusage, BEs, pos_muts, host)¶

Parameters:

beditor.lib.get_mutations.get_submap(cfg)¶

Parameters:

beditor.lib.make_guides.dinnucleotide2dsequencesproper(dsequences, dmutagenesis, dbug=False)¶

Parameters:

beditor.lib.make_guides.dpam2dpam_strands(dpam, pams)¶

Parameters:

beditor.lib.make_guides.dseq2dguides(cfg)¶

Parameters:

beditor.lib.make_guides.get_pam_searches(dpam, seq, pos_codon, test=False)¶

Parameters:

beditor.lib.make_guides.guide2dpositions(x, dbug=False)¶

Parameters:

beditor.lib.make_guides.make_guides(cfg, dseq, dmutagenesis, dpam, test=False, dbug=False)¶

Parameters:

beditor.lib.get_specificity.alignmentbed2dalignedfasta(cfg)¶

Parameters:

`beditor`

`beditor.pipeline.collect_chunks`(cfg, chunkcfgps)¶

`beditor.pipeline.collectchuckfiles`(cfg, fpinchunk, force=False)¶

`beditor.pipeline.main`()¶

`beditor.pipeline.make_outputs`(cfg, plot=True)¶

`beditor.pipeline.pipeline`(cfgp, step=None, test=False, force=False)¶

`beditor.pipeline.pipeline_chunks`(cfgp=None, cfg=None)¶

`beditor.pipeline.validcfg`(cfg)¶

`beditor.pipeline.validinput`(cfg, din)¶

`beditor.configure.get_deps`(cfg)¶

`beditor.configure.get_genomes`(cfg)¶

`beditor.lib.get_seq.din2dseq`(cfg)¶

`beditor.lib.get_seq.get_seq_aminoacid`(cfg, din)¶

`beditor.lib.get_seq.get_seq_nucleotide`(cfg, din)¶

`beditor.lib.get_seq.t2pmapper`(t, coding_sequence_positions)¶

`beditor.lib.get_seq.tboundaries2positions`(t)¶

`beditor.lib.get_mutations.dseq2dmutagenesis`(cfg)¶

`beditor.lib.get_mutations.filterdmutagenesis`(dmutagenesis, cfg)¶

`beditor.lib.get_mutations.get_codon_table`(aa, tax_id=None)¶

`beditor.lib.get_mutations.get_codon_usage`(cuspp)¶

`beditor.lib.get_mutations.get_possible_mutagenesis`(dcodontable, dcodonusage, BEs, pos_muts, host)¶

`beditor.lib.get_mutations.get_submap`(cfg)¶

`beditor.lib.make_guides.dinnucleotide2dsequencesproper`(dsequences, dmutagenesis, dbug=False)¶

`beditor.lib.make_guides.dpam2dpam_strands`(dpam, pams)¶

`beditor.lib.make_guides.dseq2dguides`(cfg)¶

`beditor.lib.make_guides.get_pam_searches`(dpam, seq, pos_codon, test=False)¶

`beditor.lib.make_guides.guide2dpositions`(x, dbug=False)¶

`beditor.lib.make_guides.make_guides`(cfg, dseq, dmutagenesis, dpam, test=False, dbug=False)¶

`beditor.lib.get_specificity.alignmentbed2dalignedfasta`(cfg)¶

`beditor.lib.get_specificity.dalignbed2annotationsbed`(cfg)¶

`beditor.lib.get_specificity.dalignbed2dalignbedguides`(cfg)¶

`beditor.lib.get_specificity.dalignbed2dalignbedguidesseq`(cfg)¶

`beditor.lib.get_specificity.dalignbedannot2daggbyguide`(cfg)¶

`beditor.lib.get_specificity.dalignbedguidesseq2dalignbedstats`(cfg)¶

`beditor.lib.get_specificity.dannots2dalignbed2dannotsagg`(cfg)¶

`beditor.lib.get_specificity.dannotsagg2dannots2dalignbedannot`(cfg)¶

`beditor.lib.get_specificity.dguides2guidessam`(cfg, dguides)¶

`beditor.lib.get_specificity.dguides2offtargets`(cfg)¶

`beditor.lib.get_specificity.guidessam2dalignbed`(cfg)¶