Skip to content

Latest commit

 

History

History
734 lines (410 loc) · 21 KB

README.md

File metadata and controls

734 lines (410 loc) · 21 KB

beditor

build status PyPI version

beditor: A Computational Workflow for Designing Libraries of Guide RNAs for CRISPR-Mediated Base Editing

Rohan Dandage, Philippe C. Després, Nozomu Yachie and Christian R. Landry. GENETICS 2019

Table of Contents

  1. Installation
  2. Configuration
  3. Input format
  4. Output format
  5. API

Installation

Basic requirements: Anaconda package manager. See requirements.md for set of bash commands that would install it.

  1. Once all the requirements are satisfied, create a python 3.6 virtual environment.
wget https://raw.githubusercontent.com/rraadd88/beditor/master/environment.yml
conda env create -f environment.yml
  1. Activate the virtual environment.
source activate beditor
  1. Install beditor python package from pypi.
pip install beditor

Usage

GUI mode

Open the GUI window from terminal.

beditor

step1: input the configuration settings.

Note: genomes listed on the gui correspond to ensembl release=95.

step2: save the configuration settings and run beditor. Outputs will be stored in the same directory as the saved configuration settings file (yml file) in a folder with the same same name as the basename of the configuration settings file.

Note: output directory will have the same basename as the saved configuration file. If /a/b/gene_ed.yml is the path of the configuration file, the output directory will be /a/b/gene_ed/. See Output format for structure of the output directory.
Note: see the terminal messages in case of any issue.

Command-line mode

  1. Run the analysis.
beditor --cfg configuration.yml
  1. Run a single step in the analysis.
beditor --step {step number} configuration.yml

step number and corresponding analysis:

1: Get genomic loci flanking the target site
2: Get possible mutagenesis strategies
3: Design guides
4: Check offtarget-effects
  1. Help
beditor --help

Input format

Table with mutation information.

Note: Path to this tsv (tab-separated values) file is provided in configuration file as a value for variable called dinp. E.g. dinp: input.tsv.

According to the mutation_format opted in configuration.yml file and corresponding columns needed in input.

nucleotide : ['genome coordinate','nucleotide mutation'].

Example for S. cerevisiae (ensembl genome release=95):

genome coordinate nucleotide mutation
I:147494-147494- A
I:143607-143607+ A
II:369937-369937- C
II:372003-372003- C

aminoacid : ['transcript: id','aminoacid: position','amino acid mutation'].

Example for S. cerevisiae (ensembl genome release=95):

transcript: id aminoacid: position amino acid mutation
YAL001C_mRNA 18 A
YAL002W_mRNA 24 A
YAL019W_mRNA 24 C
YAL067W-A_mRNA 13 F

Note: genomes listed in the gui correspond to ensembl release=95.

[for command line usage] Configuration file. It contains all the options and paths to files needed by the analysis.

This YAML formatted file contains the all the analysis specific parameters.

Template: https://github.com/rraadd88/test_beditor/blob/master/common/configuration.yml

# Input: Mutation information
## Path to this tsv (tab-separated values) file
dinp: input.tsv
reverse_mutations: False

# Step 1: Extracting sequences flanking mutation site (`01_sequences/`).
## host information
host: scientific name
genomerelease: 93
# check assembly from http://useast.ensembl.org/index.html
genomeassembly: fromensembl


# Step 2: Estimating the editable mutations based on base editors chosen. (`02_mutagenesis/`).
# whether aminoacid or nucleotide mutations
mutation_format: aminoacid or nucleotide
##[N nonsyn] S syn else both
mutation_type: N
## keep nonsense mutations
keep_mutation_nonsense: False

## Mutations information can be provided in 3 options: 
## `mutations`: Required Mutations mentioned in input file. 
## `substitutions`: Required Substitutions provided as a file (template: https://github.com/rraadd88/test_beditor/blob/master/common/dsubmap.tsv).
## `mimetic`: Carry out Mimetic substitutions (base on genome wide substitution maps). Only for human and yeast.
## input: options 
## mutations, substitutions, mimetic, [no input: keeps all possible mutations (slow)]
mutations: mutations

## Parameters specific to above options
## 2. Substitutions provided as a file
dsubmap_preferred_path: 
## 3. Mimetic substitutions
## mimetism level (high: only the best one, [medium: best 5], low: best 10)
mimetism_level: medium
## can not mutate between these 
## if ['S','T','K'] is provided all mutations between thsese amino acids are disallowed
non_intermutables: []


# Step 3: Designed guides (`03_guides/`).
## allowed nucleuotide substitutions per codon
max_subs_per_codon: 1
## base editors to use (restriction max_subs_per_codon would override the choice of base editors)
BEs: ['Target-AID','ABE']
# Cas9 related options
## PAM sequence
pams: ['NGG','NG']

#------------------------------------------------
# System related options 
## Number of cpus/threads
cores: 6
## Number of lines to process per cpu
chunksize: 200
## Dependencies 
## by default the dependencies are installed from the conda environment.
## "optionally" paths to the dependencies could be included below.
bedtools: bedtools
bwa: bwa
samtools: samtools

Output format

mutation_format opted in configuration.yml file and corresponding columns needed in input: 

nucleotide :  ['genome coordinate','nucleotide wild-type','nucleotide mutation',]
aminoacid : ['transcript: id','aminoacid: wild-type','aminoacid: position','amino acid mutation','codon: wild-type','guide: id','guide+PAM sequence','beditor score','alternate alignments count','CFD score']

Format of guide: id:

{genomic locus}|{position of mutation}|({strategy})
where,
strategy= {base editor};{strand};@{distance of mutation from PAM};{PAM};{codon wild-type}:{codon mutation};{amino acid wild-type}:{amino acid mutation};

A directory by the basename of configuration file (eg. directory called 'human' if configuration file is 'human.yml') would be created in the same folder where configuration file is located. It is referred to as 'project directory'.

Inside a project directory there would be following folders named by corresponding steps of analysis.

1. 01_sequences/
Stores the output of step #1. Extracting sequences flanking mutation site.
2. 02_mutagenesis/
Stores the output of step #2. Estimating the editable mutations based on base editors chosen.
3. 03_guides/
Stores the output of step #3. Designed guides.
4. 04_offtargets/
Stores the output of step #4. Offtarget effects.
5. 05_outputs/
Stores combined output and visualizations and sets of positive and negative control guides.   
positive control guides are designed so that they introduce stop mutation in genes being targeted.  
begative control guides lack editable nucleotide in the window of maximum activity, thereby supposed to not introduce any mutation.  

Also,
- 00_input/
Stores input files.
- chunks/
If parallel processing is used, this folder would store individual parts (chunks) of the analysis. 

How to install custom base editor and PAM

GUI mode

Custom base editor and PAM sequences can be used incorporated in the workflow by selecting 'Custom' option in the 1st tab of the GUI. Following is the layout of the options to input the information about the base editor and the PAM sequence.

Command-line mode

The sets of installed BEs and PAMs are stored in a tab-separated table, located at beditor/data/dbepams.tsv directory (use which beditor to locate directory of beditor). In order to install new base editor or PAM, user would have to simply append the relevant information in the table.

How to analyze test datasets

# make the input files with mock data
git clone https://github.com/rraadd88/test_beditor.git
source activate beditor;cd test_beditor;python test_datasets.py

Working with non-ensembl genomes or arbitrary sequences

https://github.com/openvax/pyensembl#non-ensembl-data

API

beditor.pipeline.collect_chunks(cfg, chunkcfgps)

Collects analysed chunks

Parameters:

  • cfg – main configuration dict.
  • chunkcfgps – paths to all configuration files of chunks

beditor.pipeline.collectchuckfiles(cfg, fpinchunk, force=False)

Collects minor chunk files

Parameters:

  • cfg – configuration dict
  • fpinchunk – path inside chuck’s project directory
  • force – if True overwrites the outputs

beditor.pipeline.main()

Provides command-line inputs to the pipeline.

For checking the command-lineinputs,

beditor --help

beditor.pipeline.make_outputs(cfg, plot=True)

Cobines stepwise analysis files into a pretty table.

Parameters:

  • cfg – main configuration dict
  • plot – if True creates visualizations

beditor.pipeline.pipeline(cfgp, step=None, test=False, force=False)

Runs steps of the analysis workflow in tandem.

Parameters:

  • cfgp – path to configuration file
  • step – step number
  • test – if True uses only one core, linear processing with verbose allowed
  • force – if True overwrites outputs

beditor.pipeline.pipeline_chunks(cfgp=None, cfg=None)

Runs indivudual chunk.

Parameters:

  • cfgp – path to configuration file.
  • cfg – configuration dict

Returns:

beditor.pipeline.validcfg(cfg)

Checks if configuration dict is valid i.e. contains all the required fields

Parameters:

cfg – configuration dict

beditor.pipeline.validinput(cfg, din)

Checks if input file is valid i.e. contains all the required columns.

Parameters:

  • cfg – configuration dict
  • din – dataframe containing input data

beditor.configure.get_deps(cfg)

Installs dependencies of beditor

Parameters:

cfg – configuration dict

beditor.configure.get_genomes(cfg)

Installs genomes

Parameters:

cfg – configuration dict

beditor.lib.get_seq.din2dseq(cfg)

Wrapper for converting input data (transcript ids and positions of mutation) to seqeunces flanking the codon.

Parameters:

cfg – configuration dict

beditor.lib.get_seq.get_seq_aminoacid(cfg, din)

Fetches sequences if mutation format is amino acid

Parameters:

  • cfg – configuration dict
  • din – input data

Returns dsequences:

 

dataframe with sequences

beditor.lib.get_seq.get_seq_nucleotide(cfg, din)

Fetches sequences if mutation format is nucleotide

Parameters:

  • cfg – configuration dict
  • din – input data

Returns dsequences:

 

dataframe with sequences

beditor.lib.get_seq.t2pmapper(t, coding_sequence_positions)

Maps transcript id with protein id.

Parameters:

  • t – pyensembl transcript object
  • t – reading frames

Returns coding_sequence_positions:

 

dataframe with mapped positions

beditor.lib.get_seq.tboundaries2positions(t)

Fetches positions from transcript boundaries.

Parameters:

t – pyensembl transcript object

Returns coding_sequence_positions:

 

reading frames

beditor.lib.get_mutations.dseq2dmutagenesis(cfg)

Generates mutagenesis strategies from identities of reference and mutated codons (from dseq).

Parameters:

cfg – configurations from yml file

beditor.lib.get_mutations.filterdmutagenesis(dmutagenesis, cfg)

Filters the mutagenesis strategies by multiple options provided in configuration file (.yml).

Parameters:

  • dmutagenesis – mutagenesis strategies (pd.DataFrame)
  • cfg – configurations from yml file

beditor.lib.get_mutations.get_codon_table(aa, tax_id=None)

Gets host specific codon table.

Parameters:

  • aa – list of amino acids
  • host – name of host

Returns:

codon table (pandas dataframe)

beditor.lib.get_mutations.get_codon_usage(cuspp)

Creates codon usage table.

Parameters:

cuspp – path to cusp generated file

Returns:

codon usage table (pandas dataframe)

beditor.lib.get_mutations.get_possible_mutagenesis(dcodontable, dcodonusage, BEs, pos_muts, host)

Assesses possible mutagenesis strategies, given the set of Base editors and positions of mutations.

Parameters:

  • dcodontable – Codon table
  • dcodonusage – Codon usage table
  • BEs – Base editors (dict), see global_vars.py
  • pos_muts – positions of mutations
  • host – host organism

Returns:

possible mutagenesis strategies as a pandas dataframe

beditor.lib.get_mutations.get_submap(cfg)

Fetches mimetic substitution map that would be used to filter mutagenesis strategies.

Parameters:

cfg – configurations from yml file.

beditor.lib.make_guides.dinnucleotide2dsequencesproper(dsequences, dmutagenesis, dbug=False)

Makes dseqeunces dataframe of nucleotide mutation format compatible to guide design modules

Parameters:

  • dsequences – dsequences dataframe
  • dmutagenesis – dmutagenesis dataframe

beditor.lib.make_guides.dpam2dpam_strands(dpam, pams)

Duplicates dpam dataframe to be compatible for searching PAMs on - strand

Parameters:

  • dpam – dataframe with pam information
  • pams – pams to be used for actual designing of guides.

beditor.lib.make_guides.dseq2dguides(cfg)

Wrapper around make guides function.

Parameters:

cfg – configuration dict.

beditor.lib.make_guides.get_pam_searches(dpam, seq, pos_codon, test=False)

Search PAM occurance

Parameters:

  • dpam – dataframe with PAM sequences
  • seq – target sequence
  • pos_codon – reading frame
  • test – debug mode on

Returns dpam_searches:

 

dataframe with positions of pams

beditor.lib.make_guides.guide2dpositions(x, dbug=False)

Get positions of guides relative to the target site and PAM sequence Note: Index and flank sequence based indexing are 0-based Distances and positions from pam are 1-based

Parameters:

x – lambda section of dguides dataframe

beditor.lib.make_guides.make_guides(cfg, dseq, dmutagenesis, dpam, test=False, dbug=False)

Wrapper around submodules that design guides by 1. searching all PAM sequences on ‘both’ the strands, 2. filtering guides by all possible strategies (given in dmutagenesis) e.g. activity window, Finally generates a table.

Parameters:

  • cfg – configuration dict
  • dseq – dsequences dataframe
  • dmutagenesis – dmutagenesis dataframe
  • dpam – dpam dataframe
  • test – debug mode on
  • dbug – more verbose

beditor.lib.get_specificity.alignmentbed2dalignedfasta(cfg)

Get sequences in FASTA format from BED file step#5

Parameters:

cfg – configuration dict

beditor.lib.get_specificity.dalignbed2annotationsbed(cfg)

Get annotations from the aligned BED file step#3

Parameters:

cfg – configuration dict

beditor.lib.get_specificity.dalignbed2dalignbedguides(cfg)

Get guide seqeunces from the BED file step#4

Parameters:

cfg – configuration dict

beditor.lib.get_specificity.dalignbed2dalignbedguidesseq(cfg)

Get sequences from BED file step#6

Parameters:

cfg – configuration dict

beditor.lib.get_specificity.dalignbedannot2daggbyguide(cfg)

Aggregate annotations per alignment to annotations per guide. step#10

Parameters:

cfg – configuration dict

beditor.lib.get_specificity.dalignbedguidesseq2dalignbedstats(cfg)

Gets scores for guides step#7

Parameters:

cfg – configuration dict

beditor.lib.get_specificity.dannots2dalignbed2dannotsagg(cfg)

Aggregate annotations per guide step#8

Parameters:

cfg – configuration dict

beditor.lib.get_specificity.dannotsagg2dannots2dalignbedannot(cfg)

Map aggregated annotations to guides step#9

Parameters:

cfg – configuration dict

beditor.lib.get_specificity.dguides2guidessam(cfg, dguides)

Aligns guides to genome and gets SAM file step#1

Parameters:

  • cfg – configuration dict
  • dguides – dataframe of guides

beditor.lib.get_specificity.dguides2offtargets(cfg)

All the processes in offtarget detection are here.

Parameters:

cfg – Configuration settings provided in .yml file

beditor.lib.get_specificity.guidessam2dalignbed(cfg)

Processes SAM file to get the genomic coordinates in BED format step#2

Parameters:

cfg – configuration dict