DropSynth oligo generation script
This set of scripts provides an example for how to generate DropSynth oligos for a set of libraries. We will use the 30 DHFR libraries described in the paper as an example. For updated versions of this code please check https://github.com/KosuriLab
- python 3.5 environment
- unafold-3.8 (only require hybrid-ss-min)
- extract Entrez XML files from folA_entrez.zip
The DHFR sequences used in this example are sourced from the InterPro database family IPR012259. Which have been saved in the file seq_input/InterPro_DHFR_family_proteins.fasta (12,577 seqs) Since these are UniProt IDs they should be coverted to RefSeq IDs using http://www.uniprot.org/uploadlists/ The resulting file is saved as seq_input/M20160921ACFE4208EAFA842A78A1B3BA7138A93DB4D25CK.list (6,028 seqs)
A second set of sequences is generated by doing a BLASTP of E.coli DHFR against nrproteins. This is saved in the file seq_input/BlastP_Ecoli_DHFR.fasta (14,917 seqs)
A general overview of the oligo generation procedure can be seen in the shell script protocol.sh. This file calls a number of scripts:
Used to screen assembly and amplification primers. Check which skpp-15 primers will work for amp and which skpp-20 primers for asm. For the skpp-15 check the first 200 primer pairs. The asm primers (skpp-20) are offset by 500. This also writes the Amp pairs into one single fasta for use with BLAT later on. Also outputs the individual Asm pairs.
Download records from NCBI Entrez in XML format, using the IDs in the target list: seq_input/M20160921ACFE4208EAFA842A78A1B3BA7138A93DB4D25CK.list
Sort sequences by length into those that need 4 oligo or 5 oligo for assembly. Find DHFR sequences which are trimethoprim resistant (put into a separate library later). This generates the file: db/DHFR_IP_targetlist_parsed_4_5_res_oligo.csv. This file should be manually updated to add or remove and desired sequences. The edited file is saved as: db/DHFR_IP_targetlist_parsed_4_5_res_oligo_toFASTA.csv
This script is used to look at the identity between different sequences in the library, by running pairwise alignments between all pairs. Full ~16 million pairwise calculation takes about ~20hr. Identity scores are scaled by longest sequence, to give percent identity.
These scripts are used to look at the identity between different sequences in the library and E. coli DHFR.
This takes in db/DHFR_IP_targetlist_parsed_4_5_res_oligo_toFASTA.csv and splits it into 2 libraries one for 4 oligo assemblies and another for 5 oligo assemblies. It fills up as many 384 libraries as requested. For the remaining partial library it adds the trimethoprim resistant sequences and fills the remaining spots randomly using sequences from seq_input/BlastP_Ecoli_DHFR.fasta that are not already present.
This makes sure that all of the sequences:
- Start with Met (this removes most partial records)
- Contains no ambigous X a.a.
- Are not duplicates.
- Is not on the list of sequences that can not be split db/IDs_to_remove.csv (generated by split_genes_for_oligos.py)
Enough sequences are added to make:
- 15 total libraries of 384 sequences.
- 13 libraries with 4 oligos assemblies and 2 libraries with 5 oligo assemblies.
- Lib 13 has 4 oligo assembly resistant sequences and the seq_input\BlastP_Ecoli_DHFR.fasta extra sequences as well.
- Lib 15 has 5 oligo assembly resistant sequences and the seq_input\BlastP_Ecoli_DHFR.fasta extra sequences as well.
This is the main script which generates the gene sequences and splits them. Before running check the following:
- Input library files.
- Number of oligos to split for each file.
- Generate multiple codon versions for each sequence?
- Maximum length of payload.
- Buffer size for padding. This padding is random sequence added between KpnI and the reverse assembly primer. This is used to improve gel based size selection after assembly.
- Max attempts in splitting.
- Stop codon present or added.
- Restriction sites to be used.
This will follow these processing steps:
- Open the big library protein fasta file (one 4-oligo file and another 5 oligo file)
- Split the generation into chunks of 384
- Loop and retry oligo generation until max_num_attempts is reached, if it fails choose the next sequence.
- For each protein sequence:
- Generate one or two codon sequences.
- Remove illegal restriction sites (1).
- Verify that the DNA sequences still translate correctly.
- Add gene cloning restriction sites (NdeI & KpnI) and stop codon.
- Add random sequence buffer, to pad sequence to set length. This should not have long homopolymer repeats.
- Remove illegal restriction sites (2).
- Add the assembly primers.
- Check for illegal restriction sites (3).
- Split the DNA sequence into pieces.
- Make sure the max oligo length and number of pieces is good.
- Make sure no restriction sites are present after BtsaI sites are added on.
- Write files with:
- Split oligo payloads.
- Protein sequences
- Final DNA sequences
- DNA sequences without RE or primers
- DNA sequences without primers
- Do this until each of the 384 records in the library are filled. Then go to next lib. If any sequences fail these get added to badIDarray and will not be used the next time the script is run.
Most common splitting fail errors:
- sequence is too big for maxpayload limit.
- sequence is too small for design parameters (if buffering is off)
- seq is partial and doesn't start with Met
- restriction site is introduced which can't be removed by codon mutation
Generate a CSV file with all sequences (protein and two codon versions). This also does a number of sanity checks like duplicate sequences and generates the file db/DHFR_nt_LibAll_unique.fasta which is a list of all sequences in all libraries (without RE or assembly primers), which is used in the BLAT primer screening.
Use BLAT to check for alignments between the DNA sequences and the assembly primers. If there are large matches change the primers and regenerate the corresponding oligo libraries. First check each assembly primer against each Lib. Second check each amplification primer against all Libs.
Find good amp primer pairs. This will generate skpp15-forward_select_mod.faa and skpp15-reverse_select_mod.faa in the ampprimers-skpp15 directory.
Finalize DropSynth oligos by adding amp primers, microbead barcodes, and nt.BspQI nicking sites.
Run a bunch of design rule checks on the final oligos.
make a csv file with all protein seqs (also done in make_master_CSV.py).
Make a csv file with all oligo seqs.
Check for long homopolymers and display stats.
Choose the strand with fewest Adenines for synthesis (depurination).