# Construct design meeting 6-30-21


## TODO List

1. Pick backbone to insert constructs into
2. Identify homology strategy
3. Cost of sequence sythesis


## Questions

- How would the revese complement sequence be created from the forward
sequence? Transcribe region in antisense directon then reverse transcribe
mRNA to cDNA? How would you initiate transcription?
- Cost per bp of synthetic regions? Does cost increase in linear way or price tiers as we enter longer and longer sequences?
    - If costs get significantly higher then this could infulence the homology strategy and force the use of primers. Using 60 bp length primer (upper end recommeded for Gibson assembly) NEB priced at 23 USD. Assembly of one region would require 4 total oligos bringing cost up to about 90 - 100 USD per region not including the cost of actually synthesizing the constructs.

- How do we confirm the presence of the insert since variable regions will not be carrying a selective marker? Add some kind of selective marker then? Gel purification by size?

## General resources

- [NEB Gibson assembly protocol](https://www.neb.com/protocols/2012/12/11/gibson-assembly-protocol-e5510)
    - NEB recommends 15 - 20 bp of overlap for their kits with melting temperature > 48 C
- [AddGene Gibson assembly](https://blog.addgene.org/plasmids-101-gibson-assembly)

## Brainstorming

### Issue with Gibson assembly

Main issue is that if homologous sequences are included in the variable regions, the reverse complement of that sequence will remove any homology preventing easy insertion into the vector.

#### Reverse complement resistent sequences

If a sequence contains only a nucleotide and its partner in an alternating pattern then the reverse complement of that sequence will be the same as the original sequence.

In [1]:
from Bio.Seq import Seq

In [2]:
s = Seq('GCGCGCGCGCGC')

In [3]:
s.reverse_complement() == s

True

Pretend `i` is some initiation region, hot off the press and `p` is the promotor sequence.

In [4]:
i = Seq('ATGTAGTAGCCGTAGTAGCAGT'.lower())
p = Seq('ctttagtgagggttaat')

Add the reverse complement resistent sequence to the variable region `i`.

In [5]:
i_h = s + i
i_h

Seq('GCGCGCGCGCGCatgtagtagccgtagtagcagt')

In [6]:
import numpy as np

In [7]:
def random_seq(length):
    return ''.join(np.random.choice(['A', 'T', 'G', 'C'], length, replace=True))
random_seq(10)

'AATTAAGTTA'

Brute force reverse complement resistent seq finder, there is 100% a way to directly calculate this.

In [8]:
def rcr_finder(n_attempts=10000, length=10):
    found = []
    for i in range(n_attempts):
        seq = Seq(random_seq(length))
        if seq.reverse_complement() == seq:
            found.append(seq)
    return found

In [9]:
s = rcr_finder()
s

[Seq('CTCACGTGAG'), Seq('CACAGCTGTG'), Seq('AACGCGCGTT')]

So the rule for reverse complement resistent sequences seems to be for a sequence `s` of length `n` `s[i]` = reverse_complement(`s[n-i]`).

In [10]:
def rcr_generator(length=20):
    assert length % 2 == 0, 'Length must be even!'
    head = Seq(random_seq(int(length / 2)))
    tail = head.reverse_complement()
    rcr = head + tail
    assert rcr.reverse_complement() == rcr
    return rcr

In [11]:
s = rcr_generator(30)
s

Seq('GGACCTGATCGGTTATAACCGATCAGGTCC')

In [12]:
(s + i)

Seq('GGACCTGATCGGTTATAACCGATCAGGTCCatgtagtagccgtagtagcagt')

In [13]:
(s + i).reverse_complement()

Seq('actgctactacggctactacatGGACCTGATCGGTTATAACCGATCAGGTCC')

And even at that point still need to use a primer although would not need primer for every sequence.Would not work because adding the `i` seq gives orientation to `s`. If was not the case could have a primer that spans the last 10 bases of promotor and first of `s`

#### Restriction enzymes

[Addgene RE cloning guide](https://blog.addgene.org/plasmids-101-restriction-cloning)

Would it be possible to add two restriction sites on the ends of each variable region that are targeted by the same or different unique restriction sites when taking the reverse complement? The disadvantage is then that the location of insertion is more limited by existing restriction sites.

In [14]:
from Bio.Restriction import *

import pandas as pd

Get list of all restriction enzymes and their target sites, then take the reverse complement of all sites and store in dictionary.

In [15]:
all_enzyme_sites = {e: Seq(e.site) for e in list(AllEnzymes)}
rc_all_enzyme_sites = {e: s.reverse_complement() for e, s in all_enzyme_sites.items()}

For each enzyme, check if it can digest it's reverse complement target site.

In [16]:
rc_digest_dict = {}
for each_e in all_enzyme_sites:
    rc_site = rc_all_enzyme_sites[each_e]
    site = all_enzyme_sites[each_e]
    try:
        search = each_e.search(rc_all_enzyme_sites[each_e])
        cat_rc = each_e.catalyze(rc_site)
        cat = each_e.catalyze(site)
        if len(cat) > 1:  # cut was made
            rc_digest_dict[each_e] = [search, cat, site, cat_rc, rc_site]
    except Exception as e:
        continue

In [17]:
print('All Enzymes:', len(AllEnzymes))
print('RC Digesters:', len(rc_digest_dict))

All Enzymes: 978
RC Digesters: 397


There are a number of enzymes that will also cut the reverse complement of their restriction site. These are cutters that the target site is preserved when reverse complemented. So if the plasmid has these restriction sites should be insertable as either the reverse complement. However, we do not want to insert at the same location as the RC would be the terminatin region and need to be further downstream of the promotor. Also need to verify that the ends produced by each enzyme are not compatable to avoid circularization of the backbone. This will be esspecially true for cutters with permiscious recognition sites.

In [18]:
rc_digest_dict

{BstAUI: [[2],
  (Seq('T'), Seq('GTACA')),
  Seq('TGTACA'),
  (Seq('T'), Seq('GTACA')),
  Seq('TGTACA')],
 BstZI: [[2],
  (Seq('C'), Seq('GGCCG')),
  Seq('CGGCCG'),
  (Seq('C'), Seq('GGCCG')),
  Seq('CGGCCG')],
 BspT104I: [[3],
  (Seq('TT'), Seq('CGAA')),
  Seq('TTCGAA'),
  (Seq('TT'), Seq('CGAA')),
  Seq('TTCGAA')],
 Bpu1102I: [[3],
  (Seq('GC'), Seq('TNAGC')),
  Seq('GCTNAGC'),
  (Seq('GC'), Seq('TNAGC')),
  Seq('GCTNAGC')],
 Bst2BI: [[2],
  (Seq('C'), Seq('ACGAG')),
  Seq('CACGAG'),
  (Seq('C'), Seq('TCGTG')),
  Seq('CTCGTG')],
 AflII: [[2],
  (Seq('C'), Seq('TTAAG')),
  Seq('CTTAAG'),
  (Seq('C'), Seq('TTAAG')),
  Seq('CTTAAG')],
 PspFI: [[2],
  (Seq('C'), Seq('CCAGC')),
  Seq('CCCAGC'),
  (Seq('G'), Seq('CTGGG')),
  Seq('GCTGGG')],
 Ppu10I: [[2],
  (Seq('A'), Seq('TGCAT')),
  Seq('ATGCAT'),
  (Seq('A'), Seq('TGCAT')),
  Seq('ATGCAT')],
 Eco32I: [[4],
  (Seq('GAT'), Seq('ATC')),
  Seq('GATATC'),
  (Seq('GAT'), Seq('ATC')),
  Seq('GATATC')],
 RigI: [[7],
  (Seq('GGCCGG'), Seq('CC'))

##### Applying restriction enzyme approach to pFC8

The actual backbone will look something like pFC8 which is shown below.

![](../files/pFC8.png)

Enzymes that are located downstream downstream of the T3 promotor could be used.

In [19]:
pFC8_cutters_str = [
    'KpnI', 'Acc65I', 'XbaI', 'Acc65I', 'BmeT11oI', 'Absl', 
    'Aval', 'BsoBI', 'PaeR7I', 'PspXI', 'XhoI', 'NruI',
    'PpuMI', 'BstEII', 'PflMI', 'EcoRI', 'EagI', 'NotI',
    'SacII', 'BtgI', 'BsaBI', 'BstAPI', 'BspDI', 'ClaI',
    'HindIII', 'SacI', 'Eco53kI', 'BamHI', 'NaeI'
]
# convert to enzyme instances if they exist
name_dict = {str(e): e for e in list(AllEnzymes)}
pFC8_cutters = [name_dict[c] for c in pFC8_cutters_str if c in name_dict]
len(pFC8_cutters)

26

Any cutters that also cut their reverse complement?

In [20]:
pFC8_rc_cutters = [c for c in pFC8_cutters if c in rc_digest_dict]
len(pFC8_rc_cutters)

22

Somewhat surprisingly most of them do.

In [21]:
pFC8_rc_cutters

[KpnI,
 Acc65I,
 XbaI,
 Acc65I,
 PaeR7I,
 XhoI,
 NruI,
 BstEII,
 PflMI,
 EcoRI,
 EagI,
 NotI,
 SacII,
 BsaBI,
 BstAPI,
 BspDI,
 ClaI,
 HindIII,
 SacI,
 Eco53kI,
 BamHI,
 NaeI]

Just a quick sanity check.

In [22]:
for cutter in pFC8_rc_cutters:
    assert Seq(cutter.site) == Seq(cutter.site).reverse_complement()
print('All enzymes target site == its reverse complement')

All enzymes target site == its reverse complement


Ok but then so what, you could probably in theory insert the reverse complement into the same location but we don't actually want to do that. It need to go further downstream. Without changing the restriction sites this would require usng cutters that cut the reverse complement and then produce ends that are compatable with those produced by a different restriction enzyme that does not recognize the target sites included with the variable region.

Actually this would be required for only one of the ends since you could keep the most downstream sites the same and attempt to move the more upstream site in order to alter the distance the insert ends up relative to the promotor. However you would be limited in how far you could place the termination region away from the promotor by the compatible restriction sites.

##### Check overhang compatibility using `pydna`

In [23]:
from pydna.readers import read
from pydna.assembly import Assembly
from pydna.amplicon import Amplicon
from pydna.dseqrecord import Dseqrecord

Make sure we can assemble the one cut site correctly.

In [24]:
cut = Dseqrecord(EcoRI.site).cut(EcoRI)
cut[0].seq

Dseq(-5)
G
CTTAA

In [25]:
cut[1].seq

Dseq(-5)
AATTC
    G

In [26]:
a = Assembly(cut, limit=4).assemble_linear()[0].seq
a

Dseq(-6)
GAATTC
CTTAAG

Make sure assembly is the same as original cut site.

In [27]:
assert a == Dseqrecord(EcoRI.site).seq

In [28]:
def test_compatable_overhangs(cutter_a, cutter_b):
    # test if cut site of cutter_a will be compatible with cut produced by site B
    a_frags = Dseqrecord(cutter_a.site).cut(cutter_a)
    b_frags = Dseqrecord(cutter_b.site).cut(cutter_b)
    all_frags = a_frags + b_frags
    return Assembly(all_frags, limit=4).assemble_linear()

In [29]:
test_compatable_overhangs(EcoRI, EcoRI)[0]

Apply function to all enzymes found to cut their RC in pFC8.

In [30]:
enzyme_compat_overhangs = {}
for enzyme_a in pFC8_rc_cutters:
    for enzyme_b in pFC8_rc_cutters:
        if enzyme_a != enzyme_b:
            assembly = test_compatable_overhangs(enzyme_a, enzyme_b)
            if assembly:
                if enzyme_a in enzyme_compat_overhangs:
                    enzyme_compat_overhangs[enzyme_a].append(enzyme_b)
                else:
                     enzyme_compat_overhangs[enzyme_a] = [enzyme_b]

In [31]:
enzyme_compat_overhangs

{KpnI: [Acc65I, Acc65I],
 Acc65I: [KpnI, KpnI],
 PflMI: [BstAPI],
 EagI: [NotI],
 NotI: [EagI, SacII],
 SacII: [NotI],
 BstAPI: [PflMI],
 HindIII: [SacI],
 SacI: [HindIII]}

This group exists but none really seem to be that distant from each other which is what we would need to really change the insert location.

#### Double homology

What about including two homologous regions seperated by a restriction site. The first region would have homology for the forward sequence (initiation) and the second region would have homology to for the termination region and would be cleaved before inserting.

![](../files/RE_cloning.png)

This would have the advantage of not requiring any primers but would add the most nucleotides to the variable region. Additionally, it would distance the initiation region further from the promotor and would be be possible to place it directly adjacent. If the 30 bp overlap rule was followed this would require an addition of ~130 nucleotides more than %50 of the variable region length brining the total length with a 200bp variable region to somewhere in the neighborhood of 330 bp.