# Definitive guide to insert orrientation and automation via the `pyDNA` package

This notebook is intended to be a distilled version of [Gibson assembly via pyDNA](gibson_via_pyDNA.ipynb) where I began looking at automating Gibson assembly protocol design vis the `pyDNA` package. 

Diagram of basic Gibson assembly method from AddGene
![Diagram of basic Gibson assembly method from AddGene](https://media.addgene.org/data/easy-thumbnails/filer_public/cms/filer_public/15/c4/15c45cf9-3d03-4f61-93e9-c39159f6916e/gibson_assembly_overview_1.jpg__700x351_q85_crop_subsampling-2_upscale.png)

## Backbone linearization

The first step in the Gibson protocol is linearization of the plasmid backbone that the insert will be cloned into. While at the time of writing the exact backbone is yet to be determined it will likely be one of the `pFC` plasmids or at least one with very similar features. 

![The pFC8 plasmid](files/pFC8.png)

Here, I will be using the `pFC8` plasmid as an example, with then intension of inserting a dummy initiation and elongation region just downstream of the T3 promotor. The plasmid backbones used in the actual experiment will likely have at least one promotor already present and it would be ideal if it could be taken advantage of.

Accordingly, the the cut site for linearization should be unique and as close to the promotor as possible while respecting its orrientation; the cut site should always be downstream of the promotor in order to insure the inserted region is always in the path of transcription.

![](files/cut_site_orrientation.png)

## Insert orientation

We assume that the insert sequence is always supplied in the 5' -> 3' direction and that this is the sequence the user intends for the polymerase to produce. Under this assumption if the promotor is orriented in the positive direction then no modification needs to be made to the insert before assembly. However, if the promotor is in the negative orrientation the sequence that ultimately is introduced to the backbone must be the reverse complement of the supplied sequence in order for the - orrientation polyermase to produce the same RNA transcript.

![](files/sequence_orientation.png)

## Implementing using pyDNA

With these constraints in mind we can now implement these ideas in Python using the pyDNA package with the intent to create a script for use in the plasmid design pipeline.

### Implementation pseduocode

```
read backbone file
read insert files 
determine insert orientation  # would be suplied by user in some kind of config file
determine target promotor  # also supplied in config
determine target promotor orientation  # -1 or 1
if target promotor orientation == -1:
    find closest unique cut site with start < promotor start - length cut
else:
    find closest unique cut site with start > promotor start 
cut the backbone
if target_promotor_orientation == -1:
    insert sequences = reverse complement (insert sequences)
create assembly
output genbank file
output primers
```

### Python implementation

In [40]:
import os
from pydna.readers import read
from pydna.design import primer_design
from pydna.design import assembly_fragments
from pydna.assembly import Assembly
from Bio.Restriction import Analysis, RestrictionBatch
from pydna.amplicon import Amplicon
from Bio.Seq import Seq
from Bio import SeqIO
import yaml
import pprint

In [41]:
pp = pprint.PrettyPrinter(indent=4)

Read backbone and insert files

In [42]:
pFC8_path = 'files/pFC8.gb'
init_path = 'files/test_init.gb'
exten_path = 'files/test_exten.gb'

PROMOTOR_FEATURE_INDEX = 3  # index of feature (base 0) representing the promotor the insert should be adjacent to

In [43]:
pFC8, init, exten = read(pFC8_path), read(init_path), read(exten_path)

In [44]:
pFC8.list_features()

| Ft# | Label or Note | Dir | Sta  | End  | Len | type         | orf? |
|-----|---------------|-----|------|------|-----|--------------|------|
|   0 | L:T7\promoter | --> | 11   | 33   |  22 | promoter     |  no  |
|   1 | L:T7\+1\Site  | --> | 28   | 29   |   1 | misc_feature |  no  |
|   2 | L:SNRPN       | <-- | 51   | 1032 | 981 | CDS          |  no  |
|   3 | L:T3\promoter | <-- | 1046 | 1063 |  17 | promoter     |  no  |

Determine target promotor and orientation

In [45]:
target_promotor = pFC8.features[PROMOTOR_FEATURE_INDEX]
target_promotor

SeqFeature(FeatureLocation(ExactPosition(1046), ExactPosition(1063), strand=-1), type='promoter')

In [46]:
promotor_orrientation = target_promotor.strand

Add features to label `init` as initiation region and `exten` as extension region. In the actual pipeline this would need to be handled with helper functions that translate fasta sequences + user labels from config files into genbank formated files that can then be read in as above. 

In [47]:
init.add_feature(x=0, y=len(init), strand=promotor_orrientation, label='Initiation-1', type='CDS')
exten.add_feature(x=0, y=len(exten), strand=promotor_orrientation, label='Extension-1', type='CDS')

It is *critical* that a distinction is made between the feature orrientation and the actual sequence orientation. Labeling a sequence with a feature on one strand or the other will not actually change the content of the sequence added to the final assembly, however when viewed with a program like snapGene the arrow indicating sequence direction will correspond to the *label*. 

This means that if the reverse complement is to be incorporated into the assembly (promotor orientation == -1), then the actual sequence must be reveser complemented.

In [48]:
print(init.seq[:20])

CTCGAGGGGGGGCCCGGTAC


In [49]:
# assumes insert is supplied in 5' -> 3' and as it is intended to be encountered by the polymerase
def reorient_insert_relative_to_feature(seqFeature, insert):
    seqFeat_orr = seqFeature.strand
    if seqFeat_orr == -1:
        insert.seq = insert.seq.reverse_complement()
    return insert

In [50]:
init = reorient_insert_relative_to_feature(target_promotor, init)
exten = reorient_insert_relative_to_feature(target_promotor, exten)

In [51]:
print(init.seq[:20])

TTACTCACAGGCTTTTTTCA


Locate the best restriction site

In [52]:
int(target_promotor.location.start)

1046

In [53]:
def get_closest_downstream_unique_RS(seqFeature, vector):
    rb = RestrictionBatch([], ['C'])
    cut_analysis = Analysis(rb, vector.seq, linear=False)
    unique_cutters = {enzyme: cut_loc for enzyme, cut_loc in cut_analysis.full().items() if len(cut_loc) == 1}

    # get distance relative to feature start location
    unique_cutters_dist_func = lambda loc: loc[0] - int(seqFeature.location.start)
    distances = {enzyme: unique_cutters_dist_func(loc) for enzyme, loc in unique_cutters.items()}
    
    # select enzymes correct location relative to seqFeature orientation
    distances = {enzyme: dist for enzyme, dist in distances.items() if dist * seqFeature.strand > 0}
    
    # get enzyme with RS min distance from seqFeature
    best_enzyme = min(distances, key=lambda e: abs(distances.get(e)))
    return {best_enzyme: 
                {'distance': distances[best_enzyme], 
                 'cut_site': unique_cutters[best_enzyme]
                }
           }


In [54]:
best_enzyme = get_closest_downstream_unique_RS(target_promotor, pFC8)
print(best_enzyme)

{KpnI: {'distance': -13, 'cut_site': [1033]}}


Biopython returns the first base of the downstream segment produced by a restriction digest (first base after the position where the cut will be made with base 1.

Linearize the plasmid at the restriction site

In [55]:
def linearize_backbone(backbone, restrictionEnzyme):
    return backbone.linearize(restrictionEnzyme)

In [56]:
pFC8_linear = linearize_backbone(pFC8, list(best_enzyme.keys()).pop())
pFC8_linear

Dseqrecord(-3593)

Convert inserts into amplicons so can be used in an `Assembly` object

In [57]:
def convert_inserts_to_amplicons(inserts):
    return [primer_design(each_insert) for each_insert in inserts]

The inserted regions should be passed in their final order in the construct relative to the site of linearization. This will require a helper function that gets the intended order of the inserts in their 5' -> 3' orrientation and then swaps order if inserting relative to an antisense promotor.

In [58]:
insert_amplicons = convert_inserts_to_amplicons((exten, init))
insert_amplicons

[Amplicon(330), Amplicon(330)]

Create a circular assembly from the backbone and fragments.

In [59]:
print(type(pFC8_linear))
print(type(init))

<class 'pydna.dseqrecord.Dseqrecord'>
<class 'pydna.genbankfile.GenbankFile'>


In [60]:
def make_fragment_list(linear_backbone, ordered_inserts):
    fragments = assembly_fragments(
        [linear_backbone] + list(ordered_inserts) + [linear_backbone]
    )
    return fragments

In [61]:
def make_primers(fragment_list):
    amplicons = [x for x in fragment_list if isinstance(x, Amplicon)]
    primers = [(y.forward_primer, y.reverse_primer) for y in amplicons]
    return primers

In [62]:
def make_assembly(fragment_list):
    assembly_final = Assembly(fragment_list[:-1])
    return assembly_final

In [63]:
pFC8_fragments = make_fragment_list(pFC8_linear, insert_amplicons)
pFC8_assembly = make_assembly(pFC8_fragments)
pFC8_primers = make_primers(pFC8_fragments)
print(pFC8_assembly)

Assembly
fragments..: 3593bp 383bp 383bp
limit(bp)..: 25
G.nodes....: 8
algorithm..: common_sub_strings


In [64]:
len(pFC8_assembly)

TypeError: object of type 'Assembly' has no len()

Select assembly.

In [25]:
canidate = pFC8_assembly.assemble_circular()[0]
canidate

Write the assembly to genbank formated file and visualize using SnapGene.

In [26]:
test_assembly = 'files/pFC8_init_exten_test_assembly.gb'

canidate.write(test_assembly)

![](files/pFC8_init_exten.png)

Searching for the first 10bp of the initiation sequence `CTCGAGGGGGGGCCCGGTAC` we find it is located on the 3' strand at the start of Initiation region 1. Since the promotor is in the antisense orientation the 3' strand will act as the coding strand and the orginally specified sequence will be produced by the polymerase and it will therefore encounter the exact same nucleotides if it were transcribing the originally specified sequence in the sense direction.

![](files/pFC8_init_exten_init_start.png)

Circular view with sequence location highlighted in grey

![](files/pFC8_init_exten_init_start_circular.png)

## Primers

In Gibson assembly PCR is used to add regions of homology between inserted fragments. 

![[OpenWetWare](https://openwetware.org/wiki/Janet_B._Matsen:Guide_to_Gibson_Assembly)](https://s3-us-west-2.amazonaws.com/oww-files-thumb/7/7b/Gibson_overview_cartoon_JM.png/900px-Gibson_overview_cartoon_JM.png)

This assumes that there is not yet already homologous regions on the 5' and 3' ends of the inserted sequence. However, since the variable regions that will be inserted into plasmid backbone will be 100% synthisized it would potentially be possible to add in regions on homology as part of the synthesis. If this is done it would be best of having someway of reflecting this within the assembly program. However, worst case if there is flanking homology you could just ignore the primers or mark them as not required.

In [27]:
def print_primer_list(primers):
    for pair in primers:
        print(pair[0].format("fasta"))
        print(pair[1].format("fasta"))

In [28]:
print_primer_list(pFC8_primers)

>f330 EXTEN1
TCAGTACTCCAAGACCTCGAGGGGGGGCCCGGTACCTAAAGGGAACAAAAGCT

>r330 EXTEN1
AAAAAAGCCTGTGAGTAATGAAAAAAGCCTGTGAGT

>f330 INIT1
ACTCACAGGCTTTTTTCATTACTCACAGGCTTTTTTC

>r330 INIT1
ATTAACCCTCACTAAAGGGAACAAAAGCTGGGTACCTCGAGGGGGGGC



## Workflow implementation and helper functions

Below are thoughts on how the assembly concepts above would actually be implemented in the snakemake workflow and some drafts of helper functions that are likely to be required during implementation.

### Laying out complete plasmids; construct ymls

In order for the workflow to actually be able to produce descriptions of Gibson assembly protocols there needs to be some way to specify which variable regions get inserted where, in which order and with what.

Since the most of the time the only thing that will be variable are, well, the variable regions, I think it makes the most sense to do this in terms of "constructs" which then contain some keyword representing where a variable region would be inserted. Then in the variable region definition tsv file an additional column would be added that specifies the construct that should be used.

Constructs can be specified in `yml` format. An example is below.

```
construct_a:
    backbone: "backbone.gb"  # path to genback record describing plamid backbone
    insert_downstream_of: "promotor_a"  # label of feature in backbone file to insert downstream of
    contents:
        - "VAR_REGION"  # variable region keyword. Insert variable region here.
        - "extension_region.gb"  # 
        - "UNIQUE_CUTTER"  # restriction site keyword, insert a uniquely cutting RE here
```

In [29]:
example_construct = 'files/example_construct.yml'

def read_yaml(filepath):
    # https://stackoverflow.com/questions/1773805
    with open(filepath ) as handle:
        try:
            return yaml.safe_load(handle)
        except yaml.YAMLError as exc:
            print(exc)

construct_dict = read_yaml(example_construct)
pp.pprint(construct_dict)

{   'construct_a': {   'backbone': 'backbone.gb',
                       'contents': [   'VAR_REGION',
                                       'extension_region.gb',
                                       'UNIQUE_CUTTER'],
                       'insert_downstream_of': 'promotor_a'}}


During assembly `VAR_REGION` would be replaces with genbank file produced from a variable region, which of course implies need some way of converting fasta records into minimal genbank files that can be read and used for assembly.

### Convert fasta records + config info to genbank records

In [30]:
from pydna.genbankrecord import GenbankRecord

Pretend we have some fasta record to turn into a genbank file for input into the Gibson assembly program. Read it in using BioPython.

In [31]:
fasta_record = 'files/myRecord.fa'
record = SeqIO.read(fasta_record, 'fasta')  # in workflow should only be a single record
record

SeqRecord(seq=Seq('AGAGAGGGGGGCAGACGAAAGCAGATAGACAGATGAGACAGATGACACAGGGGA...ATA'), id='MYRECORDID_This', name='MYRECORDID_This', description='MYRECORDID_This is my record stay away!', dbxrefs=[])

Convert to `genbankrecord` instance.

In [32]:
record = GenbankRecord(record)

Pretend we have already have some data labels the sequence.

In [33]:
label = 'Test-init-1'

Add a feature that covers the entire record for labeling purposes only.

In [34]:
def add_labeling_feature(genbank_record, label, title='', authors='', id='', **kwargs):
    import datetime
    now = datetime.datetime.now()
    modification_date = datetime.date.strftime(now, "%m/%d/%Y")
    genbank_record.add_feature(
        x=0, y=len(genbank_record), type='CDS', label=label,
        title=title, authors=authors, modification_date=modification_date, id=id, **kwargs
    )
    return genbank_record

In [35]:
record = add_labeling_feature(record, label, 'Test Init Record', 'Donald Duck', id='test-id1')
record.locus = label

In [36]:
test_fasta_to_gb = 'files/test_fasta_to_genbank.gb'
record.write(test_fasta_to_gb)

In [37]:
vars(record.features[0])

{'location': FeatureLocation(ExactPosition(0), ExactPosition(79), strand=1),
 'type': 'CDS',
 'id': '<unknown id>',
 'qualifiers': {'label': 'Test-init-1',
  'title': 'Test Init Record',
  'authors': 'Donald Duck',
  'modification_date': '06/28/2021',
  'id': 'test-id1'}}

In [38]:
print(open(test_fasta_to_gb).read())

LOCUS       Test-init-1               79 bp    DNA     linear   UNK 01-JAN-1980
DEFINITION  MYRECORDID_This is my record stay away!.
ACCESSION   MYRECORDID_This
VERSION     MYRECORDID_This
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     CDS             1..79
                     /label="Test-init-1"
                     /title="Test Init Record"
                     /authors="Donald Duck"
                     /modification_date="06/28/2021"
                     /id="test-id1"
ORIGIN
        1 agagaggggg gcagacgaaa gcagatagac agatgagaca gatgacacag gggacaaaag
       61 atagatgaga gaacagata
//


Put this into an object so can be subclassed for extracting information in different ways; fasta headers, filepaths ect.

In [39]:
from pathlib import Path
import datetime

def safe_dict_accesss(d, key):
    if key in d:
        return d[key]
    else:
        return None


class fastaToGenbank():
    
    def __init__(self, path, data={}):
        # record_kwargs are added to GenBankRecord object
        self.path = path
        self.data = data
        self.record = GenbankRecord(SeqIO.read(path))
        self._update_record_with_parsed_data()
    
    @property
    def label(self):  # overwrite
        label = safe_dict_access(self.data, 'label')
        if label:
            return label
        else:
            return self.record.id
    
    @property
    def locus(self):  # overwrite
        locus = safe_dict_access(self.data, 'locus')
        if locus:
            return locus
        else:
            return self.record.id
    
    @property
    def defintion(self):  # overwrite
        definition = safe_dict_access(self.definition, 'definition')
        if definition:
            return definition
        else:
            return self.record.description
    
    def _update_record_with_parsed_data(self):
        self.record.__dict__.update(
            {
            'label': self.label,
            'locus': self.locus,
            'definition': self.definition
            }
        )

    
    def write_record(self, output_path=None):
        if not output_path:
            output_path = str(Path(output_path).with_suffix('.gb'))
        self.record.write(output_path)
    
    
    def add_label_feature(self, label=None):
        mod_date = datetime.date.strftime(datetime.datetime.now(), "%m/%d/%Y")
        if not label:
            label = self.label
        self.record.add_feature(
            x=0, 
            y=len(self.record), label=label)