# pyDNA assembly objects

This notebook is looking into how to best utilize pyDNA `assembly` objects for Gibson assembly with fragments
that may or may not already contain homologous regions.

[Link to docs](https://pydna.readthedocs.io/#pydna.assembly.Assembly)



In [18]:
from pydna.assembly import Assembly
from pydna.dseqrecord import Dseqrecord
from pydna.readers import read

In [19]:
a = Dseqrecord("acgatgctatactgCCCCCtgtgctgtgctcta")
b = Dseqrecord("tgtgctgtgctctaTTTTTtattctggctgtatc")
c = Dseqrecord("tattctggctgtatcGGGGGtacgatgctatactg")

Limit is shortest shared homology that should be considered.

In [20]:
a = Assembly((a,b,c), limit=14)

In [21]:
a

Assembly
fragments..: 33bp 34bp 35bp
limit(bp)..: 14
G.nodes....: 6
algorithm..: common_sub_strings

In [22]:
a.assemble_circular()

[Contig(o59), Contig(o59)]

Does assembly work for overlapping ends?

In [45]:
d = Dseqrecord('aaaaaaaaaaaaatttttttttttttttGGGGG')
e = Dseqrecord('GGGGGcccccccccccccccccccccccccccc')

a1 = Assembly((d, e), limit=5)
a1.assemble_linear()
a1.assemble_linear()[0].seq.watson
#open('test.assemble.ascii', 'w').write(repr(a1.assemble_linear()[0]))

aaaaaaaaaaaaatttttttttttttttGGGGG
                            GGGGG
                            GGGGGcccccccccccccccccccccccccccc


name

Expanding to a bit more complex case using random sequences.

In [24]:
def assemble_records(records, limit=10):
    return Assembly(records, limit).assemble_linear()[0].seq.watson

In [25]:
import numpy as np

def random_seq(n=25):
    return ''.join(np.random.choice(['a', 't', 'g', 'c'], n, replace=True))

rand_a = random_seq()
rand_b = random_seq()

homolgous_region = random_seq(10)
homolgous_region = homolgous_region.upper()

rand_a += homolgous_region
rand_b = homolgous_region + rand_b

rand_a = Dseqrecord(rand_a)
rand_b = Dseqrecord(rand_b)

In [26]:
print('Seq A:', rand_a.seq)
print('Seq B:', rand_b.seq)

Seq A: aacttggcgctagaaataactactaGCTTCAGCAT
Seq B: GCTTCAGCATgatccgaaacttccggtcatgtgtc


In [27]:
assemble_records((rand_a, rand_b))

aacttggcgctagaaataactactaGCTTCAGCATgatccgaaacttccggtcatgtgtc

Increasing the limit above 10 should result in no assembly.

In [28]:
try:
    assemble_records((rand_a, rand_b), 11)
except Exception as e:
    print(f'Assembly failed raised {e}')

Assembly failed raised list index out of range


What happens when no homology is present?

In [29]:
rand_c = Dseqrecord(random_seq(100))
rand_d = Dseqrecord(random_seq(100))

try:
    assemble_records((rand_c, rand_d), 15)
except Exception as e:
    print(f'Assembly failed raised {e}')

Assembly failed raised list index out of range


Linear assembly fails in this case which is good and this method can be used to ID fragments which would not require primers due 
preexisting homology. We also know the order in which the fragments are to be assembled so we can avoid doing a more
costly all against all type compariso between fragments for homology. 

Next step is figuring out how to implement this in an actual assembly when records are coming in as genbank files. Another question is if
DseqRecords can be assembled in this way alongside amplicons.

In [30]:
pFC8_path = 'files/pFC8.gb'
init_path = 'files/test_init.gb'
exten_path = 'files/test_exten.gb'

In [32]:
pFC8, init, exten = read(pFC8_path), read(init_path), read(exten_path)
dir(init)

['__add__',
 '__bool__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__le___',
 '__len__',
 '__lt__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_per_letter_annotations',
 '_repr_html_',
 '_repr_pretty_',
 '_seq',
 '_set_per_letter_annotations',
 '_set_seq',
 'accession',
 'add_colors_to_features_for_ape',
 'add_feature',
 'annotations',
 'cai',
 'cas9',
 'circular',
 'copy',
 'cseguid',
 'cut',
 'cutters',
 'dbxrefs',
 'definition',
 'description',
 'dump',
 'express',
 'extract_feature',
 'features',
 'find',
 'find_aa',
 'find_aminoacids',
 'format',
 'from_SeqRecord',
 'from_string',
 'gc',
 'id',
 'isorf',
 'lcs',
 'letter_annotations',
 'l

In [35]:
init.seq

Dseq(-330)
CTCG..GTAA
GAGC..CATT