# Combine files for TWIST order

(c) 2020 Tom Röschinger. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

***

In this notebook we combine the individually prepared sequences for various experiments into the final twist order. Therefore we need to add orthogonal primers to the ends of all inserts. Also, we add additional reverse primers for different constructs within the experiments, so they can be amplified individually. Also, we might add restriction sites to the constructs to remove the primers from the transcript.

In [33]:
import wgregseq
%load_ext autoreload
%autoreload 2

import Bio
from Bio.Restriction import *

from itertools import compress

import pandas as pd
import numpy as np
import copy

import glob

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Import

Import all `csv` files in the `data/twist_order` folder in this repo.

In [169]:
file_list = sorted(glob.glob("../../../data/twist_order/*.csv"))
file_list

['../../../data/twist_order/lacI_sequences.csv',
 '../../../data/twist_order/lacUV5_tetOx_single_double_mutants.csv',
 '../../../data/twist_order/lacUV_mutants.csv',
 '../../../data/twist_order/natural_tet_promoters_mutated.csv']

Import each data frame.

In [206]:
df_list = []
for file in file_list:
    df_list.append(pd.read_csv(file, index_col=0))

Do a quick check to confirm that the necessary columns are annotated correctly. We need a column with the sequences (`seq`), a column having the information if primers were added or not (`primer_added`), and to which construct each sequence belongs (`construct`).

In [207]:
def check_file(df, filename):
    for header in ['seq', 'primer_added', 'construct']:
        if header not in df.columns:
            print("{} not a column header in file {}.".format(header, filename))
            
    return

Let's check all files in the folder.

In [208]:
for (df, name) in zip(df_list, file_list):
    check_file(df, name)

Let's go through the workflow for the first experiment.

In [228]:
df_0 = df_list[0]
df_0

Unnamed: 0,seq,mutations,energy,bin,construct,primer_added
0,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,2,-0.293762,0,lacUV5+O1_mutant,False
1,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,2,-0.108405,0,lacUV5+O1_mutant,False
2,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,2,-0.967390,0,lacUV5+O1_mutant,False
3,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,2,-0.543543,0,lacUV5+O1_mutant,False
4,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,2,-0.656303,0,lacUV5+O1_mutant,False
...,...,...,...,...,...,...
1122,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,1,0.745840,x,lacUV5+O1_mutant,False
1123,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,1,-0.506194,x,lacUV5+O1_mutant,False
1124,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,0,0.000000,x,lacUV5+O1,False
1125,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAAATG...,0,1.400000,x,lacUV5+O2,False


### Restriction site

First, we add the restriction site to the sequences to remove the reverse primers later on. This restriction site should leave the same sticky ends as the system that the inserts are going to be integrated.

As an example, we add the site for `EcoRI` to the sequences.

In [212]:
EcoRI.site

'GAATTC'

Let's add the restriction site to the sequences.

In [265]:
seqs = np.array(df_0['seq'].values, dtype=object)
seqs = wgregseq.add_restriction_site(seqs, EcoRI)
seqs[0]

'TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTGTGAGCGGgTAACAATgGAATTC'

### Primers

Next we add a reverse primer to for each construct. For this purpose we simply use the `add_primers` function with the additional keyword argument `rev_only`. 

In [266]:
primer=20
for construct in df_0['construct'].unique():
    indices = np.asarray((df_0['construct'] == construct).values == True).nonzero()
    seqs[indices] = wgregseq.add_primers(seqs[indices], primer, rev_only=True)
    primer += 1
seqs

array(['TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTGTGAGCGGgTAACAATgGAATTCCTGAATGTTCGGGATTTCCC',
       'TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTGTGAGCGtATAACAATgGAATTCCTGAATGTTCGGGATTTCCC',
       'TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTGTGAGCGGATAACAAacGAATTCCTGAATGTTCGGGATTTCCC',
       ...,
       'TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTGTGAGCGGATAACAATTGAATTCGCGGCTTGATAGTTGCATTA',
       'TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAAATGTGAGCGAGTAACAACCGAATTCTCCAATATCCTGTCCGTCTG',
       'TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGGGCAGTGAGCGCAACGCAATTGAATTCATGGGTAATTTCATGCCACG'],
      dtype=object)

Now we add a primer pair to both sides of all sequences. Also, we add random base pairs at the end of the sequences to end up with the correct total length. (Don't execute this cell twice without running the two previous ones.)

In [267]:
seqs = wgregseq.add_primers(seqs, primer, autocomplete=True)

Now the sequences have a total length of 200 base pairs.

In [268]:
len(seqs[0])

200

Now we need to add these sequences to a final data frame.