# Combine files for TWIST order

(c) 2020 Tom Röschinger. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

***

In this notebook we combine the individually prepared sequences for various experiments into the final twist order. Therefore we need to add orthogonal primers to the ends of all inserts. Also, we add additional reverse primers for different constructs within the experiments, so they can be amplified individually. Also, we might add restriction sites to the constructs to remove the primers from the transcript.

In [1]:
import wgregseq
%load_ext autoreload
%autoreload 2

import Bio
from Bio.SeqIO import parse
from Bio.Seq import Seq
from Bio.Restriction import *

from itertools import compress

import pandas as pd
import numpy as np
import copy

import glob

### Import

Import all `csv` files in the `data/twist_order` folder in this repo.

In [2]:
file_list = sorted(glob.glob("../../../data/twist_order/*.csv"))
file_list

['../../../data/twist_order/lacI_sequences.csv',
 '../../../data/twist_order/lacUV5_tetOx_single_double_mutants.csv',
 '../../../data/twist_order/lacUV_mutants.csv',
 '../../../data/twist_order/natural_tet_promoters_mutated.csv',
 '../../../data/twist_order/purR_twist_sequences.csv',
 '../../../data/twist_order/twist_orbit_TF_del_avd_ovlp_long.csv',
 '../../../data/twist_order/twist_orbit_TF_del_avd_ovlp_short.csv',
 '../../../data/twist_order/twist_orbit_TF_del_first_last_long.csv',
 '../../../data/twist_order/twist_orbit_TF_del_first_last_short.csv',
 '../../../data/twist_order/twist_site_scrambles.csv',
 '../../../data/twist_order/twist_sys_scrambles_10.csv',
 '../../../data/twist_order/twist_sys_scrambles_2_16.csv']

Import each data frame.

In [3]:
df_list = []
for file in file_list:
    df_list.append(pd.read_csv(file, index_col=0))

Do a quick check to confirm that the necessary columns are annotated correctly. We need a column with the sequences (`seq`), a column having the information if primers were added or not (`primer_added`), and to which construct each sequence belongs (`construct`).

In [4]:
def check_file(df, filename):
    for header in ['seq', 'primer_added', 'construct']:
        if header not in df.columns:
            print("{} not a column header in file {}.".format(header, filename))
            
    return

Let's check all files in the folder.

In [5]:
for (df, name) in zip(df_list, file_list):
    check_file(df, name)

primer_added not a column header in file ../../../data/twist_order/purR_twist_sequences.csv.
construct not a column header in file ../../../data/twist_order/purR_twist_sequences.csv.
seq not a column header in file ../../../data/twist_order/twist_orbit_TF_del_avd_ovlp_long.csv.
primer_added not a column header in file ../../../data/twist_order/twist_orbit_TF_del_avd_ovlp_long.csv.
construct not a column header in file ../../../data/twist_order/twist_orbit_TF_del_avd_ovlp_long.csv.
seq not a column header in file ../../../data/twist_order/twist_orbit_TF_del_avd_ovlp_short.csv.
primer_added not a column header in file ../../../data/twist_order/twist_orbit_TF_del_avd_ovlp_short.csv.
construct not a column header in file ../../../data/twist_order/twist_orbit_TF_del_avd_ovlp_short.csv.
seq not a column header in file ../../../data/twist_order/twist_orbit_TF_del_first_last_long.csv.
primer_added not a column header in file ../../../data/twist_order/twist_orbit_TF_del_first_last_long.csv.
con

Let's go through the workflow for the first experiment.

### Restriction site

First, we add the restriction site to the sequences to remove the reverse primers later on. This restriction site should leave the same sticky ends as the system that the inserts are going to be integrated.

As an example, we add the site for `EcoRI` to the sequences.

In [6]:
EcoRI.site

'GAATTC'

### Primers

Next we add a reverse primer to for each construct. For this purpose we simply use the `add_primers` function with the additional keyword argument `rev_only`. 

In [7]:
df_0 = df_list[0]
seqs = np.array(df_0['seq'].values, dtype=object)
#seqs = wgregseq.add_restriction_site(seqs, EcoRI)
seqs[0]

primer = 22
forward_primers_0 = [(primer, 0)] * len(seqs)
reverse_primers_0 = [(primer, len(seqs[0])+20)] * len(seqs)
seqs = wgregseq.add_primers(seqs, primer)

primer += 1

In [8]:
reverse_primers_1 = []
for construct in df_0['construct'].unique():
    indices = np.asarray((df_0['construct'] == construct).values == True).nonzero()
    reverse_primers_1.extend([(primer, len(seqs[indices][0]))] * np.size(indices[0]))
    seqs[indices] = wgregseq.add_primers(seqs[indices], primer, rev_only=True, autocomplete=True)
    primer += 1

In [9]:
len(reverse_primers_1)

1001

In [10]:
sub_df = pd.DataFrame({'seq': seqs, 'forward_primers_0': forward_primers_0, 'reverse_primers_0': reverse_primers_0, 'reverse_primers_1': reverse_primers_1})
sub_df

Unnamed: 0,seq,forward_primers_0,reverse_primers_0,reverse_primers_1
0,ATGACCGTTAGTGAGGCTAGTCGAGTTTACACTTTATGCTTCCGGC...,"(22, 0)","(22, 82)","(23, 102)"
1,ATGACCGTTAGTGAGGCTAGTCGAGTTTACACTTTATGCTTCCGGC...,"(22, 0)","(22, 82)","(23, 102)"
2,ATGACCGTTAGTGAGGCTAGTCGAGTTTACACTTTATGCTTCCGGC...,"(22, 0)","(22, 82)","(23, 102)"
3,ATGACCGTTAGTGAGGCTAGTCGAGTTTACACTTTATGCTTCCGGC...,"(22, 0)","(22, 82)","(23, 102)"
4,ATGACCGTTAGTGAGGCTAGTCGAGTTTACACTTTATGCTTCCGGC...,"(22, 0)","(22, 82)","(23, 102)"
...,...,...,...,...
996,ATGACCGTTAGTGAGGCTAGTCGAGTTTACACTTTATGCTTCCGGC...,"(22, 0)","(22, 82)","(23, 102)"
997,ATGACCGTTAGTGAGGCTAGTCGAGTTTACACTTTATGCTTCCGGC...,"(22, 0)","(22, 82)","(23, 102)"
998,ATGACCGTTAGTGAGGCTAGTCGAGTTTACACTTTATGCTTCCGGC...,"(22, 0)","(22, 82)","(24, 102)"
999,ATGACCGTTAGTGAGGCTAGTCGAGTTTACACTTTATGCTTCCGGC...,"(22, 0)","(22, 82)","(25, 102)"


In [11]:
wgregseq.check_primers_pool_df(sub_df)

True

Now we add a primer pair to both sides of all sequences. Also, we add random base pairs at the end of the sequences to end up with the correct total length. (Don't execute this cell twice without running the two previous ones.)

Now the sequences have a total length of 200 base pairs.

In [12]:
seqs[0][:20]

'ATGACCGTTAGTGAGGCTAG'

In [13]:
seqs[0][-20:]

'AACCTATAAAACGGTTCCTG'

Now we need to add these sequences to a final data frame.

Final check function:
- Digital PCR:
    - Choose primer pair and check what is amplified => should construct only
- Distance between primers consistent
- Check for restriction sites in constructs and primers
- Check total length equals 200
- Unique oligos in subpools


In [14]:
# Extract primers that is added
forward = wgregseq.import_primer_fwd(22)
reverse = wgregseq.import_primer_rev(22)

In [15]:
wgregseq.check_primer_seq([sub_df.iloc[0]['seq']], forward, 0)

True

In [16]:
SgeI.site

'CNNG'

In [19]:
wgregseq.scan_enzymes_print(sub_df.seq, ["SacI", "SalI", "XhoI", "SbfI", "ApaI"])

XhoI :  21.0
SalI :  20.0
ApaI :  15.0
SacI :  15.0
SbfI :  1.0


In [31]:
def check_pool_df(df,enzymes):
    if wgregseq.check_primers_pool_df(df):
        print("Primer check passed.\n")
    else:
        print("Primer check DID NOT pass!\n")
    
    if any([len(seq) != 200 for seq in df.seq]):
        print("Oligo length check DID NOT pass!\n")
    else:
        print("Oligo length check passed.\n")
    
    for name, sub_df in df.groupby("construct"):
        
        rev_primer_columns = [x  for x in sub_df.columns if ("reverse_primer" in x)]
        end = np.max([x for rev_colum in rev_primer_columns for (_, x) in sub_df[rev_colum]])
        fwd_primer_columns = [x  for x in sub_df.columns if ("forward_primer" in x)]
        start = np.max([x for fwd_colum in fwd_primer_columns for (_, x) in sub_df[fwd_colum]])
    
        seqs = [seq[start:end+20] for seq in sub_df.seq]
        if len(np.unique(seqs)) != len(seqs):
            print("Uniqueness check DID NOT pass for construct {}!\n".format(name))
        else:
            print("Uniqueness check passed for construct {}.\n".format(name))
        
    
        print("Restriction enzyme sites for construct {}:".format(name))
        wgregseq.scan_enzymes_print(seqs, enzymes)
        print("")

In [32]:
primer = 22
def do_all(file_list, primer):
    df_list = []
    for i, file in enumerate(file_list):
        df = pd.read_csv(file, index_col=0)
        seqs = np.array(df['seq'].values, dtype=object)


        forward_primers_0 = [(primer, 0)] * len(seqs)
        reverse_primers_0 = [(primer, len(seqs[0])+20)] * len(seqs)
        seqs = wgregseq.add_primers(seqs, primer)

        primer += 1

        reverse_primers_1 = []
        for construct in df['construct'].unique():
            indices = np.asarray((df['construct'] == construct).values == True).nonzero()
            reverse_primers_1.extend([(primer, len(seqs[indices][0]))] * np.size(indices[0]))
            seqs[indices] = wgregseq.add_primers(seqs[indices], primer, rev_only=True, autocomplete=True)
            primer += 1

        df_list.append(
            pd.DataFrame(
                {'seq': seqs, 
                 'forward_primers_0': forward_primers_0, 
                 'reverse_primers_0': reverse_primers_0, 
                 'reverse_primers_1': reverse_primers_1,
                 'construct': i
                }))
         
    return pd.concat(df_list, ignore_index=True)

In [33]:
df = do_all(
    ['../../../data/twist_order/lacI_sequences.csv', 
     '../../../data/twist_order/lacUV5_tetOx_single_double_mutants.csv',
     '../../../data/twist_order/lacUV_mutants.csv',
     '../../../data/twist_order/natural_tet_promoters_mutated.csv'],
    primer
)

In [34]:
check_pool_df(df, ["SacI", "SalI", "XhoI", "SbfI", "ApaI"])

Primer check passed.

Oligo length check passed.

Uniqueness check passed for construct 0.

Restriction enzyme sites for construct 0:
ApaI :  0.0
SbfI :  0.0
XhoI :  0.0
SalI :  0.0
SacI :  0.0

Uniqueness check passed for construct 1.

Restriction enzyme sites for construct 1:
ApaI :  11.0
SbfI :  0.0
XhoI :  0.0
SalI :  0.0
SacI :  0.0

Uniqueness check passed for construct 2.

Restriction enzyme sites for construct 2:
SalI :  21.0
ApaI :  0.0
SbfI :  0.0
XhoI :  0.0
SacI :  0.0

Uniqueness check passed for construct 3.

Restriction enzyme sites for construct 3:
ApaI :  11.0
SbfI :  0.0
XhoI :  0.0
SalI :  0.0
SacI :  0.0

