# Generating LacI operator constructs

(c) 2020 Tom Röschinger. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

***

In this notebook we create mutants from the O1-sequence, which is a strong binding site for lacI.

In [96]:
import wgregseq
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import bebi103
import copy

from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.layouts import grid

import bokeh.io

bokeh.io.output_notebook()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Load the wild type sequences and the energy matrix from Barnes 2019, as well as the lacUV5 (RNAP binding site) sequence from Brewster 2012.

In [153]:
O1 = 'AATTGTGAGCGGATAACAATT'
O2 = 'AAATGTGAGCGAGTAACAACC'
O3 = 'GGCAGTGAGCGCAACGCAATT'

lacUV5 = 'TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGG'

lacUV5_O1 = lacUV5 + O1
lacUV5_O2 = lacUV5 + O2
lacUV5_O3 = lacUV5 + O3

O1_matrix = np.load("../../../../data/O1_matrix.npy")

Import the dictionary which transforms DNA into integers for indexing.

In [154]:
seq_dict, _ = wgregseq.choose_dict("dna")

We write a function to evaluate the energy matrix for a sequence. Therefore, the letters are transformed into an index.

In [155]:
def energy_from_sequence(sequence, matrix):
    seq_list = list(sequence.upper())
    num_seq = [seq_dict[x] for x in seq_list]
    energy = sum([matrix[i, num_seq[i]] for i in range(len(sequence))])
    return energy

Now we can generate mutants. Therefore we generate all single and double mutants, as well as 10000 triple mutants using the function which creates all possible mutants and then chooses, to prevent duplicates.

In [100]:
mutants_single = wgregseq.mutations_det(O1, mut_per_seq=1)
mutants_double = wgregseq.mutations_det(O1, mut_per_seq=2)
mutants_triple = wgregseq.mutations_det(O1, mut_per_seq=3)

For higher order mutants we don't have to worried about duplicates and can randomly generate mutants (while keeping the number of mutations fixed). Also we only take unique sequences, since we don't want duplicates in the final order.

In [104]:
mutants_quadruple = np.unique(wgregseq.mutations_rand(O1, rate=0.2, num_mutants=100000, number_fixed=True))
mutants_quintuple = np.unique(wgregseq.mutations_rand(O1, rate=0.25, num_mutants=100000, number_fixed=True))
mutants_sextuple = np.unique(wgregseq.mutations_rand(O1, rate=0.3, num_mutants=100000, number_fixed=True))

Let's write all the mutants into a dataframe. We exclude the single mutants for now, since we want to use all of those anyways, so we add them back to the pool in the end.

In [113]:
df_1 = pd.DataFrame({"seq": mutants_single, "mutations": 1})
df_2 = pd.DataFrame({"seq": mutants_double, "mutations": 2})
df_3 = pd.DataFrame({"seq": mutants_triple, "mutations": 3})
df_4 = pd.DataFrame({"seq": mutants_quadruple, "mutations": 4})
df_5 = pd.DataFrame({"seq": mutants_quintuple, "mutations": 5})
df_6 = pd.DataFrame({"seq": mutants_sextuple, "mutations": 6})
df = pd.concat([df_2, df_3, df_4, df_5, df_6], ignore_index=True)

Now we can compute the difference in energy matrix for every mutant and add it to the dataframe.

In [114]:
df["energy"] = df['seq'].apply(energy_from_sequence, args= (O1_matrix, ))
df.tail()

Unnamed: 0,seq,mutations,energy
327006,ttgctTGAGCGGAgAACAATT,6,12.892082
327007,ttggGTGAGCGGATcAtAATT,6,6.80303
327008,ttggGTGAGCGaATcACAATT,6,5.820889
327009,ttggGTGAaCGGATAACAAgT,6,7.533097
327010,ttggGgGgGCGGATAACAATT,6,11.719174


To choose which mutants we are taking, we first try bins of binding energies.

In [115]:
gap=0.5
Min, Max = -0.5, 7.5
bins = [(i-gap, i+gap) for i in np.linspace(Min, Max, num=9)]
bins

[(-1.0, 0.0),
 (0.0, 1.0),
 (1.0, 2.0),
 (2.0, 3.0),
 (3.0, 4.0),
 (4.0, 5.0),
 (5.0, 6.0),
 (6.0, 7.0),
 (7.0, 8.0)]

Let's have a look on how the mutants are distributed.

In [116]:
p_list = []
for i in range(2, 7):
    p = figure(title="mutations = {}".format(i), frame_height=200, frame_width=200,
               tools='', background_fill_color="#fafafa", y_axis_label="density",
              x_axis_label="ΔE [k_BT]")
    energies = df.loc[df.mutations == i, "energy"].values
    hist, edges = np.histogram(energies, density=True, bins=50)
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
          fill_color="navy", line_color="white", alpha=0.5)
    for (b1, b2) in bins:
        p.line(x=[b1, b1], y=[0, np.max(hist)], color="orange")
        p.line(x=[b2, b2], y=[0, np.max(hist)], color="orange")
        p.varea(x=[b1, b2], y1=[0, 0], y2=[np.max(hist), np.max(hist)], alpha=0.2, color="orange")
        
    p_list.append(p)
    
bokeh.io.show(
    grid(p_list, nrows=1)
)
    

Select mutants that fall within the bins.

In [117]:
df_list = []
for i, (x,y) in enumerate(bins):
    temp_df = copy.deepcopy(df.loc[[x < E < y for E in df["energy"] ], :])
    temp_df["bin"] = np.ones(len(temp_df), dtype=int) * i
    df_list.append(temp_df)
    
binned_df = pd.concat(df_list, ignore_index=True)
binned_df.head(10)

Unnamed: 0,seq,mutations,energy,bin
0,cATTGTGAGCGGATcACAATT,2,-0.069192,0
1,cATTGTGAGCGGATAACAAaT,2,-0.028948,0
2,cATTGTGAGCGGATAACAATg,2,-0.621844,0
3,AcTTGTGAGCGGATAACAATc,2,-0.520312,0
4,AAaTGTGAGCGGATAACAATc,2,-0.637571,0
5,AAaTGTGAGCGGATAACAATg,2,-0.089674,0
6,AATTGTGAGCGGATcACAATg,2,-0.459736,0
7,AATTGTGAGCGGATAcCAATc,2,-0.543543,0
8,AATTGTGAGCGGATAACAAac,2,-0.96739,0
9,AATTGTGAGCGGATAACAAag,2,-0.419492,0


Now we need to select mutants from the bins. We want to try to get an equal number of sequences for each number of mutations per bin. However, some bins are sparsely populated by some type of mutants. Therefore we choose a maximal number of sequences per mutation type per bin. If there are more sequences in a bin, we randomly select sequences.

In [118]:
def select_seqs(df, ind_bin, num_seqs):
    seqs_per_mut = np.floor(num_seqs / len(df.mutations.unique()))
    rest = num_seqs - seqs_per_mut * len(df.mutations.unique())
    count_df = df.groupby(["mutations", "bin"]).size().to_frame(name="num_mutants").reset_index()
    count_df = count_df.loc[count_df["bin"] == ind_bin, :]
    
    ret_df = pd.DataFrame(columns=["seq", "mutations", "energy", "bin"])
    
    for i in df.mutations.unique():
        if count_df.loc[count_df["mutations"] == i, "num_mutants"].values < seqs_per_mut:
            ret_df = pd.concat([ret_df, df.loc[(df["mutations"] == i) & (df["bin"] == ind_bin), :]], ignore_index=True)
        else:
            indices = df.loc[(df["mutations"] == i) & (df["bin"] == ind_bin), :].index.to_numpy(dtype=int)
            selected_indices = np.random.choice(indices, size=int(seqs_per_mut), replace=False)
            ret_df = pd.concat([ret_df, df.iloc[selected_indices]])
    return ret_df

Now we only need to apply the function to each bin and collect sequences. 

In [119]:
O1_mutants_df_list = []
for Bin in range(len(bins)):
    O1_mutants_df_list.append(select_seqs(binned_df, Bin, 110))

O1_mutants_df = pd.concat(O1_mutants_df_list, ignore_index=True)
O1_mutants_df.head()

Unnamed: 0,seq,mutations,energy,bin
0,AATTGTGAGCGGgTAACAATg,2,-0.293762,0
1,AATTGTGAGCGtATAACAATg,2,-0.108405,0
2,AATTGTGAGCGGATAACAAac,2,-0.96739,0
3,AATTGTGAGCGGATAcCAATc,2,-0.543543,0
4,AATTGTGAGCGtATAACAATc,2,-0.656303,0


Let's see how many sequences we have.

In [156]:
len(O1_mutants_df.seq.values)

998

Finally, we add all single mutants back to the oligo pool.

In [157]:
df_1["energy"] = df_1['seq'].apply(energy_from_sequence, args= (O1_matrix, ))
df_1["bin"] = "x"
O1_mutants_df = pd.concat([O1_mutants_df, df_1], ignore_index=True)

Now we only need to add the lacUV5 sequence to each mutant to get the final constructs. We also add the wildtype operator sequences, and add the energy predicted by Hernan's paper.

In [158]:
oligos = copy.deepcopy(O1_mutants_df)

oligos.seq = [lacUV5 + seq for seq in oligos.seq]
oligos = oligos.append(
    pd.DataFrame(
        [[lacUV5_O1, 0, 0, "x"],
         [lacUV5_O2, 0, 1.4, "x"],
         [lacUV5_O3, 0, 5.6, "x"]
        ], 
        columns=['seq', 'mutations', 'energy', 'bin']), 
    ignore_index=True
)
oligos['primer_added'] = False
oligos.head()

Unnamed: 0,seq,mutations,energy,bin,primer_added
0,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,2,-0.293762,0,False
1,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,2,-0.108405,0,False
2,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,2,-0.96739,0,False
3,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,2,-0.543543,0,False
4,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,2,-0.656303,0,False


In [159]:
oligos.tail()

Unnamed: 0,seq,mutations,energy,bin,primer_added
1059,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,1,0.74584,x,False
1060,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,1,-0.506194,x,False
1061,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,0,0.0,x,False
1062,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAAATG...,0,1.4,x,False
1063,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGGGCAG...,0,5.6,x,False


Finally store the dataframe in the data folder.

In [152]:
oligos.to_csv("../../../../data/twist_order/lacI_sequences.csv")

## Computational environment

In [26]:
%load_ext watermark
%watermark -v -p numpy,pandas,wgregseq,bokeh

CPython 3.8.5
IPython 7.10.0

numpy 1.18.1
pandas 1.0.3
wgregseq 0.0.1
bokeh 2.0.2
