# Generating LacI operator constructs

(c) 2021 Tom Röschinger. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

***

In this notebook we create mutants from the O1-sequence, which is a strong binding site for lacI. The goal is to explore the accuracy of RNA-seq and compare predicted binding energies with the ones obtained from these mutants in RNA-seq. Also, we can easily include plenty of sequences with many mutations and can test the limits of linear binding energy matrices, i.e., for how many mutations can the energy matrix accurately predict the energy (given that RNA-seq is precise for small number of mutations).

To achieve reproducibility, we set the random seed. Be sure to run this cell before any other. In subsequent cells using random number generators are used multiple times, the results will vary. 

In [3]:
import wgregseq

# Include these if package is manipulated while running the notebook
%load_ext autoreload
%autoreload 2

import pandas as pd
#import numpy as np
import copy

import numpy as np

from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.layouts import grid

import bokeh.io

bokeh.io.output_notebook()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


First we assign the sequences of interest. We'll mostly use the strongest operator O1, but will also include the unmutated sequences for O2 and O3. The mutants for O1 will be chosen depending on their predicted binding energy. Therefore, we import the binding energy matrix calculated in Barnes 2019.

In [4]:
# Operator sequences
O1 = 'AATTGTGAGCGGATAACAATT'
O2 = 'AAATGTGAGCGAGTAACAACC'
O3 = 'GGCAGTGAGCGCAACGCAATT'

# Promoter
lacUV5 = 'TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGG'

# Combining promoter and operators
lacUV5_O1 = lacUV5 + O1
lacUV5_O2 = lacUV5 + O2
lacUV5_O3 = lacUV5 + O3

# Obtain binding energy matrix for O1
O1_matrix = np.load("../../../../data/O1_matrix.npy")

We write a function to evaluate the energy matrix for a sequence. Therefore, the letters are transformed into an index.

In [5]:
def energy_from_sequence(sequence, matrix):
    seq_list = list(sequence.upper())
    num_seq = [seq_dict[x] for x in seq_list]
    energy = sum([matrix[i, num_seq[i]] for i in range(len(sequence))])
    return energy

We also need to define the dictionary which translates nucleotides into indices.

In [6]:
seq_dict, _ = wgregseq.choose_dict("dna")

Now we can generate mutants. Therefore we generate all single, double mutants and triple mutants using the function which creates all possible mutants and then chooses, to prevent duplicates, at ensure we obtain every mutant.

In [7]:
mutants_single = wgregseq.mutations_det(O1, mut_per_seq=1)
mutants_double = wgregseq.mutations_det(O1, mut_per_seq=2)
mutants_triple = wgregseq.mutations_det(O1, mut_per_seq=3)

For higher order mutants we don't have to worried too much about duplicates and can randomly generate mutants (while keeping the number of mutations fixed). 

In [8]:
import random
random.seed(50937)
import numpy as np
np.random.seed(50937)

In [9]:
mutants_quadruple = np.unique(wgregseq.mutations_rand(O1, rate=0.2, num_mutants=100000, number_fixed=True))
mutants_quintuple = np.unique(wgregseq.mutations_rand(O1, rate=0.25, num_mutants=100000, number_fixed=True))
mutants_sextuple = np.unique(wgregseq.mutations_rand(O1, rate=0.3, num_mutants=100000, number_fixed=True))

Let's quickly confirm that we obtained plenty of unique mutants to choose from.

In [10]:
print("Number of unique quadruple mutants: {}".format(len(mutants_quadruple)))
print("Number of unique quintuple mutants: {}".format(len(mutants_quintuple)))
print("Number of unique sextuple mutants: {}".format(len(mutants_sextuple)))

Number of unique quadruple mutants: 90312
Number of unique quintuple mutants: 98945
Number of unique sextuple mutants: 99872


Let's write all the mutants into a dataframe. We exclude the single mutants for now, since we want to use all of those anyways, so we add them back to the pool in the end.

In [11]:
df_1 = pd.DataFrame({"seq": mutants_single, "mutations": 1})
df_2 = pd.DataFrame({"seq": mutants_double, "mutations": 2})
df_3 = pd.DataFrame({"seq": mutants_triple, "mutations": 3})
df_4 = pd.DataFrame({"seq": mutants_quadruple, "mutations": 4})
df_5 = pd.DataFrame({"seq": mutants_quintuple, "mutations": 5})
df_6 = pd.DataFrame({"seq": mutants_sextuple, "mutations": 6})
df = pd.concat([df_2, df_3, df_4, df_5, df_6], ignore_index=True)

Now we can compute the difference in energy matrix for every mutant and add it to the data frame.

In [12]:
# Compute energies and add column containing values
df["energy"] = df['seq'].apply(energy_from_sequence, args= (O1_matrix, ))

# Show last five rows
df.tail()

Unnamed: 0,seq,mutations,energy
326924,ttgcGTGAGCcGATAcCAATT,6,9.383518
326925,ttgcGTaAGCGGATAACcATT,6,10.349507
326926,ttgcGTcAGCGGATtACAATT,6,9.554619
326927,ttggGTaAGCaGATAACAATT,6,13.422445
326928,ttggGcGAtCGGATAACAATT,6,12.420105


Now we have to choose which sequences we include in the experiment. Therefore we choose bins of binding energies, within we choose an equal number of sequences from each number of mutations. By choosing bins, we reduce the bias from the distribution of mutants, getting a more even coverage of energies.

In [13]:
# Bin width
gap = 1

# Minimal and maximal energies
Min, Max = -0.5, 7.5

# Generate 9 bins
bins = [(i-gap/2, i+gap/2) for i in np.linspace(Min, Max, num=9)]

# Show bins
bins

[(-1.0, 0.0),
 (0.0, 1.0),
 (1.0, 2.0),
 (2.0, 3.0),
 (3.0, 4.0),
 (4.0, 5.0),
 (5.0, 6.0),
 (6.0, 7.0),
 (7.0, 8.0)]

Let's have a look on how the mutants are distributed.

In [14]:
p_list = []
for i in range(2, 7):
    p = figure(title="mutations = {}".format(i), frame_height=200, frame_width=200,
               tools='', background_fill_color="#fafafa", y_axis_label="density",
              x_axis_label="ΔE [k_BT]")
    energies = df.loc[df.mutations == i, "energy"].values
    hist, edges = np.histogram(energies, density=True, bins=50)
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
          fill_color="navy", line_color="white", alpha=0.5)
    for (b1, b2) in bins:
        p.line(x=[b1, b1], y=[0, np.max(hist)], color="orange")
        p.line(x=[b2, b2], y=[0, np.max(hist)], color="orange")
        p.varea(x=[b1, b2], y1=[0, 0], y2=[np.max(hist), np.max(hist)], alpha=0.2, color="orange")
        
    p_list.append(p)
    
bokeh.io.show(
    grid(p_list, nrows=1)
)
    

Now we need to filter the sequences for the bins

In [15]:
# Temporary list of sub data frames
df_list = []

# Iterate through bins
for i, (x,y) in enumerate(bins):
    # Copy data frame with sequnces within the bin
    temp_df = copy.deepcopy(df.loc[[x < E < y for E in df["energy"] ], :])
    
    # Append column with bin number
    temp_df["bin"] = np.ones(len(temp_df), dtype=int) * i
    
    # Append to temporary list
    df_list.append(temp_df)

# Combine data frames again
binned_df = pd.concat(df_list, ignore_index=True)

# Show first 10 rows
binned_df.tail(10)

Unnamed: 0,seq,mutations,energy,bin
92341,ttcTGTGAGgGGcTAACAAcT,6,7.976949,8
92342,ttcTGTGAGgGGtTAACAtTT,6,7.807922,8
92343,ttcTGTGAGgGaATAAaAATT,6,7.728838,8
92344,ttcTGTGgGCGGATcACAAcT,6,7.780272,8
92345,ttccGTGAGCGaATAAtAATT,6,7.826131,8
92346,ttcgGTGAGCGGATAAaAAgT,6,7.399369,8
92347,ttgTGTGAaCGGcTAACAtTT,6,7.084705,8
92348,ttgTGaGAaCGGATAACAAgT,6,7.663886,8
92349,ttgTGcGAcCGGATAACAtTT,6,7.885602,8
92350,ttgaGTGAGCGGATAAtAcTT,6,7.411149,8


Now we need to select mutants from the bins. We want to try to get an equal number of sequences for each number of mutations per bin. However, some bins are sparsely populated by some type of mutants. Therefore we choose a maximal number of sequences per mutation type per bin. If there are more sequences in a bin, we randomly select sequences.

In [16]:
def select_seqs(df, ind_bin, num_seqs):
    # Compute number of sequences per mutant
    seqs_per_mut = np.floor(num_seqs / len(df.mutations.unique()))
    
    # Count number of sequences per number of mutations
    count_df = df.groupby(["mutations", "bin"]).size().to_frame(name="num_mutants").reset_index()
    count_df = count_df.loc[count_df["bin"] == ind_bin, :]
    
    # Prepare data frame to return
    ret_df = pd.DataFrame(columns=["seq", "mutations", "energy", "bin"])
    
    # Iterate through mutants
    for i in df.mutations.unique():
        # If there are not enough sequences, take all
        if i in count_df.mutations.values:
            if count_df.loc[count_df["mutations"] == i, "num_mutants"].values < seqs_per_mut:
                ret_df = pd.concat([ret_df, df.loc[(df["mutations"] == i) & (df["bin"] == ind_bin), :]], ignore_index=True)
            else:
                # Randomly choose mutants
                indices = df.loc[(df["mutations"] == i) & (df["bin"] == ind_bin), :].index.to_numpy(dtype=int)
                selected_indices = np.random.choice(indices, size=int(seqs_per_mut), replace=False)
                ret_df = pd.concat([ret_df, df.iloc[selected_indices]])
    return ret_df

Now we only need to apply the function to each bin and collect sequences. 

In [17]:
O1_mutants_df_list = []
for Bin in range(len(bins)):
    O1_mutants_df_list.append(select_seqs(binned_df, Bin, 400))
O1_mutants_df = pd.concat(O1_mutants_df_list, ignore_index=True)
O1_mutants_df.head()

Unnamed: 0,seq,mutations,energy,bin
0,cATTGTGAGCGGATcACAATT,2,-0.069192,0
1,cATTGTGAGCGGATAACAAaT,2,-0.028948,0
2,cATTGTGAGCGGATAACAATg,2,-0.621844,0
3,AcTTGTGAGCGGATAACAATc,2,-0.520312,0
4,AAaTGTGAGCGGATAACAATc,2,-0.637571,0


Let's see how many sequences we have.

In [18]:
len(O1_mutants_df.seq.values)

3196

Finally, we add all single mutants back to the oligo pool.

In [19]:
df_1["energy"] = df_1['seq'].apply(energy_from_sequence, args= (O1_matrix, ))
df_1["bin"] = "x"
O1_mutants_df = pd.concat([O1_mutants_df, df_1], ignore_index=True)

Now we only need to add the lacUV5 sequence to each mutant to get the final constructs. We also add the wildtype operator sequences, and add the energy predicted by Hernan's paper.

In [20]:
oligos = copy.deepcopy(O1_mutants_df)
oligos['construct'] = "lacUV5+O1_mutant"
oligos.seq = [lacUV5 + seq for seq in oligos.seq]
for i in range(10):
    oligos = oligos.append(
        pd.DataFrame(
            [[lacUV5_O1, 0, 0, "x", "lacUV5+O1"],
             [lacUV5_O2, 0, 1.4, "x", "lacUV5+O2"],
             [lacUV5_O3, 0, 5.6, "x", "lacUV5+O3"]
            ], 
            columns=['seq', 'mutations', 'energy', 'bin', 'construct']), 
        ignore_index=True
    )
oligos['primer_added'] = False
oligos.head()

Unnamed: 0,seq,mutations,energy,bin,construct,primer_added
0,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGcATTG...,2,-0.069192,0,lacUV5+O1_mutant,False
1,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGcATTG...,2,-0.028948,0,lacUV5+O1_mutant,False
2,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGcATTG...,2,-0.621844,0,lacUV5+O1_mutant,False
3,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAcTTG...,2,-0.520312,0,lacUV5+O1_mutant,False
4,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAAaTG...,2,-0.637571,0,lacUV5+O1_mutant,False


Finally store the dataframe in the data folder.

In [21]:
oligos

Unnamed: 0,seq,mutations,energy,bin,construct,primer_added
0,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGcATTG...,2,-0.069192,0,lacUV5+O1_mutant,False
1,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGcATTG...,2,-0.028948,0,lacUV5+O1_mutant,False
2,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGcATTG...,2,-0.621844,0,lacUV5+O1_mutant,False
3,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAcTTG...,2,-0.520312,0,lacUV5+O1_mutant,False
4,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAAaTG...,2,-0.637571,0,lacUV5+O1_mutant,False
...,...,...,...,...,...,...
3284,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAAATG...,0,1.400000,x,lacUV5+O2,False
3285,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGGGCAG...,0,5.600000,x,lacUV5+O3,False
3286,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAATTG...,0,0.000000,x,lacUV5+O1,False
3287,TCGAGTTTACACTTTATGCTTCCGGCTCGTATAATGTGTGGAAATG...,0,1.400000,x,lacUV5+O2,False


In [22]:
oligos.to_csv("../../../../data/twist_order/lacI_sequences.csv")

## Computational environment

In [23]:
%load_ext watermark
%watermark -v -p numpy,pandas,wgregseq,bokeh

CPython 3.8.5
IPython 7.19.0

numpy 1.18.1
pandas 1.2.0
wgregseq 0.0.1
bokeh 2.0.2
