In [1]:
#import logomaker as lm
from IPython.display import Image

import numpy as np
import pandas as pd

#import holoviews as hv
#import holoviews.operation.datashader

#import bokeh_catplot
#import bebi103

import bokeh.io
bokeh.io.output_notebook()

#hv.extension('bokeh')
#bebi103.hv.set_defaults()

My inspiration comes from the Yona, Alm, and Gore paper that left me with a lingering question of how can we make sense of which promoters evolved and how they evolved. While this is clearly a small case study, this touches on ideas of to what extent can evolution be predicted.

Experimentally, I think their system is elegantly simple, and the only modification I would want to make is to limit the size of the promoter region to ensure that only an RNAP can evolve and there is no other regulation. (As I saw with the one sequence with which I replicated their result, the SNP occurred outside of the RNAP site). 

As this is only a minor modification to what Gore already did, I think the main intellectual value comes from using an energy matrix to make sense of the results (rather than just counting number of matches / mismatches to consensus). For understanding **how** promoters evolve, I want to use an energy matrix to predict which mutations will be most beneficial to binding energy, and in turn, gene expression and fitness when grown under lactose.

Let's start with a random (or at least arbitrary sequence):

In [3]:
randseq = "TTCTCAATGAGTCTAATTGCCGTTTAAAGA"
len(randseq)

30

To make sense of this sequence, let's load in Brewster's matrix, trimmed to be only the 30 bp RNAP footprint.

In [8]:
# Brewster's energy matrix in kT units
matrix = pd.read_csv("brewster_matrixS2.txt", header=None, sep = "\s+", comment="#")
matrix.columns = ["A","C","G","T"]
matrix = matrix[5:-6].reset_index(drop=True)
matrix

Unnamed: 0,A,C,G,T
0,0.305961,0.681616,0.36014,-0.313427
1,0.122283,0.247441,0.171605,-0.313427
2,1.500683,1.490967,-0.313427,0.633869
3,-0.313427,1.032246,-0.138758,0.699062
4,1.064641,-0.214039,1.119622,-0.313427
5,-0.313427,0.655413,0.295806,0.024221
6,-0.313427,-0.06492,-0.117951,-0.216438
7,-0.206301,-0.090452,-0.313427,-0.213711
8,-0.276635,-0.18302,-0.313427,-0.167744
9,-0.238536,-0.180056,-0.218187,-0.313427


I now create a function that computes the binding energy for a given matrix and sequence:

In [9]:
def binding_energy(seq, matrix):
    """
    Inputs:
        seq: string, 30 bp sequence 
        matrix: pandas dataframe, binding energy matrix
    Outputs:
        binding energy: float"""
    
    # make sure the sequence is capitilized
    seq = seq.upper()
    
    # initilize running tally
    tally_energy = 0
    
    # iterate through the basepairs of the sequence
    for i, bp in enumerate(seq):
        
        # determine energy of given bp and add to tally
        bp_e = matrix.at[i,bp]
        tally_energy = tally_energy + bp_e
        
    return tally_energy   

Let's try this thing out on our random sequence!

In [10]:
starting_energy = binding_energy(randseq, matrix)
print(starting_energy)

-1.6263741541413144


For reference, this is what a known, relatively strong promoter (lacUV5) corresponds to:

In [11]:
lacUV5="TTTACACTTTATGCTTCCGGCTCGTATAAT"
lacUV5_energy = binding_energy(lacUV5, matrix)
print(lacUV5_energy)

-4.858281604246257


Great! So we see that the random sequence is (unsurprisingly) much weaker than a known promoter.

In [12]:
RandSeq3_orig = "acgctgcctgcggcatttcgtgatcataat"
RandSeq3_evol = "gacgctgcctgcggcatttcgtgatataat"

In [13]:
binding_energy(RandSeq3_orig, matrix)

-1.6530317134068513

In [14]:
binding_energy(RandSeq3_evol, matrix)

-3.3085085538423034

In [16]:
RandSeq29 = "acgacttcctggtaacatcgacgtataat"
RandSeq29_upstream = "cacgacttcctggtaacatcgacgtataat"
RandSeq29_downstream = "acgacttcctggtaacatcgacgtataatc"

RandSeq29 has evolved a sequence which is one base pair smaller than the energy matrix that we have. Thus, it will likely return a poor binding energy when matched on this energy matrix. Below, we see what the range of energies the matrix returns when we try deleting one position from the matrix in the middle 18 positions which separate the two, 6-position sites most critical to binding.

In [17]:
for i in range(6,25):
    matrix_trimmed = matrix[0:i].append(matrix[i+1:]).reset_index()
    be = binding_energy(RandSeq29, matrix_trimmed)
    print(be)

-4.061664798304857
-4.064392418361355
-3.9718240709037276
-3.9747881565369196
-4.218568876875903
-4.187171253994547
-4.381957567553869
-3.972036563401326
-3.985281002703586
-3.9442744953787274
-4.0075053429775975
-4.035884828288332
-4.032643985632965
-4.205837664503021
-4.008352226649914
-4.272074980887202
-4.992508978062343
-4.992508978062343
-4.8839151927516085


Now, we will see how grabbing a 30 bp segment in the RandSeq29 promoter region to fit to our 30 bp energy matrix instead of aligning our energy matrix to the 29 position affects the determined binding energy.

In [20]:
binding_energy(RandSeq29_upstream, matrix)

-0.13867004448029724

In [21]:
binding_energy(RandSeq29_downstream, matrix)

-0.3894052129383483

In [12]:
RandSeq1_orig = "cgttcaggttctggttctccatgccatagt"
RandSeq1_evol1 = "cgttcaggttctggttctccatgcTatagt"
RandSeq1_evol2 = "TgttcaggttctggttctccatgcTatagt"

In [13]:
binding_energy(RandSeq1_orig, matrix)

-1.8063224473818504

In [14]:
binding_energy(RandSeq1_evol1, matrix)

-2.9115778711106644

In [15]:
binding_energy(RandSeq1_evol2, matrix)

-3.9066207383423035