# Modeling Expression

(c) 2020 Tom Röschinger. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

In [1]:
using jregseq, CSV, DataFrames, Statistics, DelimitedFiles, Plots
import Pandas.read_csv

plotlyjs()

┌ Info: Precompiling jregseq [b34a420d-223d-4b4e-9114-30d66f31796f]
└ @ Base loading.jl:1278




Plots.PlotlyJSBackend()

Here we are going to try to predict the changes in gene expression that are occurring due to single mutations in promoter binding sites. To do that, we are going to use some data from the original [Reg-Seq experiment](https://github.com/RPGroup-PBoC/RegSeq). 

Here we go with a an energy matrix for *ykgE*. We have to use `Pandas` to import the table due to some weird format.

In [2]:
df_emat = read_csv("../../data/RegSeq/ykgE_Anaero_71_93_energymatrix", delim_whitespace=true) |> DataFrame;
head(df_emat)

Unnamed: 0_level_0,pos,val_A,val_C,val_G,val_T
Unnamed: 0_level_1,Int64,Float64,Float64,Float64,Float64
1,71,-0.033288,-0.014025,0.053555,-0.006242
2,72,-0.01692,0.010667,0.019179,-0.012926
3,73,-0.078306,-0.004528,-0.01223,0.095064
4,74,0.199515,-0.058073,-0.086813,-0.054629
5,75,0.033905,-0.071138,0.004644,0.032589
6,76,0.051868,-0.025173,-0.012584,-0.014111


We also import the sequence reads.

In [3]:
df_reads = read_csv("../../data/RegSeq/ykgEAnaerodataset_alldone_with_large", delim_whitespace=true) |> DataFrame;
head(df_reads, 5)

Unnamed: 0_level_0,ct,ct_0,ct_1,seq
Unnamed: 0_level_1,Float64,Float64,Float64,String
1,1.0,1.0,0.0,ACAATTTCACCATAAAATGTCGGCGTTGCCGAAAGAAATAAAATGAGGTATTGCATTTGACGTTTGGATGAAAGATTTTCATTTGTCCTACAATTGCGGGGTGGTATGTGGCTAGCCCATTAAAAAAGAACGCCATATTTATTGATGATTGACACCGCGGCCCAGCCAATCTATACGCCT
2,1.0,1.0,0.0,ACAATTTCACCATAAAATGTCGGCGTTGCCGAAAGAAATAAAATGAGGTATTGCATTTGACGTTTGGATGAAAGATTTTCATTTGTCCTACAATTGCGGGGTGGTATGTGGCTAGCCCATTAAAAAAGAACGCCATATTTATTGATGATTGACACCGCGGGAGAGCCTCGCGTATCCCTC
3,1.0,1.0,0.0,ACAATTTCACCATAAAATGTCGGCGTTGCCGAAAGAAATAAAATGAGGTATTGCATTTGACGTTTGGATGAAAGATTTTCATTTGTCCTACAATTGCGGGGTGGTATGTGGCTAGCCCATTAAAAAAGAACGCCATATTTATTGATGATTGACACCGCGGGGAGGCGTCGTGCCCCTAAA
4,1.0,1.0,0.0,ACAATTTCACCATAAAATGTCGGCGTTGCCGAAAGAAATAAAATGAGGTATTGCATTTGACGTTTGGATGAAAGATTTTCATTTGTCCTACAATTGCGGGGTGGTATGTGGCTAGCCCATTAAAAAAGAACGCCATATTTATTGATGATTGACACCGCGGGGGGACCCTTGCTTTCTATT
5,1.0,1.0,0.0,ACAATTTCACCATAAAATGTCGGCGTTGCCGAAAGAAATAAAATGAGGTATTGCATTTGACGTTTGGATGAAAGATTTTCATTTGTCCTACAATTGCGGGGTGGTATGTGGCTAGCCCATTAAAAAAGAACGCCATATTTATTGATGATTGACACCGCGGTGTTCGAGTAATTTCCCTGA


Looking at the information footprint, we find that this binding site is the one for the repressor.

![](footprint.PNG)

The sequences of interest contain the barcode at the end, i.e., the last 20 letters are the barcode.

In [4]:
length(df_reads.seq[1])

180

To work with the energy matrix, we convert it to an array and forget about the positions for now.

In [5]:
emat = convert(Matrix, df_emat)[:, 2:5]

23×4 Array{Float64,2}:
 -0.033288  -0.014025   0.053555  -0.006242
 -0.01692    0.010667   0.019179  -0.012926
 -0.078306  -0.004528  -0.01223    0.095064
  0.199515  -0.058073  -0.086813  -0.054629
  0.033905  -0.071138   0.004644   0.032589
  0.051868  -0.025173  -0.012584  -0.014111
  0.089666  -0.029124  -0.013974  -0.046568
 -0.072368  -0.056475  -0.03225    0.161093
 -0.00548    0.010865   0.006917  -0.012302
 -0.028776   0.035087  -0.01494    0.008628
 -0.017268   0.017855   0.046747  -0.047335
  0.053466  -0.008786   0.01635   -0.06103
  0.025561  -0.013208   0.045536  -0.057889
 -0.003096   0.036267  -0.040278   0.007107
  0.044523   0.043059  -0.0035    -0.084082
  0.017119  -0.046369  -0.001753   0.031002
 -0.017918  -0.00874    0.025971   0.000687
  0.038592  -0.028786   0.009415  -0.019221
 -0.072817  -0.009082   0.060168   0.021731
  0.028964  -0.07278    0.026844   0.016972
 -0.042122   0.017107   0.027542  -0.002526
 -0.046036   0.035937   0.005203   0.004896
 -0.038327

This energy matrix is in arbitrary units, as are most inferred energy matrices from Reg-Seq. Transforming the entries to real energies seems to be a hard task, so for this toy model we are simply setting the minimal energy per position to 0 and multiply the remaining values by a constant such that the average energy contribution of a mutation is about 2kBT.

In [6]:
# Subtract minimum per row
emat = emat .- minimum(emat, dims=2)

# Compute mean (and correct for 3 mutations)
Z = mean(emat) / 4 * 3

# Renormalize
emat = emat ./ Z * 2

23×4 Array{Float64,2}:
  0.0       1.04794   4.72439    1.47134
  0.0       1.50077   1.96384    0.217279
  0.0       4.01363   3.59463    9.43158
 15.5767    1.5635    0.0        1.75086
  5.71449   0.0       4.12265    5.6429
  4.19114   0.0       0.68486    0.601789
  7.41133   0.948979  1.77316    0.0
  0.0       0.864603  2.18248   12.7006
  0.371127  1.26032   1.04554    0.0
  0.0       3.47424   0.752699   2.03483
  1.63569   3.54643   5.1182     0.0
  6.22875   2.84215   4.20959    0.0
  4.5398    2.43071   5.62647    0.0
  2.02276   4.16416   0.0        2.57781
  6.9963    6.91666   4.38378    0.0
  3.45384   0.0       2.42718    4.2091
  0.0       0.499297  2.38763    1.01214
  3.66546   0.0       2.07819    0.52035
  0.0       3.46728   7.23458    5.14355
  5.53502   0.0       5.41969    4.88264
  0.0       3.22214   3.78982    2.15408
  0.0       4.45945   2.78748    2.77078
  0.0       3.81273   2.76055    1.76691

I don't know yet how far away this is from anything correct, but we will go forward with this for the lack of a better alternative. Next we need to generate all single mutants from the wild type sequence. Let's assume for now that the wild type sequence is the one with the optimal letter at each position, although we know that this very likely not the case.

In [7]:
wildtype = [i[2] for i in argmin(emat, dims=2)][:]

23-element Array{Int64,1}:
 1
 1
 1
 3
 2
 2
 4
 1
 4
 1
 4
 4
 4
 3
 4
 2
 1
 2
 1
 2
 1
 1
 1

In [8]:
mutants = jregseq.site_single_mutations(wildtype, alph_type="Numeric")

70-element Array{Array{Int64,1},1}:
 [1, 1, 1, 3, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 [2, 1, 1, 3, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 [3, 1, 1, 3, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 [4, 1, 1, 3, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 [1, 2, 1, 3, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 [1, 3, 1, 3, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 [1, 4, 1, 3, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 [1, 1, 2, 3, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 [1, 1, 3, 3, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 [1, 1, 4, 3, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 [1, 1, 1, 1, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 [1, 1, 1, 2, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 [1, 1, 1, 4, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 2, 1, 1, 1]
 ⋮
 [1, 1, 1, 3, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1, 1, 1, 1, 1]
 [1, 1, 1, 3, 2, 2, 4, 1, 4, 1  …  3, 4, 2, 1, 2, 1

Now we need to compute the binding energy for all of these mutants, but that is just going to be each value of the energy matrix. However, let's write a function anyways in case we want to compute the energy of any given (integer) sequence using an energy matrix.

In [9]:
function get_binding_energy(sequence::Array{Int64,1}, emat::Array{Float64,2})
    energy = 0.
    for i in 1:length(sequence)
        energy += emat[i, sequence[i]]
    end
    return energy
end



get_binding_energy (generic function with 1 method)

In [11]:
histogram([get_binding_energy(sequence, emat) for sequence in mutants])