# Generating random sequences and energy matrices

(c) 2020 Scott H Saunders. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

In [7]:
import numpy as np
import random
import string
import pandas as pd

## Generate Random Sequence

First, let's simply write a function to generate a random sequence of a certain length.

In [8]:
def gen_rand_seq(length):
    """
    Generate a random DNA sequence of defined length
    
    Parameters
    ----------
    length : int
    
    Returns
    ---------
    seq : string
         Random DNA sequence
    """
    
    nt = ['A','T','C','G'] #nucleotides
    
    seq = ''.join(random.choice(nt) for i in range(length))
    
    return(seq)


We can test that it works as expected.

In [48]:
test_seq = gen_rand_seq(30)
test_seq

'CATTATCGATACTTAAAATCACGGGCTCAT'

It does. So next let's generate a totally random energy matrix for a certain length of sequence. 


## Generate Random Energy Matrix

Recall that an energy matrix should have values for each possible nucleotide at each position that reflect the inferred transcriptional effect. 

Here I'll return an np.array of the site size (in bp) where the random values are normally distributed around mean 1 with standard deviation 1.

In [92]:
def gen_emat_rand(site_size):
    """
    Generate a random energy matrix for a defined sequence length. Arbitrary values for each possible base.
    
    Parameters
    ----------
    site_size : int
        Length of the sequence to generate the energy matrix for, in bp.
    
    Returns
    ----------
    energy matrix : np.array
    """
    return(np.random.normal(1,1,(site_size,4)))

In [93]:
gen_emat_rand(30)

array([[ 1.11296440e+00,  2.90818236e-01,  1.05252756e+00,
         7.46399668e-01],
       [-4.10120291e-02,  1.25168818e+00, -1.88468195e-01,
         1.25083303e+00],
       [-5.57292721e-01,  8.76311593e-01, -2.36111012e-01,
         1.56101587e+00],
       [-1.06607831e-01,  6.81149197e-01,  1.67497373e+00,
         5.65421074e-01],
       [ 1.00368992e+00,  2.45311770e+00,  1.03896370e-01,
         1.02711107e+00],
       [ 1.77505356e+00,  2.29203251e+00,  1.24197999e-01,
         2.96913448e-01],
       [ 5.05696746e-01, -6.12841899e-01,  6.90741511e-01,
         1.90349828e+00],
       [ 4.37065381e-01, -5.14412009e-01,  2.92429813e-01,
         1.51798388e+00],
       [ 6.48142011e-01,  1.94341352e+00,  7.06530263e-01,
         1.33119153e+00],
       [-4.99507521e-01,  9.63079636e-01, -1.93531394e-01,
         1.52116615e+00],
       [ 7.81880341e-01,  8.99987219e-02,  1.78141236e+00,
        -1.05013477e-01],
       [ 2.21189679e+00,  8.10242078e-01,  4.77639047e-01,
      

Seems to work as expected. 

Based on the information footprints from RegSeq 1.0, we expect that for an energy matrix, TF binding sites will contain high information, while surrounding sequence will contain low information. Therefore we will establish a very simple representation of this by generating an energy matrix from a random sequence that contains a single theoretical binding site.

The function below allows you to define the start position and size of this binding site. For the site, this function generates a random energy matrix using `gen_emat_rand()` and for all other positions sets the energy matrix values to 0.

Looking back the above function may be unnecessary, since it's one line.

In [94]:
def gen_emat_single_site(seq, site_start, site_size):
    """
    Generate energy matrix for sequence with one site. Outside of site matrix is zero.
    
    Parameters
    ----------
    seq : string
    
    site_start : int
    
    site_size : int
    
    Returns
    ---------
    seq_emat : np.array
    """
    
    seq_emat = np.zeros((len(seq),4))
    
    seq_emat[site_start:(site_start + site_size),:] = gen_energy_matrix_rand(site_size)
    
    return(seq_emat)

In [95]:
test_emat = gen_emat_single_site(test_seq,0,10)
test_emat

array([[ 1.44712205,  1.68751866,  0.12607859,  3.02358059],
       [ 2.04085929,  0.56979785,  1.34889949,  1.21991213],
       [ 0.50074316,  2.26413067,  1.52063206,  2.73110204],
       [ 0.60144128,  0.7771055 ,  0.81654915,  0.26343913],
       [ 1.07500511, -0.16006662,  0.87769648,  0.47419438],
       [ 0.95416758, -0.5440269 , -0.01282096, -0.42378289],
       [ 1.10732871,  0.19106154,  1.34899522,  1.43671531],
       [ 0.96740583,  2.8457135 ,  0.44059671,  1.04413919],
       [ 0.13954312,  1.50406139,  2.84947971,  0.12905226],
       [ 1.32109204,  1.16609012,  0.75149217,  1.91412428],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.

Ok, so now we can generate random energy matrices with obvious binding sites. 

This approach could become much more advanced, informed by actual data. For example, we could establish whether a TF binding site is an activator or repressor and that would set the effect sign. Further WT effect could be set to zero and the importance of each position and mutation could be assigned in a more rigorous way.

For now let's proceed. 

## Simple additive prediction from energy matrix

Next, we should be able to take a sequence variant and look up the effect of each nucleotide in the energy matrix and return the sum of all the values. This "model" is literally that stupid, which is fine for now.

To do this, I wrote this function that very simply loops over the sequence and uses a pandas dataframe of the energy matrix to access each element and correct nucleotide column. It compiles each file in a list and then sums the list.

In [114]:
def sum_emat(seq, emat):
    """
    Retrieve and sum the energy matrix values for a given sequence variant and matrix.
    
    Parameters
    ----------
    seq : string
    
    emat : pd.DataFrame with columns A T C G
    
    Returns
    ---------
    sum : float
    """
    
    mat_vals = []
    
    for ind,char in enumerate(seq, start = 0):
        mat_vals.append(emat.iloc[ind][char])
        
    return(sum(mat_vals))

To test this function, we first need to convert our energy matrix, which was a np.array, into a pandas dataframe with columns named for each nucleotide. 

In [122]:
test_emat_df = pd.DataFrame(data = test_emat, columns = ('A','T','C','G'))
test_emat_df

Unnamed: 0,A,T,C,G
0,1.447122,1.687519,0.126079,3.023581
1,2.040859,0.569798,1.348899,1.219912
2,0.500743,2.264131,1.520632,2.731102
3,0.601441,0.777106,0.816549,0.263439
4,1.075005,-0.160067,0.877696,0.474194
5,0.954168,-0.544027,-0.012821,-0.423783
6,1.107329,0.191062,1.348995,1.436715
7,0.967406,2.845713,0.440597,1.044139
8,0.139543,1.504061,2.84948,0.129052
9,1.321092,1.16609,0.751492,1.914124


Now, let's remind ourselves what sequence we were working off of:

In [123]:
test_seq

'CATTATCGATACTTAAAATCACGGGCTCAT'

Let's start by testing this function with a single nt sequence:

In [126]:
sum_emat(seq = 'A', emat = test_emat_df)

1.4471220458060166

Ok it returns the correct value, so let's try a slightly more complicated sequence:


In [127]:
sum_emat(seq = 'AATT', emat = test_emat_df)

6.529217506501876

Looks about right. Next we can try to inferface some contructs designed by this code with Tom's scramble mutation functions.

## Computing Environment

In [6]:
%load_ext watermark
%watermark -d -v -p numpy,pandas

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
2020-08-26 

CPython 3.7.6
IPython 7.12.0

numpy 1.18.1
pandas 1.0.1
