# Reading Energy Matrix

In [4]:
import pandas as pd
import numpy as np 

In [5]:
# read in the energy matrix
data = pd.read_csv("../../data/brewster_matrixS2.txt", sep=" ", comment="#", header=None)
data = data[5: -6] #trimming matrix to 30 bp
data = data.reset_index(drop=True)
data.columns = ['A','C','G','T']
data.head()

Unnamed: 0,A,C,G,T
0,0.305961,0.681616,0.36014,-0.313427
1,0.122283,0.247441,0.171605,-0.313427
2,1.500683,1.490967,-0.313427,0.633869
3,-0.313427,1.032246,-0.138758,0.699062
4,1.064641,-0.214039,1.119622,-0.313427


We will use the first randomly generated sequence from Yona (2018) as an example for these functions. RandSeq1 represents the entire 103 bp, while RandSeq1_trimmed will be the known promoter region from that sequence.

In [6]:
RandSeq1 = "ATAGGAGCGTCATCAAACGCGCCGTTCAGGTTCTGGTTCTCCATGCTATAGTTAAGCCGCACAACGGGTACTACCACTCCCTGTAGTCCGCTTTACCGTTCTC"
RandSeq1_trimmed = 'CGTTCAGGTTCTGGTTCTCCATGCCATAGT'

### energy(sequence)

The energy(sequence) function returns the energy for a provided biological promoter region. It utilizes the data from the Brewster energy matrix.

In [7]:
def energy(sequence):
    """
    Input:
         sequence: 30 bp for the promoter region
    Output:
        total_energy: the total energy for the given sequence in K_bT"""
    #Initializing the counter for the total energy.
    total_energy = 0
    
    #Adds the energy value for each base together for the entire sequence
    for position, letter in enumerate(sequence):
        #Determines the energy for a given position and base pair location using the energy matrix
        energy_of_base = data.loc[position,letter]
        total_energy += energy_of_base
        
    return(total_energy)

For the RandSeq1_trimmed, the function yields

In [8]:
energy(RandSeq1_trimmed)

-1.8063224473818504

### binding_site(sequence)

The binding_site function shows the location along with the lowest energy matrix for a provided biological sequence. This function is useful for determining a potential binding site for RNA polymerase.

In [9]:
def binding_site(sequence):
    """
    Input: 
        sequence: string, biological sequence
    Outputs:
        position: the base pair with the lowest energy matrix
        lowest_energy: the lowest energy for a base pair within the sequence"""
    # Initializes the energy of position and lowest energy to zero.
    energy_of_position = 0 
    lowest_energy = 0
    
    #Loops for the position within the range of the sequence to check every 30 base pairs
    for position in range(len(sequence)-29):
        energy_of_position = energy(sequence[position:position+30])
    #Checks if the energy of the position is less than the lowest_energy, substitutes that value so that it could be compared when the loop runs again    
        if energy_of_position < lowest_energy:
            lowest_energy= energy_of_position
            position_of_lowest_energy = sequence[position:position+30]
            
    return(position,lowest_energy)

For RandSeq1, it will report the location of the binding site and the energy of the positon there.

In [10]:
binding_site(RandSeq1)

(73, -2.9115778711106644)

In [11]:
binding_site(RandSeq1_trimmed)

(0, -1.8063224473818504)

### firstmutation_promoter_alternate(sequence)

The firstmutation_promoter_alternate(sequence) predicts the first mutation within a given promoter region through applying the principle that the base pair most likely to change will be the one that shows the greatest decrease in energy, and it will apply this mutation to yield a new sequence, the altered letter, and the change in energy. However, this characteristic is not always applicable as we must consider the greatest improvement within the **best** binding site, not the improvement itself for future cases. However, the use of displaying multiple parts of information including the letter and energy difference will be helpful for later fuctions.

In [12]:
def firstmutation_promoter_alternate(sequence):
    """
    Input: 
        string, promoter region of a biological sequence (30 bp)
    Outputs:
        Final_sequence: the input sequence after its first mutation
        Letter_change: the letter that the mutated sequence displays
        Energy_difference: the difference between the change of letter and its original base pair in K_bT.
        """
    #Sets best position to the first one as it will be updated later
    #Energy difference is set to zero, it will record the highest decrease between changing a base pair's letter and its original letter
    bestposition = 0
    energy_difference = 0
    
    #The first loop sets the energy equal to a known value in the energy matrix, loops for every base pair.
    for position,letter in enumerate(sequence):
        energy=data.loc[position, letter]
        
        #Scans all the possibilities for letters within the column of the energy matrix
        for current_letter in data.columns:
            #Checks for the largest decrease in energy and save the location as the bestposition and letter_change for the energy with the lower energy
            if energy_difference < (energy-data.loc[position, current_letter]):
                energy_difference = (energy-data.loc[position, current_letter])
                letter_change = current_letter
                bestposition = position
                
    #Converts the sequence to a list to make it mutable             
    list_sequence = list(sequence)
    #Changes the letter for the bestposition in the list to letter_change
    list_sequence[bestposition] = letter_change
    #Converts the list back to a sequence
    final_sequence = "".join(list_sequence)
    return(final_sequence,letter_change,bestposition, energy_difference)


This function will still operate for our example RandSeq_1 trimmed, revealing that the mutated promoter region will be

In [13]:
firstmutation_promoter_alternate(RandSeq1_trimmed)

('CGTTCAGGTTCTGGTTCTCCATGCTATAGT', 'T', 24, 1.1052554237288137)

### firstmutation_promoter(sequence)
The firstmutation_promoter(sequence) predicts the first mutation within a given promoter region through applying the principle that the base pair most likely to change will be the one that shows the greatest decrease in energy, and it will check instead the lowest energy of the entire sequence that is possible through that change in letter. This version aligns directly to be applied in future functions, combining the elements of the alternate through its multiple outputs.

In [14]:

def firstmutation_promoter(sequence):
    """
    Input: 
        sequence: string, promoter region of a biological sequence (30 bp)
    Outputs:
        bestseq: the sequence after its first mutation
        bestenergy: the binding energy of the sequence in k_BT
        bestletter: the letter change that occured
        bestposition: the location of the mutation
        """
    
    # initilaize best sequence and best energy
    bestseq = sequence
    bestenergy = energy(sequence)
    
    # loop through sequence 
    for position,letter in enumerate(sequence):
        
        # loop through possilbe basepairs, A, T, C and G
        for current_letter in data.columns:
            
            # construct a new sequence with the mutation
            new_seq = sequence[:position] + current_letter + sequence[position+1:]
            
    
            # update best sequence and energy if needed
            if energy(new_seq) < bestenergy:
                bestseq = new_seq
                bestletter = current_letter
                bestposition = position
                bestenergy = energy(new_seq)
                
    return(bestseq, bestenergy, bestletter, bestposition)

For the example of RandSeq1_trimmed, its first mutation and binding energy after that mutation would be

In [15]:
firstmutation_promoter(RandSeq1_trimmed)

('CGTTCAGGTTCTGGTTCTCCATGCTATAGT', -2.9115778711106644, 'T', 24)

### first_mutation(sequence)

For an entire sequence, we can predict the first mutation that will occur by finding the promoter region that would have the lowest energy after implementing a single base pair change. It relies on firstmutation_promoter(sequence) to supply the comparison and change of a single base pair.

In [16]:
def first_mutation(sequence):
    """
    Input: 
        sequence: a biological sequence
    Outputs:
        FinalSequence: the sequence that implements the promoter region with the first mutation
        bestsequence: the outputs of the lowest binding energy region from firstmutation_promoter(sequence)"""
    
    #Set bestsequence equal to a zero initial value for comparison
    bestsequence = ('0',0)
    
    #loop through sequence for potential promoter regions of 30 bp, stopping when not possible anymore
    for position in range(len(sequence)-29):
        currentmutation = firstmutation_promoter(sequence[position:position+30])
        
        #Checks the change in each promoter region's binding energy, selects the mutation for the final sequence with the lowest binding energy
        if currentmutation[1] < bestsequence[1]:
            bestsequence = currentmutation
            finalsequence = sequence[:position] + bestsequence[0] + sequence[position+30:]
            
    return(finalsequence, bestsequence)

In [17]:
first_mutation(RandSeq1)

('ATAGGAGCGTCATCAAACGCGCCGTACAGGTTCTGGTTCTCCATGCTATAGTTAAGCCGCACAACGGGTACTACCACTCCCTGTAGTCCGCTTTACCGTTCTC',
 ('CGTACAGGTTCTGGTTCTCCATGCTATAGT', -3.924067560376202, 'A', 3))

In [18]:
RandSeq3_lower = "cgaggcgtttgacaagtactcatccactgtgggaggcgacgagagacgctgcctgcggcatttcgtgatcataatgtctgccgttaactatgaataccggccg"
RandSeq3 = RandSeq3_lower.upper()

In [19]:
RandSeq3

'CGAGGCGTTTGACAAGTACTCATCCACTGTGGGAGGCGACGAGAGACGCTGCCTGCGGCATTTCGTGATCATAATGTCTGCCGTTAACTATGAATACCGGCCG'

In [20]:
first_mutation(RandSeq3)

('CGAGGCGTTTGACAAGTACTCATCCACTGTGGGAGGCGACGAGAGACGCTGCCTGCGGCATTTCGTTATCATAATGTCTGCCGTTAACTATGAATACCGGCCG',
 ('GAGACGCTGCCTGCGGCATTTCGTTATCAT', -4.462285365186513, 'T', 24))