# Sequence Alignement and Detecting Motifs

This work was done for the ULB course: Computational biology and bioinformatics (INFO-F-439) by Thomas Van Gysegem (all rigth reserved).

First of all, let us introduce some utility function that will be used for this project. The first one is a function that will be used to display our Dynamic Programing matrice:

In [147]:
%%html
<style>
.clear td, .clear, .clear tr, .clear th {border:none!important}
</style>

In [180]:
from IPython.display import HTML, display

def display_dp_matrix(dp_matrix, x, y, backtrack_matrix):
    """ Simple display function that shows all informations about the DP matrice, sequences, ... """
    
    html_str = '<table>'
    
    # First row (header)
    html_str += '<tr><td></td><td></td><td>{}</td></tr>'.format('</td><td>'.join(_ for _ in y))
    
    for i in range(len(dp_matrix)):
        row = dp_matrix[i]
        
        c = '' if i <= 0 else x[i - 1]
        
        html_str += '<tr><td>{}</td>'.format(c)
        
        for j in range(len(row)):
            value = row[j]
            
            back0 = 'o' if backtrack_matrix[i][j][0] else ''
            back1 = 'o' if backtrack_matrix[i][j][1] else ''
            back2 = 'o' if backtrack_matrix[i][j][2] else ''
            
            sub_table = '<table class="clear"><tr><td>{}</td><td>{}</td></tr>'.format(back0, back1)
            sub_table += '<tr><td>{}</td><td>{}</td></tr></table>'.format(back2, int(value))
            
            html_str += '<td>{}</td>'.format(sub_table)
        
        html_str += '</tr>'
    
    html_str += '</table>'
    
    display(HTML(html_str))

The second one will be used to parse BLOSUM and PAM matrices used. For simplicity sake we will use a 2D Python dictionnary:

In [132]:
def parse_ranking_matrix(filename):
    """ Parse a ranking matrix file with comment line beginning with # """
    ranking_matrix = {}
    
    header = []
    metadata_parsed = False
    
    with open(filename, 'r') as file:
        for line in file:
            line = line.strip()
            
            # Ignore empty or comment line
            if line == "" or line.startswith('#'):
                continue
            
            # First line of data: Header
            elif not metadata_parsed:
                header = [_ for _ in line if _ != ' ']
                metadata_parsed = True
            
            # Other lines
            else:
                data = line.split(' ')
                
                key1 = data[0]
                data = [_ for _ in data[1:] if _ != '']
                
                ranking_matrix[key1] = {}
                for i in range(len(data)):
                    key2 = header[i]
                    value = int(data[i])
                    
                    ranking_matrix[key1][key2] = value
        
        return ranking_matrix

## Part 1: Sequence Alignement Algorithms

This is the first part of the assignement.

### Needleman-Wunsch (global)

First, please input your sequences here:

In [116]:
x = 'AAA'
y = 'AACCCCGGTTT'

Then, we need to load a ranking matrix and sepcify a gap penalty value

In [108]:
filename = 'PAM120'
q = parse_ranking_matrix(filename)
gap_penalty = -6

In [178]:
import numpy as np

M = len(x) + 1
N = len(y) + 1

dp_matrix = np.zeros((M, N))

backtrack_matrix = []
for i in range(M):
    backtrack_matrix.append([])
    
    for j in range(N):
        backtrack_matrix[i].append([False, False, False])

#display_dp_matrix(dp_matrix, x, y)

In [164]:
# Initialise DP matrix with gap penalty
for i in range(M):
    dp_matrix[i][0] = gap_penalty * i
    
for j in range(N):
    dp_matrix[0][j] = gap_penalty * j

#display_dp_matrix(dp_matrix, x, y)

In [181]:
# Nested loop to process all the DP matrix
for i in range(1, M):
    for j in range(1, N):
        key1 = x[i - 1]
        key2 = y[j - 1]
        
        no_gap = dp_matrix[i - 1][j - 1] + q[key1][key2]
        gap_up = dp_matrix[i][j - 1] + gap_penalty
        gap_left = dp_matrix[i - 1][j] + gap_penalty
        
        value = max(no_gap, gap_left, gap_up)
        
        dp_matrix[i][j] = value
        
        backtrack_matrix[i][j] = [
            no_gap == value,
            gap_up == value,
            gap_left == value
        ]

display_dp_matrix(dp_matrix, x, y, backtrack_matrix)

0,1,2,3,4,5,6,7,8,9,10,11,12
,,A,A,C,C,C,C,G,G,T,T,T
,0.0,0,0,0,0,0,0,0,0,0,0,0
A,0.0,o3,o3,oo-3,o-3,o-3,o-3,o1,o1,o1,o1,o1
A,0.0,o3,o6,oo0,oo-6,o-6,o-6,o-2,o2,o2,o2,o2
A,0.0,o3,o6,o3,oo-3,oo-9,o-9,o-5,o-1,o3,o3,o3

0,1
,
,0.0

0,1
,
,0.0

0,1
,
,0.0

0,1
,
,0.0

0,1
,
,0.0

0,1
,
,0.0

0,1
,
,0.0

0,1
,
,0.0

0,1
,
,0.0

0,1
,
,0.0

0,1
,
,0.0

0,1
,
,0.0

0,1
,
,0.0

0,1
o,
,3.0

0,1
o,
,3.0

0,1
o,o
,-3

0,1
o,
,-3.0

0,1
o,
,-3.0

0,1
o,
,-3.0

0,1
o,
,1.0

0,1
o,
,1.0

0,1
o,
,1.0

0,1
o,
,1.0

0,1
o,
,1.0

0,1
,
,0.0

0,1
o,
,3.0

0,1
o,
,6.0

0,1
o,o
,0

0,1
o,o
,-6

0,1
o,
,-6.0

0,1
o,
,-6.0

0,1
o,
,-2.0

0,1
o,
,2.0

0,1
o,
,2.0

0,1
o,
,2.0

0,1
o,
,2.0

0,1
,
,0.0

0,1
o,
,3.0

0,1
o,
,6.0

0,1
o,
,3.0

0,1
o,o
,-3

0,1
o,o
,-9

0,1
o,
,-9.0

0,1
o,
,-5.0

0,1
o,
,-1.0

0,1
o,
,3.0

0,1
o,
,3.0

0,1
o,
,3.0


### Smith-Waterman(local)

In [87]:
range(1, 10)[0]

1