The Needleman-Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. 

The algorithm uses dynamic programming. The algorithm effectively divides a large problem (e.g. the full sequence) into a series of smaller problems, and uses the solutions to these smaller problems to find an optimal solution to the larger problem.

This algorithm is still widely used for optimal global alignment, particularly when the quality of the alignment is of utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.

The algorithm can be used for any two strings. The first step is to construct a grid. The first sequence along the top and the second sequence down the side.

In [1]:
import numpy as np

In [17]:
def diagonal(n1, n2, pt):
    if (n1 == n2):
        return pt['MATCH']
    else:
        return pt['MISMATCH']

In [18]:
def pointers(di, ho, ve):
    pointer = max(di, ho, ve)
    
    if (di == pointer):
        return 'D'
    elif(ho == pointer):
        return 'H'
    elif (ve == pointer):
        return 'V'

In [35]:
def nw_simple(s1, s2, match, mismatch, gap) :
    penalty = {'MATCH': match, 'MISMATCH': mismatch, 'GAP': gap} # Dictionary for penalty values
    
    n = len(s1) + 1
    m = len(s2) + 1
    
    f_matrix = np.zeros((m,n), dtype = int) # Initialises an empty alignment matrix
    p_matrix = np.zeros((m,n), dtype = str) # Initialises an empty pointer matrix for backtracking
    
    # Fill first row and column of matrix with gap penalty
    for i in range(m):
        f_matrix[i][0] = penalty['GAP'] * i
        p_matrix[i][0] = 'V'
        
    for j in range(n):
        f_matrix[0][j] = penalty['GAP'] * j
        p_matrix[0][j] = 'H'
        
    # Fill the matrix
    p_matrix[0][0] = 0
    for i in range(1,m):
        for j in range(1,n):
            di = f_matrix[i-1][j-1] + diagonal(s1[j-1], s2[i-1], penalty)
            ho = f_matrix[i-1][j] + penalty['GAP']
            ve = f_matrix[i][j-1] + penalty['GAP']
            f_matrix[i][j] = max(di, ho, ve)
            p_matrix[i][j] = pointers(di, ho, ve)
    
    score = f_matrix[-1][-1]
    
    print("Alignment Score: {0}".format(score))
    print("\n" + "Alignment Matrix:")
    print(f_matrix)
    print("\n" + "Pointer Matrix:")
    print(p_matrix)

In [36]:
sequence1 = 'TGCCA'
sequence2 = 'TCCA'

nw_simple(sequence1, sequence2, 1, -1, -2)

Alignment Score: 2

Alignment Matrix:
[[  0  -2  -4  -6  -8 -10]
 [ -2   1  -1  -3  -5  -7]
 [ -4  -1   0   0  -2  -4]
 [ -6  -3  -2   1   1  -1]
 [ -8  -5  -4  -1   0   2]]

Pointer Matrix:
[['0' 'H' 'H' 'H' 'H' 'H']
 ['V' 'D' 'V' 'V' 'V' 'V']
 ['V' 'H' 'D' 'D' 'D' 'V']
 ['V' 'H' 'D' 'D' 'D' 'V']
 ['V' 'H' 'D' 'H' 'D' 'D']]


Running Needleman-Wunsch using a similarity matrix for comparing sequences of amino acids. I will be using the BLOSUM62 similarity matrix.

In [39]:
# import biotite.sequence as seq
# import biotite.sequence.align as align