## Bioinformatics: Sequence Alignment
Sequence alignment in bioinformatics is the process of comparing DNA, RNA, or protein sequences to find similarities. It helps scientists understand relationships, functions, or evolutionary history. By lining up sequences, we can spot matches, differences, and important regions.

In [1]:
from collections import Counter
import random
import numpy as np

## Bioinformatics II: Sequence Alignment 

References:
Jones and Pevzner 2004, An Introduction to Bioinformatics Algorithms

# Edit distance
Finds how dissimilar two strings are to one another by calculating the minimum number of edits needed to make the two identical. 

# Similarity Searches in bioinformatics
- DNA sequence comparison: when biologists infer a newly sequenced gene’s function by finding similarities with genes of known function. 
- DNA mutation is caused by DNA replication errors leading to substitutions, insertions, and deletions of nucleotides. 

- In 1984, scientists found v-sis oncogene, a cancer-causing gene. They compared it with all known genes from that time, and it matched with a normal growth and development gene. They hypothesized that cancer was caused by the normal growth gene doing its job at the wrong time. 
- Similarly, the Cystic fibrosis gene was discovered through a successful similarity search as well. Cystic fibrosis is a fatal disease which damages body organs and causes thick, lung clogging mucus. 

## Consensus string (dunno if this should be included)
Gets the most common nucleobases per index and then use it in the final string

In [2]:
def consensus_string(matrix):
    consensus = []
    num_columns = len(matrix[0])

    for col in range(num_columns):
        column_bases = [row[col] for row in matrix]   
        counts = Counter(column_bases)                
        most_common = counts.most_common(1)[0][0]     
        consensus.append(most_common)

    return ''.join(consensus)

In [3]:
P = generate_matrix()

for row in P:
    print(" ".join(row))

consensus_string(P)

NameError: name 'generate_matrix' is not defined

## 1. Hamming Distance

The **Hamming Distance** is the number of positions at which the corresponding symbols are different in two strings of **equal length**.  
It is used in coding theory, genetics, and information theory.

The symbols may be **letters**, **bits**, or **decimal digits**, among other possibilities.

### Examples:
- `"Hamming"` vs `"Hammers"` → **3**
- `"Data"` vs `"data"` → **1**
- `"structures"` vs `"Structured"` → **2**
- `0000` vs `8888` → **4**
- `123456` vs `120459` → **2**

In [7]:
a = ['A', 'T', 'T', 'G', 'T', 'C']
b = ['A', 'C', 'T', 'C', 'T', 'C']

distance = 0

print(f'Distance: {distance}')

for i in range (len(a)):
  if a[i] != b[i]:
    distance += 1

print(f'Hamming distance: {distance}')

Distance: 0
Hamming distance: 2


In [5]:
# Counts how many nucleobases (A,T,C,G) are different from the REFERENCE SEQUENCE in the first row 

#generate dna sequences of the same length
def generate_dna_sequences_samelength():
    bases = ['A', 'T', 'C', 'G']
    matrix_rows = random.randint(5, 10) #number of dna sequences

    sequences = []
    length = random.randint(5,8)
    for row in range(matrix_rows):
        sequence = [random.choice(bases) for column in range(length)]
        sequences.append(sequence)

    return sequences

X = generate_dna_sequences_samelength()

for row in X:
    print(" ".join(row))

T G A T A
G G A G C
G T G C A
G C G A T
G C A T T
C C C C C
A G C G G


In [6]:
def Hamming_distance(matrix, reference_index=0):
    reference = matrix[reference_index] #compare to the first row
    distance = 0
    for row in matrix:
        for i in range(len(reference)):
            if row[i] != reference[i]:
                distance += 1
    return distance

Hamming_distance(X)

24

However, it only works with sequences of identical lengths

In [4]:
def generate_random_dna_sequences():
    bases = ['A', 'T', 'C', 'G']
    num_sequences = random.randint(5, 10)

    sequences = []
    for _ in range(num_sequences):
        length = random.randint(5, 15)
        sequence = [random.choice(bases) for _ in range(length)]
        sequences.append(sequence)

    return sequences

random_dna = generate_random_dna_sequences()
for i, seq in enumerate(random_dna, 1):
    print(seq)

X = generate_random_dna_sequences()
Hamming_distance(X)


['C', 'T', 'G', 'C', 'T', 'A', 'G', 'G', 'T', 'A', 'T', 'C', 'A', 'G']
['T', 'G', 'C', 'T', 'A', 'T', 'A', 'A', 'C', 'C', 'T', 'C']
['A', 'T', 'G', 'C', 'A', 'C', 'A', 'C', 'C', 'C', 'G', 'A']
['C', 'A', 'A', 'G', 'T', 'G', 'T', 'A', 'T', 'G']
['C', 'G', 'A', 'T', 'C', 'C', 'A', 'A', 'C', 'G', 'G', 'A', 'T']
['C', 'A', 'A', 'A', 'T', 'G', 'T', 'C', 'G', 'T', 'C', 'A', 'C']
['C', 'T', 'A', 'C', 'T', 'C', 'C', 'T', 'C', 'A', 'T', 'G', 'A']


NameError: name 'Hamming_distance' is not defined

## 2. Levenshtein 
The levevnshtein edit distance works for strings with different lengths

In [5]:
def levenshtein(string_one, string_two):
    m = len(string_one)
    n = len(string_two)

    '''
    if m == 0:
        return n
    if n == 0:
        return m
    '''

    dp = [[0]*(n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if string_one[i - 1] == string_two[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min([
                                    dp[i][j - 1], 
                                    dp[i - 1][j], 
                                    dp[i - 1][j - 1]
                                    ])
    return dp[m][n]

In [6]:
L = generate_random_dna_sequences()

for i, seq in enumerate(L, 1):
    print(seq)

n = len(L)
dist_matrix = np.zeros((n, n), dtype=int)

for i in range(n):
    for j in range(n):
        dist_matrix[i, j] = levenshtein(L[i], L[j])

print("\nLevenshtein Distance Matrix:")
for row in dist_matrix:
    print(' '.join(map(str, row)))


['C', 'C', 'A', 'C', 'A', 'G', 'T', 'C', 'C', 'C']
['G', 'A', 'G', 'C', 'G']
['T', 'T', 'T', 'T', 'C', 'A', 'T', 'G', 'G', 'T', 'G', 'T']
['A', 'G', 'G', 'A', 'A', 'C', 'T', 'C', 'A', 'G', 'G', 'A', 'T']
['G', 'C', 'T', 'T', 'T', 'C', 'T', 'C']
['A', 'C', 'C', 'T', 'G', 'T', 'A', 'G']

Levenshtein Distance Matrix:
0 7 9 10 6 6
7 0 9 9 6 6
9 9 0 9 8 8
10 9 9 0 10 8
6 6 8 10 0 6
6 6 8 8 6 0


## 3. Multisequence
Description

## 4. Local Sequence Alignment (Done)


## 5. Global Sequence Alignment (Done)


## 7. BLAST 
Why blast is fast & 
Why blast is incomplete 