# **Session 4 - Sequence Alignment**

<u>Sequence Alignment</u>
  + A method of arranging sequences of DNA, RNA, or protein sequences to identify regions of similarity. 
  + The similarity being identified, may be a result of functional, structural, or evolutionary relationships between the sequences.
  + Identifeis similarity and homology
  + Homology: descent from a common ancestor or source.

![](https://drive.google.com/uc?export=view&id=1xjcoAfhvq0JY-Oc7EiVH-syEe5qamdrw)

+ Terms
  + Matches
  + Mismatches
  + Gap

<br></br>
<u>Alignment Types</u>
+ Global alignment: finds the best concordance/agreement between all characters in two sequences
    + Mostly from end to end
    + By Needle
+ Local Alignment: finds just the subsequences that align the best
    + In this method, we consider subsequences within each of the 2 sequences and try to match them to obtain the best alignment.
    + By Water
 
![](https://drive.google.com/uc?export=view&id=1NRwK49u9zjKN9KjiJZyBprlYFr6PPWe5)

In [None]:
# Install and import Bio



In [None]:
# Import the required functions (pairwise2 and format_alignment) from Bio packages
# Import Seq class



In [None]:
# create example sequences
# TCACTCGT
# ATTCG



**Global Alignment**

> When to use?
+ 2 Sequences are quite similar
+ 2 Sequences have approximately the same length

> Examples
+ Emboss NEEDLE
+ Needleman-Wunsch Global Align

In [None]:
# Perform Global Alignment



In [None]:
# display the alignment



In [None]:
# View all possible alignment



**Local Alignment**

> When to use?
+ 2 sequences have a small matched region
+ 2 Sequences are of different lengths
+ One sequences is a subsequences of the other

> Examples
+ Blast
+ Emboss WATER
+ Lalign

In [None]:
# Perform Local Alignment



In [None]:
# View all possible alignment



**Alignment Scores**

In [None]:
# Get Global Alignment's score



In [None]:
# Get Local Alignment's score



---

# Percentage of Similarity with Alignment




+ Check for similarity or percentage of similarity using Alignment
+ fraction of nucleotides that is the same/ total number of nucleotides * 100%

In [None]:
# Calculate global alignment's percentage of similarity



In [None]:
# Calculate local alignment's percentage of similarity



---
# Global Alignments with Maximum Similarity Score


Find out all the possible global alignments with the maximum similarity score
+ Matches &ensp;&ensp;&ensp;&ensp;&ensp;&ensp; : + 2 points 
+ Mismatches &ensp;&ensp;&ensp; : - 1 point
+ Opening a gap&ensp;&ensp;: - 0.5 point
+ Extending a gap : - 0.1 point

In [None]:
# Perform Global alignment with maximum similarity score
# globalms(seqA, seqB, match, mismatch, gap, extend)



In [None]:
# View all possible alignment



---
# Similarity Between Sequences


+ Sequence Alignment
    - Dynamic Programming (Global/Local/(needle/water))
    - Dotplot
    
+ Similarity: resemblance between two sequences in comparison
    - the minimal number of edit operations (inserts, deletes, and substitutions) in order to transform the one sequence into an exact copy of the other sequence being aligned 
    - distance
+ Identity: the number of charaters that match EXACTLY between two different sequences
    + Gaps are not counted 
    + The measurement is relational to the shorter of the two sequences. 
    + This has the effect that sequence identity is not transitive, i.e. 
    + if sequence A=B and B=C then A is not necessarily equal C (in terms of the identity distance measure) :
 
 <br></br>


```
 A: AAGGCTT
 B: AAGGC
 C: AAGGCAT
```




+ 100% identity does not mean two sequences are the same.
  + identity(A,B) = 100% *(5 identical nucleotides / min(length(A),length(B)))*

  + Identity(B,C)=100% *(5 identical nucleotides / min(length(A),length(B)))*
  + identity(A,C)=85% *(6 identical nucleotides / 7)*
+ Sequence similarity is first of all a general description of a relationship but nevertheless its more or less common practice to define similarity as an optimal matching problem (for sequence alignments or unless defined otherwise). 
+ Hereby, the optimal matching algorithm finds the minimal number of edit operations (inserts, deletes, and substitutions) in order to transform the one sequence into an exact copy of the other sequence being aligned (edit distance). 
+ Using this, the percentage sequence similarity of the examples above are sim(A,B)=60%, sim(B,C)=60%, sim(A,C)=86% (semi-global, sim=1-(edit distance/unaligned length of the shorter sequence)). But there are other ways to define similarity between two objects (e.g. using tertiary strucure of proteins).
An then you might start to conclude from similarity to homology, but this was already covered sufficiently
+ read more https://www.researchgate.net/post/Homology_similarity_and_identity-can_anyone_help_with_these_terms

In [None]:
# Create 3 example sequences



In [None]:
# Calculate local alignment and get the scores



In [None]:
# Calculate percentage of similarity and print



In [None]:
# Check Concept : Does 100% similarity score means the sequence is exactly the same?



---
# Hamming distance

Shows how many places 2 strings differ



+ Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. 
+ In other words, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the othe
+ It is used for error detection or error correction
+ It is used to quantify the similarity of DNA sequences,
+ For checking the edit distance
 - edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. 
 - eg Levenshtein distance

In [None]:
# Create example sequences
# ACTAT
# ACTTA
# ACTT



In [None]:
# Define Hamming Distance function



In [None]:
# Perform Hamming Distance calculation



---
# Levenshtein Distance

+  This method was invented in 1965 by the Russian Mathematician Vladimir Levenshtein (1935-2017).
+  The distance value describes the minimal number of deletions, insertions, or substitutions that are required to transform one string (the source) into another (the target).
+  Unlike the Hamming distance, the Levenshtein distance works on strings with an unequal length.

In [None]:
# Install Levenshtein Distance library



In [None]:
# Import distance function from Levenshtein



In [None]:
# Perform Levenshtein Distance calculation



---
# Dot Plot


+ A dot plot is a graphical method that allows the comparison of two biological sequences 
and identify regions of close similarity between them.
+ Simplest method - put a dot wherever
sequences are identical 
+ Dot plots compare two sequences by organizing one sequence on the x-axis, and another on the y-axis, of a plot. 
+ When the residues of both sequences match at the same location on the plot, a dot is drawn at the corresponding position

<br></br>
<u>Usefulness</u>


+ Dot plots can also be used to visually inspect sequences for 
  - direct or inverted repeats
  - regions with low sequence complexity.
  - Similar regions
  - Repeated sequences
  - Sequence rearrangements
  - RNA structures
  - Gene order



<u>Creating dotplot utility functions</u>

In [None]:
# Define Delta function
# ---------------------
# Takes two arguments, x and y, and returns 0 if they are equal, and 1 otherwise.
# This function is used in other functions to tell whether a pair of nucleotides/amino acids matches



In [None]:
# Define M function
# -----------------
# A utility function that calculates the number of matches between two subsequences of the two input sequences.
# 
# Param:
# seq1: a string representing the first sequence
# seq2: a string representing the second sequence
# i: an integer representing the starting index of the subsequence in seq1
# j: an integer representing the starting index of the subsequence in seq2
# k: an integer representing the length of the subsequences to be compared



In [None]:
# Define makeMatrix function
# --------------------------
# The function returns a matrix of match scores between all pairs of subsequences of the two input sequences

# Param:
# seq1: a string representing the first sequence
# seq2: a string representing the second sequence
# k: an integer representing the length of the subsequences to be compared



<u>Creating and displaying dotplot using Matplotlib</u>

In [None]:
# Create 2 DNA sequence examples
# "ACCTGAGCTCACCTGAGTTA"
# "ACCTGAGCTCACCTGAGTTA"



In [None]:
# Import Numpy & Matplotlib libraries to calculate and plot the result



In [None]:
# Create a function to display dotplot using Matplotlib



In [None]:
# Call the function



<u>Creating and displaying dotplot using ASCII characters</u>

In [None]:
# Define plotMatrix function
# --------------------------
# This function print out the matrix of match scores in a user-friendly way.

# Param:
# Mat: a two-dimensional list representing the matrix of match scores
# t: an integer representing the threshold for determining whether a match score should be considered significant
# seq1: a string representing the first sequence
# seq2: a string representing the second sequence
# nonblank: a character that is used to represent significant match scores
# blank: a character that is used to represent insignificant match scores



In [None]:
# Define dotplot Function
# -----------------------
# The dotplot function creates a dot plot for two sequences.

# Param:
# seq1: a string representing the first sequence
# seq2: a string representing the second sequence
# k: an integer representing the length of the k-mer used to compare the sequences (default value is 1)
# t: an integer representing the threshold for determining whether a match score should be considered significant (default value is 1)
# The function first creates a match score matrix M by calling the makeMatrix function, passing in the two sequences and the k-mer length k.



In [None]:
# Run Dot Plot function



*Read more : https://stackoverflow.com/questions/40822400/how-to-create-a-dotplot-of-two-dna-sequence-in-python*