# Basic Local Alignment Search Tool
The description of BLAST given here is taken from Wikipedia, it's not my own writing, I'm just using it as a guide for this notebook file.

## Introduction
BLAST is an acronym for Basic Local Alignment Search Tool. BLAST is an algorithm for comparing primary biological sequence information. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.

The heuristic algorithm it uses is much faster than other approaches, such as calculating an optimal alignment. This emphasis on speed is vital to making the algorithm practical on the huge genome databases currently available.

Before fast algorithms such as BLAST were developed, searching databases for protein or nucleic sequences was very time consuming because a full alignment procedure (e.g., the Needleman-Wunsch or Smith–Waterman algorithm) was used. The drawback of BLAST compared to Smith-Waterman is that it cannot guarantee optimal alignments, so we trade speed and efficiency for high accuracy and precise results.

BLAST is available on the web on the NCBI website: https://blast.ncbi.nlm.nih.gov/Blast.cgi

## Method
BLAST finds similar sequences, by locating short matches between the two sequences. This process of finding similar sequences is called seeding. It is after this first match that BLAST begins to make local alignments. While attempting to find similarity in sequences, sets of common letters, known as words, are very important. For example, suppose that the sequence contains the following stretch of letters, GLKFA. If a BLAST was being conducted under normal conditions, the word size would be 3 letters. In this case, using the given stretch of letters, the searched words would be GLK, LKF, KFA. 

The heuristic algorithm of BLAST locates all common three-letter words between the sequence of interest and the hit sequence or sequences from the database. This result will then be used to build an alignment. After making words for the sequence of interest, the rest of the words are also assembled. These words must satisfy a requirement of having a score of at least the threshold T, when compared by using a scoring matrix.

One commonly used scoring matrix for BLAST searches is BLOSUM62 (the scoring matrix used with Needleman-Wunsch in this repository), even though the optimal scoring matrix depends on sequence similarity. Once both words and neighborhood words are assembled and compiled, they are compared to the existing sequences in the database in order to find matches. The threshold score T determines whether or not a particular word will be included in the alignment. 

Once seeding has been conducted, the alignment which is only 3 residues long, is extended in both directions by the algorithm used by BLAST. Each extension impacts the score of the alignment by either increasing or decreasing it. If this score is higher than a pre-determined T, the alignment will be included in the results given by BLAST. However, if this score is lower than this pre-determined T, the alignment will cease to extend, preventing the areas of poor alignment from being included in the BLAST results. Note that increasing the T score limits the amount of space available to search, decreasing the number of neighborhood words, while at the same time speeding up the process of BLAST

## The Algorithm
The algorithm requires a query sequence to search with and a sequence or a database of sequences to search against. Blast will find subsequences in the database that are similar to subsequences in the query sequence.

The main idea of BLAST is that there are High-scoring Segment Pairs (HSP) contained in a statistically significant alignment. BLAST searches for high scoring sequence alignments between the query sequence and the existing sequences in the database.

## Step 1
### Remove low-complexity region or sequence repeats in the query sequence.
For this section I will be referring to John Wooton and Scott Federhen's 1993 paper:
"Statistics of local complexity in amino acid sequences and sequence databases".

Natural protein sequences are very different from random strings of 20 amino acids. There are often regions of low complexity, for example with clusters of glycine, proline, alanine, glutamine, etc. in homopolymeric tracts or in mosiac sequence arrangements, some of which contain regular or irregular shortperiod tandem repeats. These low complexity segments occur disproportionately often in protein sequences, with as much as 15% of residues occur in sections of improbably low compositional complexity.

Why are these low complexity regions a potential problem when searching for sequence alignments in the database? If the query sequence also has one or more similar low complexity regions, then there will be a high-scoring sequence pairs that arise purely as a result of these low complexity regions. Therefore we want to filter out these low complexity regions.

### Low Complexity and Probability
Before we can filter out low complexity regions, we must first define what we mean by complexity. Let the biopolymer have N possible types of residues (N=4 for DNA, N=20 for proteins) and consider a subsequence or "window" of length L residues. Statistical properties of each window can be defined on three levels:
- Complexity State or Numerical Partition
- Composition or “Coloring”
- Sequence

#### Complexity State
Each window has a number of occurrences of each of the $N$ letters or residues. The complexity state of the window is defined by the sorted vector of these $N$ numbers, irrespective of which specific letter or
residue is assigned to each number. Thus each window of length L has a complexity state vector $S$,
whose _$N$_ elements, $nj$ have the properties:

\begin{equation*}
0 \geq n_i \geq L,\ \sum_{i=1}^{N} n_i = L
\end{equation*}

and, in order to make a unique sorted vector that defines the state, $n_i \geq n_i+1$

Each complexity state vector$S$, represents a different partition of the integer $L$ into $N$ integers that sum
to $L$. The importance of representing sequence windows as numerical partitions is that these vectors have the property "complexity" that depends only on $N$, $L$ and $n$, irrespective of the probabilities of occurence of the states and their particular residue compositions.

This is well illustrated by the following example of the 20-letter amino acid alphabet and a window length of 20, for which there are 627 possible states. These include the "least complex" vector:

\begin{equation}
    (20\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 0)
\end{equation}

and the "most complex" vector:

\begin{equation}
    (1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1)
\end{equation}

These are both expected to be very improbable in typical amino acid sequences. In constrast, some of the states of intermediate complexity occur relatively frequently:

\begin{equation}
    (4\ 2\ 2\ 2\ 2\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 0\ 0\ 0\ 0\ 0\ 0\ 0)
\end{equation}

\begin{equation}
    (3\ 2\ 2\ 2\ 2\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 0\ 0\ 0\ 0\ 0\ 0)
\end{equation}

The second state above has a slightly greater complexity than the first. Two different measures that correspond to this intuitive concept of numerical complexity are defined below. These are "_Complexity_" or $K_1$ and "_Entropy_" or $K_2$.  

#### Composition or "Coloring"
Each complexity state vector has a number of different residue compositions corresponding to all possible assignments of the $N$ letters to the $N$ numbers in each vector $S_j$. These compositions are named "colorings". The number of compositions of any given complexity state, denoted $F$ here following the usage "Farben" (colours, in German), is given by:

\begin{equation}
    F = \frac{N!}{\prod_{k=0}^{L}{r_k}!}
\end{equation}

Here the values of $r_k$ are the counts of the number of occurences of each number in the complexity state vector $S_j$. Formally,

\begin{equation}
    0 \leq r_k \leq N, \ 0 \leq k \leq L, \ \sum_{k=0}^{L}(r_k) = N
\end{equation}

In practice, because of the restricted partitioning of $L$ into $S_j$, only a few values from the possible range of $r_k$ and $k$ actually occur for any $S_j$ and the computation only uses the non-zero $r_k$ values. For example, for the vector 

\begin{equation}
(3\ 2\ 2\ 2\ 2\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 1\ 0\ 0\ 0\ 0\ 0\ 0)
\end{equation}

$F$, the number of compositions of the complexity state, is computed from the $r_k$ values (1, 4, 9, 6) corresponding to one 3, four 2s, nine 1s, and six 0s. A unique situation occurs in the cases of window lengths $L$ that are equal to or exact multiples of $N$, for which there is only one possible colouring of the vector of maximum complexity ($F = 1$). For example for $L = 40$ and the $N = 20$ amino acid alphabet, this vector is:

\begin{equation}
(2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2\ 2)
\end{equation}

However, for mostof the complexity states and window lengths encountered in practice in protein sequence analysis, very large values of $F$ are obtained from the 20 letter alphabet.

All colourings of any numerical state have the same local complexity value, measured as $K_1$ or $K_2$, and can be considered to inherit this property from their complexity state vector. However, the probabilities may differ between different colourings of the same complexity state, depending on the probability of occurence. $p_i$, of the N different letters in the alphabet (residues). only uniform probabilities of residues give equiprobably compositions for any complexity state.

#### Sequences
For each complexity state, there exists a (usually) large number of different possible sequences. This number, $\Omega$, is the multinomial coefficient characteristic of the complexity state and is the same for all compositions (colourings) of that state and depends only on $N$, $L$, $n_i$:

\begin{equation}
\Omega = \frac{L!}{\prod_{i=1}^{N}{n_i}!}
\end{equation}

The total number of possible sequences over all the complexity states if window length $L$ is the number of permutations, $N^L$. Each sequence can be considered to inherit its attributes of complexity and probability from it complexity state and composition respectively.

In [2]:
# We want to take some sequence of amino acids, and calculate the complexity state, number of colourings,
# and the multinomial characteristic coefficient

# to calculate the complexity state, we need the window size and the size of the alphabet.
# given a sequence, we must then produce a sorted vector for each windowed subsequence. With this sorted vector
# we can calculate the r_k vectors, which can then be used to calculate the number of colourings.
# finally we will calculate Omega using the formula above.



### Entropy and Complexity