### Again about BWT

Today we will construct BWT from scratch. First we will construct a suffix array for a given string using pysuffixarray.

In [4]:
!pip install pysuffixarray

from pysuffixarray.core import SuffixArray
sa = SuffixArray('ACAACG')
print(sa.suffix_array())

[6, 2, 0, 3, 1, 4, 5]


## Task 1: Create BWT using suffix array:

- Using BioPython upload SARS-CoV-2 reference genome from fasta file
- Construct suffix array
- Construct BWT from suffix array 
- Don't forget to add special symbol (but after SA construction)

![correct](BWT_folder/BWT1.png)

## Task 2: Create FM index
- Construct Occurence array
- Construct Count dictionary
- Make a class BWTSearcher

![correct](BWT_folder/BWT2.png)

### Task 3: 
- Create pattern search function inside the class

![correct](BWT_folder/BWT3.png)

### Task 4:
- There are 100 reads that were randomly sampled from genome.fa
- Some of them are error free, some contain one mutation, and some contain 5 mutations
- Could you use your BWTSearcher class to classify them? Think about the solution and implement it. You can add any functions of class members
- How many reads of each class did you find?

In [ ]:
from Bio import SeqIO
# Process each read through the BWTSearcher
with open("BWT_folder/sample_reads.fasta", "r") as file:
    for record in SeqIO.parse(file, "fasta"):
        read_sequence = str(record.seq)

        # Here is just a placeholder to demonstrate using the read with the BWTSearcher
        print("Processing read:", read_sequence)

In [1]:
# My impletementation of BWT

class BWTSearcher:
    def __init__(self, reference):
        # Construct the suffix array
        self.sa = SuffixArray(reference)

        # Construct the BWT from the suffix array
        self.bwt_text = ''.join(reference[i-1] if i != 0 else '$' for i in self.sa.suffix_array())

        self.Occ = {ch: [0] * len(self.bwt_text) for ch in "$ACGT"}

        for i, ch in enumerate(self.bwt_text):
            for ch2 in "ACGT":
                if i != 0:
                    self.Occ[ch2][i] = self.Occ[ch2][i - 1]
            if ch in self.Occ.keys():
                self.Occ[ch][i] += 1

        self.count = {}
        total = 0
        for ch in "$ACGT":
            self.count[ch] = total
            total += self.bwt_text.count(ch)





    def bwt_pattern_search(self, pattern):
        top = 0
        bottom = len(self.bwt_text) - 1
        for char in reversed(pattern):
            if char in self.Occ:
                top = self.count[char] + (0 if top == 0 else self.Occ[char][top  - 1])
                bottom = self.count[char] + self.Occ[char][bottom] - 1
            else:
                return []

            if top > bottom:
                return []

        return self.sa.suffix_array()[top:bottom + 1]

### HMMs: important tips

Don't forget about float representation in memory.
The lower values we get - the lower the precision is.

In [24]:
import random

# Define the number of random floats to generate
num_floats = 1000

# Define the range for random floats
min_value = 0.0
max_value = 0.00001

# Generate a list of random floats
random_floats = [random.uniform(min_value, max_value) for _ in range(num_floats)]

# Print the list of random floats
print(sum(random_floats))

0.004909648519238187


In [25]:
random_floats = sorted(random_floats)
print(sum(random_floats))

0.0049096485192381845


In [26]:
print(sum(random_floats[::-1]))

0.004909648519238181


Also, in Vitterbi algorithm we want to multiply small numbers multiple times, we can go out of the limits very fast.


For single-precision floats (`float` in Python), the range of representable values is approximately from 1.17549 × 10^-38 to 3.40282 × 10^38, with a precision of about 7 decimal digits.

For double-precision floats (`double` in C/C++, `float64` in Python), the range is much wider, approximately from 2.22507 × 10^-308 to 1.79769 × 10^308, with a precision of about 15-16 decimal digits.


# Log Multiplication

The log multiplication formula states that the logarithm of the product of multiple numbers is equal to the sum of the logarithms of those numbers. Mathematically, it can be expressed as:

log(P(A,B,C)) = log(P(A)) + log(P(B)) + log(P(C))


Where \( P(A,B,C) \) represents the product of numbers \( A \), \( B \), and \( C \), and \( P(A) \), \( P(B) \), \( P(C) \) represent the individual numbers.

To retrieve the product \( P(A,B,C) \) from the logarithmic sum, you can exponentiate the result:

P(A,B,C) = e^log(P(A,B,C))


Log multiplication is particularly useful in scenarios where you need to handle very large or very small numbers, as it simplifies complex calculations involving multiplication operations. By converting multiplication into addition, log multiplication can help mitigate numerical instability and reduce the risk of overflow or underflow errors.





More info here - [https://en.wikipedia.org/wiki/Log_probability](https://en.wikipedia.org/wiki/Log_probability)