## Implementation and Analysis of the Needleman-Wunsch Algorithm Using BioNumPy for Sequence Alignment

Sequence alignment is a fundamental process in bioinformatics used to identify similarities between DNA, RNA, or protein sequences. This report details the implementation of the Needleman-Wunsch algorithm using the BioNumPy library and explores its applications in biological research.

### Needleman-Wunsch Algorithm
The Needleman-Wunsch algorithm, introduced in 1970, is a dynamic programming approach used for global sequence alignment...
-->BioNumPy Library
BioNumPy is designed to handle large-scale biological data with the efficiency and ease of NumPy

The Needleman-Wunsch algorithm involves initializing a scoring matrix and filling it based on match, delete, and insert operations

In [2]:
import numpy as np
import bionumpy as bnp

def needleman_wunsch(seq1, seq2, match_score=1, gap_cost=-1):
    len1, len2 = len(seq1), len(seq2)
    matrix = np.zeros((len1 + 1, len2 + 1))
    
    for i in range(len1 + 1):
        matrix[i][0] = i * gap_cost
    for j in range(len2 + 1):
        matrix[0][j] = j * gap_cost
    
    for i in range(1, len1 + 1):
        for j in range(1, len2 + 1):
            match = matrix[i - 1][j - 1] + (match_score if seq1[i - 1] == seq2[j - 1] else -match_score)
            delete = matrix[i - 1][j] + gap_cost
            insert = matrix[i][j - 1] + gap_cost
            matrix[i][j] = max(match, delete, insert)
    
    return matrix

seq1 = "GATTACA"
seq2 = "GCATGCU"
alignment_matrix = needleman_wunsch(seq1, seq2)
print(alignment_matrix)

[[ 0. -1. -2. -3. -4. -5. -6. -7.]
 [-1.  1.  0. -1. -2. -3. -4. -5.]
 [-2.  0.  0.  1.  0. -1. -2. -3.]
 [-3. -1. -1.  0.  2.  1.  0. -1.]
 [-4. -2. -2. -1.  1.  1.  0. -1.]
 [-5. -3. -3. -1.  0.  0.  0. -1.]
 [-6. -4. -2. -2. -1. -1.  1.  0.]
 [-7. -5. -3. -1. -2. -2.  0.  0.]]


### def needleman_wunsch(seq1, seq2, match_score=1, gap_cost=-1):
1. Defines the function needleman_wunsch that takes two sequences (seq1 and seq2), a match_score (default is 1), and a gap_cost (default is -1).

###     len1, len2 = len(seq1), len(seq2)
2. Calculates the lengths of seq1 and seq2, storing them in len1 and len2, respectively.

###     matrix = np.zeros((len1 + 1, len2 + 1))
3. Creates a matrix of zeros with dimensions (len1 + 1) x (len2 + 1). This matrix will be used to store alignment scores.
4.  Initializes the first row and first column of the matrix with multiples of the gap_cost. This represents the cost of introducing gaps at the beginning of either sequence.
5. Match: If the characters at seq1[i-1] and seq2[j-1] match, add match_score to the diagonal value; otherwise, subtract match_score.
6. Delete: Introduce a gap in seq2, adding gap_cost to the value from the cell above.
7. Insert: Introduce a gap in seq1, adding gap_cost to the value from the cell to the left.
8. Max Value: The cell [i, j] is filled with the maximum value of these three operations, representing the best score for that cell.
        



In real world example FASTA file can be used to get the sequence alignment sample code is given below

In [1]:
from Bio import SeqIO

def read_fasta(file_path):
    with open(file_path, "r") as file:
        for record in SeqIO.parse(file, "fasta"):
            return str(record.seq)

# Using the specified paths
seq1 = read_fasta("C:/Users/admin/Downloads/sequence.fasta")
seq2 = read_fasta("C:/Users/admin/Downloads/ReferenceEnteroB.fasta")
print("Sequence 1:", seq1)
print("Sequence 2:", seq2)


## Variant Calling

1. Objective: Detects genetic variants such as SNPs (Single Nucleotide Polymorphisms) from DNA sequence data.

Algorithm used: Implements a simple variant caller using the naive Bayes classifier.

In [14]:
import numpy as np
import bionumpy as bnp

def naive_variant_caller(ref, alt, qual, threshold=20):
    variants = []
    for r, a, q in zip(ref, alt, qual):
        if q >= threshold:
            variants.append((r, a))
    return variants

ref = "A"*50 + "C"*50
alt = "A"*49 + "T" + "C"*50
qual = np.random.randint(10, 40, size=100)
variants = naive_variant_caller(ref, alt, qual)
print(variants)

[('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C'), ('C', 'C')]



#### Parameters:
1.  ref: A string representing the reference sequence.
2.  alt: A string representing the alternate sequence.
3. qual: A numpy array containing quality scores for each position in the sequences.
4. threshold: An integer representing the minimum quality score required to consider a variant (default is 20).
#### Function Explanation:
1. variants = []: Initializes an empty list to store detected variants.
2. for r, a, q in zip(ref, alt, qual): Iterates over corresponding positions in ref, alt, and qual using zip.
if q >= threshold: Checks if the quality score q meets or exceeds the specified threshold.
3. variants.append((r, a)): If the condition is met, adds a tuple (r, a) to variants, where r is the reference base and a is the 4. alternate base.
5. return variants: Returns a list of tuples representing detected variants.


#### Example Data
5. ref: Creates a reference sequence consisting of 50 'A's followed by 50 'C's ("AAAAAAAAAACCCCCCCCCC").
6. alt: Creates an alternate sequence where the second position is 'T' instead of 'A' ("AAAAAAAAAATCCCCCCCCC").
7. qual: Generates an array of 100 random quality scores between 10 and 39 inclusive.

## Gene Expression Analysis
Objective: Analyzes gene expression data to identify differentially expressed genes.

Algorithm used: Implement a simple t-test for comparing expression levels.

In [2]:
import numpy as np
from scipy.stats import ttest_ind

def differential_expression_analysis(group1, group2):
    t_stat, p_value = ttest_ind(group1, group2)
    return t_stat, p_value

# Simulated gene expression data
group1 = np.random.normal(10, 2, 100)
group2 = np.random.normal(12, 2, 100)
t_stat, p_value = differential_expression_analysis(group1, group2)
print(f"T-statistic: {t_stat}, P-value: {p_value}")


T-statistic: -6.944554888357878, P-value: 5.323895809529731e-11


This code demonstrates how to use Python and SciPy to perform differential expression analysis using a t-test. 

It generates simulated gene expression data for two groups (group1 and group2), calculates the t-statistic and p-value using ttest_ind, and prints the results. 

This type of analysis is commonly used in bioinformatics and statistical genetics to identify genes that are differentially expressed between experimental conditions or groups.

#### Simulated Gene Expression Data:
group1 = np.random.normal(10, 2, 100): Generates 100 random values from a normal distribution with mean 10 and standard
deviation 2. This represents simulated gene expression values for group 1.

group2 = np.random.normal(12, 2, 100): Generates 100 random values from a normal distribution with mean 12 and standard deviation 2. This represents simulated gene expression values for group 2.