## Using BioNumPy in Existing Project

BioNumPy is a powerful library that can significantly improve the performance of reading and parsing biological datasets. It supports various file formats such as FASTQ, BAM, and VCF, and integrates seamlessly with existing Python projects. This guide will help you get started with BioNumPy and demonstrate its functionalities, including file handling and integration with Pandas.

## Installation

In [1]:
pip install bionumpy


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Reading Biological Data Files
BioNumPy simplifies the process of reading biological data files. It can automatically detect the file format and read the file efficiently. Here's an example of reading a VCF file:

In [2]:
import bionumpy as bnp

# Open your file, BioNumPy automatically detects the file format
f = bnp.open("C:\\Users\\admin\\Downloads\\sample.vcf")

# Read the file in chunks for efficient processing
for chunk in f.read_chunks():
    # Iterate over entries within a chunk
    for single_entry in chunk.to_iter():
        print(single_entry)
        # Access specific attributes like chromosome, position, etc.
        position = single_entry.position
        print(f"Chromosome: {single_entry.chromosome}, Position: {position}")


VCFEntry(chromosome='19', position=110, id='.', ref_seq='A', alt_seq='C', quality='9.6', filter='.', info=InfoDataclass(NS=0, AN=0, AC=[], DP=0, AF=[], AA='', DB=False, H2=False))
Chromosome: 19, Position: 110
VCFEntry(chromosome='19', position=111, id='.', ref_seq='A', alt_seq='G', quality='10', filter='.', info=InfoDataclass(NS=0, AN=0, AC=[], DP=0, AF=[], AA='', DB=False, H2=False))
Chromosome: 19, Position: 111
VCFEntry(chromosome='20', position=14369, id='rs6054257', ref_seq='G', alt_seq='A', quality='29', filter='PASS', info=InfoDataclass(NS=3, AN=0, AC=[], DP=14, AF=[0.5], AA='', DB=True, H2=True))
Chromosome: 20, Position: 14369
VCFEntry(chromosome='20', position=17329, id='.', ref_seq='T', alt_seq='A', quality='3', filter='q10', info=InfoDataclass(NS=3, AN=0, AC=[], DP=11, AF=[0.017], AA='', DB=False, H2=False))
Chromosome: 20, Position: 17329
VCFEntry(chromosome='20', position=1110695, id='rs6040355', ref_seq='A', alt_seq='G,T', quality='67', filter='PASS', info=InfoDataclass

In this example, bnp.open automatically detects the file format. Reading in chunks (f.read_chunks()) is efficient, and we can iterate over each entry within a chunk. Each entry can be accessed to retrieve specific attributes such as chromosome and position.

## Using BioNumPy with Pandas
BioNumPy can also convert chunks of data into Pandas DataFrames,Here's how a FASTQ file is convert it to a DataFrame:

In [10]:
import bionumpy as bnp

# Opens the FASTQ file
f = bnp.open("C:\\Users\\admin\\Downloads\\reads_1.fq.gz")

# Reads the file in chunks and convert each chunk to a DataFrame
for chunk in f.read_chunks():
    df = chunk.topandas()
    print(df)

                                                    name  \
0      read81034/ENST00000314616.11;mate1:5659-5758;m...   
1      read25907/ENST00000355968.10;mate1:575-674;mat...   
2      read67157/ENST00000390556.6;mate1:1320-1419;ma...   
3      read15543/ENST00000451562.5;mate1:83-182;mate2...   
4      read73381/ENST00000568517.1;mate1:258-357;mate...   
...                                                  ...   
19027  read96695/ENST00000276079.13;mate1:1534-1633;m...   
19028  read99554/ENST00000276079.13;mate1:2213-2312;m...   
19029  read49602/ENST00000347770.8;mate1:53-152;mate2...   
19030  read41219/ENST00000415933.5;mate1:207-306;mate...   
19031  read15906/ENST00000451562.5;mate1:185-284;mate...   

                                                sequence  \
0      TGTTTATTCAAATGACAGGCAGGAAGCGGTGGCAGCAGCAGGGGGG...   
1      TCATAATCATAAACTTAACTTNGCAATCCAGCTAGGCATGGGAGGG...   
2      GGCCGGGGCAGGGGTGTAGCTGGCTCTCGGGGAAGCATGGGAAGGA...   
3      TCTCCCCATAGATGGACTTGCCACCAGTGCCA

The resulting DataFrame will contain the data from the FASTQ file, with columns for the sequence name, sequence, and quality scores.

## Working with BioNumPy RaggedArray
BioNumPy provides functionalities to handle non-uniform length sequences using RaggedArray.This is particularly useful for biological sequences. Converts a list of sequences into a RaggedArray for efficient computations.

Here's an example of using RaggedArray with DNA sequences:

In [11]:
import bionumpy as bnp

# List of DNA sequences
my_sequences = [
    "TGTGCCAGCAGCGGGGATCGTAATCAGCCCCAGCATTTT",
    "TGCAGCGTCAAGGTCCAAGCTTTCTTT",
    "TGTGCCACCAGTGATTATTATTGGTACGAGCAGTACTTC"
]

# Convert the list of sequences to a RaggedArray
my_ragged_array = bnp.as_encoded_array(my_sequences, bnp.encodings.alphabet_encoding.DNAEncoding)

# Print the RaggedArray and its shape
print(my_ragged_array)
print(my_ragged_array.shape)


TGTGCCAGCAGCGGGGATCGTAATCAGCCCCAGCATTTT
TGCAGCGTCAAGGTCCAAGCTTTCTTT
TGTGCCACCAGTGATTATTATTGGTACGAGCAGTACTTC
(3, array([39, 27, 39], dtype=int64))
