
## BIONUMPY TUTORIAL

### 1. Filtering FASTQ reads

Filtering reads is a crucial step in bioinformatics, particularly in DNA sequencing analysis. When DNA is sequenced, it's typically broken into smaller fragments called "reads." These reads can contain errors, noise, or irrelevant information that can negatively impact downstream analysis. Filtering reads involves removing or adjusting these problematic sequences to enhance the accuracy and reliability of subsequent analyses. 

In [12]:
import bionumpy as bnp

# Read FASTQ file
reads = bnp.open("C:\\Users\\admin\\Desktop\\SRR19127870.fastq").read()
reads

SequenceEntryWithQuality with 46197 entries
                     name                 sequence                  quality
  SRR19127870.1.1 M04500:  AAAAGGAAAAAAGAAACACGGAC  [30 30 30 30 30 30 30 3
  SRR19127870.1.2 M04500:  ACCCACACTCCAGTGGGCAGTCT  [30 30 30 30 30 30 30 3
  SRR19127870.2.1 M04500:  CTCTATGCAACAGAAGCAAAGAG  [30 30 30 30 30 30 30 3
  SRR19127870.2.2 M04500:  CTCCCGTGCAAGAGTAAGCATAC  [30 30 30 30 30 30 30 3
  SRR19127870.3.1 M04500:  ACCAAGTTGAAGGAATGCATGGA  [30 30 30 30 30 30 30 3
  SRR19127870.3.2 M04500:  GACCTAAGCTCATCCTTCACATA  [30 30 30 30 30 30 30 3
  SRR19127870.4.1 M04500:  GGTTTTGCGGGTCCGCCATGGCT  [30 30 30 30 30 30 30 3
  SRR19127870.4.2 M04500:  CTAACATCGCGTCTTTTCTATTC  [30 30 30 30 30 30 30 3
  SRR19127870.5.1 M04500:  GTCTGTCGTAACGGGCAACTCTG  [30 30 30 30 30 30 30 3
  SRR19127870.5.2 M04500:  CAATAGCTATATGGTAACAATCT  [30 30 30 30 30 30 30 3

In [14]:
gc_content = np.mean((reads.sequence == "C") | (reads.sequence == "G"))
gc_content

0.47412948417953354

In [15]:
def filter_reads(file, out_filename):
    with bnp.open(out_filename, 'w') as out_file:
        for reads in bnp.open(file).read_chunks():
            # Apply quality filters
            min_quality_mask = reads.quality.min(axis=-1) > 1
            max_quality_mask = reads.quality.mean(axis=-1) > 10
            mask = min_quality_mask & max_quality_mask
            print(f'Filtering reads: {len(reads)} -> {mask.sum()}')
            out_file.write(reads[mask])
if __name__ == "__main__":
    input_file = "C:\\Users\\admin\\Desktop\\SRR19127870.fastq"
    output_file = "C:\\Users\\admin\\Desktop\\SRR19127870_filtered.fastq"
    filter_reads(input_file, output_file)

Filtering reads: 12606 -> 12421
Filtering reads: 12529 -> 12321
Filtering reads: 12260 -> 12036
Filtering reads: 8802 -> 8202


In [16]:
import bionumpy as bnp

file = bnp.open("C:\\Users\\admin\\Desktop\\SRR19127870.fastq")
# read the file chunk by chunk to keep memory low:
for chunk in file.read_chunks():
    sequences = chunk.sequence
    # change encoding to a DNAEncoding, this works as long as the
    # sequences only contains ACGT, and makes get_kmers extremely efficient
    sequences = bnp.change_encoding(sequences, bnp.DNAEncoding)
    print("Kmers:")
    kmers = bnp.get_kmers(sequences, k=31)
    print(kmers[0:3, 0:2])

    # kmers is an EncodedRaggedArray, one row for each read
    # you can get all the raw numeric kmers efficiently like this:
    print("Raw kmers:")
    numeric_kmers = kmers.raw()
    print(numeric_kmers[0:3, 0:2])

    # and if you don't care about the RaggedStructure, you can do ravel:
    print("Flat raw kmers:")
    numeric_flat_kmers = numeric_kmers.ravel()
    print(numeric_flat_kmers[0:4])

Kmers:
[AAAAGGAAAAAAGAAACACGGACACCCAAAG, AAAGGAAAAAAGAAACACGGACACCCAAAGT]
[ACCCACACTCCAGTGGGCAGTCTGTCGTAAC, CCCACACTCCAGTGGGCAGTCTGTCGTAACG]
[CTCTATGCAACAGAAGCAAAGAGGGTGTTTT, TCTATGCAACAGAAGCAAAGAGGGTGTTTTC]
Raw kmers:
[2311774397737732608 4036708113254974080]
[1218144013301269588 2610379012539011349]
[4606797596207377629 2304620903658691383]
Flat raw kmers:
[2311774397737732608 4036708113254974080 1009177028313743520
 2558137266292129832]
Kmers:
[TTTGGGGGTGTCTCGGGAGCTACACTGATCT, TTGGGGGTGTCTCGGGAGCTACACTGATCTT]
[GGAGTATCATCATAGGCACAGCGTAGGGGCT, GAGTATCATCATAGGCACAGCGTAGGGGCTA]
[GCGAGTGTTCTAATTATAATATTGTTGATCA, CGAGTGTTCTAATTATAATATTGTTGATCAT]
Raw kmers:
[3975855360475638463 4452728353939450543]
[3938650837560882058  984662709390220514]
[ 517840028885839398 3588224521042000777]
Flat raw kmers:
[3975855360475638463 4452728353939450543 1113182088484862635
  278295522121215658]
Kmers:
[GTAAATCGTCTCTTAGATACTCAGGCCACCG, TAAATCGTCTCTTAGATACTCAGGCCACCGC]
[ACCATGCGCCGAGCGCTTATGGTCAACGAGG, CCAT