## Using BioNumPy to analyse kmers, minimzers

Encoding sequences refers to the process of converting biological sequences, such as DNA or protein sequences, from their natural form of letters (A, C, G, T for DNA; A, C, G, U for RNA; and amino acid codes for proteins) into a numerical format. For DNA sequences, the letters A, C, G, and T are commonly encoded as integers, such as 0, 1, 2, and 3, respectively.

Minimizers are a compact representation of sequences used to reduce redundancy. They are determined by identifying the smallest  k-mer within a sliding window of a larger sequence.

1. bnp.as_encoded_array:  Converts a list of sequences into a numerical encoded array using a specified encoding (e.g., DNAEncoding).
Example: sequences = bnp.as_encoded_array(["ACTG", "GGGACT", "G"], bnp.DNAEncoding)

2. bnp.sequence.get_kmers: Extracts k-mers of a specified length from sequences.
Example: kmers = bnp.sequence.get_kmers(sequences, 3)

3. bnp.count_encoded:Counts the occurrences of each unique k-mer in the extracted k-mers.
Example: counts = bnp.count_encoded(kmers, axis=None)

4. bnp.sequence.get_minimizers:Finds minimizers within a sliding window of sequences.
Example: minimizers = bnp.sequence.get_minimizers(sequences, k=2, window_size=4)

5. bnp.open:Opens a FASTQ file or any supported sequence file for reading.
Example: file = bnp.open("example_data/big.fq.gz")

6. file.read_chunks:Reads the sequence file chunk by chunk to keep memory usage low, especially for large files.
Example: for chunk in file.read_chunks():

7. bnp.change_encoding:Converts the encoding of sequences to a specified encoding.
Example: sequences = bnp.change_encoding(sequences, bnp.DNAEncoding)

8. EncodedRaggedArray.raw: Retrieves the raw numeric representation of the k-mers or minimizers.
Example: numeric_kmers = kmers.raw()

9. numpy.ravel: Flattens a multi-dimensional array into a one-dimensional array.
Example: numeric_flat_kmers = numeric_kmers.ravel()

In [3]:
import bionumpy as bnp
sequences = bnp.as_encoded_array(["ACTG", "GGGACT", "G"], bnp.DNAEncoding)
kmers = bnp.sequence.get_kmers(sequences, 3)
kmers

encoded_ragged_array([[ACT, CTG],
                      [GGG, GGA, GAC, ACT],
                      []], 3merEncoding(AlphabetEncoding('ACGT')))

In [4]:
counts = bnp.count_encoded(kmers, axis=None)
counts["ACT"]

array(2, dtype=int64)

In [5]:
bnp.sequence.get_minimizers(sequences, k=2, window_size=4)

encoded_ragged_array([[AC],
                      [GA, GA, GA],
                      []], 2merEncoding(AlphabetEncoding('ACGT')))

In [11]:
# Encode the DNA sequences
sequences = bnp.as_encoded_array(["ATCGATCGA", "CAGTCAGT", "TCGA"], bnp.DNAEncoding)
# Generate 3-mers from the sequences
kmers1 = bnp.sequence.get_kmers(sequences, 3)
# Output the encoded k-mers
print("K-mers:")
print(kmers)
# Count occurrences of each k-mer
counts = bnp.count_encoded(kmers1, axis=None)

# Output the counts of each k-mer
print("\nCounts of each 3-mer:")
print(counts)
print(counts)

K-mers:
[ACT, CTG]
[GGG, GGA, GAC, ACT]
[]

Counts of each 3-mer:
AAA: 0
CAA: 0
GAA: 0
TAA: 0
ACA: 0
CCA: 0
GCA: 0
TCA: 1
AGA: 0
CGA: 3
GGA: 0
TGA: 0
ATA: 0
CTA: 0
GTA: 0
TTA: 0
AAC: 0
CAC: 0
GAC: 0
TAC: 0
ACC: 0
CCC: 0
GCC: 0
TCC: 0
AGC: 0
CGC: 0
GGC: 0
TGC: 0
ATC: 2
CTC: 0
GTC: 1
TTC: 0
AAG: 0
CAG: 2
GAG: 0
TAG: 0
ACG: 0
CCG: 0
GCG: 0
TCG: 3
AGG: 0
CGG: 0
GGG: 0
TGG: 0
ATG: 0
CTG: 0
GTG: 0
TTG: 0
AAT: 0
CAT: 0
GAT: 1
TAT: 0
ACT: 0
CCT: 0
GCT: 0
TCT: 0
AGT: 2
CGT: 0
GGT: 0
TGT: 0
ATT: 0
CTT: 0
GTT: 0
TTT: 0
AAA: 0
CAA: 0
GAA: 0
TAA: 0
ACA: 0
CCA: 0
GCA: 0
TCA: 1
AGA: 0
CGA: 3
GGA: 0
TGA: 0
ATA: 0
CTA: 0
GTA: 0
TTA: 0
AAC: 0
CAC: 0
GAC: 0
TAC: 0
ACC: 0
CCC: 0
GCC: 0
TCC: 0
AGC: 0
CGC: 0
GGC: 0
TGC: 0
ATC: 2
CTC: 0
GTC: 1
TTC: 0
AAG: 0
CAG: 2
GAG: 0
TAG: 0
ACG: 0
CCG: 0
GCG: 0
TCG: 3
AGG: 0
CGG: 0
GGG: 0
TGG: 0
ATG: 0
CTG: 0
GTG: 0
TTG: 0
AAT: 0
CAT: 0
GAT: 1
TAT: 0
ACT: 0
CCT: 0
GCT: 0
TCT: 0
AGT: 2
CGT: 0
GGT: 0
TGT: 0
ATT: 0
CTT: 0
GTT: 0
TTT: 0


The sequence "ATCGATCGATCG" can be divided into the following 3-mers:
ATC TCG CGA GAT ATC TCG CGA GAT ATC TCG CG
assigning numeric values to kmer- ATC -> 1 TCG -> 2 CGA -> 3 GAT -> 4

Raw numeric representation is given by:
[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 5]

"ACGTGTCGA" and a sliding window of size 4 (w=4) to find 3-mers (k=3).

Sequence: "ACGTGTCGA"
Window size (w): 4
K-mer size (k): 3
Step-by-Step Process:
First Window: "ACGT"

3-mers in this window: "ACG", "CGT"
Lexicographically smallest 3-mer: "ACG"
Second Window: "CGTG"

3-mers in this window: "CGT", "GTG"
Lexicographically smallest 3-mer: "CGT"

Lexicographically smallest means the smallest in dictionary order. In the context of k-mers in DNA sequences, it refers to arranging the k-mers (substrings of length 
𝑘
k) in alphabetical order and selecting the first one.

### Retrieve the count for a specific k-mer "CAG"

In [10]:
specific_kmer = "CAG"
specific_count = counts[specific_kmer]
print(f"\nCount for the k-mer '{specific_kmer}': {specific_count}")


Count for the k-mer 'CAG': 2


In [2]:
import bionumpy as bnp

# Opens the FASTQ file
file_path = "C:\\Users\\admin\\Desktop\\SRR19127870.fastq"
file = bnp.open(file_path)
file.bnp = open.file
# Reads the file chunk by chunk to keep memory low
for chunk in file.read_chunks():
    sequences = chunk.sequence
    
    # Changes the encoding to DNAEncoding
    sequences = bnp.change_encoding(sequences, bnp.DNAEncoding)
    
    # Prints kmers
    print("Kmers:")
    kmers = bnp.get_kmers(sequences, k=2)
    print(kmers[0:3, 0:2])

    # Gets raw numeric kmers
    print("Raw kmers:")
    numeric_kmers = kmers.raw()
    print(numeric_kmers[:, :])

    # Flattens raw numeric kmers
    print("Flat raw kmers:")
    numeric_flat_kmers = numeric_kmers.ravel()
    print(numeric_flat_kmers[0:200])

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\admin\\Desktop\\SRR19127870.fastq'