## Position Weight Matrix

Reading a motif-pwm from file, a PositionWeightMatrix function is created using the appropriate alphabet and counts.

In [2]:
import numpy as np
import bionumpy as bnp
from bionumpy.io.motifs import read_motif
from bionumpy.sequence.position_weight_matrix import get_motif_scores


def read_motif_scores(reads_filename: str, motif_filename: str) -> np.ndarray:
    # Read the alphabet and counts from jaspar file
    pwm = read_motif(motif_filename)

    # Get reads
    entries = bnp.open(reads_filename).read()

    # Calculate the motif score for each valid window
    scores = get_motif_scores(entries.sequence, pwm)

    # Get a histogram of the max-score for each read
    return bnp.histogram(scores.max(axis=-1))


if __name__ == "__main__":
    result = read_motif_scores(
        "C:/Users/admin/Downloads/big.fq.gz",
        "C:/Users/admin/Downloads/MA0080.1.jaspar"
    )
    print(result)

(array([  5,   5,  11,  29, 207, 392, 174,   0,  56, 121], dtype=int64), array([24.30869731, 24.95400731, 25.5993173 , 26.24462729, 26.88993728,
       27.53524728, 28.18055727, 28.82586726, 29.47117725, 30.11648725,
       30.76179724]))


The primary purpose of this code is to analyze biological sequence data (FASTQ format) to assess how well a specified motif (described by a PWM) matches within the sequences. It achieves this by:

1. Reading the PWM from a JASPAR file, which defines the expected nucleotide frequencies at each position of the motif.
2. Computing motif scores across biological reads to quantify the similarity between each read and the motif.
3. Generating a histogram to visualize the distribution of maximum motif scores across all reads, indicating how well the motif aligns with the experimental or simulated sequences

## Small Example Interpretation


#### FASTQ File (big.fq.gz): Contains 2 biological sequences:

Sequence 1: "ATGCATGCATGC"
Sequence 2: "GCTAGCTAGCTA"

#### JASPAR File (MA0080.1.jaspar): Contains a PWM for a motif that prefers sequences with "ATGC" pattern.

When we you run the code:

PWM Reading: pwm = read_motif("MA0080.1.jaspar") reads the PWM data from MA0080.1.jaspar, defining the motif's preferences (e.g., higher weights for "ATGC" patterns).

Sequence Reading: entries = bnp.open("big.fq.gz").read() opens and reads the FASTQ file big.fq.gz, extracting the sequences.

Motif Scoring: scores = get_motif_scores(entries.sequence, pwm) computes motif scores:
For Sequence 1: Computes scores based on "ATGCATGCATGC".
For Sequence 2: Computes scores based on "GCTAGCTAGCTA".

Histogram Calculation: bnp.histogram(scores.max(axis=-1)) generates a histogram of the maximum motif scores across all sequences. This histogram shows how well the motif matches each sequence, reflecting the motif's occurrence and intensity.

Output Interpretation
The result printed at the end (print(result)) will show the histogram:
It might display something like HistogramResult(count=array([...]), bins=array([...])), where count gives the frequency of maximum motif scores in each bin.
