## Simulating sequence datasets

The below provided code snippet demonstrates how to simulate sequence datasets using BioNumPy, perform basic analyses such as calculating GC content and counting specific motifs (e.g., "AC"), and then adding these computed values back to the sequence data structure.

## 1. Importing necessary libraries and setting up random number generation

In [1]:
import numpy as np
rng = np.random.default_rng(seed=1)
from bionumpy.simulate import simulate_sequences
from bionumpy.sequence import match_string

1. rng: Random number generator initialized with a seed for reproducibility.

2. simulate_sequences: Function from BioNumPy to simulate sequence data.

3. match_string: Function from BioNumPy for motif matching within sequences.

## 2. Simulating sequences

In [9]:
named_seqs = simulate_sequences('ACGT', {f's{i}': 10 + i for i in range(20)}, rng)

In [10]:
named_seqs

SequenceEntry with 20 entries
                     name                 sequence
                       s0               ACCCTCTACA
                       s1              AGACGTGGCTA
                       s2             GAGCCCTTCGAT
                       s3            TCGGTAGTGGATA
                       s4           CGCGGAATTAGGGA
                       s5          GGTCCAAACAGAGGC
                       s6         CTTCTATCGGTCTTAA
                       s7        AGCAATGACGCTCGATG
                       s8       GGAGCAACGGAACCAACA
                       s9      AACCACTTACGAGTTACAG

simulate_sequences: Generates simulated sequences.
'ACGT': Alphabet of nucleotides used in the sequences.
{f's{i}': 10 + i for i in range(20)}: Dictionary specifying sequence names (s0, s1, ..., s19) and their respective lengths (10, 11, ..., 29).
rng: Random number generator to ensure reproducibility.

## 3. Computing GC content per sequence

In [5]:
seqs = named_seqs.sequence
gc_content_per_seq = np.mean((seqs == 'C') | (seqs == 'G'), axis=1)

In [7]:
seqs

encoded_ragged_array(['CGTTAATTAC',
                      'TCCTCCGGAAT',
                      'TTGTCCTACACT',
                      'ACCTAGCATACCC',
                      'ATGTAGCGTCGACT',
                      'CGCACGCTCGTTCAG',
                      'GTCCACGTTAGTCCTG',
                      'GGGTTAAGTAGTTTAGT',
                      'CACAATGTTTCCGCTATG',
                      'CGCTTCCAGGTTTTTAACC',
                      'TTCGGTACGCTTTCTAGCAG',
                      'TTATTCATTCAACTCAGGAGC',
                      'GAGCGCGACGTCAGGGACTTCG',
                      'ATCCTGTATTAAACCATCTTAGT',
                      'AACACCGGCAGCTGGGCCCGCAAA',
                      'ACCACGCTGATTTATGTGGCTTGCG',
                      'GAACGACATGCTTCTTTGTAATCCGC',
                      'GTTATGGATCTAATGCTTAGTGGGGCA',
                      'CGTTAATGTTCTGGCCCGGAAACGTTCG',
                      'GTCGACTCATCCTCCATAGATGGCCTTCA'], AlphabetEncoding('ACGT'))

In [8]:
gc_content_per_seq

array([0.3       , 0.54545455, 0.41666667, 0.53846154, 0.5       ,
       0.66666667, 0.5625    , 0.35294118, 0.44444444, 0.47368421,
       0.5       , 0.38095238, 0.68181818, 0.30434783, 0.66666667,
       0.52      , 0.46153846, 0.44444444, 0.53571429, 0.51724138])

seqs: Extracts the sequences from named_seqs.
np.mean((seqs == 'C') | (seqs == 'G'), axis=1): Calculates the GC content for each sequence:
(seqs == 'C') | (seqs == 'G'): Creates a boolean array where True indicates presence of 'C' or 'G'.
np.mean(..., axis=1): Computes the mean along axis 1 (rows), giving the GC content for each sequence.

## 4. Adding computed GC content back to named_seqs

In [11]:
named_seqs = named_seqs.add_fields({'gc': gc_content_per_seq}, {'gc': float})

In [12]:
named_seqs

DynamicSequenceEntry with 20 entries
                     name                 sequence                       gc
                       s0               ACCCTCTACA                      0.3
                       s1              AGACGTGGCTA       0.5454545454545454
                       s2             GAGCCCTTCGAT       0.4166666666666667
                       s3            TCGGTAGTGGATA       0.5384615384615384
                       s4           CGCGGAATTAGGGA                      0.5
                       s5          GGTCCAAACAGAGGC       0.6666666666666666
                       s6         CTTCTATCGGTCTTAA                   0.5625
                       s7        AGCAATGACGCTCGATG      0.35294117647058826
                       s8       GGAGCAACGGAACCAACA       0.4444444444444444
                       s9      AACCACTTACGAGTTACAG      0.47368421052631576

add_fields: Method to add new fields to named_seqs.
{'gc': gc_content_per_seq}: Dictionary specifying the field name (gc) and the data (gc_content_per_seq).
{'gc': float}: Specifies the data type (float) for the new field gc.

## 5. Counting motif occurrences (e.g., "AC") in sequences

In [13]:
ac_hits = match_string(seqs, "AC")
ac_hit_sums = np.sum(ac_hits, axis=1)

In [14]:
ac_hits

ragged_array([False False False False False False False False  True]
[False False False False False False False False False False]
[False False False False False False False  True False  True False]
[ True False False False False False False False False  True False False]
[False False False False False False False False False False False  True
 False]
[False False False  True False False False False False False False False
 False False]
[False False False False  True False False False False False False False
 False False False]
[False False False False False False False False False False False False
 False False False False]
[False  True False False False False False False False False False False
 False False False False False]
[False False False False False False False False False False False False
 False False False False  True False]
[False False False False False False  True False False False False False
 False False False False False False False]
[False False False False False Fal

In [16]:
ac_hit_sums

array([1, 0, 2, 2, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1, 2, 2, 2, 0, 1, 1])

match_string(seqs, "AC"): Finds occurrences of the motif "AC" in each sequence.
np.sum(ac_hits, axis=1): Sums the occurrences across each sequence (axis=1).

## 6. Adding motif counts back to named_seqs

In [17]:
named_seqs = named_seqs.add_fields({'ac_hits': ac_hit_sums}, {'ac_hits': int})

In [18]:
named_seqs

DynamicSequenceEntry with 20 entries
                     name                 sequence                       gc                  ac_hits
                       s0               ACCCTCTACA                      0.3                        1
                       s1              AGACGTGGCTA       0.5454545454545454                        0
                       s2             GAGCCCTTCGAT       0.4166666666666667                        2
                       s3            TCGGTAGTGGATA       0.5384615384615384                        2
                       s4           CGCGGAATTAGGGA                      0.5                        1
                       s5          GGTCCAAACAGAGGC       0.6666666666666666                        1
                       s6         CTTCTATCGGTCTTAA                   0.5625                        1
                       s7        AGCAATGACGCTCGATG      0.35294117647058826                        0
                       s8       GGAGCAACGGAACCAACA    

Similar to adding GC content, this adds the motif counts (ac_hit_sums) as a new field named ac_hits of type int to named_seqs