# 1. Oligo Sequences Generation

The first step in building a pipeline is generating oligo sequences which can be loaded into an OligoDatabase. The `OligoSequenceGenerator` class can be used for generating new oligo sequences and allows the user to either design oligos from reference genomic sequences or create randomized sequences for experimental purposes.

In this tutorial, we show how one can:

- [Generate oligos from reference genomic sequences](#generate-oligos-from-reference-genomic-sequences)

- [Create randomized sequences](#generate-random-oligo-sequences)


**Generated output:** Both methods output the generated oligo sequences in FASTA format, which can then be loaded into the `OligoDatabase` for further filtering, scoring, and optimization (See next tutorials). 
Additional information about the sequences are stored in the FASTA header, e.g. genomic coordinates, sequences type of origin etc.

## Imports and setup

In [2]:
import os

from pathlib import Path

from oligo_designer_toolsuite.sequence_generator import OligoSequenceGenerator

In [3]:
dir_output = os.path.abspath("./results")
Path(dir_output).mkdir(parents=True, exist_ok=True)

n_jobs = 3

## Generate Oligos from Reference Genomic Sequences

Using a reference genomic FASTA file, oligos are created within a user-defined length range. The `create_sequences_sliding_window()` method facilitates this process by sliding a window of defined size along the input sequences.

Key Parameters:

- `files_fasta_in`: Input FASTA file(s) containing genomic sequences.
- `length_interval_sequences`: Tuple defining the minimum and maximum lengths for generated oligos.
- `region_ids`: Specific gene or region identifiers for which oligos are generated. If set to None, oligos are generated for all regions in the input FASTA file.
- `n_jobs`: Number of parallel jobs to speed up computation.

*Note on Reference FASTA Files: These files can be generated using the  [genomic_region_generator pipeline](https://oligo-designer-toolsuite.readthedocs.io/en/latest/_pipelines/genomic_region_generator.html) with annotations from sources like NCBI or Ensembl. This allows users to customize genomic regions of interest, such as exons or introns, ensuring the designed oligos are tailored to specific experimental requirements.*

In [4]:
gene_ids = ["AARS1", "DECR2", "PRR35"]

files_fasta_oligo_database = "../data/genomic_regions/exon_annotation_source-NCBI_species-Homo_sapiens_annotation_release-110_genome_assemly-GRCh38.fna"
probe_length_min = 40
probe_length_max = 45

filename_out="random_probe_sequences"
length_sequences=30
num_sequences=5
name_sequences="random_probe"
oligo_base_probabilities = {"A": 0.45, "C": 0.05, "G": 0.3, "T": 0.2}

## Generate Random Oligo Sequences

Randomized oligo sequences are generated based on user-defined probabilities for each nucleotide base (e.g. A, C, G, T). The `create_sequences_random()` method produces random oligos with specific per-base probabilities.

Key Parameters:

- `filename_out`: Name of the output FASTA file for the generated sequences.
- `length_sequences`: Fixed length of the random oligos.
- `num_sequences`: Total number of random oligos to generate.
- `name_sequences`: Base name assigned to each sequence in the output FASTA file.
- `base_alphabet_with_probability`: Dictionary defining the per-base generation probability, e.g., *{"A": 0.45, "C": 0.05, "G": 0.3, "T": 0.2}*.

In [5]:
oligo_sequence_generator = OligoSequenceGenerator(dir_output=dir_output)

# Generated sequences from reference genomic sequences 
oligo_genomic_fasta_file = oligo_sequence_generator.create_sequences_sliding_window(
    files_fasta_in=files_fasta_oligo_database, 
    length_interval_sequences=(probe_length_min, probe_length_max), 
    region_ids=gene_ids, 
    n_jobs=n_jobs,
)

# Generate sequences at random 
oligo_random_fasta_file = oligo_sequence_generator.create_sequences_random(
    filename_out=filename_out,
    length_sequences=length_sequences, 
    num_sequences=num_sequences, 
    name_sequences=name_sequences, 
    base_alphabet_with_probability=oligo_base_probabilities, 
)