# How to Build a Custom Oligo Design Pipeline?

The Oligo Designer Toolsuite is a collection of modules that provide basic functionalities for custom oligo design pipelines within a flexible Python framework.   
All modules have a common underlying runtime and memory optimized data structure and a standardized API, which allows the user to easily combine different modules depending on the required processing steps. 

In this notebook we will demonstrate how to use the Oligo Designer Toolsuite to develop a custom oligo design pipeline. 

## Setup

Before we can start, we first need to install the Oligo Designer Toolsuite Python package.  
A step by step description of the installation process is documented [here](https://oligo-designer-toolsuite.readthedocs.io/en/latest/_getting_started/installation.html).  

Once the Python package is installed, we need to load all required packages.

In [2]:
import os
import pprint

from pathlib import Path
from Bio.SeqUtils import MeltingTemp as mt

from oligo_designer_toolsuite.database import (
    OligoAttributes,
    OligoDatabase,
    ReferenceDatabase,
)
from oligo_designer_toolsuite.oligo_efficiency_filter import (
    LowestSetScoring,
    WeightedIsoformTmGCOligoScoring,
)
from oligo_designer_toolsuite.oligo_property_filter import (
    GCContentFilter,
    HardMaskedSequenceFilter,
    MeltingTemperatureNNFilter,
    PropertyFilter,
    SoftMaskedSequenceFilter,
)
from oligo_designer_toolsuite.oligo_selection import (
    GraphBasedSelectionPolicy,
    OligosetGeneratorIndependentSet,
)
from oligo_designer_toolsuite.oligo_specificity_filter import (
    BlastNFilter,
    CrossHybridizationFilter,
    ExactMatchFilter,
    RemoveByLargerRegionPolicy,
    SpecificityFilter,
)
from oligo_designer_toolsuite.sequence_generator import OligoSequenceGenerator

In [3]:
dir_output = os.path.abspath("./results")
Path(dir_output).mkdir(parents=True, exist_ok=True)

n_jobs = 3

## 1. Oligo Sequences Generation

The first step in building a pipeline is generating oligo sequences which can be loaded into an OligoDatabase. The `OligoSequenceGenerator` class can be used for generating new oligo sequences and allows the user to either design oligos from reference genomic sequences or create randomized sequences for experimental purposes.

### Generate Oligos from Reference Genomic Sequences

Using a reference genomic FASTA file, oligos are created within a user-defined length range. The `create_sequences_sliding_window()` method facilitates this process by sliding a window of defined size along the input sequences.

Key Parameters:

- `files_fasta_in`: Input FASTA file(s) containing genomic sequences.
- `length_interval_sequences`: Tuple defining the minimum and maximum lengths for generated oligos.
- `region_ids`: Specific gene or region identifiers for which oligos are generated. If set to None, oligos are generated for all regions in the input FASTA file.
- `n_jobs`: Number of parallel jobs to speed up computation.

*Note on Reference FASTA Files: These files can be generated using the  [genomic_region_generator pipeline](https://oligo-designer-toolsuite.readthedocs.io/en/latest/_pipelines/genomic_region_generator.html) with annotations from sources like NCBI or Ensembl. This allows users to customize genomic regions of interest, such as exons or introns, ensuring the designed oligos are tailored to specific experimental requirements.*

### Generate Random Oligo Sequences

Randomized oligo sequences are generated based on user-defined probabilities for each nucleotide base (e.g. A, C, G, T). The `create_sequences_random()` method produces random oligos with specific per-base probabilities.

Key Parameters:

- `filename_out`: Name of the output FASTA file for the generated sequences.
- `length_sequences`: Fixed length of the random oligos.
- `num_sequences`: Total number of random oligos to generate.
- `name_sequences`: Base name assigned to each sequence in the output FASTA file.
- `base_alphabet_with_probability`: Dictionary defining the per-base generation probability, e.g., *{"A": 0.45, "C": 0.05, "G": 0.3, "T": 0.2}*.

### Generated Output

Both methods output the generated oligo sequences in FASTA format, which can then be loaded into the OligoDatabase for further filtering, scoring, and optimization. 
Additional information about the sequences are stored in the FASTA header, e.g. genomic coordinates, sequences type of origin etc.

In [4]:
gene_ids = ["AARS1", "DECR2", "PRR35"]

files_fasta_oligo_database = "../data/genomic_regions/exon_annotation_source-NCBI_species-Homo_sapiens_annotation_release-110_genome_assemly-GRCh38.fna"
probe_length_min = 40
probe_length_max = 45

filename_out="random_probe_sequences"
length_sequences=30
num_sequences=5
name_sequences="random_probe"
oligo_base_probabilities = {"A": 0.45, "C": 0.05, "G": 0.3, "T": 0.2}

In [6]:
oligo_sequence_generator = OligoSequenceGenerator(dir_output=dir_output)

# Generated sequences from reference genomic sequences 
oligo_genomic_fasta_file = oligo_sequence_generator.create_sequences_sliding_window(
    files_fasta_in=files_fasta_oligo_database, 
    length_interval_sequences=(probe_length_min, probe_length_max), 
    region_ids=gene_ids, 
    n_jobs=n_jobs,
)

# Generate sequences at random 
oligo_random_fasta_file = oligo_sequence_generator.create_sequences_random(
    filename_out=filename_out,
    length_sequences=length_sequences, 
    num_sequences=num_sequences, 
    name_sequences=name_sequences, 
    base_alphabet_with_probability=oligo_base_probabilities, 
)

## 2. Setup of an OligoDatabase

The `OligoDatabase` is the core data structure for storing, organizing, and processing oligo sequences in the pipeline. It can load oligo sequences from one or more FASTA files, where each sequence is labeled with essential metadata in the header. This database provides a flexible and efficient way to manage oligos and their associated information for downstream filtering, scoring, and selection.

### Input Requirements for FASTA Files

The input FASTA file must adhere to the following structure:

**Header Format:** Each sequence must have a header starting with the **>** character.
The header should contain:

- ***region_id***: A unique identifier for the genomic region (e.g., gene name or ID). This is mandatory.
- ***additional_information***: Optional metadata fields such as transcript ID or exon number, separated by commas.
- ***coordinates***: Genomic location in the format chrom:start-end(strand), which is optional.

**Sequence Content:** The sequence follows the header in standard FASTA format.

**Example:** 
 
*>ASR1::transcrip_id=XM456,exon_number=5::16:54552-54786(+)*  
AGTTGACAGACCCCAGATTAAAGTGTGTCGCGCAACAC   

or
   
*>ASR1*  
AGTTGACAGACCCCAGATTAAAGTGTGTCGCGCAACAC

### Creating the OligoDatabase

An OligoDatabase object is initialized with parameters that define how the oligo data will be managed and stored.

Initialization Parameters:

- `min_oligos_per_region`: Minimum number of oligos required for a region to be retained in the database. Regions with fewer oligos will be logged and excluded.
- `write_regions_with_insufficient_oligos`: Whether to log regions that do not meet the minimum oligo threshold.
- `lru_db_max_in_memory`: Determines the number of regions loaded into RAM at once, optimizing memory usage.
- `database_name`: Name assigned to the database.
- `dir_output`: Directory where the database and associated files will be saved.
- `n_jobs`: Number of parallel processes to use for database operations.


In [7]:
min_oligos_per_region = 3
write_regions_with_insufficient_oligos = True
lru_db_max_in_memory=n_jobs * 2 + 2
database_name="db_oligos"

In [8]:
oligo_database = OligoDatabase(
    min_oligos_per_region=min_oligos_per_region, 
    write_regions_with_insufficient_oligos=write_regions_with_insufficient_oligos, 
    lru_db_max_in_memory=lru_db_max_in_memory, 
    database_name=database_name, 
    dir_output=dir_output, 
    n_jobs=n_jobs,
)

### Loading Sequences into the Database

The `load_database_from_fasta()` method imports sequences into the database from a FASTA file. This method can load both:

- **Target Sequences**: Genomic sequences from which oligos will be designed.
- **Oligo Sequences**: Pre-designed or randomly generated oligos.

The `sequence_type` parameter determines whether the sequences represent target regions or oligo sequences. For target sequences, the reverse complement will be automatically generated.

In [9]:
# Loading Target Sequences: Clears the database and loads genomic sequences as targets for oligo design.
oligo_database.load_database_from_fasta(
    files_fasta=oligo_genomic_fasta_file, 
    database_overwrite=True, 
    sequence_type="target", 
    region_ids=gene_ids, 
)

# Appending Oligo Sequences: Adds pre-generated oligos (e.g., random sequences) to the existing database without clearing it.
oligo_database.load_database_from_fasta(
    files_fasta=oligo_random_fasta_file, 
    database_overwrite=False, 
    sequence_type="oligo", 
    region_ids=None, 
)

Output()

Output()

Once the `OligoDatabase` is created, oligos are loaded from FASTA files and stored in a nested dictionary structure. This step organizes the oligos, ensuring efficient storage and enabling downstream analysis. The metadata from the FASTA headers is automatically parsed and stored as features for each oligo, enabling flexible filtering and scoring.

### Nested Dictionary Structure

The loaded oligos are stored in the `OligoDatabase` as a nested dictionary with the following structure:

``[region_id][oligo_id][oligo_features]``

- **region_id**: A unique identifier (e.g., gene name) grouping oligos that belong to the same genomic region.
- **oligo_id**: A unique identifier for each oligo within the region.
- **oligo_features**: A dictionary containing metadata such as sequence, genomic location, and other annotations.

In [10]:
region = list(oligo_database.database.keys())[0]
oligo_id_1 = list(oligo_database.database[region].keys())[0]
oligo_id_2 = list(oligo_database.database[region].keys())[1]

sample_oligos_DB = {region: {oligo_id_1: oligo_database.database[region][oligo_id_1], 
                           oligo_id_2: oligo_database.database[region][oligo_id_2]}}
pprint.pprint(sample_oligos_DB)

{'AARS1': {'AARS1::1': {'annotation_release': [['110']],
                        'chromosome': [['16']],
                        'end': [[70265662]],
                        'exon_number': [['10', '10']],
                        'gene_id': [['AARS1']],
                        'genome_assembly': [['GRCh38']],
                        'number_total_transcripts': [['2']],
                        'oligo': 'GGAAACCCATAGGTGTCATAGAGGAGCCAAGCAGTGTCTC',
                        'regiontype': [['exon']],
                        'source': [['NCBI']],
                        'species': [['Homo_sapiens']],
                        'start': [[70265623]],
                        'strand': [['-']],
                        'target': 'GAGACACTGCTTGGCTCCTCTATGACACCTATGGGTTTCC',
                        'transcript_id': [['NM_001605.3', 'XM_047433666.1']]},
           'AARS1::2': {'annotation_release': [['110']],
                        'chromosome': [['16']],
                        'end': [[70265661]],
    

### Pre-Filtering Oligos by Attributes

After loading the oligos, a pre-filtering step can be performed to eliminate oligos that do not meet basic criteria. This filtering uses attributes derived from the metadata in the FASTA headers or calculated using the `OligoAttributes` class.

1. **Calculating Attributes:** Additional attributes, such as isoform_consensus (the percentage of transcripts covered by the oligo for the target gene), can be computed for each oligo.
2. **Filtering by Attribute Threshold:** Oligos are filtered based on a threshold for a specific attribute. For example, to retain only oligos with at least 50% isoform consensus.
3. **Removing Regions with Insufficient Oligos:** Regions with fewer than the minimum required number of oligos are removed from the database:

In [11]:
isoform_consensus = 50

In [12]:
# Calculating Attributes
oligo_attributes_calculator = OligoAttributes()
oligo_database = oligo_attributes_calculator.calculate_isoform_consensus(
    oligo_database=oligo_database
)

# Filtering by Attribute Threshold
oligo_database.filter_database_by_attribute_threshold(
    attribute_name="isoform_consensus", #name of the attribute that should be used for filtering
    attribute_thr=isoform_consensus, #threshold for filtering 
    remove_if_smaller_threshold=True, #define if the oligo should be removed if the attribute is greater or smaller than the defined threshold
)

# Removing Regions with Insufficient Oligos
oligo_database.remove_regions_with_insufficient_oligos(pipeline_step="Pre-Filters") 

### Saving and Retrieving the Database

To preserve the state of the database at any point, the `save_database()` and `load_database()` functions can be used. This ensures that progress is not lost and enables resuming the pipeline from intermediate steps.

In [13]:
# Save Database
dir_database = oligo_database.save_database(dir_database="1_db_oligos_initial")

# Load Database
oligo_database.load_database(dir_database=dir_database, database_overwrite=True)

Output()

### Exporting the Database

The database can also be exported for analysis in other tools or formats:

- **Export as TSV:** Outputs a table of oligos and their attributes (`write_database_to_table()`)
- **Export as FASTA:** Outputs oligo sequences in FASTA format (`write_database_to_fasta()`)

The setup of the OligoDatabase is critical for several reasons:

- **Centralized Data Management:** Provides a structured repository for oligos and their metadata.
- **Customizability:** Allows for filtering based on the number of oligos per region and specific target regions.
- **Scalability:** Efficiently handles large genomic datasets by managing memory usage and parallel processing.
- **Flexibility:** Supports both genomic and pre-designed oligos, enabling a wide range of experimental setups.

By defining the database structure and loading sequences with proper metadata, this step ensures that the downstream steps (e.g., filtering and scoring) are applied seamlessly and effectively.

## 3. Property Filters

Property filters are the first major step in refining the initial pool of oligos based on their intrinsic sequence properties. This step eliminates sequences that do not meet specific experimental criteria, such as GC content or melting temperature (Tm), which ensures that only the most suitable oligos are retained for subsequent analysis.

Each property filter is implemented as a class inheriting from the abstract base class `PropertyFilterBase`. This ensures all filters have a standardized `apply()` method, which takes an `OligoDatabase` object as input, applies the filter, and returns the filtered database.  
To streamline the application of multiple filters, the `PropertyFilter` wrapper class allows users to define a sequence of filters to be applied in order. Filters with lower computational cost (e.g., GC content) should be applied first to reduce the dataset size before more complex filters (e.g., Tm). A list of implemented property filters is available [here](https://oligo-designer-toolsuite.readthedocs.io/en/latest/_api_docs/oligo_designer_toolsuite.oligo_property_filter.html)


*Why Order Matters: The sequential application of filters minimizes runtime by processing smaller datasets in later, more computationally intensive steps.*

Applying property filters to the OligoDatabase is critical for several reasons:

- **Improves Experimental Suitability:** Ensures that sequences meet critical physical and chemical requirements for optimal binding and stability.
- **Reduces Computational Load:** Eliminates unsuitable sequences early, saving resources for downstream processes.
- **Modular and Extensible:** The PropertyFilterBase design makes it easy to add new filters for additional properties as needed.



In [14]:
oligo_GC_content_min = 40
oligo_GC_content_max = 60

oligo_Tm_min = 65 
oligo_Tm_max = 75 

Tm_parameters_oligo = {
    "check": True, #default
    "strict": True, #default
    "c_seq": None, #default
    "shift": 0, #default
    "nn_table": "DNA_NN3", # Allawi & SantaLucia (1997)
    "tmm_table": "DNA_TMM1", #default
    "imm_table": "DNA_IMM1", #default
    "de_table": "DNA_DE1", #default
    "dnac1": 50, #[nM]
    "dnac2": 0, #[nM]
    "selfcomp": False, #default
    "saltcorr": 7, # Owczarzy et al. (2008)
    "Na": 39, #[mM]
    "K": 75, #[mM]
    "Tris": 20, #[mM]
    "Mg": 10, #[mM]
    "dNTPs": 0, #[mM] default
}
Tm_parameters_oligo["nn_table"] = getattr(mt, Tm_parameters_oligo["nn_table"])
Tm_parameters_oligo["tmm_table"] = getattr(mt, Tm_parameters_oligo["tmm_table"])
Tm_parameters_oligo["imm_table"] = getattr(mt, Tm_parameters_oligo["imm_table"])
Tm_parameters_oligo["de_table"] = getattr(mt, Tm_parameters_oligo["de_table"])

Tm_chem_correction_param_oligo = {
    "DMSO": 0, #default
    "fmd": 20,
    "DMSOfactor": 0.75, #default
    "fmdfactor": 0.65, #default
    "fmdmethod": 1, #default
    "GC": None, #default
}

Tm_salt_correction_param_oligo = None # use default settings

In [15]:
hard_masked_sequences = HardMaskedSequenceFilter()
soft_masked_sequences = SoftMaskedSequenceFilter()
gc_content = GCContentFilter(
    GC_content_min=oligo_GC_content_min, 
    GC_content_max=oligo_GC_content_max 
)
melting_temperature = MeltingTemperatureNNFilter(
    Tm_min=oligo_Tm_min, 
    Tm_max=oligo_Tm_max, 
    Tm_parameters=Tm_parameters_oligo, 
    Tm_chem_correction_parameters=Tm_chem_correction_param_oligo, 
    Tm_salt_correction_parameters=Tm_salt_correction_param_oligo, 
)

filters = [
    hard_masked_sequences,
    soft_masked_sequences,
    gc_content,
    melting_temperature,
]

property_filter = PropertyFilter(filters=filters)

oligo_database = property_filter.apply(
    oligo_database=oligo_database,
    sequence_type="oligo",
    n_jobs=n_jobs,
)

dir_database = oligo_database.save_database(dir_database="2_db_oligos_property_filter")

Output()

## 4. Specificity Filters

Specificity filtering is critical in oligo design to prevent off-target binding, a common problem in experiments using short oligos. Off-target binding occurs when oligos hybridize to unintended genomic regions, reducing the accuracy and reliability of the experiment. Specificity filters identify and eliminate these problematic oligos by aligning them to reference genomic sequences and removing those that bind outside their intended regions. Specificity filters rely on sequence alignment methods such as BLAST or Bowtie to compare oligos against a reference genomic database.

Like property filters, specificity filters are implemented as modular classes. The `SpecificityFilter` class orchestrates the application of multiple filters sequentially, ensuring flexibility and scalability. Filters are applied in a user-defined order. Faster filters (e.g., exact matches) should be executed first to minimize computational load for subsequent, more intensive filters (e.g., BLAST).

In [16]:
files_fasta_reference_database = "../data/genomic_regions/exon_annotation_source-NCBI_species-Homo_sapiens_annotation_release-110_genome_assemly-GRCh38.fna"

# Specificity Filter with BlastN
specificity_blastn_search_parameters = {
  "perc_identity": 80,
  "strand": "minus", # this parameter is fixed, if reference is whole genome, consider using "both"
  "word_size": 10,
  "dust": "no",
  "soft_masking": "false",
  "max_target_seqs": 10,
  "max_hsps": 1000,
}
specificity_blastn_hit_parameters = {
  "coverage": 50 # can be turned into min_alignment_length
}

# Crosshybridization Filter with BlastN
cross_hybridization_blastn_search_parameters = {
  "perc_identity": 80,
  "strand": "minus", # this parameter is fixed
  "word_size": 10,
  "dust": "no",
  "soft_masking": "false",
  "max_target_seqs": 10,
}
cross_hybridization_blastn_hit_parameters = {
  "coverage": 80 # can be turned into min_alignment_length
}

A reference database is essential for specificity filtering, as it serves as the alignment target. The database is typically generated from FASTA files containing genomic sequences. Tools such as the [genomic_region_generator pipeline](https://oligo-designer-toolsuite.readthedocs.io/en/latest/_pipelines/genomic_region_generator.html) can create these FASTA files from annotation sources like NCBI or Ensembl.

*Note: A single reference database can be used for multiple filters but different specificity filters can also be applied with different reference databases, offering flexibility for various experimental needs.*

In [17]:
# Define Reference Database
reference_database = ReferenceDatabase(
    database_name="db_reference", dir_output=dir_output
)
reference_database.load_database_from_fasta(
    files_fasta=files_fasta_reference_database, database_overwrite=False
)

In [18]:
# Specificity Filters
exact_matches = ExactMatchFilter(policy=RemoveByLargerRegionPolicy(), filter_name="exact_match")

cross_hybridization_aligner = BlastNFilter(
    search_parameters=cross_hybridization_blastn_search_parameters,
    hit_parameters=cross_hybridization_blastn_hit_parameters,
    filter_name="blastn_crosshybridization",
    dir_output=dir_output,
)
cross_hybridization = CrossHybridizationFilter(
    policy=RemoveByLargerRegionPolicy(),
    alignment_method=cross_hybridization_aligner,
    database_name_reference="db_reference",
    dir_output=dir_output,
)

specificity = BlastNFilter(
    search_parameters=specificity_blastn_search_parameters,
    hit_parameters=specificity_blastn_hit_parameters,
    filter_name="blastn_specificity",
    dir_output=dir_output,
)

filters = [exact_matches, specificity, cross_hybridization]
specificity_filter = SpecificityFilter(filters=filters)
oligo_database = specificity_filter.apply(
    sequence_type="oligo",
    oligo_database=oligo_database,
    reference_database=reference_database,
    n_jobs=n_jobs,
)

Output()

Output()

Output()

By eliminating off-target binding oligos, specificity filtering ensures that the remaining sequences are highly selective for their intended regions, laying the groundwork for effective experimental designs.

Applying specificity filters to the OligoDatabase is critical for several reasons:

- **Enhances Experimental Accuracy:** Removes oligos that could bind to unintended genomic regions, reducing noise and improving the reliability of experimental results.
- **Customizable Filtering:** Allows users to tailor filters to the specific requirements of their experiment, such as alignment stringency and coverage thresholds.
- **Efficient Workflow:** Sequential application of filters minimizes computational costs by progressively reducing the size of the dataset.

## 5. Oligoset Generation

The final step in designing oligos is to organize them into optimal sets that maximize experimental efficiency and reliability. This step evaluates individual oligos, groups them into sets of non-overlapping oligos and ranks the sets by their overall efficiency scores. The `OligosetGenerator` ensures that only the best-performing sets of oligos are selected for downstream experimental use.

### Key Objectives in Oligoset Generation

- **Scoring Individual Oligos:** Each oligo is assigned a score based on its theoretical efficiency in the experimental context. Scores are computed using a class derived from `OligoScoringBase`. A list of implemented oligo scores is available [here](https://oligo-designer-toolsuite.readthedocs.io/en/latest/_api_docs/oligo_designer_toolsuite.oligo_efficiency_filter.html).
- **Scoring Oligo Sets:** Once individual oligos are scored, they are grouped into sets of oligos based on a set generator. A scoring class derived from `SetScoringBase` evaluates the overall efficiency of each set. A list of implemented oligo scores is available [here](https://oligo-designer-toolsuite.readthedocs.io/en/latest/_api_docs/oligo_designer_toolsuite.oligo_efficiency_filter.html).
- **Generating Oligo Sets:** Oligos in each set can be selected based on their positional overlap (`OligosetGeneratorIndependentSet`) or the homogeneity of specified oligo properties (`HomogeneousPropertyOligoSetGenerator`).
- **Selection Policies:** The `OligoSelectionPolicy` classes define the strategy for selecting and optimizing non-overlapping oligo sets from a pool of candidates. These policies use greedy or graph-based approaches to navigate the large combinatorial space of possible oligo combinations. This ensures the generated sets meet experimental requirements while adhering to constraints like set size and distance between oligos.

In [19]:
oligo_Tm_opt = 50
oligo_GC_content_opt = 70
oligo_isoform_weight = 2
oligo_Tm_weight = 1
oligo_GC_weight = 1

pre_filter = False 
n_attempts = 100000
heuristic = True
heuristic_n_attempts = 100

max_graph_size = 5000
distance_between_oligos = 0

oligo_size_opt = 5
oligo_size_min = 3
n_sets = 100

In [20]:
oligos_scoring = WeightedIsoformTmGCOligoScoring(
    Tm_min=oligo_Tm_min,
    Tm_opt=oligo_Tm_opt,
    Tm_max=oligo_Tm_max,
    GC_content_min=oligo_GC_content_min,
    GC_content_opt=oligo_GC_content_opt,
    GC_content_max=oligo_GC_content_max,
    Tm_parameters=Tm_parameters_oligo,
    Tm_chem_correction_parameters=Tm_chem_correction_param_oligo,
    Tm_salt_correction_parameters=Tm_salt_correction_param_oligo,
    isoform_weight=oligo_isoform_weight,
    Tm_weight=oligo_Tm_weight,
    GC_weight=oligo_GC_weight,
)
set_scoring = LowestSetScoring(ascending=True)

### Oligo Selection Policy

The `GraphBasedSelectionPolicy` uses the scoring strategies to iteratively select oligos that minimize the overall set score. Key features include:

**Pre-Filtering:** If `pre_filter=True`, oligos are pre-filtered before set selection, which improves performance for larger sets (e.g., oligo_size_opt = 50) but can slow down small set selection (e.g., oligo_size_opt = 5).

**Heuristic Search:** A heuristic approach is employed to optimize set selection within a feasible runtime:

- `heuristic`: Enables or disables heuristic optimization for faster results, which might not find the best possible set.
- `heuristic_n_attempts`: Maximum number of attempts to find optimal sets.

### Generating Oligosets

Using the `OligosetGeneratorIndependentSet`, the pipeline generates non-overlapping sets of oligos. The generator uses the scoring strategies and selection policies to create optimal sets of a user-defined size.

**Set Parameters:**

- `set_size_opt`: Optimal number of oligos per set.
- `set_size_min`: Minimum number of oligos required for a set.
- `n_sets`: Number of sets to generate.

**Graph Constraints:**

- `max_graph_size`: Limits the size of the graph for feasible computation.
- `distance_between_oligos`: Ensures no overlap between selected oligos.

In [21]:
selection_policy = GraphBasedSelectionPolicy(
    set_scoring=set_scoring,
    pre_filter=pre_filter,
    n_attempts=n_attempts,
    heuristic=heuristic,
    heuristic_n_attempts=heuristic_n_attempts,
)
probeset_generator = OligosetGeneratorIndependentSet(
    selection_policy=selection_policy,
    oligos_scoring=oligos_scoring,
    set_scoring=set_scoring,
    max_oligos=max_graph_size,
    distance_between_oligos=distance_between_oligos,
)

In [22]:
oligo_database = probeset_generator.apply(
    oligo_database=oligo_database,
    sequence_type="oligo",
    set_size_opt=oligo_size_opt,
    set_size_min=oligo_size_min,
    n_sets=n_sets,
    n_jobs=n_jobs,
)

Output()

### Output Structure

The generated sets are saved in a pandas DataFrame with the following structure:

 oligoset_id | oligo_0  | oligo_1  | oligo_2  |  ...  | oligo_n  | set_score_1 | set_score_2 |  ...  
------------ | -------- | -------- | -------- | ----- | -------- | ----------- | ----------- | ------:
 0           | AGRN_184 | AGRN_133 | AGRN_832 |  ...  | AGRN_706 | 0.3445      | 1.2332      |  ...  

- **oligoset_id**: Identifies each oligo set.
- **oligo_0, oligo_1, ...**: Oligos in the set.
- **set_score_***: Scores representing the set's efficiency.

Applying set selection to the OligoDatabase is critical for several reasons:

- **Ensures Experimental Efficiency:** Generates sets of high-scoring oligos, ensuring effective target binding without competition.
- **Customizable and Scalable:** Users can tailor scoring strategies and selection policies to meet specific experimental needs.
- **Optimized Workflow:** Pre-filtering approaches and heuristic methods enable efficient generation of high-quality oligosets, even for large datasets.

This step finalizes the pipeline by providing optimal, ready-to-use oligosets tailored to experimental requirements. These sets can then be directly integrated into downstream experimental protocols.