# 5. Oligoset Generation

The final step in designing oligos is to organize them into optimal sets that maximize experimental efficiency and reliability. This step evaluates individual oligos, groups them into sets of non-overlapping oligos and ranks the sets by their overall efficiency scores. The `OligosetGenerator` ensures that only the best-performing sets of oligos are selected for downstream experimental use.

## Key Objectives in Oligoset Generation

- **Scoring Individual Oligos:** Each oligo is assigned a score based on its theoretical efficiency in the experimental context. Scores are computed using a class derived from `OligoScoringBase`. A list of implemented oligo scores is available [here](https://oligo-designer-toolsuite.readthedocs.io/en/latest/_api_docs/oligo_designer_toolsuite.oligo_efficiency_filter.html).
- **Scoring Oligo Sets:** Once individual oligos are scored, they are grouped into sets of oligos based on a set generator. A scoring class derived from `SetScoringBase` evaluates the overall efficiency of each set. A list of implemented oligo scores is available [here](https://oligo-designer-toolsuite.readthedocs.io/en/latest/_api_docs/oligo_designer_toolsuite.oligo_efficiency_filter.html).
- **Generating Oligo Sets:** Oligos in each set can be selected based on their positional overlap (`OligosetGeneratorIndependentSet`) or the homogeneity of specified oligo properties (`HomogeneousPropertyOligoSetGenerator`).
- **Selection Policies:** The `OligoSelectionPolicy` classes define the strategy for selecting and optimizing non-overlapping oligo sets from a pool of candidates. These policies use greedy or graph-based approaches to navigate the large combinatorial space of possible oligo combinations. This ensures the generated sets meet experimental requirements while adhering to constraints like set size and distance between oligos.

In this tutorial we show how one can:

- [Score oligos](#scoring)

- [Generate non-everlapping sets with a graph-based approach](#generating-oligosets)

- [Get ready-to-use oligo sets](#output-structure)


## Imports and setup

In [2]:
import os

from pathlib import Path
from Bio.SeqUtils import MeltingTemp as mt

from oligo_designer_toolsuite.database import (
    OligoDatabase,
)
from oligo_designer_toolsuite.oligo_efficiency_filter import (
    LowestSetScoring,
    WeightedIsoformTmGCOligoScoring,
)

from oligo_designer_toolsuite.oligo_selection import (
    GraphBasedSelectionPolicy,
    OligosetGeneratorIndependentSet,
)

In [3]:
dir_output = os.path.abspath("./results")
Path(dir_output).mkdir(parents=True, exist_ok=True)

n_jobs = 3

### Load the database
Like in previous tutorials, we will also be working with `OligoDatabase` objects. If you don't know how they work, please check our [oligo database tutorial](2-oligo-database.ipynb). In this tutorial, we will load an existing database.

In [4]:
# Create Database object
min_oligos_per_region = 3
write_regions_with_insufficient_oligos = True
lru_db_max_in_memory=n_jobs * 2 + 2
database_name="db_oligos"

oligo_database = OligoDatabase(
    min_oligos_per_region=min_oligos_per_region, 
    write_regions_with_insufficient_oligos=write_regions_with_insufficient_oligos, 
    lru_db_max_in_memory=lru_db_max_in_memory, 
    database_name=database_name, 
    dir_output=dir_output, 
    n_jobs=n_jobs,
)

# Load Database
dir_database = os.path.abspath("./data/3_db_oligos_specificity_filter")
oligo_database.load_database(dir_database=dir_database, database_overwrite=True)

## Scoring

In [5]:
# Set parameters

oligo_Tm_opt = 50
oligo_GC_content_opt = 70
oligo_isoform_weight = 2
oligo_Tm_weight = 1
oligo_GC_weight = 1

pre_filter = False 
n_attempts = 100000
heuristic = True
heuristic_n_attempts = 100

max_graph_size = 5000
distance_between_oligos = 0

oligo_size_opt = 5
oligo_size_min = 3
n_sets = 100

oligo_GC_content_min = 40
oligo_GC_content_max = 60

oligo_Tm_min = 65 
oligo_Tm_max = 75 

Tm_parameters_oligo = {
    "check": True, #default
    "strict": True, #default
    "c_seq": None, #default
    "shift": 0, #default
    "nn_table": "DNA_NN3", # Allawi & SantaLucia (1997)
    "tmm_table": "DNA_TMM1", #default
    "imm_table": "DNA_IMM1", #default
    "de_table": "DNA_DE1", #default
    "dnac1": 50, #[nM]
    "dnac2": 0, #[nM]
    "selfcomp": False, #default
    "saltcorr": 7, # Owczarzy et al. (2008)
    "Na": 39, #[mM]
    "K": 75, #[mM]
    "Tris": 20, #[mM]
    "Mg": 10, #[mM]
    "dNTPs": 0, #[mM] default
}
Tm_parameters_oligo["nn_table"] = getattr(mt, Tm_parameters_oligo["nn_table"])
Tm_parameters_oligo["tmm_table"] = getattr(mt, Tm_parameters_oligo["tmm_table"])
Tm_parameters_oligo["imm_table"] = getattr(mt, Tm_parameters_oligo["imm_table"])
Tm_parameters_oligo["de_table"] = getattr(mt, Tm_parameters_oligo["de_table"])

Tm_chem_correction_param_oligo = {
    "DMSO": 0, #default
    "fmd": 20,
    "DMSOfactor": 0.75, #default
    "fmdfactor": 0.65, #default
    "fmdmethod": 1, #default
    "GC": None, #default
}

Tm_salt_correction_param_oligo = None # use default settings

In [6]:
# oligo scoring
oligos_scoring = WeightedIsoformTmGCOligoScoring(
    Tm_min=oligo_Tm_min,
    Tm_opt=oligo_Tm_opt,
    Tm_max=oligo_Tm_max,
    GC_content_min=oligo_GC_content_min,
    GC_content_opt=oligo_GC_content_opt,
    GC_content_max=oligo_GC_content_max,
    Tm_parameters=Tm_parameters_oligo,
    Tm_chem_correction_parameters=Tm_chem_correction_param_oligo,
    Tm_salt_correction_parameters=Tm_salt_correction_param_oligo,
    isoform_weight=oligo_isoform_weight,
    Tm_weight=oligo_Tm_weight,
    GC_weight=oligo_GC_weight,
)

# set scoring
set_scoring = LowestSetScoring(ascending=True)

## Oligo Selection Policy

The `GraphBasedSelectionPolicy` uses the scoring strategies, defined above, to iteratively select oligos that minimize the overall set score. Key features include:

**Pre-Filtering:** If `pre_filter=True`, oligos are pre-filtered before set selection, which improves performance for larger sets (e.g., oligo_size_opt >= 10) but can dramatically slow down small set selection (e.g., oligo_size_opt = 5).

**Heuristic Search:** A heuristic approach is employed to optimize set selection within a feasible runtime:

- `heuristic`: Enables or disables heuristic optimization for faster results, which might not find the best possible set.
- `heuristic_n_attempts`: Maximum number of attempts to find optimal sets.

### Generating Oligosets

Using the `OligosetGeneratorIndependentSet`, the pipeline generates non-overlapping sets of oligos. The generator uses the scoring strategies and selection policies to create optimal sets of a user-defined size.

**Set Parameters:**

- `set_size_opt`: Optimal number of oligos per set.
- `set_size_min`: Minimum number of oligos required for a set.
- `n_sets`: Number of sets to generate.

**Graph Constraints:**

- `max_graph_size`: Limits the size of the graph for feasible computation.
- `distance_between_oligos`: Ensures no overlap between selected oligos.

In [7]:
selection_policy = GraphBasedSelectionPolicy(
    set_scoring=set_scoring,
    pre_filter=pre_filter,
    n_attempts=n_attempts,
    heuristic=heuristic,
    heuristic_n_attempts=heuristic_n_attempts,
)
probeset_generator = OligosetGeneratorIndependentSet(
    selection_policy=selection_policy,
    oligos_scoring=oligos_scoring,
    set_scoring=set_scoring,
    max_oligos=max_graph_size,
    distance_between_oligos=distance_between_oligos,
)

In [None]:
oligo_database = probeset_generator.apply(
    oligo_database=oligo_database,
    sequence_type="oligo",
    set_size_opt=oligo_size_opt,
    set_size_min=oligo_size_min,
    n_sets=n_sets,
    n_jobs=n_jobs,
)

# Save Database
dir_database = "4_db_oligoset_selection"
oligo_database.save_database(dir_database=dir_database)

### Output Structure

The generated sets are saved in a pandas DataFrame with the following structure:

 oligoset_id | oligo_0  | oligo_1  | oligo_2  |  ...  | oligo_n  | set_score_1 | set_score_2 |  ...  
------------ | -------- | -------- | -------- | ----- | -------- | ----------- | ----------- | ------:
 0           | AGRN_184 | AGRN_133 | AGRN_832 |  ...  | AGRN_706 | 0.3445      | 1.2332      |  ...  

- **oligoset_id**: Identifies each oligo set.
- **oligo_0, oligo_1, ...**: Oligos in the set.
- **set_score_***: Scores representing the set's efficiency.

Applying set selection to the OligoDatabase is critical for several reasons:

- **Ensures Experimental Efficiency:** Generates sets of high-scoring oligos, ensuring effective target binding without competition.
- **Customizable and Scalable:** Users can tailor scoring strategies and selection policies to meet specific experimental needs.
- **Optimized Workflow:** Pre-filtering approaches and heuristic methods enable efficient generation of high-quality oligosets, even for large datasets.

This step finalizes the pipeline by providing optimal, ready-to-use oligosets tailored to experimental requirements. These sets can then be directly integrated into downstream experimental protocols.

In [9]:
# Show oligosets for a specific gene
oligo_database.oligosets["AARS1"]

Unnamed: 0,oligoset_id,oligo_0,oligo_1,oligo_2,oligo_3,oligo_4,set_score_worst,set_score_sum
0,0,AARS1::3736,AARS1::1245,AARS1::7449,AARS1::7047,AARS1::761,1.0677,5.3124
1,1,AARS1::3736,AARS1::1245,AARS1::8018,AARS1::7047,AARS1::761,1.0677,5.3125
2,2,AARS1::3736,AARS1::1245,AARS1::6879,AARS1::7047,AARS1::761,1.0677,5.3131
