## 4. Specificity Filters

Specificity filtering is critical in oligo design to prevent off-target binding, a common problem in experiments using short oligos. Off-target binding occurs when oligos hybridize to unintended genomic regions, reducing the accuracy and reliability of the experiment. Specificity filters identify and eliminate these problematic oligos by aligning them to reference genomic sequences and removing those that bind outside their intended regions. Specificity filters rely on sequence alignment methods such as BLAST or Bowtie to compare oligos against a reference genomic database. A list of implemented specificity filters can be found [here](https://oligo-designer-toolsuite.readthedocs.io/en/latest/_api_docs/oligo_designer_toolsuite.oligo_specificity_filter.html).

In this tutorial, you'll learn how to:

- [Define a reference (background) database](#define-a-reference-database)

- [Filter an oligo database by specificity](#run-specificity-filters)


## Imports and setup

In [21]:
import os

from pathlib import Path

from oligo_designer_toolsuite.database import (
    OligoDatabase,
    ReferenceDatabase,
)

from oligo_designer_toolsuite.oligo_specificity_filter import (
    BlastNFilter,
    CrossHybridizationFilter,
    ExactMatchFilter,
    RemoveByLargerRegionPolicy,
    SpecificityFilter,
)

In [22]:
dir_output = os.path.abspath("./results")
Path(dir_output).mkdir(parents=True, exist_ok=True)

n_jobs = 3

## Filtering by specificity

### Load the database
Specificity filters operate on `OligoDatabase` objects. If you don't know how they work, please check our [oligo database tutorial](2-oligo-database.ipynb). In this tutorial, we will load an existing database.

In [23]:
# Create Database object
min_oligos_per_region = 3
write_regions_with_insufficient_oligos = True
lru_db_max_in_memory=n_jobs * 2 + 2
database_name="db_oligos"

oligo_database = OligoDatabase(
    min_oligos_per_region=min_oligos_per_region, 
    write_regions_with_insufficient_oligos=write_regions_with_insufficient_oligos, 
    lru_db_max_in_memory=lru_db_max_in_memory, 
    database_name=database_name, 
    dir_output=dir_output, 
    n_jobs=n_jobs,
)

# Load Database
dir_database = os.path.abspath("./data/2_db_oligos_property_filter")
oligo_database.load_database(dir_database=dir_database, database_overwrite=True)

### Define a reference database

A reference database is essential for specificity filtering, as it serves as the alignment target. The database is typically generated from FASTA files containing genomic sequences. Tools such as the [genomic_region_generator pipeline](https://oligo-designer-toolsuite.readthedocs.io/en/latest/_pipelines/genomic_region_generator.html) can create these FASTA files from annotation sources like NCBI or Ensembl.

*Note: A single reference database can be used for multiple filters but different specificity filters can also be applied with different reference databases, offering flexibility for various experimental needs.*

In [24]:
files_fasta_reference_database = "../data/genomic_regions/exon_annotation_source-NCBI_species-Homo_sapiens_annotation_release-110_genome_assemly-GRCh38.fna"

# Define Reference Database
reference_database = ReferenceDatabase(
    database_name="db_reference", dir_output=dir_output
)
reference_database.load_database_from_fasta(
    files_fasta=files_fasta_reference_database, database_overwrite=False
)

### Run specificity filters

Like [property filters](3-property-filters.ipynb), specificity filters are implemented as modular classes. The `SpecificityFilter` class orchestrates the application of multiple filters sequentially, ensuring flexibility and scalability. Filters are applied in a user-defined order. Faster filters (e.g., exact matches) should be executed first to minimize computational load for subsequent, more intensive filters (e.g., BLAST).

In [25]:
# Specificity Filter with BlastN
specificity_blastn_search_parameters = {
  "perc_identity": 80,
  "strand": "minus", # this parameter is fixed, if reference is whole genome, consider using "both"
  "word_size": 10,
  "dust": "no",
  "soft_masking": "false",
  "max_target_seqs": 10,
  "max_hsps": 1000,
}
specificity_blastn_hit_parameters = {
  "coverage": 50 # can be turned into min_alignment_length
}

# Crosshybridization Filter with BlastN
cross_hybridization_blastn_search_parameters = {
  "perc_identity": 80,
  "strand": "minus", # this parameter is fixed
  "word_size": 10,
  "dust": "no",
  "soft_masking": "false",
  "max_target_seqs": 10,
}
cross_hybridization_blastn_hit_parameters = {
  "coverage": 80 # can be turned into min_alignment_length
}

In [None]:
# Specificity Filters
exact_matches = ExactMatchFilter(policy=RemoveByLargerRegionPolicy(), filter_name="exact_match")

cross_hybridization_aligner = BlastNFilter(
    search_parameters=cross_hybridization_blastn_search_parameters,
    hit_parameters=cross_hybridization_blastn_hit_parameters,
    filter_name="blastn_crosshybridization",
    dir_output=dir_output,
)
cross_hybridization = CrossHybridizationFilter(
    policy=RemoveByLargerRegionPolicy(),
    alignment_method=cross_hybridization_aligner,
    database_name_reference="db_reference",
    dir_output=dir_output,
)

specificity = BlastNFilter(
    search_parameters=specificity_blastn_search_parameters,
    hit_parameters=specificity_blastn_hit_parameters,
    filter_name="blastn_specificity",
    dir_output=dir_output,
)

filters = [exact_matches, specificity, cross_hybridization]
specificity_filter = SpecificityFilter(filters=filters)
oligo_database = specificity_filter.apply(
    sequence_type="oligo",
    oligo_database=oligo_database,
    reference_database=reference_database,
    n_jobs=n_jobs,
)

# Save Database
oligo_database.save_database("3_db_oligos_specificity_filter")

By eliminating off-target binding oligos, specificity filtering ensures that the remaining sequences are highly selective for their intended regions, laying the groundwork for effective experimental designs.

Applying specificity filters to the OligoDatabase is critical for several reasons:

- **Enhances Experimental Accuracy:** Removes oligos that could bind to unintended genomic regions, reducing noise and improving the reliability of experimental results.
- **Customizable Filtering:** Allows users to tailor filters to the specific requirements of their experiment, such as alignment stringency and coverage thresholds.
- **Efficient Workflow:** Sequential application of filters minimizes computational costs by progressively reducing the size of the dataset.