# Padlock Probe Designer

This notebook implements a pipeline to designe Padlock probes (short oligos) using the `oligo_designer_toolsuite` package.

In [1]:
# import all the necassary pakages exept from the oligo_designer_toolsuite
import os
import sys
import warnings
sys.path.append(os.path.dirname(os.getcwd()))

import yaml
from Bio.SeqUtils import MeltingTemp as mt


## Define the parameters

First, we need to define the parameters we want to use to generate the Padlock probes. 
A possible way to define all the parameters, that is flexible and reusable, is to use a configuration file. 
For this tutorial we will use the YAML file `padlock_probe_designer_custom.yaml`, which uses a custom gene annotation (GTF file - GFF3 not supported) and genome sequence (fasta file). As an example, we use human gene annotation and genome sequence of chromsome 16. Check out the config file to understand which parameters are required and how the configuration file is structured.
If you want to use NCBI or Ensembl gene annotation and genome sequence with automatic file download from their servers, please check out the YAML files `padlock_probe_designer_ncbi.yaml` and `padlock_probe_designer_ensembl.yaml`.

Once the configuration file has been set up we have to read its content:

In [2]:
config_file = "./configs/padlock_probe_designer_custom.yaml"
# config_file = "./configs/padlock_probe_designer_ncbi.yaml" # NCBI config
# config_file = "./configs/padlock_probe_designer_ensemble.yaml" # Ensemble config
with open(config_file, 'r') as yaml_file:
    config = yaml.safe_load(yaml_file)
dir_output = os.path.join(os.path.dirname(os.getcwd()), config["dir_output"]) # create the complete path for the output directory

## Generate Genomic Regions for the Oligo and Reference Database

Now we can start to build the pipeline. We start with generating the fasta file used as basis for the oligo database and reference database for the alignement methods. *Note: you can use different fasta files to create the oligo and reference database.* From the provided genome annotation (fasta file), specific regions can be extracted. In particular, it is possible to use: 

- the whole genome
- the transcriptome
- the coding sequence (CDS)

To create specific regions, we need a `CustomGenomicRegionGenerator`, or a class that inherits form it (e.g. `NcbiGenomicRegionGenerator` and `EnsemblGenomicRegionGenerator`) and call the method `generate_transcript_reduced_representation()` to extract the transcriptome in a reduced representation form from the given files. These classes differ on how the the fasta and the GTF (GFF3 not supported) files used are obtained. The first one uses local files while the others dowload them respectively form the NCBI or Ensambl ftp server. *Note: that the GTF file must contain coordinate as well as transcript ID and exon number infomration to generate a transcriptome or coding sequence.* In the output fasta file, the header of each sequence must start with '>' and contain the following information: 
region_id, additional_information and coordinates (chrom, start, end, strand)

Input Format (per sequence):  
`>region_id::additional information::chromosome:start-end(strand)`  
`sequence`

Example:  
`>ASR1::transcrip_id=XM456,exon_number=5::16:54552-54786(+)`  
`AGTTGACAGACCCCAGATTAAAGTGTGTCGCGCAACAC` 

To use NCBi or Ensembl files, comment in the respective config files in the cell above.

In [3]:
from oligo_designer_toolsuite.database import CustomGenomicRegionGenerator, NcbiGenomicRegionGenerator, EnsemblGenomicRegionGenerator

# If the custom config file is selected
if config["source"] == "custom":
    region_generator_custom = CustomGenomicRegionGenerator(
        annotation_file=config["file_annotation"], 
        sequence_file=config["file_sequence"], 
        files_source=config["files_source"], 
        species=config["species"], 
        annotation_release=config["annotation_release"], 
        genome_assembly=config["genome_assembly"],
        dir_output=dir_output
    )
# If the Ncbi config file is selected
elif config["source"] == "ncbi":
    region_generator = NcbiGenomicRegionGenerator(
        taxon=config["taxon"],
        species=config["species"], 
        annotation_release=config["annotation_release"], 
        dir_output=dir_output
    )
# If the Ensembl config file is generated
elif config["source"] == "ensembl":
    region_generator = EnsemblGenomicRegionGenerator(
        species=config["species"], 
        annotation_release=config["annotation_release"], 
        dir_output=dir_output
    )

file_transcriptome = region_generator_custom.generate_transcript_reduced_representation(include_exon_junctions=True, exon_junction_size=2*config["oligo_length_max"])

## Oligo sequences generation


Next we will generate all the possible oligos with length between the maximum and minimum value given belonging to the genes defined in the config file. For this we intialize the class ``OligoDatabase`` and use the meothod ``create_database()``.  
The ``OligoDatabase`` requires a fasta file as input. This fasta file can be created using a ``GenomicRegionGenerator`` (see code in cell above) or a custom fasta file can be provided. The input fasta file needs a header for each sequence, which must start with '>' and contain the following information: 
region_id, additional_information and coordinates (chrom, start, end, strand), where the region_id is compulsory and the other fileds are opional.

Input Format (per sequence):  
`>region_id::additional information::chromosome:start-end(strand)`  
`sequence`

Example:  
`>ASR1::transcrip_id=XM456,exon_number=5::16:54552-54786(+)`  
`AGTTGACAGACCCCAGATTAAAGTGTGTCGCGCAACAC`  
or   
`>ASR1`  
`AGTTGACAGACCCCAGATTAAAGTGTGTCGCGCAACAC`

The generated probes will be saved in a nested dicionary with the following structure: 

``[gene][oligo_id][oligo_features]``

*Note: if you already have a stored database you can load it into an OligoDatabase object by using the ``load_oligo_database()`` function.*

In [4]:
from oligo_designer_toolsuite.database import OligoDatabase

# define the database class
oligo_database = OligoDatabase(
    file_fasta = file_transcriptome,
    oligo_length_min = config["oligo_length_min"],
    oligo_length_max = config["oligo_length_max"],
    min_oligos_per_region = config["min_oligos_per_gene"],
    files_source = region_generator_custom.files_source,
    species = region_generator_custom.species,
    annotation_release = region_generator_custom.annotation_release,
    genome_assembly = region_generator_custom.genome_assembly,
    n_jobs = 2,
    dir_output=dir_output
)

# read the genes file
if config["file_genes"] is None:
    warnings.warn(
        "No file containing the genes was provided, all the genes are ussed to generate the probes. This chioce can use a lot of resources."
    )
    genes = None
else:
    with open(config["file_genes"]) as handle:
        lines = handle.readlines()
        genes = [line.rstrip() for line in lines]
        
# generate the oligo sequences from gene transcripts
oligo_database.create_database(region_ids=genes) 

# alternative: load database from file
# oligo_database.load_oligo_database(file_database)

### Dictionary structure

Here is an example of how the nested dictionary is structured for one oligo.

In [5]:
gene = list(oligo_database.database.keys())[0]
oligo_id = list(oligo_database.database[gene].keys())[0]

sample_oligos_DB = {gene: {oligo_id: oligo_database.database[gene][oligo_id]}}
print(sample_oligos_DB)

{'AARS1': {'AARS1_16:70265624-70265662(-)': {'sequence': Seq('GAGACACTGCTTGGCTCCTCTATGACACCTATGGGTTT'), 'chromosome': '16', 'start': [70265624], 'end': [70265662], 'strand': '-', 'length': 38, 'additional_information_fasta': ['transcript_id=NM_001605.3,exon_number=10;transcript_id=XM_047433666.1,exon_number=10']}}}


### Read and write

The ``OligoDatabase`` class deals with everything that is related to the management of the database. In particular, beyond creatig the database, it can also read and write the oligo sequences in a **tsv** fromat. The methods `load_database()` and `write_database()`,  have exactly this purpose. It is also possible to write the sequences as a fasta file with the method ``write_fasta_from_database()``.

This allows us to save the current state of the database during the pipeline and to retrive it form a previous stage if an error uccurred.

In [6]:
if config["write_intermediate_steps"]:
    file_database = oligo_database.write_database(filename="oligo_database_initial.txt")

## Property filters

Once all the possible sequences are created, we apply a first filtering step based on the sequence properties (e.g. melting temperature or GC content). This is useful to reduce the amount of sequences we have to deal with in the next stages and discard all the sequences that are not suited for the experiment.

Each property filter is a class that inherits from the Abstact Base Class `PropertyFilterBase` They have a method called `apply` that takes the `OligoDatabase.database` dictionary and returns it filtered. To make this process smooth and modular the wrapper class `PropertyFilter` allows to apply several filters one after the other. It takes as input a list of filter classes and an `OligoDatabase` object and applies sequentially all the filters and returns the final filterd version of the database. Additionally, all the necessary sequence features computed by the filters are stored in the `OligoDatabase.database` for possible later use. 

*Note: the filters are applied in the order they are given as input. Hence, filter with fast computations should be listed first, i.e. apply GC content filter before melting temperature filter, to reduce runtime.*

To create new property filters follow the Abstact Base Class requirements in `PropertyFilterBase`.

In [7]:
from oligo_designer_toolsuite.oligo_property_filter import (
    PropertyFilter,
    MaskedSequences,
    GCContent, 
    MeltingTemperatureNN, 
    PadlockArms
)

# the melting temperature params need to be preprocessed
Tm_params = config["Tm_parameters"]["shared"].copy()
Tm_params.update(config["Tm_parameters"]["property_filter"])
Tm_params["nn_table"] = getattr(mt, Tm_params["nn_table"])
Tm_params["tmm_table"] = getattr(mt, Tm_params["tmm_table"])
Tm_params["imm_table"] = getattr(mt, Tm_params["imm_table"])
Tm_params["de_table"] = getattr(mt, Tm_params["de_table"])

Tm_chem_correction_param = config["Tm_chem_correction_param"]["shared"].copy()
Tm_chem_correction_param.update(config["Tm_chem_correction_param"]["property_filter"])

# initialize the filters clasees
masked_sequences = MaskedSequences()
gc_content = GCContent(GC_content_min=config["GC_content_min"], GC_content_max=config["GC_content_max"])
melting_temperature = MeltingTemperatureNN(
    Tm_min=config["Tm_min"], 
    Tm_max=config["Tm_max"], 
    Tm_parameters=Tm_params, 
    Tm_chem_correction_parameters=Tm_chem_correction_param
)
padlock_arms = PadlockArms(
    min_arm_length=config["min_arm_length"],
    max_arm_Tm_dif=config["max_arm_Tm_dif"],
    arm_Tm_min=config["arm_Tm_min"],
    arm_Tm_max=config["arm_Tm_max"],
    Tm_parameters=Tm_params,
    Tm_chem_correction_parameters=Tm_chem_correction_param,
)
# create the list of filters
filters = [masked_sequences, gc_content, melting_temperature, padlock_arms]

# initialize the property filter class
property_filter = PropertyFilter(filters=filters, write_regions_with_insufficient_oligos=config["write_removed_genes"])
# filter the database
oligo_database = property_filter.apply(oligo_database=oligo_database, n_jobs=config["n_jobs"])
# write the intermediate result in a file
if config["write_intermediate_steps"]:
    file_database = oligo_database.write_database(filename="oligo_database_property_filter.txt")

## Specificity filters

In experiments using short oligos one of the main problem that can occur are off-target binding of the designed oligo sequences to undesired target regions. To avoid this problem we can decide to remove all the oligos that also align to regions outside the gene they belong to.

The classes in the subpackage `oligo_speificity_filters` detect these oligos using alignment methods such as Blast and Bowtie and remove them from the database. The currently implemeted classes are: `ExactMatches`, `Blastn`, `Bowtie`, `Bowtie2`, `BowtieSeedRegion`. Look at the documentation for detailed information. Those filters are structured in the same way as the property filters. A second class `SpecificityFilter` takes a list of all the filters we want to apply, and applies them sequentially to the `OligoDatabase.database`.  
*Note: the filters are applied in the order they are given as input. Hence, filter with fast computations should be listed first, i.e. apply exact match filter before Blastn filter, to reduce runtime.*

In addition, alignement methods need a reference fasta file to detect the off-target regions. The `CustomGenomicRegionGenerator` class provides the possibility to generate this reference region as shown proviously. Once the fasta file has been created the class `ReferenceDatabase` stores the path, additional information and can extract regions we are intered in from the fasta file.

*Note: it is possible to apply a set of specificity filters with a reference file and a second set with a different reference file.*

For our pipeline, we will use `ExactMatches`, `Blastn`, `BowtieSeedRegion`. For the `BowtieSeedRegion` filter we need to generate the oligo seed region with `LigationRegionCreation` (look at the documentation to understand what seed region means). 

In [8]:
from oligo_designer_toolsuite.database import ReferenceDatabase
from oligo_designer_toolsuite.oligo_specificity_filter import (
    SpecificityFilter,
    ExactMatches,
    LigationRegionCreation,
    BowtieSeedRegion,
    Blastn,
)

dir_specificity = os.path.join(dir_output, "specificity_temporary") # folder where the temporary files will be written


reference = ReferenceDatabase(
    file_fasta = file_transcriptome,
    files_source = region_generator_custom.files_source,
    species = region_generator_custom.species,
    annotation_release = region_generator_custom.annotation_release,
    genome_assembly = region_generator_custom.genome_assembly,
    dir_output=dir_output
    )

# intialize the filter classes
exact_mathces = ExactMatches(dir_specificity=dir_specificity)
seed_ligation = LigationRegionCreation(ligation_region_size=config["ligation_region_size"])
seed_region = BowtieSeedRegion(dir_specificity=dir_specificity, seed_region_creation=seed_ligation)
blastn = Blastn(
    dir_specificity=dir_specificity, 
    word_size=config["word_size"],
    percent_identity=config["percent_identity"],
    coverage=config["coverage"],
    strand=config["strand"],
)
filters = [exact_mathces, seed_region, blastn]

# initialize the specificity filter class
specificity_filter = SpecificityFilter(filters=filters, write_regions_with_insufficient_oligos=config["write_removed_genes"])
# filte r the database
oligo_database = specificity_filter.apply(oligo_database=oligo_database, reference_database=reference, n_jobs=config["n_jobs"])
# write the intermediate result
if config["write_intermediate_steps"]:
    file_database = oligo_database.write_database(filename="oligo_database_specificity_filter.txt")

# reads processed: 2678
# reads with at least one alignment: 2678 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 87163 alignments
# reads processed: 136
# reads with at least one alignment: 136 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 2873 alignments
# reads processed: 2117
# reads with at least one alignment: 2117 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 87883 alignments
# reads processed: 575
# reads with at least one alignment: 575 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 15202 alignments
# reads processed: 369
# reads with at least one alignment: 369 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 21850 alignments


## Oligoset generation

In the next step of the pipeline the oligos will be choosen according to their theoretical efficiency in the experiment scope (e.g. how well they bind to the target in the DNA). Each oligo will receive a score computed by a class that inherits from `OligoScoringBase`. Later, the sequences will be organized in sets and a class inheriting from `SetScoringBase` will give a general efficiency score to the set. At the end the best sets will be selected and scored.

It is required that the each array of sequences contains oligos that do not overlap. In fact, if two oligos were overlapping, they would compete to bind to the same section of DNA and their efficiency would drop significantly. Therefore, we only consider sets of non-overlapping sequences.

The class `OligosetGenerator` takes the scoring strategies and tries to find, among all the feasible non-overlapping sets of oligos, the sets with the best efficiency scores. These sets will be save in a pandas DataFrame with the following structure:


 oligoset_id | oligo_0  | oligo_1  | oligo_2  |  ...  | oligo_n  | set_score_1 | set_score_2 |  ...  
------------ | -------- | -------- | -------- | ----- | -------- | ----------- | ----------- | ------:
 0           | AGRN_184 | AGRN_133 | AGRN_832 |  ...  | AGRN_706 | 0.3445      | 1.2332      |  ...  


In [9]:
from oligo_designer_toolsuite.oligo_efficiency import(
    PadlockOligoScoring,
    PadlockSetScoring,
)
from oligo_designer_toolsuite.oligo_selection import OligosetGenerator, padlock_heuristic_selection

# initialize the scoring classes
oligos_scoring = PadlockOligoScoring(
    Tm_min=config["Tm_min"],
    Tm_opt=config["Tm_opt"],
    Tm_max=config["Tm_max"],
    GC_content_min=config["GC_content_min"],
    GC_content_opt=config["GC_content_opt"],
    GC_content_max=config["GC_content_max"],
    Tm_weight=config["Tm_weight"],
    GC_weight=config["GC_weight"],
)
set_scoring = PadlockSetScoring()

# initialize the oligoset generator class
oligoset_generator = OligosetGenerator(
    oligoset_size=config["oligoset_size"], 
    min_oligoset_size=config["min_oligoset_size"],
    oligos_scoring=oligos_scoring,
    set_scoring=set_scoring,
    heurustic_selection=padlock_heuristic_selection,
    write_regions_with_insufficient_oligos=config["write_removed_genes"]
)

# generate the oligoset
oligo_database = oligoset_generator.apply(oligo_database=oligo_database, n_sets=config["n_sets"], n_jobs=config["n_jobs"])
# write the intermediate result
if config["write_intermediate_steps"]:
    file_database = oligo_database.write_database(filename="oligo_database_oligosets.txt")

## Last step

Once the best oligosets are generated each experiment design might require (or not) an addtional step. In the case of the Padlock oligo designer the last step consists in designing the final padlock probe sequences containing the padlock probe and detection probe sequence.

In [10]:
from oligo_designer_toolsuite.sequence_design import PadlockSequence

# preprocessing of themelting temperature parameters
Tm_params = config["Tm_parameters"]["shared"].copy()
Tm_params.update(config["Tm_parameters"]["detection_oligo"])
Tm_params["nn_table"] = getattr(mt, Tm_params["nn_table"])
Tm_params["tmm_table"] = getattr(mt, Tm_params["tmm_table"])
Tm_params["imm_table"] = getattr(mt, Tm_params["imm_table"])
Tm_params["de_table"] = getattr(mt, Tm_params["de_table"])

Tm_chem_correction_parameters = config["Tm_chem_correction_param"]["shared"].copy()
Tm_chem_correction_parameters.update(config["Tm_chem_correction_param"]["detection_oligo"])

# initilize the padlock sequence designer class
padlock_sequence = PadlockSequence(
    detect_oligo_length_min=config["detect_oligo_length_min"],
    detect_oligo_length_max=config["detect_oligo_length_max"],
    detect_oligo_Tm_opt=config["detect_oligo_Tm_opt"],
    Tm_parameters=Tm_params,
    Tm_chem_correction_parameters=Tm_chem_correction_parameters,
    dir_output = dir_output
)
# generate the padlock sequence
padlock_sequence.design_final_padlock_sequence(oligo_database=oligo_database)