# Padlock Probe Designer

This notebook implements a pipeline to designe Padlock probes (short oligos) using the `oligo_designer_toolbox` package.

In [1]:
# import all the necassary pakages exept from the oligo_designer_toolsuite
import os
import sys
sys.path.append(os.path.dirname(os.getcwd()))

import yaml
from Bio.SeqUtils import MeltingTemp as mt


## Define the parameters

The first thing to do is to define the parameters we want to use to generate the Padlock probes. 
A possible way to define all the parameters, that is flexible and reusable, is to use a configuration file. 
For this toutorial we will use the YAMl file `padlock_probe_designer_custom.yaml`, which uses a custom gene annotation. You can have a look to understand which parameters are required and how the configuration file is structured.
If you want to use NCBI or Ensembl gene annotation with automatic annotation download from their servers, please check out the YAMl files `padlock_probe_designer_ncbi.yaml` and `padlock_probe_designer_ensembl.yaml`.

Once the configuration file has been set up we have to read its content:

In [2]:
config_file = "./configs/padlock_probe_designer_custom.yaml"
with open(config_file, 'r') as yaml_file:
    config = yaml.safe_load(yaml_file)
dir_output = os.path.join(os.path.dirname(os.getcwd()), config["dir_output"]) # create the complete path for the output directory

## Oligo sequences generation

Now we can start to actually build the pipeline, we will start by generating all the possible oligos with length between the maximum and minimum value given belonging to the genes defined in the config file. The oligos will be saved in a nested dicionary with the following structure: 

[gene][oligo_id][oligo_feature].


For this we need a `CustomOligoDB`, or a class that inherits form it (e.g. `NCBIOligoDB` and `EnsemblOligoDB`) and call the method `create_oligos_DB`. These classes differ on how the the fasta and the gtf filesused to compute the sequences are obtained. The first one oses local files while the others dowload them form the NCBI or Ensambl ftp server.

**Remark:** if the custom configuration file has been changed the fasta and gtf files that uses as default can be dowloaded for the git repositoy only using git-lfs.

In [3]:
from oligo_designer_toolsuite.database import CustomOligoDB, NcbiOligoDB, EnsemblOligoDB

# define the database class
if config["source"] == "ncbi":
    # dowload the fasta files formthe NCBI server
    oligo_database = NcbiOligoDB(
        oligo_length_min=config["oligo_length_min"],
        oligo_length_max=config["oligo_length_max"],
        species=config["species"],
        annotation_release=config["annotation_release"],
        n_jobs=config["n_jobs"],
        dir_output=dir_output,
        min_oligos_per_gene=config["min_oligos_per_gene"],
        )
elif config["source"] == "ensembl":
    # dowload the fasta files formthe NCBI server
    oligo_database = EnsemblOligoDB(
        oligo_length_min=config["oligo_length_min"],
        oligo_length_max=config["oligo_length_max"],
        species=config["species"],
        annotation_release=config["annotation_release"],
        n_jobs=config["n_jobs"],
        dir_output=dir_output,
        min_oligos_per_gene=config["min_oligos_per_gene"],
        )
elif config["source"] == "custom":
    # use already dowloaded files
    oligo_database = CustomOligoDB(
        oligo_length_min=config["oligo_length_min"],
        oligo_length_max=config["oligo_length_max"],
        species=config["species"],
        genome_assembly=config["genome_assembly"],
        annotation_release=config["annotation_release"],
        files_source=config["files_source"],
        annotation_file=config["annotation_file"],
        sequence_file=config["sequence_file"],
        n_jobs=config["n_jobs"],
        dir_output=dir_output,
        min_oligos_per_gene=config["min_oligos_per_gene"],
        )
else:
    raise ValueError("Annotation source not supported!") 

# read the genes file
with open(config["file_genes"]) as handle:
    lines = handle.readlines()
    genes = [line.rstrip() for line in lines]
    
#generate the oligo sequences from gene transcripts
oligo_database.create_oligos_DB(genes=genes, region='transcripts')

### Dictionary structure

Here is an example of how the nested dictionary is structured. We print a very simple dictionary with 1 oligo.

In [4]:
gene = list(oligo_database.oligos_DB.keys())[0]
oligo_id = list(oligo_database.oligos_DB[gene].keys())[0]

sample_oligos_DB = {}
sample_oligos_DB[gene] = {}
sample_oligos_DB[gene][oligo_id] = oligo_database.oligos_DB[gene][oligo_id]
print(sample_oligos_DB)

{'CYBA': {'CYBA_1': {'sequence': Seq('ACAGCTGGGCGCTTCACCCAGTGGTACTTTGGTGCCTA'), 'transcript_id': ['NM_000101.4:XM_011522905.4', 'NM_000101.4:XM_011522905.4'], 'exon_id': ['NM_000101.4_exon3_NM_000101.4_exon2:XM_011522905.4_exon3_XM_011522905.4_exon2', 'NM_000101.4_exon2:XM_011522905.4_exon2'], 'chromosome': '16', 'start': [88647131, 88648070], 'end': [88647169, 88648108], 'strand': '-', 'length': 38}}}


### Read and write

These classes deal with everything that is related with the management of the dataset. In particular, beyond creatig the dataset, they can also read and write the oligo sequences in a **tsv** or **gtf** fromat. The methods `read_oligos_DB` and `write_oligos_DB` have exactly this purpose.

Therefore, it is possible to save the current state of the dictionary during the pipeline and to retrive form a previous stage if an error uccurred.

In [5]:
if config["write_intermediate_steps"]:
    oligo_database.write_oligos_DB(format=config["file_format"], dir_oligos_DB="oligos_creation")

## Property filters

Once all the possible sequences are created, we apply a first filtering process based on the sequences properties (e.g. melting temperature or GC content). This is useful to reduce the amount of sequences we have to deal with in the next stages and discard all the sequences that are not suited for the experiment.

Each property filter is a class that inherits from the Abstact Base Class `PreFilterBase` They have a method called `apply` that takes the `oligos_DB` dictionary and returns it filtered. To make this process smooth and modular the class `PropertyFilter` allows to apply several filters one after the other. It takes as input a list of filter classes and a `CustomOligoDB` and applies sequentially all the filters and returns the final filterd version of the database. Additionally, all the necessary sequence features computed by the filters are stored in the `oligos_DB` for possible later use. 

*Note:* the filters are applied in the order they are given as input. Hence, filter with fast computations should be listed first, i.e. apply GC content filter before melting temperature filter, to reduce runtime.

To create new property filters follow the Abstact Base Class requirements in `PreFilterBase`.

In [6]:
from oligo_designer_toolsuite.oligo_property_filter import (
    PropertyFilter,
    MaskedSequences,
    GCContent, 
    MeltingTemperature, 
    PadlockArms
)

# the melting temperature params need to be preprocessed
Tm_params = config["Tm_parameters"]["shared"].copy()
Tm_params.update(config["Tm_parameters"]["property_filter"])
Tm_params["nn_table"] = getattr(mt, Tm_params["nn_table"])
Tm_params["tmm_table"] = getattr(mt, Tm_params["tmm_table"])
Tm_params["imm_table"] = getattr(mt, Tm_params["imm_table"])
Tm_params["de_table"] = getattr(mt, Tm_params["de_table"])

Tm_correction_param = config["Tm_correction_parameters"]["shared"].copy()
Tm_correction_param.update(config["Tm_correction_parameters"]["property_filter"])

# initialize the filters clasees
masked_sequences = MaskedSequences()
gc_content = GCContent(GC_content_min=config["GC_content_min"], GC_content_max=config["GC_content_max"])
melting_temperature = MeltingTemperature(
    Tm_min=config["Tm_min"], 
    Tm_max=config["Tm_max"], 
    Tm_parameters=Tm_params, 
    Tm_correction_parameters=Tm_correction_param
)
padlock_arms = PadlockArms(
    min_arm_length=config["min_arm_length"],
    max_arm_Tm_dif=config["max_arm_Tm_dif"],
    arm_Tm_min=config["arm_Tm_min"],
    arm_Tm_max=config["arm_Tm_max"],
    Tm_parameters=Tm_params,
    Tm_correction_parameters=Tm_correction_param,
)
# create the list of filters
filters = [masked_sequences, gc_content, melting_temperature, padlock_arms]

# initialize the property filter class
property_filter = PropertyFilter(filters=filters, write_genes_with_insufficient_oligos=config["write_removed_genes"])
# filter the database
oligo_database = property_filter.apply(oligo_database=oligo_database, n_jobs=config["n_jobs"])
# write the intermediate result in a file
if config["write_intermediate_steps"]:
    oligo_database.write_oligos_DB(format=config["file_format"], dir_oligos_DB="property_filter")

## Specificity filters

Generally, in experiments using short oligos one of the main problem that can occur are off-target binding of the designed oligo sequences with unwanted target regions. To avoid this problem we can decide to remove all the oligos that also match regions outside the gene they belong to.

The classes in the subpackage `oligo_speificity_filters` detect these oligos using alignment methods such as Blast and Bowtie and remove them from the database. The currently implemeted classes are: `ExactMatches`, `Blastn`, `Bowtie`, `Bowtie2`, `BowtieSeedRegion`. Look at the documentation for detailed information. Those filters are structures in the same way as the property filters. A second class `SpecificityFilter` takes a list of all the filters we want to apply, and applies them sequentially to the `oligos_DB`.  
**Note:** the filters are applied in the order they are given as input. Hence, filter with fast computations should be listed first, i.e. apply exact match filter before Blastn filter, to reduce runtime.

In addition, alignement methods need a reference fasta file to detect the off-target regions. The `CustomRefereceDB` class provides the possibility to generate this reference region with the method `create_reference_DB`.  
**Remark:** it is possible to apply a set of specificity filters with a reference file and a second set with a different reference file. 

For our pipeline, we will use `ExactMatches`, `Blastn`, `BowtieSeedRegion`. For the `BowtieSeedRegion` filter we need to generate the oligo seed region with `LigationRegionCreation` (look at the documentation to understand what seed region means). 

In [7]:
from oligo_designer_toolsuite.database import CustomReferenceDB, NcbiReferenceDB, EnsemblReferenceDB
from oligo_designer_toolsuite.oligo_specificity_filter import (
    SpecificityFilter,
    ExactMatches,
    LigationRegionCreation,
    BowtieSeedRegion,
    Blastn,
)

dir_specificity = os.path.join(dir_output, "specificity_temporary") # folder where the temporary files will be written

# generate the reference
reference_database = CustomReferenceDB(
    species=oligo_database.species,
    genome_assembly=oligo_database.genome_assembly,
    annotation_release=oligo_database.annotation_release,
    files_source=oligo_database.files_source,
    annotation_file=oligo_database.annotation_file,
    sequence_file=oligo_database.sequence_file,
    dir_output=dir_output
)
reference_database.create_reference_DB(block_size=config["block_size"]) # leave the standard parameters

# intialize the filter classes
exact_mathces = ExactMatches(dir_specificity=dir_specificity)
seed_ligation = LigationRegionCreation(ligation_region_size=config["ligation_region_size"])
seed_region = BowtieSeedRegion(dir_specificity=dir_specificity, seed_region_creation=seed_ligation)
blastn = Blastn(
    dir_specificity=dir_specificity, 
    word_size=config["word_size"],
    percent_identity=config["percent_identity"],
    coverage=config["coverage"],
)
filters = [exact_mathces, seed_region, blastn]

# initialize the specificity filter class
specificity_filter = SpecificityFilter(filters=filters, write_genes_with_insufficient_oligos=config["write_removed_genes"])
# filte r the database
oligo_database = specificity_filter.apply(oligo_database=oligo_database, reference_database=reference_database, n_jobs=config["n_jobs"])
# write the intermediate result
if config["write_intermediate_steps"]:
    oligo_database.write_oligos_DB(format=config["file_format"], dir_oligos_DB="specificity_filter")

# reads processed: 882
# reads with at least one alignment: 882 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 32902 alignments
# reads processed: 126
# reads with at least one alignment: 126 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 1791 alignments
# reads processed: 2196
# reads with at least one alignment: 2196 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 63683 alignments
# reads processed: 11276
# reads with at least one alignment: 11276 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 304029 alignments
# reads processed: 1211
# reads with at least one alignment: 1211 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 50523 alignments
# reads processed: 0
# reads with at least one alignment: 0 (0.00%)
# reads that failed to align: 0 (0.00%)
No alignments
# reads processed: 0
# reads with at least one alignment: 0 (0.00%)
# reads that failed to align: 0 (0.00%)
No alignments
# reads processed: 0
# reads with at least one a

## Oligoset generation

In the next step of the pipeline the oligos will be choosen according to their theoretical efficiency in the experiment scope (e.g. how well they bind to the target in the DNA). Each oligo will receive a score computed by a class that inherits from `OligoScoringBase`. Later, the sequences will be organized in sets and a class inheriting from `SetScoringBase` will give a general efficiency score to the set. At the end the best sets will be selected and sored.

It is required that the each array of sequences  contains oligos that do not overlap. In fact, if two oligos were overlapping, they would compete to bind to the same section of DNA and their efficiency would drop significantly. Therefore, it is extremely important to consider only sets of non-overlapping sequences.

The class `OligosetGenerator` takes the scoring strategies and tries to find, among all the feasible non-overlapping sets of oligos, the sets with the best efficiency scores. These sets will be save in a pandas DataFrame with the following structure:


 oligoset_id | oligo_0  | oligo_1  | oligo_2  |  ...  | oligo_n  | set_score_1 | set_score_2 |  ...  
------------ | -------- | -------- | -------- | ----- | -------- | ----------- | ----------- | ------:
 0           | AGRN_184 | AGRN_133 | AGRN_832 |  ...  | AGRN_706 | 0.3445      | 1.2332      |  ...  


In [8]:
from oligo_designer_toolsuite.oligo_efficiency import(
    PadlockOligoScoring,
    PadlockSetScoring,
)
from oligo_designer_toolsuite.oligo_selection import OligosetGenerator, padlock_heuristic_selection

# initialize the scoring classes
oligos_scoring = PadlockOligoScoring(
    Tm_min=config["Tm_min"],
    Tm_opt=config["Tm_opt"],
    Tm_max=config["Tm_max"],
    GC_content_min=config["GC_content_min"],
    GC_content_opt=config["GC_content_opt"],
    GC_content_max=config["GC_content_max"],
    Tm_weight=config["Tm_weight"],
    GC_weight=config["GC_weight"],
)
set_scoring = PadlockSetScoring()

# initialize the oligoset generator class
oligoset_generator = OligosetGenerator(
    oligoset_size=config["oligoset_size"], 
    min_oligoset_size=config["min_oligoset_size"],
    oligos_scoring=oligos_scoring,
    set_scoring=set_scoring,
    heurustic_selection=padlock_heuristic_selection,
    write_genes_with_insufficient_oligos=config["write_removed_genes"]
)

# generate the oligoset
oligo_database = oligoset_generator.apply(oligo_database=oligo_database, n_sets=config["n_sets"], n_jobs=config["n_jobs"])
# write the intermediate result
if config["write_intermediate_steps"]:
    oligo_database.write_oligosets(dir_oligosets="oligosets")

## Last step

Once the best oligosets are generated each experiment design might require (or not) an addtional step. In the case of the Padlock oligo designer the last step consists in designing the final padlock oligo sequences.

In [9]:
from oligo_designer_toolsuite.experiment_specific import PadlockSequenceDesigner

# preprocessing of themelting temperature parameters
Tm_params = config["Tm_parameters"]["shared"].copy()
Tm_params.update(config["Tm_parameters"]["detection_oligo"])
Tm_params["nn_table"] = getattr(mt, Tm_params["nn_table"])
Tm_params["tmm_table"] = getattr(mt, Tm_params["tmm_table"])
Tm_params["imm_table"] = getattr(mt, Tm_params["imm_table"])
Tm_params["de_table"] = getattr(mt, Tm_params["de_table"])

Tm_correction_param = config["Tm_correction_parameters"]["shared"].copy()
Tm_correction_param.update(config["Tm_correction_parameters"]["detection_oligo"])

# initilize the padlock sequence designer class
padlock_sequence_designer = PadlockSequenceDesigner(
    detect_oligo_length_min=config["detect_oligo_length_min"],
    detect_oligo_length_max=config["detect_oligo_length_max"],
    detect_oligo_Tm_opt=config["detect_oligo_Tm_opt"],
    Tm_parameters=Tm_params,
    Tm_correction_parameters=Tm_correction_param
)
# generate the padlock sequence
padlock_sequence_designer.design_padlocks(database=oligo_database)