![logo](https://user-images.githubusercontent.com/21340147/192824830-dcbe8d09-2b10-431d-bd9a-b4624192dcc9.png)
<br/>
<br/>

[Semidán Robaina](https://github.com/Robaina), September 2022.

In this Notebook, we will use Pynteny through its Python API to find candidate peptide sequences beloging to the _leu_ operon of _Escherichia coli_. Note that we could conduct the same search through Pynteny's command-line interface as well as through Pynteny's web application. Find more info in the [wiki pages](https://github.com/Robaina/Pynteny/wiki). Let's start by importing some required modules.

In [1]:
from pathlib import Path
from pandas import DataFrame
from pynteny.src.filter import SyntenyHits
from pynteny import Search, Build, Download

First, we need to initialize the class `Search` with the appropiate parameters to conduct our synteny-aware search. Find more info about the parameters in the [wiki pages](https://github.com/Robaina/Pynteny/wiki/search).

Some notes:

- The only required parameters are `data`, the path to the position-labeled peptide database and `synteny_struc`, a string containing the definition of the synteny block to search for

- Providing a path to the HMM database directory (`--hmm_dir`) is optional. If not provided, then pynteny will download and store the PGAP HMM database (only once if not previously downloaded) and use it to run the search. A custom HMM database provided in `--hmm_dir`will override pynteny's default database

- We can also manually download the PGAP HMM database with the subcommand `pynteny download`

In [3]:
# Initialize class
search = Search(
    data=Path("../pynteny/tests/test_data/MG1655.fasta"),
    synteny_struc=None,
    hmm_dir=Path("../pynteny/tests/test_data/hmms"),
    hmm_meta=Path("/home/robaina/Documents/Pynteny/pynteny/tests/test_data/hmm_meta.tsv"),
    outdir=Path("example_api/results/"),
    prefix="",
    hmmsearch_args=None,
    gene_ids=False,
    logfile=Path("example_api/results/pynteny.log"),
    processes=None,
    unordered=False,
    )

## Search synteny structure in _E. coli_

Finally, we are going to use pynteny's `search` subcommand to search for a specific syntenic block within the previously built peptide database. Specifically, we are interested in the following structure:

```
<leuD 0 <leuC 1 <leuA
```

We this synteny structure, we are searching for peptide sequences matching the profile HMM corresponding to these gene symbols, which are also arranged in this particular order, all in the negative (antisense) strand, as indicated by `<`, and which are located exactly next to each other in the same contig (no ORFs allowed between them, as indicated by a maximum number of in-between ORFs of 0 in all cases.)

In [4]:
# Parse gene IDs in synteny structure according to PGAP HMM database metadata
parsed_struc = search.parseGeneIDs(synteny_struc="<leuD 0 <leuC 1 <leuA")

2022-09-30 13:29:06,310 | INFO: Translated 
 "<leuD 0 <leuC 1 <leuA" 
 to 
 "<TIGR00171.1 0 <TIGR00170.1 1 <TIGR00973.1" 
 according to provided HMM database metadata


We see that `pynteny parse` has found a number of profile HMMs matching the gene symbols in the provided synteny structure. Additionally, in two cases it has found two HMMs matching a single gene symbol, which are displayed within parentheses and separated by "|":

`<TIGR00171.1 0 <TIGR00170.1 1 <TIGR00973.1`

In these cases, `pynteny search` will match sequences by either or all of the HMMs in each group within parentheses.

Alright, now that we know that our HMM database contains models for all the gene symbols in our synteny structure, let's run `pynteny search` to find matches in our peptide sequence database. 

Some notes:

- Since we are using gene symbols instead of HMM names, we need to add the flag `--gene_ids`

- We could have directly input the synteny string composed of HMM names. In that case, we wouldn't need to provide the path to the HMM metadata file (`--hmm_meta`) and we would remove the flag `--gene_ids`

In [5]:
# Update parsed synteny structure and Rrun Pynteny search
search.update("synteny_struc", parsed_struc)
synhits: SyntenyHits = search.run()

synhits_df: DataFrame = synhits.getSyntenyHits()

2022-09-30 13:30:29,871 | INFO: Searching database by synteny structure
2022-09-30 13:30:29,871 | INFO: Running Hmmer
2022-09-30 13:30:29,872 | INFO: Reusing Hmmer results for HMM: TIGR00973.1
2022-09-30 13:30:29,878 | INFO: Reusing Hmmer results for HMM: TIGR00171.1
2022-09-30 13:30:29,880 | INFO: Reusing Hmmer results for HMM: TIGR00170.1
2022-09-30 13:30:29,881 | INFO: Filtering results by synteny structure
2022-09-30 13:30:29,917 | INFO: Writing matching sequences to FASTA files
2022-09-30 13:30:29,970 | INFO: Finished!


Pynteny has generated a number of output files in the provided output directory. HMMER3 hit results are stored within the subdirectory `hmmer_outputs`. The main output file, `synteny_matched.tsv` contains the labels of the matched sequences grouped by synteny block and sorted by gene number within their parent contig. The remaining (FASTA) files contain the retrieved peptide sequences for each gene symbol / HMM name in the synteny structure.

In [6]:
synhits_df.head()

Unnamed: 0,contig,gene_id,gene_number,locus,strand,full_label,hmm,gene_symbol,label,product,ec_number
0,U00096,b0071,71,"(78847, 79453)",neg,b0071__U00096_71_78847_79453_neg,TIGR00171.1,leuD,leuD,3-isopropylmalate dehydratase small subunit,4.2.1.33
1,U00096,b0072,72,"(79463, 80864)",neg,b0072__U00096_72_79463_80864_neg,TIGR00170.1,leuC,leuC,3-isopropylmalate dehydratase large subunit,4.2.1.33
2,U00096,b0074,74,"(81957, 83529)",neg,b0074__U00096_74_81957_83529_neg,TIGR00973.1,leuA,leuA_bact,2-isopropylmalate synthase,2.3.3.13


Displayed above the first synteny match in our peptide database, we see that all peptides are located within the same parent contig and respect the positional restrictions of our input synteny structure. Furthermore, all sequences belong to _Ruegeria pomeroyii_, and alphaproteobacteria of the family _Rhodobacteraceae_.

Additional notes:

- The previous results are strand-specific (all ORFs must be located in the positive or _sense_ strand). However, we could have made them strand-agnostic by omitting the strand symbols in the synteny structure (i.e., using `soxX 0 soxY 0 soxZ 0 soxA 0 soxB 0 soxC`)

- We could have made the search even more general dropping the constraint on the arrangement by adding the flag `pynteny search --unordered`. In which case, Pynteny would match any group of 6 ORFs corresponding to the provided HMM names, located in the same contig and adjacent to each other, but not necessarily arranged in the same order displayed by the synteny structure. In other words, `--unordered` enables searching for "true" synteny, as opposed to the, more restrictive, collinearity.

Alright, finally, let's get the peptide sequences in our original input database that correspond to the identified synteny block displayed above:

## Get citation

We can get the citation string by calling the `cite` method:

In [2]:
Search.cite()

Semidán Robaina Estévez (2022). Pynteny: synteny-aware hmm searches made easy(Version 0.0.2). Zenodo. https://doi.org/10.5281/zenodo.7048685
