![logo](https://user-images.githubusercontent.com/21340147/192824830-dcbe8d09-2b10-431d-bd9a-b4624192dcc9.png)
<br/>
<br/>

[Semidán Robaina](https://github.com/Robaina), September 2022.

In this Notebook, we will use Pynteny through its Python API to find candidate peptide sequences beloging to the _leu_ operon of _Escherichia coli_. Note that we could conduct the same search through Pynteny's command-line interface as well as through Pynteny's web application. Find more info in the [wiki pages](https://github.com/Robaina/Pynteny/wiki). Let's start by importing some required modules.

In [1]:
from pathlib import Path
from pandas import DataFrame
from pynteny.src.filter import SyntenyHits
from pynteny import Search, Build, Download

First, we need to initialize the class `Search` with the appropiate parameters to conduct our synteny-aware search. Find more info about the parameters in the [wiki pages](https://github.com/Robaina/Pynteny/wiki/search).

Some notes:

- The only required parameters are `data`, the path to the position-labeled peptide database and `synteny_struc`, a string containing the definition of the synteny block to search for

- Providing a path to the HMM database directory (`--hmm_dir`) is optional. If not provided, then pynteny will download and store the PGAP HMM database (only once if not previously downloaded) and use it to run the search. A custom HMM database provided in `--hmm_dir`will override pynteny's default database

- We can also manually download the PGAP HMM database with the subcommand `pynteny download`

In [3]:
# Initialize class
search = Search(
    data=Path("../pynteny/tests/test_data/MG1655.fasta"),
    synteny_struc=None,
    hmm_dir=Path("../pynteny/tests/test_data/hmms"),
    hmm_meta=Path("/home/robaina/Documents/Pynteny/pynteny/tests/test_data/hmm_meta.tsv"),
    outdir=Path("example_api/results/"),
    prefix="",
    hmmsearch_args=None,
    gene_ids=False,
    logfile=Path("example_api/results/pynteny.log"),
    processes=None,
    unordered=False,
    )

## Search synteny structure in _E. coli_

Finally, we are going to use pynteny's `search` subcommand to search for a specific syntenic block within the previously built peptide database. Specifically, we are interested in the [_leu_ operon](https://link.springer.com/article/10.1007/s00253-016-8026-2#Fig1):

```
<leuD 0 <leuC 1 <leuA
```

We this synteny structure, we are searching for peptide sequences matching the profile HMM corresponding to these gene symbols, which are also arranged in this particular order, all in the positive (sense) strand, as indicated by `>`, and which are located exactly next to each other in the same contig (no ORFs allowed between them, as indicated by a maximum number of in-between ORFs of 0 in all cases.)

In [None]:
# Parse gene IDs in synteny structure according to PGAP HMM database metadata
parsed_struc = search.parseGeneIDs(synteny_struc="<leuD 0 <leuC 1 <leuA")

In [7]:
# Update parsed synteny structure and Rrun Pynteny search
search.update("synteny_struc", parsed_struc)
synhits: SyntenyHits = search.run()

synhits_df: DataFrame = synhits.getSyntenyHits()

2022-09-29 13:32:55,309 | INFO: Searching database by synteny structure
2022-09-29 13:32:55,310 | INFO: Running Hmmer
2022-09-29 13:32:55,312 | INFO: Reusing Hmmer results for HMM: TIGR00973.1
2022-09-29 13:32:55,316 | INFO: Reusing Hmmer results for HMM: TIGR00171.1
2022-09-29 13:32:55,320 | INFO: Reusing Hmmer results for HMM: TIGR00170.1
2022-09-29 13:32:55,324 | INFO: Filtering results by synteny structure
2022-09-29 13:32:55,401 | INFO: Writing matching sequences to FASTA files
2022-09-29 13:32:55,512 | INFO: Finished!


[INFO][0m 1 patterns loaded from file
[INFO][0m 1 patterns loaded from file
[INFO][0m 1 patterns loaded from file


In [6]:
synhits_df.head()

Unnamed: 0,contig,gene_id,gene_number,locus,strand,full_label,hmm,gene_symbol,label,product,ec_number
0,U00096,b0071,71,"(78847, 79453)",neg,b0071__U00096_71_78847_79453_neg,TIGR00171.1,leuD,leuD,3-isopropylmalate dehydratase small subunit,4.2.1.33
1,U00096,b0072,72,"(79463, 80864)",neg,b0072__U00096_72_79463_80864_neg,TIGR00170.1,leuC,leuC,3-isopropylmalate dehydratase large subunit,4.2.1.33
2,U00096,b0074,74,"(81957, 83529)",neg,b0074__U00096_74_81957_83529_neg,TIGR00973.1,leuA,leuA_bact,2-isopropylmalate synthase,2.3.3.13


## Writing results to output files

In [None]:
# We can write the hits dataframe to a tsv file
synhits.writeToTSV(output_tsv=Path())

# Write hit peptide sequencesfor each gene ID / HMM to FASTA files
synhits.writeHitSequencesToFASTAfiles(
    sequence_database=Path(),
    output_dir=Path(),
    output_prefix=None
    )

In [2]:
Search.cite()

Semidán Robaina Estévez (2022). Pynteny: synteny-aware hmm searches made easy(Version 0.0.2). Zenodo. https://doi.org/10.5281/zenodo.7048685
