![logo](https://user-images.githubusercontent.com/21340147/192824830-dcbe8d09-2b10-431d-bd9a-b4624192dcc9.png)
<br/>
<br/>

In this Notebook, we will exemplirize the typical usage of Pynteny through the command-line interface.

In [7]:
from pathlib import Path
from IPython.display import display, HTML
import pandas as pd
from pynteny.src.utils import CommandArgs
from pynteny.src.subcommands import synteny_search, download_hmms, build_database

## Download PGAP profile HMM database

Firt, let's download [PGAP](https://academic.oup.com/nar/article/49/D1/D1020/6018440)'s profile HMM database from the NCBI webpage. To this end, we will use pynteny subcommand `download`, which will unzip and store files in the specified output directory. The metadata file will be parsed and filtered to remove HMM entries which are not available in the downloaded database (this is to avoid possible downstream errors).

In [22]:
%%bash

pynteny download --outdir data/hmms --unpack

## Build peptide sequence database

For this example we are going to use the [MAR reference](https://mmp2.sfb.uit.no/marref/) database (currently version _v7_), a collection of 970 fully sequenced prokaryotic genomes from the marine environment. Specifically, we will use the assembly data file containing the assembled nucleotide sequences.

Our final goal is to build a peptide sequence database in a single FASTA file where each record corresponds to a inferred ORF, which will display the positional information (i.e. ORF number within the parent contig as well as the DNA strand). To this end, we will run pynteny's subcommand `build`, which will take care of:

- Predict and translate ORFs with [prodigal]()
- Label each ORF with a unique identifier and add positional metadata (with respect to the parent contig)

To follow this example, you should have previously downloaded the assembly data file, `assembly.fa`, from [MAR ref](https://mmp2.sfb.uit.no/marref/). Here is how the first lines of `assembly.fa` look like, each record corresponds to a single, assembled genome:

In [3]:
%%bash 

head -n 4 /home/robaina/Databases/MAR_database/assembly.fa

>CP000435.1 Synechococcus sp. CC9311, complete genome
ACATCGTTTCCCCTGTTTCCACAAGACCTACTACGGCTGTTTTCGTAGTTCTTTTAAGAGAATAAAAACAGCCCTAAAGC
CGGGGAACACGAAAAAAACGTGAAACCATTGCGCTTCTCCCTTGCCTGTGAAATTGTGAGGAGAGATTTGTTCACGCCGT
TGACTCGGACCTCATGAAATTGGTCTGTTCCCAGGCAGAACTCAACGCAGCTCTGCAGTTGGTCAGTCGGGCTGTCGCCT


Let's run `pynteny build` to generated a peptide database labelled with positional information. The labels are organized following the structure:

```
<genome ID>__<contig ID>_<gene position>_<locus start>_<locus end>_<strand>
```

where gene position, locus start and locus end are taken respect to the contig.

In [18]:
%%bash

pynteny build \
    --data data/assembly.fa\
    --outfile data/labelled_marref.fasta

Here are some position-labelled predicted peptides corresponding to the assembled genome displayed above (`CP000435.1`):

In [5]:
%%bash

grep -A 1 "CP000435.1" /home/robaina/Databases/MAR_database/marref_prodigal_longlabels_mmp.faa | head -n 6

>CP000435.1_1__CP000435.1_1_174_1331_pos
MKLVCSQAELNAALQLVSRAVASRPTHPVLANVLLTADAGTDRLSLTGFDLNLGIQTSLAASVDTSGAVTLPARLLGEIVSKLSSDSPVSLSSDAGADQVELTSSSGSYQMRGMPADDFPELPLVENGTALRVDPASLLKALRATLFASSGDEAKQLLTGVHLRFNQKRLEAASTDGHRLAMLTVEDALQAEISAEESEPDELAVTLPARSLREVERLMASWKGGDPVSLFCERGQVVVLAADQMVTSRTLEGTYPNYRQLIPDGFSRTIDLDRRAFISALERIAVLADQHNNVVRIATEPATGLVQISADAQDVGSGSESLPAEINGDAVQIAFNARYVLDGLKAMDCDRVRLSCNAPTTPAILTPANDDPGLTYLVMPVQIRT*
>CP000435.1_2__CP000435.1_2_1435_2148_pos
MAWMHPPVHRLLGWVSRPSALRTSRDVWRLDQCRGFDDQQVFVKGAPAEADQITLDRLPTLLDADLLNADGERVGIIADLAFLPASGQISHYLVARSDPRLPGTSRWRLLPDRIVDQQPGLVSSAIHELDDLPLARASVRQDFLQRSRHWREQLQQFGDRAGERLEGWLEEPPWDEPPAVSDVASSYSSTAAPTVDPLDDWDDGDWTDAPRVERGRSVRNDPTDRNDWPDHEEDPWV*
>CP000435.1_3__CP000435.1_3_2185_4518_pos
MTQSSHAVAAFDLGAALRQEGLTETDYSEIQRRLGRDPNRAELGMFGVMWSEHCCYRNSRPLLSGFPTEGPRILVGPGENAGVVDLGEGHHLAFKVESHNHPSAVEPFQGAATGVGGILRDIFTMGARPIALLNALRFGPLDEPATRGLVEGVVAGIAHYGNCVGVPTVGGEVAFDPSYRGNPLVNAMALGLMETDEIVRSGAAGVGNPVVYVGSTTGRDGMGGASFASAELSADSLDDRPAVQVGDPF

## Search synteny structure in MAR ref

Finally, we are going to use pynteny's `search` subcommand to search for a specific syntenic block within the previously built peptide database. Specifically, we are interested in the [_sox_ operon](https://link.springer.com/article/10.1007/s00253-016-8026-2#Fig1):

```
>soxX 0 >soxY 0 >soxZ 0 >soxA 0 >soxB 0 >soxC
```

In [10]:
%%bash 

pynteny parse \
    --synteny_struc ">soxX 0 >soxY 0 >soxZ 0 >soxA 0 >soxB 0 >soxC" \
    --hmm_meta ../hmm_data/hmm_PGAP_no_missing.tsv


    ____              __                  
   / __ \__  ______  / /____  ____  __  __
  / /_/ / / / / __ \/ __/ _ \/ __ \/ / / /
 / ____/ /_/ / / / / /_/  __/ / / / /_/ / 
/_/    \__, /_/ /_/\__/\___/_/ /_/\__, /  
      /____/                     /____/   

Synteny-based Hmmer searches made easy, v0.0.2
Semidán Robaina Estévez (srobaina@ull.edu.es), 2022
 

2022-09-29 15:19:30,733 | INFO: Translated 
 ">soxX 0 >soxY 0 >soxZ 0 >soxA 0 >soxB 0 >soxC" 
 to 
 ">TIGR04485.1 0 >TIGR04488.1 0 >TIGR04490.1 0 >(TIGR01372.1|TIGR04484.1) 0 >(TIGR01373.1|TIGR04486.1) 0 >TIGR04555.1" 
 according to provided HMM database metadata


We can see that...

In [6]:
%%bash

pynteny search \
    --synteny_struc ">soxX 0 >soxY 0 >soxZ 0 >soxA 0 >soxB 0 >soxC" \
    --data /home/robaina/Databases/MAR_database/marref_prodigal_longlabels_mmp.faa \
    --outdir example2/ \
    --hmm_dir ../hmm_data/hmm_PGAP \
    --hmm_meta ../hmm_data/hmm_PGAP_no_missing.tsv \
    --gene_ids


    ____              __                  
   / __ \__  ______  / /____  ____  __  __
  / /_/ / / / / __ \/ __/ _ \/ __ \/ / / /
 / ____/ /_/ / / / / /_/  __/ / / / /_/ / 
/_/    \__, /_/ /_/\__/\___/_/ /_/\__, /  
      /____/                     /____/   

Synteny-based Hmmer searches made easy, v0.0.2
Semidán Robaina Estévez (srobaina@ull.edu.es), 2022
 

2022-09-29 15:11:25,776 | INFO: Finding matching HMMs for gene symbols
2022-09-29 15:11:25,947 | INFO: Found the following HMMs in database for given structure:
>TIGR04485.1 0 >TIGR04488.1 0 >TIGR04490.1 0 >(TIGR01372.1|TIGR04484.1) 0 >(TIGR01373.1|TIGR04486.1) 0 >TIGR04555.1
2022-09-29 15:11:26,134 | INFO: Searching database by synteny structure
2022-09-29 15:11:26,135 | INFO: Running Hmmer
2022-09-29 15:14:35,607 | INFO: Filtering results by synteny structure
2022-09-29 15:15:58,818 | INFO: Writing matching sequences to FASTA files


[INFO][0m 46 patterns loaded from file
[INFO][0m 46 patterns loaded from file
[INFO][0m 46 patterns loaded from file
[INFO][0m 46 patterns loaded from file
[INFO][0m 46 patterns loaded from file
[INFO][0m 46 patterns loaded from file


2022-09-29 15:16:03,511 | INFO: Finished!


Let's find out to which organisms these sequences belong to. We will extract the taxonomical info from the MAR ref metadata file.

In [60]:
# Assign species (GTDB) to each genome ID
meta = pd.read_csv("/home/robaina/Databases/MAR_database/MarRef_1.7.tsv", sep="\t")

def assign_tax(genome_id: str) -> str:
    try:
        return meta.loc[
            meta['acc:genbank'].str.contains(f'ena.embl:{x.split(".")[0]}'), 'tax:gtdb_classification'
            ].item().split(">")[-1]
    except:
        return ""
    
df = pd.read_csv("example2/synteny_matched.tsv", sep="\t")
df["taxonomy"] = df.contig.apply(assign_tax)

In [67]:
# Display main results
till_row = 5
display_cols = ["gene_id", "gene_symbol", "gene_number", "locus", "strand", "hmm", "taxonomy"]
display(HTML(df.loc[:till_row, display_cols].to_html()))

Unnamed: 0,gene_id,gene_symbol,gene_number,locus,strand,hmm,taxonomy
0,CP000031.2_985,soxX,985,"(1045616, 1046089)",pos,TIGR04485.1,s__Ruegeria_B pomeroyi
1,CP000031.2_986,soxY,986,"(1046132, 1046548)",pos,TIGR04488.1,s__Ruegeria_B pomeroyi
2,CP000031.2_987,soxZ,987,"(1046582, 1046911)",pos,TIGR04490.1,s__Ruegeria_B pomeroyi
3,CP000031.2_988,soxA,988,"(1046989, 1047840)",pos,TIGR04484.1,s__Ruegeria_B pomeroyi
4,CP000031.2_989,soxB,989,"(1047975, 1049645)",pos,TIGR04486.1,s__Ruegeria_B pomeroyi
5,CP000031.2_990,soxC,990,"(1049724, 1051007)",pos,TIGR04555.1,s__Ruegeria_B pomeroyi


In [68]:
# Get original peptide sequences
hit_labels = df.loc[:till_row, "full_label"].values
grep_labels = "|".join(hit_labels)

In [69]:
%%bash -s "$grep_labels"

grep -A 1 -E $1 /home/robaina/Databases/MAR_database/marref_prodigal_longlabels_mmp.faa

>CP000031.2_985__CP000031.2_985_1045616_1046089_pos
MKTTILTLAAALISGAAWAGETAPGDVVYADGAVEASLTGTPGDAANGAMVVGSKKHGNCVACHQVGALADVPFQGEIGPALDGAGSRWSEAELRGLVANAKLTFEGSMMPSFYRIDGYIRPGDAYTGKAAKGALTPLLSAQEIEDVVAFLATLKDE*
>CP000031.2_986__CP000031.2_986_1046132_1046548_pos
MDFSRRDTLGLALGAAALTVLPFRVNAAAEDRIAEFTGGAEMGEGGLTLTAPEIAENGNTVPIEVSAPGAAAIMVLAMGNPTPGVAQFNFGPLAAAQAASTRIRLAGTQDVVAIAKMADGSFVKASSTVKVTIGGCGG*
>CP000031.2_987__CP000031.2_987_1046582_1046911_pos
MASGVKPRVKVPKSVAAGEAITIKTLISHAMESGQRKDKEGNVIPRSIINRFTCEFNGQSVIDITMEPAISTNPYFQFDATVPEAGEFVFTWYDDDGSVYNDNKSITIA*
>CP000031.2_988__CP000031.2_988_1046989_1047840_pos
MKVRAMTAIAALLAAPLAAVAGPDSDELVVNGEINMVTQTEAPAHLDGALSELYSGWRFRSDETQALQMDDFDNPAMVFVDQAQEAWDTADGTEGKSCASCHGDAADSMAGVRAVYPKWNEAAGEVRTLEAQVNDCRENRMGAKAWKYDGGDMASMTALISVQSRGLPVNVAIDGPAQATWEMGKEIYYTRYGQLELSCANCHEDNYGNMIRADHLSQGHINGFPTYRLKNAKLNTSHARFKGCVRDTRAETFNPGSPEFVALELYVASRGNGLSVEGPSVRN*
>CP000031.2_989__CP000031.2_989_1047975_1049645_pos
MAASALVGASGFGNWSRLAAQQALTQDQLLEFDTFGNLTLIHITD