<!-- <a href="https://colab.research.google.com/github/Robaina/Pynteny/blob/main/docs/examples/example_api_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> -->

<div style="text-align:center;">
<img src="https://user-images.githubusercontent.com/21340147/227912321-f76e622a-684d-48a9-8ead-9a2ce7caebe9.png" style="width:70%;"/>
</div>
<br/>

[Semidán Robaina](https://github.com/Robaina), February 2023.

In this Notebook, we will use MetaTag through its Python API to reconstruct a phylogenetic tree. To this end, we will use peptide sequences from the [MARref database](https://mmp2.sfb.uit.no/marref/) and a profile HMM to identify sequences beloging to the X gene.

- Note that we could have conducted the same search through Pynteny's command-line interface.

- Find more info in the [documentation pages](https://robaina.github.io/MetaTag/)!

Let's start by importing some required modules.

In [1]:
from pathlib import Path
import pandas as pd
from metatag.cli import MetaTag
from metatag.visualization import make_tree_html
from metatag.pipelines import ReferenceTreeBuilder, QueryLabeller, QueryProcessor

Let's now create a directory to store results

In [2]:
workdir = Path("example_api")
outdir = workdir / "results"
# outdir.mkdir(exist_ok=True, parents=True)

## Download Marref database:

Download the [MarRef](https://mmp2.sfb.uit.no/marref/) database and extract contents. We will use the `protein.faa` file, containing translated peptide sequences.

We alo need to download two profile HMMs, click on them two download: [TIGR04244](https://ftp.ncbi.nlm.nih.gov/hmm/current/hmm_PGAP.HMM/TIGR04244.1.HMM) and [TIGR04246](https://ftp.ncbi.nlm.nih.gov/hmm/current/hmm_PGAP.HMM/TIGR04246.1.HMM).

## Infering a gene-specific phylogenetic tree

We will infer a phylogenetic tree for the gene _nosZ_, encoding a nitrous oxide reductase that participates in the nitrogen cycle. To this end, we will use two TIGRFAM profile HMMs: [TIGR04244.1](https://www.ncbi.nlm.nih.gov/genome/annotation_prok/evidence/TIGR04244/), which encondes a TAT-dependent nitrous-oxide reductase, and [TIGR04246.1](https://www.ncbi.nlm.nih.gov/genome/annotation_prok/evidence/TIGR04246/), which encondes a SEC-dependent nitrous-oxide reductase.

The class `ReferenceTreeBuilder` will take care of all necessary steps to infer the tree. Namely, (i) preprocess the input marref database, (ii) build a reference database containing a maximum of 20 nifH and 5 BCHX representative sequences, using both [CD-Hit](https://github.com/weizhongli/cdhit) and [RepSet](https://onlinelibrary.wiley.com/doi/10.1002/prot.25461), (iii) align the reference sequences with [MUSCLE](https://github.com/EddyRivasLab/hmmer), (iv) infer a phylogenetic tree from the alignment with [FastTree](https://github.com/PavelTorgashov/FastTree).

In [5]:
tree_builder = ReferenceTreeBuilder(
    input_database=Path("/home/robaina/Databases/MAR_database/protein.faa"),
    hmms=[
        workdir / "data" / "TIGR04244.1.HMM",
        workdir / "data" / "TIGR04246.1.HMM",
    ],
    maximum_hmm_reference_sizes=[100, 100],
    relabel_prefixes=["ref44_", "ref46_"],
    relabel=True,
    remove_duplicates=True,
    hmmsearch_args="--cut_ga",
    output_directory=outdir,
    msa_method="muscle",
    tree_method="fasttree",
    tree_model="JTT",
)
tree_builder.run()

2023-03-31 10:41:19,793 | INFO: Removing duplicates...
2023-03-31 10:41:41,305 | INFO: Asserting correct sequence format...
2023-03-31 10:43:10,256 | INFO: Done!
2023-03-31 10:43:10,259 | INFO: Making peptide-specific reference database...
2023-03-31 10:43:10,260 | INFO: Processing hmm TIGR04244.1 with additional arguments: --cut_ga
2023-03-31 10:43:10,261 | INFO: Running Hmmer...
2023-03-31 10:43:16,449 | INFO: Parsing Hmmer output file...
2023-03-31 10:43:16,456 | INFO: Filtering Fasta...
2023-03-31 10:43:17,364 | INFO: Filtering sequences by established length bounds...
2023-03-31 10:43:17,605 | INFO: Finding representative sequences for reference database...
2023-03-31 10:43:18,038 | INFO: Relabelling records in reference database...
2023-03-31 10:43:18,040 | INFO: Processing hmm TIGR04246.1 with additional arguments: --cut_ga
2023-03-31 10:43:18,041 | INFO: Running Hmmer...


2023-03-31 10:43:17,872 INFO:Reading PI database...
2023-03-31 10:43:17,872 INFO:Building dataframe
2023-03-31 10:43:17,875 INFO:Dataframe built
2023-03-31 10:43:17,942 INFO:Finished building database...
2023-03-31 10:43:17,942 INFO:Starting mixture of summaxacross and sumsumwithin with weight 0.5...
2023-03-31 10:43:17,942 INFO:Repset size: 100


2023-03-31 10:43:24,249 | INFO: Parsing Hmmer output file...
2023-03-31 10:43:24,295 | INFO: Filtering Fasta...
2023-03-31 10:43:25,458 | INFO: Filtering sequences by established length bounds...
2023-03-31 10:43:25,714 | INFO: Finding representative sequences for reference database...
2023-03-31 10:43:26,130 | INFO: Relabelling records in reference database...
2023-03-31 10:43:26,133 | INFO: Done!
2023-03-31 10:43:26,134 | INFO: Aligning reference database...


2023-03-31 10:43:25,987 INFO:Reading PI database...
2023-03-31 10:43:25,987 INFO:Building dataframe
2023-03-31 10:43:25,991 INFO:Dataframe built
2023-03-31 10:43:26,037 INFO:Finished building database...
2023-03-31 10:43:26,037 INFO:Starting mixture of summaxacross and sumsumwithin with weight 0.5...
2023-03-31 10:43:26,037 INFO:Repset size: 100


2023-03-31 10:43:27,276 | INFO: Inferring reference tree...
2023-03-31 10:43:34,756 | INFO: Done!
2023-03-31 10:43:34,757 | INFO: Relabelling tree...
2023-03-31 10:43:35,017 | INFO: Done!


To visualize the generated tree, we will employ [empress](https://github.com/biocore/empress), which generates a web-based interactive tree. The following function calls empress and generates the html file. Click on the image to open the interactive tree in the browser.

In [6]:
make_tree_html(tree_builder.reference_tree, output_dir=outdir / "tree_plot")



<a href="file:///home/robaina/Documents/MetaTag/docs/examples/example_api/tree_plot/empress.html" target="_blank"><img src="example_api/example_tree.png" style="width:50%;"></a>

## Preprocess metagenomic data

We need to first preprocess the metagenomic data to remove low quality reads as well as to prefilter sequences using the same profile HMM used to infer the phylogenetic tree. This will reduce the computational cost of the placement step. To this end, we can use the `QueryPreprocessor` class, which contains all necessary steps to preprocess the metagenomic data.

In [7]:
processor = QueryProcessor(
    input_query=Path("/home/robaina/Databases/Uniprot/uniprot_sprot.fasta"),
    hmms=[workdir / "data" / "TIGR04244.1.HMM"],
    hmmsearch_args="--cut_ga",
    minimum_sequence_length=30,
    output_directory=outdir,
)
processor.run()

2023-03-31 11:06:54,760 | INFO: Removing duplicates...
2023-03-31 11:06:57,267 | INFO: Asserting correct sequence format...
2023-03-31 11:07:12,543 | INFO: Done!
2023-03-31 11:07:12,545 | INFO: Making peptide-specific reference database...
2023-03-31 11:07:12,546 | INFO: Processing hmm TIGR04244.1 with additional arguments: --cut_ga
2023-03-31 11:07:12,547 | INFO: Running Hmmer...
2023-03-31 11:07:13,616 | INFO: Parsing Hmmer output file...
2023-03-31 11:07:13,619 | INFO: Filtering Fasta...
2023-03-31 11:07:13,813 | INFO: Filtering sequences by established length bounds...
2023-03-31 11:07:13,822 | INFO: No reduction algorithm has been selected.
2023-03-31 11:07:13,846 | INFO: Done!


## Place and label metagenomic data

We are now ready to place our environmental sequences onto the reference tree to infer their taxonomy and function. To this end, we will employ the `QueryLabeller` class, which will take care of all necessary steps to place  and label the metagenomic data. Namely, (i) preprocess the metagenomic data, (ii) place the sequences onto the reference tree with [papara](https://cme.h-its.org/exelixis/web/software/papara/index.html) and [epa-ng](https://github.com/pierrebarbera/epa-ng), and (iii) label placed sequences with [gappa](https://github.com/lczech/gappa).

In [8]:
labeller = QueryLabeller(
    input_query=processor.filtered_query,
    reference_alignment=tree_builder.reference_alignment,
    reference_tree=tree_builder.reference_tree,
    reference_labels=[
        tree_builder.reference_labels
    ],
    tree_model="JTT",
    alignment_method="papara",
    output_directory=outdir,
    maximum_placement_distance=1.0,
    distance_measure="pendant_diameter_ratio",
    minimum_placement_lwr=0.8,
)
labeller.run()

2023-03-31 11:09:45,387 | INFO: Removing duplicates...
2023-03-31 11:09:45,402 | INFO: Asserting correct sequence format...
2023-03-31 11:09:45,403 | INFO: Data already translated!
2023-03-31 11:09:45,404 | INFO: Relabelling records...
2023-03-31 11:09:45,405 | INFO: Done!
2023-03-31 11:09:45,405 | INFO: Placing reads on tree...
2023-03-31 11:09:46,986 | INFO: Writing tree with placements...
2023-03-31 11:09:46,996 | INFO: Done!
2023-03-31 11:09:46,998 | INFO: Filtering placements by maximum distance: "pendant_diameter_ratio" of 1.0
2023-03-31 11:09:47,016 | INFO: Filtering placements for tree diameter: 3.614263755
2023-03-31 11:09:47,018 | INFO: Filtering placements by minimum LWR of: 0.8
2023-03-31 11:09:47,321 | INFO: Done!
2023-03-31 11:09:47,323 | INFO: Counting labelled placements...
2023-03-31 11:09:47,946 | INFO: Done!
2023-03-31 11:09:47,948 | INFO: Relabelling tree...
2023-03-31 11:09:48,217 | INFO: Done!


## Display placements on the reference tree

We can now visualize the placements of the metagenomic data onto the reference tree. The tree was generated by the `QueryLabeller` using [gappa graft](https://github.com/lczech/gappa/wiki/Subcommand:-graft).

In [11]:
make_tree_html(labeller.placements_tree, output_dir=outdir / "tree_plot_placements")



## Display labelled sequences

And here are the results. The following table shows the taxonomic assignments of the query sequences which has been placed onto the nosZ tree and that passed the applied distance filters (as a quality control of the placement).

In [9]:
df = pd.read_csv(labeller.taxtable, sep="\t")
df.head()

Unnamed: 0,query_id,query_name,LWR,cluster_id,cluster_taxopath,taxopath
0,query_0,sp|P94127|NOSZ_ACHCY Nitrous-oxide reductase O...,1.0,C0,Unspecified,Unspecified
1,query_1,sp|Q89XJ6|NOSZ_BRADU Nitrous-oxide reductase O...,0.9999,C0,Unspecified,Unspecified
2,query_10,sp|Q59746|NOSZ_RHIME Nitrous-oxide reductase O...,1.0,C0,Unspecified,Unspecified
3,query_11,sp|P19573|NOSZ_STUST Nitrous-oxide reductase O...,1.0,C0,Unspecified,Unspecified
4,query_2,sp|Q8YBC6|NOSZ_BRUME Nitrous-oxide reductase O...,1.0,C0,Unspecified,Unspecified


## Get citation

We can get the citation string by calling the `cite` method:

In [1]:
MetaTag.cite()

If you use this software, please cite it as below: 
Semidán Robaina Estévez (2022). MetaTag: Metagenome functional and taxonomical annotation through phylogenetic tree placement.(Version 0.1.0). Zenodo.
