<!-- <a href="https://colab.research.google.com/github/Robaina/Pynteny/blob/main/docs/examples/example_api_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> -->

![logo](https://user-images.githubusercontent.com/21340147/227912321-f76e622a-684d-48a9-8ead-9a2ce7caebe9.png)
<br/>
<br/>

[Semidán Robaina](https://github.com/Robaina), February 2023.

In this Notebook, we will use MetaTag through its Python API to reconstruct a phylogenetic tree. To this end, we will use peptide sequences from the [MARref database](https://mmp2.sfb.uit.no/marref/) and a profile HMM to identify sequences beloging to the X gene.

- Note that we could have conducted the same search through Pynteny's command-line interface.

- Find more info in the [documentation pages](https://robaina.github.io/MetaTag/)!

Let's start by importing some required modules.

In [2]:
from pathlib import Path
from pandas import DataFrame
from metatag.cli import MetaTag
from metatag.pipelines import ReferenceTreeBuilder, QueryLabeller, QueryProcessor

Let's now create a directory to store results

In [2]:
Path("example_api/data").mkdir(exist_ok=False, parents=True)

## Download Uniprot reference database:

[Uniprot](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz).

## Infer gene-specific phylogenetic tree



In [3]:
tests_dir = Path("/home/robaina/Documents/MetaTag/tests")
outdir = Path("example_api/results")
outdir.mkdir(exist_ok=True, parents=True)

tree_builder = ReferenceTreeBuilder(
    input_database=Path("/home/robaina/Downloads/uniprot_sprot.fasta"),
    hmms=[
        (tests_dir / "test_data" / "TIGR01287.1.HMM").as_posix(),
        (tests_dir / "test_data" / "TIGR02016.1.HMM").as_posix(),
    ],
    maximum_hmm_reference_sizes=[20, 5],
    relabel_prefixes=["ref_", "out_"],
    relabel=True,
    remove_duplicates=True,
    hmmsearch_args="None, --cut_ga",
    output_directory=outdir,
    msa_method="muscle",
    tree_method="fasttree",
    tree_model="iqtest",
)
tree_builder.run()

2023-03-27 23:50:21,320 | INFO: Removing duplicates...
2023-03-27 23:50:24,108 | INFO: Asserting correct sequence format...
2023-03-27 23:50:38,723 | INFO: Done!
2023-03-27 23:50:38,725 | INFO: Making peptide-specific reference database...
2023-03-27 23:50:38,726 | INFO: Processing hmm TIGR01287.1 with additional arguments: --cut_nc
2023-03-27 23:50:38,727 | INFO: Running Hmmer...
2023-03-27 23:50:39,703 | INFO: Parsing Hmmer output file...
2023-03-27 23:50:39,753 | INFO: Filtering Fasta...
2023-03-27 23:50:39,937 | INFO: Filtering sequences by established length bounds...
2023-03-27 23:50:40,144 | INFO: Finding representative sequences for reference database...
2023-03-27 23:50:40,564 | INFO: Relabelling records in reference database...
2023-03-27 23:50:40,566 | INFO: Processing hmm TIGR02016.1 with additional arguments: --cut_ga
2023-03-27 23:50:40,567 | INFO: Running Hmmer...


2023-03-27 23:50:40,395 INFO:Reading PI database...
2023-03-27 23:50:40,396 INFO:Building dataframe
2023-03-27 23:50:40,400 INFO:Dataframe built
2023-03-27 23:50:40,476 INFO:Finished building database...
2023-03-27 23:50:40,476 INFO:Starting mixture of summaxacross and sumsumwithin with weight 0.5...
2023-03-27 23:50:40,476 INFO:Repset size: 20


2023-03-27 23:50:41,565 | INFO: Parsing Hmmer output file...
2023-03-27 23:50:41,567 | INFO: Filtering Fasta...
2023-03-27 23:50:41,785 | INFO: Filtering sequences by established length bounds...
2023-03-27 23:50:41,938 | INFO: Finding representative sequences for reference database...
2023-03-27 23:50:42,288 | INFO: Relabelling records in reference database...
2023-03-27 23:50:42,291 | INFO: Done!
2023-03-27 23:50:42,293 | INFO: Aligning reference database...
2023-03-27 23:50:42,339 | INFO: Inferring reference tree...


2023-03-27 23:50:42,197 INFO:Reading PI database...
2023-03-27 23:50:42,197 INFO:Building dataframe
2023-03-27 23:50:42,200 INFO:Dataframe built
2023-03-27 23:50:42,201 INFO:Finished building database...
2023-03-27 23:50:42,201 INFO:Starting mixture of summaxacross and sumsumwithin with weight 0.5...
2023-03-27 23:50:42,201 INFO:Repset size: 5


2023-03-27 23:50:42,957 | INFO: Done!
2023-03-27 23:50:42,959 | INFO: Relabelling tree...
2023-03-27 23:50:43,212 | INFO: Done!


Here is the generated tree:

In [8]:
from metatag.visualization import plot_tree_in_browser, make_tree_html


# tree_path = "/home/robaina/Documents/MetaTag/docs/examples/example_api/results/ref_database.newick"
outdir = Path("/home/robaina/Documents/MetaTag/docs/examples/example_api/tree_plot")

# def plot_tree(button):
#     plot_tree_in_browser(
#         tree_builder.reference_tree,
#         output_dir=outdir,
#     )

# make_tree_html(tree_builder.reference_tree, output_dir=outdir.as_posix())

<a href="file:///home/robaina/Documents/MetaTag/docs/examples/example_api/tree_plot/empress.html" target="_blank"><img src="example_api/empress-tree.png" style="width:50%;"></a>

## Preprocess metagenomic data

We need to first preprocess the metagenomic data to remove low quality reads as well as to prefilter sequences using the same profile HMM used to infer the phylogenetic tree. This will reduce the computational cost of the placement step.

In [None]:
processor = QueryProcessor(
    input_query=Path("/home/robaina/Downloads/uniprot_sprot.fasta"),
    hmm=tests_dir / "test_data" / "TIGR01287.1.HMM",
    hmmsearch_args="--cut_ga",
    minimum_sequence_length=30,
    output_directory=outdir,
)
processor.run()


## Place and label metagenomic data

In [None]:
labeller = QueryLabeller(
    input_query=processor.filtered_query,
    reference_alignment=tree_builder._out_reference_alignment,
    reference_tree=tree_builder._out_reference_tree,
    reference_labels=[
        (Path(tempdir) / "ref_database_id_dict.pickle").as_posix()
    ],
    tree_model="JTT",
    tree_clusters=tests_dir / "test_data" / "clusters.tsv",
    tree_cluster_scores=tests_dir / "test_data" / "cluster_scores.tsv",
    tree_cluster_score_threshold=0.6,
    alignment_method="papara",
    output_directory=tempdir,
    maximum_placement_distance=1.0,
    distance_measure="pendant_diameter_ratio",
    minimum_placement_lwr=0.8,
)
labeller.run()

## Get citation

We can get the citation string by calling the `cite` method:

In [1]:
MetaTag.cite()

If you use this software, please cite it as below: 
Semidán Robaina Estévez (2022). MetaTag: Metagenome functional and taxonomical annotation through phylogenetic tree placement.(Version 0.1.0). Zenodo.
