<!-- <a href="https://colab.research.google.com/github/Robaina/Pynteny/blob/main/docs/examples/example_api_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> -->

![logo](https://user-images.githubusercontent.com/21340147/227912321-f76e622a-684d-48a9-8ead-9a2ce7caebe9.png)
<br/>
<br/>

[Semidán Robaina](https://github.com/Robaina), February 2023.

In this Notebook, we will use MetaTag through its Python API to reconstruct a phylogenetic tree. To this end, we will use peptide sequences from the [MARref database](https://mmp2.sfb.uit.no/databases/) and a profile HMM to identify sequences beloging to the X gene.

- Note that we could have conducted the same search through Pynteny's command-line interface.

- Find more info in the [documentation pages](https://robaina.github.io/MetaTag/)!

Let's start by importing some required modules.

In [1]:
from pathlib import Path
from pandas import DataFrame
from metatag.cli import MetaTag
from metatag.pipelines import QueryLabeller, ReferenceTreeBuilder

Let's now create a directory to store results

In [2]:
Path("example_api/data").mkdir(exist_ok=False, parents=True)

## Download Uniprot reference database:

[Uniprot](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz).

## Infer gene-specific phylogenetic tree



In [4]:
tests_dir = Path("/home/robaina/Documents/MetaTag/tests")
outdir = Path("example_api/results")
outdir.mkdir(exist_ok=False, parents=True)

tree_builder = ReferenceTreeBuilder(
    input_database=Path("/home/robaina/Downloads/uniprot_sprot.fasta"),
    hmms=[
        (tests_dir / "test_data" / "TIGR01287.1.HMM").as_posix(),
        (tests_dir / "test_data" / "TIGR02016.1.HMM").as_posix(),
    ],
    maximum_hmm_reference_sizes=[20, 5],
    relabel_prefixes=["ref_", "out_"],
    relabel=True,
    remove_duplicates=True,
    hmmsearch_args="None, --cut_ga",
    output_directory=outdir,
    msa_method="muscle",
    tree_method="fasttree",
    tree_model="iqtest",
)
tree_builder.run()

2023-03-27 12:03:13,151 | INFO: Removing duplicates...
2023-03-27 12:03:16,013 | INFO: Asserting correct sequence format...
2023-03-27 12:03:30,725 | INFO: Done!
2023-03-27 12:03:30,727 | INFO: Making peptide-specific reference database...
2023-03-27 12:03:30,728 | INFO: Processing hmm TIGR01287.1 with additional arguments: --cut_nc
2023-03-27 12:03:30,728 | INFO: Running Hmmer...
2023-03-27 12:03:31,702 | INFO: Parsing Hmmer output file...
2023-03-27 12:03:31,712 | INFO: Filtering Fasta...
2023-03-27 12:03:31,886 | INFO: Filtering sequences by established length bounds...
2023-03-27 12:03:32,106 | INFO: Finding representative sequences for reference database...
2023-03-27 12:03:32,524 | INFO: Relabelling records in reference database...
2023-03-27 12:03:32,526 | INFO: Processing hmm TIGR02016.1 with additional arguments: --cut_ga
2023-03-27 12:03:32,526 | INFO: Running Hmmer...


2023-03-27 12:03:32,358 INFO:Reading PI database...
2023-03-27 12:03:32,358 INFO:Building dataframe
2023-03-27 12:03:32,362 INFO:Dataframe built
2023-03-27 12:03:32,438 INFO:Finished building database...
2023-03-27 12:03:32,438 INFO:Starting mixture of summaxacross and sumsumwithin with weight 0.5...
2023-03-27 12:03:32,438 INFO:Repset size: 20


2023-03-27 12:03:33,516 | INFO: Parsing Hmmer output file...
2023-03-27 12:03:33,518 | INFO: Filtering Fasta...
2023-03-27 12:03:33,707 | INFO: Filtering sequences by established length bounds...
2023-03-27 12:03:33,867 | INFO: Finding representative sequences for reference database...
2023-03-27 12:03:34,208 | INFO: Relabelling records in reference database...
2023-03-27 12:03:34,214 | INFO: Done!
2023-03-27 12:03:34,215 | INFO: Aligning reference database...
2023-03-27 12:03:34,263 | INFO: Inferring reference tree...


2023-03-27 12:03:34,116 INFO:Reading PI database...
2023-03-27 12:03:34,116 INFO:Building dataframe
2023-03-27 12:03:34,119 INFO:Dataframe built
2023-03-27 12:03:34,119 INFO:Finished building database...
2023-03-27 12:03:34,119 INFO:Starting mixture of summaxacross and sumsumwithin with weight 0.5...
2023-03-27 12:03:34,119 INFO:Repset size: 5


2023-03-27 12:03:34,896 | INFO: Done!
2023-03-27 12:03:34,897 | INFO: Relabelling tree...
2023-03-27 12:03:35,151 | INFO: Done!


Here is the generated tree:

In [3]:
from metatag.visualization import make_tree_html


tree_path = "/home/robaina/Documents/MetaTag/docs/examples/example_api/results/ref_database.newick"
outdir = Path("/home/robaina/Documents/MetaTag/docs/examples/example_api/tree_plot")


make_tree_html(tree_path, output_dir=outdir.as_posix())

### Generated reference tree

<div style="text-align:center;">
<a href="example_api/tree_plot/empress.html"><img src="example_api/empress-tree.png" style="width:50%;"></a>
</div>

## Preprocess metagenomic data

## Place and label metagenomic data

## Get citation

We can get the citation string by calling the `cite` method:

In [1]:
MetaTag.cite()

If you use this software, please cite it as below: 
Semidán Robaina Estévez (2022). MetaTag: Metagenome functional and taxonomical annotation through phylogenetic tree placement.(Version 0.1.0). Zenodo.
