# How to download protein sequences from metagenomes belonging to thermal environment

User request:
```
I would like to download protein sequences from metagenomes belonging to thermal environment. Is there any way that this can be acheived.
```

In [27]:
# Requirements
!pip install requests



## Obtain the analysis for freshwater

In [28]:
import requests
import csv

API = "https://www.ebi.ac.uk/metagenomics/api/v1"


def fetch(data_bag, url=""):
    if not url:
        # first request - note the filters
        print(f"fetching: the first page")
        response = requests.get(
            API + "/studies",
            params={"lineage": "root:Environmental:Aquatic:Marine:Hydrothermal vents"},
        )
    else:
        response = requests.get(url)

    response_data = response.json()
    next_url = response_data.get("links", {}).get("next")

    # Each study response data
    """
    "bioproject": "PRJEB22514",
    "samples-count": 5,
    "accession": "MGYS00002034",
    "secondary-accession": "ERP104195",
    "centre-name": "EMBL-EBI",
    "is-public": true,
    "public-release-date": null,
    "study-abstract": "The 2014 'Omics from the diffuse hydrothermal ...",
    "data-origination": "SUBMITTED",
    "last-update": "2020-05-13T17:05:03"
    """
    data_bag.extend([entry.get("attributes") for entry in response_data.get("data")])

    # keep getting the accessions
    if next_url:
        print(f"fetching: {next_url}")
        fetch(data_bag, next_url)


# start the fetch process
data_bag = []
fetch(data_bag)

with open("hydrothermal_vents_studies.tsv", "w") as fhandle:
    writer = csv.writer(fhandle, delimiter="\t")
    # any other piece of information you may want to include
    writer.writerow(["accession", "bioproject", "study-abstract"])
    for study in data_bag:
        writer.writerow(
            [
                study.get("accession"),
                study.get("bioproject"),
                study.get("study-abstract"),
            ]
        )


fetching: the first page
fetching: https://www.ebi.ac.uk/metagenomics/api/v1/studies?lineage=root%3AEnvironmental%3AAquatic%3AMarine%3AHydrothermal+vents&page=2


# Use the mg-toolkit to download the files

Even though it' possible to download the protein files using the REST API. We recommend the mg-toolkit as it has some extra features that are handy.

Follow the instruction to install it: "https://github.com/EBI-Metagenomics/emg-toolkit"

    $ tail -n +2 hydrothermal_vents_studies.tsv | cut -f 1 | xargs -I {} mg-toolkit -d bulk_download -a {} -g sequence_data

Breakdown of the one-liner:
- tail -n +2 to skip the header
- cut -f to get the accession
- xargs to pass the argument to the mg-toolkit

The files for each study will be stored in a folder named MGYSXXXX.