<img src="../_resources/mgnify_logo.png" width="200px">

# Search for MGnify Studies or Samples, using MGnifyR

The [MGnify API](https://www.ebi.ac.uk/metagenomics/api/v1) returns data and relationships as JSON. 
[MGnifyR](https://github.com/beadyallen/MGnifyR) is a package to help you read MGnify data into your R analyses.

**This example shows you how to perform a search of MGnify Studies or Samples**

You can find all of the other "API endpoints" using the [Browsable API interface in your web browser](https://www.ebi.ac.uk/metagenomics/api/v1).
This interface also lets you inspect the kinds of Filters that can be created for each list.

This is an interactive code notebook (a Jupyter Notebook).
To run this code, click into each cell and press the ▶ button in the top toolbar, or press `shift+enter`.

---

In [2]:
library(vegan)
library(ggplot2)
library(phyloseq)
library(MGnifyR)

mg <- mgnify_client(usecache = T, cache_dir = '/tmp/mgnify_cache')

## Contents
- [Example: Find Polar Samples](#Example:-find-Polar-samples)
- [Example: Find Wastewater Samples](#Example:-find-Wastewater-studies)
- [More Sample filters](#More-Sample-filters)
- [More Study filters](#More-Study-filters)
- [Example: Filtering Samples both API-side and client-side](#Example:-adding-additional-filters-to-the-data-frame)

### Documentation for `mgnify_query`

In [24]:
?mgnify_query

0,1
mgnify_query {MGnifyR},R Documentation

0,1
qtype,"Type of objects to query. One of studies,samples,runs or analyses"
accession,"Either a single known MGnify accession identifier (of type qtype), or a list/vector of accessions to query. Note that multiple values only work for samples, runs and assemblies ... not sure why."
asDataFrame,"Boolean flag to choose whether to return the results as a data.frame or leave as a nested list. In most cases, asDataFrame = TRUE will make the most sense."
maxhits,"determines the maximum number of results to return. The actual number of results will actually be higher than maxhits, as clipping only occurs on pagination page boundaries. To disable the limit, set maxhits < 0"
usecache,"Whether to cache the result - and reuse any existing cache entry instead of issuing a new callout. In generl the use of caching for queries is discouraged, as new data is being uploaded to MGnify all the time, which might potentially be missed. However, for some purposes (such as analysis reproducibility) caching makes sense."
...,Remaining parameter key/value pairs may be supplied to filter the returned values. Available options differ between qtypes.See discussion above for details.
mgnify_client,Client instance


## Example: find Polar samples 

In [8]:
samps_np <- mgnify_query(mg, "samples", latitude_gte=88, maxhits=-1)
samps_sp <- mgnify_query(mg, "samples", latitude_lte=-88, maxhits=-1)
samps_polar <- rbind(samps_np, samps_sp)

In [12]:
head(samps_polar)

Unnamed: 0_level_0,biosample,latitude,longitude,accession,analysis-completed,collection-date,geo-loc-name,sample-desc,environment-biome,environment-feature,⋯,pcr conditions,host common name,host age,host body habitat,host diet,host genotype,host phenotype,host sex,sample volume or weight for DNA extraction,chemical administration
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
ERS1568988,SAMEA98155168,89.9903,-89.2525,ERS1568988,2017-03-13,2015-09-07,Arctic Ocean,"Arctic seawater metagenome, 5 m",marine biome,marine water body,⋯,,,,,,,,,,
ERS1972379,SAMEA104347401,88.4072,-176.7614,ERS1972379,2017-11-20,,,"Station 31 sea ice, depth 0.1m",marine biome,marine water body,⋯,,,,,,,,,,
ERS1972376,SAMEA104347398,88.4072,-176.7614,ERS1972376,2017-11-20,,,"Station 31 sea ice, depth 0.5m",marine biome,marine water body,⋯,,,,,,,,,,
ERS1568987,SAMEA98154418,89.9903,-89.2525,ERS1568987,2017-03-13,2015-09-07,Arctic Ocean,"Arctic seawater metagenome, 1.5 m",marine biome,marine water body,⋯,,,,,,,,,,
ERS1972377,SAMEA104347399,88.4072,-176.7614,ERS1972377,2017-11-20,,,"Station 31 sea ice, depth 0.5m",marine biome,marine water body,⋯,,,,,,,,,,
ERS1972391,SAMEA104347413,89.9903,-89.2525,ERS1972391,2017-11-20,,,"Station 33 sea ice, depth 0.1m",marine biome,marine water body,⋯,,,,,,,,,,


## Example: find Wastewater studies

In [14]:
studies_ww <- mgnify_query(mg, "studies", biome_name="wastewater", maxhits=-1)

In [17]:
head(studies_ww)

Unnamed: 0_level_0,samples-count,accession,bioproject,secondary-accession,centre-name,is-public,study-abstract,study-name,data-origination,last-update,acc_type,type,public-release-date
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
MGYS00005846,110,MGYS00005846,PRJEB47494,ERP131768,EMG,True,"The Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJEB27054, and was assembled with metaSPAdes v3.12.0. This project includes samples from the following biomes: root:Engineered:Wastewater:Water and sludge.",EMG produced TPA metagenomics assembly of PRJEB27054 data set (Global surveillance of antimicrobial resistance).,SUBMITTED,2021-11-18T06:32:39,studies,studies,
MGYS00005847,109,MGYS00005847,PRJEB27054,ERP109094,DTU-GE,True,"Antimicrobial resistance (AMR) is one of the most serious global public health threats, however, obtaining representative data on AMR for healthy human populations is difficult. We characterized the bacterial resistome from untreated sewage from 79 sites in 60 countries. We found systematic differences in abundance and diversity of AMR genes between Europe/North-America/Oceania and Africa/Asia/South-America. Antimicrobial use data only explained a minor part of the AMR variation and no evidence for cross-selection between antimicrobial classes nor effect of travel by flight between sites were found. However, AMR abundance was strongly correlated with socio-economic, health and environmental factors, which we used to predict AMR abundances in all countries in the world. Our findings suggest that the global AMR gene diversity and abundance varies by region and are caused by national circumstances. Improving sanitation and health could potentially limit the global burden of AMR. We propose to use sewage for an ethically acceptable and economically feasible continuous global surveillance and prediction of AMR.",Global surveillance of antimicrobial resistance,SUBMITTED,2021-11-17T00:54:21,studies,studies,
MGYS00005802,6,MGYS00005802,PRJEB43967,ERP127957,EMG,True,"The Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJEB8087, and was assembled with SPAdes v3.11.1, metaSPAdes v3.12.0. This project includes samples from the following biomes: root:Engineered:Wastewater:Nutrient removal:Biological phosphorus removal:Activated sludge.",EMG produced TPA metagenomics assembly of PRJEB8087 data set (Metagenomes of Danish EBPR WWTPs).,SUBMITTED,2021-10-08T10:16:00,studies,studies,
MGYS00005769,11,MGYS00005769,PRJEB42552,ERP126430,EMG,True,"The Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJNA479723, and was assembled with Flye v2.8.1. This project includes samples from the following biomes: root:Engineered:Wastewater:Water and sludge.",EMG produced TPA metagenomics assembly of PRJNA479723 data set (Nanopore metagenomics).,SUBMITTED,2021-07-28T17:07:01,studies,studies,
MGYS00005770,10,MGYS00005770,PRJNA479723,SRP152758,The University of Hong Kong,True,WWTPs nanopore metagenomics,Nanopore metagenomics,HARVESTED,2021-07-28T16:48:03,studies,studies,
MGYS00005741,76,MGYS00005741,PRJDB4240,DRP003823,"Mino/Satoh Laboratory, Department of Socio-cultural Environmental Studies, Graduate School of Frontier Sciences, The University of Tokyo",True,"The data is the whole sequence data for the ""Bacterial population dynamics in a laboratory activated sludge reactor monitored by pyrosequencing of 16S rRNA"", Satoh et al. (2013), Microbes and Environments, 28(1), 65-70. Bacterial community change in a laboratory activated sludge reactor was analyzed by NGS for around 8 months. A sudden disappearance of a major population was observed around day 50th, and periodical increase and recessions were observed for many of the observed OTUs.",WL Reactor published in ME,HARVESTED,2021-06-07T13:09:08,studies,studies,


## More Sample filters

### By location

In [None]:
more_northerly_than <- mgnify_query(mg, "samples", latitude_gte=88, maxhits=-1)

more_southerly_than <- mgnify_query(mg, "samples", latitude_lte=-88, maxhits=-1)

more_easterly_than <- mgnify_query(mg, "samples", longitude_gte=170, maxhits=-1)

more_westerly_than <- mgnify_query(mg, "samples", longitude_lte=170, maxhits=-1)

at_location <- mgnify_query(mg, "samples", geo_loc_name="usa", maxhits=-1)

### By biome

In [None]:
biome_within_wastewater <- mgnify_query(mg, "samples", biome_name="wastewater", maxhits=-1)

### By metadata
There are a large number of metadata key:value pairs, because these are author-submitted, along with the samples, to the ENA archive.

If you know how to specify the metadata key:value query for the samples you're interested in, you can use this form to find matching Samples:

In [18]:
from_ex_smokers <- mgnify_query(mg, "samples", metadata_key="smoker", metadata_value="ex-smoker", maxhits=-1)

To find `metadata_key`s and values, it is best to browse the [interactive API Browser](https://www.ebi.ac.uk/metagenomics/v1/samples), and use the `Filters` button to construct queries interactively at first.

--- 
## More Study filters

### By Centre Name

In [21]:
from_smithsonian <- mgnify_query(mg, "studies", centre_name="Smithsonian", maxhits=-1)

---

## Example: adding additional filters to the data frame

First, fetch some samples from the Lentic biome. We can specify the entire Biome lineage, too.

In [25]:
lentic_samples <- mgnify_query(mg, "samples", biome_name="root:Environmental:Aquatic:Lentic", usecache=T)

Not, also filter by depth *within* the returned results, using normal R syntax.

In [27]:
depth_numeric = as.numeric(lentic_samples$depth)  # We must convert data from MGnifyR (always strings) to numerical format.
depth_numeric[is.na(depth_numeric)] = 0.0  # If depth data is missing, assume it is surface-level.
lentic_subset = lentic_samples[depth_numeric >=25 & depth_numeric <=50,]  # Filter to samples collected between 25m and 50m down.
lentic_subset

Unnamed: 0_level_0,latitude,biosample,longitude,accession,collection-date,sample-desc,sample-name,sample-alias,last-update,geographic location (longitude),⋯,instrument model,last update date,investigation type,project name,geographic location (depth),geographic location (altitude),environmental package,sequencing method,NCBI sample classification,ENA checklist
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
SRS992699,17.39,SAMN03860260,40.54,SRS992699,2011-10-15,12,sample03,sample03,2020-05-18T00:52:00,40.54,⋯,Illumina HiSeq 2000,,,,,,,,,
SRS992702,20.31,SAMN03860274,38.46,SRS992702,2011-10-15,91,sample17,sample17,2020-05-18T00:51:47,38.46,⋯,Illumina HiSeq 2000,,,,,,,,,
SRS992693,17.39,SAMN03860259,40.54,SRS992693,2011-10-15,12,sample02,sample02,2020-05-18T00:50:43,40.54,⋯,Illumina HiSeq 2000,,,,,,,,,
SRS992705,23.36,SAMN03860286,37.3,SRS992705,2011-10-15,149,sample29,sample29,2020-05-18T00:46:05,37.3,⋯,Illumina HiSeq 2000,,,,,,,,,
SRS992692,18.34,SAMN03860268,40.44,SRS992692,2011-10-15,34,sample11,sample11,2020-05-18T00:45:26,40.44,⋯,Illumina HiSeq 2000,,,,,,,,,
SRS992710,22.2,SAMN03860281,37.55,SRS992710,2011-10-15,108,sample24,sample24,2020-05-18T00:35:28,37.55,⋯,Illumina HiSeq 2000,,,,,,,,,
SRS992714,25.46,SAMN03860292,36.6,SRS992714,2011-10-15,169,sample35,sample35,2020-05-18T00:35:15,36.6,⋯,Illumina HiSeq 2000,,,,,,,,,
SRS992704,23.36,SAMN03860287,37.3,SRS992704,2011-10-15,149,sample30,sample30,2020-05-18T00:27:10,37.3,⋯,Illumina HiSeq 2000,,,,,,,,,
SRS992713,25.46,SAMN03860293,36.6,SRS992713,2011-10-15,169,sample36,sample36,2020-05-18T00:13:07,36.6,⋯,Illumina HiSeq 2000,,,,,,,,,
SRS992696,22.2,SAMN03860280,37.55,SRS992696,2011-10-15,108,sample23,sample23,2020-05-18T00:06:42,37.55,⋯,Illumina HiSeq 2000,,,,,,,,,
