Skip to content

A curated collection of datasets and datasources related to the ocean, marine biology and climate change.

License

Notifications You must be signed in to change notification settings

Inria-Chile/awesome-ocean-ai-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open Ocean Data Collection for AI and ML

A curated collection of open datasets and datasources related to the ocean, marine biology and climate change.

AwesomeLicense: CeCILLv2.1

About

This is a dataset collection the Challenge AI methods for determining ocean ecosystems from space: Combining genomic information, microscopic and satellite imagery to be held in conjunction with the 31st International Joint Conference on Artificial Intelligence and the 25th European Conference on Artificial Intelligence (IJCAI-ECAI-2022) on July 23-29, 2022, in Messe Wien, Vienna, Austria.

The collection goes by listing sources for genomic, microscopic imaging and satellite data. We also list other datasets relevant to particular papers.

This compilation falls under the activities of Inria Project OcéanIA.

Note: This document will be frequently updated. Stay tuned!

Contributing

We welcome contributions to this guide. Please read a CONTRIBUTING.md on how to proceed.

Genomic sequence datasets

Ocean Microbial Reference Gene Catalog v2 (OM-RGC v2)

  • It is highly recommended to view the metadata (excel format) that is associated with the Ocean Microbial Reference Gene Catalog v2 as this will give you an overview of the data available.
  • You can directly access to OM-RGC v2 dabase or via oceania-fasta-query.

oceania-fasta-query

An online service to extract biologic subsequences of interest from large FASTA files. It currently supports the Ocean Microbial Reference Gene Catalog v2 (OM-RGC.v2) gene catalog from Tara Oceans Expedition. If you would like to try the service on files from other catalogs please the Inria Chile engineering team at desarrollo@inria.cl.

oceania-fasta-query consists of:

  • An online service currently provided by Inria Chile, and
  • a Python client that queries on the online services and allows you to work locally on the results through a Python script or a Jupyter notebook.

Installing oceania-fasta-query

To use oceania-query-fasta you just need to install the Python client, for example, by running

pip install oceania-query-fasta

Using oceania-fasta-query

From your Python code or notebook import the oceania module and use method get_sequences_from_fasta() to run a query on the server. Examples are available as Jupyter notebooks in the GitHub repo oceania-query-demo, for instance:

Furthermore, omics data (including metagenome-assembled genomes, metatranscriptomes, genes, proteins and functional annotations) from plankton communities and some single-cell assays can be download directly from here

Plankton images datasets

Ecotaxa

Credit: part of this information was provided by Fabien Lombard.

Ecotaxa (Picheral et al., 2017) contains over 188 million “objects” captured by different microscopes. Part of the dataset has been manually annotated. They also have an ML-based image classifier.

Ecotaxa data is available via an API allowing to programmatically query and download parts of the datasets that are both validated and under CC-BY license (which is the case for the Tara Ocean).

Datasets mentioned here are all under that license.

References for datasets

  • (Sorbonne Université/CNRS-Institut De La Mer De Villefranche (IMEV) & Sorbonne Université/CNRS-Laboratoire D’Océanographie De Villefranche (LOV), 2020)
  • (Sorbonne Université/CNRS, Institut De La Mer De Villefranche-Sorbonne Université, France & Sorbonne Université/CNRS, Laboratoire D’Océanographie De Villefranche-Sorbonne Université, France, 2020d)
  • (Sorbonne Université/CNRS, Institut De La Mer De Villefranche-Sorbonne Université, France & Sorbonne Université/CNRS, Laboratoire D’Océanographie De Villefranche-Sorbonne Université, France, 2020a)
  • (Sorbonne Université/CNRS, Institut De La Mer De Villefranche-Sorbonne Université, France et al., 2020)
  • (Sorbonne Université/CNRS, Institut De La Mer De Villefranche-Sorbonne Université, France & Sorbonne Université/CNRS, Laboratoire D’Océanographie De Villefranche-Sorbonne Université, France, 2020c)
  • (Sorbonne Université/CNRS, Institut De La Mer De Villefranche-Sorbonne Université, France & Sorbonne Université/CNRS, Laboratoire D’Océanographie De Villefranche-Sorbonne Université, France, 2020b)

WHOI-Plankton: 3.5 million human annotated plankton images

Annotated Plankton Images - Data Set for Developing and Evaluating Classification Methods

The data set available here comprises > 3.5 million images of microscopic marine plankton, organized according to category labels provided by researchers at the Woods Hole Oceanographic Institution (WHOI). The images are currently placed into one of 103 categories.

The annotated image set can be downloaded at the following citable URI: http://dx.doi.org/10.1575/1912/7341.

The images are part of a much larger data set (>700 million images) collected in situ by automated submersible imaging-in-flow cytometry with an instrument called Imaging FlowCytobot (IFCB) (Olson & Sosik, 2007; Sosik & Olson, 2007) at the Martha’s Vineyard Coastal Observatory (MVCO) starting in 2006 and continuing to the present. Near real time image data and the complete archive are accessible for browse and download at the IFCB Data Dashboard.

Occurrence datasets

OBIS is a global open-access data and information clearing-house on marine biodiversity for science, conservation and sustainable development. OBIS harvests occurrence records from thousands of datasets and makes them available as a single integrated dataset. There are several ways to access OBIS data:

  • The mapper allows users to visualize and inspect subsets of OBIS data. A variety of filters (taxonomic, geographic, time, data quality) is available and multiple layers can be combined in a single view. Layers can be downloaded as CSV files.
  • OBIS R package provides direct access to the OBIS database from R.
  • Both the mapper and the R package are based on the OBIS API which can be used by third party developers as well.
  • Full exports of the quality controlled presence records as csv or parquet here.

Satellite imaging datasets

Google Earth Engine

Google Earth Engine combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities. Scientists, researchers, and developers use Earth Engine to detect changes, map trends, and quantify differences on the Earth’s surface. Earth Engine is now available for commercial use, and remains free for academic and research use.

Open Data on AWS

The Registry of Open Data on AWS makes it easy to find datasets made publicly available through AWS services. Browse available data and learn how to register your own datasets.

Datasets organized by papers that them

References

De Vargas, C., Tara Oceans Consortium, C., & Tara Oceans Expedition, P. (2015a). Total V9 rDNA information organized at the metabarcode level (Database W4), supplement to: De Vargas, Colomban; Audic, Stephane; Henry, Nicolas; Decelle, Johan; Mahe, Jean-Claude; Logares, Ramiro; Lara, Enrique; Berney, Cédric; Le Bescot, Noan; Probert, Ian; Carmichael, Margaux; Poulain, Julie; Romac, Sarah; Colin, Sébastien; Aury, Jean-Marc; Bittner, Lucie; Chaffron, Samuel; Dunthorn, Micah; Engelen, Stefan; Flegontova, Olga; Horák, Aleš; Jaillon, Olivier; Lima-Mendez, Gipsi; Lukes, Julius; Malviya, Shruti; Morard, Raphael; Mulot, Matthieu; Scalco, Eleonora; Siano, Raffaele; Zingone, Adriana; Picheral, Marc; Searson, Sarah; Kandels-Lewis, Stefanie; Acinas, Silvia G; Gorsky, Gabriel; Grimsley, Nigel; Hingamp, Pascal; Iudicone, Daniele; Not, Fabrice; Ogata, Hiroyuki; Sieracki, Michael; Speich, Sabrina; Stemmann, Lars; Sunagawa, Shinichi; Wincker, Patrick; Karsenti, Eric (2015): First Tara Oceans V9 rDNA metabarcoding dataset. PANGAEA - Data Publisher for Earth & Environmental Science. https://doi.org/10.1594/PANGAEA.843018

De Vargas, C., Tara Oceans Consortium, C., & Tara Oceans Expedition, P. (2015b). Total V9 rDNA information organized at the OTU level (Database W5), supplement to: De Vargas, Colomban; Audic, Stephane; Henry, Nicolas; Decelle, Johan; Mahe, Jean-Claude; Logares, Ramiro; Lara, Enrique; Berney, Cédric; Le Bescot, Noan; Probert, Ian; Carmichael, Margaux; Poulain, Julie; Romac, Sarah; Colin, Sébastien; Aury, Jean-Marc; Bittner, Lucie; Chaffron, Samuel; Dunthorn, Micah; Engelen, Stefan; Flegontova, Olga; Horák, Aleš; Jaillon, Olivier; Lima-Mendez, Gipsi; Lukes, Julius; Malviya, Shruti; Morard, Raphael; Mulot, Matthieu; Scalco, Eleonora; Siano, Raffaele; Zingone, Adriana; Picheral, Marc; Searson, Sarah; Kandels-Lewis, Stefanie; Acinas, Silvia G; Gorsky, Gabriel; Grimsley, Nigel; Hingamp, Pascal; Iudicone, Daniele; Not, Fabrice; Ogata, Hiroyuki; Sieracki, Michael; Speich, Sabrina; Stemmann, Lars; Sunagawa, Shinichi; Wincker, Patrick; Karsenti, Eric (2015): First Tara Oceans V9 rDNA metabarcoding dataset. PANGAEA - Data Publisher for Earth & Environmental Science. https://doi.org/10.1594/PANGAEA.843022

Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J. C., Roux, S., Vincent, F., Bittner, L., Darzi, Y., Wang, J., Audic, S., Berline, L., Bontempi, G., Cabello, A. M., Coppola, L., Cornejo-Castillo, F. M., … Raes, J. (2015). Ocean plankton: Determinants of community structure in the global plankton interactome. Science, 348(6237), 1262073. https://doi.org/10.1126/science.1262073

Olson, R. J., & Sosik, H. M. (2007). A submersible imaging-in-flow instrument to analyze nano-and microplankton: Imaging FlowCytobot. Limnology and Oceanography: Methods, 5(6), 195–203. https://doi.org/10.4319/lom.2007.5.195

Picheral, M., Colin, S., & J.-O., I. (2017). EcoTaxa, a tool for the taxonomic classification of images. http://ecotaxa.obs-vlfr.fr

Sorbonne Université/CNRS, Institut De La Mer De Villefranche-Sorbonne Université, France, & Sorbonne Université/CNRS, Laboratoire D’Océanographie De Villefranche-Sorbonne Université, France. (2020a). Plankton community in bongo net, moose-ge cruises, north-western mediterranean sea. Marine Data Archive. https://doi.org/10-14284-480

Sorbonne Université/CNRS, Institut De La Mer De Villefranche-Sorbonne Université, France, & Sorbonne Université/CNRS, Laboratoire D’Océanographie De Villefranche-Sorbonne Université, France. (2020b). Plankton community in juday-bogorov (330µm) net, point b, villefranche-sur-mer, france. Marine Data Archive. https://doi.org/10-14284-478

Sorbonne Université/CNRS, Institut De La Mer De Villefranche-Sorbonne Université, France, & Sorbonne Université/CNRS, Laboratoire D’Océanographie De Villefranche-Sorbonne Université, France. (2020c). Plankton community in régent (680µm) net, point b, villefranche-sur-mer, france. Marine Data Archive. https://doi.org/10-14284-477

Sorbonne Université/CNRS, Institut De La Mer De Villefranche-Sorbonne Université, France, & Sorbonne Université/CNRS, Laboratoire D’Océanographie De Villefranche-Sorbonne Université, France. (2020d). Plankton community in wp2 net (200µm), dyfamed point, ligurian sea, france. Marine Data Archive. https://doi.org/10.14284/479

Sorbonne Université/CNRS, Institut De La Mer De Villefranche-Sorbonne Université, France, Sorbonne Université/CNRS, Laboratoire D’Océanographie De Villefranche-Sorbonne Université, France, & Office Français De La Biodiversité (OFB). (2020). Plankton community in wp2 net (200µm), pnmir cruises, parc naturel marin d’Iroise, france. Marine Data Archive. https://doi.org/10-14284-481

Sorbonne Université/CNRS-Institut De La Mer De Villefranche (IMEV), & Sorbonne Université/CNRS-Laboratoire D’Océanographie De Villefranche (LOV). (2020). Plankton community in wp2 net (200µm), point b, villefranche-sur-mer, france. Marine Data Archive. https://doi.org/10.14284/473

Sosik, H. M., & Olson, R. J. (2007). Automated taxonomic classification of phytoplankton sampled with imaging-in-flow cytometry. Limnology and Oceanography: Methods, 5(6), 204–216. https://doi.org/10.4319/lom.2007.5.204

About

A curated collection of datasets and datasources related to the ocean, marine biology and climate change.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published