# Welcome to the OBIS biodiversity notebook 


In this notebook, we offer access to our pre-processed biodiversity dataset from OBIS stored from our cloud storage, see https://obis.org/

<div style="text-align: justify"; max-width: 50%>
OBIS (Ocean Biodiversity Information System) is a global open-access data and information for marine biodiversity data. It provides access to various types of marine biodiversity data, including species occurrences, distribution records, and ecological data. OBIS aggregates data from numerous sources, including research institutions, government agencies, and citizen science initiatives, and makes it available through a standardized platform. Scientists, policymakers, and the general public can use OBIS to explore and analyze marine biodiversity data, track changes in marine ecosystems over time, and support conservation and management efforts. OBIS plays a crucial role in advancing our understanding of marine biodiversity and promoting the sustainable management of ocean resources. 
</div>

<img src="taxonomy_structure.png" alt="Image 1" style="width: 800px; height: 400;">

# Data available from our selection from OBIS dataset

Taxonomy classification is a complex hierarchical structure. We provide here below the definition of the main class/nodes of the classification. We provide here a simplified structure.

## Year & Month:

The timeframe of our data start from 1960 up to today.
Note: Where month==0 means the month is Unknown.

## Kingdom:

Definition: The highest level of biological classification, grouping together organisms with similar basic characteristics or fundamental niches.

Example: Animalia, Plantae, Fungi, Protista, etc.

## Phylum:

Definition: A taxonomic rank below kingdom and above class, representing a group of organisms with similar body plans and structural organization.

Example: Chordata (includes vertebrates), Arthropoda (includes insects and crustaceans), Mollusca (includes snails and octopuses), etc.

## Class:

Definition: A taxonomic rank below phylum and above order, representing a group of organisms with similar characteristics, anatomical features, or developmental patterns.

Example: Mammalia (includes mammals), Aves (includes birds), Reptilia (includes reptiles), Insecta (includes insects), etc.

## Order:

Definition: A taxonomic rank below class and above family, representing a group of related families with similar characteristics and evolutionary relationships.

Example: Carnivora (includes carnivorous mammals like cats and dogs), Coleoptera (includes beetles), Rodentia (includes rodents), Primates (includes primates such as monkeys and apes), etc.

## Family:

Definition: A taxonomic rank below order and above genus, representing a group of related genera with similar characteristics and evolutionary relationships.

Example: Felidae (includes cats), Canidae (includes dogs), Hominidae (includes humans and great apes), Rosaceae (includes roses), etc.

## Genus:

Definition: A taxonomic rank below family and above species, representing a group of closely related species sharing a common ancestor and exhibiting similar morphological and genetic traits.

Example: Felis (genus of cats), Canis (genus of dogs), Homo (genus of humans), Rosa (genus of roses), etc.

## Species:

Definition: The lowest and most specific taxonomic rank, representing a group of individuals that can interbreed and produce fertile offspring in nature, and sharing common characteristics, traits, and genetic material.

Example: Felis catus (domestic cat), Canis lupus (gray wolf), Homo sapiens (modern human), Rosa gallica (French rose), etc.

## Bathymetry: 

Definition: Bathymetry represents the measurement of water depth at the survey locations.

## Marine: 

Definition: Ecosystems living within oceans or seas.

## Freshwater: 

Definition:  Ecosystems living in freshwater environments, such as rivers, lakes, or streams.

## Terrestrial: 

Definition: This column may indicate whether a particular observation or data point pertains to terrestrial environments, typically referring to ecosystems or habitats on land.

## ScientificName: 

This column likely contains the scientific name of a species or organism, following taxonomic conventions, which is useful for species identification and classification.

## OriginalScientificName:

This column may contain the original scientific name of a species or organism, which could differ from the current scientific name due to taxonomic revisions or updates.

## SST: 

Sea Surface Temperature measurements, which are important for understanding oceanic climate patterns and environmental conditions.

## SSS: 

Sea Surface Salinity measurements, which are important for studying oceanic circulation patterns and marine ecosystems.

## Geometry: 

It contains geometric information, such as points, lines, or polygons, representing spatial features like geographic locations, boundaries, or shapes. Note that 

# Interesting links:

OBIS: https://obis.org/

Shannon and Diversity Index: https://en.wikipedia.org/wiki/Diversity_index


# Data access

A token to access our data lake has been generated for the hackathon and will be deprecated on 21st May.

In [1]:
from azure.storage.blob import BlobServiceClient
import dask
import dask.dataframe as dd
import numpy as np
from shapely.geometry import Point, Polygon
from glob import glob
import geopandas as gpd
import pdb
import pandas as pd
import dask_geopandas
from tqdm import tqdm
from azure.storage.blob import BlobClient
import geopandas as gpd
import pandas as pd
from io import BytesIO
import yaml
from blob_tools import *
dask.config.set({'dataframe.query-planning-warning': False})



<dask.config.set at 0x7f8055fed690>

In [2]:
CONFIG_FILE = './config.yaml'
obis_blob = blob_param(CONFIG_FILE, 'obis-biodiversity')
vestas_blob = blob_param(CONFIG_FILE, 'vestas')

In [3]:
obis_container = obis_blob.container
vestas_container = vestas_blob.container

# Get spatial box from vestas samples

We have four vestas samples stored in our cloud and sas_url can be found within the vestas blob object (vestas.sas_url_list).

In [4]:
vestas_blob.sas_url_list

['https://stodpdaskuserspace.blob.core.windows.net/vestas/vcl_ts_CN.csv?si=vestas_hackathon_policy&spr=https&sv=2022-11-02&sr=c&sig=M8agHJFDQcypdAvMEy7%2FATkjDHPI7wTCmRL4pbmEgsc%3D',
 'https://stodpdaskuserspace.blob.core.windows.net/vestas/vcl_ts_NL.csv?si=vestas_hackathon_policy&spr=https&sv=2022-11-02&sr=c&sig=M8agHJFDQcypdAvMEy7%2FATkjDHPI7wTCmRL4pbmEgsc%3D',
 'https://stodpdaskuserspace.blob.core.windows.net/vestas/vcl_ts_NO.csv?si=vestas_hackathon_policy&spr=https&sv=2022-11-02&sr=c&sig=M8agHJFDQcypdAvMEy7%2FATkjDHPI7wTCmRL4pbmEgsc%3D',
 'https://stodpdaskuserspace.blob.core.windows.net/vestas/vcl_ts_US.csv?si=vestas_hackathon_policy&spr=https&sv=2022-11-02&sr=c&sig=M8agHJFDQcypdAvMEy7%2FATkjDHPI7wTCmRL4pbmEgsc%3D']

In [5]:
# Initialize the BlobClient with the blob URL
blob_client = BlobClient.from_blob_url(blob_url=vestas_blob.sas_url_list[0])

# Download the blob content
blob_content = blob_client.download_blob().readall()

# Use BytesIO to handle the byte stream, then read it into a pandas DataFrame
data = BytesIO(blob_content)
df = pd.read_csv(data, header=17)
df.head()

Unnamed: 0,timestamp,hfx,lh,pblh,psfc,rainnc,sst,swdown,t2,ust,...,wdir_25.0,wdir_50.0,wdir_75.0,wdir_100.0,wdir_150.0,wdir_200.0,wdir_250.0,wdir_300.0,wdir_400.0,wdir_500.0
0,2010-01-01 07:00:00,18.95,81.61,777.93,101534.47,1.1968,293.18,128.83,292.18,0.35,...,74.0615,74.2794,74.4538,74.6117,74.8905,75.185,75.5469,77.6991,87.7681,103.0733
1,2010-01-01 08:00:00,16.04,73.96,867.84,101526.87,1.1975,293.18,60.25,292.31,0.33,...,71.2928,71.5184,71.6998,71.8609,72.1285,72.4152,72.7708,75.0346,85.4908,100.8676
2,2010-01-01 09:00:00,15.37,72.4,778.78,101514.88,1.2013,293.18,18.07,292.35,0.34,...,70.7942,71.0364,71.2272,71.3979,71.6907,72.0205,72.4529,74.9257,86.775,102.9818
3,2010-01-01 10:00:00,14.94,72.43,1067.08,101535.66,1.2075,293.18,0.0,292.37,0.34,...,68.1704,68.3533,68.5021,68.6372,68.8779,69.1463,69.5009,72.0196,84.0302,99.4804
4,2010-01-01 11:00:00,12.61,66.93,779.91,101595.66,1.2155,293.18,0.0,292.46,0.32,...,72.6123,72.7964,72.9523,73.097,73.3658,73.6911,74.1212,76.6491,89.8833,107.1794


In [6]:
#The 4 selected areas for the data are as follow
Norway = [
(4.26917, 59.44806),
(4.67361, 59.48222),
(4.40750, 59.06944),
(4.81222, 59.10500)
]
 
Netherlands = [
(3.66318, 52.6396),
(3.94714, 52.6396),
(3.96714, 52.86992),
(3.66318, 52.85233)
]

South_China_Sea = [
(21.134832, 111.444392),
(21.134832, 111.755552),
(21.405112, 111.444392),
(21.405112, 111.755552)
]
 
US_east_coast = [
(41.014832, -71.245652),
(41.014832, -70.894292),
(41.285112, -71.245652),
(41.285112, -70.894292)
]

# Spatial subset selection example for OBIS

Note: Feel free to try with your own area. Typically the subset extraction would take about 2 minutes. If your area is too big, the notebook might run out of memory. If so, check if it's possible to increase your machine capacity. 'Medium' being the strongest machine you can have. However the area should still run even if you

In [7]:
# Create a Polygon object from the coordinates
polygon = Polygon(Norway)

In [8]:
gdf = download_obis_within_polygon(obis_blob, polygon)

  0%|          | 0/132 [00:04<?, ?it/s]


ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: Missing geo metadata in Parquet/Feather file.
            Use pandas.read_parquet/read_feather() instead.

In [11]:
gdf

Unnamed: 0,year,month,kingdom,phylum,class,order,family,genus,species,bathymetry,marine,freshwater,terrestrial,scientificName,originalScientificName,sst,sss,geometry
0,1982,10,Animalia,Arthropoda,Copepoda,Calanoida,Calanidae,Calanus,Calanus helgolandicus,265.6,1,0,0,Calanus helgolandicus,Calanus helgolandicus,10.11,32.58,POINT (4.48000 59.35670)
1,1973,4,Animalia,Arthropoda,Copepoda,Calanoida,Calanidae,Calanus,Calanus helgolandicus,259.4,1,0,0,Calanus helgolandicus,Calanus helgolandicus,10.1,32.44,POINT (4.51830 59.17500)
2,1982,3,Chromista,Ochrophyta,Bacillariophyceae,Triceratiales,Triceratiaceae,Odontella,Odontella aurita,267,1,,0,Odontella aurita,Odontella aurita,10.09,32.56,POINT (4.50170 59.43670)
3,1998,0,Chromista,Myzozoa,Dinophyceae,Noctilucales,Kofoidiniaceae,Spatulodinium,Spatulodinium pseudonoctiluca,256.6,1,,,Spatulodinium pseudonoctiluca,Spatulodinium pseudonoctiluca,10.12,32.51,POINT (4.45300 59.09200)
4,1982,10,Animalia,Arthropoda,Copepoda,Calanoida,Acartiidae,Acartia,,265.6,1,0,0,Acartia,Acartia,10.11,32.58,POINT (4.48000 59.35670)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111,2017,4,Animalia,Arthropoda,Copepoda,Calanoida,,,,256.4,1,1,,Calanoida,Calanoida,10.12,32.51,POINT (4.45800 59.09700)
112,1973,4,Animalia,Arthropoda,Copepoda,Calanoida,Calanidae,Calanus,,259.4,1,0,0,Calanus,Calanus,10.1,32.44,POINT (4.51830 59.17500)
113,1998,0,Chromista,Myzozoa,Dinophyceae,Gymnodiniales,Gymnodiniaceae,Gyrodinium,Gyrodinium pingue,256.6,1,,,Gyrodinium pingue,Gyrodinium pingue,10.12,32.51,POINT (4.45300 59.09200)
114,1998,0,Chromista,Myzozoa,Dinophyceae,Gonyaulacales,Ostreopsidaceae,Alexandrium,,256.6,1,,,Alexandrium,Alexandrium,10.12,32.51,POINT (4.45300 59.09200)
