# 2022 EY Challenge

## Frog Data

This notebook demonstrates how to extract frog location data from the Global Biodiversity Information Facility (GBIF). The GBIF occurrence dataset combines data from a wide array of sources, including specimen-related data from natural history museums, observations from citizen science networks, and automated environmental surveys. While these data are constantly changing at GBIF.org, periodic snapshots are taken and made available on the Planetary Computer. For our purposes, we are only interested in a narrow subset of the data relating to frogs.


In [69]:
# Supress Warnings 
import warnings
warnings.filterwarnings('ignore')

# Import common GIS tools
import numpy as np
import xarray as xr
import rasterio.features
import xrspatial.multispectral as ms

# Import Planetary Computer tools
import stackstac
import pystac
import pystac_client
import planetary_computer

# Plotting tools
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Data science tools
import dask.dataframe as dd
import pandas as pd

# Table visualisation tools
from IPython.display import display, HTML

### Area definition

For this demonstration, we will constrain our search to frogs in the Richmond area.

In [70]:
# Richmond
min_lon, min_lat = (150.62, -33.69)  # Lower-left corner
max_lon, max_lat = (150.83, -33.48)  # Upper-right corner
bbox = (min_lon, min_lat, max_lon, max_lat)

### Fetch GBIF dataset

Now we query the Planetary Computer for the GBIF data. We do not need to specify a query region as the dataset  stores snapshots of a more dynamic collection of datasets, hence we only access one item. We will choose the latest snapshot.

In [72]:
stac = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

search = stac.search(
    bbox=bbox,
    collections=["gbif"],
    # query={"order": {"eq": 'Anura'}},
    
)

gbif_items = search.get_all_items()
print('Number of GBIF scenes for given region:',len(gbif_items))
for item in gbif_items:
    print(item.id)
    
# Take latest
gbif = gbif_items[0]

Number of GBIF scenes for given region: 6
gbif-2021-10-01
gbif-2021-09-01
gbif-2021-08-01
gbif-2021-07-01
gbif-2021-06-01
gbif-2021-04-13


It is good practice to sign the data items to avoid any authentication issues when querying the Planetary Computer. 

The GBIF data is very large, and is therefore spread out over 1050 partitions. We can set up a Dask dataframe to interface with the STAC API, allowing us to create a query workflow and load in the data a partition at a time. The following steps defines the Dask dataframe and instructs it on the queries to perform upon loading in a partition, namely to filter for frogs (`order == "Anura"`) and to only take those frogs in the Richmond bounding box.

In [78]:
# Take most recent. Sign it too.
gbif = planetary_computer.sign(gbif_items[0])
gbif_data_asset = gbif.assets['data']


df = (
    dd.read_parquet(
        gbif_data_asset.href,
        storage_options=gbif_data_asset.extra_fields["table:storage_options"],
        dataset={"require_extension": None},
    )
    [['eventdate', 'order', 'decimallatitude', 'decimallongitude']]
    .query("order == 'Anura'")
)
# Filter for the bounding box
df = df[
    (df.decimallatitude < max_lat) & 
    (df.decimallatitude > min_lat) &
    (df.decimallongitude < max_lon) & 
    (df.decimallongitude > min_lon)
]
df

Unnamed: 0_level_0,eventdate,order,decimallatitude,decimallongitude
npartitions=1050,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,object,object,float64,float64
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


In [67]:
# Function that repeats the above cell
# Solves an authentication issue that happens when the extraction is long
def resign_planetary_computer():
    global gbif_items
    # Take most recent. Sign it too.
    gbif = planetary_computer.sign(gbif_items[0])
    gbif_data_asset = gbif.assets['data']


    df = (
        dd.read_parquet(
            gbif_data_asset.href,
            storage_options=gbif_data_asset.extra_fields["table:storage_options"],
            dataset={"require_extension": None},
        )
        [['eventdate', 'order', 'decimallatitude', 'decimallongitude']]
        .query("order == 'Anura'")
    )
    # Filter for the bounding box
    df = df[
        (df.decimallatitude < max_lat) & 
        (df.decimallatitude > min_lat) &
        (df.decimallongitude < max_lon) & 
        (df.decimallongitude > min_lon)
    ]
    return df

### Extract data

Finally, we are able to extract the data one partition at a time. To save time, we only extract about 10% of the partitions (randomly with probability 0.1). When the extraction is complete, we save the file to csv format.

In [None]:
np.random.seed(420)

frogs = pd.DataFrame()
for i in range(df.npartitions):
    if np.random.random() < 0.1:
        print(f'Taking {i+1} of {df.npartitions}')
        try:
            frogs = frogs.append(df.get_partition(i).compute())
        except:
            df = resign_planetary_computer()
            frogs = frogs.append(df.get_partition(i).compute())
        print(f'Frogs found so far: {len(frogs)}')

# Save to file
(
    frogs
    .reset_index(drop=True)
    .assign(
        occurrenceStatus = 1
    )
    .rename(columns={'eventdate':'eventDate', 'decimallatitude':'decimalLatitude', 'decimallongitude':'decimalLongitude'})
    .drop('order', 1)
    .to_csv("richmond_frogs.csv", index=None)
)

When the extraction is complete, we are left with a table containing the geolocations of frog sightings in Richmond. These data should be used as the ground truth of your algorithm.

In [85]:
pd.read_csv("richmond_frogs.csv")

Unnamed: 0,eventDate,decimalLatitude,decimalLongitude,occurrenceStatus
0,2019-09-20T00:00:00,-33.480633,150.699869,1
1,2008-01-29T00:00:00,-33.605402,150.661154,1
2,2011-09-18T00:00:00,-33.555530,150.723270,1
3,2019-09-20T00:00:00,-33.480633,150.699869,1
4,2018-11-12T00:00:00,-33.545000,150.809000,1
...,...,...,...,...
230,2015-04-27T00:00:00,-33.523342,150.771823,1
231,2010-08-14T00:00:00,-33.683198,150.796120,1
232,2019-09-21T00:00:00,-33.500885,150.689557,1
233,2010-12-21T00:00:00,-33.685636,150.678203,1
