## Accessing GBIF data on Azure

This notebook provides an example of accessing Global Biodiversity Information Facility (GBIF) occurrence data from blob storage on Azure.  Periodic snapshots of the data are stored in Parquet format.

This dataset is stored in the West Europe Azure region, so this notebook will run most efficiently on Azure compute located in the same region.  If you are using this data for environmental science applications, consider applying for an [AI for Earth grant](http://aka.ms/ai4egrants) to support your compute requirements.

This dataset is documented at [aka.ms/ai4edata-aster](http://aka.ms/ai4edata-gbif).    

### Imports and constants

In [None]:
import dask.dataframe as dd
from adlfs import AzureBlobFileSystem

storage_account_name = 'ai4edataeuwest'
sas_token = REDACTED
folder_name = 'gbif/occurrence'

## Listing the data files

GBIF provide an export of occurrence data under the Creative Commons Zero and Creative Commons By-Attribution licenses.  A dataset is uploaded periodically containing georeferenced records available under either license.

We can use `adlfs` to view the available data exports:

In [None]:
fs = AzureBlobFileSystem(account_name=storage_account_name, sas_token=sas_token)
export_folders = fs.glob(folder_name + '/20*')
print('Found {} GBIF data exports'.format(len(export_folders)))
for k in range(0,len(export_folders)):
    print(export_folders[k])

We can then list the files within one of these exports, in this case the last (most recent) one:

In [None]:
data_path = export_folders[-1]

fs = AzureBlobFileSystem(account_name=storage_account_name, sas_token=sas_token)
parquet_files = fs.glob(data_path + '/occurrence.parquet/*')
print('Found {} Parquet files in export {}'.format(len(parquet_files), data_path))
for k in range(0,5):
    print('    ' + parquet_files[k])
print('    …')

## Opening one data file

The whole occurrence dataset has hundreds of millions of records, split across around 100 Parquet files.  We will open just one.

In [None]:
print(data_path)
df = dd.read_parquet('az://' + parquet_files[0], storage_options={'account_name':storage_account_name, 'sas_token':sas_token}).compute()
print(df.head())

## Plot data

This is a quick plot of latitude and longitude.  We can see the shapes of continents.

In [None]:
ax = df.plot.hexbin('decimallongitude', 'decimallatitude', gridsize=(360, 180),
                    vmax=100, cmap='Greens', colorbar=False)
ax.set_xlim(-180, 180)
ax.set_ylim(-90, 90)

Another plot shows the month of observation or collection.

In [None]:
some_plants = df[(df.kingdom == 'Plantae')]
ax = some_plants['month'].plot.hist(x='Month', y='Count', bins=12)

ax.set_xlim(1, 12)

## Citation

It's good practise to cite the data using a DOI (Digital Object Identifier), see GBIF's [Citation Guidelines](https://www.gbif.org/citation-guidelines).

GBIF have a service to create a DOI covering a subset of a cloud dataset.  We need to know the DOI of the cloud dataset, and the number of occurrences we used from each contributing dataset.

In [None]:
print(fs.cat(data_path + '/citation.txt').decode("unicode_escape"))

# This is the usage of the month graph, which only used plants:
some_plants.groupby(by='datasetkey')['gbifid'].count().reset_index(name='count').sort_values(['count'], ascending=False)