# UK Invasive Species - occurrences

### This Notebook will lead you through exercises to load occurrences from a GBIF DarwinCore Archive, and convert it to CSVs for import into Scratchpads, you will:

- Learn about GBIF & DarwinCore Archives
- Learn how to read and manipulate data using Pandas
- Export the occurrence data, ready for import into a Scratchpad.

In [1]:
### Necessary imports for this notebooks.

# Pandas is a wonderful tool for data manipulation and analysis - https://pandas.pydata.org
import pandas as pd

# Zipfile lets us read files within a ZIP archive
import zipfile

# Helper function to turn our data into downloadable records
from helpers import create_download_link

### GBIF & DarwinCore Archive

Global Biodiversity Information Facility aggregates and publishes biodiversity data from many institutions from around the world. 

The [GBIF Archive](GBIF-DwCA.zip) has been created by searching GBIF for UK occurrence records of our 8 invasive species.

Source data is available on GBIF: [DOI: 10.15468/dl.ohvz6n](http://doi.org/10.15468/dl.ohvz6n)

In [2]:
# Lets load the DarwinCore Archive (it's just a ZIP file)
zip=zipfile.ZipFile('GBIF-DwCA.zip')

# Lets look inside the DwCA - we have our occurrence records, multimedia, information on rights and how to cite
print ([f for f in zip.namelist() if not f.startswith('dataset')])

['multimedia.txt', 'citations.txt', '.DS_Store', 'verbatim.txt', 'metadata.xml', 'meta.xml', 'rights.txt', 'occurrence.txt']


In [3]:
# Load the occurrences into our Pandas DataFrame
f=zip.open("occurrence.txt")
df = pd.read_csv(f, sep='\t', low_memory=False)

Note: GBIF has 92,071 UK occurrence records for our 8 invasive species! It would take quite a while to import all of these into a Scratchpad, so this DarwinCore Archive contains a random 1% sample to work with.

Let's preview the data:

In [4]:
# Show the first 5 rows
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,gbifID,abstract,accessRights,accrualMethod,accrualPeriodicity,accrualPolicy,alternative,audience,...,subgenusKey,speciesKey,species,genericName,acceptedScientificName,typifiedName,protocol,lastParsed,lastCrawled,repatriated
0,0,33036,1559722217,,,,,,,,...,,2891770,Impatiens glandulifera,Impatiens,Impatiens glandulifera Royle,,DWC_ARCHIVE,2019-12-30T17:18:22.990Z,2019-12-30T17:10:20.110Z,False
1,1,33286,1559719420,,,,,,,,...,,2891770,Impatiens glandulifera,Impatiens,Impatiens glandulifera Royle,,DWC_ARCHIVE,2019-12-30T17:18:19.027Z,2019-12-30T17:10:20.110Z,False
2,2,66837,1549816945,,,,,,,,...,,2891770,Impatiens glandulifera,Impatiens,Impatiens glandulifera Royle,,DWC_ARCHIVE,2019-12-30T17:03:38.849Z,2019-12-30T17:01:07.284Z,False
3,3,55317,1559520159,,,,,,,,...,,2891770,Impatiens glandulifera,Impatiens,Impatiens glandulifera Royle,,DWC_ARCHIVE,2019-12-30T17:11:10.987Z,2019-12-30T17:10:20.118Z,False
4,4,50985,1559537864,,,,,,,,...,,2891770,Impatiens glandulifera,Impatiens,Impatiens glandulifera Royle,,DWC_ARCHIVE,2019-12-30T17:11:14.683Z,2019-12-30T17:10:20.118Z,False


Scratchpads have separate data types for observation records, and their location. This data model allows multiple observations to be attached to a single location, but it does require us to create two seperate 
imports, first for the location and then for the observations. 

Lets start with the locality data...

In [9]:
# Select the locality fields we want to import
locality_df = df[['gbifID', 'locality']]

# # Rename the columns into the format required for the Scratchpads import
locality_df = locality_df.rename(columns = {'gbifID':'GUID', 'locality':'Title'}) 

# Combine the latitude and longitude fields into a coordinate (latitude, longitude) 
locality_df['Map'] = df.apply(lambda x: f'({x["decimalLatitude"]},{x["decimalLongitude"]})', axis=1)

In [6]:
# Select the occurrence fields we want to import
occurrence_df = df[['gbifID', 'occurrenceID', 'institutionCode', 'collectionCode', 'species']].copy()


# Populate required fields
occurrence_df['Basis of record'] = 'Human Observation'
occurrence_df['collectionCode'] = 'GBIF'

# Rename the columns into the format required for the Scratchpads import
occurrence_df = occurrence_df.rename(columns = {
    'gbifID':'Location (GUID)', 
    'occurrenceID':'Catalogue number',
    'institutionCode':'Institution code',
    'collectionCode':'Collection code',
    'species':'Taxonomic name (Name)',
    
}) 

# Catalogue number
occurrence_df.head()


Unnamed: 0,Location (GUID),Catalogue number,Institution code,Collection code,Taxonomic name (Name),Basis of record
0,1559722217,44e4776c-1f34-4dd0-8be1-dddb6398d011,Environment Agency (Biodiversity staff),GBIF,Impatiens glandulifera,Human Observation
1,1559719420,4aee5a43-874f-4602-acaf-553d366db27f,Environment Agency (Biodiversity staff),GBIF,Impatiens glandulifera,Human Observation
2,1549816945,e4898ee8-bace-4a73-98db-db7b7f6fb0ea,SEWBReC,GBIF,Impatiens glandulifera,Human Observation
3,1559520159,6a6b289b-e969-44d6-9296-c6133ad9d6e4,Environment Agency (Biodiversity staff),GBIF,Impatiens glandulifera,Human Observation
4,1559537864,560a5f81-c6c3-4c7c-9e64-1c978a701ad5,Environment Agency (Biodiversity staff),GBIF,Impatiens glandulifera,Human Observation


In [7]:
# The data is ready for downloading - create download links
create_download_link(locality_df, title="Download localities CSV", filename = "localities.csv")

In [8]:
create_download_link(occurrence_df, title="Download occurrences CSV", filename = "occurrences.csv")