# Whip a Darwin Core Archive

This script tests if a Darwin Core Archive confirms to defined [whip specifications](https://github.com/inbo/whip).

1. Define whip specs at `datasets/<dataset_dir>/specs/` (copy/paste and adapt from other datasets)
2. Place an unzipped Darwin Core Archive in the `data` directory
3. Indicate the directory name `dataset_dir` to pull specs
4. Indicate what core and extension files are part of the archive

In [1]:
dataset_dir = "natagora-orthoptera-occurrences"
event_core = False
occ_core = True
occ_ext = False
mof_ext = False

In [2]:
import pandas as pd
import numpy as np
import yaml
from pywhip import whip_csv
from IPython.display import HTML, display_html

In [3]:
occ_core_ext = True if occ_core or occ_ext else False

## Read data

In [4]:
event = pd.read_csv("../data/event.txt", delimiter="\t", dtype=object) if event_core else False

In [5]:
occ = pd.read_csv("../data/occurrence.txt", delimiter="\t", dtype=object) if occ_core_ext else False

In [6]:
mof = pd.read_csv("../data/measurementorfact.txt", delimiter="\t", dtype=object) if mof_ext else False

## Some stats

Number of records:

In [7]:
len(event) if event_core else False

False

In [8]:
len(occ) if occ_core_ext else False

36227

In [9]:
len(mof) if mof_ext else False

False

In [10]:
event["eventDate"].min() if event_core else occ["eventDate"].min()

'1987-08-30'

In [11]:
event["eventDate"].max() if event_core else occ["eventDate"].max()

'2018-12-05'

In [12]:
occ.groupby(["scientificName","taxonRank","vernacularName"])["occurrenceID"].count().reset_index()

Unnamed: 0,scientificName,taxonRank,vernacularName,occurrenceID
0,Barbitistes serricauda,species,Barbitiste des bois,136
1,Bicolorana bicolor,species,Decticelle bicolore,555
2,Calliptamus italicus,species,Caloptène italien,3
3,Chorthippus albomarginatus,species,Criquet marginé,70
4,Chorthippus biguttulus,species,Criquet mélodieux,2136
5,Chorthippus brunneus,species,Criquet duettiste,1964
6,Chorthippus dorsatus,species,Criquet verte-échine,216
7,Chorthippus mollis,species,Criquet des jachères,28
8,Chorthippus vagans,species,Criquet des pins,130
9,Chrysochraon dispar,species,Criquet des clairières,2522


## Verify data

### Relationships between files

In [13]:
occ_event = pd.merge(occ, event, how = "left") if occ_ext else False
mof_event = pd.merge(mof, event, how = "left") if mof_ext else False

Number of records with empty values when merging with event. Should be `False` or `0` for all.

In [14]:
occ_event[occ_event["type"].isnull()]["id"].unique() if occ_ext else False

False

In [15]:
mof_event[mof_event["type"].isnull()]["id"].unique() if mof_ext else False

False

### Unique IDs

Number of records with a duplicate ids. Should be `False` or `0` for all.

In [16]:
event[event["eventID"].duplicated(keep=False)]["eventID"].sort_values().count() if event_core else False

False

In [17]:
occ[occ["occurrenceID"].duplicated(keep=False)]["occurrenceID"].sort_values().count() if occ_core_ext else False

0

## Whip data

### Event

In [18]:
event_spec = yaml.load(open("../datasets/" + dataset_dir + "/specs/dwc_event.yaml").read()) if event_core else False

In [19]:
event_whipped = whip_csv("../data/event.txt", event_spec, delimiter="\t") if event_core else False

In [20]:
display_html(HTML(event_whipped.get_report("html")), metadata=dict(isolated=True)) if event_core else False

False

### Occurrence

In [21]:
occ_spec = yaml.load(open("../datasets/" + dataset_dir + "/specs/dwc_occurrence.yaml").read()) if occ_core_ext else False

  """Entry point for launching an IPython kernel.


In [22]:
occ_whipped = whip_csv("../data/occurrence.txt", occ_spec, delimiter="\t") if occ_core_ext else False

Hooray, your data set is according to the guidelines!


In [23]:
display_html(HTML(occ_whipped.get_report("html")), metadata=dict(isolated=True)) if occ_core_ext else False

### Measurement or fact

In [24]:
mof_spec = yaml.load(open("../datasets/" + dataset_dir + "/specs/dwc_mof.yaml").read()) if mof_ext else False

In [25]:
mof_whipped = whip_csv("../data/measurementorfact.txt", mof_spec, delimiter="\t") if mof_ext else False

In [26]:
display_html(HTML(mof_whipped.get_report("html")), metadata=dict(isolated=True)) if mof_ext else False

False