Skip to content

Bombus/pollinator

develop
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code
This branch is 30 commits ahead, 279 commits behind VertNet:develop.

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

What is Pollinator

This project was forked from VertNet Gulo and subsequently butchered. The goal is to experiment with constructing graph-based views of Darwin Core objects in MapReduce, enabling integration with related objects on the semantic web.

Outcomes

We modified VertNet Gulo to support generating EZIDs and triplifying Darwin Core CSV files using MapReduce. We have the Runtime wiki page that describes how to run Pollinator on your laptop using an example CSV file. We also made the EZID Java client library available in Clojure and moved it to a new project in GitHub.

Experiment

Goal 1

Take a CSV file with Darwin Core records, dwc.csv:

catalogNumber,recordedBy,scientificName,eventDate,decimalLatitude,decimalLongitude,identifiedBy
1,Rob Guralnick,Puma concolor,8/8/12,37.1,-120.1,Aaron Steele
2,John Deck,Ursus arctos horribilis,8/8/12,37.2,-120.1,Rob Guralnick
3,Aaron Steele,Bufu bufo,8/8/12,37.2,-120.2,Dave Wake
4,Nico Cellinese,Acanthella pulchra,8/8/12,37.3,-120.4,Nico Cellinese

Then we'll then MapReduce over it to create some outputs.

First output

The type output lists identifiers and what type they are. type.csv would look like this:

ID,Type
ark:/9999/fk411,dwc:Taxon
ark:/9999/fk412,dwc:Taxon
ark:/9999/fk413,dwc:Taxon
ark:/9999/fk414,dwc:Taxon
ark:/9999/fk421,dwc:Locality
ark:/9999/fk422,dwc:Locality
ark:/9999/fk423,dwc:Locality
ark:/9999/fk424,dwc:Locality
ark:/9999/fk431,dwc:Occurrence
ark:/9999/fk432,dwc:Occurrence
ark:/9999/fk433,dwc:Occurrence
ark:/9999/fk434,dwc:Occurrence

Second output

The source_of.csv describes the relationships between the identifiers. In this example, dwc:Locality source_of dwc:Occurrence source_of dwc:Taxon:

source	target
ark:/9999/fk421,ark:/9999/fk431
ark:/9999/fk431,ark:/9999/fk411
ark:/9999/fk422,ark:/9999/fk432
ark:/9999/fk432,ark:/9999/fk412
ark:/9999/fk423,ark:/9999/fk433
ark:/9999/fk433,ark:/9999/fk413
ark:/9999/fk424,ark:/9999/fk434
ark:/9999/fk434,ark:/9999/fk414

Goal 2

Bulk load files to CartoDB

Goal 3

Write a query API and client, then construct a query against that in Java against the data in the AppEngine datastore for a single UUID as a query parameter that returns an N3 file containing source_of relationships and types. For example:

public interface Query {
  public static String execute(String uuid);
}

Would return objects and derivitives.

Goal 4

Sketch out EZID use cases, implementation from context of the above exercises. Best way to assign identifiers in the above process.

Outputs: Subversion code used is from https://code.ecoinformatics.org/code/ezid/trunk/ as a maven project. We mint identifiers like:

EZIDService ezid = new EZIDService();
ezid.login(username,password);
// ark:/99999/fk4 is the temporary ark shoulder
String newId = ezid.mintIdentifier("ark:/99999/fk4", null);

About

Shredding Darwin Core Archives with ferocity, strength, and Cascalog.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Clojure 98.4%
  • Shell 1.6%