Skip to content

bio-guoda/idigbio-spark

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

idigbio-spark

Generate taxonomic checklists and occurrence collections from biodiversity collections like GBIF, iDigBio. Converts DwCA tracked by Preston into parquet and sequence files to enable parallel processing in a compute cluster.

This library relies on an apache spark and Mesos/HDFS clusters to:

  1. generate checklists
  2. generate occurrence collection
  3. import Darwin Core Archive into apache parquet data formats

At time of writing (June 2017), this library is used by http://effechecka.org and https://gimmefreshdata.github.io . Note that effechecka and freshdata projects are not longer active.

Funding

This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.

About

processing engine for biodiversity archives

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 100.0%