idigbio-spark

Generate taxonomic checklists and occurrence collections from biodiversity collections like GBIF, iDigBio. Converts DwCA tracked by Preston into parquet and sequence files to enable parallel processing in a compute cluster.

This library relies on an apache spark and Mesos/HDFS clusters to:

generate checklists
generate occurrence collection
import Darwin Core Archive into apache parquet data formats

At time of writing (June 2017), this library is used by http://effechecka.org and https://gimmefreshdata.github.io . Note that effechecka and freshdata projects are not longer active.

Funding

This work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.

Name		Name	Last commit message	Last commit date
Latest commit History 356 Commits
project		project
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project

project

src

src

.gitignore

.gitignore

.travis.yml

.travis.yml

README.md

README.md

build.sbt

build.sbt

Repository files navigation

idigbio-spark

Funding

About

Releases

Packages

Languages

bio-guoda/idigbio-spark

Folders and files

Latest commit

History

Repository files navigation

idigbio-spark

Funding

About

Resources

Stars

Watchers

Forks

Languages