Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download service implementation for event data #30

Open
2 tasks
djtfmartin opened this issue Apr 1, 2022 · 2 comments
Open
2 tasks

Download service implementation for event data #30

djtfmartin opened this issue Apr 1, 2022 · 2 comments
Assignees

Comments

@djtfmartin
Copy link
Member

djtfmartin commented Apr 1, 2022

Exploration required which should include:

  • Investigate use of Spark QL to support download service
  • Investigate connector between Spark and Elastic (Elastic SQL) for reading Elastic search from Spark to produce exports. See this

4 potential types of download we could support, each with different complexities in implementation.

a) Single dataset download
These would be full exports of the event datasets with our interpretation (taxonomy etc).
These could be pre-generated using pipelines (similar to DwCA export pipeline) and copied to S3 or FS.
These would satisfy the EcoCommons people.
Complexity: LOW

b) Multiple dataset download
Similar to the above, but the ability to package multiple complete datasets (a zip of zips).
Complexity:MEDIUM

c) Query based cross dataset download
This would be the sort of download we are familiar with for occurrence data, but i question whether it is a good idea for event data, where the datasets are all quite different.
If AVRO based, then events need (globally) unique eventIDs which is something we dont have at the moment.
Complexity: HIGH

d) Sites by species download
Elastic search based, using facets
Complexity: MEDIUM

@javier-molina
Copy link
Collaborator

New service should be reusable, GBIF is happy to adopt this in the future.

@djtfmartin
Copy link
Member Author

Current plan after discussion is to support (a) and (d) in the first instance.

@adam-collins adam-collins removed their assignment Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants