Skip to content

Dataset internal jobs, external files, & server-side view creation#878

Merged
bennybp merged 26 commits into
mainfrom
create_view
Jan 14, 2025
Merged

Dataset internal jobs, external files, & server-side view creation#878
bennybp merged 26 commits into
mainfrom
create_view

Conversation

@bennybp

@bennybp bennybp commented Jan 10, 2025

Copy link
Copy Markdown
Contributor

Description

This one has been a long time coming. This fairly-big PR has 3 main components:

1. Internal Jobs

Cleanup and expansion of server internal jobs. These are jobs that are run server side. Until now, they have been relegated to periodic maintenance tasks. This PR makes those more usable for long-running single tasks as well.

2. External file capability

Adds the ability to store files in an S3 bucket. I have not tried it on actual S3 yet, just testing locally using MinIO, which we will also likely run locally in the near-term.

For now, the only place these external files are used is with dataset attachments. Datasets can now have these files as attachments, and when downloading them they come directly from the S3 bucket.

3. Server-side view creation

Combining those additions, we can now create "views" using internal jobs and then storing them in an S3 bucket. These views are basically pre-rendered cache files which you can download and then use either as a starting point for the local dataset cache or as standalone, read-only dataset views completely disconnected from the server.

The main functions are create_view, download_view, list_views, use_view_cache, and preload_cache, which are all part of the dataset interface.

The create_view function will add an internal job on the server, which runs in the background and doesn't require you to be constantly connected to the server (as these can take a while).

Some initial benchmarks on a copy of our ML server: roughly 30 mins to create a view (including outputs) for 124,000 singlepoints (dataset 413 "SPICE PubChem Set 10 Single Points Dataset v1.0"). Resulting file was 8.4 GB.

Dataset 422 "SSPICE Amino Acid Ligand OpenFF v1.0" with 388,532 singlepoints took 80 minutes - resulting file was 21.7 GB

Testing

I've tested this locally and with a copy of the ML instance. I am currently thinking of how to test this with GHA - it's tricky because of the requirement for an S3 server.

Future work

I hope to get to a documentation sprint next week. There is a lot to document here (and everywhere else, too).

I want to have the ability to do more with these internal jobs, including large dataset submission.

Status

  • Code base linted
  • Ready to go

@bennybp

bennybp commented Jan 14, 2025

Copy link
Copy Markdown
Contributor Author

I'm going to skip a lot of testing for the view creation stuff right now. It's been tested manually.

Soon I want to add some features related to ingesting pre-computed data, which will make these kinds of dataset tests much easier, so no need to waste time on working around that right now.

Also will need to set up a test S3 bucket

@bennybp bennybp merged commit 4f552af into main Jan 14, 2025
@bennybp bennybp deleted the create_view branch January 14, 2025 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant