Skip to content

Commit

Permalink
Link to issues #25 and #26 from design.md
Browse files Browse the repository at this point in the history
  • Loading branch information
JackKelly committed Jan 23, 2024
1 parent c560637 commit 069569e
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions design.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

`light-speed-io` (or "LSIO", for short) will be a Rust library crate for loading and processing many chunks of files, as fast as the storage system will allow. **The aim is to to allow users to load and process on the order of 1 million 4 kB chunks per second from a single local SSD**.

**UPDATE (2024-01-23): THE DESIGN IS LIKELY TO CHANGE A LOT! SPECIFICALLY, MY PLAN IS TO SIMPLIFY LSIO SO THAT IT IS ONLY RESPONSIBLE FOR I/O (NOT FOR PROCESSING CHUNKS). USERS WILL STILL BE ABLE TO INTERLEAVE I/O WITH PROCESSING BECAUSE LSIO WILL RETURN A Rust `Stream` (AKA `AsyncIterator`) OF CHUNKS (see [this GitHub comment](https://github.com/JackKelly/light-speed-io/issues/25#issuecomment-1900536618)). AFTER BUILDING AN MVP OF LSIO, I PLAN TO BUILD A SECOND CRATE WHICH MAKES IT EASY TO APPLY AN ARBITRARY PROCESSING FUNCTION TO A STREAM, IN PARALLEL ACROSS CPU CORES. (See [this comment](https://github.com/JackKelly/light-speed-io/issues/26#issuecomment-1902182033))**

Why aim for 1 million chunks per second? See [this spreadsheet](https://docs.google.com/spreadsheets/d/1DSNeU--dDlNSFyOrHhejXvTl9tEWvUAJYl-YavUdkmo/edit#gid=0) an ML training use-case that comfortably requires hundreds of thousands of chunks per second.

But, wait, isn't it inefficient to load tiny chunks? [Dask recommends chunk sizes between 100 MB and 1 GB](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes)! Modern SSDs are turning the tables: modern SSDs can sustain over 1 million input/output operations per second. And cloud storage looks like it is speeding up (for example, see the recent announcement of [AWS Express Zone One](https://aws.amazon.com/blogs/aws/new-amazon-s3-express-one-zone-high-performance-storage-class/); and there may be [ways to get high performance from existing cloud storage buckets](https://github.com/JackKelly/light-speed-io/issues/10), too). One reason that Dask recommends large chunk sizes is that Dask's scheduler takes on the order of 1 ms to plan each task. LSIO's data processing should be faster (see below).
Expand Down

0 comments on commit 069569e

Please sign in to comment.