Skip to content

handle inputs/outputs which are too large to exist in memory #28

@Lance-Drane

Description

@Lance-Drane

Domain scientists may want or need to send files with large sizes as input or output. The data inside these files should not be assumed to be able to exist in memory at once. This will require a few significant changes to the application.

Capability API

Immediately before the user function is called, the data is deserialized against the parameter type in the user's function via Pydantic. Immediately after the user function is called, the data is serialized against the return type in the user's function via Pydantic. This is a good way to guarantee that the proper schema will always be generated, an easy way to automatically validate inbound types, and a way to enforce that users return the outbound type they say they will.

This approach has an obvious fallback when the entire input or output data is too large to fit in memory, which is a usecase we need to account for. It would most commonly occur in instances where the domain science needs to read a file line-by-line, for example, and process the file one or a few lines at a time.

Proposed solution

The typing parameters should only be checked in the event of being able to keep the entire data in memory. I propose adding an optional field

input_file_options: IntersectFileOptions | None = None
output_file_options: IntersectFileOptions | None = None

to @intersect_message decorators, and having the output_file_options field included as an IntersectEventDefinition option for the @intersect_message_events decorator. This will allow for users to also override their request/response types with an IntersectFileOptions
definition, and not attempt to validate the data via Pydantic (which can only validate data that is 100% in memory).

An IntersectFileOptions object could look something like:

class IntersectFileOptions(BaseModel):
  file_destination: Path  # where to read or write the file from, this should potentially default to either a temporary a file or an option in `XDG_RUNTIME_DIR` (on Linux systems, `/run/user/${UID}`)
  delete_after_use: bool = True # determine whether or not the files should stick around, by default we'll delete them but users may want to override this
  schema_type: Any # this is REQUIRED and cannot have a default set, this is just a type definition. It can often be "bytes" but large text files may benefit from using other types.

Data API

This also means that we need to prepare for a few alternative ways to handle sending the data over the network. At the moment, we are either directly inserting the data into the message (which keeps 100% of the data in memory) OR we are using minio.get_object() or minio.put_object() (which also works 100% in memory).

Proposed solution

Leveraging the proposed IntersectFileOptions in the INTERSECT decorators in conjunction with the already existing data_handler and content_type options should allow us to use the minio.fget_object() and minio.fput_object() APIs, which use Range requests and should never keep the entire file in memory. We would still send/receive the exact same number of messages over the message broker. This is the easiest solution to implement. It becomes a simple matter of "if the user is using FileOptions, use fget/fput; if the user is not using FileOptions, use get/put".

A more difficult approach would be the usecase where everything still goes over the message broker, in an attempt to work Range requests into the message broker itself. I do not believe this solution is particularly promising because the asynchronous nature of the message broker means that you cannot always guarantee the order of messages you receive. This has the further problem of clogging up the message broker with numerous additional messages, work which would make more sense for MINIO to handle.

The FileOptions specification could also work well if we come up with alternative data management solutions (i.e. Globus) that we incorporate directly into the SDK, so the approach is extensible and keeps our options open.

A note on mixing data handlers and FileOptions

It's important to mention that "MESSAGE" does not necessary imply "no file options" and "MINIO" does not necessarily imply "file options"; for example, some microservices could easily handle message sizes in memory which are too large to transmit over RabbitMQ (MINO + no FileOptions)

A note on encryption

Recent discussion on encryption has proposed often making this data encrypted as it goes over the wire. If the data cannot be kept in memory, the encrypted data should be written to a temporary file which is always deleted, logic can exist which does a buffered read on the encrypted data and does a buffered write to create a file with the unencrypted data, and the FileOptions parameter should reference the destination of the unencrypted data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions