handle inputs/outputs which are too large to exist in memory

Domain scientists may want or need to send files with large sizes as input or output. The data inside these files should not be assumed to be able to exist in memory at once. This will require a few significant changes to the application.

## Capability API

Immediately before the user function is called, the data is deserialized against the parameter type in the user's function via Pydantic. Immediately after the user function is called, the data is serialized against the return type in the user's function via Pydantic. This is a good way to guarantee that the proper schema will always be generated, an easy way to automatically validate inbound types, and a way to enforce that users return the outbound type they say they will.

This approach has an obvious fallback when the entire input or output data is too large to fit in memory, which is a usecase we need to account for. It would most commonly occur in instances where the domain science needs to read a file line-by-line, for example, and process the file one or a few lines at a time.

### Proposed solution

The typing parameters should only be checked in the event of being able to keep the entire data in memory. I propose adding an optional field

```python
input_file_options: IntersectFileOptions | None = None
output_file_options: IntersectFileOptions | None = None
```

to `@intersect_message` decorators, and having the `output_file_options` field included as an `IntersectEventDefinition` option for the `@intersect_message_events` decorator. This will allow for users to also override their request/response types with an `IntersectFileOptions`
 definition, and not attempt to validate the data via Pydantic (which can only validate data that is 100% in memory).

An IntersectFileOptions object could look something like:

```python
class IntersectFileOptions(BaseModel):
  file_destination: Path  # where to read or write the file from, this should potentially default to either a temporary a file or an option in `XDG_RUNTIME_DIR` (on Linux systems, `/run/user/${UID}`)
  delete_after_use: bool = True # determine whether or not the files should stick around, by default we'll delete them but users may want to override this
  schema_type: Any # this is REQUIRED and cannot have a default set, this is just a type definition. It can often be "bytes" but large text files may benefit from using other types.
```

## Data API

This also means that we need to prepare for a few alternative ways to handle sending the data over the network. At the moment, we are either directly inserting the data into the message (which keeps 100% of the data in memory) OR we are using `minio.get_object()` or `minio.put_object()` (which also works 100% in memory).

### Proposed solution

Leveraging the proposed `IntersectFileOptions` in the INTERSECT decorators in conjunction with the already existing `data_handler` and `content_type` options should allow us to use the `minio.fget_object()` and `minio.fput_object()` APIs, which use `Range` requests and should never keep the entire file in memory. We would still send/receive the exact same number of messages over the message broker. **This is the easiest solution to implement.** It becomes a simple matter of "if the user is using FileOptions, use fget/fput; if the user is not using FileOptions, use get/put".

A more difficult approach would be the usecase where everything still goes over the message broker, in an attempt to work `Range` requests into the message broker itself. I do not believe this solution is particularly promising because the asynchronous nature of the message broker means that you cannot always guarantee the _order_ of messages you receive. This has the further problem of clogging up the message broker with numerous additional messages, work which would make more sense for MINIO to handle.

The `FileOptions` specification could also work well if we come up with alternative data management solutions (i.e. Globus) that we incorporate directly into the SDK, so the approach is extensible and keeps our options open.

## A note on mixing data handlers and FileOptions

It's important to mention that "MESSAGE" does not necessary imply "no file options" and "MINIO" does not necessarily imply "file options"; for example, some microservices could easily handle message sizes in memory which are too large to transmit over RabbitMQ (MINO + no FileOptions)

## A note on encryption

Recent discussion on encryption has proposed often making this data encrypted as it goes over the wire. If the data cannot be kept in memory, the encrypted data should be written to a _temporary file which is always deleted_, logic can exist which does a buffered read on the encrypted data and does a buffered write to create a file with the _unencrypted_ data, and the FileOptions parameter should reference the destination of the _unencrypted_ data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

handle inputs/outputs which are too large to exist in memory #28

Capability API

Proposed solution

Data API

Proposed solution

A note on mixing data handlers and FileOptions

A note on encryption

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

handle inputs/outputs which are too large to exist in memory #28

Description

Capability API

Proposed solution

Data API

Proposed solution

A note on mixing data handlers and FileOptions

A note on encryption

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions