Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(upload): async improved #2544

Merged
merged 7 commits into from
Jun 4, 2024
Merged

Conversation

AmineDiro
Copy link
Collaborator

Description

Hey,

Here's a breakdown of what I've done:

  • Reducing the number of opened fd and memory footprint: Previously, for each uploaded file, we were opening a temporary NamedTemporaryFile to write existing content read from Supabase. However, due to the dependency on langchain loader classes, we couldn't use memory buffers for the loaders. Now, with the changes made, we only open a single temporary file for each process_file_and_notify, cutting down on excessive file opening, read syscalls, and memory buffer usage. This could cause stability issues when ingesting and processing large volumes of documents. Unfortunately, there is still reopening of temporary files in some code paths but this can be improved further in later work.
  • Removing UploadFile class from File: The UploadFile ( a FastAPI abstraction over a SpooledTemporaryFile for multipart upload) was redundant in our File setup since we already downloaded the file from remote storage and read it into memory + wrote the file into a temp file. By removing this abstraction, we streamline our code and eliminate unnecessary complexity.
  • async function Adjustments: I've removed the async labeling from functions where it wasn't truly asynchronous. For instance, calling filter_file for processing files isn't genuinely async, ass async file reading isn't actually asynchronous—it uses a threadpool for reading the file . Given that we're already leveraging celery for parallelism (one worker per core), we need to ensure that reading and processing occur in the same thread, or at least minimize thread spawning. Additionally, since the rest of the code isn't inherently asynchronous, our bottleneck lies in CPU operations rather than asynchronous processing.

These changes aim to improve performance and streamline our codebase.
Let me know if you have any questions or suggestions for further improvements!

Checklist before requesting a review

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have ideally added tests that prove my fix is effective or that my feature works

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 4, 2024
Copy link

vercel bot commented May 4, 2024

Someone is attempting to deploy a commit to the Quivr-app Team on Vercel.

A member of the Team first needs to authorize it.

@dosubot dosubot bot added the area: backend Related to backend functionality or under the /backend directory label May 4, 2024
@StanGirard
Copy link
Collaborator

Thanks a lot ! I'll review it and let you know if there is anything

@StanGirard
Copy link
Collaborator

Thanks a lot! It works great except for when you upload URLs ;) I'll fix that

AmineDiro and others added 3 commits June 4, 2024 09:16
Signed-off-by: aminediro <aminediro@github.com>
The list_files_array in the QuivrRAG class is updated to include file URLs in addition to file names. This change allows for better handling and display of files in the application.
Copy link

sonarcloud bot commented Jun 4, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
6.6% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud


def __init__(self, **data):
super().__init__(**data)
data["file_sha1"] = compute_sha1_from_content(data["bytes_content"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving it <3

@StanGirard StanGirard changed the title Refacto/file feat(upload): async improved Jun 4, 2024
@StanGirard StanGirard merged commit 675885c into QuivrHQ:main Jun 4, 2024
1 of 4 checks passed
StanGirard added a commit that referenced this pull request Jun 5, 2024
🤖 I have created a release *beep* *boop*
---


## 0.0.259 (2024-06-04)

## What's Changed
* feat(upload): async improved by @AmineDiro in
#2544

## New Contributors
* @AmineDiro made their first contribution in
#2544

**Full Changelog**:
v0.0.258...v0.0.259

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: backend Related to backend functionality or under the /backend directory size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants