Skip to content

Handle large CSV files + async preprocessing#363

Draft
MatiasArriola wants to merge 11 commits into
developmentfrom
feature/async-preprocessing
Draft

Handle large CSV files + async preprocessing#363
MatiasArriola wants to merge 11 commits into
developmentfrom
feature/async-preprocessing

Conversation

@MatiasArriola
Copy link
Copy Markdown
Contributor

@MatiasArriola MatiasArriola commented Nov 17, 2025

📌 References

📝 Implementation

  • Create new script yarn run async-preprocessing
    • replicated from async-uploads
    • only triggered for non-CSV files that exceed fileSizeLimit value stored in dataStore glass/general
    • This process validates headers, computes rows and specimen fields, udpdates the dataStore, and moves to the async-uploads queue.
  • For validation, always read CSV in chunks using papaparse instead of loading large files with the XLSX library
    • Methods included directly in CSVUtils instead of creating a custom repository object.
  • Refactor: extract types ValidationResult, ValidationResultWithSpecimens
  • Changed AsyncImportRISIndividualFungalFile to make async-uploads work for this file
    • now the chunking is made as the first step. Loading a 500mb file with XLSX was making the process idle and consume a lot of memory.
    • First a pass of validations in chunks, and then we make another pass in chunks for importing the records.
    • We need to evaluate impact of not loading all the rows at once for the program rules validation and make sure there are no rules that depends on other rows outside the chunk (I don't think so, but just in case).

Requires dataStore changes otherwise it fall backs to defaults

TODO:

  • For non-CSV files marked to be preprocessed, handle it in the UI (show some message, check the status is correctly displayed in the files grid)
  • async-uploads and async-deletions: review performance for large CSV files and implement CSV reading in chunks if needed
  • async-uploads will fail when saving a considerable amount of individual import reports. For example for a file with 3M rows, trying to JSON.stringify an array of 10k validation reports will fail (not to mention the space required in dataStore for that). We need a change here to save the summaries in other way.

📹 Screenshots/Screen capture

🔥 Testing

  • In my local setup, I had to make the following changes to allow increasing the file size upload limit:
    • dhis_2.conf: max.file_upload_size = 5120000000
    • /usr/lib/python3.13/site-packages/d2_docker/config/nginx.conf (or check the proper path inspecting the d2-docker volume for nginx): client_max_body_size 1000m;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants