-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch checking of exported data #430
Conversation
Add pixl_cli._io.read_patient_data function that takes either a CSV or parquet files and returns a dataframe of messages Add pixl_cli._message_processing.messages_from_df to convert the df to a list of messages Update pixl_cli._message_processing.populate_queue_and_db to take a df of messages and return a list of the messages added
…ages Also add test test_batch_upload to test the batch querying and uploading of data
Co-authored-by: Miguel Xochicale m.xochicale@ucl.ac.uk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice, couple of questions about it but happy to merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same accession number
and mrn
, but a different datetime. Does that make sense in terms of what we'd need? Worth adding a test case that represents this
populate_queue_and_db(queues_to_populate, messages_df)
File "/gae/pixl_prod/PIXL/cli/src/pixl_cli/_message_processing.py", line 131, in populate_queue_and_db
messages_df = filter_exported_or_add_to_db(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gae/pixl_prod/PIXL/cli/src/pixl_cli/_database.py", line 54, in filter_exported_or_add_to_db
messages_df = _filter_exported_messages(messages_df, db_images_df)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gae/pixl_prod/PIXL/cli/src/pixl_cli/_database.py", line 78, in _filter_exported_messages
merged = messages_df.merge(
^^^^^^^^^^^^^^^^^^
File "/gae/miniforge3/envs/pixl_prod/lib/python3.12/site-packages/pandas/core/frame.py", line 10819, in merge
return merge(
^^^^^^
File "/gae/miniforge3/envs/pixl_prod/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 170, in merge
op = _MergeOperation(
^^^^^^^^^^^^^^^^
File "/gae/miniforge3/envs/pixl_prod/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 813, in __init__
self._validate_validate_kwd(validate)
File "/gae/miniforge3/envs/pixl_prod/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1653, in _validate_validate_kwd
raise MergeError(
pandas.errors.MergeError: Merge keys are not unique in left dataset; not a one-to-one merge
Indeed,
So, should we test against equal vs different dataframes? It would be nice that you share three lines of anonymised real data to help us to create such test case! Maybe it is just the same as above with the same timestamps? |
Ah you'd want something like this in a test. Same MRN and accession number but different timestamp. I think @p-j-smith might be having a look at this so worth chatting
|
Add test for duplicate messages
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #430 +/- ##
==========================================
+ Coverage 80.09% 84.01% +3.92%
==========================================
Files 75 83 +8
Lines 3245 3528 +283
==========================================
+ Hits 2599 2964 +365
+ Misses 646 564 -82 ☔ View full report in Codecov by Sentry. |
Thanks for catching this! We now drop duplicates when loading the file, and we've added a test case for it. We also needed to fix how we were filtering existing images using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so nice, went down from 15 minutes to 4 seconds ❤️
2024-07-25 14:14:26.176 | INFO | pixl_cli._io:read_patient_info:74 - Created 49589 messages from /*/*/*
2024-07-25 14:14:26.177 | INFO | pixl_cli._message_processing:populate_queue_and_db:128 - Filtering out exported images and uploading new ones to the database
2024-07-25 14:14:30.383 | INFO | core.patient_queue.producer:publish:38 - Publishing 42531 messages to queue: imaging
pytest-pixl/src/pytest_pixl/data/omop-resources/duplicate_input.csv
Outdated
Show resolved
Hide resolved
Co-authored-by: Stef Piatek <s.piatek@ucl.ac.uk>
Oh nice, that's pretty impressive |
Description
Fixes #397: Add batch querying of existing images and batch upload of new images
pixl_cli._io.read_patient_data
function that takes either a CSV or parquet files and returns a dataframe of messagespixl_cli._message_processing.messages_from_df
to convert the df to a list of messagespixl_cli._message_processing.populate_queue_and_db
to take a df of messages and return a list of the messages addedpixl_cli._database.filter_exported_or_add_to_db
to filter messages in memory using dfs rather than querying db multiple timessession.bulk_save_objects(images)
to batch insert imagestest_batch_upload
to test the batch querying and uploading of dataBased on #429 (thanks @stefpiatek!), opened a new PR so @stefpiatek can review
Type of change
Please delete options accordingly to the description.
Suggested Checklist
main
branch.UCLH-Foundtry/arc-dev
squash and merge