Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch checking of exported data #430

Merged
merged 15 commits into from
Jul 25, 2024
Merged

Conversation

p-j-smith
Copy link
Contributor

@p-j-smith p-j-smith commented Jul 24, 2024

Description

Fixes #397: Add batch querying of existing images and batch upload of new images

  • Add pixl_cli._io.read_patient_data function that takes either a CSV or parquet files and returns a dataframe of messages
  • Add pixl_cli._message_processing.messages_from_df to convert the df to a list of messages
  • Update pixl_cli._message_processing.populate_queue_and_db to take a df of messages and return a list of the messages added
  • Update pixl_cli._database.filter_exported_or_add_to_db to filter messages in memory using dfs rather than querying db multiple times
  • Use session.bulk_save_objects(images) to batch insert images
  • Also add test test_batch_upload to test the batch querying and uploading of data

Based on #429 (thanks @stefpiatek!), opened a new PR so @stefpiatek can review

Type of change

Please delete options accordingly to the description.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Suggested Checklist

  • I have performed a self-review of my own code.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have passed on my local host device. (see further details at the CONTRIBUTING document)
  • Make sure your branch is up-to-date with main branch. See CONTRIBUTING for a general example to syncronise your branch with the main branch.
  • I have requested PR review to UCLH-Foundtry/arc-dev
  • I have adddressed and marked as resolved all the review comments in my PR.
  • Finally, I have selected squash and merge

stefpiatek and others added 10 commits July 23, 2024 11:15
Add pixl_cli._io.read_patient_data function that takes either a CSV or parquet files and returns a dataframe of messages
Add pixl_cli._message_processing.messages_from_df to convert the df to a list of messages
Update pixl_cli._message_processing.populate_queue_and_db to take a df of messages and return a list of the messages added
…ages

Also add test test_batch_upload to test the batch querying and uploading of data
Co-authored-by: Miguel Xochicale m.xochicale@ucl.ac.uk
Copy link
Contributor

@stefpiatek stefpiatek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, couple of questions about it but happy to merge

cli/tests/conftest.py Show resolved Hide resolved
cli/src/pixl_cli/_database.py Show resolved Hide resolved
Copy link
Contributor

@stefpiatek stefpiatek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same accession number and mrn, but a different datetime. Does that make sense in terms of what we'd need? Worth adding a test case that represents this

    populate_queue_and_db(queues_to_populate, messages_df)
  File "/gae/pixl_prod/PIXL/cli/src/pixl_cli/_message_processing.py", line 131, in populate_queue_and_db
    messages_df = filter_exported_or_add_to_db(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gae/pixl_prod/PIXL/cli/src/pixl_cli/_database.py", line 54, in filter_exported_or_add_to_db
    messages_df = _filter_exported_messages(messages_df, db_images_df)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gae/pixl_prod/PIXL/cli/src/pixl_cli/_database.py", line 78, in _filter_exported_messages
    merged = messages_df.merge(
             ^^^^^^^^^^^^^^^^^^
  File "/gae/miniforge3/envs/pixl_prod/lib/python3.12/site-packages/pandas/core/frame.py", line 10819, in merge
    return merge(
           ^^^^^^
  File "/gae/miniforge3/envs/pixl_prod/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 170, in merge
    op = _MergeOperation(
         ^^^^^^^^^^^^^^^^
  File "/gae/miniforge3/envs/pixl_prod/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 813, in __init__
    self._validate_validate_kwd(validate)
  File "/gae/miniforge3/envs/pixl_prod/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1653, in _validate_validate_kwd
    raise MergeError(
pandas.errors.MergeError: Merge keys are not unique in left dataset; not a one-to-one merge

@mxochicale
Copy link
Contributor

mxochicale commented Jul 25, 2024

Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same accession number and mrn, but a different datetime. Does that make sense in terms of what we'd need? Worth adding a test case that represents this

Indeed, example_messages_df generates this kind of message (with different timestamps):

   mrn accession_number  study_date  procedure_occurrence_id    project_name      extract_generated_timestamp
0  mrn              123  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727406+00:00
1  mrn              234  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727419+00:00
2  mrn              345  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727422+00:00 

So, should we test against equal vs different dataframes? It would be nice that you share three lines of anonymised real data to help us to create such test case! Maybe it is just the same as above with the same timestamps?

@stefpiatek
Copy link
Contributor

Ah you'd want something like this in a test. Same MRN and accession number but different timestamp. I think @p-j-smith might be having a look at this so worth chatting

   mrn accession_number  study_date  procedure_occurrence_id    project_name      extract_generated_timestamp
0  mrn              123  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727406+00:00
1  mrn              123  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727419+00:00

Copy link

codecov bot commented Jul 25, 2024

Codecov Report

Attention: Patch coverage is 98.02632% with 3 lines in your changes missing coverage. Please review.

Project coverage is 84.01%. Comparing base (282f64b) to head (78bc732).

Files Patch % Lines
cli/src/pixl_cli/_io.py 95.34% 2 Missing ⚠️
cli/src/pixl_cli/_message_processing.py 95.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #430      +/-   ##
==========================================
+ Coverage   80.09%   84.01%   +3.92%     
==========================================
  Files          75       83       +8     
  Lines        3245     3528     +283     
==========================================
+ Hits         2599     2964     +365     
+ Misses        646      564      -82     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@p-j-smith
Copy link
Contributor Author

Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same accession number and mrn, but a different datetime.

Thanks for catching this! We now drop duplicates when loading the file, and we've added a test case for it. We also needed to fix how we were filtering existing images using df.isin - if you're comparing two dataframes then the indices need to match

Copy link
Contributor

@stefpiatek stefpiatek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so nice, went down from 15 minutes to 4 seconds ❤️

2024-07-25 14:14:26.176 | INFO     | pixl_cli._io:read_patient_info:74 - Created 49589 messages from /*/*/*
2024-07-25 14:14:26.177 | INFO     | pixl_cli._message_processing:populate_queue_and_db:128 - Filtering out exported images and uploading new ones to the database
2024-07-25 14:14:30.383 | INFO     | core.patient_queue.producer:publish:38 - Publishing 42531 messages to queue: imaging

Co-authored-by: Stef Piatek <s.piatek@ucl.ac.uk>
@p-j-smith
Copy link
Contributor Author

Ah so nice, went down from 15 minutes to 4 seconds ❤️

Oh nice, that's pretty impressive

@p-j-smith p-j-smith merged commit 6d55a87 into main Jul 25, 2024
10 checks passed
@p-j-smith p-j-smith deleted the miguel-paul/batch-populate-query branch July 25, 2024 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch query for pixl populate
3 participants