Batch checking of exported data #430

p-j-smith · 2024-07-24T10:52:44Z

Description

Fixes #397: Add batch querying of existing images and batch upload of new images

Add pixl_cli._io.read_patient_data function that takes either a CSV or parquet files and returns a dataframe of messages
Add pixl_cli._message_processing.messages_from_df to convert the df to a list of messages
Update pixl_cli._message_processing.populate_queue_and_db to take a df of messages and return a list of the messages added
Update pixl_cli._database.filter_exported_or_add_to_db to filter messages in memory using dfs rather than querying db multiple times
Use session.bulk_save_objects(images) to batch insert images
Also add test test_batch_upload to test the batch querying and uploading of data

Based on #429 (thanks @stefpiatek!), opened a new PR so @stefpiatek can review

Type of change

Please delete options accordingly to the description.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Suggested Checklist

I have performed a self-review of my own code.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
I have commented my code, particularly in hard-to-understand areas.
I have passed on my local host device. (see further details at the CONTRIBUTING document)
Make sure your branch is up-to-date with main branch. See CONTRIBUTING for a general example to syncronise your branch with the main branch.
I have requested PR review to UCLH-Foundtry/arc-dev
I have adddressed and marked as resolved all the review comments in my PR.
Finally, I have selected squash and merge

Add pixl_cli._io.read_patient_data function that takes either a CSV or parquet files and returns a dataframe of messages Add pixl_cli._message_processing.messages_from_df to convert the df to a list of messages Update pixl_cli._message_processing.populate_queue_and_db to take a df of messages and return a list of the messages added

…ages Also add test test_batch_upload to test the batch querying and uploading of data

Co-authored-by: Miguel Xochicale m.xochicale@ucl.ac.uk

stefpiatek

Looks nice, couple of questions about it but happy to merge

cli/tests/conftest.py

cli/src/pixl_cli/_database.py

stefpiatek

Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same accession number and mrn, but a different datetime. Does that make sense in terms of what we'd need? Worth adding a test case that represents this

    populate_queue_and_db(queues_to_populate, messages_df)
  File "/gae/pixl_prod/PIXL/cli/src/pixl_cli/_message_processing.py", line 131, in populate_queue_and_db
    messages_df = filter_exported_or_add_to_db(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gae/pixl_prod/PIXL/cli/src/pixl_cli/_database.py", line 54, in filter_exported_or_add_to_db
    messages_df = _filter_exported_messages(messages_df, db_images_df)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gae/pixl_prod/PIXL/cli/src/pixl_cli/_database.py", line 78, in _filter_exported_messages
    merged = messages_df.merge(
             ^^^^^^^^^^^^^^^^^^
  File "/gae/miniforge3/envs/pixl_prod/lib/python3.12/site-packages/pandas/core/frame.py", line 10819, in merge
    return merge(
           ^^^^^^
  File "/gae/miniforge3/envs/pixl_prod/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 170, in merge
    op = _MergeOperation(
         ^^^^^^^^^^^^^^^^
  File "/gae/miniforge3/envs/pixl_prod/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 813, in __init__
    self._validate_validate_kwd(validate)
  File "/gae/miniforge3/envs/pixl_prod/lib/python3.12/site-packages/pandas/core/reshape/merge.py", line 1653, in _validate_validate_kwd
    raise MergeError(
pandas.errors.MergeError: Merge keys are not unique in left dataset; not a one-to-one merge

mxochicale · 2024-07-25T09:46:38Z

Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same accession number and mrn, but a different datetime. Does that make sense in terms of what we'd need? Worth adding a test case that represents this

Indeed, example_messages_df generates this kind of message (with different timestamps):

   mrn accession_number  study_date  procedure_occurrence_id    project_name      extract_generated_timestamp
0  mrn              123  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727406+00:00
1  mrn              234  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727419+00:00
2  mrn              345  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727422+00:00

So, should we test against equal vs different dataframes? It would be nice that you share three lines of anonymised real data to help us to create such test case! Maybe it is just the same as above with the same timestamps?

stefpiatek · 2024-07-25T10:08:41Z

Ah you'd want something like this in a test. Same MRN and accession number but different timestamp. I think @p-j-smith might be having a look at this so worth chatting

   mrn accession_number  study_date  procedure_occurrence_id    project_name      extract_generated_timestamp
0  mrn              123  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727406+00:00
1  mrn              123  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727419+00:00

Add test for duplicate messages

codecov · 2024-07-25T11:46:56Z

Codecov Report

Attention: Patch coverage is 98.02632% with 3 lines in your changes missing coverage. Please review.

Project coverage is 84.01%. Comparing base (282f64b) to head (78bc732).

Files	Patch %	Lines
cli/src/pixl_cli/_io.py	95.34%	2 Missing ⚠️
cli/src/pixl_cli/_message_processing.py	95.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #430      +/-   ##
==========================================
+ Coverage   80.09%   84.01%   +3.92%     
==========================================
  Files          75       83       +8     
  Lines        3245     3528     +283     
==========================================
+ Hits         2599     2964     +365     
+ Misses        646      564      -82

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

p-j-smith · 2024-07-25T11:46:58Z

Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same accession number and mrn, but a different datetime.

Thanks for catching this! We now drop duplicates when loading the file, and we've added a test case for it. We also needed to fix how we were filtering existing images using df.isin - if you're comparing two dataframes then the indices need to match

stefpiatek

Ah so nice, went down from 15 minutes to 4 seconds ❤️

2024-07-25 14:14:26.176 | INFO     | pixl_cli._io:read_patient_info:74 - Created 49589 messages from /*/*/*
2024-07-25 14:14:26.177 | INFO     | pixl_cli._message_processing:populate_queue_and_db:128 - Filtering out exported images and uploading new ones to the database
2024-07-25 14:14:30.383 | INFO     | core.patient_queue.producer:publish:38 - Publishing 42531 messages to queue: imaging

pytest-pixl/src/pytest_pixl/data/omop-resources/duplicate_input.csv

Co-authored-by: Stef Piatek <s.piatek@ucl.ac.uk>

p-j-smith · 2024-07-25T13:29:42Z

Ah so nice, went down from 15 minutes to 4 seconds ❤️

Oh nice, that's pretty impressive

stefpiatek and others added 10 commits July 23, 2024 11:15

Split out parsing input and populating for easier testing

973fb61

Set up passing test for development

7f52165

Fix CLI argument for input path

0852ba5

adds database_images #397

4d57c84

fix linting issue

6ba16a6

fix linting issue

a7eaba1

adds read messages from batch_input.cvs #397

96dc3c5

Update pixl_cli tests to use df of messages rather than lists of mess…

82579e7

…ages Also add test test_batch_upload to test the batch querying and uploading of data

Use session.bulk_save_objects(images) to batch insert images

8023eac

Co-authored-by: Miguel Xochicale m.xochicale@ucl.ac.uk

p-j-smith requested a review from a team July 24, 2024 10:52

p-j-smith mentioned this pull request Jul 24, 2024

Setup for batch checking of exported data #429

Closed

stefpiatek approved these changes Jul 24, 2024

View reviewed changes

cli/tests/conftest.py Show resolved Hide resolved

cli/src/pixl_cli/_database.py Show resolved Hide resolved

stefpiatek reviewed Jul 25, 2024

View reviewed changes

Fix filtering existing images and drop duplicate messages

c70e7a4

Add test for duplicate messages

p-j-smith requested a review from stefpiatek July 25, 2024 11:47

use different timestamps in test duplicate_input.csv

dab536c

stefpiatek approved these changes Jul 25, 2024

View reviewed changes

pytest-pixl/src/pytest_pixl/data/omop-resources/duplicate_input.csv Outdated Show resolved Hide resolved

Use more realistic test data in duplicate_input.csv

55fa536

Co-authored-by: Stef Piatek <s.piatek@ucl.ac.uk>

p-j-smith added 2 commits July 25, 2024 14:31

remove backticks from duplicate_input.csv

40fc700

Use the same study_date for duplicate entires in duplicate_input.csv

78bc732

p-j-smith merged commit 6d55a87 into main Jul 25, 2024
10 checks passed

p-j-smith deleted the miguel-paul/batch-populate-query branch July 25, 2024 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch checking of exported data #430

Batch checking of exported data #430

p-j-smith commented Jul 24, 2024 •

edited

Loading

stefpiatek left a comment

stefpiatek left a comment •

edited

Loading

mxochicale commented Jul 25, 2024 •

edited

Loading

stefpiatek commented Jul 25, 2024

codecov bot commented Jul 25, 2024 •

edited

Loading

p-j-smith commented Jul 25, 2024

stefpiatek left a comment

p-j-smith commented Jul 25, 2024

Batch checking of exported data #430

Batch checking of exported data #430

Conversation

p-j-smith commented Jul 24, 2024 • edited Loading

Description

Type of change

Suggested Checklist

stefpiatek left a comment

Choose a reason for hiding this comment

stefpiatek left a comment • edited Loading

Choose a reason for hiding this comment

mxochicale commented Jul 25, 2024 • edited Loading

stefpiatek commented Jul 25, 2024

codecov bot commented Jul 25, 2024 • edited Loading

Codecov Report

p-j-smith commented Jul 25, 2024

stefpiatek left a comment

Choose a reason for hiding this comment

p-j-smith commented Jul 25, 2024

p-j-smith commented Jul 24, 2024 •

edited

Loading

stefpiatek left a comment •

edited

Loading

mxochicale commented Jul 25, 2024 •

edited

Loading

codecov bot commented Jul 25, 2024 •

edited

Loading