-
Notifications
You must be signed in to change notification settings - Fork 47
Support partial updates in STACAPIJobDatabase.persist #794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
this is initial attempt to add support for partial updates in |
|
cc @HansVRP |
|
I'll take a look this week |
|
main issue seems to be related to the item_id moving from a column to the dataframe index; causing a mismatch in size. and the item_id is no longer popped out |
|
there were some inconsistencies in the test in terms of the mocks, and the string based nature of the IDs; however I am uncertain the current version would not cause regression. I'd like to test it on our cropsar stac based job manager and compare the input-output items. WEED is also using the stac based job manager, so every change needs to be validated thoroughly |
Indeed that was intentional in my initial commit to allow partial updates, where you need a meaningful index (instead of an auto-increment one). So I changed the "item_id" column to be the index. But if I understand you correctly, there are users or use cases that expect an "item_id" column in the data frame? |
|
Maybe good to discuss tomorrow; It's rather that there have already been 2 workflows build on top of the current stac based job manager; I want to avoid that their STAC collection becomes inconsistent |
|
|
||
| # Handle datetime | ||
| dt = series_dict.get("datetime") | ||
| if not dt: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the need to bring back the series_dict.pop("item_id"), but are the other changes relevant here?
I'd like to avoid that this PR review also spirals out of scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the need to bring back the series_dict.pop("item_id")
on further consideration: I'd like to reconsider:
"item_id" as column name has no special meaning anymore, so it should not get special treatment (meaning it should not be popped)
| if self.has_geometry: | ||
| item_dict["geometry"] = series[self.geometry_column] | ||
| else: | ||
| item_dict["geometry"] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is removal of these lines relevant here?
as noted, I'd like to keep this PR focused to avoid it strands in eternal review
| else: | ||
| # Merge data on item_id (in the index) | ||
| df_to_persist = existing_df | ||
| df_to_persist.update(df, overwrite=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While working on test coverage, it turned out that this pandas update might cause data loss:
it only updates the intersection of both dataframes, so if there is a mismatch between items in existing_df and df, there will be less updates than expected
|
as mentioned in #793 (comment) : let's move the task of merging existing data with updates to the job manager (instead of requiring each job db implementation to do this correctly. This closes this PR (without merge). |
Uh oh!
There was an error while loading. Please reload this page.