Skip to content

Conversation

@soxofaan
Copy link
Member

@soxofaan soxofaan commented Aug 8, 2025

@soxofaan
Copy link
Member Author

soxofaan commented Aug 8, 2025

this is initial attempt to add support for partial updates in STACAPIJobDatabase.persist, but there are still some tests failing

@soxofaan
Copy link
Member Author

soxofaan commented Aug 8, 2025

cc @HansVRP

@HansVRP
Copy link
Contributor

HansVRP commented Aug 11, 2025

I'll take a look this week

@HansVRP
Copy link
Contributor

HansVRP commented Aug 12, 2025

main issue seems to be related to the item_id moving from a column to the dataframe index; causing a mismatch in size.

and the item_id is no longer popped out

@HansVRP
Copy link
Contributor

HansVRP commented Aug 13, 2025

there were some inconsistencies in the test in terms of the mocks, and the string based nature of the IDs; however I am uncertain the current version would not cause regression. I'd like to test it on our cropsar stac based job manager and compare the input-output items.

WEED is also using the stac based job manager, so every change needs to be validated thoroughly

@soxofaan
Copy link
Member Author

main issue seems to be related to the item_id moving from a column to the dataframe index;

Indeed that was intentional in my initial commit to allow partial updates, where you need a meaningful index (instead of an auto-increment one). So I changed the "item_id" column to be the index.

But if I understand you correctly, there are users or use cases that expect an "item_id" column in the data frame?
I wonder however why, as the pandas dataframe is (or at least should be) internal business. If you use STACAPIJobDatabase, you want to persist your data to a STAC API, and don't care about dataframes, or am I misunderstanding? Do you have more info on why STACAPIJobDatabase users interact with the pandas internals?

@HansVRP
Copy link
Contributor

HansVRP commented Aug 18, 2025

Maybe good to discuss tomorrow; It's rather that there have already been 2 workflows build on top of the current stac based job manager; I want to avoid that their STAC collection becomes inconsistent


# Handle datetime
dt = series_dict.get("datetime")
if not dt:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the need to bring back the series_dict.pop("item_id"), but are the other changes relevant here?

I'd like to avoid that this PR review also spirals out of scope

Copy link
Member Author

@soxofaan soxofaan Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the need to bring back the series_dict.pop("item_id")

on further consideration: I'd like to reconsider:
"item_id" as column name has no special meaning anymore, so it should not get special treatment (meaning it should not be popped)

if self.has_geometry:
item_dict["geometry"] = series[self.geometry_column]
else:
item_dict["geometry"] = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is removal of these lines relevant here?

as noted, I'd like to keep this PR focused to avoid it strands in eternal review

soxofaan added a commit that referenced this pull request Aug 18, 2025
Might still be in use in certain use cases

further eliminate fixture anti-patterns in tests, allowing more parameterization
else:
# Merge data on item_id (in the index)
df_to_persist = existing_df
df_to_persist.update(df, overwrite=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While working on test coverage, it turned out that this pandas update might cause data loss:
it only updates the intersection of both dataframes, so if there is a mismatch between items in existing_df and df, there will be less updates than expected

soxofaan added a commit that referenced this pull request Aug 19, 2025
soxofaan added a commit that referenced this pull request Aug 19, 2025
Might still be in use in certain use cases

further eliminate fixture anti-patterns in tests, allowing more parameterization
soxofaan added a commit that referenced this pull request Aug 20, 2025
Eliminate some fixture anti-patterns (too much abstraction and decoupling)
Based on working on #794 and #798
soxofaan added a commit that referenced this pull request Aug 20, 2025
Eliminate some fixture anti-patterns (too much abstraction and decoupling)
Based on working on #794 and #798
soxofaan added a commit that referenced this pull request Aug 20, 2025
further elimination of unnecessary fixtures
Based on working on #794 and #798
@soxofaan
Copy link
Member Author

as mentioned in #793 (comment) : let's move the task of merging existing data with updates to the job manager (instead of requiring each job db implementation to do this correctly.
(Requires introduction of a new API JobDatabaseInterface.get_by_indices, but that should not be too hard to implement).

This closes this PR (without merge).
(Note that some test related tweaks were ported to master anyway)

@soxofaan soxofaan closed this Aug 20, 2025
@soxofaan soxofaan deleted the issue793-stac-api-job-db-persist-partial-update branch August 20, 2025 10:21
soxofaan added a commit that referenced this pull request Sep 9, 2025
soxofaan added a commit that referenced this pull request Sep 9, 2025
Eliminate some fixture anti-patterns (too much abstraction and decoupling)
Based on working on #794 and #798
soxofaan added a commit that referenced this pull request Sep 9, 2025
further elimination of unnecessary fixtures
Based on working on #794 and #798
soxofaan added a commit that referenced this pull request Sep 9, 2025
soxofaan added a commit that referenced this pull request Sep 9, 2025
Eliminate some fixture anti-patterns (too much abstraction and decoupling)
Based on working on #794 and #798
soxofaan added a commit that referenced this pull request Sep 9, 2025
further elimination of unnecessary fixtures
Based on working on #794 and #798
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants