[Fixes #13966] Harvesters: Integrity error when a title is updated in the remote WMS service #13967
[Fixes #13966] Harvesters: Integrity error when a title is updated in the remote WMS service #13967mattiagiupponi merged 3 commits intomasterfrom
Conversation
Summary of ChangesHello @Gpetrak, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request effectively resolves the IntegrityError encountered when remote resource titles are updated by moving the title field to the defaults parameter in get_or_create. It also introduces robust handling for race conditions using transaction.atomic() and IntegrityError catching. The added test case correctly validates the fix. I have suggested a minor optimization to avoid redundant database writes for newly created resources and to ensure timestamp consistency within the batch processing.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #13967 +/- ##
==========================================
+ Coverage 74.24% 74.25% +0.01%
==========================================
Files 947 950 +3
Lines 56620 56797 +177
Branches 7675 7699 +24
==========================================
+ Hits 42038 42176 +138
- Misses 12892 12925 +33
- Partials 1690 1696 +6 🚀 New features to boost your workflow:
|
geonode/harvesting/tasks.py
Outdated
| }, | ||
| ) | ||
| try: | ||
| with transaction.atomic(): |
There was a problem hiding this comment.
why so we need transaction.atomic here @Gpetrak ?
There was a problem hiding this comment.
@giohappy I added the atomic block to handle concurrency safety for our 10 parallel harvesting workers (as an extra safety level). While the primary bug was caused by a metadata mismatch, this addition prevents the discovery session from crashing because of the multiple workers.
There was a problem hiding this comment.
There was a problem hiding this comment.
@mattiagiupponi you are right and thank you that mention it, but I don't think that it's 100% reduntant. The problem is not the Django itself but the whole database transaction inside the task execution (Celery) workflow. In a UniqueViolation error, the database aborts the current transaction. While Django's get_or_create handles the logic, it doesn't clear the 'Aborted' state from the database connection. By wrapping it in a manual atomic block, we create a savepoint. This allows us to roll back only the collision, clean the connection, and safely call .save() and continue the rest of the loop without crashing the whole Celery task. I think the problem here is not Django but Celery / PostgreSQL.
Anyway, if we want to prioritize code simplicity, I’m fine with removing it for now. We can monitor the logs for Transaction-aborted errors and re-add the savepoint if the high-concurrency proves it's necessary.
There was a problem hiding this comment.
I see you already removed the transaction.atomic() context in your last commit, but for clarity I want to highlight that the transaction is already handled inside the get_or_create method. Wrapping the method inside another atomic transaction would be useless.
def get_or_create(self, defaults=None, **kwargs):
"""
Look up an object with the given kwargs, creating one if necessary.
Return a tuple of (object, created), where created is a boolean
specifying whether an object was created.
"""
# The get() needs to be targeted at the write database in order
# to avoid potential transaction consistency problems.
self._for_write = True
try:
return self.get(**kwargs), False
except self.model.DoesNotExist:
params = self._extract_model_params(defaults, **kwargs)
# Try to create an object using passed params.
try:
with transaction.atomic(using=self.db):
params = dict(resolve_callables(params))
return self.create(**params), True
except IntegrityError:
try:
return self.get(**kwargs), False
except self.model.DoesNotExist:
pass
raise
This PR is created according to this issue #13966
Checklist
For all pull requests:
The following are required only for core and extension modules (they are welcomed, but not required, for contrib modules):
Submitting the PR does not require you to check all items, but by the time it gets merged, they should be either satisfied or inapplicable.