Conversation
|
@rsaksida Btw, I was expecting something like this as as mentioned in the Slack thread. We need # 3 and # 4 from the screenshot to be reusable from Publisher which will take over the activity of direct publishing to S3.
|
- Debounces S3 publishing events - Locks the API while S3 syncing is in progress - Uploads manifest with all uploaded S3 files - Triggers Argo workflow for indexing
22ff5f2 to
e26a4a0
Compare
What this PR accomplishes is basically uploading a set of graphs to S3. |
|
Ok, I was commenting earlier when this PR was in Draft state.
@rsaksida Can we also have it call an Argo workflow to copy the partial set to the corresponding S3 buckets for Graphs/Resources/Metadata? Or is this already happening? I don't see this in the documentation above.
Great! Indexing to ES is expected with Argo workflow.
Not now. We will do that after these atomic workflows are proven reliable. Otherwise, that will simply add more complexity. |
|
@rsaksida Please deploy in the TEST environment so that @chuang-CE can test and approve the PR. |
@rsaksida Question for you above, in case you missed it. |
When a sync fails, next sync should pick up the leftover work
Reuse manifest key when calling Argo
|
Questions about your latest commit:
@rsaksida Did you see my comment above? Can we do the copying to S3 through another Argo workflow too? See screenshot here: #1022 (comment) Ruby Registry continues to publish to PGSQL after processing the incoming JSON-LDs. No change there other than doing it in a a batch. Ruby Registry also generates the ZIP or multiple ZIPs depending on whether the background job separates the input JSON-LDs into Resource, Graph and Metadata as well as CRUD operations. All other work of copying to S3 and indexing is left to Argo Workflows. I cannot stress enough how critical it is to separate these concerns to allow the new publisher to do direct publishing by calling the same Argo workflows without more reworking/hand holding this code base of Ruby Registry. |
|
@chuang-CE Can you post here also what is blocking your sign off? |

Changes S3 graph publishing from per-envelope immediate writes into a debounced, community-
level batch sync.
Previously, each envelope save/delete directly uploaded or deleted a single S3 object from the envelope callback. This PR replaces that with an
EnvelopeGraphSynctracking record, a delayed ActiveJob, a batch sync service, and a partial graph indexing Argo workflow submission. The result is fewer S3/indexing operations during bursts of publishing activity, a clear sync lock while the batch is being flushed, and a manifest-driven indexing handoff.This PR is necessary to support https://github.com/CredentialEngine/ce-registry/issues/25
Changes
envelope_graph_syncsto track one pending S3 graph sync per envelope community.Envelopecallbacks to record S3 sync activity after commit instead of directly uploading/deleting S3 objects.SyncEnvelopeGraphsWithS3Jobto debounce publishing activity before syncing to S3.SyncPendingEnvelopeGraphsWithS3to batch-upload/delete graph objects from S3 based onpending
EnvelopeVersionrecords.ENVELOPE_GRAPH_SYNC_LOCK_TIMEOUT_SECONDS, defaulting to 600 seconds./<community>/manifests/partial-graphs/<timestamp>.json.gz/<community>/manifests/partial-graphs/latest.json.gzSubmitPartialGraphIndexWorkflowto submit the Argo partial indexing workflow after a manifest is written.s3-graphs-zipinstead of readingARGO_WORKFLOWS_TEMPLATE_NAME.New behavior
This prevents every individual publish from immediately causing its own S3 write and downstream indexing trigger. Instead, publishes within the debounce window are collapsed into a single batch. That batch syncs only the latest pending version per CTID, writes the affected graph files to S3, writes a gzipped manifest of uploaded S3 keys, and then triggers the partial indexing workflow using that manifest.
Deletes are also handled in the batch. Delete-only batches do not write a manifest or trigger Argo, because the manifest is currently an upload-input list for partial indexing.
Flow Up To Argo Workflow Trigger
Envelope.Envelopecallback callsEnvelopeGraphSync.record_activity!with the envelope community and latest PaperTrail envelope version id.EnvelopeGraphSync.record_activity!creates or updates the per-community sync tracker.last_activity_atlast_activity_version_idscheduled_for_atlast_synced_version_id, initialized to the previous version when the sync record is firstcreated
SyncEnvelopeGraphsWithS3Jobfornow + ENVELOPE_GRAPH_SYNC_DEBOUNCE_SECONDS, defaulting to 60 seconds.last_activity_atand advanceslast_activity_version_id, but does not enqueue another job.SyncEnvelopeGraphsWithS3Jobruns, it checks whether the debounce quiet period has elapsed.scheduled_for_atand re- enqueues itself for the new quiet-period end.scheduled_for_at, marks the sync assyncing, and callsSyncPendingEnvelopeGraphsWithS3.syncingis true, the v1 and v2 publish APIs reject new publish requests for that community with HTTP503and:Publishing is temporarily locked while S3 sync is in progressSyncPendingEnvelopeGraphsWithS3finds pendingEnvelopeVersionrows for that community:Envelopeversionslast_synced_version_idprocessed_resourceJSON to:/<community>/<ctid>.json/<community>/<ctid>.jsons3_urlupdated from the S3 object public URL.latest.json.gz.SubmitPartialGraphIndexWorkflowsubmits the Argo workflow.update-index-graphs-input-file-elasticsearch<community>-partial-graph-index-task-imageindex-nameinput-bucketinput-file-keysource-bucketprefix-pathaws-region