Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ETag retrieval for orchestrations with multiple executions #363

Merged
merged 4 commits into from
Jan 31, 2020

Conversation

ConnorMcMahon
Copy link
Contributor

It is possible for the history table to get into a state where the
latest execution history is first, followed by the previous execution
history, with the last entry for an instance id being the sentinal row.
The previous history parsing logic stopped processing as soon as it saw
a new execution id, meaning it failed to grab the ETag from the sentinal
row.

This, in combination with changes made when the ETag were not present,
caused UpdateState calls to fail, meaning queue messages for Activities
were being scheduled without recording that we did schedule them. This
leads to duplicate Activity executions until we retrieve the history
with an execution id (i.e. after an activity function successfully
executes).

This fix makes our ETag retrieval logic more robust.

eTagValue = entity.ETag;
if (i != (tableEntities.Count - 1))
{
AnalyticsEventSource.Log.GeneralWarning(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debatably this should be an error and an exception should be thrown here. However, I am a bit hesitant throwing an exception, as I'm not sure how it is handled here.

@cgillum
Copy link
Collaborator

cgillum commented Jan 30, 2020

I see what you're trying to do but this last refactor seems unnecessarily complicated. Here is my suggestion, which I think is simpler and does not strictly require the sentinel to be the last row:

IList<HistoryEvent> historyEvents;
string executionId;
DynamicTableEntity sentinel = null;
if (tableEntities.Count > 0)
{
    // The most recent generation will always be in the first history event.
    executionId = tableEntities[0].Properties["ExecutionId"].StringValue;

    // Convert the table entities into history events.
    var events = new List<HistoryEvent>(tableEntities.Count);

    foreach (DynamicTableEntity entity in tableEntities)
    {
        if (entity.Properties["ExecutionId"].StringValue != executionId)
        {
            // The remaining entities are from a previous generation and can be discarded.
            break;
        }

        if (entity.RowKey == SentinelRowKey)
        {
            sentinel = entity;
            continue;
        }

        // Some entity properties may be stored in blob storage.
        await this.DecompressLargeEntityProperties(entity);

        events.Add((HistoryEvent)this.tableEntityConverter.ConvertFromTableEntity(entity, GetTypeForTableEntity));
    }

    historyEvents = events;
}
else
{
    historyEvents = EmptyHistoryEventList;
    executionId = expectedExecutionId;
}

// Read the checkpoint completion time from the sentinel row, which should always be the last row.
// The only time a sentinel won't exist is if no instance of this ID has ever existed.
// The IsCheckpointCompleteProperty was newly added _after_ v1.6.4.
DateTime checkpointCompletionTime = DateTime.MinValue;
if (sentinel == null)
{
    sentinel = tableEntities.LastOrDefault(e => e.RowKey == SentinelRowKey);
}

string eTagValue = sentinel?.ETag;
if (sentinel != null &&
    sentinel.Properties.TryGetValue(CheckpointCompletedTimestampProperty, out EntityProperty timestampProperty))
{
    checkpointCompletionTime = timestampProperty.DateTime ?? DateTime.MinValue;
}

Copy link
Collaborator

@sebastianburckhardt sebastianburckhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Found one typo in comment, and have a slight suggestion on wording.

@@ -184,9 +184,11 @@ public override async Task<OrchestrationHistory> GetHistoryEventsAsync(string in
break;
}

// The sentinal row does not contain any history events, so save it for later
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sentinal => sentinel

@@ -208,9 +210,10 @@ public override async Task<OrchestrationHistory> GetHistoryEventsAsync(string in
// The only time a sentinel won't exist is if no instance of this ID has ever existed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or it was removed via a history purge management operation?

Connor McMahon added 4 commits January 30, 2020 17:22
It is possible for the history table to get into a state where the
latest execution history is first, followed by the previous execution
history, with the last entry for an instance id being the sentinal row.
The previous history parsing logic stopped processing as soon as it saw
a new execution id, meaning it failed to grab the ETag from the sentinal
row.

This, in combination with changes made when the ETag were not present,
caused UpdateState calls to fail, meaning queue messages for Activities
were being scheduled without recording that we did schedule them. This
leads to duplicate Activity executions until we retrieve the history
with an execution id (i.e. after an activity function successfully
executes).

This fix makes our ETag retrieval logic more robust.
Copy link
Collaborator

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: - LGTM!

@cgillum
Copy link
Collaborator

cgillum commented Jan 31, 2020

Though it looks like the test run timed out. Let's make sure the CI is healthy before merging.

@ConnorMcMahon
Copy link
Contributor Author

I rebased onto master just to retrigger the build, and it appears to have worked this time. The timeout last time was in a Service Bus test, which I didn't touch here, so I am going to merge now.

@ConnorMcMahon ConnorMcMahon merged commit e2d319c into master Jan 31, 2020
@ConnorMcMahon ConnorMcMahon deleted the FixDuplicateActivities branch June 4, 2020 00:11
shankarsama pushed a commit to shankarsama/durabletask that referenced this pull request Apr 27, 2021
…#363)

It is possible for the history table to get into a state where the
latest execution history is first, followed by the previous execution
history, with the last entry for an instance id being the sentinal row.
The previous history parsing logic stopped processing as soon as it saw
a new execution id, meaning it failed to grab the ETag from the sentinal
row.

This, in combination with changes made when the ETag were not present,
caused UpdateState calls to fail, meaning queue messages for Activities
were being scheduled without recording that we did schedule them. This
leads to duplicate Activity executions until we retrieve the history
with an execution id (i.e. after an activity function successfully
executes).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants