Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured and extensible logging #425

Merged
merged 13 commits into from
Aug 5, 2020
Merged

Structured and extensible logging #425

merged 13 commits into from
Aug 5, 2020

Conversation

cgillum
Copy link
Collaborator

@cgillum cgillum commented Jul 24, 2020

Please Read!

This PR adds basic support for .NET Core's ILogger infrastructure to DurableTask.Core and DurableTask.AzureStorage. The motivation for this work is primarily for the benefit of Durable Functions, but there are significant benefits for any consumer of DTFx. The stated goals are as follows:

  1. Enable DurableTask.Core logs to be more easily surfaced in Antares Kusto clusters (internal to Azure App Service)
  2. Enable correlating DurableTask.Core logs with DurableTask.AzureStorage traces using simple Kusto queries
  3. Enable users to export DurableTask.Core and DurableTask.AzureStorage logs to Application Insights automatically when running in Azure Functions.
  4. Enable users to capture logs in any way they choose by supporting ILogger extensibility (this is critical for supporting Durable Functions on Linux platforms).
  5. Make it easy for future transaction store providers (like DurableTask.SqlServer) to leverage the same logging infrastructure.

Before reviewing the code, I highly recommend you review this README.md file that I added. This will help you understand the design.

Change summary

  • Upgraded all projects from net451 to net461. This was necessary to support ILogger without conditional compilation. All projects should now be netstandard2.0 compatible. This will require any upstream projects that target net451 to upgrade if they want to keep using DTFx.
  • Added a NEW DurableTask-Core EventSource implementation. The old one will still be there (for now) just in case we find issues with the new implementation. The new one was meant to be compatible with our DurableTask-AzureStorage event source provider to make it easier to query in Kusto.
  • Added optional ILoggerFactory configuration to both DurableTask.Core and DurableTask.AzureStorage. For Azure Functions, we will add the Azure Functions ILoggerFactory to these DTFx classes so that Functions users can have direct access to these logs. We can also use this to do JSON logging on Linux platforms (that part is not in scope for this PR, however).
  • Added logs of new structured logging code to DurableTask.Core. This project needed an overhaul of its logging to make it more structured and query-able. I've basically added a bunch of new log events to help us with production debugging.
  • Replaced EventSource tracing with new LogHelper abstraction in DurableTask.AzureStorage. This was a very mechanical change and should not impact existing Event Source logs. However, it allows us to plug into ILogger and get all those benefits.

Pending changes:

Manual testing:

  • Capture ETW events and ensure everything flows correctly across DurableTask-Core and DurableTask-AzureStorage
  • Capture logs from a Function app and make sure they appear in Application Insights (when configured to do so)

@cgillum cgillum added enhancement durable functions This issue is an ask from the Durable Functions team. dt.azurestorage DurableTask.AzureStorage labels Jul 24, 2020
@cgillum cgillum self-assigned this Jul 24, 2020
Copy link
Collaborator Author

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some notes about the auto-generated code so that readers don't spend too much time reviewing it (there is a lot of it, and it's likely either going to be all correct or all wrong).

using DurableTask.Core.Logging;
using Microsoft.Extensions.Logging;

static class LogEvents
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

99% of this code was auto-generated using a tool that reflects over AnalyticsEventSource and creates a class for every ETW event we were logging. I feel reasonably confident that the code accurately represents our existing ETW tracing. Reviewers shouldn't bother reviewing this in too much detail.

using DurableTask.Core.Logging;
using Microsoft.Extensions.Logging;

class LogHelper
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

99% of this code was auto-generated using a custom tool that reflects over AnalyticsEventSource. I'm fairly confident that it is accurate in how it accepts parameters and passes them to the log event class it creates. Reviewers shouldn't bother reviewing this in too much detail.

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with a quick glimpse, to get the gist of the changes. Will give another, deeper look, tomorrow morning.

Given that this is not a codebase I have interacted with much yet, most of my comments so far are about improving the communication of this PR and its changes. In general, I'm looking to see more summary lines on the basic gist of the new classes, their helpers, and their core mechanics. Just a line or two per each of these would suffice.

Communication aside, the changes in these PR seem fairly safe to me. Mostly a matter of switching logging backends and interfaces, so I recognize many these changes are mechanical and therefore probably ok.

I'll give this another look again tomorrow. In the meantime, it would help me if this got updated with a few more comments on the relevant places.

@@ -862,7 +793,6 @@ public void GeneralWarning(string Account, string TaskHub, string Details, strin
string ETag,
string ExtensionVersion)
{
EnsureLogicalTraceActivityId();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What were these doing before and why is it safe to remove them now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the answer to these questions are long and complicated. The short version is that this refactor allowed me to consolidate them into a single place, LoggingExtensions.LogDurableEvent.

The longer answer is that these EventSource APIs depend on thread static state for logging trace activity IDs (GUIDs that correlate multiple trace messages together as one logical "activity" - not to be confused with Durable Task "activities" or "activity functions"). These are the values you'll see in the ActivityId column in all (or most) of our Kusto tables. Thread-static variables don't work at all when writing async code, so I needed to invent another way of propagating this "activity" information. If you look at the top of this file, you'll see a AsyncLocal<Guid> ActivityIdState variable which is a more modern version of the thread-static variable concept that is compatible with async code. I'm using that to keep track of ambient trace activity IDs. The EnsureLogicalTraceActivityId was used simply to copy the AsyncLocal<Guid> value into the thread-static variable so that the call to WriteEvent had the correct value. So we still do this, but the refactor allowed me to do it in a single place rather than calling this method inside every one of our trace methods.


internal string HistoryTableName => this.HasTrackingStoreStorageAccount ? $"{this.TrackingStoreNamePrefix}History" : $"{this.TaskHubName}History";

internal string InstanceTableName => this.HasTrackingStoreStorageAccount ? $"{this.TrackingStoreNamePrefix}Instances" : $"{this.TaskHubName}Instances";

internal LogHelper Logger
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the LogHelper could use a some commenting for IntelliSense, it's use is not entirely obvious from its generic name :)

src/DurableTask.Core/Common/Utils.cs Outdated Show resolved Hide resolved
src/DurableTask.Core/Logging/LogEvents.cs Show resolved Hide resolved
src/DurableTask.Core/Logging/README.md Outdated Show resolved Hide resolved

// NOTE: This is intended to eventually replace the other DurableTask-Core provider
[EventSource(Name = "DurableTask-Core")]
class StructuredEventSource : EventSource
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this class is so important for this change, I would like it to also have a <summary> documentation tag

@@ -22,6 +22,12 @@ namespace DurableTask.Core
[DataContract]
public class OrchestrationInstance : IExtensibleDataObject
{
internal readonly static OrchestrationInstance Empty = new OrchestrationInstance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be a valid instantiation? What will be the use of this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what valid instantiation means, but it looks like I'm not actually using this anywhere and can safely remove it. I can't remember exactly why I added this in the first place, but it was probably as a performance optimization somewhere at some point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By valid instantiation I meant if it made sense to have an OrchestrationInstance object floating around that had everything as "empty" and whether or not that should be an allowed object in the first place.

But yeah, sounds like we can safely remove it :)

@@ -131,7 +144,15 @@ OrchestrationStatus[] dedupeStatuses
/// <returns>OrchestrationInstance that represents the orchestration that was created</returns>
public Task<OrchestrationInstance> CreateOrchestrationInstanceAsync(string name, string version, object input)
{
return InternalCreateOrchestrationInstanceWithRaisedEventAsync(name, version, null, input, null, null, null, null);
return this.InternalCreateOrchestrationInstanceWithRaisedEventAsync(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance we could comment in-place what all these null-ish values refer to :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just update all these to use named parameters. That's a bit more future-proof.

@@ -56,6 +60,21 @@ public TaskHubWorker(IOrchestrationService orchestrationService)
{
}


/// <summary>
/// Create a new TaskHubWorker with given OrchestrationService
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: removing the whitespace in this summary :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer it be removed everywhere, but I was hesitant to create style inconsistencies within a single file, or to fix all occurrences of this because of how much it would inflate an already large PR (this codebase largely predates me).

Id = id;
DispatcherId = dispatcherId;
this.Name = name;
this.Id = id;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this Id the WorkItemId? Any thoughts on making this stand out a little more, considering that there's already another Id parameter here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it's not even a work item ID. It's very non-obvious why there are two different "Id" properties here (this code also predates me) but I'm hesitant to change it as part of this PR. A lot of the existing DurableTask.Core could use a good refactoring. :(

Copy link
Contributor

@ConnorMcMahon ConnorMcMahon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the approach seems solid, if a little expensive (I would want to check that this doesn't massively impact performance).

In general, I am still a little confused about what structured logs will look like on Linux when this is integrated with David's changes, and how those will be consumed by the agent in that pipeline.

Also, at some point I think it would be incredibly helpful to allow logging an additional field to DurableTask.AzureStorage (mainly app name in the case of Durable Functions). This would cut back on the number of queries we need, and could make our analyzer a bit more robust. This PR may not be the right place for it, but since we are already doing so many automated changes, I would hate to have to make them again.

[Event(EventIds.FetchingInstanceHistory, Level = EventLevel.Informational, Version = 1)]
internal void FetchingInstanceHistory(string InstanceId, string ExecutionId)
{
this.WriteEvent(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The methods after this add guards for checking if the event log level is enabled. Why do we not for these logs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added remaining guards in the latest iteration.

};

TaskMessage eventMessage = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this logic into the if statement currently at line 556?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can! I'm not sure why this method was previously so complicated...

src/DurableTask.Core/TaskOrchestrationDispatcher.cs Outdated Show resolved Hide resolved
- Added missing copyright headers
- Started filling in missing log messages
- Other misc. updates
@cgillum cgillum added this to the August 2020 Release milestone Aug 3, 2020
Adding support for DTFX_APP_NAME env variable
Copy link
Collaborator Author

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a few more informational comments to this PR to help clarify things.

@@ -90,11 +57,11 @@ private static void EnsureLogicalTraceActivityId()
string TargetExecutionId,
long SequenceNumber,
int Episode,
string AppName,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added AppName columns to every log so that we can much more easily find these traces when searching in Kusto.

}

[Event(108, Level = EventLevel.Warning, Version = 2, Message = "A duplicate message was detected. This can indicate a potential performance problem. Message ID = '{2}'. DequeueCount = {3}.")]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the Message property for these events because we don't actually use them and they easily get out of date.

@@ -72,8 +74,7 @@ public void TraceProcessingMessage(MessageData data, bool isExtendedSession)
Math.Max(0, (int)DateTimeOffset.UtcNow.Subtract(queueMessage.InsertionTime.Value).TotalMilliseconds),
data.SequenceNumber,
data.Episode.GetValueOrDefault(-1),
isExtendedSession,
Utils.ExtensionVersion);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll notice that I got rid of Utils.ExtensionVersion in these method calls. We still log this information, but we just do it in a different place now inside LogEvents.cs. We only log it for ETW events. There is no actual behavior change here.

"CreateLeaseIfNotExistAsync - leaseContainerName: {0}, consumerGroupName: {1}, partitionId: {2}, blobPrefix: {3}, exception: {4}.",
this.leaseContainerName,
this.consumerGroupName,
if (se.RequestInformation.HttpStatusCode != 409)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this check simply because we were getting a lot of unnecessary noise in the logs on startup.

Utils.ExtensionVersion);
string.Empty,
result.StorageRequests,
result.InstancesDeleted,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While auditing the logs we write, I noticed that we don't actually mention how many instances were deleted. I thought this information would be useful, so I added it to this log event.

this.storageAccountName,
this.taskHubName,
executionStartedEvent.OrchestrationInstance.InstanceId,
executionStartedEvent.OrchestrationInstance.ExecutionId,
executionStartedEvent.EventType.ToString(),
OrchestrationStatus.Pending,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why I didn't make this change sooner. The intent was always to communicate the runtime status, but for some reason I was always instead logging the event type, which is not something a viewer of these logs would intuitively understand as it relates to the orchestration instance status.

this.storageAccountName,
this.taskHubName,
instanceId,
executionId,
historyEventBatch.Count,
historyEventBatch.Count - 1, // exclude sentinel from count
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old "off by one" error I noticed while inspecting the new logs that were being emitted.

src/DurableTask.Core/Common/Utils.cs Show resolved Hide resolved
Copy link
Contributor

@ConnorMcMahon ConnorMcMahon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small nits, but overall looks good to me!

this.consumerGroupName,
if (se.RequestInformation.HttpStatusCode != 409)
{
// eat any storage exception related to conflict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: This comment makes a bit more sense above the if statement's conditional imo. I got confused reading it down here, as I assumed it was saying we were eating the exception that we are logging.

src/DurableTask.Core/Common/Utils.cs Show resolved Hide resolved
@cgillum cgillum merged commit fea9421 into master Aug 5, 2020
@cgillum cgillum deleted the core-logging branch August 5, 2020 01:50
This was referenced Aug 5, 2020
shankarsama pushed a commit to shankarsama/durabletask that referenced this pull request Apr 27, 2021
- Updated framework dependencies from net451 to net461
- New ETW provider for DT.Core
- ILoggerFactory configuration support for DT.Core and DT.AzureStorage
- Other misc. updates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dt.azurestorage DurableTask.AzureStorage durable functions This issue is an ask from the Durable Functions team. enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants