-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Structured and extensible logging #425
Conversation
Update test projects from net451 to net461
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some notes about the auto-generated code so that readers don't spend too much time reviewing it (there is a lot of it, and it's likely either going to be all correct or all wrong).
using DurableTask.Core.Logging; | ||
using Microsoft.Extensions.Logging; | ||
|
||
static class LogEvents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
99% of this code was auto-generated using a tool that reflects over AnalyticsEventSource
and creates a class for every ETW event we were logging. I feel reasonably confident that the code accurately represents our existing ETW tracing. Reviewers shouldn't bother reviewing this in too much detail.
using DurableTask.Core.Logging; | ||
using Microsoft.Extensions.Logging; | ||
|
||
class LogHelper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
99% of this code was auto-generated using a custom tool that reflects over AnalyticsEventSource
. I'm fairly confident that it is accurate in how it accepts parameters and passes them to the log event class it creates. Reviewers shouldn't bother reviewing this in too much detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with a quick glimpse, to get the gist of the changes. Will give another, deeper look, tomorrow morning.
Given that this is not a codebase I have interacted with much yet, most of my comments so far are about improving the communication of this PR and its changes. In general, I'm looking to see more summary lines on the basic gist of the new classes, their helpers, and their core mechanics. Just a line or two per each of these would suffice.
Communication aside, the changes in these PR seem fairly safe to me. Mostly a matter of switching logging backends and interfaces, so I recognize many these changes are mechanical and therefore probably ok.
I'll give this another look again tomorrow. In the meantime, it would help me if this got updated with a few more comments on the relevant places.
@@ -862,7 +793,6 @@ public void GeneralWarning(string Account, string TaskHub, string Details, strin | |||
string ETag, | |||
string ExtensionVersion) | |||
{ | |||
EnsureLogicalTraceActivityId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What were these doing before and why is it safe to remove them now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the answer to these questions are long and complicated. The short version is that this refactor allowed me to consolidate them into a single place, LoggingExtensions.LogDurableEvent
.
The longer answer is that these EventSource APIs depend on thread static state for logging trace activity IDs (GUIDs that correlate multiple trace messages together as one logical "activity" - not to be confused with Durable Task "activities" or "activity functions"). These are the values you'll see in the ActivityId
column in all (or most) of our Kusto tables. Thread-static variables don't work at all when writing async code, so I needed to invent another way of propagating this "activity" information. If you look at the top of this file, you'll see a AsyncLocal<Guid> ActivityIdState
variable which is a more modern version of the thread-static variable concept that is compatible with async code. I'm using that to keep track of ambient trace activity IDs. The EnsureLogicalTraceActivityId
was used simply to copy the AsyncLocal<Guid>
value into the thread-static variable so that the call to WriteEvent
had the correct value. So we still do this, but the refactor allowed me to do it in a single place rather than calling this method inside every one of our trace methods.
|
||
internal string HistoryTableName => this.HasTrackingStoreStorageAccount ? $"{this.TrackingStoreNamePrefix}History" : $"{this.TaskHubName}History"; | ||
|
||
internal string InstanceTableName => this.HasTrackingStoreStorageAccount ? $"{this.TrackingStoreNamePrefix}Instances" : $"{this.TaskHubName}Instances"; | ||
|
||
internal LogHelper Logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the LogHelper could use a some commenting for IntelliSense, it's use is not entirely obvious from its generic name :)
|
||
// NOTE: This is intended to eventually replace the other DurableTask-Core provider | ||
[EventSource(Name = "DurableTask-Core")] | ||
class StructuredEventSource : EventSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this class is so important for this change, I would like it to also have a <summary>
documentation tag
@@ -22,6 +22,12 @@ namespace DurableTask.Core | |||
[DataContract] | |||
public class OrchestrationInstance : IExtensibleDataObject | |||
{ | |||
internal readonly static OrchestrationInstance Empty = new OrchestrationInstance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this to be a valid instantiation? What will be the use of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what valid instantiation means, but it looks like I'm not actually using this anywhere and can safely remove it. I can't remember exactly why I added this in the first place, but it was probably as a performance optimization somewhere at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By valid instantiation I meant if it made sense to have an OrchestrationInstance
object floating around that had everything as "empty" and whether or not that should be an allowed object in the first place.
But yeah, sounds like we can safely remove it :)
@@ -131,7 +144,15 @@ OrchestrationStatus[] dedupeStatuses | |||
/// <returns>OrchestrationInstance that represents the orchestration that was created</returns> | |||
public Task<OrchestrationInstance> CreateOrchestrationInstanceAsync(string name, string version, object input) | |||
{ | |||
return InternalCreateOrchestrationInstanceWithRaisedEventAsync(name, version, null, input, null, null, null, null); | |||
return this.InternalCreateOrchestrationInstanceWithRaisedEventAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance we could comment in-place what all these null
-ish values refer to :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just update all these to use named parameters. That's a bit more future-proof.
@@ -56,6 +60,21 @@ public TaskHubWorker(IOrchestrationService orchestrationService) | |||
{ | |||
} | |||
|
|||
|
|||
/// <summary> | |||
/// Create a new TaskHubWorker with given OrchestrationService |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: removing the whitespace in this summary :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer it be removed everywhere, but I was hesitant to create style inconsistencies within a single file, or to fix all occurrences of this because of how much it would inflate an already large PR (this codebase largely predates me).
Id = id; | ||
DispatcherId = dispatcherId; | ||
this.Name = name; | ||
this.Id = id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this Id
the WorkItemId
? Any thoughts on making this stand out a little more, considering that there's already another Id parameter here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it's not even a work item ID. It's very non-obvious why there are two different "Id" properties here (this code also predates me) but I'm hesitant to change it as part of this PR. A lot of the existing DurableTask.Core could use a good refactoring. :(
…l PR feedback from David.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the approach seems solid, if a little expensive (I would want to check that this doesn't massively impact performance).
In general, I am still a little confused about what structured logs will look like on Linux when this is integrated with David's changes, and how those will be consumed by the agent in that pipeline.
Also, at some point I think it would be incredibly helpful to allow logging an additional field to DurableTask.AzureStorage (mainly app name in the case of Durable Functions). This would cut back on the number of queries we need, and could make our analyzer a bit more robust. This PR may not be the right place for it, but since we are already doing so many automated changes, I would hate to have to make them again.
[Event(EventIds.FetchingInstanceHistory, Level = EventLevel.Informational, Version = 1)] | ||
internal void FetchingInstanceHistory(string InstanceId, string ExecutionId) | ||
{ | ||
this.WriteEvent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The methods after this add guards for checking if the event log level is enabled. Why do we not for these logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added remaining guards in the latest iteration.
}; | ||
|
||
TaskMessage eventMessage = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this logic into the if statement currently at line 556?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can! I'm not sure why this method was previously so complicated...
- Added missing copyright headers - Started filling in missing log messages - Other misc. updates
Adding support for DTFX_APP_NAME env variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a few more informational comments to this PR to help clarify things.
@@ -90,11 +57,11 @@ private static void EnsureLogicalTraceActivityId() | |||
string TargetExecutionId, | |||
long SequenceNumber, | |||
int Episode, | |||
string AppName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added AppName
columns to every log so that we can much more easily find these traces when searching in Kusto.
} | ||
|
||
[Event(108, Level = EventLevel.Warning, Version = 2, Message = "A duplicate message was detected. This can indicate a potential performance problem. Message ID = '{2}'. DequeueCount = {3}.")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the Message
property for these events because we don't actually use them and they easily get out of date.
@@ -72,8 +74,7 @@ public void TraceProcessingMessage(MessageData data, bool isExtendedSession) | |||
Math.Max(0, (int)DateTimeOffset.UtcNow.Subtract(queueMessage.InsertionTime.Value).TotalMilliseconds), | |||
data.SequenceNumber, | |||
data.Episode.GetValueOrDefault(-1), | |||
isExtendedSession, | |||
Utils.ExtensionVersion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll notice that I got rid of Utils.ExtensionVersion
in these method calls. We still log this information, but we just do it in a different place now inside LogEvents.cs. We only log it for ETW events. There is no actual behavior change here.
"CreateLeaseIfNotExistAsync - leaseContainerName: {0}, consumerGroupName: {1}, partitionId: {2}, blobPrefix: {3}, exception: {4}.", | ||
this.leaseContainerName, | ||
this.consumerGroupName, | ||
if (se.RequestInformation.HttpStatusCode != 409) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this check simply because we were getting a lot of unnecessary noise in the logs on startup.
Utils.ExtensionVersion); | ||
string.Empty, | ||
result.StorageRequests, | ||
result.InstancesDeleted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While auditing the logs we write, I noticed that we don't actually mention how many instances were deleted. I thought this information would be useful, so I added it to this log event.
this.storageAccountName, | ||
this.taskHubName, | ||
executionStartedEvent.OrchestrationInstance.InstanceId, | ||
executionStartedEvent.OrchestrationInstance.ExecutionId, | ||
executionStartedEvent.EventType.ToString(), | ||
OrchestrationStatus.Pending, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why I didn't make this change sooner. The intent was always to communicate the runtime status, but for some reason I was always instead logging the event type, which is not something a viewer of these logs would intuitively understand as it relates to the orchestration instance status.
this.storageAccountName, | ||
this.taskHubName, | ||
instanceId, | ||
executionId, | ||
historyEventBatch.Count, | ||
historyEventBatch.Count - 1, // exclude sentinel from count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an old "off by one" error I noticed while inspecting the new logs that were being emitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small nits, but overall looks good to me!
this.consumerGroupName, | ||
if (se.RequestInformation.HttpStatusCode != 409) | ||
{ | ||
// eat any storage exception related to conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: This comment makes a bit more sense above the if statement's conditional imo. I got confused reading it down here, as I assumed it was saying we were eating the exception that we are logging.
- Updated framework dependencies from net451 to net461 - New ETW provider for DT.Core - ILoggerFactory configuration support for DT.Core and DT.AzureStorage - Other misc. updates
Please Read!
This PR adds basic support for .NET Core's ILogger infrastructure to DurableTask.Core and DurableTask.AzureStorage. The motivation for this work is primarily for the benefit of Durable Functions, but there are significant benefits for any consumer of DTFx. The stated goals are as follows:
ILogger
extensibility (this is critical for supporting Durable Functions on Linux platforms).Before reviewing the code, I highly recommend you review this README.md file that I added. This will help you understand the design.
Change summary
net451
tonet461
. This was necessary to supportILogger
without conditional compilation. All projects should now benetstandard2.0
compatible. This will require any upstream projects that targetnet451
to upgrade if they want to keep using DTFx.DurableTask-Core
EventSource implementation. The old one will still be there (for now) just in case we find issues with the new implementation. The new one was meant to be compatible with ourDurableTask-AzureStorage
event source provider to make it easier to query in Kusto.ILoggerFactory
configuration to bothDurableTask.Core
andDurableTask.AzureStorage
. For Azure Functions, we will add the Azure FunctionsILoggerFactory
to these DTFx classes so that Functions users can have direct access to these logs. We can also use this to do JSON logging on Linux platforms (that part is not in scope for this PR, however).DurableTask.Core
. This project needed an overhaul of its logging to make it more structured and query-able. I've basically added a bunch of new log events to help us with production debugging.DurableTask.AzureStorage
. This was a very mechanical change and should not impact existing Event Source logs. However, it allows us to plug intoILogger
and get all those benefits.Pending changes:
Manual testing:
DurableTask-Core
andDurableTask-AzureStorage