New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduces a mechanism to batch-sending telemetry events. #292

Merged
merged 1 commit into from Feb 27, 2018

Conversation

Projects
None yet
6 participants
@changsi-an
Copy link
Member

changsi-an commented Feb 21, 2018

Batch-sending telemetry is necessary because some notifications from debuggee might be too frequent for the telemetry infrastructure to handle. Take VS as an example, it throttles telemetry events if there are two many within a short period of time. We will lose some important telemetry when that happens.

The BatchTelemetryReporter will cache the received telemetry events and sends them out every 10 seconds.

This PR also refactors some code and migrates some general telemetry handling code from DebugSession.ts to utils.ts or telemetry.ts.

@changsi-an changsi-an requested review from roblourens and digeff Feb 21, 2018

@changsi-an

This comment has been minimized.

Copy link
Member

changsi-an commented Feb 21, 2018

@digeff PTAL

this.runAndMeasureProcessingTime('notification/onPaused', () => {
return this.onPaused(params);
});
});
this.chrome.Debugger.onResumed(() => this.onResumed());
this.chrome.Debugger.onScriptParsed(params => this.onScriptParsed(params));

This comment has been minimized.

@dhanvikapila

dhanvikapila Feb 21, 2018

Member

are we measuring onScriptParsed in another PR?

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

Yes, all the other notifications will be taken care by the next PR. I hope to have this PR introduces the "BatchTelemetryReporter" only (and give one usage example).

Some of the notification handlers need some small tweaks before they can be measured correctly. So I'd rather do another PR for those.

@@ -549,4 +551,23 @@ export function isNumber(num: number): boolean {

export function toVoidP(p: Promise<any>): Promise<void> {
return p.then(() => { });
}

export function calculateElapsedTime(startProcessingTime: [number, number]): number {

This comment has been minimized.

@dhanvikapila

dhanvikapila Feb 21, 2018

Member

Nit: Wouldn't it be better to type the tuple as "HighRes timer" or something similar?

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

Sounds good. Added that.

const MillisecondsPerSecond = 1000;

const ellapsedTime = process.hrtime(startProcessingTime);
const ellapsedMilliseconds = ellapsedTime[0] * MillisecondsPerSecond + ellapsedTime[1] / NanoSecondsPerMillisecond;

This comment has been minimized.

@dhanvikapila

dhanvikapila Feb 21, 2018

Member

isnt this easier to understand a different way => ( ellapsedTime[0]* 1e9 + ellapsedTime[1])/NanoSecondsPerMillisecond?
where 1e9 is the number of nanoseconds in a second.

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

Sounds good. Changed.

@changsi-an changsi-an force-pushed the changsi-an:master branch from 0312499 to 060ae1c Feb 22, 2018

@changsi-an

This comment has been minimized.

Copy link
Member

changsi-an commented Feb 22, 2018

I am merging this PR now. If there is any other issues please kindly point it out and I will address later.

@auchenberg

This comment has been minimized.

Copy link
Contributor

auchenberg commented Feb 22, 2018

@changsi-an Why is this needed to be implemented in the DA, if this is limited by VS? Shouldn't this be implemented inside VS's telemetry service to make sure that data isn't lost?

@changsi-an

This comment has been minimized.

Copy link
Member

changsi-an commented Feb 22, 2018

@auchenberg Tweaking VS's telemetry service is a much bigger change than we have time for. Also regardless of what underlying telemetry service system is being used, overwhelming the service endpoint is not ideal. If the batching strategy works, it should be adopted as an optimization and diminishes the value of sending each telemetry individually.

As a side note, we only batch the events that would potentially amass in a short period of time.

@@ -363,7 +367,11 @@ export abstract class ChromeDebugAdapter implements IDebugAdapter {
* Hook up all connection events
*/
protected hookConnectionEvents(): void {
this.chrome.Debugger.onPaused(params => this.onPaused(params));
this.chrome.Debugger.onPaused((params) => {
this.runAndMeasureProcessingTime('notification/onPaused', () => {

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

I'd add crdp to the name, so it's clear that the notification comes from Chrome

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

Will change to crdpnotification/onPaused, sounds fine?

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

I'd put a separator so it's easier to identify the two words e.g.:
crdp/notification/onPaused
crdp-notification/onPaused
crdp_notification/onPaused
crdpNotification/onPaused

I think the only one that is allowed is: crdp/notification/onPaused

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

Adopted crdp/notification/onPaused

startTime: startTime.toString()
};

return procedure().then(() => {

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

I personally find async/await code easier to read and reason than using promises

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

Agreed and changed

properties.successful = 'true';
}, (e) => {
properties.successful = 'false';
properties.exceptionType = "generic";

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

Is there a more intuitive way to put this information? If I read generic in this field, I don't think I'll understand what it means

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

How about not setting this field?

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

Renamed to 'firstChance'

// There is an issue on some clients and reportEvent only currently accept strings properties,
// hence all the following properties must be strings.
successful?: "true" | "false";
exceptionType?: ExceptionType;

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

Should this be called eventType or actionType rather than exceptionType?

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

It is used to differentiate whether it's a uncaughtException or promiseRejection for exception case. Why should it be eventType or actionType?

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

I think that people might expect to see the class name in a field called exceptionType.

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

I think this word clearly conveys its meaning. Neither eventType or actionType is correct to capture the intention of this field. The case you mention might be valid, but I feel that it might come from a limited group of people who might come from a scoped background. It is not that convincing that exceptionType has to mean a class name. Users of data should check the content of the data before make any assumptions. So unless we would have a absolutely perfect wording for this concept (other than actionType and eventType), I might leave it as-is.

@@ -36,4 +49,80 @@ export class NullTelemetryReporter implements ITelemetryReporter {

}

export const DefaultTelemetryInterval = 10000;

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

I'd rename this to DefaultTelemetryIntervalInMilliseconds to make it clear what the unit is.

This comment has been minimized.

@changsi-an
private _eventBuckets: {[eventName: string]: any};
private _timer: NodeJS.Timer;

public constructor(private _telemetryReporter: TelemetryReporter, private _cadence: number = DefaultTelemetryInterval) {

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

_cadenceInMilliseconds?

This comment has been minimized.

@changsi-an
}

public setupEventHandler(_sendEvent: (event: DebugProtocol.Event) => void): void {
// no-op

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

It'll be surprising for customers of this class if this is a no-op. It might be clearer if this class doesn't implement ITelemetryReporter and doesn't have this method.

this._eventBuckets = {};
}

private send(): void {

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

I'm finding hard to follow what this method does. I think it'd be much easier to understand if we split this into smaller private methods with meaningful self-documenting names. What do you think?

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

It's been divided into several parts.

@changsi-an changsi-an force-pushed the changsi-an:master branch from 060ae1c to 63a8511 Feb 22, 2018

this.reset();
}
/**
* Transfrom the bucket of events data from the form:

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

This is really useful :) 👍 👍 👍

let propertyNamesSet = {};
for (const entry of bucket) {
for (const key of Object.keys(entry)) {
propertyNamesSet[key] = true;

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

What is the advantage of using a dictionary over an array for this?

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

Array won't work. I am using dictionary as a set. Array can only append new keys (unless we search with indexof()). But a lot of entries will have the same property names.

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

That makes sense. I compleatly missed that :)

Could you add a comment saying that we use the dictionary to eliminate duplicated values?

}

/**
* Get the property keys from all the entries of a event bucket:

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

Awesome comment!

// Run through again, collect the values for each property name.
for (const entry of bucket) {
for (const key of allPropertyNamesInTheBucket) {
properties[`aggregated.${key}`].push(entry[key] || null);

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

If entry[key] is "" or 0 entry[key] || null will return null

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

Good catch. Fixed.

@changsi-an changsi-an force-pushed the changsi-an:master branch from 63a8511 to cf5c348 Feb 22, 2018

private send(): void {
for (const eventName in this._eventBuckets) {
const bucket = this._eventBuckets[eventName];
let properties = BatchTelemetryReporter.transfromBucketData(bucket);

This comment has been minimized.

@digeff

digeff Feb 22, 2018

Contributor

It'd be awesome if the name of the method would explain what kind of transformation the method is going to do (e.g. what the output is)

This comment has been minimized.

@changsi-an

changsi-an Feb 22, 2018

Member

Describing the transformation on a method name might be too verbose. The realistic way is probably to refer the comment put on the method. The idea is that "if the name is not sufficient enough to tell what, refer to its signature comment, if still cannot, refer to its implementation".

Or otherwise, if you have a better name in mind, please suggest.

@changsi-an changsi-an force-pushed the changsi-an:master branch from cf5c348 to 0029f27 Feb 23, 2018

@auchenberg

This comment has been minimized.

Copy link
Contributor

auchenberg commented Feb 23, 2018

@changsi-an I'm still not sure I understand. We don't seem to have this problem today in VS Code, and this approach would require each Debug Adaptor to throttle events that satisfies constraints on VS's telemetry endpoint.

Why not have the central VSTelemetry service implementing this, so no DA should deal with this VS internal limitation?

@roblourens @dhanvikapila

@dhanvikapila

This comment has been minimized.

Copy link
Member

dhanvikapila commented Feb 23, 2018

@auchenberg we can chat if you need more context. The telemetry we are adding is going to result in lot of data points being generated and so we need to be careful not to overwhelm upstream telemetry sinks (VScode or VS) when they get this data. 10s was a arbitrary number we chose which we can either make larger or smaller. There is no data loss as a result of this PR. Data you got earlier you continue to get (along with lot more new data). The only change will be that the last data set will be sent 10s later for the telemetry sinks to process.

Can you please elaborate on your concerns with this PR? This is best practice when you have potential to generate lot of traffic for telemetry sinks. If we don't do this, every DA using this package will end up potentially spamming their respective telemetry endpoints with lots of requests and as a result causing upstream (VSCode or VS telemetry processors to be) impacted.

@auchenberg

This comment has been minimized.

Copy link
Contributor

auchenberg commented Feb 23, 2018

@dhanvikapila Again I'm trying to understand why the throttling problem has to be solved inside a debugging core library that's used in 8 different debuggers across VS and VS Code, and not inside the telemetry service in VS.

In VS Code, to my knowledge, we don't implement telemetry throttling, but simply forwards them to the AppInsights client (npm package) that might be doing some throttling when needed.

My concern is that we are adding VS telemetry specific logic inside a core debugging library to fix a symptom instead of fixing it centrally inside VS's telemetry abstraction/sink/service. It shouldn't be this libraries concern that VS's telemetry service might drop data if the volumes get's to high.

@dhanvikapila

This comment has been minimized.

Copy link
Member

dhanvikapila commented Feb 23, 2018

@auchenberg as I mentioned earlier, the amount of telemetry is going to be large. If we don't do this, all debug adapters who will also be generating this twenty points can potentially impact their telemetry endpoints.
Throttling makes sense when it happens on the server end and not the client (like VSCode) unless you want to lose data. Batching is the right thing to do at the client since you don't want to drop data on the floor.
This telemetry is going into VS and VSCode. The telemetry is not specific to VS. why do you think VSCode cannot use this telemetry data?

It might be better for us to jump in a call to sync. This is not about any limitations on the telemetry service. As a client of a service the client would need to be sure the data they send is not dropped on the floor resulting in missed data for us.

@isidorn

This comment has been minimized.

Copy link
Contributor

isidorn commented Feb 23, 2018

Since @roblourens is this week on vacation and we are wrapping up our milestone (we have endgame next week) I suggest to postpone this PR to the March milestone.
Then @roblourens can give this a nice review

As far as I understand we have been sending telemtry like this always and there is no need to rush this in.

@dhanvikapila

This comment has been minimized.

Copy link
Member

dhanvikapila commented Feb 23, 2018

We need to get in a room so we can talk about this and close it out. Otherwise we are going to be repeating each other and not get anywhere.

@roblourens

This comment has been minimized.

Copy link
Member

roblourens commented Feb 23, 2018

@isidorn this is intended for VS, not vscode, so it has no impact on our release.

I have read the comments but still have the same question as Kenneth. I don't understand where the limitation is or where the data loss happens. Is the volume of telemetry really too high? You can include me on any call.

@changsi-an changsi-an force-pushed the changsi-an:master branch from 0029f27 to 04c5702 Feb 23, 2018

@changsi-an changsi-an force-pushed the changsi-an:master branch from 04c5702 to 3057e0a Feb 23, 2018

@changsi-an changsi-an changed the base branch from master to feature/telemetry Feb 27, 2018

@changsi-an

This comment has been minimized.

Copy link
Member

changsi-an commented Feb 27, 2018

@roblourens I had a meeting with @auchenberg and Kai today. We reached an agreement to batch-sending telemetry for notifications from a debugee. In that, we don't want telemetry messages clog the communication channel between DA and VS Code, so we could allow messages of higher priority pass through first.

Based on that outcome, we would still want you to code review this PR as is, to clear out issues for edge cases which I have not noticed yet.

In order to catch up release schedule on our end, we will first merge this PR to feature branch "feature/telemetry". I will work on getting it to master after your final code review.

@changsi-an changsi-an merged commit 8fe3d0a into Microsoft:feature/telemetry Feb 27, 2018

3 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
license/cla All CLA requirements met.
Details
@roblourens

This comment has been minimized.

Copy link
Member

roblourens commented Feb 28, 2018

Sure, that's fine. I'll do an in depth review later.

we don't want telemetry messages clog the communication channel between DA and VS Code, so we could allow messages of higher priority pass through first

This part still confuses me, is there evidence that there are so many telemetry messages that it slows anything down? I'd be surprised if we actually send enough to cause an issue.

@changsi-an

This comment has been minimized.

Copy link
Member

changsi-an commented Feb 28, 2018

When we start reporting onScriptParsed event as telemetry, it might happen. Some project would have hundreds or thousands of small script files, which might be loaded in a short period of time. What reflects to the UI is the loadedScript event, and that's what benefits the customers. If in the meantime, we also send telemetry event (Output events) through. Imagine that there are 100 telemetry events sent out before the last loadedScript event, either alternatively with loadedScript event or randomly intertwined with loadedScript event, the last loadedScript event will be significantly delayed (comparing to the case where there were no telemetry event). Telemetry events do not directly benefit end user, it only benefit the developers or code owners, hence have lower priority. We shouldn't let our design/mechanism compromise UX.

That being said, I don't have a real evidence yet, because it is about the new telemetry events we are adding which has not happened yet. But we do have the above theory and it might be useful to prevent that (an 'evidence') from happening.

@roblourens

This comment has been minimized.

Copy link
Member

roblourens commented Mar 5, 2018

Maybe we should investigate some more to determine what the capacity of the IPC channel is, and whether we need to do anything here. There are already many messages being exchanged on the websocket and adapter/client connections, many of them very large, and the telemetry seems to be a small part of that. I'm worried that this is a premature optimization - adding dozens of lines of code to do something that sounds right but we don't have data to back it up.

That being said, I don't have a real evidence yet, because it is about the new telemetry events we are adding which has not happened yet. But we do have the above theory and it might be useful to prevent that (an 'evidence') from happening.

Given that, why was this so urgent?

@changsi-an

This comment has been minimized.

Copy link
Member

changsi-an commented Mar 5, 2018

Here is why: before this PR, the telemetry we have been sending so far does not have a problem. After this PR, we will be sending new telemetry events, which can amass within a short period of time, and we don't want to loose them in case they are throttled. It might not be wise to see the loss first, as the so-called 'evidence', then decide what to do with it. At that point, there is no way to recover the valuable telemetry events we have lost.

For your first point, the issue we want to prevent is not whether the IPC channel is capable of sending through large amount of data (And I believe it is capable too). It is that we don't want the telemetry to cause delay of the real UI interactions. Telemetry is not part of the UX, it should be de-prioritized. Any UI delays, not matter how small, should be avoided.

@roblourens

This comment has been minimized.

Copy link
Member

roblourens commented Mar 5, 2018

Can you set up a call to work this out? I think there are two points

  • VS's rate limit on telemetry events. We should know what that rate limit is exactly and, if we will hit it, whether we can do this aggregation from PineZorro or inside VS instead of individual adapters
  • Whether telemetry could actually cause a significant delay in the time for a user to see a loaded script. My intuition is that the time to send many events across will be milliseconds, and will be dwarfed by the time to do other things, like resolve sourcemaps for the script. I would be surprised if users are ever waiting for a loaded script to show up because of too much telemetry clogging the channel.

@roblourens roblourens added this to the March 2018 milestone Apr 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment