-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Origo: [Argus] Opentelemetry instrumentation for better metrics & tracing #4779
Origo: [Argus] Opentelemetry instrumentation for better metrics & tracing #4779
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
How to Test
|
@kdembler please test this PR based on the above instructions, and let me know whether it meets your requirements for monitoring purposes. Also, feel free to ping me if there is any issue setting up the infra |
@zeeshanakram3 This looks lovely! I will be happy to test it out, but I will be unavailable next week, will be back on 12th and will check it out as soon as possible |
This looks like a really nice addition to provide visibility into how the application is performing. First observation, when starting the ES stack, the apm-server service name is coming up as Secondly the distributor is crashing, with failure to validate the I tried to change it from
This is happening on Mac Desktop Docker.. I'll try on Linux and report back. But I suspect the same is happening on Linux and that is why the Full scenario is failing integration tests at job |
Even without starting with instrumentation, the same error is occuring:
Is there something perhaps wrong in the changes in |
Was able to resolve the problem by uncommenting the |
distributor-node/config.yml
Outdated
@@ -19,6 +19,9 @@ logs: | |||
# auth: | |||
# username: username | |||
# password: password | |||
# otlp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be uncommented as the properties are "required".
Otherwise change them to be optional and leave them commented out.
If node is started with instrumentation and endpoint/attributes not provided, exist with error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the otlp
properties are not required in the config schema. You can see here. Also, the distributor-1
service works fine if I remove the JOYSTREAM_DISTRIBUTOR__OTLP__ENDPOINT
, JOYSTREAM_DISTRIBUTOR__OTLP__ATTRIBUTES
env vars from the service in docker-compose.yml file, which also enforces the point that otlp
properties are not required. I think there is a problem with the ValidationService
class. I tried to debug the problem by adding console.log(this.avj.errors)
at line, and got this:
[
{
keyword: 'required',
dataPath: '/otlp',
schemaPath: '#/properties/otlp/required',
params: { missingProperty: 'attributes' },
message: "should have required property 'attributes'"
}
]
/joystream/distributor-node/lib/services/parsers/ConfigParserService.js:97
throw e;
^
ValidationError: Invalid env value of JOYSTREAM_DISTRIBUTOR__OTLP__ENDPOINT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just created an issue for this. Will be investigating it #4787
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, so how do we interpret the line required: ['endpoint', 'attributes'],
in configSchema.ts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, so how do we interpret the line
required: ['endpoint', 'attributes'],
in configSchema.ts
This means otlp
is an optional object in configSchema. However, when it is present (i.e., not undefined
), it must
have endpoint
& attributes
properties.
FYI I created a PR to address this issue, please have a look #4788
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged the fix, perhaps you want to update this from master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks and works as expected.
Just left some comments about possible generalization and decoupling the instrumentation config from the main application.
|
||
diag.info('Starting tracing...') | ||
|
||
// Default config JSON/YAML file path (relative to current working directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad we discovered issue with parsing the config because of adding the new parameters in the config file, however I feel perhaps we can just rely on OTEL_
... env variables, since they are not really configuration parameters for the distributor-node itself. And by dropping this coupling, where we are importing from the
import { ConfigParserService } from '../services/parsers/ConfigParserService'
import { ReadonlyConfig } from '../types'
It becomes possible to break this code out of the distributor package into its own package that can be re-used?
What are your thoughts. I'm not familiar enough with Open Telemetry architecture yet to have a very strong opinion on this.
Okay it's been a while, really sorry for the delay. I'm starting testing it out right now |
distributor-node/CHANGELOG.md
Outdated
@@ -1,5 +1,6 @@ | |||
### 1.2.0 | |||
|
|||
- Integrates OpenTelemetry API/SDK with Argus for exporting improved tracing logs & metrics to Elasticsearch. Adds `./start-elasticsearch-stack.sh` script to bootstrap elasticsearch services (Elasticsearch + Kibana + APM Server) with all the required configurations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since 1.2.0 has been already released and is in use, let's bump the version please
Also seeing |
cc. @mnaamani |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Left a few suggested changes, otherwise it works great and seeing nice traces in APM down to the SQL queries.
|
||
################################################ | ||
# temporary patche TODO: create proper solution | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some env variables i think might be getting lost, there is a specific one i noticed which is set at top .env PROCESSOR_HOST=processor
which is the host/ip the graphql server expects from the processor when it "pings" it to update. So the graphql-server is constantly logging:
2023-07-12 19:30:30 [
2023-07-12 19:30:30 'Unauthorized access on /update-processor-state: 172.18.0.10 (expected: undefined)'
2023-07-12 19:30:30 ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so it took a while to figure this out.
The PROCESSOR_HOST
env var is available inside the graphql-server
, how the problem is happening while dns lookup, if the graphql-server
is running with open telemetry instrumentation enabled, then the lookup
function returns an address string (response type of callback-based function) instead of expected LookupAddress object (response type of promise-based function). So, in short, open telemetry isn't applying a proper patch of the lookup function.
There are two possible solutions I can think of
- Don't instrument the
dns
package in thegraphql-server
(a much simpler solution) - Make a change in the
graphql-server
codebase to usecallback
based ordns/promise
basedlookup
method instead of promisify lookup function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Don't instrument the
dns
package in thegraphql-server
(a much simpler solution)
I think that is the best option to go with. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 7c658a8
// Disable DNS instrumentation, because the instrumentation does not correctly patches `dns.lookup` function | ||
// if the function is converted to a promise-based method using `utils.promisify(dns.lookup)` | ||
// See: https://github.com/Joystream/joystream/pull/4779#discussion_r1262515887 | ||
getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-dns': { enabled: false } }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also fixed a problem I noticed with storage and distributor nodes, not finishing reponse on their status endpoint /api/v1/status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor fixes, I prepared them in a PR zeeshanakram3#6
Addresses #4763