Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Origo: [Argus] Opentelemetry instrumentation for better metrics & tracing #4779

Merged

Conversation

zeeshanakram3
Copy link
Contributor

Addresses #4763

@vercel
Copy link

vercel bot commented Jun 1, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated (UTC)
pioneer-testnet ⬜️ Ignored (Inspect) Jul 14, 2023 10:32am

@zeeshanakram3
Copy link
Contributor Author

How to Test

  • Setup Elastic search stack (Elasticsearch + Kibana + APM server). A utility script (./start-elasticsearch-stack.sh) has been provided. It will automatically bootstrap all the relevant services & set up all the necessary configurations.
  • After bootstraping the ES Stack using ./start-elasticsearch-stack.sh. go to Kibana dashboard at http://localhost:5601/ and login with following credentials:
    • username=elastic
    • password=password
  • Go to Kibana APM integration endpoint, and click on Add Elastic APM. Then click on Save and Continue
  • Next run any application instrumented with OpenTelemetry protocol. For reference, distributor-1 service from the monorepo root docker-compse.yml file has been configured to export trace logs to the Elasticsearch server(see the configuration variables needed to be exported).
  • After the application is up & running, go to Traces endpoint in the Kibana dashboard, and you should be able to see the trace logs for different requests. e.g., below is a trace for GET api/v1/assets/:id request to Argus
image

@zeeshanakram3 zeeshanakram3 marked this pull request as ready for review June 2, 2023 04:41
@zeeshanakram3
Copy link
Contributor Author

@kdembler please test this PR based on the above instructions, and let me know whether it meets your requirements for monitoring purposes. Also, feel free to ping me if there is any issue setting up the infra

@kdembler
Copy link
Member

kdembler commented Jun 2, 2023

@zeeshanakram3 This looks lovely! I will be happy to test it out, but I will be unavailable next week, will be back on 12th and will check it out as soon as possible

@mnaamani
Copy link
Member

mnaamani commented Jun 8, 2023

This looks like a really nice addition to provide visibility into how the application is performing.
I'm having issues however in getting the the distributor-1 to run successfully with instrumentation.

First observation, when starting the ES stack, the apm-server service name is coming up as apm-server-1

Secondly the distributor is crashing, with failure to validate the JOYSTREAM_DISTRIBUTOR__OTLP__ENDPOINT env variable.

I tried to change it from http://apm-server:8200 to "http://apm-server-1:8200" (with quotes) but it still fails with:

2023-06-08 10:49:47 $ node --require ./lib/app/instrumentation.js ./bin/run start
2023-06-08 10:49:51 Starting tracing...
2023-06-08 10:49:51 /joystream/distributor-node/lib/services/parsers/ConfigParserService.js:96
2023-06-08 10:49:51                     throw e;
2023-06-08 10:49:51                     ^
2023-06-08 10:49:51 
2023-06-08 10:49:51 ValidationError: Invalid env value of JOYSTREAM_DISTRIBUTOR__OTLP__ENDPOINT
2023-06-08 10:49:51 
2023-06-08 10:49:51 
2023-06-08 10:49:51     at ConfigParserService.setConfigEnvValue (/joystream/distributor-node/lib/services/parsers/ConfigParserService.js:89:27)
2023-06-08 10:49:51     at /joystream/distributor-node/lib/services/parsers/ConfigParserService.js:109:18
2023-06-08 10:49:51     at Array.forEach (<anonymous>)
2023-06-08 10:49:51     at ConfigParserService.mergeEnvConfigWith (/joystream/distributor-node/lib/services/parsers/ConfigParserService.js:104:14)
2023-06-08 10:49:51     at ConfigParserService.parse (/joystream/distributor-node/lib/services/parsers/ConfigParserService.js:133:14)
2023-06-08 10:49:51     at Object.<anonymous> (/joystream/distributor-node/lib/app/instrumentation.js:27:31)
2023-06-08 10:49:51     at Module._compile (internal/modules/cjs/loader.js:1114:14)
2023-06-08 10:49:51     at Object.Module._extensions..js (internal/modules/cjs/loader.js:1143:10)
2023-06-08 10:49:51     at Module.load (internal/modules/cjs/loader.js:979:32)
2023-06-08 10:49:51     at Function.Module._load (internal/modules/cjs/loader.js:819:12) {
2023-06-08 10:49:51   errors: [],
2023-06-08 10:49:51   errorMessages: []
2023-06-08 10:49:51 }
2023-06-08 10:49:51 error Command failed with exit code 1.

This is happening on Mac Desktop Docker..

I'll try on Linux and report back. But I suspect the same is happening on Linux and that is why the Full scenario is failing integration tests at job manage channels and videos through CLI: Failed

@mnaamani
Copy link
Member

mnaamani commented Jun 8, 2023

Even without starting with instrumentation, the same error is occuring:

2023-06-08 11:58:39 $ ./bin/run start
2023-06-08 11:57:34 
2023-06-08 11:57:34 
2023-06-08 11:57:34 error Command failed with exit code 1.
2023-06-08 11:57:45     Error: Invalid env value of JOYSTREAM_DISTRIBUTOR__OTLP__ENDPOINT
2023-06-08 11:57:45 

Is there something perhaps wrong in the changes in configSchema.ts ?

@mnaamani
Copy link
Member

mnaamani commented Jun 8, 2023

Was able to resolve the problem by uncommenting the otlp section in the distributor-node/config.yml
I think this is because the properties are "required" so they must exist in the default config file, even if they are overriden by the env value.

@mnaamani
Copy link
Member

mnaamani commented Jun 8, 2023

Reverted to use http://apm-server:8200 endpoint, and was able to setup the APM integration and seeing
Screenshot 2023-06-08 at 1 00 05 PM

@mnaamani
Copy link
Member

mnaamani commented Jun 8, 2023

Interestingly enough, the tracing showed a potential problem with reading the package.json file in which the only code in distributor-node that does this is:

ConfigParserService.ts

public getNodeVersion(): string {
    const packageJSON = JSON.parse(fs.readFileSync(path.join(__dirname, '../../../package.json')).toString())
    return String(packageJSON.version)
  }
Screenshot 2023-06-08 at 1 04 05 PM

@@ -19,6 +19,9 @@ logs:
# auth:
# username: username
# password: password
# otlp:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be uncommented as the properties are "required".
Otherwise change them to be optional and leave them commented out.
If node is started with instrumentation and endpoint/attributes not provided, exist with error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the otlp properties are not required in the config schema. You can see here. Also, the distributor-1 service works fine if I remove the JOYSTREAM_DISTRIBUTOR__OTLP__ENDPOINT, JOYSTREAM_DISTRIBUTOR__OTLP__ATTRIBUTES env vars from the service in docker-compose.yml file, which also enforces the point that otlp properties are not required. I think there is a problem with the ValidationService class. I tried to debug the problem by adding console.log(this.avj.errors) at line, and got this:

[
  {
    keyword: 'required',
    dataPath: '/otlp',
    schemaPath: '#/properties/otlp/required',
    params: { missingProperty: 'attributes' },
    message: "should have required property 'attributes'"
  }
]
/joystream/distributor-node/lib/services/parsers/ConfigParserService.js:97
                    throw e;
                    ^

ValidationError: Invalid env value of JOYSTREAM_DISTRIBUTOR__OTLP__ENDPOINT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just created an issue for this. Will be investigating it #4787

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, so how do we interpret the line required: ['endpoint', 'attributes'], in configSchema.ts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, so how do we interpret the line required: ['endpoint', 'attributes'], in configSchema.ts

This means otlp is an optional object in configSchema. However, when it is present (i.e., not undefined), it must have endpoint & attributes properties.

FYI I created a PR to address this issue, please have a look #4788

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged the fix, perhaps you want to update this from master

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@mnaamani mnaamani requested a review from kdembler June 13, 2023 07:44
Copy link
Member

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks and works as expected.
Just left some comments about possible generalization and decoupling the instrumentation config from the main application.


diag.info('Starting tracing...')

// Default config JSON/YAML file path (relative to current working directory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad we discovered issue with parsing the config because of adding the new parameters in the config file, however I feel perhaps we can just rely on OTEL_... env variables, since they are not really configuration parameters for the distributor-node itself. And by dropping this coupling, where we are importing from the

import { ConfigParserService } from '../services/parsers/ConfigParserService'
import { ReadonlyConfig } from '../types'

It becomes possible to break this code out of the distributor package into its own package that can be re-used?

What are your thoughts. I'm not familiar enough with Open Telemetry architecture yet to have a very strong opinion on this.

@kdembler
Copy link
Member

kdembler commented Jun 16, 2023

Okay it's been a while, really sorry for the delay. I'm starting testing it out right now

@@ -1,5 +1,6 @@
### 1.2.0

- Integrates OpenTelemetry API/SDK with Argus for exporting improved tracing logs & metrics to Elasticsearch. Adds `./start-elasticsearch-stack.sh` script to bootstrap elasticsearch services (Elasticsearch + Kibana + APM Server) with all the required configurations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 1.2.0 has been already released and is in use, let's bump the version please

@kdembler
Copy link
Member

Also seeing ENOENT: no such file or directory, open '/joystream/distributor-node/src/commands/package.json'

@zeeshanakram3 zeeshanakram3 added the argus Argus distributor node label Jun 23, 2023
@zeeshanakram3
Copy link
Contributor Author

It becomes possible to break this code out of the distributor package into its own package that can be re-used?

  • I have created @joystream/opentelemetry package in the monorepo that can be used to add opentelemetry tracing capabilities to any arbitrary nodes & services
  • Now all the OpenTelemetry code is Orthogonal to the Application packages and can be imported as a dependency
  • I tested the storage-node, distributed-node & query-node docker containers with open telemetry integration using the colossos-1, distributor-1 & graphql-server services respectively in the docker-compose.yml, and it seemed to work as expected
  • Building & running docker images of Applications (with OTEL integration) requires publishing of @joystream/opentelemetry on npm (so could not test that)

cc. @mnaamani

@zeeshanakram3
Copy link
Contributor Author

@mnaamani I have updated the dockerfiles to install @josyteam/opentelemtry in images during build step in d3a473f

Copy link
Member

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Left a few suggested changes, otherwise it works great and seeing nice traces in APM down to the SQL queries.

colossus.Dockerfile Outdated Show resolved Hide resolved
colossus.Dockerfile Show resolved Hide resolved
distributor-node.Dockerfile Outdated Show resolved Hide resolved
distributor-node/CHANGELOG.md Show resolved Hide resolved

################################################
# temporary patche TODO: create proper solution

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some env variables i think might be getting lost, there is a specific one i noticed which is set at top .env PROCESSOR_HOST=processor which is the host/ip the graphql server expects from the processor when it "pings" it to update. So the graphql-server is constantly logging:

2023-07-12 19:30:30 [
2023-07-12 19:30:30   'Unauthorized access on /update-processor-state: 172.18.0.10 (expected: undefined)'
2023-07-12 19:30:30 ]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so it took a while to figure this out.

The PROCESSOR_HOST env var is available inside the graphql-server, how the problem is happening while dns lookup, if the graphql-server is running with open telemetry instrumentation enabled, then the lookup function returns an address string (response type of callback-based function) instead of expected LookupAddress object (response type of promise-based function). So, in short, open telemetry isn't applying a proper patch of the lookup function.

There are two possible solutions I can think of

  • Don't instrument the dns package in the graphql-server (a much simpler solution)
  • Make a change in the graphql-server codebase to use callback based or dns/promise based lookup method instead of promisify lookup function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Don't instrument the dns package in the graphql-server (a much simpler solution)

I think that is the best option to go with. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 7c658a8

start-elasticsearch-stack.sh Outdated Show resolved Hide resolved
opentelemetry/index.ts Outdated Show resolved Hide resolved
opentelemetry/index.ts Outdated Show resolved Hide resolved
// Disable DNS instrumentation, because the instrumentation does not correctly patches `dns.lookup` function
// if the function is converted to a promise-based method using `utils.promisify(dns.lookup)`
// See: https://github.com/Joystream/joystream/pull/4779#discussion_r1262515887
getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-dns': { enabled: false } }),
Copy link
Member

@mnaamani mnaamani Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also fixed a problem I noticed with storage and distributor nodes, not finishing reponse on their status endpoint /api/v1/status

@mnaamani mnaamani self-requested a review July 13, 2023 18:47
Copy link
Member

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor fixes, I prepared them in a PR zeeshanakram3#6

@mnaamani mnaamani self-requested a review July 14, 2023 13:38
@mnaamani mnaamani merged commit 8a2a812 into Joystream:master Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
argus Argus distributor node origo Origo Release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants