New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Intermittent UnauthorizedException for WPS processes behind Magpie 3.8.0 and matching Twitcher #433
Comments
Reproduced (after a few tries since this is intermittent) with CRIM's Jenkins against Hirondelle from CRIM as well: https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/master/407/console
|
Latest run seems to try accessing this file: |
Oh right, that data and all Raven testdata can be deployed by activating https://github.com/bird-house/birdhouse-deploy/blob/d90765acabe248e65c4899929fbe37a9e8661643/birdhouse/env.local.example#L180-L184 on Hirondelle. Just to be clear, missing that testdata file is un-related to the |
… codes/times (relates to Ouranosinc/Magpie#433)
For reference from Ouranosinc/PAVICS-e2e-workflow-tests#74 (comment) Below is There was no errors in Magpie logs.
|
## Overview Adds stress test notebook to evaluate possible regression in sporadic response codes and timings for given requests. ## Changes The new notebook uses a function with versatile inputs (configurable in environment variables) to control how strict we want to be with this test. For the moment, it tests by default 100 times WPS GetCapabilities requests for each of `finch`, `flyingpigeon`, `raven` and `hummingbird` services, but can be extended for basically any request. Maximum Average time below the expected value of 1s for request execution as well as expected 200 status code from them will make the test pass. ## Related Issues - Relates to issue Ouranosinc/Magpie#433 - Relates to PR bird-house/birdhouse-deploy#174
FYI, we also started to see intermittent Authorization failure for Thredds data as well, in our nightly prod run http://jenkins.ouranos.ca/job/PAVICS-e2e-workflow-tests/job/master/1135/console.
The error above is not supposed to be able to happen since everything is public on our Thredds for the moment. Will try to find logs but since this is a production server, there are many activities, not sure I can pinpoint the good logs. |
Matching logs for #433 (comment)
|
@tlvu cache.enabled = false
cache.acl.enabled = false
cache.service.enabled = false |
@fmigneault you want to add these to prod even if the PR about cache config bird-house/birdhouse-deploy#174 is not merged yet? You suspect portions of the caching mechanism has been enabled? |
@fmigneault done: Ouranosinc/birdhouse-deploy@5c23e8b (can you double check I've inserted into the proper location in those 2 config files). I've done |
@fmigneault FYI, I ran Jenkins on our prod after those cache config you proposed in #433 (comment) are applied, the
|
@tlvu The settings look all right, but the line
indicates that somehow caching is still enabled. Error raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req)) is a side effect of #439 I'm fine with having the cache settings disabled until bird-house/birdhouse-deploy#174 is working with all tests. |
…sing db-session error during permissions resolution (relates to #433, corrects 500 error flaged by Twitcher)
…sing db-session error during permissions resolution (relates to #433, corrects 500 error flaged by Twitcher)
@fmigneault I've destroyed and recreated twitcher and magpie container on our production to ensure the config that force disable the cache are active. Still have these errors in the twitcher logs when running
So I do not think the following configs are able to disable the cache. cache.enabled = false
cache.acl.enabled = false
cache.service.enabled = false |
@fmigneault After PR bird-house/birdhouse-deploy#182 is merged, our Jenkins nightly in prod found 408 code in the
Magpie logs has nothing suspicious. |
https://hirondelle.crim.ca/magpie/version still has Magpie 3.12.0, the PR bird-house/birdhouse-deploy#182 has not been autodeployed? |
I think its ok to leave it in the same issue since the cause could be off same nature. 408 is generated by the stress test that defines For hirondelle, I'm not sure if it is auto-deployed or done manually by @matprov |
No sorry, I cannot answer that for @matprov. Outarde seems down right now. Maybe the update didn't work well? |
Outarde is also on magpie:3.12 > pavics-compose ps
reading './components/monitoring/default.env'
COMPOSE_CONF_LIST=-f docker-compose.yml -f ./components/scheduler/docker-compose-extra.yml -f ./components/monitoring/docker-compose-extra.yml
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60). I've never seen that one before. |
@fmigneault Outarde is now up. Docker seems to have misbehaved and needed some love on Outarde.
What exactly do you make reference to? New version of magpie? |
@matprov |
@fmigneault Usually a simple Also, the Following these two issues and the fact that staging environment was running smoothly, it's safe to say that the issue isn't stack related, but Docker operationalization related. |
FYI both https://outarde.crim.ca/magpie/version and https://hirondelle.crim.ca/magpie/version still has Magpie 3.12.0 as of this writting. Probably missing the manual pull on your production fork https://github.com/crim-ca/birdhouse-deploy? I would suggest maybe leaving hirondelle on the main repo and only have outarde on your fork? |
This might be a good idea, in order to test the latest changes on hirondelle without having to update the fork. @dbyrns What's your position on this one? Staging and prod are both pointing towards our fork right now, but we might benefit from pointing staging to the official repo. |
@matprov Personally, I would like to have hirondelle/staging auto-update to master for regularly testing recent changes, and leave outarde/prod with manual update. |
@fmigneault Actually no. They are on the fork to prevent prod auto-update. Using the fork repo makes sure we are not updating everytime we merge on birdhouse/birdhouse-deploy. And yes, auto-updating staging (so pointing towards the real repo, not the fork) makes sense for QA.
I synced the fork last week. |
@matprov |
I agree. Our staging env (hirondelle) following birdhouse-deploy/master and our prod env (outarde) following our fork for deployment control makes perfect sense. |
@fmigneault It's the autodeploy component from birdhouse. Autodeploy via this component is enabled on both staging (every 10 minutes) and prod instance (at midnight). Now, to avoid autodeploy, it all depends which is the remote. Either the "real" repo or the fork. |
@fmigneault We want to keep the auto-deploy, but having prod synced on its own fork (see https://birdhouse-deploy.readthedocs.io/en/latest/contributing.html) instead of the birdhouse/birdhouse-deploy repo. Changing staging's remote to birdhouse/birdhouse-deploy repo will do the trick. |
Another reason for the staging env to follow the real repo is to catch errors earlier and allow for "hot fixes" to avoid the previous magpie rollback. This is what I understood. You guys prefer hot fixes than rollback so the staging env will be able to test the hot fix. The fork is only needed to ensure production stability. For the record, our staging env (medus) is also on the real repo. Only our production is on the fork. |
@tlvu I totally agree, this is the way to go. Hirondelle now points to the real repo. |
## Overview Re-enables the caching feature of Twitcher that was disabled temporarily in #182 ## Changes **Non-breaking changes** - Twitcher request caching=on **Breaking changes** n/a ## Related Issue / Discussion - Resolves Ouranosinc/Magpie#433
Describe the bug
Running Jenkins notebook test suite against Ouranos production PAVICS stack we randomly get
and
Full run in http://jenkins.ouranos.ca/job/PAVICS-e2e-workflow-tests/job/master/1047/console
Run the same test suite again, and the error is gone.
Note on Ouranos production server, all WPS services have full public access on anonymous group.
No error seen in
docker logs magpie
anddocker logs twitcher
.To Reproduce
Steps to reproduce the behavior:
--nbval-lax
activated, see Build request on CRIM side https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/master/402/parameters/Expected behavior
Should not have those
UnauthorizedException
The text was updated successfully, but these errors were encountered: