Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Intermittent UnauthorizedException for WPS processes behind Magpie 3.8.0 and matching Twitcher #433

Closed
tlvu opened this issue May 10, 2021 · 31 comments · Fixed by bird-house/birdhouse-deploy#188
Assignees
Labels
bug Problem, error, or invalid behaviour

Comments

@tlvu
Copy link
Contributor

tlvu commented May 10, 2021

Describe the bug
Running Jenkins notebook test suite against Ouranos production PAVICS stack we randomly get

11:22:04  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/util.py in openURL(url_base, data, method, cookies, username, password, timeout, headers, verify, cert, auth)
11:22:04      210     if req.status_code in [400, 401]:
11:22:04  --> 211         raise ServiceException(req.text)
11:22:04      212 
11:22:04  
11:22:04  ServiceException: <?xml version="1.0" encoding="utf-8"?>
11:22:04  <ExceptionReport version="1.0.0"
11:22:04      xmlns="http://www.opengis.net/ows/1.1"
11:22:04      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
11:22:04      xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
11:22:04      <Exception exceptionCode="NoApplicableCode" locator="AccessForbidden">
11:22:04          <ExceptionText>Not authorized to access this resource. User does not meet required permissions.</ExceptionText>
11:22:04      </Exception>
11:22:04  </ExceptionReport>
11:22:04  
11:22:04  During handling of the above exception, another exception occurred:
11:22:04  
11:22:04  UnauthorizedException                     Traceback (most recent call last)
11:22:04  <ipython-input-5-a503678973ad> in <module>
11:22:04       30     elevation=basin_elevation,
11:22:04       31     area=basin_area,
11:22:04  ---> 32     name="test_basin",
11:22:04       33 )
11:22:04       34 
11:22:04  
11:22:04  </opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py-613> in climatology_esp(self, forecast_date, ts, model_name, params, forecast_duration, name, area, latitude, longitude, elevation, nc_spec, rvc, rain_snow_fraction, output_formats)
11:22:04  
11:22:04  /opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py in _execute(self, pid, **kwargs)
11:22:04      357             if "AccessForbidden" in str(e):
11:22:04      358                 raise UnauthorizedException(
11:22:04  --> 359                     "You are not authorized to do a request of type: Execute"
11:22:04      360                 )
11:22:04      361             raise
11:22:04  
11:22:04  UnauthorizedException: You are not authorized to do a request of type: Execute

and

11:22:04  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/util.py in openURL(url_base, data, method, cookies, username, password, timeout, headers, verify, cert, auth)
11:22:04      210     if req.status_code in [400, 401]:
11:22:04  --> 211         raise ServiceException(req.text)
11:22:04      212 
11:22:04  
11:22:04  ServiceException: <?xml version="1.0" encoding="utf-8"?>
11:22:04  <ExceptionReport version="1.0.0"
11:22:04      xmlns="http://www.opengis.net/ows/1.1"
11:22:04      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
11:22:04      xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
11:22:04      <Exception exceptionCode="NoApplicableCode" locator="AccessForbidden">
11:22:04          <ExceptionText>Not authorized to access this resource. User does not meet required permissions.</ExceptionText>
11:22:04      </Exception>
11:22:04  </ExceptionReport>
11:22:04  
11:22:04  During handling of the above exception, another exception occurred:
11:22:04  
11:22:04  UnauthorizedException                     Traceback (most recent call last)
11:22:04  <ipython-input-2-326fb088ffcd> in <module>
11:22:04        4 raven_url = os.environ.get("WPS_URL", f"{pavics_url}/twitcher/ows/proxy/raven/wps")
11:22:04        5 
11:22:04  ----> 6 raven = birdy.WPSClient(raven_url)
11:22:04  
11:22:04  /opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py in __init__(self, url, processes, converters, username, password, headers, auth, verify, cert, verbose, progress, version, caps_xml, desc_xml, language)
11:22:04      120             if "AccessForbidden" in str(e):
11:22:04      121                 raise UnauthorizedException(
11:22:04  --> 122                     "You are not authorized to do a request of type: GetCapabilities"
11:22:04      123                 )
11:22:04      124             raise
11:22:04  
11:22:04  UnauthorizedException: You are not authorized to do a request of type: GetCapabilities

Full run in http://jenkins.ouranos.ca/job/PAVICS-e2e-workflow-tests/job/master/1047/console

Run the same test suite again, and the error is gone.

Note on Ouranos production server, all WPS services have full public access on anonymous group.

No error seen in docker logs magpie and docker logs twitcher.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy birdhouse-deploy PAVICS docker-compose stack at tag https://github.com/bird-house/birdhouse-deploy/tree/1.13.0
  2. Request Jenkins build with only Raven notebooks and with --nbval-lax activated, see Build request on CRIM side https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/master/402/parameters/

Expected behavior
Should not have those UnauthorizedException

@tlvu tlvu added the bug Problem, error, or invalid behaviour label May 10, 2021
@tlvu
Copy link
Contributor Author

tlvu commented May 11, 2021

Reproduced (after a few tries since this is intermittent) with CRIM's Jenkins against Hirondelle from CRIM as well:

https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/master/407/console

09:05:48  _ raven-master/docs/source/notebooks/Subset_climate_data_over_watershed.ipynb::Cell 1 _
09:05:48  Notebook cell execution failed
09:05:48  Cell 1: Cell execution caused an exception
09:05:48  
09:05:48  Input:
09:05:48  # Set the links to the servers. 
09:05:48  # Note that if Finch is a remote server, Raven needs to be accessible on the next because some cells 
09:05:48  # below use the output from Raven processes to feed into Finch. 
09:05:48  
09:05:48  url_finch = os.environ.get(
09:05:48      "FINCH_WPS_URL", "https://hirondelle.crim.ca/twitcher/ows/proxy/finch/wps"
09:05:48  )
09:05:48  
09:05:48  url_raven = os.environ.get(
09:05:48      "WPS_URL", "https://hirondelle.crim.ca/twitcher/ows/proxy/raven/wps"
09:05:48  )
09:05:48  
09:05:48  # Establish client connexions to the remote servers
09:05:48  finch = birdy.WPSClient(url_finch)  
09:05:48  raven = birdy.WPSClient(url_raven)
09:05:48  
09:05:48  Traceback:
09:05:48  
09:05:48  ---------------------------------------------------------------------------
09:05:48  ServiceException                          Traceback (most recent call last)
09:05:48  /opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py in __init__(self, url, processes, converters, username, password, headers, auth, verify, cert, verbose, progress, version, caps_xml, desc_xml, language)
09:05:48      117         try:
09:05:48  --> 118             self._wps.getcapabilities(xml=caps_xml)
09:05:48      119         except ServiceException as e:
09:05:48  
09:05:48  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in getcapabilities(self, xml)
09:05:48      274             self._capabilities = reader.readFromUrl(
09:05:48  --> 275                 self.url, headers=self.headers)
09:05:48      276 
09:05:48  
09:05:48  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in readFromUrl(self, url, username, password, headers, verify, cert)
09:05:48      554                                  username=username, password=password,
09:05:48  --> 555                                  headers=headers, verify=verify, cert=cert)
09:05:48      556 
09:05:48  
09:05:48  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in _readFromUrl(self, url, data, timeout, method, username, password, headers, verify, cert)
09:05:48      508                         1], method='Get', username=self.auth.username, password=self.auth.password,
09:05:48  --> 509                         headers=headers, verify=self.auth.verify, cert=self.auth.cert, timeout=self.timeout)
09:05:48      510             return etree.fromstring(u.read())
09:05:48  
09:05:48  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/util.py in openURL(url_base, data, method, cookies, username, password, timeout, headers, verify, cert, auth)
09:05:48      210     if req.status_code in [400, 401]:
09:05:48  --> 211         raise ServiceException(req.text)
09:05:48      212 
09:05:48  
09:05:48  ServiceException: <?xml version="1.0" encoding="utf-8"?>
09:05:48  <ExceptionReport version="1.0.0"
09:05:48      xmlns="http://www.opengis.net/ows/1.1"
09:05:48      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
09:05:48      xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
09:05:48      <Exception exceptionCode="NoApplicableCode" locator="AccessForbidden">
09:05:48          <ExceptionText>Not authorized to access this resource. User does not meet required permissions.</ExceptionText>
09:05:48      </Exception>
09:05:48  </ExceptionReport>
09:05:48  
09:05:48  During handling of the above exception, another exception occurred:
09:05:48  
09:05:48  UnauthorizedException                     Traceback (most recent call last)
09:05:48  <ipython-input-2-6e0bd7322976> in <module>
09:05:48       13 # Establish client connexions to the remote servers
09:05:48       14 finch = birdy.WPSClient(url_finch)
09:05:48  ---> 15 raven = birdy.WPSClient(url_raven)
09:05:48  
09:05:48  /opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py in __init__(self, url, processes, converters, username, password, headers, auth, verify, cert, verbose, progress, version, caps_xml, desc_xml, language)
09:05:48      120             if "AccessForbidden" in str(e):
09:05:48      121                 raise UnauthorizedException(
09:05:48  --> 122                     "You are not authorized to do a request of type: GetCapabilities"
09:05:48      123                 )
09:05:48      124             raise
09:05:48  
09:05:48  UnauthorizedException: You are not authorized to do a request of type: GetCapabilities

@fmigneault
Copy link
Collaborator

Latest run seems to try accessing this file:
"https://hirondelle.crim.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/raven/raven-gr4j-cemaneige/Salmon-River-Near-Prince-George_meteo_daily.nc"
but I don't see that when I go through Thredds' browser

@tlvu
Copy link
Contributor Author

tlvu commented May 11, 2021

Latest run seems to try accessing this file:
"https://hirondelle.crim.ca/twitcher/ows/proxy/thredds/fileServer/birdhouse/testdata/raven/raven-gr4j-cemaneige/Salmon-River-Near-Prince-George_meteo_daily.nc"
but I don't see that when I go through Thredds' browser

Oh right, that data and all Raven testdata can be deployed by activating https://github.com/bird-house/birdhouse-deploy/blob/d90765acabe248e65c4899929fbe37a9e8661643/birdhouse/env.local.example#L180-L184 on Hirondelle.

Just to be clear, missing that testdata file is un-related to the UnauthorizedException: You are not authorized to do a request of type: GetCapabilities intermittent error. We have this same intermittent error on our prod and our prod has all the required testdata.

@tlvu tlvu changed the title [BUG] Intermittent access control with WPS processes behind Magpie 3.8.0 and matching Twitcher [BUG] Intermittent UnauthorizedException for WPS processes behind Magpie 3.8.0 and matching Twitcher May 11, 2021
fmigneault added a commit to Ouranosinc/PAVICS-e2e-workflow-tests that referenced this issue May 20, 2021
@fmigneault
Copy link
Collaborator

For reference from Ouranosinc/PAVICS-e2e-workflow-tests#74 (comment)
Possible lead:

Below is docker logs twitcher on Medus matching the timestamp for the Flyingpigeon 500 failure in https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/stress-test/10/console. There was a hit on other services during the 100 hit on Flyingpigeon. I suspect the other hits are from the Canarie-api monitoring that hits all the WPS every minutes. Otherwise, this is a staging server, much less activity than the production server.

There was no errors in Magpie logs.

2021-05-26 21:22:02,150 INFO  [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'                 
2021-05-26 21:22:02,174 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,397 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,403 INFO  [TWITCHER:132][waitress-2] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'                 
2021-05-26 21:22:02,430 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,652 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,659 INFO  [TWITCHER:132][waitress-3] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'                 
2021-05-26 21:22:02,691 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,737 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,743 INFO  [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/ncWMS2/wms'                       
2021-05-26 21:22:02,761 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,898 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,901 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-05-26 21:22:02,907 WARNI [TWITCHER:100][waitress-1] Using cached service                                                                         
2021-05-26 21:22:02,908 ERROR [TWITCHER:33][waitress-1] unknown error                                                                                 
Traceback (most recent call last):                                                                                                                    
  File "/opt/birdhouse/src/twitcher/twitcher/tweens.py", line 27, in ows_security_tween                                                               
    security.check_request(request)                                                                                                                   
  File "/opt/local/src/magpie/magpie/adapter/magpieowssecurity.py", line 127, in check_request                                                        
    permission_requested = service_impl.permission_requested()                                                                                        
  File "/opt/local/src/magpie/magpie/services.py", line 429, in permission_requested                                                                  
    raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))                                                    
NotImplementedError: Undefined 'Permission' from 'request' parameter: none                                                                            
2021-05-26 21:22:02,921 INFO  [TWITCHER:132][waitress-2] 'None' request 'browse' permission on '/ows/proxy/thredds/catalog.html'
2021-05-26 21:22:02,941 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,009 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,016 INFO  [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'
2021-05-26 21:22:03,040 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,225 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,231 INFO  [TWITCHER:132][waitress-3] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'
2021-05-26 21:22:03,256 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-05-26 21:22:03,284 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'

fmigneault added a commit to Ouranosinc/PAVICS-e2e-workflow-tests that referenced this issue Jul 13, 2021
## Overview

Adds stress test notebook to evaluate possible regression in sporadic response codes and timings for given requests.

## Changes

The new notebook uses a function with versatile inputs (configurable in environment variables) to control how strict we want to be with this test. For the moment, it tests by default 100 times WPS GetCapabilities requests for each of `finch`, `flyingpigeon`, `raven` and `hummingbird` services, but can be extended for basically any request. Maximum Average time below the expected value of 1s for request execution as well as expected 200 status code from them will make the test pass.

## Related Issues

- Relates to issue Ouranosinc/Magpie#433
- Relates to PR bird-house/birdhouse-deploy#174
@tlvu
Copy link
Contributor Author

tlvu commented Jul 13, 2021

FYI, we also started to see intermittent Authorization failure for Thredds data as well, in our nightly prod run http://jenkins.ouranos.ca/job/PAVICS-e2e-workflow-tests/job/master/1135/console.

OSError: [Errno -78] NetCDF: Authorization failure: b'https://boreas.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml'

The error above is not supposed to be able to happen since everything is public on our Thredds for the moment.

Will try to find logs but since this is a production server, there are many activities, not sure I can pinpoint the good logs.

@tlvu
Copy link
Contributor Author

tlvu commented Jul 13, 2021

Matching logs for #433 (comment)

2021-07-13 00:39:04,607 INFO  [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/hummingbird'
2021-07-13 00:39:04,627 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,643 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,647 INFO  [TWITCHER:132][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml.dods'
2021-07-13 00:39:04,668 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,687 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,692 INFO  [TWITCHER:132][waitress-2] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml.dods'
2021-07-13 00:39:04,714 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,744 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,746 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,750 WARNI [TWITCHER:100][waitress-1] Using cached service
2021-07-13 00:39:04,750 ERROR [TWITCHER:33][waitress-1] unknown error
Traceback (most recent call last):
  File "/opt/birdhouse/src/twitcher/twitcher/tweens.py", line 27, in ows_security_tween
    security.check_request(request)
  File "/opt/local/src/magpie/magpie/adapter/magpieowssecurity.py", line 127, in check_request
    permission_requested = service_impl.permission_requested()
  File "/opt/local/src/magpie/magpie/services.py", line 429, in permission_requested
    raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))
NotImplementedError: Undefined 'Permission' from 'request' parameter: none
2021-07-13 00:39:04,751 INFO  [TWITCHER:132][waitress-0] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml.dods'
2021-07-13 00:39:04,773 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,783 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,789 INFO  [TWITCHER:132][waitress-2] 'None' request 'read' permission on '/ows/proxy/thredds/wms/birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc'
2021-07-13 00:39:04,794 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-07-13 00:39:04,807 INFO  [TWITCHER:132][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/ouranos/cb-oura-1.0/day_NorESM1-M_historical+rcp85_r1i1p1_na10kgrid_qm-moving-50bins-detrend_1950-2100.ncml.dods'

@fmigneault
Copy link
Collaborator

@tlvu
Can you try adding these settings to both Magpie and Twitcher INI configs and let me know if the errors can be replicated again?

cache.enabled = false
cache.acl.enabled = false
cache.service.enabled = false

@tlvu
Copy link
Contributor Author

tlvu commented Jul 13, 2021

@tlvu
Can you try adding these settings to both Magpie and Twitcher INI configs and let me know if the errors can be replicated again?

cache.enabled = false
cache.acl.enabled = false
cache.service.enabled = false

@fmigneault you want to add these to prod even if the PR about cache config bird-house/birdhouse-deploy#174 is not merged yet? You suspect portions of the caching mechanism has been enabled?

@tlvu
Copy link
Contributor Author

tlvu commented Jul 13, 2021

@tlvu
Can you try adding these settings to both Magpie and Twitcher INI configs and let me know if the errors can be replicated again?

cache.enabled = false
cache.acl.enabled = false
cache.service.enabled = false

@fmigneault done: Ouranosinc/birdhouse-deploy@5c23e8b (can you double check I've inserted into the proper location in those 2 config files). I've done ./pavics-compose.sh restart twitcher magpie for these changes to take effect. Let me know if restart is enough for I have to destroy and recreate those containers.

@tlvu
Copy link
Contributor Author

tlvu commented Jul 13, 2021

@fmigneault FYI, I ran Jenkins on our prod after those cache config you proposed in #433 (comment) are applied, the stress-test.ipynb still fail (https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/fix-stress-tests.ipynb-not-showing-error-in-jenkins/6/console) and I still have the following stack trace in Twitcher logs:

2021-07-13 22:16:18,506 INFO  [TWITCHER:132][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/birdhouse/cccs_portal/indices/F
inal/BCCAQv2/tx_max/YS/rcp45/simulations/BCCAQv2+ANUSPLIN300_FGOALS-g2_historical+rcp45_r1i1p1_1950-2100_tx_max_YS.nc.dds'                            
2021-07-13 22:16:18,543 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,623 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,632 INFO  [TWITCHER:132][waitress-0] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/birdhouse/cccs_portal/indices/F
inal/BCCAQv2/tx_max/YS/rcp45/simulations/BCCAQv2+ANUSPLIN300_FGOALS-g2_historical+rcp45_r1i1p1_1950-2100_tx_max_YS.nc.das'                            
2021-07-13 22:16:18,659 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,769 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,772 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,778 WARNI [TWITCHER:100][waitress-3] Using cached service                                                                         
2021-07-13 22:16:18,779 ERROR [TWITCHER:33][waitress-3] unknown error                                                                                 
Traceback (most recent call last):                                                                                                                    
  File "/opt/birdhouse/src/twitcher/twitcher/tweens.py", line 27, in ows_security_tween                                                               
    security.check_request(request)                                                                                                                   
  File "/opt/local/src/magpie/magpie/adapter/magpieowssecurity.py", line 127, in check_request                                                        
    permission_requested = service_impl.permission_requested()                                                                                        
  File "/opt/local/src/magpie/magpie/services.py", line 429, in permission_requested                                                                  
    raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))                                                    
NotImplementedError: Undefined 'Permission' from 'request' parameter: none                                                                            
2021-07-13 22:16:18,780 INFO  [TWITCHER:132][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/birdhouse/cccs_portal/indices/F
inal/BCCAQv2/tx_max/YS/rcp45/simulations/BCCAQv2+ANUSPLIN300_FGOALS-g2_historical+rcp45_r1i1p1_1950-2100_tx_max_YS.nc.dds'                            
2021-07-13 22:16:18,813 INFO  [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,853 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,859 INFO  [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/raven/wps'                        
2021-07-13 22:16:18,881 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,896 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-13 22:16:18,901 INFO  [TWITCHER:132][waitress-2] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/birdhouse/cccs_portal/indices/F
inal/BCCAQv2/tx_max/YS/rcp45/simulations/BCCAQv2+ANUSPLIN300_FGOALS-g2_historical+rcp45_r1i1p1_1950-2100_tx_max_YS.nc.dods'    

@fmigneault
Copy link
Collaborator

@tlvu The settings look all right, but the line

2021-07-13 22:16:18,778 WARNI [TWITCHER:100][waitress-3] Using cached service

indicates that somehow caching is still enabled.

Error

 raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))     

is a side effect of #439
This error can occur only when caching is active, otherwise the request is parsed from scratch each time.

I'm fine with having the cache settings disabled until bird-house/birdhouse-deploy#174 is working with all tests.
I'm not sure why instances see them as enabled though... toggling those worked when I was creating the stress-test script.
It is pointing to a server that applied them?

fmigneault added a commit that referenced this issue Jul 14, 2021
…sing db-session error during permissions resolution (relates to #433, corrects 500 error flaged by Twitcher)
fmigneault added a commit that referenced this issue Jul 14, 2021
…sing db-session error during permissions resolution (relates to #433, corrects 500 error flaged by Twitcher)
@tlvu
Copy link
Contributor Author

tlvu commented Jul 14, 2021

@fmigneault I've destroyed and recreated twitcher and magpie container on our production to ensure the config that force disable the cache are active. Still have these errors in the twitcher logs when running stress-test.ipynb (https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/master/497/console)

2021-07-14 23:03:30,178 INFO  [TWITCHER:132][waitress-0] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/station_obs/ECCC_AHCCD
_gen3_temperature.ncml.dods'                                                                                                                          
2021-07-14 23:03:30,201 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,342 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,348 INFO  [TWITCHER:132][waitress-0] 'None' request 'getcapabilities' permission on '/ows/proxy/raven/wps'                        
2021-07-14 23:03:30,370 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       2021-07-14 23:03:30,623 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,627 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,630 WARNI [TWITCHER:100][waitress-3] Using cached service                                                                         
2021-07-14 23:03:30,630 ERROR [TWITCHER:33][waitress-3] unknown error                                                                                 
Traceback (most recent call last):                                                                                                                    
  File "/opt/birdhouse/src/twitcher/twitcher/tweens.py", line 27, in ows_security_tween                                                               
    security.check_request(request)                                                                                                                   
  File "/opt/local/src/magpie/magpie/adapter/magpieowssecurity.py", line 127, in check_request                                      
    permission_requested = service_impl.permission_requested()                                                                                        
  File "/opt/local/src/magpie/magpie/services.py", line 429, in permission_requested                                                
    raise NotImplementedError("Undefined 'Permission' from 'request' parameter: {!s}".format(req))                                                    
NotImplementedError: Undefined 'Permission' from 'request' parameter: none                                                                            
2021-07-14 23:03:30,635 INFO  [TWITCHER:132][waitress-0] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/station_obs/ECCC_AHCC$_gen3_temperature.ncml.dods'                                                                                                        
2021-07-14 23:03:30,658 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,732 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,737 INFO  [TWITCHER:132][waitress-3] 'None' request 'getcapabilities' permission on '/ows/proxy/raven/wps'                        
2021-07-14 23:03:30,759 INFO  [TWITCHER:60][waitress-3] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-07-14 23:03:30,993 INFO  [TWITCHER:60][waitress-0] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'       

So I do not think the following configs are able to disable the cache.

cache.enabled = false
cache.acl.enabled = false
cache.service.enabled = false

@tlvu
Copy link
Contributor Author

tlvu commented Aug 3, 2021

@fmigneault After PR bird-house/birdhouse-deploy#182 is merged, our Jenkins nightly in prod found 408 code in the stress-test.ipynb this morning. I have never seen this 408 before. Previously it was 500 and 401 only. Should I open a new different issue or keep the discussion here?

docker logs twitcher has a few security check failed but no stack-trace at all.

2021-08-03 14:01:49,296 INFO  [TWITCHER:158][waitress-2] 'None' request 'getcapabilities' permission on '/ows/proxy/flyingpigeon/wps'                 
2021-08-03 14:01:49,321 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-08-03 14:01:49,407 WARNI [waitress.queue:117][MainThread] Task queue depth is 1                                                                  
2021-08-03 14:01:49,430 INFO  [TWITCHER:60][waitress-2] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'                                       
2021-08-03 14:01:49,437 WARNI [TWITCHER:30][waitress-2] security check failed.  

Magpie logs has nothing suspicious.

@tlvu
Copy link
Contributor Author

tlvu commented Aug 3, 2021

https://hirondelle.crim.ca/magpie/version still has Magpie 3.12.0, the PR bird-house/birdhouse-deploy#182 has not been autodeployed?

@fmigneault
Copy link
Collaborator

I think its ok to leave it in the same issue since the cause could be off same nature.

408 is generated by the stress test that defines timeout=int(os.getenv("TEST_TIMEOUT_ABORT", 5)) when submitting requests.
I don't know why that WPS was slow to answer though, especially that caching is not involved.
Maybe it was a one-off case that just so happened to reach the limit, but usually it is much lower than that (~0.1s).

For hirondelle, I'm not sure if it is auto-deployed or done manually by @matprov
I cannot see any jenkins job related to that, Maybe you know @dbyrns ?

@dbyrns
Copy link
Contributor

dbyrns commented Aug 4, 2021

No sorry, I cannot answer that for @matprov. Outarde seems down right now. Maybe the update didn't work well?

@fmigneault
Copy link
Collaborator

Outarde is also on magpie:3.12
Not sure why it is not up. It is extremely slow.
I get following error:

> pavics-compose ps
reading './components/monitoring/default.env'
COMPOSE_CONF_LIST=-f docker-compose.yml -f ./components/scheduler/docker-compose-extra.yml -f ./components/monitoring/docker-compose-extra.yml
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

I've never seen that one before.

@matprov
Copy link

matprov commented Aug 19, 2021

@fmigneault Outarde is now up. Docker seems to have misbehaved and needed some love on Outarde.

For hirondelle, I'm not sure if it is auto-deployed or done manually

What exactly do you make reference to? New version of magpie?

@fmigneault
Copy link
Collaborator

@matprov
Since you were in vacation, I tried looking into re-starting pavics-compose up -d to reboot the stack, but I wasn't sure if there was some auto-deploy/update of images involved for this server with a Jenkins task or if it was done manually by ssh'ing on the server.

@matprov
Copy link

matprov commented Aug 19, 2021

@fmigneault Usually a simple pavics-compose up -d is enough. However, in this case, the Docker daemon was unreachable for an unknown reason. It had to be restarted.

Also, the docker-proxy was still proxying some non-existent containers, which led to an impossibility to restart the stack, since ports were already in use. This was the first time it happened, for an unknown reason. Maybe Docker forgot to clean itself during the restart of the service, which led to proxy issues.

Following these two issues and the fact that staging environment was running smoothly, it's safe to say that the issue isn't stack related, but Docker operationalization related.

@tlvu
Copy link
Contributor Author

tlvu commented Aug 24, 2021

FYI both https://outarde.crim.ca/magpie/version and https://hirondelle.crim.ca/magpie/version still has Magpie 3.12.0 as of this writting. Probably missing the manual pull on your production fork https://github.com/crim-ca/birdhouse-deploy? I would suggest maybe leaving hirondelle on the main repo and only have outarde on your fork?

@matprov
Copy link

matprov commented Aug 30, 2021

I would suggest maybe leaving hirondelle on the main repo and only have outarde on your fork?

This might be a good idea, in order to test the latest changes on hirondelle without having to update the fork.

@dbyrns What's your position on this one? Staging and prod are both pointing towards our fork right now, but we might benefit from pointing staging to the official repo.

@fmigneault
Copy link
Collaborator

@matprov
What is the purpose of that fork? I don't see any diff between master of both repos. Were they on a fork only because of the transition phase from the old stacks to birdhouse-deploy restructure?
As far as I'm aware, they could use bird-house repo directly now.

Personally, I would like to have hirondelle/staging auto-update to master for regularly testing recent changes, and leave outarde/prod with manual update.

@matprov
Copy link

matprov commented Aug 30, 2021

Were they on a fork only because of the transition phase from the old stacks to birdhouse-deploy restructure?

@fmigneault Actually no. They are on the fork to prevent prod auto-update. Using the fork repo makes sure we are not updating everytime we merge on birdhouse/birdhouse-deploy. And yes, auto-updating staging (so pointing towards the real repo, not the fork) makes sense for QA.

I don't see any diff between master of both repos.

I synced the fork last week.

@fmigneault
Copy link
Collaborator

@matprov
I'm not sure what is causing the auto-update you are mentioning, and why/how it avoids it.
Given the branches have no diff, whatever causes auto-update is an external cron job or a custom config not staged in the main repo.

@dbyrns
Copy link
Contributor

dbyrns commented Aug 30, 2021

I agree. Our staging env (hirondelle) following birdhouse-deploy/master and our prod env (outarde) following our fork for deployment control makes perfect sense.

@matprov
Copy link

matprov commented Aug 30, 2021

..whatever causes auto-update is an external cron job or a custom config not staged in the main repo.

@fmigneault It's the autodeploy component from birdhouse. Autodeploy via this component is enabled on both staging (every 10 minutes) and prod instance (at midnight). Now, to avoid autodeploy, it all depends which is the remote. Either the "real" repo or the fork.

@fmigneault
Copy link
Collaborator

@matprov @tlvu
Wouldn't removing the ./components/scheduler from EXTRA_CONF_DIRS disable the auto-deploy?

@matprov
Copy link

matprov commented Aug 30, 2021

@fmigneault We want to keep the auto-deploy, but having prod synced on its own fork (see https://birdhouse-deploy.readthedocs.io/en/latest/contributing.html) instead of the birdhouse/birdhouse-deploy repo. Changing staging's remote to birdhouse/birdhouse-deploy repo will do the trick.

@tlvu
Copy link
Contributor Author

tlvu commented Sep 1, 2021

Our staging env (hirondelle) following birdhouse-deploy/master and our prod env (outarde) following our fork for deployment control makes perfect sense.

Another reason for the staging env to follow the real repo is to catch errors earlier and allow for "hot fixes" to avoid the previous magpie rollback. This is what I understood. You guys prefer hot fixes than rollback so the staging env will be able to test the hot fix. The fork is only needed to ensure production stability.

For the record, our staging env (medus) is also on the real repo. Only our production is on the fork.

@matprov
Copy link

matprov commented Sep 1, 2021

@tlvu I totally agree, this is the way to go. Hirondelle now points to the real repo.

fmigneault added a commit to bird-house/birdhouse-deploy that referenced this issue Sep 1, 2021
## Overview

Re-enables the caching feature of Twitcher that was disabled temporarily in #182 

## Changes

**Non-breaking changes**
- Twitcher request caching=on

**Breaking changes**
n/a

## Related Issue / Discussion

- Resolves Ouranosinc/Magpie#433
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Problem, error, or invalid behaviour
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants