Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pushing to run-integration-tests sometimes doesn't load locale data? #14396

Open
alexgibson opened this issue Apr 2, 2024 · 10 comments
Open
Labels
Bug 🐛 Something's not working the way it should Infra Infrastructure

Comments

@alexgibson
Copy link
Member

Description

We've been seeing these kinds of failures intermittently for a while now where locale specific tests fail. It appears almost like locale data is not there when the tests run. Pushing again sometimes fixes things, but the problem doesn't seem to be going away. We should try and figure out what's going on.

https://github.com/mozilla/bedrock/actions/runs/8520133305

@alexgibson alexgibson added the Bug 🐛 Something's not working the way it should label Apr 2, 2024
@alexgibson alexgibson changed the title Pushing to run-integration tests sometimes dowsn't load locale data? Pushing to run-integration-tests sometimes doesn't load locale data? Apr 2, 2024
@alexgibson alexgibson added the Infra Infrastructure label Apr 2, 2024
@stevejalim
Copy link
Collaborator

My suspicion is that it's a race condition between the containers spinning up at the end of a deployment (which then hits a webhook on mozilla/bedrock that starts the integration/headless tests) and the container's own process pulling down l10n files on startup. If the container starts getting hammered (including by reruns) that might slow/delay the l10n update form longer

@alexgibson
Copy link
Member Author

alexgibson commented Apr 2, 2024

@stevejalim how come we only see this for the tests branch and not in our regular CI for dev / stage / prod? Shouldn’t the l10n data be fully pulled down before the site is considered to be deployed?

@stevejalim
Copy link
Collaborator

I'd need to look to check if the test build is something different from the prod build. Am pretty sure we don't ship an image to prod containing our dev/test deps. Will get back you you

@stevejalim
Copy link
Collaborator

It might also be that resources are allocated differently for test than dev

@stevejalim
Copy link
Collaborator

And looking about a bit, I now don't think the test image is any different to the regular image we ship to dev/stage/prod. Odd.

@stevejalim
Copy link
Collaborator

Shouldn’t the l10n data be fully pulled down before the site is considered to be deployed?

Yep, and we can see that happening here (which is called by this, which is called in the Dockerfile)

All of which makes me wonder if the data/www-l10n-team directory isn't necessarily available reliably - maybe it's eventually consistent or something similar, which shows up more when the deployment is fresh. (But I'm just thinking aloud right now and need to dig more)

@stevejalim
Copy link
Collaborator

So, one thing that's different on bedrock-test compared to bedrock-dev and -stage and -prod is that in test mode we run bedrock with supervisord enabled:

https://github.com/mozilla-it/webservices-infra/blob/main/bedrock/k8s/bedrock/values-test.yaml#L114

When RUN_SUPERVISOR is set to True, bedrock is booted up the running of this script that appears to fake a locale sync having happening so that bedrock will start.

It also runs a clock process that is always called with at least the 'file' arg, which means we also update files every 5 (by default) minutes, which includes updating the l10n files

So, it's maybe possible that
a) bedrock can start without l10n files available and the l10n update process takes a while to complete, so we're missing locales
or
b) sometimes we catch the test server updating it's l10n files and so we're missing locales

But I'd welcome a second opinion on that from @pmac as I may be misinterpreting or there may be more nuance if I dig deeper

@alexgibson
Copy link
Member Author

alexgibson commented Apr 9, 2024

Nice investigation @stevejalim!

I've noticed that tests always seem to fail on first deployment (it no longer seems to be intermittent from what I can tell?) Not sure if that's useful, or if it points to something that has recently changed maybe? Just thought I'd add here.

@janbrasna
Copy link
Contributor

This is probably unrelated, but about a week ago this started appearing in all the logs:

#39 5.149 + ./manage.py l10n_update
#39 5.810 Using SITE_MODE of 'Mozorg'
#39 6.377 System check identified some issues:
#39 6.377 WARNINGS:
#39 6.377 ?: (staticfiles.W004) The directory '/app/assets' in the STATICFILES_DIRS setting does not exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🐛 Something's not working the way it should Infra Infrastructure
Projects
None yet
Development

No branches or pull requests

3 participants