Skip to content

testing: migrate to trampoline_v2#3860

Merged
tmatsuo merged 38 commits intoGoogleCloudPlatform:masterfrom
tmatsuo:trampoline_v2
May 29, 2020
Merged

testing: migrate to trampoline_v2#3860
tmatsuo merged 38 commits intoGoogleCloudPlatform:masterfrom
tmatsuo:trampoline_v2

Conversation

@tmatsuo
Copy link
Copy Markdown
Contributor

@tmatsuo tmatsuo commented May 22, 2020

This PR changes the build file to trampoline_v2.sh for all the jobs.

trampoline_v2.sh does 3 things.

  1. Prepare the Docker image for the test
  2. Run the Docker with appropriate flags to run the test
  3. Upload the newly built Docker image

in a way that is somewhat compatible with trampoline_v1.

It also behaves better in many ways.

  1. It uses the same uid as the caller, use the same gid as docker group on the host.
  2. No permission change required upon cleanup.
  3. It's bit more portable so that much easier to run somewhere other than Kokoro including local env.
  4. It can optionally build the docker image from the source then use the built image for the test, so that you can have a single PR for changing the Dockerfile and tests.
  5. Logging with timestamp. Easier to determine the bottleneck.
  6. Only passing down minimum envvars.
  7. Somewhat configurable with .trampolinerc.

Python specific info

To run this script, first decrypt our test secret by running scripts/decrypt-secrets.sh. This need secret-accessor role on our Cloud Project.

Then run the script:

.kokoro/trampoline_v2.sh

You can optionally change these environment variables:

  • TRAMPOLINE_IMAGE: The docker image to use.
  • TRAMPOLINE_IMAGE_SOURCE: The location of the Dockerfile.
  • RUN_TESTS_SESSION: The nox session to run.
  • RUN_TESTS_DIRS: colon separated values for which directory you want to run the tests.
  • TRAMPOLINE_BUILD_FILE: The script to run in the docker container.

I also added a handy script for running tests locally.

scripts/run_tests_local.sh takes only one required argument, and run test sessions for that directory. Example:

$ pwd
/home/tmatsuo/work/python-docs-samples
$ cd cdn
$ ../scripts/run_tests_local.sh .  # it will runs lint, py-3.6, and py-3.7 
# or
$ ../scripts/run_tests_local.sh . lint py-3.6  # it will only run lint and py-3.6

@tmatsuo tmatsuo added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label May 22, 2020
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 22, 2020
@tmatsuo
Copy link
Copy Markdown
Contributor Author

tmatsuo commented May 23, 2020

Currently most of the tests are passing locally with the new script.
Only failing test is iap. I think it's failing due to unrelated reasons. Let's see the periodic builds' result.

@tmatsuo
Copy link
Copy Markdown
Contributor Author

tmatsuo commented May 23, 2020

iap tests are passing in periodic builds.
I'm experimenting with running the same tests on Kokoro with trampoline_v2.

@tmatsuo tmatsuo force-pushed the trampoline_v2 branch 2 times, most recently from 426c07a to a977ec4 Compare May 23, 2020 18:03
@tmatsuo
Copy link
Copy Markdown
Contributor Author

tmatsuo commented May 23, 2020

The iap test is passing with trampoline_v2.sh:
https://source.cloud.google.com/results/invocations/acd1029b-f6fa-4aa5-940b-cfd936157e5b/targets

Somehow it's not passing locally. I'll skip this test for local runs.

@tmatsuo
Copy link
Copy Markdown
Contributor Author

tmatsuo commented May 24, 2020

I run py-3.7 for all the tests with the following command and they all passed!

$ TERM=vt100 \
  TRAMPOLINE_BUILD_FILE=.kokoro/tests/run_tests.sh \
  .kokoro/trampoline_v2.sh 2>&1 | tee /tmp/build.log

@tmatsuo tmatsuo force-pushed the trampoline_v2 branch 6 times, most recently from 8f86299 to 5fac350 Compare May 27, 2020 00:46
@tmatsuo
Copy link
Copy Markdown
Contributor Author

tmatsuo commented May 27, 2020

Rebased to the master, and experimenting Trampoline V2 with the py-3.7 build.
This should be full build and it'll take long time. I'll examine the invocation logs after the test finishes.

Fingers crossed 🤞

@tmatsuo
Copy link
Copy Markdown
Contributor Author

tmatsuo commented May 27, 2020

Doh. The regex part was wrong.

@tmatsuo tmatsuo changed the title testing: prototype trampoline_v2 testing: Migrate to trampoline_v2 May 27, 2020
@tmatsuo tmatsuo changed the title testing: Migrate to trampoline_v2 testing: migrate to trampoline_v2 May 27, 2020
@tmatsuo tmatsuo marked this pull request as ready for review May 27, 2020 18:22
@tmatsuo tmatsuo requested a review from a team as a code owner May 27, 2020 18:22
@tmatsuo tmatsuo requested review from busunkim96 and kurtisvg May 27, 2020 18:22
@tmatsuo
Copy link
Copy Markdown
Contributor Author

tmatsuo commented May 27, 2020

All the presubmit builds passed. This is amazing :)

However, I need to make some more changes.

Comment thread .kokoro/tests/run_tests.sh Outdated
Comment thread .kokoro/tests/run_tests.sh Outdated
Comment thread .kokoro/tests/run_tests.sh Outdated
Comment thread .kokoro/tests/run_tests.sh Outdated
Comment thread .kokoro/trampoline_v2.sh Outdated
Comment thread .kokoro/trampoline_v2.sh Outdated
Comment on lines 231 to 234
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why set during build-time? Why can't these just be set at execution time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for a long explanation.

  1. To allow sudo we need a real username
  2. One of the tests (compute/oslogin iirc) needs a real username

So we need the real username for those reasons. So anyways we need to create a user and allow sudo at build time. Well, if we create a user, I can also add the user to the docker group.

Then we can pass --user={$user_uid}:${user_gid} which is better than --user={$user_uid}:${docker_gid}.

The files created by the former will be the same uid:gid as you, but the latter will create a file with different gid.

Usecase: we can use the script to generate the README:

scripts/run_tests_local.sh cdn readmegen

This will generate the README.rst in cdn directory with the same uid and gid as you.

Copy link
Copy Markdown
Contributor Author

@tmatsuo tmatsuo May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically, the direct answer to your question is:

Yes, we can pass --user={$user_uid}:${docker_gid} and docker works, but some other tests will fail (compute/oslogin, and potentially other tests which need sudo access).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker already has functionality for setting a default user/group as part of a Dockerfile. What do we gain by having trampoline manage this instead of letting the container do it itself?

I think my concern is that the container is supposed to be the encapsulation of the entire environment, including users/groups. I don't think it makes sense to have trampoline play a part in managing that. If folks want to have a different user/group by default, they should set that in their Dockerfile.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code makes sure that, docker will run with the same uid:gid as the caller of the trampoline_v2.sh script. This is dynamic. When you run locally, it's going to be your usual uid:gid, when it's run in Kokoro, it's going to be uid:gid of kbuilder user, etc etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to summarize:

  1. I want to use --user "${user_uid}:${user_gid}" for making sure the container will create files with the same uid:gid as the caller.
  2. It works for most of our tests, but some tests are failing.
  3. compute/oslogin: We need a real user name for this uid -> we need to create the user in the container OS. This is why we passing the build arg.
  4. docker in docker: We need to make sure the user in the container have a permission on /var/run/docker.sock. -> This is why we add our user to the group with the same id as the docker group in the host os.
  5. sudo: Some tests potentially need root access, so we need to add sudoers entry on the fly. Actually our run_tests.sh needs it for storing cloudsql_proxy to /bin.

So there are reasons. If you come up with a solution with the passing tests both for compute/oslogin and vision/automl, I'm open to such a solution.

You can download this branch and try out the script with the following command:

$ scripts/decrypt-secrets.sh  # Make sure we have the secrets
$ scripts/run_tests_local.sh compute/oslogin py-3.7
$ scripts/run_tests_local.sh vision/automl py-3.7
# or
$ RUN_TESTS_DIRS=compute/oslogin:vision/automl \
  TRAMPOLINE_BUILD_FILE=.kokoro/tests/run_tests.sh \
  .kokoro/trampoline_v2.sh

These commands start from building the Docker image, so you can test end 2 end for your proposed solution.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this also mean that it rebuilds (and potentially pushes a new image) every time a new user runs it?

Copy link
Copy Markdown
Contributor Author

@tmatsuo tmatsuo May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it takes a while for the first run but after that they will be fast because docker pull can detect the local layers and docker build is using --cache-from. These command never upload to anywhere by default. You have to provide two envvars to use that option like this:

TRAMPOLINE_IMAGE=gcr.io/tmatsuo-test/my-python-multi \
  TRAMPOLINE_IMAGE_UPLOAD=true
  scripts/run_tests_local.sh compute/oslogin py-3.7

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, the upload option is not intend to be used by humans.

My intention is to set that flag only in one of the periodic builds. When the tests are all green, it will push the built image. We won't need the Cloud Build hook any more (although it won't harm anything).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also hope that in the future, we can merge all the periodic builds into one. If we pass down the multiple nox sessions, it might be possible. The tests will take a very long time, so we may want to have a tool to run tests in parallel. I thought someone is developing such a tool.

Comment thread .kokoro/trampoline_v2.sh Outdated
@tmatsuo
Copy link
Copy Markdown
Contributor Author

tmatsuo commented May 28, 2020

@kurtisvg Thanks for the review!

I think I addressed your comments and answered your questions. PTAL

@tmatsuo
Copy link
Copy Markdown
Contributor Author

tmatsuo commented May 28, 2020

Don't worry about the Python 3.8 failure. That's the one I manually started and stopped once I confirmed the build cop bot is working.

@tmatsuo
Copy link
Copy Markdown
Contributor Author

tmatsuo commented May 29, 2020

lint and py-2.7 passed. I'll merge this now with the admin power.

@tmatsuo tmatsuo merged commit 89fa5fc into GoogleCloudPlatform:master May 29, 2020
@tmatsuo tmatsuo deleted the trampoline_v2 branch May 29, 2020 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes This human has signed the Contributor License Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants