Skip to content

Reproducing Build Failures directly on the CI machines

kurtamohler edited this page Aug 1, 2020 · 9 revisions

Introduction

If a CircleCI build fails, one can easily reproduce by SSH-ing directly into the machine used on CI. This requires you to be in the pytorch team on GitHub.

To SSH in, the only requirement is that you have an SSH public key added to your GitHub user account, and access to an SSH client which can use the said key. For Windows builds, it's necessary to instead use a personal access token.

To enable SSH on a given job, you can go to the CircleCI job you'd like to re-run, click on the rerun dropdown, and then click Rerun job with SSH.

After this, the job will restart, and there will be an extra step in the job, named "Enable SSH". Expanding this will reveal instructions and an IP address/port for connecting to the CI job.

Getting into the Docker container

Once the job has re-run, you may need to wait a bit before the Docker container where the CI will run is up. Once it is, you can type docker ps to get the name of the container, and then do docker exec -it <container-name> /bin/bash to drop into a shell inside the Docker container. This will allow you to play around and reproduce the error.

Windows

Windows CircleCI jobs do not use docker containers. Instead, you need to run a bat script to bring up the environment. After you ssh into the machine, run:

> cd project  # This is where the PyTorch source tree is found
> .\build\win_tmp\ci_scripts\pytorch_env_restore.bat  # This file may not appear until the CI job has been running for several minutes

Now you can rerun tests to reproduce errors.

The Windows jobs come with vim and nano installed, but they don't seem to work properly. Instead, you can run vim from a local terminal to edit files through scp. The IP address and port number used here should be the same as those used to ssh into the CI job.

> vim scp://{IP address}:{port number}//Users/circleci/project/{path to file}