-
Notifications
You must be signed in to change notification settings - Fork 1
Reproducing Build Failures directly on the CI machines
If a CircleCI build fails, one can easily reproduce by SSH-ing directly into the machine used on CI. This requires you to be in the pytorch
team on GitHub.
To SSH in, the only requirement is that you have an SSH public key added to your GitHub user account, and access to an SSH client which can use the said key. For Windows builds, it's necessary to instead use a personal access token.
To enable SSH on a given job, you can go to the CircleCI job you'd like to re-run, click on the rerun dropdown, and then click Rerun job with SSH.
After this, the job will restart, and there will be an extra step in the job, named "Enable SSH". Expanding this will reveal instructions and an IP address/port for connecting to the CI job.
Once the job has re-run, you may need to wait a bit before the Docker container where the CI will run is up. Once it is, you can type docker ps
to get the name of the container, and then do docker exec -it <container-name> /bin/bash
to drop into a shell inside the Docker container. This will allow you to play around and reproduce the error.
Windows CircleCI jobs do not use docker containers. Instead, you need to run a bat script to bring up the environment. After you ssh into the machine, run:
> cd project # This is where the PyTorch source tree is found
> .\build\win_tmp\ci_scripts\pytorch_env_restore.bat # This file may not appear until the CI job has been running for several minutes
Now you can rerun tests to reproduce errors.
The Windows jobs come with vim and nano installed, but they don't seem to work properly. Instead, you can run vim from a local terminal to edit files through scp. The IP address and port number used here should be the same as those used to ssh into the CI job.
> vim scp://{IP address}:{port number}//Users/circleci/project/{path to file}