Skip to content

Add hardware-enabled copilot runner#399

Open
JoseSantosAMD wants to merge 6 commits intomainfrom
JoseSantosAMD/copilot_enabled_runner
Open

Add hardware-enabled copilot runner#399
JoseSantosAMD wants to merge 6 commits intomainfrom
JoseSantosAMD/copilot_enabled_runner

Conversation

@JoseSantosAMD
Copy link
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

JoseSantosAMD and others added 4 commits February 25, 2026 14:16
- Introduced multiple scripts for managing GitHub Actions runners within an Apptainer container, including build, run, and cleanup functionalities.
- Added comprehensive documentation (README.md, AGENTS.md, skills.md) detailing setup, usage, and environment configuration.
- Included example environment files for both runner and container configurations to guide users in setting up their environments securely and effectively.

This update establishes a robust framework for running GitHub Actions in HPC environments, enhancing usability and maintainability.
- Introduced a new workflow file to automate the setup of a Python virtual environment for Copilot.
- The workflow includes steps for checking out the repository, creating and activating a virtual environment, installing dependencies, and verifying ROCm and GPU visibility.
- This addition enhances the CI/CD process by streamlining the environment setup for Copilot integration.
Copilot AI review requested due to automatic review settings February 26, 2026 16:17
@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Feb 26, 2026
@JoseSantosAMD
Copy link
Collaborator Author

@copilot run example 07_gemm_all_scatter and report back the teraflops.

Copy link
Contributor

Copilot AI commented Feb 26, 2026

@JoseSantosAMD I've opened a new pull request, #400, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds infrastructure to run Iris workflows on an AMD/ROCm-capable self-hosted “copilot” GitHub Actions runner, with Apptainer + SLURM scripts to build/run/manage the runner container.

Changes:

  • Introduces a copilot-setup-steps workflow to bootstrap a Python venv and sanity-check ROCm visibility on the self-hosted runner.
  • Adds Apptainer/SLURM runner lifecycle scripts (build, run, start) plus configuration templates.
  • Adds documentation for setup/usage/troubleshooting of the coding-agent runner environment.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
.github/workflows/copilot-setup-steps.yml Manual workflow to prepare venv and verify ROCm tooling on the self-hosted runner.
.github/scripts/github-runner-files/start.sh Container entrypoint that installs/configures the Actions runner and starts the listener.
.github/scripts/github-runner-files/run-github-coding-agent-runner.sh SLURM/standalone launcher to execute the runner inside an Apptainer container.
.github/scripts/github-runner-files/build-github-coding-agent-runner.sh SLURM build script to produce the runner container SIF image.
.github/scripts/github-runner-files/cleanup-old-runners.sh Utility to remove stale runner configs/overlay artifacts.
.github/scripts/github-runner-files/runner-container.env.example Template for container environment overrides (paths/options).
.github/scripts/github-runner-files/env.example Template for exporting runner-related env vars (token, repo, labels).
.github/scripts/github-runner-files/README.md End-to-end documentation for building and running the self-hosted runner.
.github/scripts/github-runner-files/skills.md Additional build/run notes (SLURM-focused).
.github/scripts/github-runner-files/AGENTS.md Local conventions for editing these scripts (no secrets, no host-specific paths).

# Point git config to a writable location (can be overridden by env file)
export GIT_CONFIG_GLOBAL="${GIT_CONFIG_GLOBAL:-${RUNNER_WORKDIR}/.gitconfig}"
mkdir -p "$(dirname "$GIT_CONFIG_GLOBAL")"
git config --global --add safe.directory '*'
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git config --global --add safe.directory '*' disables Git's ownership safety checks for all directories. On a self-hosted runner (especially if it can run arbitrary PR code), this broad trust is a security risk. Prefer adding only the specific workspace directory (or runner work subdirectory) that needs to be treated as safe instead of a wildcard.

Suggested change
git config --global --add safe.directory '*'
if [ -n "${RUNNER_WORKDIR:-}" ]; then
git config --global --add safe.directory "${RUNNER_WORKDIR}"
fi

Copilot uses AI. Check for mistakes.

set -e

WORK_DIR="${WORK:-/work1/amd/josantos}"
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardcodes a host-specific absolute path (/work1/amd/josantos) as the default WORK_DIR, which conflicts with the repo’s stated convention to avoid machine-specific paths (see .github/scripts/github-runner-files/AGENTS.md:17-21). Please remove the hardcoded default and rely on env (e.g. $WORK, $HOME) or require the caller to pass a base directory.

Suggested change
WORK_DIR="${WORK:-/work1/amd/josantos}"
WORK_DIR="${WORK:-$HOME}"

Copilot uses AI. Check for mistakes.
Comment on lines +182 to +185
--bind "${RUNNER_HOME_HOST}:/runner-home:rw" \
--bind "${RUNNER_WORKDIR}:${RUNNER_WORKDIR}" \
${RUNNER_TMP_BIND:+"$RUNNER_TMP_BIND"} \
--env "RUNNER_HOME=/runner-home" \
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optional /tmp bind is passed as a single argument because RUNNER_TMP_BIND contains spaces and is expanded with quotes (${RUNNER_TMP_BIND:+"$RUNNER_TMP_BIND"}). Apptainer expects --bind and its value as separate argv tokens (or --bind=...), so this likely won't bind /tmp as intended. Consider using an argv array (or conditionally appending --bind "${RUNNER_TMP}:/tmp:rw") instead of building a spaced string.

Copilot uses AI. Check for mistakes.
Comment on lines +160 to +164
if [ -n "${USE_OVERLAY}" ] && [ "${USE_OVERLAY}" != "0" ] && [ -d "${OVERLAY_DIR}" ]; then
apptainer exec \
--no-home \
--overlay "${OVERLAY_DIR}" \
--bind "${SCRIPT_DIR}:/runner-scripts:ro" \
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--overlay "${OVERLAY_DIR}" is being given a directory created via mkdir -p, but elsewhere in the repo overlays are created as an image file with apptainer overlay create (e.g. .github/scripts/container_exec.sh:71-79 uses iris_overlay_...img). Passing an empty directory here is unlikely to be a valid overlay and can cause the runner container startup to fail. Consider creating an overlay image file under RUNNER_BASE (per job) with apptainer overlay create ... overlay.img, pass that file to --overlay, and clean it up when the runner stops.

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +112
echo "Found existing runner configuration, removing..."
./config.sh remove --token "$REGISTRATION_TOKEN" || true
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an existing .runner config is present, config.sh remove is invoked with the registration token. The script later fetches a dedicated remove token (.../actions/runners/remove-token) for cleanup, which suggests the registration token may be the wrong credential for removal. If the remove step fails (currently ignored), the local .runner file can remain and config.sh may refuse to configure. Fetch and use a remove token for this removal (or ensure local config files are removed on failure) before calling config.sh --replace.

Suggested change
echo "Found existing runner configuration, removing..."
./config.sh remove --token "$REGISTRATION_TOKEN" || true
echo "Found existing runner configuration, requesting remove token..."
PRE_REMOVE_RESPONSE=$(curl -s -X POST \
-H "Authorization: token $GITHUB_TOKEN" \
-H "Accept: application/vnd.github.v3+json" \
"https://api.github.com/repos/$GITHUB_REPOSITORY/actions/runners/remove-token")
if command -v jq >/dev/null 2>&1; then
PRE_REMOVE_TOKEN=$(echo "$PRE_REMOVE_RESPONSE" | jq -r .token)
else
PRE_REMOVE_TOKEN=$(echo "$PRE_REMOVE_RESPONSE" | grep -o '"token":"[^"]*"' | head -1 | cut -d'"' -f4)
fi
if [ "$PRE_REMOVE_TOKEN" != "null" ] && [ -n "$PRE_REMOVE_TOKEN" ]; then
echo "Removing existing runner with remove token..."
if ! ./config.sh remove --token "$PRE_REMOVE_TOKEN"; then
echo "Warning: Failed to remove existing runner with remove token; deleting local .runner config."
rm -f ".runner"
fi
else
echo "Warning: Could not obtain remove token for existing runner; deleting local .runner config."
rm -f ".runner"
fi

Copilot uses AI. Check for mistakes.
echo "Press Ctrl+C to stop"
echo "=========================================="
command -v rocminfo >/dev/null 2>&1 && rocminfo || true
./run.sh
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GITHUB_TOKEN (a long-lived PAT with repo scope) is passed into the container and remains in the environment when ./run.sh starts, so every workflow job executed on this runner can read and exfiltrate it. An attacker who can trigger a workflow on this self-hosted runner (for example via a pull request workflow) could simply print $GITHUB_TOKEN in a job step and gain full API access to the repository outside of Actions. To avoid leaking this secret, keep using it only within start.sh/cleanup to obtain registration/remove tokens but ensure it is not present in the runner job environment (for example by removing or masking GITHUB_TOKEN from the environment before invoking ./run.sh or otherwise isolating the PAT from workflow steps).

Copilot uses AI. Check for mistakes.
@mawad-amd mawad-amd mentioned this pull request Feb 26, 2026
1 task
@mawad-amd mawad-amd changed the title Jose santos amd/copilot enabled runner Add hardware-enabled copilot runner Feb 26, 2026
@mawad-amd
Copy link
Collaborator

@copilot can you run example 7 for 10 different GEMM shapes and report back the results?

Copy link
Contributor

Copilot AI commented Feb 26, 2026

@mawad-amd I've opened a new pull request, #401, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Collaborator

@mawad-amd mawad-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need to merge first before testing or no so feel free to merge this.

add issue_comment trigger with created, edited
gate job to PR comments containing @copilot
keep manual workflow_dispatch
target self-hosted runner labels: copilot, apptainer, iris
@JoseSantosAMD
Copy link
Collaborator Author

@copilot can you run example 7 for 10 different GEMM shapes and report back the results? @copilot

Copy link
Contributor

Copilot AI commented Feb 26, 2026

@JoseSantosAMD I've opened a new pull request, #402, to work on those changes. Once the pull request is ready, I'll request review from you.

@JoseSantosAMD
Copy link
Collaborator Author

@copilot please run setup and can you run example 7 for 10 different GEMM shapes and report back the results? @copilot

Copy link
Contributor

Copilot AI commented Feb 26, 2026

@JoseSantosAMD I've opened a new pull request, #404, to work on those changes. Once the pull request is ready, I'll request review from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants