From dfc6722f2ab0e3efb65ce5b49449a2a8b14a26b7 Mon Sep 17 00:00:00 2001 From: Stella Biderman Date: Sun, 26 May 2024 17:46:02 -0400 Subject: [PATCH 1/4] Fix markdown formatting error (#1217) * Update README.md * Update NeoXArgs docs automatically * Update NeoXArgs docs automatically --------- Co-authored-by: github-actions Co-authored-by: Quentin Anthony --- configs/README.md | 2 +- configs/neox_arguments.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/configs/README.md b/configs/README.md index d8ae81739..e14274b56 100644 --- a/configs/README.md +++ b/configs/README.md @@ -9,7 +9,7 @@ Below is an example configuration `.yaml` to train a ~160M parameter GPT model. For a detailed list of all the arguments available for neox, see [neox_arguments.md](neox_arguments.md) -Note: yaml arguments may be formatted with either '-' or '_'. The standard separator used is a '_' as shown in the example configurations below. However, the use of '-' as a separator may be deprecated in the future. +Note: yaml arguments may be formatted with either '-' or '\_'. The standard separator used is a '\_' as shown in the example configurations below. However, the use of '-' as a separator may be deprecated in the future. ```yaml # GPT-3 pretraining setup { diff --git a/configs/neox_arguments.md b/configs/neox_arguments.md index c6d369524..306a0da5d 100644 --- a/configs/neox_arguments.md +++ b/configs/neox_arguments.md @@ -111,7 +111,7 @@ Logging Arguments - **git_hash**: str - Default = 1b85a2f + Default = a3fb470 current git hash of repository From b5c0afe42851dc4878aaa1b56021dc71b3ee435d Mon Sep 17 00:00:00 2001 From: Jacob Hatef <74274091+jahatef@users.noreply.github.com> Date: Tue, 4 Jun 2024 13:22:14 -0400 Subject: [PATCH 2/4] add workflow_dispatch to gh actions pr so we can run on command (#1233) * add workflow_dispatch to gh actions pr so we can run on command * Update NeoXArgs docs automatically --------- Co-authored-by: github-actions --- .github/workflows/pull_request.yml | 2 +- configs/neox_arguments.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/pull_request.yml b/.github/workflows/pull_request.yml index 3213718df..a2b1a2fc2 100644 --- a/.github/workflows/pull_request.yml +++ b/.github/workflows/pull_request.yml @@ -1,6 +1,6 @@ name: Pull Request -on: [pull_request] +on: [pull_request, workflow_dispatch] jobs: pre-commit: diff --git a/configs/neox_arguments.md b/configs/neox_arguments.md index 306a0da5d..c60d1e15f 100644 --- a/configs/neox_arguments.md +++ b/configs/neox_arguments.md @@ -111,7 +111,7 @@ Logging Arguments - **git_hash**: str - Default = a3fb470 + Default = 516169c current git hash of repository From 4a34e0a565f19a8578210654afcb3bb835fcc35e Mon Sep 17 00:00:00 2001 From: jaimemcc <99298642+jaimemcc-intel@users.noreply.github.com> Date: Wed, 5 Jun 2024 15:26:54 -0700 Subject: [PATCH 3/4] init changes to README (#1232) * init changes to README * Update NeoXArgs docs automatically * Update README.md * Update NeoXArgs docs automatically * Update README.md * Update NeoXArgs docs automatically * Update NeoXArgs docs automatically --------- Co-authored-by: github-actions Co-authored-by: Quentin Anthony --- configs/neox_arguments.md | 2 +- tests/README.md | 77 ++++++++++++++++++++++++++++++++++++++- 2 files changed, 77 insertions(+), 2 deletions(-) diff --git a/configs/neox_arguments.md b/configs/neox_arguments.md index c60d1e15f..f6c3ecde3 100644 --- a/configs/neox_arguments.md +++ b/configs/neox_arguments.md @@ -111,7 +111,7 @@ Logging Arguments - **git_hash**: str - Default = 516169c + Default = 7aa0074 current git hash of repository diff --git a/tests/README.md b/tests/README.md index 316096cc5..390a52898 100644 --- a/tests/README.md +++ b/tests/README.md @@ -32,7 +32,7 @@ pytest --forked tests/model/test_model_generation.py Some tests can run on cpu only. These are marked with the decorator @pytest.mark.cpu. The test cases for cpu can be run with: -```` +``` pytest tests -m cpu ``` @@ -49,3 +49,78 @@ if You see this kind of error: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method ``` It usually means that you used some pytorch.cuda function before the test creates the processes. However just importing `from torch.utils import cpp_extension` can also trigger this. + + +## CPU Test Integration + +Tests can be run against physical CPUs through GitHub Actions. To have tests run on the physical CPU test, here is generally how the CI should be written: + +### runs-on + +The CI needs to be written to target the CPU Github Action runner. The jobs that need to run on CPU should use the hardware runner's labels: +```yaml +jobs: + cpu-test-job: + runs-on: [ 'self-hosted', 'aws', 'test'] # these labels tell GitHub to execute on the runner with the 'aws' and 'test' labels +``` + +### Software dependencies + +Hardware tests that need python and docker should install them as part of the test execution to make sure the tests run as expected: +```yaml +steps: + # sample syntax to setup python with pip + - uses: actions/setup-python@v4 + with: + python-version: "3.8" + cache: "pip" + + # sample setup of docker (there's no official Docker setup action) + - name: Docker setup + run: | # taken from Docker's installation page: https://docs.docker.com/engine/install/ubuntu/ + # Add Docker's official GPG key: + sudo apt-get update + sudo apt-get install ca-certificates curl + sudo install -m 0755 -d /etc/apt/keyrings + sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc + sudo chmod a+r /etc/apt/keyrings/docker.asc + # Add the repository to Apt sources: + echo \ + "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ + $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ + sudo tee /etc/apt/sources.list.d/docker.list > /dev/null + sudo apt-get update + sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y +``` + +Any other software dependencies should be assumed to be missing and installed as part of the CI. + +### Using Docker image + +Using the Docker image and running tests in a container is recommended to resolve environment issues. There is a modified docker-compose.yml in tests/cpu_tests directory that is recommended to be used for CPU tests: + +```bash +cp tests/cpu_tests/docker-compose.yml . +# export any env variables here that should be used: +export NEOX_DATA_PATH='./data/enwik8' +docker compose run -d --build --name $CONTAINER gpt-neox tail -f /dev/null +# then can set up and run tests in the container using docker exec +docker exec $CONTAINER pip install -r /workspace/requirements-dev.txt +# etc. +# please clean up the container as part of the CI: +docker rm $CONTAINER +``` + +At the time of writing there is no built-in method to provide an offline-built Docker image to `jobs..container`. + +### Using existing CPU test CI + +There is an existing CPU test workflow that can be included in existing CI: + +```yaml +steps: + - name: Run CPU Tests + uses: + target_test_ref: $GITHUB_REF # replace with the ref/SHA that the tests should be run on + # have a look at the reusable workflow here: https://github.com/EleutherAI/gpt-neox/blob/main/tests/cpu_tests/action.yml +``` From 90a6cdb35f11d3a1892da4cb242c6a2576bcfb6a Mon Sep 17 00:00:00 2001 From: dmahan93 <44207705+dmahan93@users.noreply.github.com> Date: Thu, 6 Jun 2024 20:24:49 -0500 Subject: [PATCH 4/4] fix summed biases not being divided by mp size (#1220) --- tools/ckpts/convert_hf_to_sequential.py | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/tools/ckpts/convert_hf_to_sequential.py b/tools/ckpts/convert_hf_to_sequential.py index c53f28391..55cfc6517 100644 --- a/tools/ckpts/convert_hf_to_sequential.py +++ b/tools/ckpts/convert_hf_to_sequential.py @@ -119,16 +119,27 @@ def shard_sequential_mp(num_mp_ranks, sequential): ranks = {x: dict() for x in range(num_mp_ranks)} for k, v in sequential.items(): if reduce( + np.logical_or, + [ + x in k + for x in [ + "dense_4h_to_h.bias", + "attention.dense.bias", + ] + ], + ): + # Divide by tp_size since they get added together + for x in range(num_mp_ranks): + ranks[x][k] = v / num_mp_ranks + elif reduce( np.logical_or, [ x in k for x in [ "layernorm", "rotary_emb", - "dense_4h_to_h.bias", "norm.weight", "norm.bias", - "attention.dense.bias", ] ], ):