Skip to content

Commit

Permalink
Merge branch 'main' into fix-pp1
Browse files Browse the repository at this point in the history
  • Loading branch information
Quentin-Anthony committed Jun 7, 2024
2 parents 56aa2ba + 90a6cdb commit 8451671
Show file tree
Hide file tree
Showing 5 changed files with 92 additions and 6 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/pull_request.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Pull Request

on: [pull_request]
on: [pull_request, workflow_dispatch]

jobs:
pre-commit:
Expand Down
2 changes: 1 addition & 1 deletion configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Below is an example configuration `.yaml` to train a ~160M parameter GPT model.

For a detailed list of all the arguments available for neox, see [neox_arguments.md](neox_arguments.md)

Note: yaml arguments may be formatted with either '-' or '_'. The standard separator used is a '_' as shown in the example configurations below. However, the use of '-' as a separator may be deprecated in the future.
Note: yaml arguments may be formatted with either '-' or '\_'. The standard separator used is a '\_' as shown in the example configurations below. However, the use of '-' as a separator may be deprecated in the future.
```yaml
# GPT-3 pretraining setup
{
Expand Down
2 changes: 1 addition & 1 deletion configs/neox_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ Logging Arguments

- **git_hash**: str

Default = abe5c99
Default = 7aa0074

current git hash of repository

Expand Down
77 changes: 76 additions & 1 deletion tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ pytest --forked tests/model/test_model_generation.py

Some tests can run on cpu only. These are marked with the decorator @pytest.mark.cpu.
The test cases for cpu can be run with:
````
```
pytest tests -m cpu
```

Expand All @@ -49,3 +49,78 @@ if You see this kind of error:
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
```
It usually means that you used some pytorch.cuda function before the test creates the processes. However just importing `from torch.utils import cpp_extension` can also trigger this.


## CPU Test Integration

Tests can be run against physical CPUs through GitHub Actions. To have tests run on the physical CPU test, here is generally how the CI should be written:

### runs-on

The CI needs to be written to target the CPU Github Action runner. The jobs that need to run on CPU should use the hardware runner's labels:
```yaml
jobs:
cpu-test-job:
runs-on: [ 'self-hosted', 'aws', 'test'] # these labels tell GitHub to execute on the runner with the 'aws' and 'test' labels
```

### Software dependencies

Hardware tests that need python and docker should install them as part of the test execution to make sure the tests run as expected:
```yaml
steps:
# sample syntax to setup python with pip
- uses: actions/setup-python@v4
with:
python-version: "3.8"
cache: "pip"

# sample setup of docker (there's no official Docker setup action)
- name: Docker setup
run: | # taken from Docker's installation page: https://docs.docker.com/engine/install/ubuntu/
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y
```

Any other software dependencies should be assumed to be missing and installed as part of the CI.

### Using Docker image

Using the Docker image and running tests in a container is recommended to resolve environment issues. There is a modified docker-compose.yml in tests/cpu_tests directory that is recommended to be used for CPU tests:

```bash
cp tests/cpu_tests/docker-compose.yml .
# export any env variables here that should be used:
export NEOX_DATA_PATH='./data/enwik8'
docker compose run -d --build --name $CONTAINER gpt-neox tail -f /dev/null
# then can set up and run tests in the container using docker exec
docker exec $CONTAINER pip install -r /workspace/requirements-dev.txt
# etc.
# please clean up the container as part of the CI:
docker rm $CONTAINER
```

At the time of writing there is no built-in method to provide an offline-built Docker image to `jobs.<job-id>.container`.

### Using existing CPU test CI

There is an existing CPU test workflow that can be included in existing CI:

```yaml
steps:
- name: Run CPU Tests
uses:
target_test_ref: $GITHUB_REF # replace with the ref/SHA that the tests should be run on
# have a look at the reusable workflow here: https://github.com/EleutherAI/gpt-neox/blob/main/tests/cpu_tests/action.yml
```
15 changes: 13 additions & 2 deletions tools/ckpts/convert_hf_to_sequential.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,16 +119,27 @@ def shard_sequential_mp(num_mp_ranks, sequential):
ranks = {x: dict() for x in range(num_mp_ranks)}
for k, v in sequential.items():
if reduce(
np.logical_or,
[
x in k
for x in [
"dense_4h_to_h.bias",
"attention.dense.bias",
]
],
):
# Divide by tp_size since they get added together
for x in range(num_mp_ranks):
ranks[x][k] = v / num_mp_ranks
elif reduce(
np.logical_or,
[
x in k
for x in [
"layernorm",
"rotary_emb",
"dense_4h_to_h.bias",
"norm.weight",
"norm.bias",
"attention.dense.bias",
]
],
):
Expand Down

0 comments on commit 8451671

Please sign in to comment.