Skip to content

Commit

Permalink
Merge branch 'master' into bugfix/13285_disable_non_blocking_to_devic…
Browse files Browse the repository at this point in the history
…e_mps
  • Loading branch information
justusschock committed Aug 25, 2022
2 parents a89eaeb + 424e78d commit 39af0b9
Show file tree
Hide file tree
Showing 9 changed files with 85 additions and 43 deletions.
3 changes: 2 additions & 1 deletion .azure/gpu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,8 @@ jobs:
python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)"
python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'bagua' not in line] ; open(fname, 'w').writelines(lines)"
CUDA_VERSION_MM=$(python -c "import torch ; print(''.join(map(str, torch.version.cuda.split('.')[:2])))")
pip install "bagua-cuda$CUDA_VERSION_MM>=0.9.0"
CUDA_VERSION_BAGUA=$(python -c "print([ver for ver in [115,113,111,102] if $CUDA_VERSION_MM >= ver][0])")
pip install "bagua-cuda$CUDA_VERSION_BAGUA>=0.9.0"
pip install -e .[strategies]
pip install -U deepspeed # TODO: remove when docker images are upgraded
pip install --requirement requirements/pytorch/devel.txt
Expand Down
41 changes: 28 additions & 13 deletions .github/checkgroup.yml
Original file line number Diff line number Diff line change
@@ -1,20 +1,14 @@
custom_service_name: "Lightning CI required checker"
# For security reasons, configuration is only loaded from the repository's default branch,
# changes made in pull requests from different branches or forks are ignored. This means that changes to this file
# will only be used after they are merged.
subprojects:
- id: "CI: CircleCI"
paths:
- ".circleci/**"
checks:
- "test-on-tpus"

- id: "CI: Azure"
paths:
- ".azure/**"
checks:
- "pytorch-lightning (GPUs)"
- "pytorch-lightning (GPUs) (testing PyTorch - stable)"
- "pytorch-lightning (HPUs)"
- "pytorch-lightning (IPUs)"

- id: "pytorch_lightning"
paths:
# all examples don't need to be added because they aren't used in CI, but these are
Expand Down Expand Up @@ -52,14 +46,32 @@ subprojects:
- "mypy"
- "PR Gatekeeper (pytorch)"
- "pytorch-lightning (GPUs)"
- "pytorch-lightning (GPUs) (testing PyTorch - stable)"
- "pytorch-lightning (HPUs)"
- "pytorch-lightning (IPUs)"
- "slow (macOS-11, 3.7, 1.11)"
- "slow (ubuntu-20.04, 3.7, 1.11)"
- "slow (windows-2022, 3.7, 1.11)"
- "test-on-tpus"

- id: "pytorch_lightning: Azure GPU"
paths:
- ".azure/gpu-tests.yml"
- "tests/tests_pytorch/run_standalone_*.sh"
checks:
- "pytorch-lightning (GPUs)"

- id: "pytorch_lightning: Azure HPU"
paths:
- ".azure/hpu-tests.yml"
checks:
- "pytorch-lightning (HPUs)"

- id: "pytorch_lightning: Azure IPU"
paths:
- ".azure/ipu-tests.yml"
checks:
- "pytorch-lightning (IPUs)"

- id: "pytorch_lightning: Docs"
paths:
- "docs/source-pytorch/**"
Expand All @@ -73,7 +85,6 @@ subprojects:
- id: "pytorch_lightning: Docker"
paths:
- "dockers/**"
- "!dockers/README.md"
- "requirements.txt"
- "requirements/*.txt"
- "requirements/pytorch/*"
Expand Down Expand Up @@ -108,12 +119,10 @@ subprojects:

- id: "lightning_app"
paths:
- ".azure/app-cloud-e2e.yml"
- "requirements/app/**"
- "src/lightning_app/**"
- "tests/tests_app/**"
- "tests/tests_app_examples/**"
- "tests/tests_clusters/**"
# the examples are used in the app CI
- "examples/app_*"
checks:
Expand All @@ -127,6 +136,12 @@ subprojects:
- "pytest (windows-2022, 3.8, latest)"
- "pytest (windows-2022, 3.8, oldest)"

- id: "lightning_app: Azure"
paths:
- ".azure/app-cloud-e2e.yml"
checks:
- "App.cloud-e2e"

- id: "lightning_app: Docs"
paths:
- "docs/source-app/**"
Expand Down
15 changes: 15 additions & 0 deletions .github/workflows/probot-check-group.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
name: Probot

on:
check_run: {}
pull_request: {}
issue_comment: {types: [created]}

jobs:
required-jobs:
runs-on: ubuntu-latest
if: github.event_name != 'issue_comment' || contains(github.event.comment.body, '@probot pls')
steps:
- uses: carmocca/probot@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
5 changes: 3 additions & 2 deletions dockers/base-conda/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -141,8 +141,9 @@ RUN \
RUN \
# install Bagua
CUDA_VERSION_MM=$(python -c "print(''.join('$CUDA_VERSION'.split('.')[:2]))") && \
pip install "bagua-cuda$CUDA_VERSION_MM==0.9.0" && \
python -c "import bagua_core; bagua_core.install_deps()" && \
CUDA_VERSION_BAGUA=$(python -c "print([ver for ver in [115,113,111,102] if $CUDA_VERSION_MM >= ver][0])") && \
pip install "bagua-cuda$CUDA_VERSION_BAGUA==0.9.0" && \
if [[ "$CUDA_VERSION_MM" = "$CUDA_VERSION_BAGUA" ]]; then python -c "import bagua_core; bagua_core.install_deps()"; fi && \
python -c "import bagua; print(bagua.__version__)"

RUN \
Expand Down
1 change: 1 addition & 0 deletions dockers/tpu-tests/tpu_test_cases.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ local tputests = base.BaseTest {
export XRT_TPU_CONFIG="tpu_worker;0;${KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS:7}"
export PL_RUN_TPU_TESTS=1
cd tests/tests_pytorch
set -e
coverage run --source=pytorch_lightning -m pytest -vv --durations=0 ./
echo "\n||| Running standalone tests |||\n"
bash run_standalone_tests.sh
Expand Down
34 changes: 34 additions & 0 deletions docs/source-pytorch/accelerators/hpu_basic.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,40 @@ It uses :class:`~pytorch_lightning.strategies.hpu_parallel.HPUParallelStrategy`
----

Scale-out on Gaudis
-------------------

To train a Lightning model using multiple HPU nodes, set the ``num_nodes`` parameter with the available nodes in the ``Trainer`` class.

.. code-block:: python
trainer = Trainer(accelerator="hpu", devices=8, strategy="hpu_parallel", num_nodes=2)
In addition to this, the following environment variables need to be set to establish communication across nodes. Check out the documentation on :doc:`Cluster Environment <../clouds/cluster>` for more details.

- *MASTER_PORT* - required; has to be a free port on machine with NODE_RANK 0
- *MASTER_ADDR* - required (except for NODE_RANK 0); address of NODE_RANK 0 node
- *WORLD_SIZE* - required; how many workers are in the cluster
- *NODE_RANK* - required; id of the node in the cluster

The trainer needs to be instantiated on every node participating in the training.

On Node 1:

.. code-block:: bash
MASTER_ADDR=<MASTER_ADDR> MASTER_PORT=<MASTER_PORT> NODE_RANK=0 WORLD_SIZE=16
python -m some_model_trainer.py (--arg1 ... train script args...)
On Node 2:

.. code-block:: bash
MASTER_ADDR=<MASTER_ADDR> MASTER_PORT=<MASTER_PORT> NODE_RANK=1 WORLD_SIZE=16
python -m some_model_trainer.py (--arg1 ... train script args...)
----

Select Gaudis automatically
---------------------------

Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
# the default package dependencies
-r ./requirements/app/base.txt
-r ./requirements/pytorch/base.txt
26 changes: 0 additions & 26 deletions src/lightning_app/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,49 +9,23 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
### Added

- Add support for `Lightning App Commands` through the `configure_commands` hook on the Lightning Flow and the `ClientCommand` ([#13602](https://github.com/Lightning-AI/lightning/pull/13602))


- Add support for Lightning AI BYOC cluster management ([#13835](https://github.com/Lightning-AI/lightning/pull/13835))


- Add support to see Lightning AI BYOC cluster logs ([#14334](https://github.com/Lightning-AI/lightning/pull/14334))


- Add support to run Lightning apps on Lightning AI BYOC clusters ([#13894](https://github.com/Lightning-AI/lightning/pull/13894))


- Add support for listing Lightning AI apps ([#13987](https://github.com/Lightning-AI/lightning/pull/13987))


- Adds `LightningTrainingComponent`. `LightningTrainingComponent` orchestrates multi-node training in the cloud ([#13830](https://github.com/Lightning-AI/lightning/pull/13830))


- Add support for printing application logs using CLI `lightning show logs <app_name> [components]` ([#13634](https://github.com/Lightning-AI/lightning/pull/13634))


- Add support for `Lightning API` through the `configure_api` hook on the Lightning Flow and the `Post`, `Get`, `Delete`, `Put` HttpMethods ([#13945](https://github.com/Lightning-AI/lightning/pull/13945))


- Added a warning when `configure_layout` returns URLs configured with http instead of https ([#14233](https://github.com/Lightning-AI/lightning/pull/14233))


### Changed

- Default values and parameter names for Lightning AI BYOC cluster management ([#14132](https://github.com/Lightning-AI/lightning/pull/14132))


### Changed

-


- Run the flow only if the state has changed from the previous execution ([#14076](https://github.com/Lightning-AI/lightning/pull/14076))

### Deprecated

-


### Fixed

- Unification of app template: moved `app.py` to root dir for `lightning init app <app_name>` template ([#13853](https://github.com/Lightning-AI/lightning/pull/13853))
Expand Down
2 changes: 1 addition & 1 deletion src/lightning_app/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
version = "0.6.0dev"
version = "0.6.0rc0"

0 comments on commit 39af0b9

Please sign in to comment.