Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guide For Multi-Node Distributed Finetuning #1477
base: main
Are you sure you want to change the base?
Guide For Multi-Node Distributed Finetuning #1477
Changes from all commits
9c75d6c
71798ad
724ff48
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
PubkeyAuthentication
line to enable public key authentication.Step 2: Generate Public Key
ssh-keygen
:cat ~/.ssh/id_rsa.pub
authorized_keys
file on all nodes and the server:nano ~/.ssh/authorized_keys
Repeat steps 1-3 on all other nodes.
Exchange public keys between nodes:
authorized_keys
file of Node 2, and vice versa.Test passwordless SSH access:
Step 3: Configure Axolotl
.yml
files and settings.deepspeed_hostfile
inside the Axolotl folder:Step 4: Configure Accelerate
Follow these steps on each node to configure Accelerate:
Node 1 (Server) Configuration
This machine
for the compute environment.multi_gpu
for the compute type.2
for two nodes).0
for the first node (server).5000
).no
for setting up custom environment variables.static
for the rendezvous backend.yes
for running on the same network.no
for using a cluster.yes
for using Deepspeed.yes
for using Deepspeed configs.deepspeed_configs/zero2.json
).no
for using Zero 3.pdsh
for the Deepspeed multinode launcher.deepspeed_hostfile
for the Deepspeed hostfile.no
for using custom launch utility options.no
for using a TPU.8
for 8 GPUs).Node 2 Configuration
Repeat the above steps for Node 2, but change the machine rank to
1
.Step 5: Finetuning
On Node 1 (server), run the finetuning process using Accelerate:
This will start the finetuning process across all nodes. You can check the different IP addresses before each step to verify that the training is running on every node.