diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md index 511dea005e..9318722279 100644 --- a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md +++ b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md @@ -22,7 +22,7 @@ You will perform these steps in this Learning Path: First, ensure you have permissions to access to Meta's [405B parameter llama 3.1 model](https://huggingface.co/meta-llama/Llama-3.1-405B). {{% notice Note %}} -Remember that you will need to replicate the install steps below on each device. Do NOT replicate the download and quantization step, since that will take excessive time -- instead do an `scp` from the quantization machine to the other instances, as shown below. +Remember that you will need to replicate the install steps below on each device. Do NOT replicate the download and quantization step, llama.cpp will send the tensors to the cache. {{% /notice %}} ##### 1. Generate a virtual environment @@ -149,12 +149,4 @@ Allowed quantization types: 32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B 0 or F32 : 26.00G @ 7B COPY : only copy tensors, no quantizing -``` - -##### 5. Copy the quantized gguf to the other instances - -Ensure that your EC2 security group has an inbound rule allowing itself, copy your ssh pem file to the instance you did the requantization on, and then use `scp` to copy the quantized gguf file to your two other instances. - -{{% notice Note %}} -Use the private IP of your ec2 instances for this copy operation if your SG has a self-reference. -{{% /notice %}} \ No newline at end of file +``` \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md index 83192b9acc..5a4eebebc5 100644 --- a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md +++ b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md @@ -23,7 +23,7 @@ Communication between the master node and the worker nodes occurs through a sock {{% notice Note %}}The RPC feature in llama.cpp is not secure by default, so you should never expose it to the open internet. To mitigate this risk, ensure that the security groups for all your EC2 instances are properly configured—restricting access to only trusted IPs or internal VPC traffic. This helps prevent unauthorized access to the RPC endpoints.{{% /notice %}} Use the following command to start the listening on the worker nodes: ```bash -bin/rpc-server -p 50052 -H 0.0.0.0 -t 64 +bin/rpc-server -c -p 50052 -H 0.0.0.0 -t 64 ``` Below are the available flag options that can be used with the rpc-server functionality: diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md index f051735815..9cfb6a691b 100644 --- a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md +++ b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md @@ -6,11 +6,10 @@ weight: 4 layout: learningpathall --- ## Master node setup -In this learning path, we will use the following three IP addresses for the nodes. Replace these in the instructions with your own node IPs. +In this learning path, we will use the following two IP addresses for the worker nodes. Replace these with your own node IPs. ```bash -master_ip =" 172.31.110.10" -worker_ips = "172.31.110.11,172.31.110.12" +export worker_ips = "172.31.110.11:50052,172.31.110.12:50052" ``` You can find the IP addresses of your AWS instances in the AWS console. @@ -28,6 +27,11 @@ Finally, you can execute the following command, to execute distributed inference ```bash bin/llama-cli -m ../../model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 999 ``` + +{{% notice Note %}} +It will take a significant amount of time (~30 minutes) to load the tensors on the worker nodes. Pre-loaded tensors are a current development request for llama.cpp. +{{% /notice %}} + Here are short definitions of the flags used in above command: -n => Number of maximum output tokens --rpc => list of backend workers