Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,13 @@ weight: 2
layout: learningpathall
---

## System Requirements
## Requirements

- An AWS account

- Quota for c8g instances in your preferred region
- Access to launch an EC2 instance of type `c8g.4xlarge` (or larger) with at least 128 GB of storage

- A Linux or MacOS host

- A c8g instance (4xlarge or larger)

- At least 128GB of storage
For more information about creating an EC2 instance using AWS refer to [Getting Started with AWS](/learning-paths/servers-and-cloud-computing/csp/aws/).

## AWS Console Steps

Expand Down Expand Up @@ -49,12 +45,14 @@ Follow these steps to launch your EC2 instance using the AWS Management Console:
3. **Secure the Key File**

- Move the downloaded `.pem` file to the SSH configuration directory

```bash
mkdir -p ~/.ssh
mv arcee-graviton4-key.pem ~/.ssh
```

- Set proper permissions (on Mac/Linux):
- Set proper permissions on macOS or Linux:

```bash
chmod 400 ~/.ssh/arcee-graviton4-key.pem
```
Expand Down Expand Up @@ -105,9 +103,12 @@ Follow these steps to launch your EC2 instance using the AWS Management Console:

- In the dropdown list, select "My IP".

Note 1: you will only be able to connect to the instance from your current host, which is the safest setting. We don't recommend selecting "Anywhere", which would allow anyone on the Internet to attempt to connect. Use at your own risk.

Note 2: although this demonstration only requires SSH access, feel free to use one of your existing security groups as long as it allows SSH traffic.
{{% notice Notes %}}
You will only be able to connect to the instance from your current host, which is the safest setting. Selecting "Anywhere" allows anyone on the Internet to attempt to connect; use at your own risk.

Although this demonstration only requires SSH access, it is possible to use one of your existing security groups as long as it allows SSH traffic.
{{% /notice %}}

5. **Configure Storage**

Expand Down Expand Up @@ -161,7 +162,7 @@ Follow these steps to launch your EC2 instance using the AWS Management Console:

- **AMI Selection**: The Ubuntu 24.04 LTS AMI must be ARM64 compatible for Graviton processors

- **Security**: please think twice about allowing SSH from anywhere (0.0.0.0/0). We strongly recommend restricting access to your IP address
- **Security**: Think twice about allowing SSH from anywhere (0.0.0.0/0). It is strongly recommended to restrict access to your IP address.

- **Storage**: The 128GB EBS volume is sufficient for the Arcee model and dependencies

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ weight: 3
layout: learningpathall
---

In this step, we'll set up the Graviton4 instance with all the necessary tools and dependencies required to build and run the Arcee Foundation Model. This includes installing the build tools and Python environment.
In this step, you'll set up the Graviton4 instance with all the necessary tools and dependencies required to build and run the Arcee Foundation Model. This includes installing the build tools and Python environment.

## Step 1: Update Package List

Expand All @@ -29,7 +29,7 @@ sudo apt-get install cmake gcc g++ git python3 python3-pip python3-virtualenv li

This command installs all the essential development tools and dependencies:

- **cmake**: Cross-platform build system generator that we'll use to compile Llama.cpp
- **cmake**: Cross-platform build system generator used to compile Llama.cpp
- **gcc & g++**: GNU C and C++ compilers for building native code
- **git**: Version control system for cloning repositories
- **python3**: Python interpreter for running Python-based tools and scripts
Expand All @@ -39,9 +39,9 @@ This command installs all the essential development tools and dependencies:

The `-y` flag automatically answers "yes" to prompts, making the installation non-interactive.

## What's Ready Now
## What's Ready Now?

After completing these steps, your Graviton4 instance will have:
After completing these steps, your Graviton4 instance has:

- A complete C/C++ development environment for building Llama.cpp
- Python 3 with pip for managing Python packages
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,9 @@ weight: 4
layout: learningpathall
---

In this step, we'll build Llama.cpp from source. Llama.cpp is a high-performance C++ implementation of the LLaMA model that's optimized for inference on various hardware platforms, including ARM-based processors like Graviton4.
In this step, you'll build Llama.cpp from source. Llama.cpp is a high-performance C++ implementation of the LLaMA model that's optimized for inference on various hardware platforms, including Arm-based processors like Graviton4.

Even though AFM-4.5B has a custom model architecture, we're able to use the vanilla version of llama.cpp as the Arcee AI team has contributed the appropriate modeling code.

Here are all the steps.
Even though AFM-4.5B has a custom model architecture, we're able to use the vanilla version of Llama.cpp as the Arcee AI team has contributed the appropriate modeling code.

## Step 1: Clone the Repository

Expand All @@ -26,7 +24,7 @@ This command clones the Llama.cpp repository from GitHub to your local machine.
cd llama.cpp
```

Change into the llama.cpp directory where we'll perform the build process. This directory contains the CMakeLists.txt file and source code structure.
Change into the llama.cpp directory to run the build process. This directory contains the `CMakeLists.txt` file and source code structure.

## Step 3: Configure the Build with CMake

Expand All @@ -35,13 +33,15 @@ cmake -B .
```

This command uses CMake to configure the build system:

- `-B .` specifies that the build files should be generated in the current directory
- CMake will detect your system's compiler, libraries, and hardware capabilities
- It will generate the appropriate build files (Makefiles on Linux) based on your system configuration

Note: The cmake output should include the information below, indicating that the build process will leverage the Neoverse V2 architecture's specialized instruction sets designed for AI/ML workloads. These optimizations are crucial for achieving optimal performance on Graviton4:

```bash
The CMake output should include the information below, indicating that the build process will leverage the Neoverse V2 architecture's specialized instruction sets designed for AI/ML workloads. These optimizations are crucial for achieving optimal performance on Graviton4:

```output
-- ARM feature DOTPROD enabled
-- ARM feature SVE enabled
-- ARM feature MATMUL_INT8 enabled
Expand Down Expand Up @@ -69,7 +69,7 @@ This command compiles the Llama.cpp project:

The build process will compile the C++ source code into executable binaries optimized for your ARM64 architecture. This should only take a minute.

## What Gets Built
## What is built?

After successful compilation, you'll have several key command-line executables in the `bin` directory:
- `llama-cli` - The main inference executable for running LLaMA models
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,7 @@ weight: 5
layout: learningpathall
---

In this step, we'll set up a Python virtual environment and install the required dependencies for working with Llama.cpp. This ensures we have a clean, isolated Python environment with all the necessary packages for model optimization.

Here are all the steps.
In this step, you'll set up a Python virtual environment and install the required dependencies for working with Llama.cpp. This ensures you have a clean, isolated Python environment with all the necessary packages for model optimization.

## Step 1: Create a Python Virtual Environment

Expand All @@ -30,14 +28,14 @@ source env-llama-cpp/bin/activate

This command activates the virtual environment:
- The `source` command executes the activation script, which modifies your current shell environment
- Depending on you sheel, your command prompt may change to show `(env-llama-cpp)` at the beginning, indicating the active environment. We will reflect this in the following commands.
- Depending on you sheel, your command prompt may change to show `(env-llama-cpp)` at the beginning, indicating the active environment. This will be reflected in the following commands.
- All subsequent `pip` commands will install packages into this isolated environment
- The `PATH` environment variable is updated to prioritize the virtual environment's Python interpreter

## Step 3: Upgrade pip to the Latest Version

```bash
(env-llama-cpp) pip install --upgrade pip
pip install --upgrade pip
```

This command ensures you have the latest version of pip:
Expand All @@ -49,7 +47,7 @@ This command ensures you have the latest version of pip:
## Step 4: Install Project Dependencies

```bash
(env-llama-cpp) pip install -r requirements.txt
pip install -r requirements.txt
```

This command installs all the Python packages specified in the requirements.txt file:
Expand All @@ -58,7 +56,7 @@ This command installs all the Python packages specified in the requirements.txt
- This ensures everyone working on the project uses the same package versions
- The installation will include packages needed for model loading, inference, and any Python bindings for Llama.cpp

## What Gets Installed
## What is installed?

After successful installation, your virtual environment will contain:
- **NumPy**: For numerical computations and array operations
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,24 @@ weight: 6
layout: learningpathall
---

In this step, we'll download the AFM-4.5B model from Hugging Face, convert it to the GGUF format for use with Llama.cpp, and create quantized versions to optimize memory usage and inference speed.
In this step, you'll download the AFM-4.5B model from Hugging Face, convert it to the GGUF format for use with Llama.cpp, and create quantized versions to optimize memory usage and inference speed.

The first release of the [Arcee Foundation Model](https://www.arcee.ai/blog/announcing-the-arcee-foundation-model-family) family, [AFM-4.5B](https://www.arcee.ai/blog/deep-dive-afm-4-5b-the-first-arcee-foundational-model) is a 4.5-billion-parameter frontier model that delivers excellent accuracy, strict compliance, and very high cost-efficiency. It was trained on almost 7 trillion tokens of clean, rigorously filtered data, and has been tested across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish

Here are all the steps to download and optimize the model for AWS Graviton4. Make sure to run them in the virtual environment you created at the previous step.
Here are the steps to download and optimize the model for AWS Graviton4. Make sure to run them in the virtual environment you created at the previous step.

## Step 1: Install the Hugging Face libraries

```bash
(env-llama-cpp) pip install huggingface_hub hf_xet
pip install huggingface_hub hf_xet
```

This command installs the Hugging Face Hub Python library, which provides tools for downloading models and datasets from the Hugging Face platform. The library includes the `huggingface-cli` command-line interface that we'll use to download the AFM-4.5B model. The `hf_xet` library provides additional functionality for efficient data transfer and caching when downloading large models from Hugging Face Hub.
This command installs the Hugging Face Hub Python library, which provides tools for downloading models and datasets from the Hugging Face platform. The library includes the `huggingface-cli` command-line interface that you can use to download the AFM-4.5B model.

## Step 2: Download the AFM-4.5B Model

```bash
(env-llama-cpp) huggingface-cli download arcee-ai/afm-4.5B --local-dir models/afm-4-5b
huggingface-cli download arcee-ai/afm-4.5B --local-dir models/afm-4-5b
```

This command downloads the AFM-4.5B model from the Hugging Face Hub:
Expand All @@ -35,8 +35,8 @@ This command downloads the AFM-4.5B model from the Hugging Face Hub:
## Step 3: Convert to GGUF Format

```bash
(env-llama-cpp) python3 convert_hf_to_gguf.py models/afm-4-5b
(env-llama-cpp) deactivate
python3 convert_hf_to_gguf.py models/afm-4-5b
deactivate
```

The first command converts the downloaded Hugging Face model to the GGUF (GGML Universal Format) format:
Expand All @@ -46,7 +46,7 @@ The first command converts the downloaded Hugging Face model to the GGUF (GGML U
- It outputs a single `afm-4-5B-F16.gguf` ~15GB file in the `models/afm-4-5b/` directory
- GGUF is the native format used by Llama.cpp and provides efficient loading and inference

Then, we deactivate the Python virtual environment as future commands won't require it.
Next, deactivate the Python virtual environment as future commands won't require it.

## Step 4: Create Q4_0 Quantized Version

Expand Down Expand Up @@ -81,7 +81,7 @@ This command creates an 8-bit quantized version of the model:

**ARM Optimization**: Similar to Q4_0, ARM has contributed optimized kernels for Q8_0 quantization that take advantage of Neoverse v2 instruction sets. These optimizations provide excellent performance for 8-bit operations while maintaining higher accuracy compared to 4-bit quantization.

## What You'll Have
## What is available now?

After completing these steps, you'll have three versions of the AFM-4.5B model:
- `afm-4-5B-F16.gguf` - The original full-precision model (~15GB)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ weight: 7
layout: learningpathall
---

Now that we have our AFM-4.5B models in GGUF format, we can run inference using various Llama.cpp tools. In this step, we'll explore different ways to interact with the model for text generation, benchmarking, and evaluation.
Now that you have the AFM-4.5B models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore different ways to interact with the model for text generation, benchmarking, and evaluation.

## Using llama-cli for Interactive Text Generation

Expand All @@ -19,6 +19,7 @@ bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q8_0.gguf -n 256 --color
```

This command starts an interactive session with the model:

- `-m models/afm-4-5b/afm-4-5B-Q8_0.gguf` specifies the model file to load
- `-n 512` sets the maximum number of tokens to generate per response
- The tool will prompt you to enter text, and the model will generate a response
Expand All @@ -29,7 +30,7 @@ In this example, `llama-cli` uses 16 vCPUs. You can try different values with `-

Once you start the interactive session, you can have conversations like this:

```
```console
> Give me a brief explanation of the attention mechnanism in transformer models.
In transformer models, the attention mechanism allows the model to focus on specific parts of the input sequence when computing the output. Here's a simplified explanation:

Expand All @@ -50,6 +51,7 @@ The attention mechanism allows transformer models to selectively focus on specif
To exit the interactive session, type `Ctrl+C` or `/bye`.

This will display performance statistics:

```bash
llama_perf_sampler_print: sampling time = 26.66 ms / 356 runs ( 0.07 ms per token, 13352.84 tokens per second)
llama_perf_context_print: load time = 782.72 ms
Expand All @@ -62,7 +64,7 @@ In this example, our 8-bit model running on 16 threads generated 355 tokens, at

### Example Non-Interactive Session

Now, let's try the 4-bit model in non-interactive mode:
Now, try the 4-bit model in non-interactive mode:

```bash
bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -n 256 --color -no-cnv -p "Give me a brief explanation of the attention mechnanism in transformer models."
Expand Down Expand Up @@ -116,7 +118,7 @@ curl -X POST http://localhost:8080/v1/chat/completions \
}'
```

You should get an answer similar to this one:
You get an answer similar to this one:

```json
{
Expand Down Expand Up @@ -153,4 +155,4 @@ You should get an answer similar to this one:
}
```

You could also interact with the server using Python with the [OpenAI client library](https://github.com/openai/openai-python), enabling streaming responses, and other features.
You can also interact with the server using Python with the [OpenAI client library](https://github.com/openai/openai-python), enabling streaming responses, and other features.
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ The results should look like this:

It's pretty amazing to see that with only 4 threads, the 4-bit model can still generate at the very comfortable speed of 15 tokens per second. We could definitely run several copies of the model on the same instance to serve concurrent users or applications.

You could also try [`llama-batched-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/batched-bench) to benchmark performance on batch sizes larger than 1.
You can also try [`llama-batched-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/batched-bench) to benchmark performance on batch sizes larger than 1.


## Using llama-perplexity for Model Evaluation
Expand All @@ -74,15 +74,16 @@ The `llama-perplexity` tool evaluates the model's quality on text datasets by ca

### Downloading a Test Dataset

First, let's download the Wikitest-2 test dataset.
First, download the Wikitest-2 test dataset.

```bash
sh scripts/get-wikitext-2.sh
```

### Running Perplexity Evaluation

Now, let's measure perplexity on the test dataset
Next, measure perplexity on the test dataset.

```bash
bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-F16.gguf -f wikitext-2-raw/wiki.test.raw
bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q8_0.gguf -f wikitext-2-raw/wiki.test.raw
Expand All @@ -106,16 +107,13 @@ bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -f wikitext-2-raw/wik
tail -f ppl.sh.log
```


Here are the full results.


| Model | Generation Speed (tokens/s, 16 vCPUs) | Memory Usage | Perplexity (Wikitext-2) |
|:-------:|:----------------------:|:------------:|:----------:|
| F16 | ~15–16 | ~15 GB | TODO |
| Q8_0 | ~25 | ~8 GB | TODO |
| Q4_0 | ~40 | ~4.4 GB | TODO |


*Please remember to terminate the instance in the AWS console when you're done testing*
When you have finished your benchmarking and evaluation, make sure to terminate your AWS EC2 instance in the AWS Management Console to avoid incurring unnecessary charges for unused compute resources.

Loading