Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ weight: 4
layout: learningpathall
---

In this step, you'll set up the Graviton4 instance with the tools and dependencies required to build and run the Arcee Foundation Model. This includes installing system packages and a Python environment.
In this step, you'll set up the Graviton4 instance with the tools and dependencies required to build and run the AFM-4.5B model. This includes installing system packages and a Python environment.

## Update the package list

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ layout: learningpathall
---
## Build the Llama.cpp inference engine

In this step, you'll build Llama.cpp from source. Llama.cpp is a high-performance C++ implementation of the LLaMA model, optimized for inference on a range of hardware platforms,including Arm-based processors like AWS Graviton4.
In this step, you'll build Llama.cpp from source. Llama.cpp is a high-performance C++ implementation of the LLaMA model, optimized for inference on a range of hardware platforms, including Arm-based processors like AWS Graviton4.

Even though AFM-4.5B uses a custom model architecture, you can still use the standard Llama.cpp repository - Arcee AI has contributed the necessary modeling code upstream.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ This command does the following:

- Runs the activation script, which modifies your shell environment
- Updates your shell prompt to show `env-llama-cpp`, indicating the environment is active
- Updates `PATH` to use so the environment’s Python interpreter
- Updates `PATH` to use the environment’s Python interpreter
- Ensures all `pip` commands install packages into the isolated environment

## Upgrade pip to the latest version
Expand Down Expand Up @@ -72,7 +72,8 @@ After the installation completes, your virtual environment includes:
- **NumPy**: for numerical computations and array operations
- **Requests**: for HTTP operations and API calls
- **Other dependencies**: additional packages required by llama.cpp's Python bindings and utilities
Your environment is now ready to run Python scripts that integrate with the compiled Llama.cpp binaries

Your environment is now ready to run Python scripts that integrate with the compiled Llama.cpp binaries.

{{< notice Tip >}}
Before running any Python commands, make sure your virtual environment is activated. {{< /notice >}}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ layout: learningpathall

In this step, you’ll download the [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) model from Hugging Face, convert it to the GGUF format for compatibility with `llama.cpp`, and generate quantized versions to optimize memory usage and improve inference speed.

**Note: if you want to skip the model optimization process, [GGUF](https://huggingface.co/arcee-ai/AFM-4.5B-GGUF) versions are available.**
{{% notice Note %}}
If you want to skip the model optimization process, [GGUF](https://huggingface.co/arcee-ai/AFM-4.5B-GGUF) versions are available. {{% /notice %}}

Make sure to activate your virtual environment before running any commands. The instructions below walk you through downloading and preparing the model for efficient use on AWS Graviton4.

Expand All @@ -28,11 +29,11 @@ pip install huggingface_hub hf_xet
This command installs:

- `huggingface_hub`: Python client for downloading models and datasets
- `hf_xet`: Git extension for fetching large model files stored on Hugging Face
- `hf_xet`: Git extension for fetching large model files hosted on Hugging Face

These tools include the `hf` command-line interface you'll use next.

## Login to the Hugging Face Hub
## Log in to the Hugging Face Hub

```bash
hf auth login
Expand Down Expand Up @@ -86,7 +87,7 @@ This command creates a 4-bit quantized version of the model:
- `llama-quantize` is the quantization tool from Llama.cpp.
- `afm-4-5B-F16.gguf` is the input GGUF model file in 16-bit precision.
- `Q4_0` applies zero-point 4-bit quantization.
- This reduces the model size by approximately 45% (from ~15GB to ~8GB).
- This reduces the model size by approximately ~70% (from ~15GB to ~4.4GB).
- The quantized model will use less memory and run faster, though with a small reduction in accuracy.
- The output file will be `afm-4-5B-Q4_0.gguf`.

Expand All @@ -104,7 +105,7 @@ bin/llama-quantize models/afm-4-5b/afm-4-5B-F16.gguf models/afm-4-5b/afm-4-5B-Q8

This command creates an 8-bit quantized version of the model:
- `Q8_0` specifies 8-bit quantization with zero-point compression.
- This reduces the model size by approximately 70% (from ~15GB to ~4.4GB).
- This reduces the model size by approximately ~45% (from ~15GB to ~8GB).
- The 8-bit version provides a better balance between memory usage and accuracy than 4-bit quantization.
- The output file is named `afm-4-5B-Q8_0.gguf`.
- Commonly used in production scenarios where memory resources are available.
Expand Down