diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/00_overview.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/00_overview.md new file mode 100644 index 0000000000..2e3ddadafe --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/00_overview.md @@ -0,0 +1,37 @@ +--- +title: Overview +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## The AFM-4.5B model + +AFM-4.5B is a 4.5-billion-parameter foundation model designed to balance accuracy, efficiency, and broad language coverage. Trained on nearly 7 trillion tokens of carefully filtered data, it performs well across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish. + +In this Learning Path, you'll deploy AFM-4.5B using [Llama.cpp](https://github.com/ggerganov/llama.cpp) on an Arm-based AWS Graviton4 instance. You’ll walk through the full workflow, from setting up your environment and compiling the runtime, to downloading, quantizing, and running inference on the model. You'll also evaluate model quality using perplexity, a common metric for measuring how well a language model predicts text. + +This hands-on guide helps developers build cost-efficient, high-performance LLM applications on modern Arm server infrastructure using open-source tools and real-world deployment practices. + +### LLM deployment workflow on Arm Graviton4 + +- **Provision compute**: launch an EC2 instance using a Graviton4-based instance type (for example, `c8g.4xlarge`) + +- **Set up your environment**: install the required build tools and dependencies (such as CMake, Python, and Git) + +- **Build the inference engine**: clone the [Llama.cpp](https://github.com/ggerganov/llama.cpp) repository and compile the project for your Arm-based environment + +- **Prepare the model**: download the **AFM-4.5B** model files from Hugging Face and use Llama.cpp's quantization tools to reduce model size and optimize performance + +- **Run inference**: load the quantized model and run sample prompts using Llama.cpp. + +- **Evaluate model quality**: calculate **perplexity** or use other metrics to assess model performance + +{{< notice Note>}} +You can reuse this deployment flow with other models supported by Llama.cpp by swapping out the model file and adjusting quantization settings. +{{< /notice >}} + + + + diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/01_launching_a_graviton4_instance.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/01_launching_a_graviton4_instance.md index 772e4d96c5..09bbfa54ce 100644 --- a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/01_launching_a_graviton4_instance.md +++ b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/01_launching_a_graviton4_instance.md @@ -1,6 +1,6 @@ --- -title: Launching a Graviton4 instance -weight: 2 +title: Provision your Graviton4 environment +weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall @@ -8,164 +8,111 @@ layout: learningpathall ## Requirements - - An AWS account +Before you begin, make sure you have the following: - - Access to launch an EC2 instance of type `c8g.4xlarge` (or larger) with at least 128 GB of storage +- An AWS account +- Permission to launch a Graviton4 EC2 instance of type `c8g.4xlarge` (or larger) +- At least 128 GB of available storage -For more information about creating an EC2 instance using AWS refer to [Getting Started with AWS](/learning-paths/servers-and-cloud-computing/csp/aws/). +If you're new to EC2, check out the Learning Path [Getting Started with AWS](/learning-paths/servers-and-cloud-computing/csp/aws/). -## AWS Console Steps +## Create an SSH key pair -Follow these steps to launch your EC2 instance using the AWS Management Console: +To deploy the Arcee AFM-4.5B model, you need an EC2 instance running on Arm-based Graviton4 hardware. -### Step 1: Create an SSH Key Pair +To do this, start by signing in to the [AWS Management Console](https://console.aws.amazon.com), then navigate to the **EC2** service. -1. **Navigate to EC2 Console** +From there, you can create an SSH key pair that allows you to connect to your instance securely. - - Go to the [AWS Management Console](https://console.aws.amazon.com) +## Set up secure access - - Search for "EC2" and click on "EC2" service +Open the **Key Pairs** section under **Network & Security** in the sidebar, and create a new key pair named `arcee-graviton4-key`. -2. **Create Key Pair** +Next, select **RSA** as the key type, and **.pem** as the file format. Once you create the key, your browser will download the `.pem` file automatically. - - In the left navigation pane, click "Key Pairs" under "Network & Security" +To ensure the key remains secure and accessible, move the `.pem` file to your SSH configuration directory, and update its permissions to restrict access. - - Click "Create key pair" +To do this, on macOS or Linux, run: - - Enter name: `arcee-graviton4-key` +```bash +mkdir -p ~/.ssh +mv arcee-graviton4-key.pem ~/.ssh/ +chmod 400 ~/.ssh/arcee-graviton4-key.pem +``` +internet +## Launch and configure the EC2 instance - - Select "RSA" as the key pair type +In the left sidebar of the EC2 dashboard, select **Instances**, and then **Launch instances**. - - Select ".pem" as the private key file format +Use the following settings to configure your instance: - - Click "Create key pair" +- **Name**: `Arcee-Graviton4-Instance` +- **Application and OS image**: + - Select the **Quick Start** tab + - Select **Ubuntu Server 24.04 LTS (HVM), SSD Volume Type** + - Ensure the architecture is set to **64-bit (ARM)** +- **Instance type**: select `c8g.4xlarge` or larger +- **Key pair name**: select `arcee-graviton4-key` from the list - - The private key file will automatically download to your computer +## Configure network -3. **Secure the Key File** +To enable internet access, choose a VPC with at least one public subnet. - - Move the downloaded `.pem` file to the SSH configuration directory +Then select a public subnet from the list. - ```bash - mkdir -p ~/.ssh - mv arcee-graviton4-key.pem ~/.ssh - ``` +Under **Auto-assign public IP**, select **Enable**. - - Set proper permissions on macOS or Linux: +## Configure firewall - ```bash - chmod 400 ~/.ssh/arcee-graviton4-key.pem - ``` +Select **Create security group**. Then select **Allow SSH traffic from** and select **My IP**. -### Step 2: Launch EC2 Instance +{{% notice Note %}} +You'll only be able to connect to the instance from your current host, which is the most secure setting. Avoid selecting **Anywhere** unless absolutely necessary, as this setting allows anyone on the internet to attempt a connection. -1. **Start Instance Launch** - - - In the left navigation pane, click "Instances" under "Instances" - - - Click "Launch instances" button - -2. **Configure Instance Details** - - - **Name and tags**: Enter `Arcee-Graviton4-Instance` as the instance name - - - **Application and OS Images**: - - Click "Quick Start" tab - - - Select "Ubuntu" - - - Choose "Ubuntu Server 24.04 LTS (HVM), SSD Volume Type" - - - **Important**: Ensure the architecture shows "64-bit (ARM)" for Graviton compatibility - - - **Instance type**: - - Click on "Select instance type" - - - Select `c8g.4xlarge` or larger - -3. **Configure Key Pair** - - In "Key pair name", select the SSH keypair you created earlier (`Arcee-Graviton4-Instance`) - -4. **Configure Network Settings** - - - **Network**: Select a VPC with a least one public subnet. - - - **Subnet**: Select a public subnet in the VPC - - - **Auto-assign Public IP**: Enable - - - **Firewall (security groups)** - - - Click on "Create security group" - - - Click on "Allow SSH traffic from" - - - In the dropdown list, select "My IP". - - -{{% notice Notes %}} -You will only be able to connect to the instance from your current host, which is the safest setting. Selecting "Anywhere" allows anyone on the Internet to attempt to connect; use at your own risk. - -Although this demonstration only requires SSH access, it is possible to use one of your existing security groups as long as it allows SSH traffic. +You only need SSH access for this Learning Path. If you already have a security group that allows inbound SSH traffic, you can reuse it. {{% /notice %}} -5. **Configure Storage** - - - **Root volume**: - - Size: `128` GB - - - Volume type: `gp3` - -7. **Review and Launch** - - - Review all settings in the "Summary" section +## Configure storage - - Click "Launch instance" +Set the **root volume size** to `128` GB, then select **gp3** as the volume type. -### Step 3: Monitor Instance Launch +## Review and launch the instance -1. **View Launch Status** +Review all your configuration settings, and when you're ready, select **Launch instance** to create your EC2 instance. - After a few seconds, you should see a message similar to this one: +## Monitor the instance launch - `Successfully initiated launch of instance (i-)` +After a few seconds, you should see a confirmation message like this: - If instance launch fails, please review your settings and try again. +``` +Successfully initiated launch of instance (i-xxxxxxxxxxxxxxxxx) +``` -2. **Get Connection Information** +If the launch fails, double-check the instance type, permissions, and network settings. - - Click on the instance id, or look for the instance in the Instances list in the EC2 console. +To retrieve the connection details, go to the **Instances** list in the EC2 dashboard. - - In the "Details" tab of the instance, note the "Public DNS" host name +Then select your instance by selecting **Instance ID**. - - This is the host name you'll use to connect via SSH, aka `PUBLIC_DNS_HOSTNAME` +In the **Details** tab, copy the **Public DNS** value - you’ll use this to connect through SSH. -### Step 4: Connect to Your Instance +## Connect to your instance -1. **Open Terminal/Command Prompt** +Open a terminal and connect to the instance using the SSH key you downloaded earlier: -2. **Connect via SSH** - ```bash - ssh -i ~/.ssh/arcee-graviton4-key.pem ubuntu@ - ``` +```bash +ssh -i ~/.ssh/arcee-graviton4-key.pem ubuntu@ +``` -3. **Accept Security Warning** +When prompted, type `yes` to confirm the connection. - - When prompted about authenticity of host, type `yes` - - - You should now be connected to your Ubuntu instance - -### Important Notes - -- **Region Selection**: Ensure you're in your preferred AWS region before launching - -- **AMI Selection**: The Ubuntu 24.04 LTS AMI must be ARM64 compatible for Graviton processors - -- **Security**: Think twice about allowing SSH from anywhere (0.0.0.0/0). It is strongly recommended to restrict access to your IP address. - -- **Storage**: The 128GB EBS volume is sufficient for the Arcee model and dependencies - -- **Backup**: Consider creating AMIs or snapshots for backup purposes +You should now be connected to your Ubuntu instance running on Graviton4. +{{% notice Note %}} +**Region**: make sure you're launching in your preferred AWS region. +**AMI**: confirm that the selected AMI supports the Arm64 architecture. +**Security**: for best practice, restrict SSH access to your own IP. +**Storage**: 128 GB is sufficient for the AFM-4.5B model and dependencies. +**Backup**: consider creating an AMI or snapshot after setup is complete. +{{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/02_setting_up_the_instance.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/02_setting_up_the_instance.md index c85c8f0bc4..8b8c53c779 100644 --- a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/02_setting_up_the_instance.md +++ b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/02_setting_up_the_instance.md @@ -1,51 +1,58 @@ --- -title: Setting up the instance -weight: 3 +title: Configure your Graviton4 environment +weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In this step, you'll set up the Graviton4 instance with all the necessary tools and dependencies required to build and run the Arcee Foundation Model. This includes installing the build tools and Python environment. +In this step, you'll set up the Graviton4 instance with the tools and dependencies required to build and run the Arcee Foundation Model. This includes installing system packages and a Python environment. -## Step 1: Update Package List +## Update the package list + +Run the following command to update your local APT package index: ```bash sudo apt-get update ``` -This command updates the local package index from the repositories: +This step ensures you have the most recent metadata about available packages, including versions and dependencies. It helps prevent conflicts when installing new packages. -- Downloads the latest package lists from all configured APT repositories -- Ensures you have the most recent information about available packages and their versions -- This is a best practice before installing new packages to avoid potential conflicts -- The package index contains metadata about available packages, their dependencies, and version information +## Install system dependencies -## Step 2: Install System Dependencies +Install the build tools and Python environment: ```bash sudo apt-get install cmake gcc g++ git python3 python3-pip python3-virtualenv libcurl4-openssl-dev unzip -y ``` -This command installs all the essential development tools and dependencies: +This command installs the following tools and dependencies: + +- **CMake**: cross-platform build system generator used to compile and build Llama.cpp + +- **GCC and G++**: GNU C and C++ compilers for compiling native code + +- **Git**: version control system for cloning repositories + +- **Python 3**: Python interpreter for running Python-based tools and scripts + +- **Pip**: Python package manager + +- **Virtualenv**: tool for creating isolated Python environments + +- **libcurl4-openssl-dev**: development files for the curl HTTP library -- **cmake**: Cross-platform build system generator used to compile Llama.cpp -- **gcc & g++**: GNU C and C++ compilers for building native code -- **git**: Version control system for cloning repositories -- **python3**: Python interpreter for running Python-based tools and scripts -- **python3-pip**: Python package installer for managing Python dependencies -- **python3-virtualenv**: Tool for creating isolated Python environments -- **libcurl4-openssl-dev**: client-side URL transfer library +- **Unzip**: tool to extract `.zip` files (used in some model downloads) -The `-y` flag automatically answers "yes" to prompts, making the installation non-interactive. +The `-y` flag automatically approves the installation of all packages without prompting. -## What's Ready Now? +## Ready for build and deployment -After completing these steps, your Graviton4 instance has: +After completing the setup, your instance includes the following tools and environments: - A complete C/C++ development environment for building Llama.cpp -- Python 3 with pip for managing Python packages +- Python 3, pip, and virtualenv for managing Python tools and environments - Git for cloning repositories -- All necessary build tools for compiling optimized ARM64 binaries +- All required dependencies for compiling optimized Arm64 binaries -The system is now prepared for the next steps: building Llama.cpp and downloading the Arcee Foundation Model. +You're now ready to build Llama.cpp and download the Arcee Foundation Model. diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/03_building_llama_cpp.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/03_building_llama_cpp.md index b4cdcf0f7a..7cf16a7a2f 100644 --- a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/03_building_llama_cpp.md +++ b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/03_building_llama_cpp.md @@ -1,16 +1,17 @@ --- -title: Building Llama.cpp -weight: 4 +title: Build Llama.cpp +weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Build the Llama.cpp inference engine -In this step, you'll build Llama.cpp from source. Llama.cpp is a high-performance C++ implementation of the LLaMA model that's optimized for inference on various hardware platforms, including Arm-based processors like Graviton4. +In this step, you'll build Llama.cpp from source. Llama.cpp is a high-performance C++ implementation of the LLaMA model, optimized for inference on a range of hardware platforms,including Arm-based processors like AWS Graviton4. -Even though AFM-4.5B has a custom model architecture, we're able to use the vanilla version of Llama.cpp as the Arcee AI team has contributed the appropriate modeling code. +Even though AFM-4.5B uses a custom model architecture, you can still use the standard Llama.cpp repository - Arcee AI has contributed the necessary modeling code upstream. -## Step 1: Clone the Repository +## Clone the repository ```bash git clone https://github.com/ggerganov/llama.cpp @@ -18,28 +19,28 @@ git clone https://github.com/ggerganov/llama.cpp This command clones the Llama.cpp repository from GitHub to your local machine. The repository contains the source code, build scripts, and documentation needed to compile the inference engine. -## Step 2: Navigate to the Project Directory +## Navigate to the project directory ```bash cd llama.cpp ``` -Change into the llama.cpp directory to run the build process. This directory contains the `CMakeLists.txt` file and source code structure. +Change into the llama.cpp directory to run the build process. This directory contains the `CMakeLists.txt` file and all source code. -## Step 3: Configure the Build with CMake +## Configure the build with CMake ```bash cmake -B . ``` -This command uses CMake to configure the build system: +This command configures the build system using CMake: -- `-B .` specifies that the build files should be generated in the current directory -- CMake will detect your system's compiler, libraries, and hardware capabilities -- It will generate the appropriate build files (Makefiles on Linux) based on your system configuration +- `-B .` tells CMake to generate build files in the current directory +- CMake detects your system's compiler, libraries, and hardware capabilities +- It produces Makefiles (on Linux) or platform-specific build scripts for compiling the project -The CMake output should include the information below, indicating that the build process will leverage the Neoverse V2 architecture's specialized instruction sets designed for AI/ML workloads. These optimizations are crucial for achieving optimal performance on Graviton4: +If you're running on Graviton4, the CMake output should include hardware-specific optimizations targeting the Neoverse V2 architecture. These optimizations are crucial for achieving high performance on Graviton4: ```output -- ARM feature DOTPROD enabled @@ -50,33 +51,40 @@ The CMake output should include the information below, indicating that the build -- Adding CPU backend variant ggml-cpu: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+dotprod+i8mm+sve ``` -- **DOTPROD: Dot Product** - Hardware-accelerated dot product operations for neural network computations -- **SVE: Scalable Vector Extension** - Advanced vector processing capabilities that can handle variable-length vectors up to 2048 bits, providing significant performance improvements for matrix operations -- **MATMUL_INT8: Matrix multiplication units** - Dedicated hardware for efficient matrix operations common in transformer models, accelerating the core computations of large language models -- **FMA: Fused Multiply-Add - Optimized floating-point operations that combine multiplication and addition in a single instruction -- **FP16 Vector Arithmetic - Hardware support for 16-bit floating-point vector operations, reducing memory usage while maintaining good numerical precision +These features enable advanced CPU instructions that accelerate inference performance on Arm64: -## Step 4: Compile the Project +- **DOTPROD: Dot Product**: hardware-accelerated dot product operations for neural network workloads + +- **SVE (Scalable Vector Extension)**: advanced vector processing capabilities that can handle variable-length vectors up to 2048 bits, providing significant performance improvements for matrix operations + +- **MATMUL_INT8**: integer matrix multiplication units optimized for transformers + +- **FMA**: fused multiply-add operations to speed up floating-point math + +- **FP16 vector arithmetic**: 16-bit floating-point vector operations to reduce memory use without compromising precision + +## Compile the project ```bash cmake --build . --config Release -j16 ``` -This command compiles the Llama.cpp project: -- `--build .` tells CMake to build the project using the files in the current directory -- `--config Release` specifies a Release build configuration, which enables optimizations and removes debug symbols +This command compiles the Llama.cpp source code: + +- `--build .` tells CMake to build the project in the current directory +- `--config Release` enables optimizations and strips debug symbols - `-j16` runs the build with 16 parallel jobs, which speeds up compilation on multi-core systems like Graviton4 -The build process will compile the C++ source code into executable binaries optimized for your ARM64 architecture. This should only take a minute. +The build process compiles the C++ source code into executable binaries optimized for the Arm64 architecture. Compilation typically takes under a minute. -## What is built? +## Key binaries after compilation -After successful compilation, you'll have several key command-line executables in the `bin` directory: -- `llama-cli` - The main inference executable for running LLaMA models -- `llama-server` - A web server for serving model inference over HTTP -- `llama-quantize` - a tool for model quantization to reduce memory usage -- Various utility programs for model conversion and optimization +After compilation, you'll find several key command-line tools in the `bin` directory: +- `llama-cli`: the main inference executable for running LLaMA models +- `llama-server`: a web server for serving model inference over HTTP +- `llama-quantize`: a tool for model quantization to reduce memory usage +- Additional utilities for model conversion and optimization -You can find more information in the llama.cpp [GitHub repository](https://github.com/ggml-org/llama.cpp/tree/master/tools). +You can find more tools and usage details in the llama.cpp [GitHub repository](https://github.com/ggml-org/llama.cpp/tree/master/tools). -These binaries are specifically optimized for ARM64 architecture and will provide excellent performance on your Graviton4 instance. +These binaries are specifically optimized for Arm64 architecture and will provide excellent performance on your Graviton4 instance. diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/04_install_python_dependencies_for_llama_cpp.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/04_install_python_dependencies_for_llama_cpp.md index f21d281408..b680dcf7eb 100644 --- a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/04_install_python_dependencies_for_llama_cpp.md +++ b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/04_install_python_dependencies_for_llama_cpp.md @@ -1,66 +1,80 @@ --- -title: Installing Python dependencies for llama.cpp -weight: 5 +title: Install Python dependencies +weight: 6 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Overview In this step, you'll set up a Python virtual environment and install the required dependencies for working with Llama.cpp. This ensures you have a clean, isolated Python environment with all the necessary packages for model optimization. -## Step 1: Create a Python Virtual Environment +## Create a Python virtual environment ```bash virtualenv env-llama-cpp ``` -This command creates a new Python virtual environment named `env-llama-cpp`: -- Virtual environments provide isolated Python environments that prevent conflicts between different projects -- The `env-llama-cpp` directory will contain its own Python interpreter and package installation space -- This isolation ensures that the Llama.cpp dependencies won't interfere with other Python projects on your system -- Virtual environments are essential for reproducible development environments +This command creates a new Python virtual environment named `env-llama-cpp`, which has the following benefits: +- Provides an isolated Python environment to prevent package conflicts between projects +- Creates a local directory containing its own Python interpreter and installation space +- Ensures Llama.cpp dependencies don’t interfere with your global Python setup +- Supports reproducible and portable development environments -## Step 2: Activate the Virtual Environment +## Activate the virtual environment + +Run the following command to activate the virtual environment: ```bash source env-llama-cpp/bin/activate ``` +This command does the following: + +- Runs the activation script, which modifies your shell environment +- Updates your shell prompt to show `env-llama-cpp`, indicating the environment is active +- Updates `PATH` to use so the environment’s Python interpreter +- Ensures all `pip` commands install packages into the isolated environment -This command activates the virtual environment: -- The `source` command executes the activation script, which modifies your current shell environment -- Depending on you sheel, your command prompt may change to show `(env-llama-cpp)` at the beginning, indicating the active environment. This will be reflected in the following commands. -- All subsequent `pip` commands will install packages into this isolated environment -- The `PATH` environment variable is updated to prioritize the virtual environment's Python interpreter +## Upgrade pip to the latest version -## Step 3: Upgrade pip to the Latest Version +Before installing dependencies, it’s a good idea to upgrade pip: ```bash pip install --upgrade pip ``` +This command: -This command ensures you have the latest version of pip: -- Upgrading pip helps avoid compatibility issues with newer packages -- The `--upgrade` flag tells pip to install the newest available version -- This is a best practice before installing project dependencies -- Newer pip versions often include security fixes and improved package resolution +- Ensures you have the latest version of pip +- Helps avoid compatibility issues with modern packages +- Applies the `--upgrade` flag to fetch and install the newest release +- Brings in security patches and better dependency resolution logic -## Step 4: Install Project Dependencies +## Install project dependencies + +Use the following command to install all required Python packages: ```bash pip install -r requirements.txt ``` -This command installs all the Python packages specified in the requirements.txt file: -- The `-r` flag tells pip to read the package list from the specified file -- `requirements.txt` contains a list of Python packages and their version specifications -- This ensures everyone working on the project uses the same package versions -- The installation will include packages needed for model loading, inference, and any Python bindings for Llama.cpp +This command does the following: + +- Uses the `-r` flag to read the list of dependencies from `requirements.txt` +- Installs the exact package versions required for the project +- Ensures consistency across development environments and contributors +- Includes packages for model loading, inference, and Python bindings for `llama.cpp` + +This step sets up everything you need to run AFM-4.5B in your Python environment. + +## What the environment includes + +After the installation completes, your virtual environment includes: +- **NumPy**: for numerical computations and array operations +- **Requests**: for HTTP operations and API calls +- **Other dependencies**: additional packages required by llama.cpp's Python bindings and utilities +Your environment is now ready to run Python scripts that integrate with the compiled Llama.cpp binaries -## What is installed? +{{< notice Tip >}} +Before running any Python commands, make sure your virtual environment is activated. {{< /notice >}} -After successful installation, your virtual environment will contain: -- **NumPy**: For numerical computations and array operations -- **Requests**: For HTTP operations and API calls -- **Other dependencies**: Specific packages needed for Llama.cpp Python integration -The virtual environment is now ready for running Python scripts that interact with the compiled Llama.cpp binaries. Remember to always activate the virtual environment (`source env-llama-cpp/bin/activate`) before running any Python code related to this project. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/05_downloading_and_optimizing_afm45b.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/05_downloading_and_optimizing_afm45b.md index e293e74ff7..d4850cfcc0 100644 --- a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/05_downloading_and_optimizing_afm45b.md +++ b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/05_downloading_and_optimizing_afm45b.md @@ -1,91 +1,97 @@ --- -title: Downloading and optimizing AFM-4.5B -weight: 6 +title: Download and optimize the AFM-4.5B model +weight: 7 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In this step, you'll download the AFM-4.5B model from Hugging Face, convert it to the GGUF format for use with Llama.cpp, and create quantized versions to optimize memory usage and inference speed. +In this step, you’ll download the AFM-4.5B model from Hugging Face, convert it to the GGUF format for compatibility with `llama.cpp`, and generate quantized versions to optimize memory usage and improve inference speed. -The first release of the [Arcee Foundation Model](https://www.arcee.ai/blog/announcing-the-arcee-foundation-model-family) family, [AFM-4.5B](https://www.arcee.ai/blog/deep-dive-afm-4-5b-the-first-arcee-foundational-model) is a 4.5-billion-parameter frontier model that delivers excellent accuracy, strict compliance, and very high cost-efficiency. It was trained on almost 7 trillion tokens of clean, rigorously filtered data, and has been tested across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish +Make sure to activate your virtual environment before running any commands. The instructions below walk you through downloading and preparing the model for efficient use on AWS Graviton4. -Here are the steps to download and optimize the model for AWS Graviton4. Make sure to run them in the virtual environment you created at the previous step. - -## Step 1: Install the Hugging Face libraries +## Install the Hugging Face libraries ```bash pip install huggingface_hub hf_xet ``` -This command installs the Hugging Face Hub Python library, which provides tools for downloading models and datasets from the Hugging Face platform. The library includes the `huggingface-cli` command-line interface that you can use to download the AFM-4.5B model. +This command installs: + +- `huggingface_hub`: Python client for downloading models and datasets +- `hf_xet`: Git extension for fetching large model files stored on Hugging Face -## Step 2: Download the AFM-4.5B Model +These tools include the `huggingface-cli` command-line interface you'll use next. + +## Download the AFM-4.5B model ```bash huggingface-cli download arcee-ai/afm-4.5B --local-dir models/afm-4-5b ``` -This command downloads the AFM-4.5B model from the Hugging Face Hub: -- `arcee-ai/afm-4.5B` is the model identifier on Hugging Face Hub -- `--local-dir model/afm-4-5b` specifies the local directory where the model files will be stored -- The download includes the model weights, configuration files, and tokenizer data -- This is a 4.5 billion parameter model, so the download may take several minutes depending on your internet connection +This command downloads the model to the `models/afm-4-5b` directory: +- `arcee-ai/afm-4.5B` is the Hugging Face model identifier. +- The download includes the model weights, configuration files, and tokenizer data. +- This is a 4.5 billion parameter model, so the download can take several minutes depending on your internet connection. -## Step 3: Convert to GGUF Format +## Convert to GGUF format ```bash python3 convert_hf_to_gguf.py models/afm-4-5b deactivate ``` -The first command converts the downloaded Hugging Face model to the GGUF (GGML Universal Format) format: -- `convert_hf_to_gguf.py` is a conversion script that comes with Llama.cpp -- `models/afm-4-5b` is the input directory containing the Hugging Face model files -- The script reads the model architecture, weights, and configuration from the Hugging Face format -- It outputs a single `afm-4-5B-F16.gguf` ~15GB file in the `models/afm-4-5b/` directory -- GGUF is the native format used by Llama.cpp and provides efficient loading and inference +This command converts the downloaded Hugging Face model to GGUF (GGML Universal Format): +- `convert_hf_to_gguf.py` is a conversion script that comes with Llama.cpp. +- `models/afm-4-5b` is the input directory containing the Hugging Face model files. +- The script reads the model architecture, weights, and configuration from the Hugging Face format. +- It outputs a single `afm-4-5B-F16.gguf` ~15GB file in the same `models/afm-4-5b/` directory. +- GGUF is the native format for Llama.cpp, optimized for efficient loading and inference. Next, deactivate the Python virtual environment as future commands won't require it. -## Step 4: Create Q4_0 Quantized Version +## Create Q4_0 Quantized Version ```bash bin/llama-quantize models/afm-4-5b/afm-4-5B-F16.gguf models/afm-4-5b/afm-4-5B-Q4_0.gguf Q4_0 ``` This command creates a 4-bit quantized version of the model: -- `llama-quantize` is the quantization tool from Llama.cpp -- `afm-4-5B-F16.gguf` is the input GGUF model file in 16-bit precision -- `Q4_0` specifies 4-bit quantization with zero-point quantization -- This reduces the model size by approximately 45% (from ~15GB to ~8GB) -- The quantized model will use less memory and run faster, though with a small reduction in accuracy -- The output file will be named `afm-4-5B-Q4_0.gguf` +- `llama-quantize` is the quantization tool from Llama.cpp. +- `afm-4-5B-F16.gguf` is the input GGUF model file in 16-bit precision. +- `Q4_0` applies zero-point 4-bit quantization. +- This reduces the model size by approximately 45% (from ~15GB to ~8GB). +- The quantized model will use less memory and run faster, though with a small reduction in accuracy. +- The output file will be `afm-4-5B-Q4_0.gguf`. + +## Arm optimization -**ARM Optimization**: ARM has contributed highly optimized kernels for Q4_0 quantization that leverage the Neoverse v2 instruction sets. These low-level math routines accelerate typical deep learning operations, providing significant performance improvements on ARM-based processors like Graviton4. +Arm has contributed optimized kernels for Q4_0 that use Neoverse v2 instruction sets. These low-level routines accelerate math operations, delivering strong performance on Graviton4. -These instruction sets enable Llama.cpp to perform quantized operations much faster than generic implementations, making ARM processors highly competitive for inference workloads. +These instruction sets allow Llama.cpp to run quantized operations significantly faster than generic implementations, making Arm processors a competitive choice for inference workloads. -## Step 5: Create Q8_0 Quantized Version +## Create a Q8_0 quantized version ```bash bin/llama-quantize models/afm-4-5b/afm-4-5B-F16.gguf models/afm-4-5b/afm-4-5B-Q8_0.gguf Q8_0 ``` This command creates an 8-bit quantized version of the model: -- `Q8_0` specifies 8-bit quantization with zero-point quantization -- This reduces the model size by approximately 70% (from ~15GB to ~4.4GB) -- The 8-bit version provides a better balance between memory usage and accuracy compared to 4-bit -- The output file will be named `afm-4-5B-Q8_0.gguf` -- This version is often preferred for production use when memory constraints allow +- `Q8_0` specifies 8-bit quantization with zero-point compression. +- This reduces the model size by approximately 70% (from ~15GB to ~4.4GB). +- The 8-bit version provides a better balance between memory usage and accuracy than 4-bit quantization. +- The output file is named `afm-4-5B-Q8_0.gguf`. +- Commonly used in production scenarios where memory resources are available. + +## Arm optimization -**ARM Optimization**: Similar to Q4_0, ARM has contributed optimized kernels for Q8_0 quantization that take advantage of Neoverse v2 instruction sets. These optimizations provide excellent performance for 8-bit operations while maintaining higher accuracy compared to 4-bit quantization. +Similar to Q4_0, Arm has contributed optimized kernels for Q8_0 quantization that take advantage of Neoverse v2 instruction sets. These optimizations provide excellent performance for 8-bit operations while maintaining higher accuracy compared to 4-bit quantization. -## What is available now? +## Model files ready for inference After completing these steps, you'll have three versions of the AFM-4.5B model: - `afm-4-5B-F16.gguf` - The original full-precision model (~15GB) -- `afm-4-5B-Q4_0.gguf` - 4-bit quantized version (~8GB) for memory-constrained environments -- `afm-4-5B-Q8_0.gguf` - 8-bit quantized version (~4.4GB) for balanced performance and memory usage +- `afm-4-5B-Q4_0.gguf` - 4-bit quantized version (~4.4GB) for memory-constrained environments +- `afm-4-5B-Q8_0.gguf` - 8-bit quantized version (~8GB) for balanced performance and memory usage -These models are now ready to be used with the Llama.cpp inference engine for text generation and other language model tasks. \ No newline at end of file +These models are now ready to be used with the `llama.cpp` inference engine for text generation and other language model tasks. diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/06_running_inference.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/06_running_inference.md index b1c9aeb471..0e84e97f56 100644 --- a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/06_running_inference.md +++ b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/06_running_inference.md @@ -1,37 +1,39 @@ --- -title: Running inference with AFM-4.5B -weight: 7 +title: Run inference with AFM-4.5B +weight: 8 ### FIXED, DO NOT MODIFY layout: learningpathall --- -Now that you have the AFM-4.5B models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore different ways to interact with the model for text generation, benchmarking, and evaluation. +Now that you have the AFM-4.5B models in GGUF format, you can run inference using various Llama.cpp tools. In this step, you'll explore how to generate text, benchmark performance, and interact with the model through both command-line and HTTP APIs. -## Using llama-cli for Interactive Text Generation -The `llama-cli` tool provides an interactive command-line interface for text generation. This is perfect for testing the model's capabilities and having conversations with it. +## Use llama-cli for interactive text generation -### Basic Usage +The `llama-cli` tool provides an interactive command-line interface for text generation. This is ideal for quick testing and hands-on exploration of the model's behavior. + +## Basic usage ```bash bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q8_0.gguf -n 256 --color ``` -This command starts an interactive session with the model: +This command starts an interactive session: -- `-m models/afm-4-5b/afm-4-5B-Q8_0.gguf` specifies the model file to load -- `-n 512` sets the maximum number of tokens to generate per response +- `-m` (model file path) specifies the model file to load +- `-n 256` sets the maximum number of tokens to generate per response +- `--color` enables colored terminal output - The tool will prompt you to enter text, and the model will generate a response -In this example, `llama-cli` uses 16 vCPUs. You can try different values with `-t `. +In this example, `llama-cli` uses 16 vCPUs. You can try different values with `-t `. -### Example Interactive Session +### Example interactive session Once you start the interactive session, you can have conversations like this: ```console -> Give me a brief explanation of the attention mechnanism in transformer models. +> Give me a brief explanation of the attention mechanism in transformer models. In transformer models, the attention mechanism allows the model to focus on specific parts of the input sequence when computing the output. Here's a simplified explanation: 1. **Key-Query-Value (K-Q-V) computation**: For each input element, the model computes three vectors: @@ -48,9 +50,9 @@ In transformer models, the attention mechanism allows the model to focus on spec The attention mechanism allows transformer models to selectively focus on specific parts of the input sequence, enabling them to better understand context and relationships between input elements. This is particularly useful for tasks like machine translation, where the model needs to capture long-range dependencies between input words. ``` -To exit the interactive session, type `Ctrl+C` or `/bye`. +To exit the session, type `Ctrl+C` or `/bye`. -This will display performance statistics: +You'll then see performance metrics like this: ```bash llama_perf_sampler_print: sampling time = 26.66 ms / 356 runs ( 0.07 ms per token, 13352.84 tokens per second) @@ -60,28 +62,28 @@ llama_perf_context_print: eval time = 13173.66 ms / 331 runs ( 39 llama_perf_context_print: total time = 129945.08 ms / 355 tokens ``` -In this example, our 8-bit model running on 16 threads generated 355 tokens, at over 25 tokens per second (`eval time`). +In this example, the 8-bit model running on 16 threads generated 355 tokens, at ~25 tokens per second (`eval time`). -### Example Non-Interactive Session +## Run a non-interactive prompt -Now, try the 4-bit model in non-interactive mode: +You can also use `llama-cli` in one-shot mode with a prompt: ```bash -bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -n 256 --color -no-cnv -p "Give me a brief explanation of the attention mechnanism in transformer models." +bin/llama-cli -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -n 256 --color -no-cnv -p "Give me a brief explanation of the attention mechanism in transformer models." ``` -This command starts an non-interactive session with the model: -- `-m models/afm-4-5b/afm-4-5B-Q4_0.gguf` specifies the model file to load -- `-no-cnv` disable the conversation mode -- `-p` sets the prompt sent to the model -- The tool will prompt you to enter text, and the model will generate a response +This command: +- Loads the 4-bit model +- Disables conversation mode using `-no-cnv` +- Sends a one-time prompt using `-p` +- Prints the generated response and exits -Here, you should see the model generating at about 40 tokens per second. This shows how a more aggressive quantization recipe helps deliver faster performance. +The 4-bit model delivers faster generation—expect around 40 tokens per second on Graviton4. This shows how a more aggressive quantization recipe helps deliver faster performance. -## Using llama-server for API Access +## Use llama-server for API access -The `llama-server` tool runs the model as a web server, allowing you to make HTTP requests for text generation. This is useful for integrating the model into applications or for batch processing. +The `llama-server` tool runs the model as a web server compatible with the OpenAI API format, allowing you to make HTTP requests for text generation. This is useful for integrating the model into applications or for batch processing. -### Starting the Server +## Start the server ```bash bin/llama-server -m models/afm-4-5b/afm-4-5B-Q4_0.gguf \ @@ -90,17 +92,17 @@ bin/llama-server -m models/afm-4-5b/afm-4-5B-Q4_0.gguf \ --ctx-size 4096 ``` -This starts a server that: +This starts a local server that: - Loads the specified model - Listens on all network interfaces (`0.0.0.0`) - Accepts connections on port 8080 -- Uses a 4096-token context window +- Supports a 4096-token context window -### Making API Requests +### Make an API request -Once the server is running, you can make requests using curl or any HTTP client. As `llama-server` is compatible with the popular OpenAI API, we'll use in the following examples. +Once the server is running, you can make requests using curl, or any HTTP client. -Open a new terminal on the AWS instance and run: +Open a new terminal on the AWS instance, and run: ```bash curl -X POST http://localhost:8080/v1/chat/completions \ @@ -118,7 +120,7 @@ curl -X POST http://localhost:8080/v1/chat/completions \ }' ``` -You get an answer similar to this one: +The response includes the model’s reply and performance metrics: ```json { @@ -155,4 +157,12 @@ You get an answer similar to this one: } ``` +## What's next? + +You’ve now successfully: + +- Run AFM-4.5B in interactive and non-interactive modes +- Tested performance with different quantized models +- Served the model as an OpenAI-compatible API endpoint + You can also interact with the server using Python with the [OpenAI client library](https://github.com/openai/openai-python), enabling streaming responses, and other features. diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/07_evaluating_the_quantized_models.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/07_evaluating_the_quantized_models.md index bf390d985e..f23b06cc3a 100644 --- a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/07_evaluating_the_quantized_models.md +++ b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/07_evaluating_the_quantized_models.md @@ -1,21 +1,21 @@ --- -title: Evaluating the quantized models -weight: 8 +title: Benchmark and evaluate the quantized models +weight: 9 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Using llama-bench for Performance Benchmarking +## Benchmark performance using llama-bench -The [`llama-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench) tool allows you to measure the performance characteristics of your model, including inference speed and memory usage. +Use the [`llama-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/llama-bench) tool to measure model performance, including inference speed and memory usage. -### Basic Benchmarking +## Run basic benchmarks -You can benchmark multiple model versions to compare their performance: +Benchmark multiple model versions to compare performance: ```bash -# Benchmark the full precision model +# Benchmark the full-precision model bin/llama-bench -m models/afm-4-5b/afm-4-5B-F16.gguf # Benchmark the 8-bit quantized model @@ -25,14 +25,16 @@ bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q8_0.gguf bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q4_0.gguf ``` -Running each model on 16 vCPUs, you should see results like: +Typical results on a 16 vCPU instance: - **F16 model**: ~15-16 tokens/second, ~15GB memory usage - **Q8_0 model**: ~25 tokens/second, ~8GB memory usage - **Q4_0 model**: ~40 tokens/second, ~4.4GB memory usage -The exact performance will depend on your specific instance configuration and load. +Your actual results might vary depending on your specific instance configuration and system load. -### Advanced Benchmarking +## Run advanced benchmarks + +Use this command to benchmark performance across prompt sizes and thread counts: ```bash bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q4_0.gguf \ @@ -41,13 +43,13 @@ bin/llama-bench -m models/afm-4-5b/afm-4-5B-Q4_0.gguf \ -t 8,16,24 ``` -This command: -- Loads the model and runs inference benchmarks -- `-p`: Evaluates a random prompt of 128, and 512 tokens -- `-n`: Generates 128 tokens -- `-t`: Run the model on 4, 8, and 16 threads +This command does the following: +- Loads the 4-bit model and runs inference benchmarks +- `-p`: evaluates prompt lengths of 128, 256, and 512 tokens +- `-n`: generates 128 tokens +- `-t`: runs inference using 4, 8, and 24 threads -The results should look like this: +Here’s an example of how performance scales across threads and prompt sizes: | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | @@ -61,28 +63,32 @@ The results should look like this: | llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 16 | pp512 | 190.18 ± 0.03 | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 16 | tg128 | 40.99 ± 0.36 | -It's pretty amazing to see that with only 4 threads, the 4-bit model can still generate at the very comfortable speed of 15 tokens per second. We could definitely run several copies of the model on the same instance to serve concurrent users or applications. +Even with just four threads, the Q4_0 model achieves comfortable generation speeds. On larger instances, you can run multiple concurrent model processes to support parallel workloads. + +To benchmark batch inference, use [`llama-batched-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/batched-bench). -You can also try [`llama-batched-bench`](https://github.com/ggml-org/llama.cpp/tree/master/tools/batched-bench) to benchmark performance on batch sizes larger than 1. +## Evaluate model quality using llama-perplexity -## Using llama-perplexity for Model Evaluation +Use the llama-perplexity tool to measure how well each model predicts the next token in a sequence. Perplexity is a measure of how well a language model predicts text. It gives you insight into the model’s confidence and predictive ability, representing the average number of possible next tokens the model considers when predicting each word: -Perplexity is a measure of how well a language model predicts text. It represents the average number of possible next tokens the model considers when predicting each word. A lower perplexity score indicates the model is more confident in its predictions and generally performs better on the given text. For example, a perplexity of 2.0 means the model typically considers 2 possible tokens when making each prediction, while a perplexity of 10.0 means it considers 10 possible tokens on average. +- A lower perplexity score indicates the model is more confident in its predictions and generally performs better on the given text. +- For example, a perplexity of 2.0 means the model typically considers ~2 tokens per step when making each prediction, while a perplexity of 10.0 means it considers 10 possible tokens on average, indicating more uncertainty. The `llama-perplexity` tool evaluates the model's quality on text datasets by calculating perplexity scores. Lower perplexity indicates better quality. -### Downloading a Test Dataset +## Download a test dataset -First, download the Wikitest-2 test dataset. +Use the following script to download and extract the Wikitext-2 dataset: ```bash sh scripts/get-wikitext-2.sh ``` +This script downloads and extracts the dataset to a local folder named `wikitext-2-raw`. -### Running Perplexity Evaluation +## Run a perplexity evaluation -Next, measure perplexity on the test dataset. +Run the llama-perplexity tool to evaluate how well each model predicts the Wikitext-2 test set: ```bash bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-F16.gguf -f wikitext-2-raw/wiki.test.raw @@ -90,9 +96,15 @@ bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q8_0.gguf -f wikitext-2-raw/wik bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -f wikitext-2-raw/wiki.test.raw ``` -If you want to speed things up, you can add the `--chunks` option to use a fraction of 564 chunks contained in the test dataset. +{{< notice Tip >}} +To reduce runtime, add the `--chunks` flag to evaluate a subset of the data. For example: `--chunks 50` runs the evaluation on the first 50 text blocks. +{{< /notice >}} + +## Run the evaluation as a background script + +Running a full perplexity evaluation on all three models takes about 5 hours. To avoid SSH timeouts and keep the process running after logout, wrap the commands in a shell script and run it in the background. -On the full dataset, these three commands will take about 5 hours. You should run them in a shell script to avoid SSH timeouts. +Create a script named ppl.sh: For example: ```bash @@ -109,7 +121,7 @@ bin/llama-perplexity -m models/afm-4-5b/afm-4-5B-Q4_0.gguf -f wikitext-2-raw/wik Here are the full results. -| Model | Generation Speed (tokens/s, 16 vCPUs) | Memory Usage | Perplexity (Wikitext-2) | +| Model | Generation speed (tokens/s, 16 vCPUs) | Memory Usage | Perplexity (Wikitext-2) | |:-------:|:----------------------:|:------------:|:----------:| | F16 | ~15–16 | ~15 GB | TODO | | Q8_0 | ~25 | ~8 GB | TODO | diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/08_conclusion.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/08_conclusion.md index a7effd0311..832d335d14 100644 --- a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/08_conclusion.md +++ b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/08_conclusion.md @@ -1,66 +1,66 @@ --- -title: Conclusion -weight: 9 +title: Review what you built +weight: 10 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Conclusion +## Wrap up your AFM-4.5B deployment -Congratulations! You have successfully completed the journey of deploying the Arcee AFM-4.5B foundation model on AWS Graviton4. +Congratulations! You have completed the process of deploying the Arcee AFM-4.5B foundation model on AWS Graviton4. -Here is a summary of what you learned. - -### What you built +Here’s a summary of what you built and how you can take your knowledge forward. Using this Learning Path, you have: -1. **Launched a Graviton4-powered EC2 instance** - Set up a c8g.4xlarge instance running Ubuntu 24.04 LTS, leveraging AWS's latest Arm-based processors for optimal performance and cost efficiency. +- **Launched a Graviton4-powered EC2 instance** – you set up a `c8g.4xlarge` instance running Ubuntu 24.04 LTS, leveraging Arm-based compute for optimal price–performance. -2. **Configured the development environment** - Installed essential tools and dependencies, including Git, build tools, and Python packages needed for machine learning workloads. +- **Configured the development environment** – you installed tools and dependencies, including Git, build tools, and Python packages for machine learning workloads. -3. **Built Llama.cpp from source** - Compiled the optimized inference engine specifically for Arm64 architecture, ensuring maximum performance on Graviton4 processors. +- **Built Llama.cpp from source** – you compiled the inference engine specifically for the Arm64 architecture to maximize performance on Graviton4. -4. **Downloaded and optimized AFM-4.5B** - Retrieved the 4.5-billion parameter Arcee Foundation Model and converted it to the efficient GGUF format, then created quantized versions (8-bit and 4-bit) to balance performance and memory usage. +- **Downloaded and optimized AFM-4.5B** – you retrieved the 4.5-billion-parameter Arcee Foundation Model, converted it to the GGUF format, and created quantized versions (8-bit and 4-bit) to reduce memory usage and improve speed. -5. **Ran inference and evaluation** - Tested the model's capabilities through interactive conversations, API endpoints, and comprehensive benchmarking to measure speed, memory usage, and model quality. +- **Ran inference and evaluation** – you tested the model using interactive sessions and API endpoints, and benchmarked speed, memory usage, and model quality. -### Key Performance Insights +## Key performance insights The benchmarking results demonstrate the power of quantization and Arm-based computing: -- **Memory efficiency**: The 4-bit quantized model uses only ~4.4GB of RAM compared to ~15GB for the full precision model -- **Speed improvements**: Quantization delivers 2-3x faster inference speeds (40+ tokens/second vs 15-16 tokens/second) -- **Cost optimization**: Lower memory requirements enable running on smaller, more cost-effective instances -- **Quality preservation**: The quantized models maintain excellent perplexity scores, showing minimal quality degradation +- **Memory efficiency** – the 4-bit model uses only ~4.4 GB of RAM compared to ~15 GB for the full-precision version +- **Speed improvements** – inference with Q4_0 is 2–3x faster (40+ tokens/sec vs. 15–16 tokens/sec) +- **Cost optimization** – lower memory needs enable smaller, more affordable instances +- **Quality preservation** – the quantized models maintain strong perplexity scores, showing minimal quality loss + +## The AWS Graviton4 advantage -### The Graviton4 Advantage +AWS Graviton4 processors, built on the Arm Neoverse-V2 architecture, provide: -AWS Graviton4 processors, built on Arm Neoverse-V2 architecture, provide: - Superior performance per watt compared to x86 alternatives -- Cost savings of 20-40% for compute-intensive workloads +- Cost savings of 20–40% for compute-intensive workloads - Optimized memory bandwidth and cache hierarchy for AI/ML workloads - Native Arm64 support for modern machine learning frameworks -### Next Steps and Call to Action +## Next steps for deploying AFM-4.5B on Arm -Now that you have a fully functional AFM-4.5B deployment, here are some exciting ways to extend your learning: +Now that you have a fully functional AFM-4.5B deployment, here are some ways to extend your learning: -**Production Deployment** +**Production deployment**: - Set up auto-scaling groups for high availability - Implement load balancing for multiple model instances - Add monitoring and logging with CloudWatch - Secure your API endpoints with proper authentication -**Application Development** -- Build a web application using the llama-server API +**Application development**: +- Build a web application using the `llama-server` API - Create a chatbot or virtual assistant - Develop content generation tools - Integrate with existing applications via REST APIs -The combination of Arcee AI's efficient foundation models, Llama.cpp's optimized inference engine, and AWS Graviton4's powerful Arm processors creates a compelling platform for deploying production-ready AI applications. Whether you're building chatbots, content generators, or research tools, this stack provides the performance, cost efficiency, and flexibility needed for modern AI workloads. +Together, Arcee AI’s foundation models, Llama.cpp’s efficient runtime, and Graviton4’s compute capabilities give you everything you need to build scalable, production-grade AI applications. -For more information on Arcee AI and how we can help you build high-quality, secure, and cost-efficient AI, solution, please visit [www.arcee.ai](https://www.arcee.ai). +From chatbots and content generation to research tools, this stack strikes a balance between performance, cost, and developer control. +For more information on Arcee AI, and how you can build high-quality, secure, and cost-efficient AI solutions, please visit www.arcee.ai. diff --git a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/_index.md b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/_index.md index 4623d917d2..0f1369264c 100644 --- a/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-aws/_index.md @@ -1,5 +1,5 @@ --- -title: Deploy Arcee AFM-4.5B on AWS Graviton4 +title: Deploy Arcee AFM-4.5B on Arm-based AWS Graviton4 with Llama.cpp draft: true cascade: @@ -7,23 +7,23 @@ cascade: minutes_to_complete: 30 -who_is_this_for: This is an introductory topic for developers and engineers who want to deploy the Arcee AFM-4.5B small language model on an AWS Arm-based instance. AFM-4.5B is a 4.5-billion-parameter frontier model that delivers excellent accuracy, strict compliance, and very high cost-efficiency. It was trained on almost 7 trillion tokens of clean, rigorously filtered data, and has been tested across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish. +who_is_this_for: This Learning Path is for developers and ML engineers who want to deploy Arcee's AFM-4.5B small language model on AWS Graviton4 instances using Llama.cpp. learning_objectives: - - Launch and set up an Arm-based Graviton4 virtual machine on Amazon Web Services. - - Build Llama.cpp from source. - - Download AFM-4.5B from Hugging Face. - - Quantize AFM-4.5B with Llama.cpp. - - Deploy the model and run inference with Llama.cpp. - - Evaluate the quality of quantized models by measuring perplexity. + - Launch an Arm-based EC2 instance on AWS Graviton4 + - Build and install Llama.cpp from source + - Download and quantize the AFM-4.5B model from Hugging Face + - Run inference on the quantized model using Llama.cpp + - Evaluate model quality by measuring perplexity prerequisites: - - An [AWS account](https://aws.amazon.com/) with permission to launch c8g (Graviton4) instances. - - Basic familiarity with SSH. + - An [AWS account](https://aws.amazon.com/) with permission to launch Graviton4 (`c8g.4xlarge` or larger) instances + - At least 128 GB of available storage + - Basic familiarity with Linux and SSH author: Julien Simon -### Tags +# Tags # Tagging metadata, see the Learning Path guide for the allowed values skilllevels: Introductory subjects: ML @@ -31,7 +31,7 @@ arm_ips: - Neoverse tools_software_languages: - Amazon Web Services - - Linux + - Hugging Face - Python - Llama.cpp operatingsystems: @@ -39,28 +39,28 @@ operatingsystems: further_reading: - - resource: - title: Arcee AI - link: https://www.arcee.ai - type: Website - - resource: - title: Announcing Arcee Foundation Models - link: https://www.arcee.ai/blog/announcing-the-arcee-foundation-model-family - type: Blog - - resource: - title: AFM-4.5B, the First Arcee Foundation Model - link: https://www.arcee.ai/blog/deep-dive-afm-4-5b-the-first-arcee-foundational-model - type: Blog - - resource: - title: Amazon EC2 Graviton Instances - link: https://aws.amazon.com/ec2/graviton/ - type: Documentation - - resource: - title: Amazon EC2 Documentation - link: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ - type: Documentation + - resource: + title: Arcee AI + link: https://www.arcee.ai + type: website + - resource: + title: Announcing the Arcee Foundation Model family + link: https://www.arcee.ai/blog/announcing-the-arcee-foundation-model-family + type: blog + - resource: + title: Deep Dive - AFM-4.5B, the first Arcee Foundation Model + link: https://www.arcee.ai/blog/deep-dive-afm-4-5b-the-first-arcee-foundational-model + type: blog + - resource: + title: Amazon EC2 Graviton instances + link: https://aws.amazon.com/ec2/graviton/ + type: documentation + - resource: + title: Amazon EC2 User Guide + link: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ + type: documentation -### FIXED, DO NOT MODIFY +# FIXED, DO NOT MODIFY # ================================================================================ weight: 1 # _index.md always has weight of 1 to order correctly layout: "learningpathall" # All files under learning paths have this same wrapper