diff --git a/.github/workflows/content-checks.yml b/.github/workflows/content-checks.yml index 147bda1e3b..949573cb31 100644 --- a/.github/workflows/content-checks.yml +++ b/.github/workflows/content-checks.yml @@ -81,9 +81,11 @@ jobs: - name: Run Microsoft Security DevOps uses: microsoft/security-devops-action@latest id: msdo + with: + tools: container-mapping, bandit, eslint, templateanalyzer - name: Upload results to Security tab uses: actions/upload-artifact@v4 with: path: ${{ steps.msdo.outputs.sarifFile }} - retention-days: 5 # Default is 90 days \ No newline at end of file + retention-days: 5 # Default is 90 days diff --git a/.wordlist.txt b/.wordlist.txt index 0db9d5b849..7010bf19d2 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -3245,4 +3245,25 @@ sysbox tinyml tvOS watchOS -zilliz \ No newline at end of file +zilliz +ASGI +ComputeLibrary +FastAPI +GTSRB +KubeArchInspect +MLOPs +MiniLM +OpenBLAS +Requantize +UpsideDownCake +Uvicorn +WPA +WindowsPerfGUI +ZGC +Zouaoui +gui +kubearchinspect +mlops +multithreading +preloaded +requantize \ No newline at end of file diff --git a/assets/contributors.csv b/assets/contributors.csv index ddf7d58fbe..33d3bfb507 100644 --- a/assets/contributors.csv +++ b/assets/contributors.csv @@ -41,3 +41,4 @@ Annie Tallund,Arm,annietllnd,annietallund,, Cyril Rohr,RunsOn,crohr,cyrilrohr,, Rin Dobrescu,Arm,,,, Przemyslaw Wirkus,Arm,PrzemekWirkus,przemyslaw-wirkus-78b73352,, +Nader Zouaoui,Day Devs,nader-zouaoui,nader-zouaoui,@zouaoui_nader,https://daydevs.com/ diff --git a/content/install-guides/_images/wperf-vs-extension-counting-preview.png b/content/install-guides/_images/wperf-vs-extension-counting-preview.png new file mode 100644 index 0000000000..1fabf9ae90 Binary files /dev/null and b/content/install-guides/_images/wperf-vs-extension-counting-preview.png differ diff --git a/content/install-guides/_images/wperf-vs-extension-install-page.png b/content/install-guides/_images/wperf-vs-extension-install-page.png new file mode 100644 index 0000000000..8a5606b710 Binary files /dev/null and b/content/install-guides/_images/wperf-vs-extension-install-page.png differ diff --git a/content/install-guides/_images/wperf-vs-extension-sampling-preview.png b/content/install-guides/_images/wperf-vs-extension-sampling-preview.png new file mode 100644 index 0000000000..c87931ca08 Binary files /dev/null and b/content/install-guides/_images/wperf-vs-extension-sampling-preview.png differ diff --git a/content/install-guides/acfl.md b/content/install-guides/acfl.md index 8a00d6c7c0..e7e6bd2655 100644 --- a/content/install-guides/acfl.md +++ b/content/install-guides/acfl.md @@ -23,11 +23,11 @@ title: Arm Compiler for Linux tool_install: true weight: 1 --- -[Arm Compiler for Linux (ACfL)](https://developer.arm.com/Tools%20and%20Software/Arm%20Compiler%20for%20Linux) is a suite of tools containing Arm C/C++ Compiler (`armclang`), Arm Fortran Compiler (`armflang`), and Arm Performance Libraries (`ArmPL`). It is tailored to the development of High Performance Computing (HPC) applications. +[Arm Compiler for Linux (ACfL)](https://developer.arm.com/Tools%20and%20Software/Arm%20Compiler%20for%20Linux) is a suite of tools containing Arm C/C++ Compiler (`armclang`), Arm Fortran Compiler (`armflang`), and Arm Performance Libraries (ArmPL). It is tailored to the development of High Performance Computing (HPC) applications. -`Arm Compiler for Linux` runs on 64-bit Arm machines, it is not a cross-compiler. +Arm Compiler for Linux runs on 64-bit Arm machines, it is not a cross-compiler. -You do not require any additional license to use `Arm Compiler for Linux`. +You do not require any additional license to use Arm Compiler for Linux. ## Arm-based hardware @@ -60,7 +60,7 @@ These packages can be installed with the appropriate package manager for your OS - Amazon Linux: environment-modules glibc-devel gzip procps python3 tar - Ubuntu: environment-modules libc6-dev python3 -Note: The minimum supported version for Python is version 3.6. +The minimum supported version for Python is version 3.6. You must have at least 2 GB of free hard disk space to both download and unpack the Arm Compiler for Linux package. You must also have an additional 6 GB of @@ -77,7 +77,7 @@ You are now ready to install ACfL [manually](#manual) or with [Spack](#spack). ## Download and install using install script -Use an Arm recommended script to select, download, and install your preferred `ACfL` package. +Use an Arm recommended script to select, download, and install your preferred ACfL package. ```console bash <(curl -L https://developer.arm.com/-/cdn-downloads/permalink/Arm-Compiler-for-Linux/Package/install.sh) @@ -95,46 +95,56 @@ sudo apt install wget ### Fetch the appropriate installer -`ACfL` installation packages are available to download from [Arm Developer](https://developer.arm.com/downloads/-/arm-compiler-for-linux). Individual `Arm Performance Libraries (ArmPL)` packages are also available. +ACfL installation packages are available to download from [Arm Developer](https://developer.arm.com/downloads/-/arm-compiler-for-linux). Individual Arm Performance Libraries (ArmPL) packages are also available. + +Fetch the ACfL installers: -Fetch the `ACfL` installers: #### Ubuntu Linux: ```bash { target="ubuntu:latest" } -wget https://developer.arm.com/-/cdn-downloads/permalink/Arm-Compiler-for-Linux/Version_24.04/arm-compiler-for-linux_24.04_Ubuntu-22.04_aarch64.tar +wget https://developer.arm.com/-/cdn-downloads/permalink/Arm-Compiler-for-Linux/Version_24.10/arm-compiler-for-linux_24.10_Ubuntu-22.04_aarch64.tar ``` #### Red Hat Linux: ```bash { target="fedora:latest" } -wget https://developer.arm.com/-/cdn-downloads/permalink/Arm-Compiler-for-Linux/Version_24.04/arm-compiler-for-linux_24.04_RHEL-8_aarch64.tar +wget https://developer.arm.com/-/cdn-downloads/permalink/Arm-Compiler-for-Linux/Version_24.10/arm-compiler-for-linux_24.10_RHEL-9_aarch64.tar ``` ### Install -To install the `Arm Compiler for Linux` package on your 64-bit Linux Arm machine extract the package and run the installation script. +To install Arm Compiler for Linux on your 64-bit Linux Arm machine extract the package and run the installation script. -Each command sequence includes accepting the license agreement to automate the installation and installing the `modules` software. +Each command sequence includes accepting the license agreement to automate the installation and installing Environment Modules. #### Ubuntu Linux: ```bash { target="ubuntu:latest", env="DEBIAN_FRONTEND=noninteractive" } sudo -E apt-get -y install environment-modules python3 libc6-dev -tar -xvf arm-compiler-for-linux_24.04_Ubuntu-22.04_aarch64.tar -cd ./arm-compiler-for-linux_24.04_Ubuntu-22.04 -sudo ./arm-compiler-for-linux_24.04_Ubuntu-22.04.sh --accept +tar -xvf arm-compiler-for-linux_24.10_Ubuntu-22.04_aarch64.tar +cd ./arm-compiler-for-linux_24.10_Ubuntu-22.04 +sudo ./arm-compiler-for-linux_24.10_Ubuntu-22.04.sh --accept ``` #### Red Hat Linux: ```bash { target="fedora:latest" } sudo yum -y install environment-modules python3 glibc-devel -tar -xvf arm-compiler-for-linux_24.04_RHEL-8_aarch64.tar -cd arm-compiler-for-linux_24.04_RHEL-8 -sudo ./arm-compiler-for-linux_24.04_RHEL-8.sh --accept +tar -xvf arm-compiler-for-linux_24.10_RHEL-9_aarch64.tar +cd arm-compiler-for-linux_24.10_RHEL-9 +sudo ./arm-compiler-for-linux_24.10_RHEL-9.sh --accept ``` +{{% notice Warning %}} +⚠️ On RPM based systems (such as Red Hat), if an +alternative version of GCC (not the GCC bundled with ACfL) is installed +**after** ACfL, you will not be able to uninstall ACfL fully. For example, a GDB +(GNU Project Debugger) installation will install the native system GCC. If this +install takes place **after** ACfL, you will no longer be able to fully +uninstall ACfL. +{{% /notice %}} + ### Set up environment -`Arm Compiler for Linux` uses environment modules to dynamically modify your user environment. Refer to the [Environment Modules documentation](https://lmod.readthedocs.io/en/latest/#id) for more information. +Arm Compiler for Linux uses environment modules to dynamically modify your user environment. Refer to the [Environment Modules documentation](https://lmod.readthedocs.io/en/latest/#id) for more information. Set up the environment, for example, in your `.bashrc` and add module files. @@ -163,25 +173,26 @@ module avail To configure Arm Compiler for Linux: ```bash { env_source="~/.bashrc" } -module load acfl/24.04 +module load acfl/24.10 ``` To configure GCC: ```bash { env_source="~/.bashrc" } -module load gnu/13.2.0 +module load gnu/14.2.0 ``` -`ACfL` is now [ready to use](#armclang). + +ACfL is now [ready to use](#armclang). ## Download and install with Spack {#spack} -`Arm Compiler for Linux` is available with the [Spack](https://spack.io/) package manager. +Arm Compiler for Linux is available with the [Spack](https://spack.io/) package manager. See the [Arm Compiler for Linux and Arm PL now available in Spack](https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/arm-compiler-for-linux-and-arm-pl-now-available-in-spack) blog for full details. ### Setup Spack -Clone the `Spack` repostitory and add `bin` directory to the path: +Clone the Spack repository and add `bin` directory to the path: ```console git clone -c feature.manyFiles=true https://github.com/spack/spack.git @@ -194,17 +205,17 @@ Set up shell support: . /home/ubuntu/spack/share/spack/setup-env.sh ``` -`Spack` is now ready to use. +Spack is now ready to use. ### Install ACfL -Download and install `Arm Compiler for Linux` with: +Download and install Arm Compiler for Linux with: ```console spack install acfl ``` -If you wish to install just the `Arm Performance Libraries`, use: +If you wish to install just the Arm Performance Libraries, use: ```console spack install armpl-gcc @@ -218,7 +229,7 @@ spack load acfl spack compiler find ``` -`ACfL` is now [ready to use](#armclang). +ACfL is now [ready to use](#armclang). ## Get started with Arm C/C++ compiler {#armclang} @@ -226,7 +237,7 @@ spack compiler find To get started with the Arm C/C++ Compiler and compile a simple application follow the steps below. Check that the correct compiler version is being used: -```bash { env_source="~/.bashrc", pre_cmd="module load acfl/24.04" } +```bash { env_source="~/.bashrc", pre_cmd="module load acfl/24.10" } armclang --version ``` @@ -244,13 +255,13 @@ int main() Build the application with: -```console { env_source="~/.bashrc", pre_cmd="module load acfl/24.04" } +```console { env_source="~/.bashrc", pre_cmd="module load acfl/24.10" } armclang hello.c -o hello ``` Run the application with: -```bash { env_source="~/.bashrc", pre_cmd="module load acfl/24.04" } +```bash { env_source="~/.bashrc", pre_cmd="module load acfl/24.10" } ./hello ``` @@ -264,7 +275,7 @@ Hello, C World! To get started with the Arm Fortran Compiler and compile a simple application follow the steps below. Check that the correct compiler version is being used: -```bash { env_source="~/.bashrc", pre_cmd="module load acfl/24.04" } +```bash { env_source="~/.bashrc", pre_cmd="module load acfl/24.10" } armflang --version ``` @@ -278,12 +289,12 @@ end program hello ``` Build the application with: -```bash { env_source="~/.bashrc", pre_cmd="module load acfl/24.04" } +```bash { env_source="~/.bashrc", pre_cmd="module load acfl/24.10" } armflang hello.f90 -o hello ``` Run the application with: -```bash { env_source="~/.bashrc", pre_cmd="module load acfl/24.04" } +```bash { env_source="~/.bashrc", pre_cmd="module load acfl/24.10" } ./hello ``` diff --git a/content/install-guides/armpl.md b/content/install-guides/armpl.md index 7dcb2a42e1..38d80d3794 100644 --- a/content/install-guides/armpl.md +++ b/content/install-guides/armpl.md @@ -70,28 +70,28 @@ For more information refer to [Get started with Arm Performance Libraries](https In a terminal, run the command shown below to download the macOS package: ```console -wget https://developer.arm.com/-/media/Files/downloads/hpc/arm-performance-libraries/24-04/macos/arm-performance-libraries_24.04_macOS.tgz +wget https://developer.arm.com/-/media/Files/downloads/hpc/arm-performance-libraries/24-10/macos/arm-performance-libraries_24.10_macOS.tgz ``` Use tar to extract the file: ```console -tar zxvf arm-performance-libraries_24.04_macOS.tgz +tar zxvf arm-performance-libraries_24.10_macOS.tgz ``` Output of above command: ```console -armpl_24.04_flang-new_clang_18.dmg +armpl_24.10_flang-new_clang_19.dmg ``` Mount the disk image by running from a terminal: ```console -hdiutil attach armpl_24.04_flang-new_clang_18.dmg +hdiutil attach armpl_24.10_flang-new_clang_19.dmg ``` Now run the installation script as a superuser: ```console -/Volumes/armpl_24.04_flang-new_clang_18_installer/armpl_24.04_flang-new_clang_18_install.sh -y +/Volumes/armpl_24.10_flang-new_clang_19_installer/armpl_24.10_flang-new_clang_19_install.sh -y ``` Using this command you automatically accept the End User License Agreement and the packages are installed to the `/opt/arm` directory. If you want to change the installation directory location use the `--install_dir` option with the script and provide the desired directory location. @@ -103,33 +103,33 @@ For more information refer to [Get started with Arm Performance Libraries](https ## Linux {#linux} -Arm Performance Libraries are supported on most Linux Distributions like Ubuntu, RHEL, SLES and Amazon Linux on an `AArch64` host and compatible with various versions of GCC and NVHPC. The GCC compatible releases are built with GCC 13 and tested with GCC versions 7 to 13. The NVHPC compatible releases are built and tested with NVHPC 24.1. +Arm Performance Libraries are supported on most Linux Distributions like Ubuntu, RHEL, SLES and Amazon Linux on an `AArch64` host and compatible with various versions of GCC and NVHPC. The GCC compatible releases are built with GCC 14 and tested with GCC versions 7 to 14. The NVHPC compatible releases are built and tested with NVHPC 24.7. [Download](https://developer.arm.com/downloads/-/arm-performance-libraries) the appropriate package for your Linux distribution. The deb based installers can be used on Ubuntu 20 and Ubuntu 22. The RPM based installers can be used on the following supported distributions: - Amazon Linux 2, Amazon Linux 2023 -- RHEL-7, RHEL-8, RHEL-9 -- SLES-15 +- RHEL-8, RHEL-9 +- SLES-15 Service Packs 5 and 6 The instructions shown below are for deb based installers for GCC users. In a terminal, run the command shown below to download the debian package: ```console -wget https://developer.arm.com/-/media/Files/downloads/hpc/arm-performance-libraries/24-04/linux/arm-performance-libraries_24.04_deb_gcc.tar +wget https://developer.arm.com/-/media/Files/downloads/hpc/arm-performance-libraries/24-10/linux/arm-performance-libraries_24.10_deb_gcc.tar ``` Use `tar` to extract the file and then change directory: ```console -tar -xf arm-performance-libraries_24.04_deb_gcc.tar -cd arm-performance-libraries_24.04_deb/ +tar -xf arm-performance-libraries_24.10_deb_gcc.tar +cd arm-performance-libraries_24.10_deb/ ``` Run the installation script as a super user: ```console -sudo ./arm-performance-libraries_24.04_deb.sh --accept +sudo ./arm-performance-libraries_24.10_deb.sh --accept ``` Using the `--accept` switch you automatically accept the End User License Agreement and the packages are installed to the `/opt/arm` directory. @@ -165,13 +165,13 @@ module avail The output should be similar to: ```output -armpl/24.04.0_gcc +armpl/24.10.0_gcc ``` Load the appropriate module: ```console -module load armpl/24.04.0_gcc +module load armpl/24.10.0_gcc ``` You can now compile and test the examples included in the `/opt/arm//examples/`, or `//examples/` directory, if you have installed to a different location than the default. diff --git a/content/install-guides/java.md b/content/install-guides/java.md index 22b4d19252..363bcf0daa 100644 --- a/content/install-guides/java.md +++ b/content/install-guides/java.md @@ -234,6 +234,20 @@ To print the final values of the flags after the JVM has been initialized, run: java -XX:+PrintFlagsFinal -version ``` +Generally the biggest performance improvements from JVM flags can be obtained from heap and garbage collection (GC) tuning, as long as you understand your workload well. + +Default initial heap size is 1/64th of RAM and default maximum heap size is 1/4th of RAM. If you know your memory requirements, you should set both of these flags to the same value (e.g. `-Xms12g` and `-Xmx12g` for an application that uses at most 12 GB). Setting both flags to the same value will prevent the JVM from having to periodically allocate additional memory. Additionally, for cloud workloads max heap size is often set to 75%-85% of RAM, much higher than the default setting. + +If you are deploying in a cloud scenario where you might be deploying the same stack to systems that have varying amounts of RAM, you might want to use `-XX:MaxRAMPercentage` instead of `-Xmx`, so you can specify a percentage of max RAM rather than a fixed max heap size. This setting can also be helpful in containerized workloads. + +Garbage collector choice will depend on the workload pattern for which you're optimizing. + +* If your workload is a straightforward serial single-core load with no multithreading, the `UseSerialGC` flag should be set to true. +* For multi-core small heap batch jobs (<4GB), the `UseParallelGC` flag should be set to true. +* The G1 garbage collector (`UseG1GC` flag) is better for medium to large heaps (>4GB). This is the most commonly used GC for large parallel workloads, and is the default for high-core environments. If you want to optimize throughput, use this one. +* The ZGC (`UseZGC` flag) has low pause times, which can drastically improve tail latencies. If you want to prioritize response time at a small cost to throughput, use ZGC. +* The Shenandoah GC (`UseShenandoahGC` flag) is still fairly niche. It has ultra low pause times and concurrent evacuation, making it ideal for low-latency applications, at the cost of increased CPU use. + ## Are there other tools commonly used in Java projects? There are a number of Java-related tools you might like to install. diff --git a/content/install-guides/streamline-cli.md b/content/install-guides/streamline-cli.md index b8ccdef52d..faa863ac7b 100644 --- a/content/install-guides/streamline-cli.md +++ b/content/install-guides/streamline-cli.md @@ -68,29 +68,73 @@ Arm recommends that you profile an optimized release build of your application, If you are using the `workflow_topdown_basic option`, ensure that your application workload is at least 20 seconds long, in order to give the core time to capture all of the metrics needed. This time increases linearly as you add more metrics to capture. -## Install Streamline CLI Tools +## Using Python scripts -1. Download and extract the Streamline CLI tools on your Arm server: +The Python scripts provided with Streamline CLI tools require Python 3.8 or later, and depend on several third-party modules. We recommend creating a Python virtual environment containing these modules to run the tools. + +Create a virtual environment: + +```sh +# From Bash +python3 -m venv sl-venv +source ./sl-venv/bin/activate +``` + +The prompt of your terminal has (sl-venv) as a prefix indicating the virtual environment is active. + +{{% notice Note%}} +The instructions assume that you run all Python commands from inside the virtual environment. +{{% /notice %}} + +## Installing the tools {.reference} + +The Streamline CLI tools are available as a standalone download to enable easy integration in to server workflows. + +To download the latest version of the tool and extract it to the current working directory you can use our download utility script: + +```sh +wget https://artifacts.tools.arm.com/arm-performance-studio/Streamline_CLI_Tools/get-streamline-cli.py +python3 get-streamline-cli.py install +python3 -m pip install -r ./streamline_cli_tools/bin/requirements.txt +``` + +If you want to add the Streamline tools to your search path: + +```sh +export PATH=$PATH:$PWD/streamline_cli_tools/bin +``` + +The script can also be used to download a specific version, or install to a user-specified directory: + +* To list all available versions: ```sh - wget https://artifacts.tools.arm.com/arm-performance-studio/2024.3/Arm_Streamline_CLI_Tools_9.2.2_linux_arm64.tgz  - tar -xzf Arm_Streamline_CLI_Tools_9.2.2_linux_arm64.tgz  + python3 get-streamline-cli.py list ``` -1. The `sl-format.py` Python script requires Python 3.8 or later, and depends on several third-party modules. We recommend creating a Python virtual environment containing these modules to run the tools. For example: +* To download, but not install, a specific version: ```sh - # From Bash - python3 -m venv sl-venv - source ./sl-venv/bin/activate + python3 get-streamline-cli.py download --tool-version + ``` + +* To download and install a specific version: - # From inside the virtual environment - python3 -m pip install -r ./streamline_cli_tools/bin/requirements.txt + ```sh + python3 get-streamline-cli.py install --tool-version ``` - {{% notice Note%}} - The instructions in this guide assume you have added the `/bin/` directory to your `PATH` environment variable, and that you run all Python commands from inside the virtual environment. - {{% /notice %}} +* To download and install to a specific directory + + ```sh + python3 get-streamline-cli.py install --install-dir + ``` + +For manual download, you can find all available releases here: + +```sh +https://artifacts.tools.arm.com/arm-performance-studio/Streamline_CLI_Tools/ +``` ## Applying the kernel patch @@ -135,10 +179,10 @@ Follow these steps to integrate these patches into an RPM-based distribution's k 1. Install the RPM build tools: - ``` + ```sh sudo yum install rpm-build rpmdevtools ``` - + 1. Remove any existing `rpmbuild` directory, renaming as appropriate: ```sh @@ -150,6 +194,7 @@ Follow these steps to integrate these patches into an RPM-based distribution's k ```sh yum download --source kernel ``` + 1. Install the sources binary: ```sh @@ -157,11 +202,13 @@ Follow these steps to integrate these patches into an RPM-based distribution's k ``` 1. Enter the `rpmbuild` directory that is created: + ```sh cd rpmbuild ``` 1. Copy the patch into the correct location. Replace the 9999 patch number with the next available patch number in the sequence: + ```sh cp vX.Y-combined.patch SOURCES/9999-strobing-patch.patch ``` diff --git a/content/install-guides/windows-perf-vs-extension.md b/content/install-guides/windows-perf-vs-extension.md new file mode 100644 index 0000000000..c55e3298f0 --- /dev/null +++ b/content/install-guides/windows-perf-vs-extension.md @@ -0,0 +1,106 @@ +--- +### Title the install tools article with the name of the tool to be installed +### Include vendor name where appropriate +title: WindowsPerf Visual Studio Extension +draft: true +cascade: + draft: true + +minutes_to_complete: 10 + +official_docs: https://github.com/arm-developer-tools/windowsperf-vs-extension + +author_primary: Nader Zouaoui + +### Optional additional search terms (one per line) to assist in finding the article +additional_search_terms: + - perf + - profiling + - profiler + - windows + - woa + - windows on arm + - visual studio +### FIXED, DO NOT MODIFY +weight: 1 # Defines page ordering. Must be 1 for first (or only) page. +tool_install: true # Set to true to be listed in main selection page, else false +multi_install: FALSE # Set to true if first page of multi-page article, else false +multitool_install_part: false # Set to true if a sub-page of a multi-page article, else false +layout: installtoolsall # DO NOT MODIFY. Always true for tool install articles +--- + +## Introduction + +WindowsPerf is a lightweight performance profiling tool inspired by Linux Perf, and specifically tailored for Windows on Arm. It leverages the ARM64 PMU (Performance Monitor Unit) and its hardware counters to offer precise profiling capabilities. + +Recognizing the complexities of command-line interaction, the WindowsPerf GUI is a Visual Studio 2022 extension created to provide a more intuitive, integrated experience within the integrated development environment (IDE). This tool enables developers to interact with WindowsPerf, adjust settings, and visualize performance data seamlessly in Visual Studio. + +## A Glimpse of the available features + +The WindowsPerf GUI extension is composed of several key features, each designed to streamline the user experience: + +- **WindowsPerf Configuration**: Connect directly to `wperf.exe` for a seamless integration. Configuration is accessible via `Tools -> Options -> Windows Perf -> WindowsPerf Path`. +- **Host Data**: Understand your environment with `Tools -> WindowsPerf Host Data`, offering insights into tests run by WindowsPerf. +- **Output Logging**: All commands executed through the GUI are logged, ensuring transparency and aiding in performance analysis. +- **Sampling UI**: Customize your sampling experience by selecting events, setting frequency and duration, choosing programs for sampling, and comprehensively analyzing results. + +![Sampling preview #center](../_images/wperf-vs-extension-sampling-preview.png "Sampling settings UI Overview") + + +- **Counting Settings UI**: Build a `wperf stat` command from scratch using the configuration interface, then view the output in the IDE or open it with Windows Performance Analyzer (WPA) + + +![Counting preview #center](../_images/wperf-vs-extension-counting-preview.png "_Counting settings UI Overview_") + + +## Getting Started + +### Prerequisites + +- **Visual Studio 2022**: Ensure you have Visual Studio 2022 installed on your Windows on Arm device. +- **WindowsPerf**: Download and install WindowsPerf by following the [WindowsPerf install guide](/install-guides/wperf/). +- **LLVM** (Recommended): You can install the LLVM toolchain by following the [LLVM toolchain for Windows on Arm install guide](/install-guides/llvm-woa). + +{{% notice llvm-objdump %}} +The disassembly feature needs to have `llvm-objdump` available at `%PATH%` to work properly. +{{% /notice %}} + +### Installation from Visual Studio Extension Manager + +To install the WindowsPerf Visual Studio Extension from Visual Studio: + +1. Open Visual Studio 2022 +2. Go to the `Extensions` menu +3. Select **Manage Extensions** +4. Click on the search bar ( or tap `Ctrl` + `L` ) and type `WindowsPerf` +5. Click on the install button and restart Visual Studio + +![WindowsPerf install page #center](../_images/wperf-vs-extension-install-page.png) + +### Installation from GitHub + +You can also install the WindowsPerf Visual Studio Extension from GitHub. + +Download the installation file directly from the [GitHub release page](https://github.com/arm-developer-tools/windowsperf-vs-extension/releases). + +Unzip the downloaded file and double click on the `WindowsPerfGUI.vsix` file + +{{% notice Note %}} +Make sure that any previous version of the extension is uninstalled and that Visual Studio is closed before installing the extension. +{{% /notice %}} + +### Build from source + +To build the source code, clone [the repository](https://github.com/arm-developer-tools/windowsperf-vs-extension) and build the `WindowsPerfGUI` solution using the default configuration in Visual Studio. + +Building the source is not required, but offered as an alternative installation method if you want to customize the extension. + +### WindowsPerf Setup + +To get started, you must link the GUI with the executable file `wperf.exe` by navigating to `Tools -> Options -> WindowsPerf -> WindowsPerf Path`. This step is crucial for utilizing the GUI, and the extension will not work if you don't do it. + +## Uninstall the WindowsPerfGUI extension + +In Visual Studio go to `Extensions` -> `Manage Extensions` -> `Installed` -> `All` -> `WindowsPerfGUI` and select "Uninstall". + +Please note that this will be scheduled by Visual Studio. You may need to close VS instance and follow uninstall wizard to remove the extension. diff --git a/content/install-guides/wperf.md b/content/install-guides/wperf.md index 7bba607a4f..3abe9025f2 100644 --- a/content/install-guides/wperf.md +++ b/content/install-guides/wperf.md @@ -1,7 +1,7 @@ --- ### Title the install tools article with the name of the tool to be installed ### Include vendor name where appropriate -title: Perf for Windows on Arm (WindowsPerf) +title: WindowsPerf (wperf) ### Optional additional search terms (one per line) to assist in finding the article additional_search_terms: @@ -30,38 +30,36 @@ multitool_install_part: false # Set to true if a sub-page of a multi-page arti layout: installtoolsall # DO NOT MODIFY. Always true for tool install articles --- -WindowsPerf is an open-source command line tool for performance analysis on Windows on Arm devices. +WindowsPerf is a Linux Perf-inspired Windows on Arm performance profiling tool. Profiling is based on the Arm AArch64 PMU and its hardware counters. WindowsPerf supports the counting model for obtaining aggregate counts of occurrences of PMU events, and the sampling model for determining the frequencies of event occurrences produced by program locations at the function, basic block, and instruction levels. WindowsPerf is an open-source project hosted on [GitHub](https://github.com/arm-developer-tools/windowsperf). -WindowsPerf consists of a kernel-mode driver and a user-space command-line tool, or [VS Code Extension](#vscode). The command-line tool is modeled after the Linux `perf` command. +WindowsPerf consists of a kernel-mode driver and a user-space command-line tool. You can seamlessly integrate the WindowsPerf command line tool with both the [WindowsPerf Visual Studio Extension](#vs2022) and the [WindowsPerf VS Code Extension](#vscode). These extensions, which you can download from the Visual Studio Marketplace, enhance the functionality of WindowsPerf by providing a user-friendly interface, and additional features for performance analysis and debugging. This integration allows developers to efficiently analyze and optimize their applications directly within their preferred development environment. -WindowsPerf includes a **counting model** for counting events such as cycles, instructions, and cache events and a **sampling model** to understand how frequently events occur. -{{% notice Virtual Machines%}} -WindowsPerf cannot be used on virtual machines, such as cloud instances. +{{% notice Note%}} +You cannot use WindowsPerf on virtual machines, such as cloud instances. {{% /notice %}} -You can interact with the - ## Visual Studio and the Windows Driver Kit (WDK) -WindowsPerf relies on `dll` files installed with Visual Studio (Community Edition or higher) and (optionally) installers from the Windows Driver Kit extension. +WindowsPerf relies on `dll` files installed with Visual Studio, from the Community Edition or higher and, optionally, installers from the Windows Driver Kit extension. -[Download the Windows Driver Kit (WDK)](https://learn.microsoft.com/en-us/windows-hardware/drivers/download-the-wdk) explains the WDK installation process. +For information about the WDK installation process, see [Download the Windows Driver Kit (WDK)](https://learn.microsoft.com/en-us/windows-hardware/drivers/download-the-wdk). See also the [Visual Studio for Windows on Arm install guide](/install-guides/vs-woa/). ## Download WindowsPerf -The latest release package `windowsperf-bin-.zip` can be downloaded from the Linaro GitLab repository: +You can download the latest release package, `windowsperf-bin-.zip` from the Arm GitHub repository: ```url -https://gitlab.com/Linaro/WindowsPerf/windowsperf/-/releases +https://github.com/arm-developer-tools/windowsperf/releases ``` + To download directly from command prompt, use: ```console mkdir windowsperf-bin-3.8.0 cd windowsperf-bin-3.8.0 -curl https://gitlab.com/api/v4/projects/40381146/packages/generic/windowsperf/3.8.0/windowsperf-bin-3.8.0.zip --output windowsperf-bin-3.8.0.zip +curl -L -O https://github.com/arm-developer-tools/windowsperf/releases/download/3.8.0/windowsperf-bin-3.8.0.zip ``` Unzip the package: @@ -70,84 +68,58 @@ Unzip the package: tar -xmf windowsperf-bin-3.8.0.zip ``` -## Install VS Code Extension (optional) {#vscode} - -In addition to the command-line tools, `WindowsPerf` is available on the [VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=Arm.windowsperf). - -Install by opening the `Extensions` view (`Ctrl`+`Shift`+`X`) and searching for `WindowsPerf`. Click `Install`. - -Open `Settings` (`Ctrl`+`,`) > `Extensions` > `WindowsPerf`, and specify the path to the `wperf` executable. - -{{% notice Non-Windows on Arm host%}} -You can only generate reports from a Windows on Arm device. - -If using a non-Windows on Arm host, you can import and analyze `WindowsPerf` JSON reports from such devices. - -You do not need to install `wperf` on non-Windows on Arm devices. -{{% /notice %}} - - ## Install wperf driver -You can install the kernel driver using either the Visual Studio [devcon](#devcon_install) utility or the supplied [installer](#devgen_install). +You can install the kernel driver using the supplied `wperf-devgen` installer. + +The [wperf-devgen](https://github.com/arm-developer-tools/windowsperf/blob/main/wperf-devgen/README.md) tool has been designated as the preferred installer and uninstaller for the WindowsPerf Kernel Driver in the latest release. This tool offers a simple process for managing the installation and removal of the driver. {{% notice Note%}} You must install the driver as `Administrator`. {{% /notice %}} -Open a `Windows Command Prompt` terminal with `Run as administrator` enabled. - -Navigate to the `windowsperf-bin-` directory. -```command -cd windowsperf-bin-3.8.0 -``` +Open a **Windows Command Prompt** terminal with **Run as administrator** selected. -### Install with devcon {#devcon_install} - -Navigate into the `wperf-driver` folder, and use `devcon` to install the driver: +Make sure you are in the `windowsperf-bin-` directory: ```command -cd wperf-driver -devcon install wperf-driver.inf Root\WPERFDRIVER -``` -You will see output similar to: - -```output -Device node created. Install is complete when drivers are installed... -Updating drivers for Root\WPERFDRIVER from \wperf-driver.inf. -Drivers installed successfully. +cd windowsperf-bin-3.8.0 ``` ### Install with wperf-devgen {#devgen_install} Navigate to the `wperf-driver` folder and run the installer: + ```command cd wperf-driver wperf-devgen install ``` -You will see output similar to: -```output + +The output should be similar to: + +```output Executing command: install. Install requested. -Waiting for device creation... -Device installed successfully. -Trying to install driver... -Success installing driver. +Device installed successfully ``` + ## Verify install You can check everything is working by running the `wperf` executable. {{% notice Note%}} -Once the above driver is installed, you can use `wperf` without `Administrator` privileges. +Once you have installed the driver, you can use `wperf` without `Administrator` privileges. {{% /notice %}} For example: + ```command -cd .. +cd ..\wperf wperf --version ``` -You should see output similar to: + +You see output similar to: + ```output Component Version GitVer FeatureString ========= ======= ====== ============= @@ -155,43 +127,73 @@ You should see output similar to: wperf-driver 3.8.0 6d15ddfc +etw-drv ``` - ## Uninstall wperf driver -You can uninstall (aka "remove") the kernel driver using either the Visual Studio [devcon](#devcon_uninstall) utility or the supplied [installer](#devgen_uninstall). +You can uninstall (or *remove*) the kernel driver using supplied [wperf-devgen](#devgen_uninstall) uninstaller. {{% notice Note%}} You must uninstall the driver as `Administrator`. {{% /notice %}} -### Uninstall with devcon {#devcon_uninstall} - -Below command removes the device from the device tree and deletes the device stack for the device. As a result of these actions, child devices are removed from the device tree and the drivers that support the device are unloaded. See [DevCon Remove](https://learn.microsoft.com/en-us/windows-hardware/drivers/devtest/devcon-remove) article for more details. - -```command -devcon remove wperf-driver.inf Root\WPERFDRIVER -``` -You should see output similar to: -```output -ROOT\SYSTEM\0001 : Removed -1 device(s) were removed. -``` - ### Uninstall with wperf-devgen {#devgen_uninstall} ```command +cd windowsperf-bin-3.8.0\wperf-driver wperf-devgen uninstall ``` -You should see output similar to: + +The output is similar to: + ```console Executing command: uninstall. Uninstall requested. -Waiting for device creation... -Device uninstalled successfully. -Trying to remove driver \wperf-driver.inf. -Driver removed successfully. +Root\WPERFDRIVER +Device found +Device uninstalled successfully ``` +## Install WindowsPerf Virtual Studio Extension (optional) {#vs2022} + +WindowsPerf GUI (Graphical User Interface) is a Visual Studio 2022 extension designed to bring a seamless UI experience to WindowsPerf, the command-line performance profiling tool for Windows on Arm. It is available on the [Visual Studio Marketplace](https://marketplace.visualstudio.com/items?itemName=Arm.WindowsPerfGUI). + +Install by opening **Extensions** menu, click **Manage Extensions**, and click **Browse**. Type `WindowsPerf` to search for Arm WindowsPerf GUI extension. Click **Install**. + +{{% notice How to set up wperf.exe path in the extension%}} +In order to set the path to the `wperf.exe` executable, go to **Tools** -> **Options** -> **WindowsPerf** -> **WindowsPerf Path** and set the absolute path to the wperf.exe executable and then click on the **Validate** button. +{{% /notice %}} + +Also, visit WindowsPerf GUI project website on [GitHub](https://github.com/arm-developer-tools/windowsperf-vs-extension) for more details and latest updates. + +## Install WindowsPerf VS Code Extension (optional) {#vscode} + +In addition to the command-line tools, `WindowsPerf` is available on the [VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=Arm.windowsperf). + +Install by opening the **Extensions** view (Ctrl+Shift+X) and searching for `WindowsPerf`. Click **Install**. + +Open **Settings** (Ctrl+,) > **Extensions** > **WindowsPerf**, and specify the path to the `wperf` executable. + +{{% notice Non-Windows on Arm host%}} +You can only generate reports from a Windows on Arm device. + +If using a non-Windows on Arm host, you can import and analyze `WindowsPerf` JSON reports from such devices. + +You do not need to install `wperf` on non-Windows on Arm devices. +{{% /notice %}} + ## Further reading -[Announcing WindowsPerf: Open-source performance analysis tool for Windows on Arm](https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/announcing-windowsperf) +### WindowsPerf + +- [WindowsPerf Release 3.7.2 blog post](https://www.linaro.org/blog/expanding-profiling-capabilities-with-windowsperf-372-release) +- [WindowsPerf Release 3.3.0 blog post](https://www.linaro.org/blog/windowsperf-release-3-3-0/) +- [WindowsPerf Release 3.0.0 blog post](https://www.linaro.org/blog/windowsperf-release-3-0-0/) +- [WindowsPerf Release 2.5.1 blog post](https://www.linaro.org/blog/windowsperf-release-2-5-1/) +- [WindowsPerf release 2.4.0 introduces the first stable version of sampling model support](https://www.linaro.org/blog/windowsperf-release-2-4-0-introduces-the-first-stable-version-of-sampling-model-support/) +- [Announcing WindowsPerf: Open-source performance analysis tool for Windows on Arm](https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/announcing-windowsperf) + +### WindowsPerf GUI + +- [Introducing the WindowsPerf GUI: the Visual Studio 2022 extension](https://www.linaro.org/blog/introducing-the-windowsperf-gui-the-visual-studio-2022-extension/) +- [Introducing 1.0.0-beta release of WindowsPerf Visual Studio extension](https://www.linaro.org/blog/introducing-1-0-0-beta-release-of-windowsperf-visual-studio-extension/) +- [New Release: WindowsPerf Visual Studio Extension v1.0.0](https://www.linaro.org/blog/new-release-windowsperf-visual-studio-extension-v1000/) +- [Launching WindowsPerf Visual Studio Extension v2.1.0](https://www.linaro.org/blog/launching--windowsperf-visual-studio-extension-v210/) diff --git a/content/learning-paths/laptops-and-desktops/windowsperf_sampling_cpython/_next-steps.md b/content/learning-paths/laptops-and-desktops/windowsperf_sampling_cpython/_next-steps.md index d25f2ee4cb..b46623df90 100644 --- a/content/learning-paths/laptops-and-desktops/windowsperf_sampling_cpython/_next-steps.md +++ b/content/learning-paths/laptops-and-desktops/windowsperf_sampling_cpython/_next-steps.md @@ -12,7 +12,7 @@ recommended_path: "/learning-paths/laptops-and-desktops/win_net/" # further_reading links to references related to this path. Can be: # Manuals for a tool / software mentioned (type: documentation) # Blog about related topics (type: blog) - # General online references (type: website) + # General online references (type: website) further_reading: - resource: @@ -32,16 +32,40 @@ further_reading: link: https://www.linaro.org/blog/windowsperf-release-3-0-0/ type: blog - resource: - title: Windows on Arm overview + title: WindowsPerf Release 3.3.0 + link: https://www.linaro.org/blog/windowsperf-release-3-3-0/ + type: blog + - resource: + title: WindowsPerf Release 3.7.2 + link: https://www.linaro.org/blog/expanding-profiling-capabilities-with-windowsperf-372-release + type: blog + - resource: + title: "Introducing the WindowsPerf GUI: the Visual Studio 2022 extension" + link: https://www.linaro.org/blog/introducing-the-windowsperf-gui-the-visual-studio-2022-extension + type: blog + - resource: + title: "Introducing 1.0.0-beta release of WindowsPerf Visual Studio extension" + link: https://www.linaro.org/blog/introducing-1-0-0-beta-release-of-windowsperf-visual-studio-extension + type: blog + - resource: + title: "New Release: WindowsPerf Visual Studio Extension v1.0.0" + link: https://www.linaro.org/blog/new-release-windowsperf-visual-studio-extension-v1000 + type: blog + - resource: + title: "Launching WindowsPerf Visual Studio Extension v2.1.0" + link: https://www.linaro.org/blog/launching--windowsperf-visual-studio-extension-v210 + type: blog + - resource: + title: "Windows on Arm overview" link: https://learn.microsoft.com/en-us/windows/arm/overview type: website - resource: - title: Linaro Windows on Arm project + title: "Linaro Windows on Arm project" link: https://www.linaro.org/windows-on-arm/ type: website - resource: - title: WindowsPerf releases - link: https://gitlab.com/Linaro/WindowsPerf/windowsperf/-/releases + title: "WindowsPerf releases" + link: https://github.com/arm-developer-tools/windowsperf/releases type: website # ================================================================================ # FIXED, DO NOT MODIFY diff --git a/content/learning-paths/microcontrollers/cmsis-dsp/cmsis-dsp-tests.md b/content/learning-paths/microcontrollers/cmsis-dsp/cmsis-dsp-tests.md index d321c0c8d9..3d76893764 100644 --- a/content/learning-paths/microcontrollers/cmsis-dsp/cmsis-dsp-tests.md +++ b/content/learning-paths/microcontrollers/cmsis-dsp/cmsis-dsp-tests.md @@ -7,7 +7,7 @@ weight: 5 # 1 is first, 2 is second, etc. # Do not modify these elements layout: "learningpathall" --- -The [CMSIS-DSP tests](https://github.com/ARM-software/CMSIS-DSP/blob/main/Testing) are publicly available, and are used for validation of the library. They can be run on the [Corstone-300](https://developer.arm.com/Processors/Corstone-300) Fixed Virtual Platform (FVP). +The [CMSIS-DSP tests](https://github.com/ARM-software/CMSIS-DSP/blob/main/Testing) are publicly available, and are used for validation of the library. They can be run on the [Corstone reference systems](https://www.arm.com/products/silicon-ip-subsystems/), for example [Corstone-300](https://developer.arm.com/Processors/Corstone-300) Fixed Virtual Platform (FVP). These tests are primarily for Arm internal use, but users can replicate if they wish. Else proceed to the [next step](/learning-paths/microcontrollers/cmsis-dsp/_review/). diff --git a/content/learning-paths/microcontrollers/iot-sdk/_next-steps.md b/content/learning-paths/microcontrollers/iot-sdk/_next-steps.md index 99af21044b..06f884c97e 100644 --- a/content/learning-paths/microcontrollers/iot-sdk/_next-steps.md +++ b/content/learning-paths/microcontrollers/iot-sdk/_next-steps.md @@ -14,7 +14,7 @@ recommended_path: "/learning-paths/microcontrollers/tfm" # further_reading links to references related to this path. Can be: # Manuals for a tool / software mentioned (type: documentation) # Blog about related topics (type: blog) - # General online references (type: website) + # General online references (type: website) further_reading: - resource: @@ -25,6 +25,10 @@ further_reading: title: Arm Speech Recognition Total Solution example video, using the Arm Open IoT SDK, Corstone-310 and AVH link: https://devsummit.arm.com/flow/arm/devsummit22/sessions-catalog/page/sessions/session/16600464346670018mPQ type: website + - resource: + title: Learn more about the Corstone reference systems + link: https://www.arm.com/products/silicon-ip-subsystems/ + type: website # ================================================================================ # FIXED, DO NOT MODIFY diff --git a/content/learning-paths/microcontrollers/mlek/build.md b/content/learning-paths/microcontrollers/mlek/build.md index 967030ab9b..0872560bfd 100644 --- a/content/learning-paths/microcontrollers/mlek/build.md +++ b/content/learning-paths/microcontrollers/mlek/build.md @@ -9,31 +9,31 @@ layout: "learningpathall" --- The [Arm ML Evaluation Kit (MLEK)](https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ml-embedded-evaluation-kit) provides a number of ready-to-use ML applications. These allow you to investigate the embedded software stack and evaluate performance on the Cortex-M55 and Ethos-U55 processors. -You can use the MLEK source code to build sample applications and run them on the [Corstone-300](https://developer.arm.com/Processors/Corstone-300) Fixed Virtual Platform (FVP). +You can use the MLEK source code to build sample applications and run them on the [Corstone reference systems](https://www.arm.com/products/silicon-ip-subsystems/), for example the [Corstone-300](https://developer.arm.com/Processors/Corstone-300) Fixed Virtual Platform (FVP). -## Before you begin +## Before you begin You can use your own Ubuntu Linux host machine or use [Arm Virtual Hardware (AVH)](https://www.arm.com/products/development-tools/simulation/virtual-hardware) for this Learning Path. -The Ubuntu version should be 20.04 or 22.04. The `x86_64` architecture must be used because the Corstone-300 FVP is not currently available for the Arm architecture. You will need a Linux desktop to run the FVP because it opens graphical windows for input and output from the software applications. +The Ubuntu version should be 20.04 or 22.04. The `x86_64` architecture must be used because the Corstone-300 FVP is not currently available for the Arm architecture. You will need a Linux desktop to run the FVP because it opens graphical windows for input and output from the software applications. If you want to use Arm Virtual Hardware the [Arm Virtual Hardware install guide](/install-guides/avh#corstone) provides setup instructions. -### Compilers +### Compilers -The examples can be built with [Arm Compiler for Embedded](https://developer.arm.com/Tools%20and%20Software/Arm%20Compiler%20for%20Embedded) or [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain). +The examples can be built with [Arm Compiler for Embedded](https://developer.arm.com/Tools%20and%20Software/Arm%20Compiler%20for%20Embedded) or [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain). Use the install guides to install the compilers on your computer: - [Arm Compiler for Embedded](/install-guides/armclang/) - [Arm GNU Toolchain](/install-guides/gcc/arm-gnu) -Both compilers are pre-installed in Arm Virtual Hardware. +Both compilers are pre-installed in Arm Virtual Hardware. ### Corstone-300 FVP {#fvp} -To install the Corstone-300 FVP on your computer refer to the [install guide for Arm Ecosystem FVPs](/install-guides/fm_fvp). +To install the Corstone-300 FVP on your computer refer to the [install guide for Arm Ecosystem FVPs](/install-guides/fm_fvp). -The Corstone-300 FVP is pre-installed in Arm Virtual Hardware. +The Corstone-300 FVP is pre-installed in Arm Virtual Hardware. ## Clone the repository @@ -54,9 +54,9 @@ git submodule update --init ## Build the example applications -The default compiler is `gcc`, but `armclang` can also be used. +The default compiler is `gcc`, but `armclang` can also be used. -You can select either compiler to build applications. You can also try them both and compare the results. +You can select either compiler to build applications. You can also try them both and compare the results. - Build with Arm GNU Toolchain (`gcc`) @@ -70,6 +70,6 @@ You can select either compiler to build applications. You can also try them both ./build_default.py --toolchain arm ``` -The build will take a few minutes. +The build will take a few minutes. When the build is complete, you will find the example images (`.axf` files) in the `cmake-build-*/bin` directory. The `cmake-build` directory names are specific to the compiler used and Ethos-U55 configuration. diff --git a/content/learning-paths/microcontrollers/nav-mlek/intro.md b/content/learning-paths/microcontrollers/nav-mlek/intro.md index 155fe7c710..46e4eb6405 100644 --- a/content/learning-paths/microcontrollers/nav-mlek/intro.md +++ b/content/learning-paths/microcontrollers/nav-mlek/intro.md @@ -1,7 +1,7 @@ --- title: Overview -weight: 2 +weight: 2 # Do not modify these elements layout: "learningpathall" @@ -9,8 +9,8 @@ layout: "learningpathall" As a microcontroller software developer, you likely start projects by identifying tools and software, setting up a development environment, and gathering evaluation boards and models. -Machine Learning (ML) applications follow the same pattern, but introduce additional complexity around the inclusion of machine learning models, software libraries for ML operations, and driver software to program neural processing unit (NPU) hardware. +Machine Learning (ML) applications follow the same pattern, but introduce additional complexity around the inclusion of machine learning models, software libraries for ML operations, and driver software to program neural processing unit (NPU) hardware. -The [Corstone-300](https://developer.arm.com/Processors/Corstone-300) and [Corstone-310](https://developer.arm.com/Processors/Corstone-310) reference designs are the basis of many ML IoT solutions. These designs offer a jump start on building hardware for ML applications. There are many software tools and examples available to get started creating ML applications, but you may find it difficult to see the big picture and understand which tools and software are best for you. +The [Corstone reference systems](https://www.arm.com/products/silicon-ip-subsystems) are the basis of many ML IoT solutions. These designs offer a jump start on building hardware for ML applications. There are many software tools and examples available to get started creating ML applications, but you may find it difficult to see the big picture and understand which tools and software are best for you. You can review the differences between the Corstone reference systems on the [Arm Developer Homepage for Corstone](https://developer.arm.com/documentation/102801/latest/). This Learning Path is to help you get started with Cortex-M and Ethos-U machine learning application development. diff --git a/content/learning-paths/microcontrollers/nav-mlek/platforms.md b/content/learning-paths/microcontrollers/nav-mlek/platforms.md index bfa1f56046..5214a1653d 100644 --- a/content/learning-paths/microcontrollers/nav-mlek/platforms.md +++ b/content/learning-paths/microcontrollers/nav-mlek/platforms.md @@ -13,34 +13,33 @@ There are very many Cortex-M microcontrollers with available [development boards ### MPS3 FPGA prototyping board -The [Arm MPS3 FPGA Prototyping Board](https://www.arm.com/products/development-tools/development-boards/mps3/) can be programmed with [FPGA images](https://developer.arm.com/downloads/-/download-fpga-images/) for the Corstone-300 and the Corstone-310 designs. The FPGA images are good for early software development. +The [Arm MPS3 FPGA Prototyping Board](https://www.arm.com/products/development-tools/development-boards/mps3/) can be programmed with [FPGA images](https://developer.arm.com/downloads/-/download-fpga-images/) for the for the Corstone-300, Corstone-310 and Corstone-1000 reference packages. The FPGA images are good for early software development. MPS3 is the recommended solution for evaluating performance, but boards are in short supply and may be difficult to obtain. - ## Virtual Hardware -Virtual implementations of the Corstone-300 and Corstone-310 are also available for software development. These can be accessed locally or in the cloud. +Virtual implementations of the Corstone reference systems are also available for software development. These can be accessed locally or in the cloud. ### Ecosystem FVPs Ecosystem FVPs are free-of-charge and target a variety of applications. They run on Linux and Windows. -The Corstone-300 and Corstone-310 FVPs are available on the [Arm Ecosystem FVP page](https://developer.arm.com/downloads/-/arm-ecosystem-fvps/). General ecosystem FVP setup instructions are provided in the [install guide](/install-guides/fm_fvp/eco_fvp/). +The Corstone reference systems are available on the [Arm Ecosystem FVP page](https://developer.arm.com/downloads/-/arm-ecosystem-fvps/). General ecosystem FVP setup instructions are provided in the [install guide](/install-guides/fm_fvp/eco_fvp/). The Ecosystem FVP can be used in conjunction with [Keil MDK](https://developer.arm.com/Tools%20and%20Software/Keil%20MDK) or [Arm Development Studio](https://developer.arm.com/Tools%20and%20Software/Arm%20Development%20Studio). Keil MDK Professional Edition also provides these virtual platforms. -### Arm Virtual Hardware +### Arm Virtual Hardware -[Arm Virtual Hardware](https://www.arm.com/products/development-tools/simulation/virtual-hardware/) provides two cloud-based solutions to access Corstone-300 and Corstone-310 platforms. These are intended for use as software test and validation environments suitable for CI/CD integration. +[Arm Virtual Hardware](https://www.arm.com/products/development-tools/simulation/virtual-hardware/) provides two cloud-based solutions to access Corstone reference systems. These are intended for use as software test and validation environments suitable for CI/CD integration. Both versions of AVH offer FVPs. Choose the one which best matches your preferences. You can use your AWS account and pay for the compute you use or pay for the hardware-as-a-service directly using your Arm account. Both methods offer free trials. The marketing information provides more details about the similarities and differences. -- [Arm Virtual Hardware Corstone and CPUs](#aws) AWS AMI (Amazon Machine Image) provides Virtual Hardware Targets (`VHT`) in a cloud instance (virtual machine). The AMI is available in the [AWS marketplace](https://aws.amazon.com/marketplace/pp/prodview-urbpq7yo5va7g/). +- [Arm Virtual Hardware Corstone and CPUs](#aws) AWS AMI (Amazon Machine Image) provides Virtual Hardware Targets (`VHT`) in a cloud instance (virtual machine). The AMI is available in the [AWS marketplace](https://aws.amazon.com/marketplace/pp/prodview-urbpq7yo5va7g/). - [Arm Virtual Hardware Third-Party Hardware](#3rdparty) uses hypervisor technology to model real hardware provided by Arm’s partners. It also offers FVPs as part of the cloud service. @@ -84,13 +83,13 @@ If you can start the FVPs you are ready for ML application development. #### Arm Virtual Hardware Third-Party Hardware {#3rdparty} -Arm Virtual Hardware Third-Party Hardware is currently in public beta. +Arm Virtual Hardware Third-Party Hardware is currently in public beta. [Log in to AVH](https://app.avh.arm.com/login/) using your Arm account or create a new one using the `Create an Arm account` link. After log in, you can use the AVH console to create a new device and select `Corstone-300fvp` or `Corstone-310fvp`. -You can use the AVH console to upload software and control FVP execution. +You can use the AVH console to upload software and control FVP execution. There is also documentation available in the console you can read to continue learning about AVH. @@ -107,10 +106,10 @@ The [AVH simulation model documentation](https://arm-software.github.io/AVH/main Ethos-U55 and Ethos-U65 offer a configurable number of MACs (multiply-accumulate units). During IP evaluation and performance analysis you need to understand the numbers of MACs available in the hardware and create your software to use the same configuration. -| Ethos-U NPU | Number of MACs supported | -| ----------- | ----------- | -| Ethos-U55 | 32, 64, 128, 256 | -| Ethos-U65 | 256, 512 | +| Ethos-U NPU | Number of MACs supported | +| ----------- | ----------- | +| Ethos-U55 | 32, 64, 128, 256 | +| Ethos-U65 | 256, 512 | FVP and VHT platforms can be configured with: ```console @@ -118,7 +117,7 @@ FVP and VHT platforms can be configured with: ``` ### Fast mode -The Ethos-U model used in FVPs can run at a faster speed with less simulation detail. +The Ethos-U model used in FVPs can run at a faster speed with less simulation detail. Use this configuration parameter to enable fast mode: @@ -128,20 +127,24 @@ Use this configuration parameter to enable fast mode: ### Hardware memory maps -A memory map is available for each configuration of Corstone-300 and Corstone-310. For example, the Corstone-300 with Cortex-M55 and Ethos-U55 [memory map](https://developer.arm.com/documentation/100966/1118/Arm--Corstone-SSE-300-FVP/Memory-map-overview-for-Corstone-SSE-300/) describes the address ranges for memory and peripherals. +A memory map is available for each configuration of the Corstone kits. For example, the Corstone-300 with Cortex-M55 and Ethos-U55 [memory map](https://developer.arm.com/documentation/100966/1118/Arm--Corstone-SSE-300-FVP/Memory-map-overview-for-Corstone-SSE-300/) describes the address ranges for memory and peripherals. -Refer to the [Corstone-300 Reference Guide](https://developer.arm.com/documentation/100966/1118/Arm--Corstone-SSE-300-FVP/) and [Corstone-310 Reference Guide](https://developer.arm.com/documentation/100966/1118/Arm--Corstone-SSE-310-FVP/) for details about the hardware models. +Refer to the reference guides for details about the hardware models: +- [Corstone-300 Reference Guide](https://developer.arm.com/documentation/100966/latest/Arm--Corstone-SSE-300-FVP/) +- [Corstone-310 Reference Guide](https://developer.arm.com/documentation/100966/latest/Arm--Corstone-SSE-310-FVP/) +- [Corstone-315 Reference Guide](https://developer.arm.com/documentation/109395/latest) +- [Corstone-320 Reference Guide](https://developer.arm.com/documentation/109760/latest) The memory map of FVPs is NOT configurable. ## Arm IP Explorer -Arm IP Explorer is used by SoC architects to select IP for new designs. It includes simulation features which provide cycle accurate simulation of Arm processors for the purpose of processor selection. It covers Cortex-M and Ethos-U and can help you determine the best processor configurations for a project. +Arm IP Explorer is used by SoC architects to select IP for new designs. It includes simulation features which provide cycle accurate simulation of Arm processors for the purpose of processor selection. It covers Cortex-M and Ethos-U and can help you determine the best processor configurations for a project. -Refer to the [Arm IP Explorer install guide](/install-guides/ipexplorer/) for links to more information. +Refer to the [Arm IP Explorer install guide](/install-guides/ipexplorer/) for links to more information. ## Summary -You should have a general understanding of the hardware options for Corstone-300 and Corstone-310 for application development. You can use an MPS3 board or an FVP on your local machine or using one of the cloud solutions. +You should have a general understanding of the hardware options for Corstone-300 and Corstone-310 for application development. You can use an MPS3 board or an FVP on your local machine or using one of the cloud solutions. The next section covers similar information for software, tools, and example applications. diff --git a/content/learning-paths/servers-and-cloud-computing/_index.md b/content/learning-paths/servers-and-cloud-computing/_index.md index 66334d9eed..081706830e 100644 --- a/content/learning-paths/servers-and-cloud-computing/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/_index.md @@ -9,7 +9,7 @@ maintopic: true operatingsystems_filter: - Android: 2 - Baremetal: 1 -- Linux: 105 +- Linux: 107 - macOS: 9 - Windows: 12 pinned_modules: @@ -19,12 +19,12 @@ pinned_modules: - providers - migration subjects_filter: -- CI-CD: 3 +- CI-CD: 4 - Containers and Virtualization: 25 - Databases: 15 - Libraries: 6 - ML: 13 -- Performance and Architecture: 36 +- Performance and Architecture: 37 - Storage: 1 - Web: 10 subtitle: Optimize cloud native apps on Arm for performance and cost @@ -33,6 +33,7 @@ tools_software_languages_filter: - .NET: 1 - .NET SDK: 1 - 5G: 1 +- ACL: 1 - Android Studio: 2 - Ansible: 2 - Arm Development Studio: 4 @@ -72,7 +73,7 @@ tools_software_languages_filter: - gdb: 1 - Geekbench: 1 - GenAI: 4 -- GitHub: 2 +- GitHub: 3 - GitLab: 1 - Glibc: 1 - Go: 2 @@ -86,7 +87,7 @@ tools_software_languages_filter: - JAX: 1 - Kafka: 1 - Keras: 1 -- Kubernetes: 9 +- Kubernetes: 10 - Lambda: 1 - libbpf: 1 - Linaro Forge: 1 @@ -106,8 +107,8 @@ tools_software_languages_filter: - PAPI: 1 - perf: 3 - PostgreSQL: 4 -- Python: 10 -- PyTorch: 4 +- Python: 11 +- PyTorch: 5 - RAG: 1 - Redis: 3 - Remote.It: 2 diff --git a/content/learning-paths/servers-and-cloud-computing/csp/azure.md b/content/learning-paths/servers-and-cloud-computing/csp/azure.md index 44d1e8fedf..2efeebe30d 100644 --- a/content/learning-paths/servers-and-cloud-computing/csp/azure.md +++ b/content/learning-paths/servers-and-cloud-computing/csp/azure.md @@ -11,7 +11,7 @@ layout: "learningpathall" As with most cloud service providers, Azure offers a pay-as-you-use [pricing policy](https://azure.microsoft.com/en-us/pricing/), including a number of [free](https://azure.microsoft.com/en-us/free/) services. -This guide is to help you get started with [Virtual Machines](https://azure.microsoft.com/en-us/products/virtual-machines/), using Arm-based VMs available in Azure. Microsoft Azure offers two generations of Arm-based VMs. The latest generation is based on [Azure Cobalt 100 processors](https://techcommunity.microsoft.com/t5/azure-compute-blog/announcing-the-preview-of-new-azure-vms-based-on-the-azure/ba-p/4146353). The previous generation VMs are based on [Ampere](https://azure.microsoft.com/en-us/blog/azure-virtual-machines-with-ampere-altra-arm-based-processors-generally-available/) processors. This is a general-purpose compute platform, essentially your own personal computer in the cloud. +This guide is to help you get started with [Virtual Machines](https://azure.microsoft.com/en-us/products/virtual-machines/), using Arm-based VMs available in Azure. Microsoft Azure offers two generations of Arm-based VMs. The latest generation is based on [Azure Cobalt 100 processors](https://azure.microsoft.com/en-us/blog/azure-cobalt-100-based-virtual-machines-are-now-generally-available/). The previous generation VMs are based on [Ampere](https://azure.microsoft.com/en-us/blog/azure-virtual-machines-with-ampere-altra-arm-based-processors-generally-available/) processors. This is a general-purpose compute platform, essentially your own personal computer in the cloud. Full [documentation and quickstart guides](https://learn.microsoft.com/en-us/azure/virtual-machines/) are available. diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/_index.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/_index.md new file mode 100644 index 0000000000..3057703032 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/_index.md @@ -0,0 +1,44 @@ +--- +title: Optimize MLOps with Arm-hosted GitHub Runners +draft: true +cascade: + draft: true + +minutes_to_complete: 60 + +who_is_this_for: This is an introductory topic for software developers interested in automation for Machine Learning (ML) tasks. + +learning_objectives: + - Set up an Arm-hosted GitHub runner. + - Train and test a PyTorch ML model with the German Traffic Sign Recognition Benchmark (GTSRB) dataset. + - Compare the performance of two trained PyTorch ML models; one model compiled with OpenBLAS (Open Basic Linear Algebra Subprograms Library) and oneDNN (Deep Neural Network Library), and the other model compiled with Arm Compute Library (ACL). + - Containerize a ML model and push the container to DockerHub. + - Automate steps in an ML workflow using GitHub Actions. + +prerequisites: + - A GitHub account with access to Arm-hosted GitHub runners. + - A Docker Hub account for storing container images. + - Familiarity with the concepts of ML and continuous integration and deployment (CI/CD). + +author_primary: Pareena Verma, Annie Tallund + +### Tags +skilllevels: Introductory +subjects: CI-CD +armips: + - Neoverse +tools_software_languages: + - Python + - PyTorch + - ACL + - GitHub +operatingsystems: + - Linux + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/_next-steps.md new file mode 100644 index 0000000000..0ab9c525f8 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/_next-steps.md @@ -0,0 +1,27 @@ +--- +next_step_guidance: Thank you for completing the learning path on running MLOps with Arm-hosted GitHub runners. You might be interested in learning how to build Arm images and multi-architecture images with these Arm-hosted runners. + +recommended_path: /learning-paths/cross-platform/github-arm-runners + +further_reading: + - resource: + title: Arm64 on GitHub Actions - Powering faster, more efficient build systems + link: https://github.blog/news-insights/product-news/arm64-on-github-actions-powering-faster-more-efficient-build-systems/ + type: blog + - resource: + title: Arm Compute Library + link: https://github.com/ARM-software/ComputeLibrary + type: website + - resource: + title: Streamlining your MLOps pipeline with GitHub Actions and Arm64 runners + link: https://github.blog/enterprise-software/ci-cd/streamlining-your-mlops-pipeline-with-github-actions-and-arm64-runners/ + type: blog + + +# ================================================================================ +# FIXED, DO NOT MODIFY +# ================================================================================ +weight: 21 # set to always be larger than the content in this path, and one more than 'review' +title: "Next Steps" # Always the same +layout: "learningpathall" # All files under learning paths have this same wrapper +--- diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/_review.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/_review.md new file mode 100644 index 0000000000..0ccaffbc39 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/_review.md @@ -0,0 +1,43 @@ +--- +review: + - questions: + question: > + Can Arm-hosted runners be used with GitHub Actions? + answers: + - "Yes" + - "No" + correct_answer: 1 + explanation: > + You can use Arm-hosted runners with GitHub Actions, and they are available for both Linux and Windows. + + - questions: + question: > + What does the GTSRB dataset consist of? + answers: + - Sound files of spoken German words. + - Sound files of animal sounds. + - Images of flower petals. + - Images of German traffic signs. + correct_answer: 4 + explanation: > + GTSRB stands for German Traffic Signs Recognition Benchmark, and the dataset consists of images of German traffic signs. + + - questions: + question: > + Is ACL included in PyTorch? + answers: + - "True" + - "False" + correct_answer: 1 + explanation: > + While it is possible to use Arm Compute Library independently, the optimized kernels are built into PyTorch through the oneDNN backend. + + + +# ================================================================================ +# FIXED, DO NOT MODIFY +# ================================================================================ +title: "Review" # Always the same title +weight: 20 # Set to always be larger than the content in this path +layout: "learningpathall" # All files under learning paths have this same wrapper +--- diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/background.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/background.md new file mode 100644 index 0000000000..055846c401 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/background.md @@ -0,0 +1,97 @@ +--- +title: MLOps background +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Overview + +In this Learning Path, you will learn how to automate an MLOps workflow using Arm-hosted GitHub runners and GitHub Actions. + +You will perform the following tasks: +- Train and test a neural network model with PyTorch. +- Compare the model inference time using two different PyTorch backends. +- Containerize the model and save it to DockerHub. +- Deploy the container image and use API calls to access the model. + +## GitHub Actions + +GitHub Actions is a platform that automates software development workflows, which includes Continuous Integration and Continuous Delivery (CI/CD). + +Every repository on GitHub has an **Actions** tab as shown below: + +![#actions-gui](images/actions-gui.png) + +GitHub Actions runs workflow files to automate processes. Workflows run when specific events occur in a GitHub repository. + +[YAML](https://yaml.org/) defines a workflow. + +Workflows specify: + +* How a job is triggered. +* The running environment. +* The commands to run. + +The machine running the workflows is called a _runner_. + +## Arm-hosted GitHub runners + +Hosted GitHub runners are provided by GitHub, so you do not need to set up and manage cloud infrastructure. Arm-hosted GitHub runners use the Arm architecture so you can build and test software without the necessity for cross-compiling or instruction emulation. + +Arm-hosted GitHub runners enable you to: + +* Optimize your workflows. +* Reduce cost. +* Improve energy consumption. + +Additionally, the Arm-hosted runners are preloaded with essential tools, which makes it easier for to develop and test your applications. + +Arm-hosted runners are available for Linux and Windows. This Learning Path uses Linux. + +{{% notice Note %}} +You must have a Team or Enterprise Cloud plan to use Arm-hosted runners. +{{% /notice %}} + +Getting started with Arm-hosted GitHub runners is straightforward. Follow the steps in [Create a new Arm-hosted runner](/learning-paths/cross-platform/github-arm-runners/runner/#how-can-i-create-an-arm-hosted-runner) to create a runner in your organization. + +Once you have created the runner, use the `runs-on` syntax in your GitHub Actions workflow file to execute the workflow on Arm. + +Below is an example workflow that executes on an Arm-hosted runner named `ubuntu-22.04-arm-os`: + +```yaml +name: Example workflow +on: + workflow_dispatch: +jobs: + example-job: + name: Example Job + runs-on: ubuntu-22.04-arm-os # Custom ARM64 runner + steps: + - name: Example step + run: echo "This line runs on Arm!" +``` + + +## Machine Learning Operations (MLOps) + +Machine learning use cases require reliable workflows to maintain both performance and quality of output. + +There are tasks that can be automated in the ML lifecycle, such as: +- Model training and retraining. +- Model performance analysis. +- Data storage and processing. +- Model deployment. + +Developer Operations (DevOps) refers to good practices for collaboration and automation, including CI/CD. MLOps describes the area of practice where the ML application development intersects with ML system deployment and operations. + +## German Traffic Sign Recognition Benchmark (GTSRB) + +This Learning Path explains how to train and test a PyTorch model to perform traffic sign recognition. + +You will learn how to use the GTSRB dataset to train the model. The dataset is free to use under the [Creative Commons](https://creativecommons.org/publicdomain/zero/1.0/) license. It contains thousands of images of traffic signs found in Germany. It has become a well-known resource to showcase ML applications. + +The GTSRB dataset is also effective for comparing the performance and accuracy of both different models, and different PyTorch backends. + +Continue to the next section to learn how to set up an end-to-end MLOps workflow using Arm-hosted GitHub runners. diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/compare-performance.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/compare-performance.md new file mode 100644 index 0000000000..8232c66b08 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/compare-performance.md @@ -0,0 +1,79 @@ +--- +title: Compare the performance of PyTorch backends +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +Continuously monitoring the performance of your machine learning models in production is crucial to maintaining effectiveness over time. The performance of your ML model can change due to various factors ranging from data-related issues to environmental factors. + +In this section, you will change the PyTorch backend being used to test the trained model. You will learn how to measure and continuously monitor the inference performance using your workflow. + +## OneDNN with Arm Compute Library (ACL) + +In the previous section, you used the PyTorch 2.3.0 Docker Image compiled with OpenBLAS from DockerHub to run your testing workflow. PyTorch can be run with other backends. You will now modify the testing workflow to use PyTorch 2.3.0 Docker Image compiled with OneDNN and the Arm Compute Library. + +The [Arm Compute Library](https://github.com/ARM-software/ComputeLibrary) is a collection of low-level machine learning functions optimized for Arm's Cortex-A and Neoverse processors and Mali GPUs. Arm-hosted GitHub runners use Arm Neoverse CPUs, which make it possible to optimize your neural networks to take advantage of processor features. ACL implements kernels, which are also known as operators or layers, using specific instructions that run faster on AArch64. + +ACL is integrated into PyTorch through [oneDNN](https://github.com/oneapi-src/oneDNN), an open-source deep neural network library. + +## Modify the test workflow and compare results + +Two different PyTorch docker images for Arm Neoverse CPUs are available on [DockerHub](https://hub.docker.com/r/armswdev/pytorch-arm-neoverse). + +Up until this point, you used the `r24.07-torch-2.3.0-openblas` container image to run workflows. The oneDNN container image is also available to use in workflows. These images represent two different PyTorch backends which handle the PyTorch model execution. + +### Change the Docker image to use oneDNN + +In your browser, open and edit the file `.github/workflows/test_model.yml`. + +Update the `container.image` parameter to `armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-onednn-acl` and save the file by committing the change to the main branch: + +```yaml +jobs: + test-model: + name: Test the Model + runs-on: ubuntu-22.04-arm-os # Custom ARM64 runner + container: + image: armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-onednn-acl + options: --user root + # Steps omitted +``` + +### Run the test workflow + +Trigger the **Test Model** job again by clicking the **Run workflow** button on the **Actions** tab. + +The test workflow starts running. + +Navigate to the workflow run on the **Actions** tab, click into the job, and expand the **Run testing script** step. + +You see a change in the performance results with OneDNN and ACL kernels being used. + +The output is similar to: + +```output +Accuracy of the model on the test images: 90.48% +--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ + Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls +--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ + model_inference 4.63% 304.000us 100.00% 6.565ms 6.565ms 1 + aten::conv2d 0.18% 12.000us 56.92% 3.737ms 1.869ms 2 + aten::convolution 0.30% 20.000us 56.74% 3.725ms 1.863ms 2 + aten::_convolution 0.43% 28.000us 56.44% 3.705ms 1.853ms 2 + aten::mkldnn_convolution 47.02% 3.087ms 55.48% 3.642ms 1.821ms 2 + aten::max_pool2d 0.15% 10.000us 25.51% 1.675ms 837.500us 2 + aten::max_pool2d_with_indices 25.36% 1.665ms 25.36% 1.665ms 832.500us 2 + aten::linear 0.18% 12.000us 9.26% 608.000us 304.000us 2 + aten::clone 0.26% 17.000us 9.08% 596.000us 149.000us 4 + aten::addmm 8.50% 558.000us 8.71% 572.000us 286.000us 2 +--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ +Self CPU time total: 6.565ms +``` + +For the ACL results, notice that the **Self CPU time total** is lower compared to the OpenBLAS run in the previous section. + +The names of the layers have also changed, where the `aten::mkldnn_convolution` is the kernel optimized to run on the Arm architecture. That operator is the main reason the inference time is improved, made possible by using ACL kernels. + +In the next section, you will learn how to automate the deployment of your model. diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/deploy.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/deploy.md new file mode 100644 index 0000000000..8710299fdf --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/deploy.md @@ -0,0 +1,226 @@ +--- +title: Deploy the application as a container +weight: 6 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +After your model has been trained and validated using GitHub Actions workflows, the next step is to deploy the model into a production environment. + +In this section, you will containerize your trained model and push the container image to DockerHub. + +## Containerize the model + +You can use the Dockerfile included in the repository to create a container for the trained model and the deployment scripts. + +Review the `Dockerfile` to understand how it works: + +```console +# Use the official PyTorch image as the base image +FROM armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-onednn-acl + +# Set the working directory +WORKDIR /app + +# Copy the necessary files +COPY models/ /app/models/ +COPY scripts/ /app/scripts/ + +# Install any additional dependencies +RUN pip install --no-cache-dir torch torchvision fastapi uvicorn python-multipart + +# Expose the port the app will run on +EXPOSE 8000 + +# Specify the command to run the model server +CMD ["uvicorn", "scripts.serve_model:app", "--host", "0.0.0.0", "--port", "8000"] +``` + +The Dockerfile uses the PyTorch image with the ACL backend as the base image for the container. + +The working directory is set to `/app` where the trained model and the scripts to deploy the model are copied. + +The container runs an application (`scripts/serve_model.py`) on port 8000. This script is called by `uvicorn` when the container is run. Uvicorn is a fast, lightweight ASGI (Asynchronous Server Gateway Interface) server, and is good for serving Python web applications. More information about application is provided in the next section. + +## Serve the model using FastAPI + +[FastAPI](https://fastapi.tiangolo.com/) is an easy way to serve your trained model as an API. + +The Python application using FastAPI and PyTorch loads the pre-trained model, accepts image uploads, and makes predictions about the image. + +The code is in the file `scripts/serve_model.py` and is shown below: + +```python +import torch +import torchvision.transforms as transforms +from fastapi import FastAPI, UploadFile, File, HTTPException +from fastapi.responses import JSONResponse +from PIL import Image +import io + +# Define the model class (should match the model architecture used during training) +class TrafficSignNet(torch.nn.Module): + def __init__(self): + super(TrafficSignNet, self).__init__() + self.conv1 = torch.nn.Conv2d(3, 32, kernel_size=3) + self.conv2 = torch.nn.Conv2d(32, 64, kernel_size=3) + self.fc1 = torch.nn.Linear(64 * 6 * 6, 128) + self.fc2 = torch.nn.Linear(128, 43) # 43 classes in GTSRB dataset + + def forward(self, x): + x = torch.relu(self.conv1(x)) + x = torch.max_pool2d(x, 2) + x = torch.relu(self.conv2(x)) + x = torch.max_pool2d(x, 2) + x = torch.flatten(x, 1) + x = torch.relu(self.fc1(x)) + x = self.fc2(x) + return x + +# Load the trained model +model = TrafficSignNet() +model.load_state_dict(torch.load("/app/models/traffic_sign_net.pth", +map_location=torch.device('cpu'))) +model.eval() + +# Define image transformations (should match the preprocessing used during training) +transform = transforms.Compose([ + transforms.Resize((32, 32)), + transforms.ToTensor(), + transforms.Normalize((0.5,), (0.5,)) +]) + +# Initialize FastAPI +app = FastAPI() + +# Define the prediction endpoint +@app.post("/predict/") +async def predict(file: UploadFile = File(...)): + try: + # Read the image + image_bytes = await file.read() + image = Image.open(io.BytesIO(image_bytes)).convert("RGB") + + # Preprocess the image + image = transform(image).unsqueeze(0) # Add batch dimension + + # Run the model on the image + with torch.no_grad(): + output = model(image) + _, predicted = torch.max(output, 1) + + # Return the prediction + return {"predicted_class": predicted.item()} + + # Return the prediction as a JSON response + return JSONResponse(content={"prediction": prediction.tolist()}) + + except Exception as e: + raise HTTPException(status_code=500, detail=str(e)) +``` + +## Build the container image with GitHub Actions + +You are now ready automate the build of your containerized model using GitHub Actions. + +Review the workflow file at `.github/workflows/deploy-model.yml`: + +```console +name: Deploy to DockerHub + +on: + workflow_dispatch: + workflow_run: + workflows: [Test Model] + types: + - completed + +jobs: + deploy-to-dockerhub: + runs-on: ubuntu-22.04-arm-os + name: Build and Push Docker Image to DockerHub + steps: + - name: Checkout code + uses: actions/checkout@v3 + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + + - name: Log in to DockerHub + uses: docker/login-action@v3 + with: + username: ${{ secrets.DOCKER_USERNAME }} + password: ${{ secrets.DOCKER_PASSWORD }} + + - name: Build and Push Docker Image + run: | + docker buildx build --platform linux/arm64 -t ${{ secrets.DOCKER_USERNAME }}/gtsrb-image:latest --push . +``` + +This workflow builds the container for the Arm architecture and pushes the container image to DockerHub. + +### Configure your DockerHub credentials + +Before you run this workflow, you need your Docker Hub username and a Personal Access Token (PAT). This enables GitHub Actions to store the container image in your DockerHub account. + +If you don't have a personal access token log in to [DockerHub](https://hub.docker.com/), click on your profile on the top right, select `Account settings` and then select `Personal access tokens`. Use the `Generate new token` button to create the token. + +The credentials are passed to the workflow as secrets. + +To save your secrets, click on the `Settings` tab in the GitHub repository. Expand the `Secrets and variables` on the left side and click `Actions`. + +Add two secrets using the `New repository secret` button: + + * DOCKER_USERNAME: Your DockerHub username + * DOCKER_PASSWORD: Your DockerHub Personal Access Token + +### Run the workflow + +Navigate to the `Actions` tab in your repository, and select `Deploy to DockerHub` on the left side. + +Use the `Run workflow` drop-down on the right-hand side to click `Run workflow` to start the workflow on the main branch. + +The workflow starts running. + +## Verify the deployment + +When the **Deploy to DockerHub** workflow completes, the container image is available in DockerHub and you can run it on any Arm machine. + +### Confirm the image in DockerHub + +Log in to DockerHub and see the newly created container image. + +A screenshot showing the new image in DockerHub is below: + +![dockerhub_img](images/dockerhub_img.png) + +### Run the application + +To run the application on an Arm machine, you can pull the Docker image and run create a container. + +Make sure to substitute your DockerHub username in the commands below: + +```console +docker pull /gtsrb-image +docker run -d -p 8000:8000 /gtsrb-image +``` +### Test the API with a traffic sign image + +Test the application using an image from the repository. Download the test image named `test-img.png` from GitHub by clicking it and using the download button on the right side. + +Run the `curl` command below to make a POST request to the predict endpoint using the image: + +```bash +curl -X 'POST' 'http://localhost:8000/predict/' -H 'accept: application/json' -H 'Content-Type: multipart/form-data' -F 'file=@test-img.png;type=image/png' +``` + +The output is: + +```ouput +{"predicted_class":6} +``` + +You have now validated that you were able to successfully deploy your application, serve your model as an API, and make predictions on a test image. + +In the last section, you will learn about the complete end-to-end MLOps workflow by combining the individual workflows. diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/e2e-workflow.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/e2e-workflow.md new file mode 100644 index 0000000000..71a0743264 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/e2e-workflow.md @@ -0,0 +1,167 @@ +--- +title: End-to-end MLOps workflow +weight: 7 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +So far, you have run three individual workflows covering the tasks in the ML lifecycle: +- Training +- Testing +- Performance monitoring +- Deployment + +With GitHub Actions, you can build an end-to-end custom MLOPs workflow that combines and automates the individual workflows. + +To demonstrate this, the repository contains a workflow in `.github/workflows/train-test-deploy-model.yml` that automates the individual steps. + +Here is the complete **Train, Test, and Deploy Model** workflow file: + +```yaml +name: Train, Test and Deploy Model + +on: + workflow_dispatch: + push: + branches: main + +jobs: + train-model: + name: Train the Model + runs-on: ubuntu-22.04-arm-os # Custom ARM64 runner + container: + image: armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-openblas + options: --user root + steps: + - name: Checkout code + uses: actions/checkout@v4 + - name: Run training script + run: python scripts/train_model.py + - name: Upload Artifact + uses: actions/upload-artifact@v4 + with: + name: traffic_sign_net + path: ${{ github.workspace }}/models/traffic_sign_net.pth + retention-days: 5 + test-model-openblas: + name: Test with OpenBLAS + needs: train-model + runs-on: ubuntu-22.04-arm-os + container: + image: armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-openblas + options: --user root + steps: + - name: Checkout code + uses: actions/checkout@v4 + - name: Download model artifact + uses: actions/download-artifact@v4 + with: + name: traffic_sign_net + - name: Test with OpenBLAS + run: python scripts/test_model.py --model models/traffic_sign_net.pth | tee openblas.txt + - name: Upload Artifact + uses: actions/upload-artifact@v4 + with: + name: openblas + path: openblas.txt + retention-days: 5 + test-model-onednn-acl: + name: Test with Arm Compute Library + needs: train-model + runs-on: ubuntu-22.04-arm-os + container: + image: armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-onednn-acl + options: --user root + steps: + - name: Checkout code + uses: actions/checkout@v4 + - name: Download model artifact + uses: actions/download-artifact@v4 + with: + name: traffic_sign_net + - name: Test with oneDNN and Arm Compute Library + run: python scripts/test_model.py --model models/traffic_sign_net.pth | tee acl.txt + - name: Upload Artifact + uses: actions/upload-artifact@v4 + with: + name: acl + path: acl.txt + retention-days: 5 + compare-results: + name: Compare Profiler Reports + needs: [test-model-openblas, test-model-onednn-acl] + runs-on: ubuntu-22.04-arm-os + steps: + - name: Checkout code + uses: actions/checkout@v4 + - name: Download OpenBLAS artifact + uses: actions/download-artifact@v4 + with: + name: openblas + - name: Download ACL artifact + uses: actions/download-artifact@v4 + with: + name: acl + - name: Parse output + run: python scripts/parse_output.py openblas.txt acl.txt + - name: Remove output files + run: rm -rf openblas.txt acl.txt + push-artifact: + name: Push the updated model + needs: compare-results + runs-on: ubuntu-22.04-arm-os + steps: + - name: Checkout code + uses: actions/checkout@v4 + - name: Download model artifact + uses: actions/download-artifact@v4 + with: + name: traffic_sign_net + path: models/ + - name: Push updated model to repository + run: | + git config --global user.name "GitHub Actions" + git config --global user.email "actions@users.noreply.github.com" + git add models/traffic_sign_net.pth + git commit -m "Add updated model" + git push + deploy-to-dockerhub: + name: Build and Push Docker Image to DockerHub + needs: push-artifact + runs-on: ubuntu-22.04-arm-os + steps: + - name: Checkout code + uses: actions/checkout@v3 + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + + - name: Log in to DockerHub + uses: docker/login-action@v3 + with: + username: ${{ secrets.DOCKER_USERNAME }} + password: ${{ secrets.DOCKER_PASSWORD }} + + - name: Build and Push Docker Image + run: | + docker buildx build --platform linux/arm64 -t ${{ secrets.DOCKER_USERNAME }}/gtsrb-image:latest --push . +``` + +These steps should look familiar and now they are put together to curate an end-to-end MLOPs workflow. + +The training and testing steps are run like before. The output report is saved and parsed to show the compare the performance changes in inference time. + +The trained model is updated in the repository. + +The deployment step connects to DockerHub, pushes the containerized model and scripts, which can then be downloaded and run. + +The steps depend on each other, requiring the previous one to run before the next is triggered. The entire workflow is triggered automatically any time a change is pushed into the main branch of your repository. + +Using what you have learned, navigate to the **Train, Test and Deploy** workflow and run it. + +The diagram below shows the end-to-end workflow, the relationship between the steps, and the time required to run each step: + +![#e2e-workflow](/images/e2e-workflow.png) + +You have run an MLOps workflow using GitHub Actions with Arm-hosted runners for managing all of the steps in your ML application's lifecycle. diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/images/actions-gui.png b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/actions-gui.png new file mode 100644 index 0000000000..78a848d986 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/actions-gui.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/images/actions_test.png b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/actions_test.png new file mode 100644 index 0000000000..05248e3da9 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/actions_test.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/images/actions_train.png b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/actions_train.png new file mode 100644 index 0000000000..ed34762dbb Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/actions_train.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/images/artifact.png b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/artifact.png new file mode 100644 index 0000000000..c483ceed4a Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/artifact.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/images/dockerhub_img.png b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/dockerhub_img.png new file mode 100644 index 0000000000..6e485e97c9 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/dockerhub_img.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/images/e2e-workflow.png b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/e2e-workflow.png new file mode 100644 index 0000000000..5c60873747 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/e2e-workflow.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/images/fork.png b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/fork.png new file mode 100644 index 0000000000..46b9703c81 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/fork.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/images/run-id.png b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/run-id.png new file mode 100644 index 0000000000..c1484bdb40 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/run-id.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/images/run-workflow.png b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/run-workflow.png new file mode 100644 index 0000000000..e6091cb68f Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/run-workflow.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/images/train_run.png b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/train_run.png new file mode 100644 index 0000000000..9aa857f9d2 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/gh-runners/images/train_run.png differ diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/train-test.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/train-test.md new file mode 100644 index 0000000000..2cfdc3e401 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/train-test.md @@ -0,0 +1,192 @@ +--- +title: Understand neural network model training and testing +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Fork the example repository + +In this section, you will fork the example GitHub repository containing the project code. + +Get started by forking the example repository. In a web browser, navigate to the repository at: + +```bash +https://github.com/Arm-Labs/gh_armrunner_mlops_gtsrb +``` +Fork the repository, using the **Fork** button: + +![#fork](/images/fork.png) + +Create a fork within a GitHub Organization or Team where you have access to Arm-hosted GitHub runners. + +{{% notice Note %}} +If a repository with the same name `gh_armrunner_mlops_gtsrb` already exists in your Organization or Team, you can modify the repository name to make it unique. +{{% /notice %}} + +## Learn about model training and testing + +In this section, you will inspect the Python code for training and testing a neural network model. + +Explore the repository using a browser to get familiar with code and the workflow files. + +{{% notice Note %}} +No actions are required in the sections below. + +The purpose is to provide an overview of the code used for training and testing a PyTorch model on the GTSRB dataset. +{{% /notice %}} + +### Model training + +In the `scripts` directory, there is a Python script called `train_model.py`. This script loads the GTSRB dataset, defines a neural network, and trains the model on the dataset. + +#### Data preprocessing + +The first section loads the GTSRB dataset to prepare it for training. The GTSRB dataset is built into `torchvision`, which makes loading easier. + +The transformations used when loading data are part of the preprocessing step, which makes the data uniform and ready to run through the extensive math operations of the ML model. + +In accordance with common machine learning practices, data is separated into training and testing data to avoid overfitting the neural network. + +Here is the code to load the dataset: + +```python +transform = transforms.Compose([ + transforms.Resize((32, 32)), + transforms.ToTensor(), + transforms.Normalize((0.5,), (0.5,)) +]) + +train_set = torchvision.datasets.GTSRB(root='./data', split='train', download=True, transform=transform) +train_loader = DataLoader(train_set, batch_size=64, shuffle=True) +``` + +#### Model creation + +The next step is to define a class for the model, listing the layers used. + +The model defines the forward pass function used at training time to update the weights. Additionally, the loss function and optimizer for the model are defined. + +Here is the code that defines the model: + +```python +class TrafficSignNet(nn.Module): + def __init__(self): + super(TrafficSignNet, self).__init__() + self.conv1 = nn.Conv2d(3, 32, kernel_size=3) + self.conv2 = nn.Conv2d(32, 64, kernel_size=3) + self.fc1 = nn.Linear(64 * 6 * 6, 128) + self.fc2 = nn.Linear(128, 43) # 43 classes in GTSRB dataset + + def forward(self, x): + x = torch.relu(self.conv1(x)) + x = torch.max_pool2d(x, 2) + x = torch.relu(self.conv2(x)) + x = torch.max_pool2d(x, 2) + x = torch.flatten(x, 1) + x = torch.relu(self.fc1(x)) + x = self.fc2(x) + return x + +model = TrafficSignNet() +criterion = nn.CrossEntropyLoss() +optimizer = optim.Adam(model.parameters(), lr=0.001) +``` + +#### Model training and the model file + +A training loop performs the actual training. + +The number of epochs is arbitrarily set to 10 for this example. When the training is finished, the model weights are saved to a `.pth` file. + +Here is the code for the training loop: + +```python +num_epochs = 10 +model.train() +for epoch in range(num_epochs): + running_loss = 0.0 + for i, data in enumerate(train_loader, 0): + inputs, labels = data + optimizer.zero_grad() + + # Forward pass + outputs = model(inputs) + loss = criterion(outputs, labels) + + # Backward pass and optimization + loss.backward() + optimizer.step() + + running_loss += loss.item() + if i % 100 == 99: # Print every 100 mini-batches + print(f'Epoch [{epoch + 1}/{num_epochs}], Step [{i + 1}/{len(train_loader)}], Loss: {running_loss / 100:.4f}') + running_loss = 0.0 + +torch.save(model.state_dict(), './models/traffic_sign_net.pth') +``` + +You now have an understanding of how to load the GTSRB dataset, define a neural network, train the model on the dataset, and save the trained model. + +The next step is testing the trained model. + +### Model testing + +The `test_model.py` Python script in the `scripts` directory verifies how accurately the ML model classifies traffic signs. + +It uses the PyTorch profiler to measure the CPU performance in terms of execution time. The profiler measures the model inference time when different PyTorch backends are used to test the model. + +#### Model loading and testing data + +Testing is done by loading the model that was saved after training and preparing it for evaluation on a test dataset. + +As in training, transformations are used to load the test data from the GTSRB dataset. + +Here is the code to load the model and the test data: + +```python +model_path = args.model if args.model else './models/traffic_sign_net.pth' + +model = TrafficSignNet() +model.load_state_dict(torch.load(model_path)) +model.eval() + +transform = transforms.Compose([ + transforms.Resize((32, 32)), + transforms.ToTensor(), + transforms.Normalize((0.5,), (0.5,)) +]) + +test_set = torchvision.datasets.GTSRB(root='./data', split='test', download=True, transform=transform) +test_loader = DataLoader(test_set, batch_size=64, shuffle=False) +``` + +#### Testing loop and profiling results + +The testing loop passes each batch of test data through the model and compares predictions to the actual labels to calculate accuracy. + +The accuracy is calculated as a percentage of correctly classified images. Both the accuracy and PyTorch profiler reports are printed at the end of the script. + +Here is the testing loop with profiling: + +```python +correct = 0 +total = 0 +with torch.no_grad(): + for data in test_loader: + images, labels = data + with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof: + with record_function("model_inference"): + outputs = model(images) + _, predicted = torch.max(outputs.data, 1) + total += labels.size(0) + correct += (predicted == labels).sum().item() + +print(f'Accuracy of the model on the test images: {100 * correct / total:.2f}%') +print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)) +``` + +You now have a good overview of the code for training and testing the model on the GTSRB dataset using PyTorch. + +In the next section, you will learn how to use GitHub Actions workflows to run the training and testing scripts on an Arm-hosted GitHub runner. diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/workflows.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/workflows.md new file mode 100644 index 0000000000..77519b81bd --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/workflows.md @@ -0,0 +1,196 @@ +--- +title: Automate training and testing with GitHub Actions +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Run GitHub Actions workflows + +In this section, you will use GitHub Actions to run the training and testing scripts on an Arm-hosted GitHub runner. + +### Train the model + +GitHub Actions are defined by workflows in the `.github/workflows` directory of a project. + +The workflow at `.github/workflows/train-model.yml` automates the model training. + +The training workflow uses a [PyTorch 2.3.0 Docker Image compiled with OpenBLAS from DockerHub](https://hub.docker.com/r/armswdev/pytorch-arm-neoverse) and runs the script at `scripts/train_model.py` in the container. + +When training is complete, the model is saved for future use as an artifact of the workflow. + +Review the **Train Model** workflow by opening the file `.github/workflows/train-model.yml` from your fork in your browser: + +```yaml +name: Train Model + +on: + workflow_dispatch: + +jobs: + train-model: + name: Train the Model + runs-on: ubuntu-22.04-arm-os # Custom ARM64 runner + container: + image: armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-openblas + options: --user root + steps: + - name: Checkout code + uses: actions/checkout@v4 + - name: Run training script + run: python scripts/train_model.py + - name: Upload Artifact + uses: actions/upload-artifact@v4 + with: + name: traffic_sign_net + path: ${{ github.workspace }}/models/traffic_sign_net.pth + retention-days: 5 +``` + +The workflow specifies one job named **Train the Model**. + +The job runs in the runner environment specified by `runs-on`. The `runs-on: ubuntu-22.04-arm-os` points to the Arm-hosted GitHub runner you setup in the first section. + +### Run the training workflow + +Navigate to the **Train Model** workflow under the `Actions` tab. + +Press the `Run workflow` button and run the workflow on the main branch. + +![Train_workflow](/images/train_run.png) + +The workflow starts running. It takes about 8 minutes to complete. + +Click on the workflow to see the output from each step of the workflow. + +![Actions_train](/images/actions_train.png) + +Expand on the `Run training script` step to see the training loss per epoch followed by `Finished Training`. + +The output is similar to: + +```output +(...) +Epoch [8/10], Step [400/417], Loss: 0.0230 +Epoch [9/10], Step [100/417], Loss: 0.0193 +Epoch [9/10], Step [200/417], Loss: 0.0207 +Epoch [9/10], Step [300/417], Loss: 0.0204 +Epoch [9/10], Step [400/417], Loss: 0.0244 +Epoch [10/10], Step [100/417], Loss: 0.0114 +Epoch [10/10], Step [200/417], Loss: 0.0168 +Epoch [10/10], Step [300/417], Loss: 0.0208 +Epoch [10/10], Step [400/417], Loss: 0.0152 +Finished Training +``` + +Confirm the model is generated and saved as an artifact in the job's overview. + +![#artifact](/images/artifact.png) + +This trained model artifact is used in the next step. + +### Test the model + +The next workflow called `test-model.yml` automates running the `test_model.py` script on the Arm-hosted runner. + +The test job downloads the artifact generated by the training workflow in the previous step, and runs the inference using PyTorch with the OpenBLAS backend from the specified container image. + +Review the **Test Model** workflow by opening the file `.github/workflows/test-model.yml` in your browser: + +```yaml +name: Test Model + +on: + workflow_dispatch: + +jobs: + test-model: + name: Test the Model + runs-on: ubuntu-22.04-arm-os # Custom ARM64 runner + container: + image: armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-openblas + options: --user root + steps: + - name: Checkout code + uses: actions/checkout@v4 + - name: Download artifacts + uses: actions/download-artifact@v4 + with: + name: traffic_sign_net + run-id: <11-digit run ID> + github-token: ${{ secrets.GITHUB_TOKEN }} + - name: Run testing script + run: python scripts/test_model.py --model traffic_sign_net.pth + +``` + +### Run the testing workflow + +{{% notice Note %}} +The `test-model.yml` file needs to be edited to be able to use the saved model from the training run. +{{% /notice %}} + +#### Modify the workflow file + +Complete the steps below to modify the testing workflow file: + +1. Navigate to the **Actions** tab on your GitHub repository. + +2. Click on **Train Model** on the left side of the page. + +3. Click on the completed **Train Model** workflow. + +4. Copy the 11-digit ID number from the end of the URL in your browser address bar. + +![#run-id](/images/run-id.png) + +5. Navigate back to the **Code** tab and open the file `.github/workflows/test-model.yml`. + +6. Click the Edit button, represented by a pencil on the top right of the file contents. + +7. Update the `run-id` parameter with the 11 digit ID number you copied. + +8. Save the file by clicking the **Commit changes** button. + + +#### Run the workflow file + +You are now ready to run the **Test Model** workflow. + +1. Navigate to the `Actions` tab and select the **Test Workflow** on the left side. + +2. Click the **Run workflow** button to run the workflow on the main branch. + +![#run-workflow](images/run-workflow.png) + +The workflow starts running. + +Click on the workflow to view the output from each step. + +![Actions_test](/images/actions_test.png) + +Click on the **Run testing script** step to see the accuracy of the model and a table of the results from the PyTorch profiler. + +The output from is similar to: + +```output +Accuracy of the model on the test images: 90.48% +------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ + Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls +------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ + model_inference 2.35% 332.000us 100.00% 14.141ms 14.141ms 1 + aten::max_pool2d 0.10% 14.000us 34.06% 4.817ms 2.409ms 2 + aten::max_pool2d_with_indices 33.97% 4.803ms 33.97% 4.803ms 2.401ms 2 + aten::linear 0.08% 11.000us 32.98% 4.663ms 2.332ms 2 + aten::addmm 32.58% 4.607ms 32.71% 4.626ms 2.313ms 2 + aten::conv2d 0.08% 12.000us 22.37% 3.164ms 1.582ms 2 + aten::convolution 0.13% 19.000us 22.29% 3.152ms 1.576ms 2 + aten::_convolution 0.21% 29.000us 22.16% 3.133ms 1.567ms 2 + aten::_nnpack_spatial_convolution 21.88% 3.094ms 21.95% 3.104ms 1.552ms 2 + aten::relu 0.11% 15.000us 8.17% 1.155ms 385.000us 3 +------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ +Self CPU time total: 14.141ms +``` + +In the next section, you will learn how to modify the testing workflow to compare the inference performance using PyTorch with two different backends. diff --git a/content/learning-paths/servers-and-cloud-computing/kubearchinspect/_index.md b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/_index.md new file mode 100644 index 0000000000..ff83f6bff4 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/_index.md @@ -0,0 +1,38 @@ +--- +title: Migrate containers to Arm using KubeArchInspect +draft: true +cascade: + draft: true + +minutes_to_complete: 15 + +who_is_this_for: This is an introductory topic for software developers who want to know if the containers running in a Kubernetes cluster are available for the Arm architecture. + +learning_objectives: + - Run KubeArchInspect to get a quick report of the containers running in a Kubernetes cluster. + - Discover which images support the Arm architecture. + - Understand common reasons for an image not supporting Arm. + - Make configuration changes to upgrade images with Arm support. + +prerequisites: + - A running Kubernetes cluster accessible with `kubectl`. + +author_primary: Jason Andrews + +### Tags +skilllevels: Introductory +subjects: Performance and Architecture +armips: + - Neoverse +tools_software_languages: + - Kubernetes +operatingsystems: + - Linux + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/kubearchinspect/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/_next-steps.md new file mode 100644 index 0000000000..3174193d80 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/_next-steps.md @@ -0,0 +1,30 @@ +--- +next_step_guidance: Now you know how to use the KubeArchInspect tool to understand the Arm support of your Kubernetes cluster images. + +recommended_path: /learning-paths/servers-and-cloud-computing/eks-multi-arch/ + +further_reading: + - resource: + title: Kubernetes documentation + link: https://kubernetes.io/docs/home/ + type: documentation + - resource: + title: Amazon Elastic Kubernetes Service + link: https://aws.amazon.com/eks/ + type: documentation + - resource: + title: Azure Kubernetes Service (AKS) + link: https://learn.microsoft.com/en-us/azure/aks/ + type: documentation + - resource: + title: Arm workloads on GKE + link: https://cloud.google.com/kubernetes-engine/docs/concepts/arm-on-gke + type: documentation + +# ================================================================================ +# FIXED, DO NOT MODIFY +# ================================================================================ +weight: 21 # set to always be larger than the content in this path, and one more than 'review' +title: "Next Steps" # Always the same +layout: "learningpathall" # All files under learning paths have this same wrapper +--- \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/kubearchinspect/_review.md b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/_review.md new file mode 100644 index 0000000000..378a647733 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/_review.md @@ -0,0 +1,43 @@ +--- +review: + - questions: + question: > + Which of the following statements is true about kubearchinspect? + answers: + - KubeArchInspect displays a report of the images running in a Kubernetes cluster, but it does not identify which images support arm64. + - KubeArchInspect displays a report of the images running in a Kubernetes cluster and identifies which images support arm64. + - KubeArchInspect displays a report of the images running in a Kubernetes cluster and identifies which images are running on arm64. + correct_answer: 2 + explanation: > + KubeArchInspect displays a report of the images running in a Kubernetes cluster and identifies which ones support arm64. The report is generated by connecting to the source registry for each image and checking which architectures are available. + + - questions: + question: > + True or False: KubeArchInspect automatically upgrades images to the latest version. + answers: + - "True" + - "False" + correct_answer: 2 + explanation: > + KubeArchInspect does not automatically upgrade images to the latest version. It only identifies the images that are available. + + - questions: + question: > + Which of the following is NOT a way to improve your cluster's Arm compatibility? + answers: + - Upgrade images to a newer version -- if they support arm64. + - Find an alternative image that supports arm64. + - Request that the developers of an image build and publish an arm64 version. + - Contact the Kubernetes community to upgrade your cluster. + correct_answer: 4 + explanation: > + KubeArchInspect helps you identify the available images that support arm64, but it does not upgrade the cluster. You would have to upgrade the cluster manually using the appropriate Kubernetes commands. + + +# ================================================================================ +# FIXED, DO NOT MODIFY +# ================================================================================ +title: "Review" # Always the same title +weight: 20 # Set to always be larger than the content in this path +layout: "learningpathall" # All files under learning paths have this same wrapper +--- \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/kubearchinspect/analyse-results.md b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/analyse-results.md new file mode 100644 index 0000000000..a3bdc0efec --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/analyse-results.md @@ -0,0 +1,41 @@ +--- +title: Analyze the results +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Identifying issues and opportunities + +After running KubeArchInspect, you can examine the output to determine if the cluster image architectures are suitable for your needs. + +If you want to run an all Arm cluster, you need to use images which include arm64 support. + +For example, in the previous report, you see some images of concern: + +```output +Legends: +✅ - Supports arm64, ❌ - Does not support arm64, ⬆ - Upgrade for arm64 support, ❗ - Some error occurred +------------------------------------------------------------------------------------------------ + +602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/csi-snapshotter:v6.3.2-eks-1-28-11 ❌ +... +602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/csi-node-driver-registrar:v2.9.2-eks-1-28-11 ❌ +... +sergrua/kube-tagger:release-0.1.1 ❌ +``` + +These images are identified as not supporting arm64 (`❌`). + +## Addressing issues + +The KubeArchInspect report provides valuable information for improving the cluster's performance and compatibility with the Arm architecture. + +Several approaches can be taken to address the issues identified: + +* **Upgrade images:** If an image with an available arm64 version (`⬆`) is detected, consider upgrading to that version. This can be done by modifying the deployment configuration and restarting the containers using the new image tag. +* **Find alternative images:** For images with no available arm64 version, look for alternative images that offer arm64 support. For example, instead of a specific image from the registry, try using a more general image like `busybox`, which supports multiple architectures, including arm64. +* **Request Arm support:** If there is no suitable alternative image available, you can contact the image developers or the Kubernetes community and request them to build and publish an arm64 version of the image. + +KubeArchInspect provides an efficient way to understand and improve the Arm architecture support within your Kubernetes cluster, ensuring your cluster runs efficiently and effectively. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/kubearchinspect/before-you-begin.md b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/before-you-begin.md new file mode 100644 index 0000000000..d59fe366bb --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/before-you-begin.md @@ -0,0 +1,56 @@ +--- +title: Install KubeArchInspect +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +{{% notice Note %}} +KubeArchInspect is a command-line tool which requires a running Kubernetes cluster. + +Make sure you can connect to your Kubernetes cluster using `kubectl`. +{{% /notice %}} + +## How do I install KubeArchInspect? + +For Arm Linux, download the KubeArchInspect package from GitHub: + +```console +wget https://github.com/ArmDeveloperEcosystem/kubearchinspect/releases/download/v0.2.0/kubearchinspect_Linux_arm64.tar.gz +``` + +Extract the files from the release package: + +```console +tar xvfz kubearchinspect_Linux_arm64.tar.gz +``` + +The `kubearchinspect` binary is now in the current directory. + +If you are using a different platform, such as Windows or macOS, you can get other release packages from the [GitHub releases area](https://github.com/ArmDeveloperEcosystem/kubearchinspect/releases/). + +You can run `kubearchinspect` from the current location or copy it to a directory in your search path such as `/usr/local/bin`. + +## How do I verify KubeArchInspect is installed? + +Confirm KubeArchInspect works correctly by running the `kubearchinspect` command: + +```console +./kubearchinspect images --help +``` + +If KubeArchInspect is working correctly, the usage message is displayed: + +```output +Check which images in your cluster support arm64. + +Usage: + kubearchinspect images [flags] + +Flags: + -d, --debug Enable debug mode + -h, --help help for images +``` + +You are now ready to use KubeArchInspect. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/kubearchinspect/run-kubearchinspect.md b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/run-kubearchinspect.md new file mode 100644 index 0000000000..61d3f4830d --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/kubearchinspect/run-kubearchinspect.md @@ -0,0 +1,73 @@ +--- +title: Run KubeArchInspect +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +KubeArchInspect identifies images in a Kubernetes cluster which have support for the Arm architecture. It checks each image against the image registry, checking the available architectures for each image tag. The results can be used to identify potential issues or opportunities for optimizing the cluster for Arm. + +## How do I run KubeArchInspect? + +To run KubeArchInspect, you need to have `kubearchinspect` installed and ensure that the `kubectl` command is configured to connect to your cluster. If not already configured, you should set up `kubectcl` to connect to your cluster. + +Run KubeArchInspect with the following command: + +```console +kubearchinspect images +``` + +KubeArchInspect connects to the Kubernetes cluster and generates a list of images in use. + +For each image found, it connects to the source registry for the image and checks which architectures are available, producing a report like the example below: + +```output +Legends: +✅ - Supports arm64, ❌ - Does not support arm64, ⬆ - Upgrade for arm64 support, ❗ - Some error occurred +------------------------------------------------------------------------------------------------ + +602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/coredns:v1.9.3-eksbuild.10 ❗ +602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/csi-snapshotter:v6.3.2-eks-1-28-11 ❌ +quay.io/kiwigrid/k8s-sidecar:1.21.0 ✅ +grafana/grafana:9.3.1 ✅ +redis:6.2.4-alpine ✅ +602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.6-eksbuild.1 ❗ +registry.k8s.io/autoscaling/cluster-autoscaler:v1.25.3 ✅ +602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/csi-node-driver-registrar:v2.9.2-eks-1-28-11 ❌ +docker.io/bitnami/metrics-server:0.6.2-debian-11-r20 ⬆ +amazon/aws-for-fluent-bit:2.10.0 ✅ +quay.io/argoproj/argocd:v2.0.5 ⬆ +quay.io/prometheus/node-exporter:v1.5.0 ✅ +registry.k8s.io/ingress-nginx/controller:v1.9.4@sha256:5b161f051d017e55d358435f295f5e9a297e66158f136321d9b04520ec6c48a3 ❗ +quay.io/prometheus-operator/prometheus-operator:v0.63.0 ✅ +registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.8.1 ✅ +mirrors--ghcr-io.mirror.com/banzaicloud/vault-secrets-webhook:1.18.0 ✅ +quay.io/prometheus-operator/prometheus-config-reloader:v0.63.0 ✅ +mirrors--dockerhub.mirror.com/grafana/grafana:9.3.8 ✅ +curlimages/curl:7.85.0 ✅ +602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/csi-attacher:v4.4.2-eks-1-28-11 ❗ +602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/livenessprobe:v2.11.0-eks-1-28-11 ❗ +busybox:1.31.1 ✅ +quay.io/prometheus/prometheus:v2.42.0 ✅ +docker.io/bitnami/external-dns:0.14.0-debian-11-r2 ✅ +dsgcore--docker.mirror.com/jcaap:3.7 ❗ +602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/csi-provisioner:v3.6.2-eks-1-28-11 ❗ +602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/csi-resizer:v1.9.2-eks-1-28-11 ❗ +602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/kube-proxy:v1.25.16-minimal-eksbuild.1 ❗ +quay.io/kiwigrid/k8s-sidecar:1.22.0 ✅ +quay.io/prometheus/blackbox-exporter:v0.24.0 ✅ +amazon/cloudwatch-agent:1.247350.0b251780 ✅ +602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/aws-ebs-csi-driver:v1.26.0 ❗ +sergrua/kube-tagger:release-0.1.1 ❌ +docker.io/alpine:3.13 ✅ +quay.io/prometheus/alertmanager:v0.25.0 ✅ +602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni-init:v1.15.4-eksbuild.1 ❗ +602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.15.4-eksbuild.1 ❗ +``` + +Each image running in the cluster appears on a separate line, including name, tag (version), and test result. + +A green tick indicates the image already supports arm64, a red cross that arm64 support is not available, an upward arrow shows that arm64 support is included in a newer version. + +A red exclamation mark is shown when an error occurs checking the image. This may indicate an error connecting to the image registry. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/migration/golang.md b/content/learning-paths/servers-and-cloud-computing/migration/golang.md index e56c6dd9f7..d8f5da78dc 100644 --- a/content/learning-paths/servers-and-cloud-computing/migration/golang.md +++ b/content/learning-paths/servers-and-cloud-computing/migration/golang.md @@ -18,4 +18,6 @@ Go 1.18 (released March 2022) provided a significant performance improvement. Make sure to use the latest version of the Go compiler and toolchain to get the best performance on Arm. -Visit [Go releases](https://go.dev/dl/) to see the latest versions. \ No newline at end of file +Visit [Go releases](https://go.dev/dl/) to see the latest versions. + +Refer to the [Go install guide](/install-guides/go/) for details about installing Go on Arm Linux. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/migration/java.md b/content/learning-paths/servers-and-cloud-computing/migration/java.md index 56df9ac46d..294359b90f 100644 --- a/content/learning-paths/servers-and-cloud-computing/migration/java.md +++ b/content/learning-paths/servers-and-cloud-computing/migration/java.md @@ -91,7 +91,31 @@ Depending on your application, you may want to investigate the vector processing You can try [Process Watch](https://learn.arm.com/learning-paths/servers-and-cloud-computing/processwatch/) to monitor the usage of SIMD and CRC instructions. -Refer to the [Java documentation](https://docs.oracle.com/en/java/javase/17/docs/specs/man/java.html) for more information about the flags. +Refer to the [Java documentation](https://docs.oracle.com/en/java/javase/17/docs/specs/man/java.html) for more information about the flags. + +## Memory and Garbage Collection + +The default [JVM ergonomics](https://docs.oracle.com/en/java/javase/21/gctuning/ergonomics.html) can generally be improved upon if you understand your workload well. + +Default initial heap size is 1/64th of RAM and default maximum heap size is 1/4th of RAM. If you know your memory requirements, you should set both of these flags to the same value (e.g. `-Xms12g` and `-Xmx12g` for an application that uses at most 12 GB). Setting both flags to the same value will prevent the JVM from having to periodically allocate additional memory. Additionally, for cloud workloads max heap size is often set to 75%-85% of RAM, much higher than the default setting. + +If you are deploying in a cloud scenario where you might be deploying the same stack to systems that have varying amounts of RAM, you might want to use `-XX:MaxRAMPercentage` instead of `-Xmx`, so you can specify a percentage of max RAM rather than a fixed max heap size. This setting can also be helpful in containerized workloads. + +Garbage collector choice will depend on the workload pattern for which you're optimizing. + +* If your workload is a straightforward serial single-core load with no multithreading, the `UseSerialGC` flag should be set to true. +* For multi-core small heap batch jobs (<4GB), the `UseParallelGC` flag should be set to true. +* The G1 garbage collector (`UseG1GC` flag) is better for medium to large heaps (>4GB). This is the most commonly used GC for large parallel workloads, and is the default for high-core environments. If you want to optimize throughput, use this one. +* The ZGC (`UseZGC` flag) has low pause times, which can drastically improve tail latencies. If you want to prioritize response time at a small cost to throughput, use ZGC. +* The Shenandoah GC (`UseShenandoahGC` flag) is still fairly niche. It has ultra low pause times and concurrent evacuation, making it ideal for low-latency applications, at the cost of increased CPU use. + +If you'd like to see what the default JVM values are for specific processor counts, you can run + +```bash +java -XX:ActiveProcessorCount=[selected processor count] -XX:+PrintFlagsFinal -version +``` + +Where `[selected processor count]` is the number of processors you want to evaluate the defaults for. You can also use this `-XX:ActiveProcessorCount` if you don't want to set GC and RAM sizes manually, if you know which default configuration you want to force the JVM to use. ## Crypto diff --git a/content/learning-paths/servers-and-cloud-computing/milvus-rag/_index.md b/content/learning-paths/servers-and-cloud-computing/milvus-rag/_index.md index 0bef6ba40f..8cd15700f2 100644 --- a/content/learning-paths/servers-and-cloud-computing/milvus-rag/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/milvus-rag/_index.md @@ -1,21 +1,17 @@ --- -title: Build a Retrieval-Augmented Generation (RAG) application using Zilliz Cloud on Arm servers - -draft: true -cascade: - draft: true +title: Build a RAG application using Zilliz Cloud on Arm servers minutes_to_complete: 20 -who_is_this_for: This is an introductory topic for software developers who want to create a RAG application on Arm servers. +who_is_this_for: This is an introductory topic for software developers who want to create a Retrieval-Augmented Generation (RAG) application on Arm servers. learning_objectives: - - Create a simple RAG application using Zilliz Cloud - - Launch a LLM service on Arm servers + - Create a simple RAG application using Zilliz Cloud. + - Launch an LLM service on Arm servers. prerequisites: - - Basic understanding of a RAG pipeline. - - An AWS Graviton3 c7g.2xlarge instance, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp) from a cloud service provider or an on-premise Arm server. + - A basic understanding of a RAG pipeline. + - An AWS Graviton3 C7g.2xlarge instance, or any [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp) from a cloud service provider or an on-premise Arm server. - A [Zilliz account](https://zilliz.com/cloud), which you can sign up for with a free trial. author_primary: Chen Zhang diff --git a/content/learning-paths/servers-and-cloud-computing/milvus-rag/_review.md b/content/learning-paths/servers-and-cloud-computing/milvus-rag/_review.md index 7722e4c24c..fc44dbd984 100644 --- a/content/learning-paths/servers-and-cloud-computing/milvus-rag/_review.md +++ b/content/learning-paths/servers-and-cloud-computing/milvus-rag/_review.md @@ -12,23 +12,23 @@ review: - questions: question: > - Can Llama3.1 model run on Arm? + Can Meta Llama 3.1 run on Arm? answers: - "Yes" - "No" correct_answer: 1 explanation: > - The Llama-3.1-8B model from Meta can be used on Arm-based servers with llama.cpp. + You can use the Llama 3.1-8B model from Meta on Arm-based servers with llama.cpp. - questions: question: > - Which of the following is true about about Zilliz Cloud? + Which of the following is true about Zilliz Cloud? answers: - - "It is a fully-managed version of Milvus vector database" - - "It is a self-hosted version of Milvus vector database" + - "It is a fully managed version of Milvus vector database." + - "It is a self-hosted version of Milvus vector database." correct_answer: 1 explanation: > - Zilliz Cloud is a fully-managed version of Milvus. + Zilliz Cloud is a fully managed version of Milvus. diff --git a/content/learning-paths/servers-and-cloud-computing/milvus-rag/launch_llm_service.md b/content/learning-paths/servers-and-cloud-computing/milvus-rag/launch_llm_service.md index 583d1fba35..aa75e24c51 100644 --- a/content/learning-paths/servers-and-cloud-computing/milvus-rag/launch_llm_service.md +++ b/content/learning-paths/servers-and-cloud-computing/milvus-rag/launch_llm_service.md @@ -1,23 +1,23 @@ --- -title: Launch LLM Server +title: Launch the LLM Server weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In this section, you will build and run the `llama.cpp` server program using an OpenAI-compatible API on your running AWS Arm-based server instance. +### Llama 3.1 Model and Llama.cpp -### Llama 3.1 model & llama.cpp +In this section, you will build and run the `llama.cpp` server program using an OpenAI-compatible API on your AWS Arm-based server instance. The [Llama-3.1-8B model](https://huggingface.co/cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf) from Meta belongs to the Llama 3.1 model family and is free to use for research and commercial purposes. Before you use the model, visit the Llama [website](https://llama.meta.com/llama-downloads/) and fill in the form to request access. -[llama.cpp](https://github.com/ggerganov/llama.cpp) is an open source C/C++ project that enables efficient LLM inference on a variety of hardware - both locally, and in the cloud. You can conveniently host a Llama 3.1 model using `llama.cpp`. +[Llama.cpp](https://github.com/ggerganov/llama.cpp) is an open-source C/C++ project that enables efficient LLM inference on a variety of hardware - both locally, and in the cloud. You can conveniently host a Llama 3.1 model using `llama.cpp`. -### Download and build llama.cpp +### Download and build Llama.cpp -Run the following commands to install make, cmake, gcc, g++, and other essential tools required for building llama.cpp from source: +Run the following commands to install make, cmake, gcc, g++, and other essential tools required for building Llama.cpp from source: ```bash sudo apt install make cmake -y @@ -27,13 +27,13 @@ sudo apt install build-essential -y You are now ready to start building `llama.cpp`. -Clone the source repository for llama.cpp: +Clone the source repository for Llama.cpp: ```bash git clone https://github.com/ggerganov/llama.cpp ``` -By default, `llama.cpp` builds for CPU only on Linux and Windows. You don't need to provide any extra switches to build it for the Arm CPU that you run it on. +By default, `llama.cpp` builds for CPU only on Linux and Windows. You do not need to provide any extra switches to build it for the Arm CPU that you run it on. Run `make` to build it: @@ -64,23 +64,23 @@ You can now download the model using the huggingface cli: ```bash huggingface-cli download cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf dolphin-2.9.4-llama3.1-8b-Q4_0.gguf --local-dir . --local-dir-use-symlinks False ``` -The GGUF model format, introduced by the llama.cpp team, uses compression and quantization to reduce weight precision to 4-bit integers, significantly decreasing computational and memory demands and making Arm CPUs effective for LLM inference. +The GGUF model format, introduced by the Llama.cpp team, uses compression and quantization to reduce weight precision to 4-bit integers, significantly decreasing computational and memory demands and making Arm CPUs effective for LLM inference. -### Re-quantize the model weights +### Requantize the model weights -To re-quantize the model, run: +To requantize the model, run: ```bash ./llama-quantize --allow-requantize dolphin-2.9.4-llama3.1-8b-Q4_0.gguf dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf Q4_0_8_8 ``` -This will output a new file, `dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf`, which contains reconfigured weights that allow `llama-cli` to use SVE 256 and MATMUL_INT8 support. +This outputs a new file, `dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf`, which contains reconfigured weights that allow `llama-cli` to use SVE 256 and MATMUL_INT8 support. This requantization is optimal specifically for Graviton3. For Graviton2, the optimal requantization should be performed in the `Q4_0_4_4` format, and for Graviton4, the `Q4_0_4_8` format is the most suitable for requantization. ### Start the LLM Server -You can utilize the `llama.cpp` server program and send requests via an OpenAI-compatible API. This allows you to develop applications that interact with the LLM multiple times without having to repeatedly start and stop it. Additionally, you can access the server from another machine where the LLM is hosted over the network. +You can utilize the `llama.cpp` server program and send requests through an OpenAI-compatible API. This allows you to develop applications that interact with the LLM multiple times without having to repeatedly start and stop it. Additionally, you can access the server from another machine where the LLM is hosted over the network. Start the server from the command line, and it listens on port 8080: @@ -91,10 +91,10 @@ Start the server from the command line, and it listens on port 8080: The output from this command should look like: ```output -'main: server is listening on 127.0.0.1:8080 - starting the main loop +main: server is listening on 127.0.0.1:8080 - starting the main loop ``` -You can also adjust the parameters of the launched LLM to adapt it to your server hardware to obtain ideal performance. For more parameter information, see the `llama-server --help` command. +You can also adjust the parameters of the launched LLM to adapt it to your server hardware to achieve an ideal performance. For more parameter information, see the `llama-server --help` command. You have started the LLM service on your AWS Graviton instance with an Arm-based CPU. In the next section, you will directly interact with the service using the OpenAI SDK. diff --git a/content/learning-paths/servers-and-cloud-computing/milvus-rag/offline_data_loading.md b/content/learning-paths/servers-and-cloud-computing/milvus-rag/offline_data_loading.md index d69bd1ffad..0299590493 100644 --- a/content/learning-paths/servers-and-cloud-computing/milvus-rag/offline_data_loading.md +++ b/content/learning-paths/servers-and-cloud-computing/milvus-rag/offline_data_loading.md @@ -5,30 +5,31 @@ weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Create a dedicated cluster -In this section, you will learn how to setup a cluster on Zilliz Cloud. You will then learn how to load your private knowledge database into the cluster. +In this section, you will set up a cluster on Zilliz Cloud. -### Create a dedicated cluster +Begin by [registering](https://docs.zilliz.com/docs/register-with-zilliz-cloud) for a free account on Zilliz Cloud. -You will need to [register](https://docs.zilliz.com/docs/register-with-zilliz-cloud) for a free account on Zilliz Cloud. +After you register, [create a cluster](https://docs.zilliz.com/docs/create-cluster). -After you register, [create a cluster](https://docs.zilliz.com/docs/create-cluster) on Zilliz Cloud. In this Learning Path, you will create a dedicated cluster deployed in AWS using Arm-based machines to store and retreive the vector data as shown: +Now create a **Dedicated** cluster deployed in AWS using Arm-based machines to store and retrieve the vector data as shown: ![cluster](create_cluster.png) -When you select the `Create Cluster` Button, you should see the cluster running in your Default Project. +When you select the **Create Cluster** Button, you should see the cluster running in your **Default Project**. ![running](running_cluster.png) {{% notice Note %}} -You can use self-hosted Milvus as an alternative to Zilliz Cloud. This option is more complicated to set up. We can also deploy [Milvus Standalone](https://milvus.io/docs/install_standalone-docker-compose.md) and [Kubernetes](https://milvus.io/docs/install_cluster-milvusoperator.md) on Arm-based machines. For more information about Milvus installation, please refer to the [installation documentation](https://milvus.io/docs/install-overview.md). +You can use self-hosted Milvus as an alternative to Zilliz Cloud. This option is more complicated to set up. You can also deploy [Milvus Standalone](https://milvus.io/docs/install_standalone-docker-compose.md) and [Kubernetes](https://milvus.io/docs/install_cluster-milvusoperator.md) on Arm-based machines. For more information about installing Milvus, see the [Milvus installation documentation](https://milvus.io/docs/install-overview.md). {{% /notice %}} -### Create the Collection +## Create the Collection -With the dedicated cluster running in Zilliz Cloud, you are now ready to create a collection in your cluster. +With the Dedicated cluster running in Zilliz Cloud, you are now ready to create a collection in your cluster. -Within your activated python `venv`, start by creating a file named `zilliz-llm-rag.py` and copy the contents below into it: +Within your activated Python virtual environment `venv`, start by creating a file named `zilliz-llm-rag.py`, and copy the contents below into it: ```python from pymilvus import MilvusClient @@ -38,7 +39,7 @@ milvus_client = MilvusClient( ) ``` -Replace and with the `URI` and `Token` for your running cluster. Refer to [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud for more details. +Replace ** and ** with the `URI` and `Token` for your running cluster. Refer to [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud for further information. Now, append the following code to `zilliz-llm-rag.py` and save the contents: @@ -56,16 +57,16 @@ milvus_client.create_collection( consistency_level="Strong", # Strong consistency level ) ``` -This code checks if a collection already exists and drops it if it does. You then, create a new collection with the specified parameters. +This code checks if a collection already exists and drops it if it does. If this happens, you can create a new collection with the specified parameters. -If you don't specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values. -You will use inner product distance as the default metric type. For more information about distance types, you can refer to [Similarity Metrics page](https://milvus.io/docs/metric.md?tab=floating) +If you do not specify any field information, Milvus automatically creates a default `id` field for the primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema defined fields and their values. +You can use inner product distance as the default metric type. For more information about distance types, you can refer to [Similarity Metrics page](https://milvus.io/docs/metric.md?tab=floating). You can now prepare the data to use in this collection. -### Prepare the data +## Prepare the data -In this example, you will use the FAQ pages from the [Milvus Documentation 2.4.x](https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip) as the private knowledge that is loaded in your RAG dataset/collection. +In this example, you will use the FAQ pages from the [Milvus Documentation 2.4.x](https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip) as the private knowledge that is loaded in your RAG dataset. Download the zip file and extract documents to the folder `milvus_docs`. @@ -74,7 +75,7 @@ wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/m unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs ``` -You will load all the markdown files from the folder `milvus_docs/en/faq` into your data collection. For each document, use "# " to separate the content in the file, which can roughly separate the content of each main part of the markdown file. +Now load all the markdown files from the folder `milvus_docs/en/faq` into your data collection. For each document, use "# " to separate the content in the file. This divides the content of each main part of the markdown file. Open `zilliz-llm-rag.py` and append the following code to it: @@ -91,9 +92,9 @@ for file_path in glob("milvus_docs/en/faq/*.md", recursive=True): ``` ### Insert data -You will now prepare a simple but efficient embedding model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) that can convert the loaded text into embedding vectors. +Now you can prepare a simple but efficient embedding model [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) that can convert the loaded text into embedding vectors. -You will iterate through the text lines, create embeddings, and then insert the data into Milvus. +You can iterate through the text lines, create embeddings, and then insert the data into Milvus. Append and save the code shown below into `zilliz-llm-rag.py`: @@ -115,10 +116,10 @@ for i, (line, embedding) in enumerate( milvus_client.insert(collection_name=collection_name, data=data) ``` -Run the python script, to check that you have successfully created the embeddings on the data you loaded into the RAG collection: +Run the Python script, to check that you have successfully created the embeddings on the data you loaded into the RAG collection: ```bash -python3 python3 zilliz-llm-rag.py +python3 zilliz-llm-rag.py ``` The output should look like: diff --git a/content/learning-paths/servers-and-cloud-computing/milvus-rag/online_rag.md b/content/learning-paths/servers-and-cloud-computing/milvus-rag/online_rag.md index ced3778b20..a3622b92c1 100644 --- a/content/learning-paths/servers-and-cloud-computing/milvus-rag/online_rag.md +++ b/content/learning-paths/servers-and-cloud-computing/milvus-rag/online_rag.md @@ -5,14 +5,11 @@ weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Prepare the Embedding Model -In this section, you will build the online RAG part of your application. +In your Python script, generate a test embedding and print its dimension and the first few elements. -### Prepare the embedding model - -In your python script, generate a test embedding and print its dimension and first few elements. - -For the LLM, you will use the OpenAI SDK to request the Llama service launched before. You don't need to use any API key because it is running locally on your machine. +For the LLM, you will use the OpenAI SDK to request the Llama service that you launched previously. You do not need to use an API key because it is running locally on your machine. Append the code below to `zilliz-llm-rag.py`: @@ -31,7 +28,7 @@ Run the script. The output should look like: ### Retrieve data for a query -You will specify a frequent question about Milvus and then search for the question in the collection and retrieve the semantic top-3 matches. +Now specify a common question about Milvus, and search for the question in the collection, in order to retrieve the top 3 semantic matches. Append the code shown below to `zilliz-llm-rag.py`: @@ -55,7 +52,7 @@ retrieved_lines_with_distances = [ ] print(json.dumps(retrieved_lines_with_distances, indent=4)) ``` -Run the script again and the output with the top 3 matches will look like: +Run the script again, and the output with the top 3 matches should look like: ```output [ @@ -68,18 +65,18 @@ Run the script again and the output with the top 3 matches will look like: 0.5974207520484924 ], [ - "What is the maximum dataset size Milvus can handle?\n\n \nTheoretically, the maximum dataset size Milvus can handle is determined by the hardware it is run on, specifically system memory and storage:\n\n- Milvus loads all specified collections and partitions into memory before running queries. Therefore, memory size determines the maximum amount of data Milvus can query.\n- When new entities and and collection-related schema (currently only MinIO is supported for data persistence) are added to Milvus, system storage determines the maximum allowable size of inserted data.\n\n###", + "What is the maximum dataset size Milvus can handle?\n\n \nTheoretically, the maximum dataset size Milvus can handle is determined by the hardware it is run on, specifically system memory and storage:\n\n- Milvus loads all specified collections and partitions into memory before running queries. Therefore, memory size determines the maximum amount of data Milvus can query.\n- When new entities and collection-related schema (currently only MinIO is supported for data persistence) are added to Milvus, system storage determines the maximum allowable size of inserted data.\n\n###", 0.5833579301834106 ] ] ``` -### Use LLM to get a RAG response +### Use the LLM to obtain a RAG response You are now ready to use the LLM and obtain a RAG response. -For the LLM, you will use the OpenAI SDK to request the Llama service you launched in the previous section. You don't need to use any API key because it is running locally on your machine. +For the LLM, you will use the OpenAI SDK to request the Llama service you launched in the previous section. You do not need to use an API key because it is running locally on your machine. -You will then convert the retrieved documents into a string format. Define system and user prompts for the Language Model. This prompt is assembled with the retrieved documents from Milvus. Finally use the LLM to generate a response based on the prompts. +You will then convert the retrieved documents into a string format. Define system and user prompts for the Language Model. This prompt is assembled with the retrieved documents from Milvus. Finally, use the LLM to generate a response based on the prompts. Append the code below into `zilliz-llm-rag.py`: @@ -117,7 +114,7 @@ print(response.choices[0].message.content) ``` {{% notice Note %}} -Make sure your llama.cpp server from the previous section is running before you proceed +Make sure your llama.cpp server from the previous section is running before you proceed. {{% /notice %}} Run the script one final time with these changes using `python3 zilliz-llm-rag.py`. The output should look like: diff --git a/content/learning-paths/servers-and-cloud-computing/milvus-rag/prerequisite.md b/content/learning-paths/servers-and-cloud-computing/milvus-rag/prerequisite.md index 9d336b341d..1008493283 100644 --- a/content/learning-paths/servers-and-cloud-computing/milvus-rag/prerequisite.md +++ b/content/learning-paths/servers-and-cloud-computing/milvus-rag/prerequisite.md @@ -1,5 +1,5 @@ --- -title: Install dependencies +title: Overview and Install dependencies weight: 2 ### FIXED, DO NOT MODIFY @@ -8,14 +8,20 @@ layout: learningpathall ## Overview -In this Learning Path, you will learn how to build a Retrieval-Augmented Generation (RAG) application on Arm-based servers. RAG applications often use vector databases to efficiently store and retrieve high-dimensional vector representations of text data. Vector databases are optimized for similarity search and can handle large volumes of vector data, making them ideal for the retrieval component of RAG systems. In this example, you will utilize [Zilliz Cloud](https://zilliz.com/cloud), the fully-managed Milvus vector database as your vector storage. Zilliz Cloud is available on major cloud such as AWS, GCP and Azure. In this demo you will use Zilliz Cloud deployed on AWS with Arm based servers. For the LLM, you will use the `Llama-3.1-8B` model running on an AWS Arm-based server using `llama.cpp`. +In this Learning Path, you will learn how to build a Retrieval-Augmented Generation (RAG) application on Arm-based servers. + +RAG applications often use vector databases to efficiently store and retrieve high-dimensional vector representations of text data. Vector databases are optimized for similarity search and can handle large volumes of vector data, making them ideal for the retrieval component of RAG systems. + +In this Learning Path, you will use [Zilliz Cloud](https://zilliz.com/cloud) for your vector storage, which is a fully managed Milvus vector database. Zilliz Cloud is available on major cloud computing service providers; for example, AWS, GCP, and Azure. + +Here, you will use Zilliz Cloud deployed on AWS with an Arm-based server. For the LLM, you will use the Llama-3.1-8B model also running on an AWS Arm-based server, but using `llama.cpp`. ## Install dependencies -This Learning Path has been tested on an AWS Graviton3 `c7g.2xlarge` instance running Ubuntu 22.04 LTS system. -You need at least four cores and 8GB of RAM to run this example. Configure disk storage up to at least 32 GB. +This Learning Path has been tested on an AWS Graviton3 `C7g.2xlarge` instance running an Ubuntu 22.04 LTS system. +You need at least four cores and 8GB of RAM to run this example. Configure the disk storage up to at least 32 GB. -After you launch the instance, connect to it and run the following commands to prepare the environment. +After you have launched the instance, connect to it, and run the following commands to prepare the environment. Install python: diff --git a/content/learning-paths/servers-and-cloud-computing/profiling-for-neoverse/streamline-cli.md b/content/learning-paths/servers-and-cloud-computing/profiling-for-neoverse/streamline-cli.md index fa5f177510..ff6bc4bc99 100644 --- a/content/learning-paths/servers-and-cloud-computing/profiling-for-neoverse/streamline-cli.md +++ b/content/learning-paths/servers-and-cloud-computing/profiling-for-neoverse/streamline-cli.md @@ -26,33 +26,23 @@ Arm recommends that you profile an optimized release build of your application, ### Procedure {.section} -1. Download and extract the Streamline CLI tools on your Arm server: +1. Use the download utility script to download the latest version of the tool and extract it to the current working directory on your Arm server: ```sh - wget https://artifacts.tools.arm.com/arm-performance-studio/2024.4/Arm_Streamline_CLI_Tools_9.3_linux_arm64.tgz  - tar -xzf Arm_Streamline_CLI_Tools_9.3_linux_arm64.tgz  - ``` - -1. Follow the instructions in the [Install Guide](/install-guides/streamline-cli/) to ensure you have everything set up correctly. Arm recommends that you apply the kernel patch as described in this guide, to improve support for capturing function-attributed top-down metrics on Arm systems. - -1. The `sl-format.py` Python script requires Python 3.8 or later, and depends on several third-party modules. We recommend creating a Python virtual environment containing these modules to run the tools. For example: - - ```sh - # From Bash - python3 -m venv sl-venv - source ./sl-venv/bin/activate + wget https://artifacts.tools.arm.com/arm-performance-studio/Streamline_CLI_Tools/get-streamline-cli.py - # From inside the virtual environment - python3 -m pip install -r ./streamline_cli_tools/bin/requirements.txt + python3 get-streamline-cli.py install ``` - {{% notice Note%}} - The instructions in this guide assume you have added the `/bin/` directory to your `PATH` environment variable, and that you run all Python commands from inside the virtual environment. - {{% /notice %}} + The script can also be used to download a specific version, or install to a user-specified directory. Refer to the [Install Guide](/install-guides/streamline-cli/) for details on all the script options. + + {{% notice %}} + Follow the instructions in the [Install Guide](/install-guides/streamline-cli/) to ensure you have everything set up correctly. Arm recommends that you apply the kernel patch as described in this guide, to improve support for capturing function-attributed top-down metrics on Arm systems. + {{% /notice %}} 1. Use `sl-record` to capture a raw profile of your application and save the data to a directory on the filesystem. - Arm recommends making a profile of at least 20 seconds in duration, which ensures that the profiler can capture a statistically significant number of samples for all of the metrics. + Arm recommends making a profile of at least 20 seconds in duration, which ensures that the profiler can capture a statistically significant number of samples for all of the metrics. ```sh sl-record -C workflow_topdown_basic -o -A @@ -110,7 +100,7 @@ Arm recommends that you profile an optimized release build of your application, ## Capturing a system-wide profile -To capture a system-wide profile, which captures all processes and threads, run `sl-record` with the `-S yes` option and omit the `-A ` application-specific option and following arguments. +To capture a system-wide profile, which captures all processes and threads, run `sl-record` with the `-S yes` option and omit the `-A` application-specific option and following arguments. In systems without the kernel patches, system-wide profiles can capture the top-down metrics. To keep the captures to a usable size, it may be necessary to limit the duration of the profiles to less than 5 minutes. @@ -118,7 +108,7 @@ In systems without the kernel patches, system-wide profiles can capture the top- To capture top-down metrics in a system without the kernel patches, there are three options available: -* To capture a system-wide profile, which captures all processes and threads, run with the `-S yes` option and omit the `-A ` application-specific option and following arguments. To keep the captures to a usable size, it may be necessary to limit the duration of the profiles to less than 5 minutes. +* To capture a system-wide profile, which captures all processes and threads, run with the `-S yes` option and omit the `-A` application-specific option and following arguments. To keep the captures to a usable size, it may be necessary to limit the duration of the profiles to less than 5 minutes. * To reliably capture single-threaded application profile, add the `--inherit no` option to the command line. However, in this mode metrics are only captured for the first thread in the application process and any child threads or processes are ignored. diff --git a/content/learning-paths/servers-and-cloud-computing/pytorch-llama/pytorch-llama.md b/content/learning-paths/servers-and-cloud-computing/pytorch-llama/pytorch-llama.md index 1758f71307..337b44e48b 100644 --- a/content/learning-paths/servers-and-cloud-computing/pytorch-llama/pytorch-llama.md +++ b/content/learning-paths/servers-and-cloud-computing/pytorch-llama/pytorch-llama.md @@ -50,7 +50,7 @@ wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/PyTorch-arm-patches wget https://raw.githubusercontent.com/ArmDeveloperEcosystem/PyTorch-arm-patches/main/0001-Feat-Enable-int4-quantized-models-to-work-with-pytor.patch git apply 0001-Feat-Enable-int4-quantized-models-to-work-with-pytor.patch git apply --whitespace=nowarn 0001-modified-generate.py-for-cli-and-browser.patch -./install_requirements.sh +pip install -r requirements.txt ``` {{% notice Note %}} You will need Python version 3.10 to apply these patches. This is the default version of Python installed on an Ubuntu 22.04 Linux machine. {{% /notice %}} diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/1-dev-env-setup.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/1-dev-env-setup.md index bc58ac8ac4..ff573c391e 100644 --- a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/1-dev-env-setup.md +++ b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/1-dev-env-setup.md @@ -8,31 +8,31 @@ layout: learningpathall ## Set up your development environment -In this Learning Path, you will learn how to build and deploy a simple LLM-based chat app to an Android device using ExecuTorch and XNNPACK. You will learn how to build the ExecuTorch runtime for Llama models, build JNI libraries for the Android application, and use the libraries in the application. +In this Learning Path, you will learn how to build and deploy a simple LLM-based chat app to an Android device using ExecuTorch and XNNPACK with [KleidiAI](https://gitlab.arm.com/kleidi/kleidiai). Arm has worked with the Meta team to integrate KleidiAI into ExecuTorch through XNNPACK. These improvements increase the throughput of quantized LLMs running on Arm chips that contain the i8mm (8-bit integer matrix multiply) processor feature. You will learn how to build the ExecuTorch runtime for Llama models with KleidiAI, build JNI libraries for the Android application, and use the libraries in the application. The first step is to prepare a development environment with the required software: - Android Studio (latest version recommended). -- Android NDK version 25.0.8775105. +- Android NDK version 28.0.12433566. - Java 17 JDK. - Git. -- Python 3.10. +- Python 3.10 or later (these instructions have been tested with 3.10 and 3.12). -The instructions assume macOS with Apple Silicon, an x86 Debian, or Ubuntu Linux machine with at least 16GB of RAM. +The instructions assume macOS with Apple Silicon, an x86 Debian, or an Ubuntu Linux machine, with at least 16GB of RAM. ## Install Android Studio and Android NDK Follow these steps to install and configure Android Studio: -1. Download and install the latest version of [Android Studio](https://developer.android.com/studio/). +1. Download and install the latest version of [Android Studio](https://developer.android.com/studio/). -2. Start Android Studio and open the `Settings` dialog. +2. Start Android Studio and open the **Settings** dialog. -3. Navigate to `Languages & Frameworks -> Android SDK`. +3. Navigate to **Languages & Frameworks**, then **Android SDK**. -4. In the `SDK Platforms` tab, check `Android 14.0 ("UpsideDownCake")`. +4. In the **SDK Platforms** tab, check **Android 14.0 ("UpsideDownCake")**. -Next, install the specific version of the Android NDK that you need by first installing the Android command line tools: +Next, install the specific version of the Android NDK that you require by first installing the Android command line tools: Linux: @@ -49,53 +49,55 @@ curl https://dl.google.com/android/repository/commandlinetools-mac-11076708_late Unzip the Android command line tools: ``` -unzip commandlinetools.zip +unzip commandlinetools.zip -d android-sdk ``` -Install the NDK in the directory that Android Studio installed the SDK. This is generally `~/Library/Android/sdk` by default: +Install the NDK in the same directory that Android Studio installed the SDK. This is generally `~/Library/Android/sdk` by default. Set the requirement environment variables: ``` export ANDROID_HOME="$(realpath ~/Library/Android/sdk)" -./cmdline-tools/bin/sdkmanager --sdk_root="${ANDROID_HOME}" --install "ndk;25.0.8775105" +export PATH=$ANDROID_HOME/cmdline-tools/bin/:$PATH +sdkmanager --sdk_root="${ANDROID_HOME}" --install "ndk;28.0.12433566" +export ANDROID_NDK=$ANDROID_HOME/ndk/28.0.12433566/ ``` ## Install Java 17 JDK Open the [Java SE 17 Archive Downloads](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) page in your browser. -Select an appropriate download for your development machine operating system. +Select an appropriate download for your development machine operating system. Downloads are available for macOS as well as Linux. -## Install Git +## Install Git and cmake For macOS use [Homebrew](https://brew.sh/): - + ``` bash -brew install git +brew install git cmake ``` For Linux, use the package manager for your distribution: - + ``` bash -sudo apt install git-all +sudo apt install git-all cmake ``` ## Install Python 3.10 For macOS: - + ``` bash brew install python@3.10 ``` For Linux: - + ``` bash sudo apt update -udo apt install software-properties-common -y +sudo apt install software-properties-common -y sudo add-apt-repository ppa:deadsnakes/ppa -sudo apt install Python3.10 +sudo apt install Python3.10 python3.10-venv ``` You now have the required development tools installed. diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/2-executorch-setup.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/2-executorch-setup.md index 13e289cc32..bad77ccd91 100755 --- a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/2-executorch-setup.md +++ b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/2-executorch-setup.md @@ -19,13 +19,13 @@ python3.10 -m venv executorch source executorch/bin/activate ``` -The prompt of your terminal has (executorch) as a prefix to indicate the virtual environment is active. +The prompt of your terminal has `executorch` as a prefix to indicate the virtual environment is active. ### Option 2: Create a Conda virtual environment Install Miniconda on your development machine by following the [Installing conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) instructions. -Once `conda` is installed create the environment: +Once `conda` is installed, create the environment: ```bash conda create -yn executorch python=3.10.0 @@ -34,23 +34,16 @@ conda activate executorch ### Clone ExecuTorch and install the required dependencies -From within the conda environment, run the commands below to download the ExecuTorch repository and install the required packages: +From within the conda environment, run the commands below to download the ExecuTorch repository and install the required packages: ``` bash -# Clone the ExecuTorch repo from GitHub git clone https://github.com/pytorch/executorch.git cd executorch - -# Update and pull submodules git submodule sync git submodule update --init - -# Install ExecuTorch pip package and its dependencies, as well as -# development tools like CMake. +./install_requirements.sh ./install_requirements.sh --pybind xnnpack - -# Install a few more dependencies -./examples/models/llama2/install_requirements.sh +./examples/models/llama/install_requirements.sh ``` -You are now ready to start building the application. \ No newline at end of file +When these scripts finish successfully, ExecuTorch is set up. That means it's time to dive into the world of Llama models! diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/3-Understanding-LLaMA-models.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/3-Understanding-LLaMA-models.md index 6e64e13c05..9066a72da1 100644 --- a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/3-Understanding-LLaMA-models.md +++ b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/3-Understanding-LLaMA-models.md @@ -22,7 +22,7 @@ Llama models are powerful and versatile, having the ability to generate coherent * Virtual assistants. * Language translation. -Please note that the models are subject to the [acceptable use policy](https://github.com/facebookresearch/llama/blob/main/USE_POLICY.md) and this [responsible use guide](https://ai.meta.com/static-resource/responsible-use-guide/). +Please note that the models are subject to the [acceptable use policy](https://github.com/facebookresearch/llama/blob/main/USE_POLICY.md) and [this responsible use guide](https://ai.meta.com/static-resource/responsible-use-guide/). ## Results @@ -30,11 +30,11 @@ As Llama 2 and Llama 3 models require at least 4-bit quantization due to the con ## Quantization -One way to create models that fit in smartphone memory is to employ 4-bit groupwise per token dynamic quantization of all the linear layers of the model. *Dynamic quantization* refers to quantizing activations dynamically, such that quantization parameters for activations are calculated, from the min/max range, at runtime. Furthermore, weights are statically quantized. In this case, weights are per-channel groupwise quantized with 4-bit signed integers. +One way to create models that fit in smartphone memory is to employ 4-bit groupwise per token dynamic quantization of all the linear layers of the model. *Dynamic quantization* refers to quantizing activations dynamically, such that quantization parameters for activations are calculated, from the min/max range, at runtime. Furthermore, weights are statically quantized. In this case, weights are per-channel groupwise quantized with 4-bit signed integers. For further information, refer to [torchao: PyTorch Architecture Optimization](https://github.com/pytorch-labs/ao/). -The table below evaluates WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness). +The table below evaluates WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness). The results are for two different groupsizes, with max_seq_len 2048, and 1000 samples: @@ -43,9 +43,9 @@ The results are for two different groupsizes, with max_seq_len 2048, and 1000 sa |Llama 2 7B | 9.2 | 10.2 | 10.7 |Llama 3 8B | 7.9 | 9.4 | 9.7 -Note that groupsize less than 128 was not enabled, since such a model was still too large. This is because current efforts have focused on enabling FP32, and support for FP16 is under way. +Note that groupsize less than 128 was not enabled in this example, since the model was still too large. This is because current efforts have focused on enabling FP32, and support for FP16 is under way. What this implies for model size is: -1. Embedding table is in FP32. +1. Embedding table is in FP32. 2. Quantized weights scales are FP32. diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/4-Prepare-LLaMA-models.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/4-Prepare-LLaMA-models.md index 7cbe409102..116bb3f364 100755 --- a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/4-Prepare-LLaMA-models.md +++ b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/4-Prepare-LLaMA-models.md @@ -6,121 +6,48 @@ weight: 5 layout: learningpathall --- -## Download and export the Llama 3 8B model +## Download and export the Llama 3.2 1B model -To get started with Llama 3, you obtain the pre-trained parameters by visiting [Meta's Llama Downloads](https://llama.meta.com/llama-downloads/) page. Request the access by filling out your details and read through and accept the Responsible Use Guide. This grants you a license and a download link which is valid for 24 hours. The Llama 3 8B model is used for this part, but the same instructions apply for other options as well with minimal modification. +To get started with Llama 3, you can obtain the pre-trained parameters by visiting [Meta's Llama Downloads](https://llama.meta.com/llama-downloads/) page. Request access by filling out your details, and read through and accept the Responsible Use Guide. This grants you a license and a download link which is valid for 24 hours. The Llama 3.2 1B Instruct model is used for this exercise, but the same instructions apply to other options as well with minimal modification. -Install the following requirements using a package manager of your choice, for example apt-get: +Install the `llama-stack` package from `pip`. ```bash -apt-get install md5sum wget +pip install llama-stack ``` - -Clone the Llama models Git repository and install the dependencies: - -```bash -git clone https://github.com/meta-llama/llama-models -cd llama-models -pip install -e . -pip install buck -``` -Run the script to download, and paste the download link from the email when prompted. +Run the command to download, and paste the download link from the email when prompted. ```bash -cd models/llama3_1 -./download.sh +llama model download --source meta --model-id Llama3.2-1B-Instruct ``` -You will be asked which models you would like to download. Enter `meta-llama-3.1-8b`. + +When the download is finished, the installation path is printed as output. ```output - **** Model list *** - - meta-llama-3.1-405b - - meta-llama-3.1-70b - - meta-llama-3.1-8b - - meta-llama-guard-3-8b - - prompt-guard +Successfully downloaded model to //.llama/checkpoints/Llama3.2-1B-Instruct ``` -When the download is finished, you should see the following files in the new folder + +Verify by viewing the downloaded files under this path: ```bash -$ ls Meta-Llama-3.1-8B -consolidated.00.pth params.json tokenizer.model +ls $HOME/.llama/checkpoints/Llama3.2-1B-Instruct +checklist.chk consolidated.00.pth params.json tokenizer.model ``` -{{% notice Note %}} -1. If you encounter the error "Sorry, we could not process your request at this moment", it might mean you have initiated two license processes simultaneously. Try modifying the affiliation field to work around it. -2. You may have to run the `download.sh` script as root, or modify the execution privileges with `chmod`. +{{% notice Working Directory %}} +The rest of the instructions should be executed from the ExecuTorch base directory. {{% /notice %}} -Export model and generate `.pte` file. Run the Python command to export the model: +Export model and generate `.pte` file. Run the Python command to export the model to your current directory. ```bash -python -m examples.models.llama2.export_llama --checkpoint llama-models/models/llama3_1/Meta-Llama-3.1-8B/consolidated.00.pth -p llama-models/models/llama3_1/Meta-Llama-3.1-8B/params.json -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_id":128001}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte" +python3 -m examples.models.llama.export_llama \ +--checkpoint $HOME/.llama/checkpoints/Llama3.2-1B-Instruct/consolidated.00.pth \ +--params $HOME/.llama/checkpoints/Llama3.2-1B-Instruct/params.json \ +-kv --use_sdpa_with_kv_cache -X --xnnpack-extended-ops -qmode 8da4w \ +--group_size 64 -d fp32 \ +--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001, 128006, 128007]}' \ +--embedding-quantize 4,32 \ +--output_name="llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte" \ +--max_seq_length 1024 ``` Due to the larger vocabulary size of Llama 3, you should quantize the embeddings with `--embedding-quantize 4,32` to further reduce the model size. -## Optional: Evaluate Llama 3 model accuracy - -You can evaluate model accuracy using the same arguments as above: - -``` bash -python -m examples.models.llama2.eval_llama -c llama-models/models/llama3_1/Meta-Llama-3.1-8B/consolidated.00.pth -p llama-models/models/llama3_1/Meta-Llama-3.1-8B/params.json -t llama-models/models/llama3_1/Meta-Llama-3.1-8B/tokenizer.model -d fp32 --max_seq_len 2048 --limit 1000 -``` - -{{% notice Warning %}} -Model evaluation without a GPU will take a long time. On a MacBook with an M3 chip and 18GB RAM this took 10+ hours. -{{% /notice %}} - -## Validate models on the development machine - -Before running models on a smartphone, you can validate them on your development computer. - -Follow the steps below to build ExecuTorch and the Llama runner to run models. - -1. Build executorch with optimized CPU performance: - - ``` bash - cmake -DPYTHON_EXECUTABLE=python \ - -DCMAKE_INSTALL_PREFIX=cmake-out \ - -DEXECUTORCH_ENABLE_LOGGING=1 \ - -DCMAKE_BUILD_TYPE=Release \ - -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ - -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ - -DEXECUTORCH_BUILD_XNNPACK=ON \ - -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ - -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ - -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ - -Bcmake-out . - - cmake --build cmake-out -j16 --target install --config Release - ``` - - The CMake build options are available on [GitHub](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt#L59). - -2. Build the Llama runner: - -{{% notice Note %}} -For Llama 3, add `-DEXECUTORCH_USE_TIKTOKEN=ON` option. -{{% /notice %}} - -Run cmake: - -``` bash - cmake -DPYTHON_EXECUTABLE=python \ - -DCMAKE_INSTALL_PREFIX=cmake-out \ - -DCMAKE_BUILD_TYPE=Release \ - -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ - -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ - -DEXECUTORCH_BUILD_XNNPACK=ON \ - -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ - -Bcmake-out/examples/models/llama2 \ - examples/models/llama2 - - cmake --build cmake-out/examples/models/llama2 -j16 --config Release -``` - -3. Run the model: - - ``` bash - cmake-out/examples/models/llama2/llama_main --model_path=llama3_kv_sdpa_xnn_qe_4_32.pte --tokenizer_path=llama-models/models/llama3_1/Meta-Llama-3.1-8B/tokenizer.model --prompt= - ``` - - The run options are available on [GitHub](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L18-L40). diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/5-Run-Benchmark-on-Android.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/5-Run-Benchmark-on-Android.md index 20d1503491..fe1bd9981e 100644 --- a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/5-Run-Benchmark-on-Android.md +++ b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/5-Run-Benchmark-on-Android.md @@ -12,17 +12,19 @@ Cross-compile Llama runner to run on Android using the steps below. ### 1. Set Android NDK -Set the environment variable to point to the Android NDK. +Set the environment variable to point to the Android NDK: ``` bash -export ANDROID_NDK=~/Library/Android/sdk/ndk/25.0.8775105 +export ANDROID_NDK=$ANDROID_HOME/ndk/28.0.12433566/ ``` {{% notice Note %}} Make sure you can confirm $ANDROID_NDK/build/cmake/android.toolchain.cmake is available for CMake to cross-compile. {{% /notice %}} -### 2. Build ExecuTorch and associated libraries for Android +### 2. Build ExecuTorch and associated libraries for Android with KleidiAI + +You are now ready to build ExecuTorch for Android by taking advantage of the performance optimization provided by the [KleidiAI](https://gitlab.arm.com/kleidi/kleidiai) kernels. Use `cmake` to cross-compile ExecuTorch: @@ -31,21 +33,26 @@ cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-23 \ -DCMAKE_INSTALL_PREFIX=cmake-out-android \ + -DEXECUTORCH_ENABLE_LOGGING=1 \ -DCMAKE_BUILD_TYPE=Release \ - -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ - -DEXECUTORCH_ENABLE_LOGGING=1 \ - -DPYTHON_EXECUTABLE=python \ + -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ + -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ + -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \ + -DXNNPACK_ENABLE_ARM_BF16=OFF \ -Bcmake-out-android . -cmake --build cmake-out-android -j16 --target install --config Release +cmake --build cmake-out-android -j7 --target install --config Release ``` +{{% notice Note %}} +Make sure you add -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON option to enable support for KleidiAI kernels in ExecuTorch with XNNPack. +{{% /notice %}} -### 3. Build Llama runner for android +### 3. Build Llama runner for Android Use `cmake` to cross-compile Llama runner: @@ -60,23 +67,21 @@ cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ - -Bcmake-out-android/examples/models/llama2 \ - examples/models/llama2 + -DEXECUTORCH_USE_TIKTOKEN=ON \ + -Bcmake-out-android/examples/models/llama \ + examples/models/llama -cmake --build cmake-out-android/examples/models/llama2 -j16 --config Release +cmake --build cmake-out-android/examples/models/llama -j16 --config Release ``` -{{% notice Note %}} -For Llama 3, add `-DEXECUTORCH_USE_TIKTOKEN=ON` option when building the Llama runner. -{{% /notice %}} - You should now have `llama_main` available for Android. ## Run on Android via adb shell +You will need an Arm-powered smartphone with the i8mm feature running Android, with 16GB of RAM. The following steps were tested on a Google Pixel 8 Pro phone. -### 1. Connect your android phone +### 1. Connect your Android phone -Connect your phone to your computer using a USB cable. +Connect your phone to your computer using a USB cable. You need to enable USB debugging on your Android device. You can follow [Configure on-device developer options](https://developer.android.com/studio/debug/dev-options) to enable USB debugging. @@ -86,27 +91,66 @@ Once you have enabled USB debugging and connected via USB, run: adb devices ``` -You should see your device listed to confirm it is connected. +You should see your device listed to confirm it is connected. ### 2. Copy the model, tokenizer, and Llama runner binary to the phone ``` bash adb shell mkdir -p /data/local/tmp/llama -adb push /data/local/tmp/llama/ -adb push /data/local/tmp/llama/ -adb push cmake-out-android/examples/models/llama2/llama_main /data/local/tmp/llama/ +adb push llama3_1B_kv_sdpa_xnn_qe_4_128_1024_embedding_4bit.pte /data/local/tmp/llama/ +adb push $HOME/.llama/checkpoints/Llama3.2-1B-Instruct/tokenizer.model /data/local/tmp/llama/ +adb push cmake-out-android/examples/models/llama/llama_main /data/local/tmp/llama/ ``` -{{% notice Note %}} -For Llama 3, you can pass the original `tokenizer.model` (without converting to `.bin` file). -{{% /notice %}} ### 3. Run the model Use the Llama runner to execute the model on the phone with the `adb` command: ``` bash -adb shell "cd /data/local/tmp/llama && ./llama_main --model_path --tokenizer_path --prompt \"Once upon a time\" --seq_len 120" +adb shell "cd /data/local/tmp/llama && ./llama_main --model_path llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte --tokenizer_path tokenizer.model --prompt "<|start_header_id|>system<|end_header_id|>\nYour name is Cookie. you are helpful, polite, precise, concise, honest, good at writing. You always give precise and brief answers up to 32 words<|eot_id|><|start_header_id|>user<|end_header_id|>\nHey Cookie! how are you today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>" --warmup=1 --cpu_threads=5 +``` + +The output should look something like this. + +``` +I 00:00:00.003316 executorch:main.cpp:69] Resetting threadpool with num threads = 5 +I 00:00:00.009329 executorch:runner.cpp:59] Creating LLaMa runner: model_path=llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte, tokenizer_path=tokenizer.model +I 00:00:03.569399 executorch:runner.cpp:88] Reading metadata from model +I 00:00:03.569451 executorch:runner.cpp:113] Metadata: use_sdpa_with_kv_cache = 1 +I 00:00:03.569455 executorch:runner.cpp:113] Metadata: use_kv_cache = 1 +I 00:00:03.569459 executorch:runner.cpp:113] Metadata: get_vocab_size = 128256 +I 00:00:03.569461 executorch:runner.cpp:113] Metadata: get_bos_id = 128000 +I 00:00:03.569464 executorch:runner.cpp:113] Metadata: get_max_seq_len = 1024 +I 00:00:03.569466 executorch:runner.cpp:113] Metadata: enable_dynamic_shape = 1 +I 00:00:03.569469 executorch:runner.cpp:120] eos_id = 128009 +I 00:00:03.569470 executorch:runner.cpp:120] eos_id = 128001 +I 00:00:03.569471 executorch:runner.cpp:120] eos_id = 128006 +I 00:00:03.569473 executorch:runner.cpp:120] eos_id = 128007 +I 00:00:03.569475 executorch:runner.cpp:168] Doing a warmup run... +I 00:00:03.838634 executorch:text_prefiller.cpp:53] Prefill token result numel(): 128256 + +I 00:00:03.892268 executorch:text_token_generator.h:118] +Reached to the end of generation +I 00:00:03.892281 executorch:runner.cpp:267] Warmup run finished! +I 00:00:03.892286 executorch:runner.cpp:174] RSS after loading model: 1269.445312 MiB (0 if unsupported) +<|start_header_id|>system<|end_header_id|>\nYour name is Cookie. you are helpful, polite, precise, concise, honest, good at writing. You always give precise and brief answers up to 32 words<|eot_id|><|start_header_id|>user<|end_header_id|>\nHey Cookie! how are you today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>I 00:00:04.076905 executorch:text_prefiller.cpp:53] Prefill token result numel(): 128256 + + +I 00:00:04.078027 executorch:runner.cpp:243] RSS after prompt prefill: 1269.445312 MiB (0 if unsupported) +I'm doing great, thanks! I'm always happy to help, communicate, and provide helpful responses. I'm a bit of a cookie (heh) when it comes to delivering concise and precise answers. What can I help you with today?<|eot_id|> +I 00:00:05.399304 executorch:text_token_generator.h:118] +Reached to the end of generation + +I 00:00:05.399314 executorch:runner.cpp:257] RSS after finishing text generation: 1269.445312 MiB (0 if unsupported) +PyTorchObserver {"prompt_tokens":54,"generated_tokens":51,"model_load_start_ms":1710296339487,"model_load_end_ms":1710296343047,"inference_start_ms":1710296343370,"inference_end_ms":1710296344877,"prompt_eval_end_ms":1710296343556,"first_token_ms":1710296343556,"aggregate_sampling_time_ms":49,"SCALING_FACTOR_UNITS_PER_SECOND":1000} +I 00:00:05.399342 executorch:stats.h:111] Prompt Tokens: 54 Generated Tokens: 51 +I 00:00:05.399344 executorch:stats.h:117] Model Load Time: 3.560000 (seconds) +I 00:00:05.399346 executorch:stats.h:127] Total inference time: 1.507000 (seconds) Rate: 33.842070 (tokens/second) +I 00:00:05.399348 executorch:stats.h:135] Prompt evaluation: 0.186000 (seconds) Rate: 290.322581 (tokens/second) +I 00:00:05.399350 executorch:stats.h:146] Generated 51 tokens: 1.321000 (seconds) Rate: 38.607116 (tokens/second) +I 00:00:05.399352 executorch:stats.h:154] Time to first generated token: 0.186000 (seconds) +I 00:00:05.399354 executorch:stats.h:161] Sampling time over 105 tokens: 0.049000 (seconds) ``` -You have successfully run a model on your Android smartphone. \ No newline at end of file +You have successfully run the Llama 3.1 1B Instruct model on your Android smartphone with ExecuTorch using KleidiAI kernels. diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/6-Build-Android-Chat-App.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/6-Build-Android-Chat-App.md index 278a6b5863..f62301100b 100644 --- a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/6-Build-Android-Chat-App.md +++ b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/6-Build-Android-Chat-App.md @@ -16,21 +16,16 @@ You can use the Android demo application included in ExecuTorch repository [Llam 2. Set the following environment variables: ``` bash - export ANDROID_NDK=~/Library/Android/sdk/ndk/25.0.8775105 + export ANDROID_NDK=$ANDROID_HOME/ndk/28.0.12433566/ export ANDROID_ABI=arm64-v8a ``` {{% notice Note %}} - is the root for the NDK, which is usually under ~/Library/Android/sdk/ndk/XX.Y.ZZZZZ for macOS, and contains NOTICE and README.md. Make sure you can confirm /build/cmake/android.toolchain.cmake is available for CMake to cross-compile. + is the root for the NDK, which is usually under ~/Library/Android/sdk/ndk/XX.Y.ZZZZZ for macOS, and contains NOTICE and README.md. +Make sure you can confirm /build/cmake/android.toolchain.cmake is available for CMake to cross-compile. {{% /notice %}} -3. (Optional) If you need to use tiktoken as the tokenizer (for LLaMA 3), set `EXECUTORCH_USE_TIKTOKEN=ON` and CMake uses it as the tokenizer. If you run other models like LLaMA 2, skip this step. - - ``` bash - export EXECUTORCH_USE_TIKTOKEN=ON # Only for LLaMA3 - ``` - -4. Run the following commands to set up the required JNI library: +3. Run the following commands to set up the required JNI library: ``` bash pushd extension/android @@ -42,7 +37,7 @@ You can use the Android demo application included in ExecuTorch repository [Llam ``` {{% notice Note %}} -This is running the shell script setup.sh which configures and builds the required core ExecuTorch, Llama 2, and Android libraries. +This is running the shell script setup.sh which configures and builds the required core ExecuTorch, Llama, and Android libraries. {{% /notice %}} ## Getting models @@ -73,7 +68,7 @@ adb push /data/local/tmp/llama/ 2. Upload the files. -If the files are not on the device, use the device explorer to copy them. +If the files are not on the device, use the device explorer to copy them. ![Files Upload](device-explorer-upload.png "Figure 2. Android Studio upload files using Device Explorer") @@ -83,7 +78,7 @@ If the files are not on the device, use the device explorer to copy them. This is the recommended option. -1. Open Android Studio and select “Open an existing Android Studio project” and navigate to open `examples/demo-apps/android/LlamaDemo`. +1. Open Android Studio and select **Open an existing Android Studio project** and navigate to open `examples/demo-apps/android/LlamaDemo`. 2. Run the app (^R). This builds and launches the app on the phone. diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/_index.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/_index.md index c50c3d571a..987c6463e4 100644 --- a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/_index.md +++ b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/_index.md @@ -1,13 +1,13 @@ --- -title: Build an Android chat app with Llama, ExecuTorch, and XNNPACK +title: Build an Android chat app with Llama, KleidiAI, ExecuTorch, and XNNPACK minutes_to_complete: 60 -who_is_this_for: This is an introductory topic for software developers interested in learning how to build an Android chat app with Llama, ExecuTorch, and XNNPACK. +who_is_this_for: This is an introductory topic for software developers interested in learning how to build an Android chat app with Llama, KleidiAI, ExecuTorch, and XNNPACK. learning_objectives: - Set up an ExecuTorch development environment. - - Describe how ExecuTorch uses XNNPACK kernels to accelerate performance on Arm-based platforms. + - Describe how ExecuTorch uses KleidiAI kernels to accelerate performance on Arm-based platforms. - Describe how 4-bit groupwise PTQ quantization reduces model size without significantly sacrificing model accuracy. - Build and run Llama models using ExecuTorch on your development machine. - Build and run an Android Chat app with different Llama models using ExecuTorch on an Arm-based smartphone. @@ -15,13 +15,13 @@ learning_objectives: prerequisites: - An Apple M1/M2 development machine with Android Studio installed or a Linux machine with at least 16GB of RAM. - - An Arm-powered smartphone running Android, with 16GB of RAM. + - An Arm-powered smartphone with the i8mm feature running Android, with 16GB of RAM. - A USB cable to connect your smartphone to your development machine. - Android Debug Bridge (adb) installed on your device. Follow the steps in [adb](https://developer.android.com/tools/adb) to install Android SDK Platform Tools. The adb tool is included in this package. - Java 17 JDK. Follow the steps in [Java 17 JDK](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) to download and install JDK for host. - Python 3.10. -author_primary: Varun Chari +author_primary: Varun Chari, Pareena Verma ### Tags skilllevels: Introductory diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/_next-steps.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/_next-steps.md index b22be2d593..69b9025405 100644 --- a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/_next-steps.md +++ b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/_next-steps.md @@ -1,5 +1,8 @@ --- -recommended_path: /learning-paths/smartphones-and-mobile/mte_on_pixel8/ +next_step_guidance: Now that you are familiar with building LLM applications with ExecuTorch, XNNPack and KleidiAI, you are ready to incorporate LLMs into your Android applications. + +recommended_path: /learning-paths/cross-platform/kleidiai-explainer/ + further_reading: - resource: diff --git a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/_review.md b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/_review.md index 07d5f6c8cd..7fddcf0f62 100644 --- a/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/_review.md +++ b/content/learning-paths/smartphones-and-mobile/Build-Llama3-Chat-Android-App-Using-Executorch-And-XNNPACK/_review.md @@ -6,7 +6,7 @@ review: answers: - ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices. - It is a Pytorch method to quantize LLMs. - - It is a program to execute pytorch models. + - It is a program to execute PyTorch models. correct_answer: 1 explanation: > ExecuTorch is part of the PyTorch Edge ecosystem and enables efficient deployment of PyTorch models to edge devices. @@ -20,7 +20,7 @@ review: - Llama is a family of large language models that uses publicly-available data for training. correct_answer: 3 explanation: > - LLaMA is a state-of-the-art foundational large language model designed to enable researchers to advance their work in this subfield of AI. + LLama is a state-of-the-art foundational large language model designed to enable researchers to advance their work in this subfield of AI. - questions: question: > diff --git a/content/migration/_index.md b/content/migration/_index.md index d84f05785c..caef862bf7 100644 --- a/content/migration/_index.md +++ b/content/migration/_index.md @@ -18,8 +18,8 @@ Below is a list of Neoverse CPUs, the architecture versions, and the key additio | ----------- | -------------------- | --------------------------------------------------------- | | Neoverse-N1 | Armv8.2-A | LSE - Large System Extensions improves multi-threaded performance. | | Neoverse-V1 | Armv8.4-A | SVE - Scalable Vector Extension adds high performance vector processing for HPC and AI workloads. | -| Neoverse-N2 | Armv9.0-A | SVE2 and Arm CCA - Extends SVE and adds Arm Confidential Compute Architecture for hardware isolation and security. | -| Neoverse-V2 | Armv9.0-A | SVE2 and Arm CCA - Targets high single threaded performance for HPC and AI workloads. | +| Neoverse-N2 | Armv9.0-A | SVE2 - Extends SVE for improved data parallelism and wider vectors. | +| Neoverse-V2 | Armv9.0-A | SVE2 - Targets high single threaded performance for HPC and AI workloads. | ### What cloud hardware is available today? @@ -39,22 +39,21 @@ AWS offers more than [150 instance types with Graviton processors](https://aws.a {{< /tab >}} {{< tab header="Google GCP">}} -Google GCP offers a varity of [virtual machine instances with Arm processors](https://cloud.google.com/compute/docs/instances/arm-on-compute). The largest instance has 80 vCPUs and 640 Gb of RAM in the 'c3a-highmem' format. It does not offer bare-metal instances. It offers compute for general-purpose workloads (standard) and memory-optimized workloads (highmem). +Google GCP offers a varity of [virtual machine instances with Arm processors](https://cloud.google.com/compute/docs/instances/arm-on-compute). The largest instance has 48 vCPUs and 192 Gb of RAM. It does not offer bare-metal instances. | Generation | Arm CPU | Instance types | Comments | | --------------|--------------|--------------------|-----------| | T2A | Neoverse-N1 | T2A-standard | Optimized for general-purpose workloads - web servers, and microservices. | -| C3A | AmpereOne | c3a-standard, c3a-highmem | Compute-optimized - large-scale databases, media transcoding, and HPC. | {{< /tab >}} {{< tab header="Microsoft Azure">}} -Microsoft Azure offers a variety of [virtual machine instances with Arm Neoverse processors](https://learn.microsoft.com/en-us/azure/virtual-machines/dpsv5-dpdsv5-series). The largest instance has 64 vCPUs and 208 Gb of RAM in the 'D64ps_v5' format. It does not offer bare-metal instances. It offers compute for general-purpose workloads (Dps), memory-optimized workloads (Eps), compute-intensive workloads (Fsv), and high-performance (Cobalt). +Microsoft Azure offers a variety of [virtual machine instances with Arm Neoverse processors](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series). The latest generation of Arm-based VMs are based on Cobalt 100 CPU. The largest instance has 96 vCPUs and 384 Gb of RAM in the 'D96ps_v6' format. It does not offer bare-metal instances. It offers compute for general-purpose workloads (Dps and Dpls) and memory-optimized workloads (Eps). | Generation | Arm CPU | Instance types | Comments | | --------------|--------------|--------------------|-----------| -| psv5 | Neoverse-N1 | Dpsv5, Epsv5 | General purpose and memory optimized instances. | -| psv6 | Neoverse-N2 | Dpsv6, Epsv6, Fsv6 | Cobalt processor improves performance, Dpsv6 (general purpose 4:1 mem:cpu ratio), Dplsv6 (general purpose, 2:1 mem:cpu ratio), Epsv6 (memory-optimized). | +| Dpsv5 | Neoverse-N1 | Dpsv5, Epsv5 | General purpose and memory optimized instances. | +| Dpsv6 | Neoverse-N2 | Dpsv6, Dpls6, Epsv6 | Cobalt 100 processor improves performance, Dpsv6 (general purpose 4:1 mem:cpu ratio), Dplsv6 (general purpose, 2:1 mem:cpu ratio), Epsv6 (memory-optimized, 8:1 mem:cpu ratio). | {{< /tab >}} {{< tab header="Oracle OCI">}}