From 1019dd10feabeedbee443d4e4bf3e28ed52bac46 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Thu, 13 Nov 2025 13:01:07 +0000 Subject: [PATCH 1/8] Refactor Android Halide documentation for clarity and consistency - Updated titles and descriptions across multiple files for better alignment with content. - Enhanced explanations of AOT compilation and operator fusion concepts. - Improved code comments and formatting for readability. - Corrected minor grammatical errors and inconsistencies in terminology. --- .../android_halide/_index.md | 13 ++--- .../android_halide/android.md | 10 ++-- .../aot-and-cross-compilation.md | 24 ++++----- .../android_halide/fusion.md | 52 +++++++++---------- .../android_halide/intro.md | 42 +++++++-------- .../android_halide/processing-workflow.md | 28 +++++----- 6 files changed, 85 insertions(+), 84 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md index b351d54846..f00c88f159 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md @@ -1,9 +1,5 @@ --- -title: Halide Essentials From Basics to Android Integration - -draft: true -cascade: - draft: true +title: Build high-performance image processing with Halide on Android minutes_to_complete: 180 @@ -31,7 +27,12 @@ operatingsystems: - Android tools_software_languages: - Android Studio - - Coding + - Halide + - C++ + - Kotlin + - Android Studio + - CMake + further_reading: - resource: diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md index ba6eb63972..b0fd394f05 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md @@ -17,10 +17,10 @@ Kotlin, now the preferred programming language for Android development, combines ## Benefits of using Halide on mobile Integrating Halide into Android applications brings several key advantages: -1. Performance. Halide enables significant acceleration of complex image processing algorithms, often surpassing the speed of traditional Java or Kotlin implementations by leveraging optimized code generation. By generating highly optimized native code tailored for ARM CPUs or GPUs, Halide can dramatically increase frame rates and responsiveness, essential for real-time or interactive applications. -2. Efficiency. On mobile devices, resource efficiency translates directly to improved battery life and reduced thermal output. Halide's scheduling strategies (such as operation fusion, tiling, parallelization, and vectorization) minimize unnecessary memory transfers, CPU usage, and GPU overhead. This optimization substantially reduces overall power consumption, extending battery life and enhancing the user experience by preventing overheating. -3. Portability. Halide abstracts hardware-specific details, allowing developers to write a single high-level pipeline that easily targets different processor architectures and hardware configurations. Pipelines can seamlessly run on various ARM-based CPUs and GPUs commonly found in Android smartphones and tablets, enabling developers to support a wide range of devices with minimal platform-specific modifications. -4. Custom Algorithm Integration. Halide allows developers to easily integrate their bespoke image-processing algorithms that may not be readily available or optimized in common libraries, providing full flexibility and control over application-specific performance and functionality. +- Performance - Halide enables significant acceleration of complex image processing algorithms, often surpassing the speed of traditional Java or Kotlin implementations by leveraging optimized code generation. By generating highly optimized native code tailored for Arm CPUs or GPUs, Halide can dramatically increase frame rates and responsiveness, essential for real-time or interactive applications. +- Efficiency - on mobile devices, resource efficiency translates directly to improved battery life and reduced thermal output. Halide's scheduling strategies (such as operation fusion, tiling, parallelization, and vectorization) minimize unnecessary memory transfers, CPU usage, and GPU overhead. This optimization substantially reduces overall power consumption, extending battery life and enhancing the user experience by preventing overheating. +- Portability - Halide abstracts hardware-specific details, allowing developers to write a single high-level pipeline that easily targets different processor architectures and hardware configurations. Pipelines can seamlessly run on various Arm-based CPUs and GPUs commonly found in Android smartphones and tablets, enabling developers to support a wide range of devices with minimal platform-specific modifications. +- Custom Algorithm Integration - Halide allows developers to easily integrate their bespoke image-processing algorithms that may not be readily available or optimized in common libraries, providing full flexibility and control over application-specific performance and functionality. In short, Halide delivers high-performance image processing without sacrificing portability or efficiency, a balance particularly valuable on resource-constrained mobile devices. @@ -336,7 +336,7 @@ When the app launches, the Process Image button is disabled. When a user taps Lo Upon pressing the Process Image button, the following sequence occurs: 1. Background Processing. A Kotlin coroutine initiates processing on a background thread, ensuring the application’s UI remains responsive. -2. Conversion to Grayscale. The loaded bitmap image is converted into a grayscale byte array using a simple RGB-average method, preparing it for processing by the native (JNI) layer. +2. Conversion to Grayscale. The loaded bitmap image is converted into a grayscale byte array using a simple RGB (Red-Green-Blue) average method, preparing it for processing by the native (JNI) layer. 3. Native Function Invocation. This grayscale byte array, along with image dimensions, is passed to a native function (blurThresholdImage) defined via JNI. This native function is implemented using the Halide pipeline, performing operations such as blurring and thresholding directly on the image data. 4. Post-processing. After the native function completes, the resulting processed grayscale byte array is converted back into a Bitmap image. 5. UI Update. The coroutine then updates the displayed image (on the main UI thread) with this newly processed bitmap, providing the user immediate visual feedback. diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md index f4003f1f51..0d2166744c 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md @@ -8,15 +8,15 @@ layout: "learningpathall" --- ## Ahead-of-time and cross-compilation -One of Halide's standout features is the ability to compile image processing pipelines ahead-of-time (AOT), enabling developers to generate optimized binary code on their host machines rather than compiling directly on target devices. This AOT compilation process allows developers to create highly efficient libraries that run effectively across diverse hardware without incurring the runtime overhead associated with just-in-time (JIT) compilation. +One of Halide's standout features is the ability to compile image processing pipelines ahead-of-time (AOT), enabling developers to generate optimized binary code on their host machines rather than compiling directly on target devices. This AOT compilation process allows developers to create highly efficient libraries that run effectively across diverse hardware without incurring the runtime overhead associated with just-in-time (JIT) compilation. Halide also supports robust cross-compilation capabilities. Cross-compilation means using the host version of Halide, typically running on a desktop Linux or macOS system—to target different architectures, such as ARM for Android devices. Developers can thus optimize Halide pipelines on their host machine, produce libraries specifically optimized for Android, and integrate them seamlessly into Android applications. The generated pipeline code includes essential optimizations and can embed minimal runtime support, further reducing workload on the target device and ensuring responsiveness and efficiency. ## Objective -In this section, we leverage the host version of Halide to perform AOT compilation of an image processing pipeline via cross-compilation. The resulting pipeline library is specifically tailored to Android devices (targeting, for instance, arm64-v8a ABI), while the compilation itself occurs entirely on the host system. This approach significantly accelerates development by eliminating the need to build Halide or perform JIT compilation on Android devices. It also guarantees that the resulting binaries are optimized for the intended hardware, streamlining the deployment of high-performance image processing applications on mobile platforms. +In this section, you'll leverage the host version of Halide to perform AOT compilation of an image processing pipeline via cross-compilation. The resulting pipeline library is specifically tailored to Android devices (targeting, for instance, arm64-v8a ABI), while the compilation itself occurs entirely on the host system. This approach significantly accelerates development by eliminating the need to build Halide or perform JIT compilation on Android devices. It also guarantees that the resulting binaries are optimized for the intended hardware, streamlining the deployment of high-performance image processing applications on mobile platforms. -## Prepare Pipeline for Android -The procedure implemented in the following code demonstrates how Halide's AOT compilation and cross-compilation features can be utilized to create an optimized image processing pipeline for Android. We will run Halide on our host machine (in this example, macOS) to generate a static library containing the pipeline function, which will later be invoked from an Android device. Below is a step-by-step explanation of this process. +## Prepare pipeline for Android +The procedure implemented in the following code demonstrates how Halide's AOT compilation and cross-compilation features can be utilized to create an optimized image processing pipeline for Android. Run Halide on your host machine (in this example, macOS) to generate a static library containing the pipeline function, which will later be invoked from an Android device. Below is a step-by-step explanation of this process. Create a new file named blur-android.cpp with the following contents: @@ -85,9 +85,9 @@ int main(int argc, char** argv) { } ``` -In the original implementation constants 128, 255, and 0 were implicitly treated as integers. Here, the threshold value (128) and output values (255, 0) are explicitly cast to uint8_t. This approach removes ambiguity and clearly specifies the types used, ensuring compatibility and clarity. Both approaches result in identical functionality, but explicitly casting helps emphasize the type correctness and may avoid subtle issues during cross-compilation or in certain environments. Additionally, explicit uint8_t casts help avoid implicit promotion to 32-bit integers (and the corresponding narrowings back to 8-bit) in the generated code, reducing redundant cast operations and potential vector widen/narrow overhead—especially on ARM/NEON +In the original implementation constants 128, 255, and 0 were implicitly treated as integers. Here, the threshold value (128) and output values (255, 0) are explicitly cast to uint8_t. This approach removes ambiguity and clearly specifies the types used, ensuring compatibility and clarity. Both approaches result in identical functionality, but explicitly casting helps emphasize the type correctness and may avoid subtle issues during cross-compilation or in certain environments. Additionally, explicit uint8_t casts help avoid implicit promotion to 32-bit integers (and the corresponding narrowings back to 8-bit) in the generated code, reducing redundant cast operations and potential vector widen/narrow overhead—especially on Arm/NEON. -The program takes at least one command-line argument, the output base name used to generate the files (e.g., “blur_threshold_android”). Here, the target architecture is explicitly set within the code to Android ARM64: +The program takes at least one command-line argument, the output base name used to generate the files (for example, "blur_threshold_android"). Here, the target architecture is explicitly set within the code to Android ARM64: ```cpp // Configure Halide Target for Android @@ -108,11 +108,11 @@ Notes: 1. NoRuntime — When set to true, Halide excludes its runtime from the generated code, and you must link the runtime manually during the linking step. When set to false, the Halide runtime is included in the generated library, which simplifies deployment. 2. ARMFp16 — Enables the use of ARM hardware support for half-precision (16-bit) floating-point operations, which can provide faster execution when reduced precision is acceptable. 3. Why the runtime choice matters - If your app links several AOT-compiled pipelines, ensure there is exactly one Halide runtime at link time: -* Strategy A (cleanest): build all pipelines with NoRuntime ON and link a single standalone Halide runtime once (matching the union of features you need, e.g., Vulkan/OpenCL/Metal or ARM options). +* Strategy A (cleanest): build all pipelines with NoRuntime ON and link a single standalone Halide runtime once (matching the union of features you need, for example, Vulkan/OpenCL/Metal or Arm options). * Strategy B: embed the runtime in exactly one pipeline (leave NoRuntime OFF only there); compile all other pipelines with NoRuntime ON. * Mixing more than one runtime can cause duplicate symbols and split global state (e.g., error handlers, device interfaces). -We declare spatial variables (x, y) and an ImageParam named “input” representing the input image data. We use boundary clamping (clamp) to safely handle edge pixels. Then, we apply a 3x3 blur with a reduction domain (RDom). The accumulated sum is divided by 9 (the number of pixels in the neighborhood), producing an average blurred image. Lastly, thresholding is applied, producing a binary output: pixels above a certain brightness threshold (128) become white (255), while others become black (0). +The code declares spatial variables (x, y) and an ImageParam named "input" representing the input image data. Boundary clamping (clamp) safely handles edge pixels. A 3×3 blur with a reduction domain (RDom) is then applied. The accumulated sum is divided by 9 (the number of pixels in the neighborhood), producing an average blurred image. Lastly, thresholding is applied, producing a binary output: pixels above a certain brightness threshold (128) become white (255), while others become black (0). This section intentionally reinforces previous concepts, focusing now primarily on explicitly clarifying integration details, such as type correctness and the handling of runtime features within Halide. @@ -122,7 +122,7 @@ This strategy can simplify debugging by clearly isolating computational steps an By clearly separating algorithm logic from scheduling, developers can easily test and compare different scheduling strategies,such as compute_inline, compute_root, compute_at, and more, without modifying their fundamental algorithmic code. This separation significantly accelerates iterative optimization and debugging processes, ultimately yielding better-performing code with minimal overhead. -We invoke Halide's AOT compilation function compile_to_static_library, which generates a static library (.a) containing the optimized pipeline and a corresponding header file (.h). +Halide's AOT compilation function compile_to_static_library generates a static library (.a) containing the optimized pipeline and a corresponding header file (.h). ```cpp thresholded.compile_to_static_library( @@ -134,18 +134,18 @@ thresholded.compile_to_static_library( ``` This will produce: -* A static library (blur_threshold_android.a) containing the compiled pipeline. This static library also includes Halide's runtime functions tailored specifically for the targeted architecture (arm-64-android). Thus, no separate Halide runtime needs to be provided on the Android device when linking against this library. +* A static library (blur_threshold_android.a) containing the compiled pipeline. This static library also includes Halide's runtime functions tailored specifically for the targeted architecture (arm-64-android). Thus, no separate Halide runtime needs to be provided on the Android device when linking against this library. * A header file (blur_threshold_android.h) declaring the pipeline function for use in other C++/JNI code. These generated files are then ready to integrate directly into an Android project via JNI, allowing efficient execution of the optimized pipeline on Android devices. The integration process is covered in the next section. -Note: JNI (Java Native Interface) is a framework that allows Java (or Kotlin) code running in a Java Virtual Machine (JVM), such as on Android, to interact with native applications and libraries written in languages like C or C++. JNI bridges the managed Java/Kotlin environment and the native, platform-specific implementations. +JNI (Java Native Interface) is a framework that allows Java (or Kotlin) code running in a Java Virtual Machine (JVM), such as on Android, to interact with native applications and libraries written in languages like C or C++. JNI bridges the managed Java/Kotlin environment and the native, platform-specific implementations. ## Compilation instructions To compile the pipeline-generation program on your host system, use the following commands (replace /path/to/halide with your Halide installation directory): ```console export DYLD_LIBRARY_PATH=/path/to/halide/lib/libHalide.19.dylib -g++ -std=c++17 blud-android.cpp -o blud-android \ +g++ -std=c++17 blur-android.cpp -o blur-android \ -I/path/to/halide/include -L/path/to/halide/lib -lHalide \ $(pkg-config --cflags --libs opencv4) -lpthread -ldl \ -Wl,-rpath,/path/to/halide/lib diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md index f10442403f..a11c7cb396 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md @@ -8,9 +8,9 @@ layout: "learningpathall" --- ## Objective -In the previous section, you explored parallelization and tiling. Here, you will focus on operator fusion (inlining) in Halide i.e., letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You will learn when fusion reduces memory traffic and when materializing saves recomputation (e.g., for large stencils or multi-use intermediates). You will inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s). +In the previous section, you explored parallelization and tiling. Here, you will focus on operator fusion (inlining) in Halide, that is, letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You will learn when fusion reduces memory traffic and when materializing saves recomputation (for example, for large stencils or multi-use intermediates). You will inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s). -This section does not cover loop fusion (the fuse directive). You will focus on operator fusion, which is Halide's default behavior. +This section doesn't cover loop fusion (the fuse directive). You will focus on operator fusion, which is Halide's default behavior. ## Code To demonstrate how fusion in Halide works create a new file `camera-capture-fusion.cpp`, and modify it as follows. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion vs. materialization, and print ms / FPS / MPix/s. So you can see the impact immediately. @@ -47,11 +47,11 @@ static const char* schedule_name(Schedule s) { } // Build the BGR->Gray -> 3x3 binomial blur -> threshold pipeline. -// We clamp the *ImageParam* at the borders (Func clamp of ImageParam works in Halide 19). +// Clamp the *ImageParam* at the borders (Func clamp of ImageParam works in Halide 19). Pipeline make_pipeline(ImageParam& input, Schedule schedule) { Var x("x"), y("y"); - // Assume 3-channel BGR interleaved frames (we convert if needed). + // Assume 3-channel BGR interleaved frames (converted if needed). input.dim(0).set_stride(3); // x-stride = channels input.dim(2).set_stride(1); // c-stride = 1 input.dim(2).set_bounds(0, 3); // three channels @@ -81,7 +81,7 @@ Pipeline make_pipeline(ImageParam& input, Schedule schedule) { // Final output Func output("output"); output(x, y) = thresholded(x, y); - output.compute_root(); // we always realize 'output' + output.compute_root(); // always realize 'output' // Scheduling to demonstrate OPERATOR FUSION vs MATERIALIZATION // Default in Halide = fusion/inlining (no schedule on producers). @@ -233,11 +233,11 @@ int main(int argc, char** argv) { } ``` -The main part of this program is the `make_pipeline` function. It defines the camera processing pipeline in Halide and applies different scheduling choices depending on which mode we select. +The main part of this program is the `make_pipeline` function. It defines the camera processing pipeline in Halide and applies different scheduling choices depending on which mode is selected. -You start by declaring Var x, y as our pixel coordinates. Similarly as before, the camera frames come in as 3-channel interleaved BGR, you will tell Halide how the data is laid out: the stride along x is 3 (one step moves across all three channels), the stride along c (channels) is 1, and the bounds on the channel dimension are 0–2. +Start by declaring Var x, y as pixel coordinates. Similarly as before, the camera frames come in as 3-channel interleaved BGR, telling Halide how the data is laid out: the stride along x is 3 (one step moves across all three channels), the stride along c (channels) is 1, and the bounds on the channel dimension are 0–2. -Because you don’t want to worry about array bounds when applying filters, you will clamp the input at the borders. In Halide 19, BoundaryConditions::repeat_edge works cleanly when applied to an ImageParam, since it has .dim() information. This way, all downstream stages can assume safe access even at the edges of the image. +Because you don't want to worry about array bounds when applying filters, clamp the input at the borders. In Halide 19, BoundaryConditions::repeat_edge works cleanly when applied to an ImageParam, since it has .dim() information. This way, all downstream stages can assume safe access even at the edges of the image. ```cpp Pipeline make_pipeline(ImageParam& input, Schedule schedule) { @@ -252,9 +252,9 @@ Pipeline make_pipeline(ImageParam& input, Schedule schedule) { Func inputClamped = BoundaryConditions::repeat_edge(input); ``` -Next comes the gray conversion. As in previous section, you will use Rec.601 weights a 3×3 binomial blur. Instead of using a reduction domain (RDom), you unroll the sum in C++ host code with a pair of loops over the kernel. The kernel values {1, 2, 1; 2, 4, 2; 1, 2, 1} approximate a Gaussian filter. Each pixel of blur is simply the weighted sum of its 3×3 neighborhood, divided by 16. +Next comes the gray conversion. As in previous section, use Rec.601 weights and a 3×3 binomial blur. Instead of using a reduction domain (RDom), unroll the sum in C++ host code with a pair of loops over the kernel. The kernel values {1, 2, 1; 2, 4, 2; 1, 2, 1} approximate a Gaussian filter. Each pixel of blur is simply the weighted sum of its 3×3 neighborhood, divided by 16. -You will then add a threshold stage. Pixels above 128 become white, and all others black, producing a binary image. Finally, define an output Func that wraps the thresholded result and call compute_root() on it so that it will be realized explicitly when you run the pipeline. +Then add a threshold stage. Pixels above 128 become white, and all others black, producing a binary image. Finally, define an output Func that wraps the thresholded result and call compute_root() on it so that it will be realized explicitly when you run the pipeline. Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, you instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output. * Simple: Here you will explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic. @@ -299,7 +299,7 @@ return Pipeline(output); } ``` -All the camera handling is just like before: you open the default webcam with OpenCV, normalize frames to 3-channel BGR if needed, wrap each frame as an interleaved Halide buffer, run the pipeline, and show the result. You will still time only the realize() call and print ms / FPS / MPix/s, with the first frame marked as [warm-up]. +All the camera handling is just like before: open the default webcam with OpenCV, normalize frames to 3-channel BGR if needed, wrap each frame as an interleaved Halide buffer, run the pipeline, and show the result. Time only the realize() call and print ms / FPS / MPix/s, with the first frame marked as [warm-up]. The new part is that you can toggle scheduling modes from the keyboard while the application is running: 1. Keys: @@ -310,9 +310,9 @@ The new part is that you can toggle scheduling modes from the keyboard while the * q / Esc – quit Under the hood, pressing 0–3 triggers a rebuild of the Halide pipeline with the chosen schedule: -1. You map the key to a Schedule enum value. -2. You call make_pipeline(input, next) to construct the new scheduled pipeline. -3. You reset the warm-up flag, so the next line of stats is labeled [warm-up] (that frame includes JIT). +1. Map the key to a Schedule enum value. +2. Call make_pipeline(input, next) to construct the new scheduled pipeline. +3. Reset the warm-up flag, so the next line of stats is labeled [warm-up] (that frame includes JIT). 4. The main loop keeps grabbing frames; only the Halide schedule changes. This live switching makes fusion tangible: you can watch the loop nest printout change, see the visualization update, and compare throughput numbers in real time as you move between Simple, FuseBlurAndThreshold, FuseAll, and Tile. @@ -326,7 +326,7 @@ g++ -std=c++17 camera-capture-fusion.cpp -o camera-capture-fusion \ ./camera-capture-fusion ``` -You will see the following output: +You'll see the following output: ```output % ./camera-capture-fusion Starting with schedule: FuseAll (press 0..3 to switch; q/Esc to quit) @@ -411,7 +411,7 @@ The console output combines two kinds of information: Comparing the numbers: * FuseAll runs at ~53 FPS. It has minimal memory traffic but pays for recomputation of gray under the blur. -* FuseBlurAndThreshold jumps to over 200 FPS. By materializing gray, we avoid redundant recomputation and allow blur+threshold to stay fused. This is often the sweet spot for interleaved camera input. +* FuseBlurAndThreshold jumps to over 200 FPS. By materializing gray, redundant recomputation is avoided and blur+threshold stays fused. This is often the sweet spot for interleaved camera input. * Simple reaches ~166 FPS. Both gray and blur are materialized, so no recomputation occurs, but memory traffic is higher than in FuseBlurAndThreshold. * Tile achieves similar speed (~200 FPS). Producing gray per tile balances recomputation and memory traffic by keeping intermediates local to cache. @@ -442,27 +442,27 @@ for y: for x: gray(x,y) = ... // write one planar gray image for y: for x: out(x,y) = threshold( sum kernel * gray(x+i,y+j) ) ``` -The fused version eliminates buffer writes but recomputes gray under the blur stencil. The materialized version performs more memory operations but avoids recomputation, and also gives us a clean point to parallelize or vectorize the gray stage. +The fused version eliminates buffer writes but recomputes gray under the blur stencil. The materialized version performs more memory operations but avoids recomputation, and also provides a clean point to parallelize or vectorize the gray stage. -It’s worth noting that Halide also supports a loop fusion directive (fuse) that merges two loop variables together. That’s a different concept and not our focus here. In this tutorial, we’re talking specifically about operator fusion—the decision of whether to inline or materialize stages. +It's worth noting that Halide also supports a loop fusion directive (fuse) that merges two loop variables together. That's a different concept and not the focus here. In this tutorial, the focus is specifically on operator fusion—the decision of whether to inline or materialize stages. ## How this looks in the live camera demo -Our pipeline is: BGR input → gray → 3×3 blur → thresholded → output. Depending on the schedule, we see different kinds of fusion: +The pipeline is: BGR input → gray → 3×3 blur → thresholded → output. Depending on the schedule, different kinds of fusion are shown: * FuseAll. No schedules on producers. gray, blur, and thresholded are all inlined into output. This minimizes memory traffic but recomputes gray repeatedly inside the 3×3 blur. -* FuseBlurAndThreshold: We add gray.compute_root(), materializing gray once as a planar buffer. This avoids recomputation of gray and makes downstream blur and thresholded vectorize better. blur and thresholded remain fused. +* FuseBlurAndThreshold: Adding gray.compute_root() materializes gray once as a planar buffer. This avoids recomputation of gray and makes downstream blur and thresholded vectorize better. blur and thresholded remain fused. * Simple. Both gray and blur are materialized across the frame. This avoids recomputation entirely but increases memory traffic. -* Tile. We split the output into 64×64 tiles and compute gray per tile (compute_at(output, xo)). This keeps intermediate results local to cache while still fusing blur inside each tile. +* Tile. The output is split into 64×64 tiles and gray is computed per tile (compute_at(output, xo)). This keeps intermediate results local to cache while still fusing blur inside each tile. By toggling between these modes in the live demo, you can see how the loop nests and throughput numbers change, which makes the abstract idea of fusion much more concrete. ## When to use operator fusion -Fusion is Halide's default and usually the right place to start. It’s especially effective for: +Fusion is Halide's default and usually the right place to start. It's especially effective for: * Element-wise chains, where each pixel is transformed independently: examples include intensity scaling or offset, gamma correction, channel mixing, color-space conversions, and logical masking. * Cheap post-ops after spatial filters: -for instance, there’s no reason to materialize a blurred image just to threshold it. Fuse the threshold directly into the blur’s consumer. +for instance, there's no reason to materialize a blurred image to threshold it. Fuse the threshold directly into the blur's consumer. -In our code, FuseAll inlines gray, blur, and thresholded into output. FuseBlurAndThreshold materializes only gray, then keeps blur and thresholded fused—a common middle ground that balances memory use and compute reuse. +In the code, FuseAll inlines gray, blur, and thresholded into output. FuseBlurAndThreshold materializes only gray, then keeps blur and thresholded fused—a common middle ground that balances memory use and compute reuse. ## When to materialize instead of fuse Fusion isn’t always best. You’ll want to materialize an intermediate (compute_root() or compute_at()) if: @@ -472,7 +472,7 @@ Fusion isn’t always best. You’ll want to materialize an intermediate (comput * You need a natural stage to apply parallelization or tiling. ### Profiling -The fastest way to check whether fusion helps is to measure it. Our demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official [Halide profiling tutorial](https://halide-lang.org/tutorials/tutorial_lesson_21_auto_scheduler_generate.html#profiling). +The fastest way to check whether fusion helps is to measure it. The demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official [Halide profiling tutorial](https://halide-lang.org/tutorials/tutorial_lesson_21_auto_scheduler_generate.html#profiling). ## Summary -In this section, you have learned about operator fusion in Halide—a powerful technique for reducing memory bandwidth and improving computational efficiency. You explored why fusion matters, looked at scenarios where it is most effective, and saw how Halide's scheduling constructs such as compute_root() and compute_at() let us control whether stages are fused or materialized. By experimenting with different schedules, including fusing the Gaussian blur and thresholding stages, we observed how fusion can significantly improve the performance of a real-time image processing pipeline +In this section, you've learned about operator fusion in Halide—a powerful technique for reducing memory bandwidth and improving computational efficiency. You explored why fusion matters, looked at scenarios where it's most effective, and saw how Halide's scheduling constructs such as compute_root() and compute_at() let you control whether stages are fused or materialized. By experimenting with different schedules, including fusing the Gaussian blur and thresholding stages, you observed how fusion can significantly improve the performance of a real-time image processing pipeline. diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md index 4670270833..a4ea931cbb 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md @@ -12,7 +12,7 @@ Halide is a powerful, open-source programming language specifically designed to A key advantage of Halide lies in its innovative programming model. By clearly distinguishing between algorithmic logic and scheduling decisions—such as parallelism, vectorization, memory management, and hardware-specific optimizations, developers can first focus on ensuring the correctness of their algorithms. Performance tuning can then be handled independently, significantly accelerating development cycles. This approach often yields performance that matches or even surpasses manually optimized code. As a result, Halide has seen widespread adoption across industry and academia, powering image processing systems at organizations such as Google, Adobe, and Facebook, and enabling advanced computational photography features used by millions daily. -In this learning path, you will explore Halide's foundational concepts, set up your development environment, and create your first functional Halide application. By the end, you will understand what makes Halide uniquely suited to efficient image processing, particularly on mobile and Arm-based hardware, and be ready to build your own optimized pipelines. +In this Learning Path, you'll explore Halide's foundational concepts, set up your development environment, and create your first functional Halide application. By the end, you'll understand what makes Halide uniquely suited to efficient image processing, particularly on mobile and Arm-based hardware, and be ready to build your own optimized pipelines. For broader or more general use cases, please refer to the official Halide documentation and tutorials available at [halide-lang.org](https://halide-lang.org). @@ -20,7 +20,7 @@ The example code for this Learning Path is available in two repositories [here]( ## Key concepts in Halide ### Separation of algorithm and schedule -At the core of Halide's design philosophy is the principle of clearly separating algorithms from schedules. Traditional image-processing programming tightly couples algorithmic logic with execution strategy, complicating optimization and portability. In contrast, Halide explicitly distinguishes these two components: +At the core of Halide's design philosophy is the principle of clearly separating algorithms from schedules. Traditional image-processing programming tightly couples algorithmic logic with execution strategy, complicating optimization and portability. In contrast, Halide explicitly distinguishes these two components: * Algorithm: Defines what computations are performed—for example, image filters, pixel transformations, or other mathematical operations on image data. * Schedule: Specifies how and where these computations are executed, addressing critical details such as parallel execution, memory usage, caching strategies, and hardware-specific optimizations. @@ -36,9 +36,9 @@ Halide::Func brighter("brighter"); brighter(x, y, c) = Halide::cast(Halide::min(input(x, y, c) + 50, 255)); ``` -Functions (Func) represent individual computational steps or image operations. Each Func encapsulates an expression applied to pixels, allowing concise definition of complex image processing tasks. Vars symbolically represent spatial coordinates or dimensions (e.g., horizontal x, vertical y, color channel c). They specify where computations are applied in the image data Pipelines are formed by interconnecting multiple Func objects, structuring a clear workflow where the output of one stage feeds into subsequent stages, enabling modular and structured image processing. +Functions (Func) represent individual computational steps or image operations. Each Func encapsulates an expression applied to pixels, allowing concise definition of complex image processing tasks. Vars symbolically represent spatial coordinates or dimensions (for example, horizontal x, vertical y, color channel c). They specify where computations are applied in the image data. Pipelines are formed by interconnecting multiple Func objects, structuring a clear workflow where the output of one stage feeds into subsequent stages, enabling modular and structured image processing. -Halide is a domain-specific language (DSL) tailored explicitly for image and signal processing tasks. It provides a concise set of predefined operations and building blocks optimized for expressing complex image processing pipelines. By abstracting common computational patterns into simple yet powerful operators, Halide allows developers to succinctly define their processing logic, facilitating readability, maintainability, and easy optimization for various hardware targets. +Halide is a domain-specific language (DSL) tailored explicitly for image and signal processing tasks. It provides a concise set of predefined operations and building blocks optimized for expressing complex image processing pipelines. By abstracting common computational patterns into powerful operators, Halide allows developers to succinctly define their processing logic, facilitating readability, maintainability, and easy optimization for various hardware targets. ### Scheduling strategies (parallelism, vectorization, tiling) Halide offers several powerful scheduling strategies designed for maximum performance: @@ -58,7 +58,7 @@ Halide can be set up using one of two main approaches: * Installing pre-built binaries - pre-built binaries are convenient, quick to install, and suitable for most beginners or standard platforms (Windows, Linux, macOS). This approach is recommended for typical use cases. * Building Halide from source is required when pre-built binaries are unavailable for your specific environment, or if you wish to experiment with the latest Halide features or LLVM versions still under active development. This method typically requires greater familiarity with build systems and may be more suitable for advanced users. -Here, you will use pre-built binaries: +Use pre-built binaries: 1. Visit the official Halide releases [page](https://github.com/halide/Halide/releases). As of this writing, the latest Halide version is v19.0.0. 2. Download and unzip the binaries to a convenient location (e.g., /usr/local/halide on Linux/macOS or C:\halide on Windows). 3. Optionally set environment variables to simplify further usage: @@ -67,9 +67,9 @@ export HALIDE_DIR=/path/to/halide export PATH=$HALIDE_DIR/bin:$PATH ``` -To proceed futher, make sure to install the following components: +To proceed further, install the following components: 1. LLVM (Halide requires LLVM to compile and execute pipelines) -2. OpenCV (for image handling in later lessons) +2. OpenCV (for image handling in later sections) Install with the commands for your OS: @@ -102,7 +102,7 @@ int main() { // Static path for the input image. std::string imagePath = "img.png"; - // Load the input image using OpenCV (BGR by default). + // Load the input image using OpenCV (BGR format by default, which stands for Blue-Green-Red channel order). Mat input = imread(imagePath, IMREAD_COLOR); // Alternative: Halide has a built-in IO function to directly load images as Halide::Buffer. // Example: Halide::Buffer inputBuffer = Halide::Tools::load_image(imagePath); @@ -111,7 +111,7 @@ int main() { return -1; } - // Convert RGB back to BGR for correct color display in OpenCV (optional but recommended for OpenCV visualization). + // Convert from BGR to RGB (Red-Green-Blue) format for correct color display in OpenCV. cvtColor(input, input, COLOR_BGR2RGB); // Wrap the OpenCV Mat data in a Halide::Buffer. @@ -151,13 +151,13 @@ int main() { } ``` -This program demonstrates how to combine Halide's image processing capabilities with OpenCV’s image I/O and display functionality. It begins by loading an image from disk using OpenCV, specifically reading from a static file named `img.png` (here you use a Cameraman image). Since OpenCV loads images in BGR format by default, the code immediately converts the image to RGB format so that it is compatible with Halide's expectations. +This program demonstrates how to combine Halide's image processing capabilities with OpenCV's image I/O and display functionality. It begins by loading an image from disk using OpenCV, specifically reading from a static file named `img.png` (here you use a Cameraman image). Since OpenCV loads images in BGR (Blue-Green-Red) format by default, the code immediately converts the image to RGB (Red-Green-Blue) format so that it's compatible with Halide's expectations. -Once the image is loaded and converted, the program wraps the raw image data into a Halide buffer, capturing the image’s dimensions (width, height, and color channels). Next, the Halide pipeline is defined through a function named invert, which specifies the computations to perform on each pixel—in this case, subtracting the original pixel value from 255 to invert the colors. The pipeline definition alone does not perform any actual computation; it only describes what computations should occur and how to schedule them. +Once the image is loaded and converted, the program wraps the raw image data into a Halide buffer, capturing the image's dimensions (width, height, and color channels). Next, the Halide pipeline is defined through a function named invert, which specifies the computations to perform on each pixel—in this case, subtracting the original pixel value from 255 to invert the colors. The pipeline definition alone doesn't perform any actual computation; it only describes what computations should occur and how to schedule them. The actual computation occurs when the pipeline is executed with the call to invert.realize(...). This is the step that processes the input image according to the defined pipeline and produces an output Halide buffer. The scheduling directive (invert.reorder(c, x, y)) ensures that pixel data is computed in an interleaved manner (channel-by-channel per pixel), aligning the resulting data with OpenCV’s expected memory layout for images. -Finally, the processed Halide output buffer is efficiently wrapped in an OpenCV Mat header without copying pixel data. For proper display in OpenCV, which uses BGR channel ordering by default, the code converts the processed image back from RGB to BGR. The program then displays the original and inverted images in separate windows, waiting for a key press before exiting. This approach demonstrates a streamlined integration between Halide for high-performance image processing and OpenCV for convenient input and output operations. +Finally, the processed Halide output buffer is efficiently wrapped in an OpenCV Mat header without copying pixel data. For proper display in OpenCV, which uses BGR (Blue-Green-Red) channel ordering by default, the code converts the processed image back from RGB (Red-Green-Blue) to BGR. The program then displays the original and inverted images in separate windows, waiting for a key press before exiting. This approach demonstrates a streamlined integration between Halide for high-performance image processing and OpenCV for convenient input and output operations. By default, Halide orders loops based on the order of variable declaration. In this example, the original ordering (x, y, c) implies processing the image pixel-by-pixel across all horizontal positions (x), then vertical positions (y), and finally channels (c). This ordering naturally produces a planar memory layout (e.g., processing all red pixels first, then green, then blue). @@ -184,12 +184,12 @@ Buffer inputBuffer = Buffer::make_interleaved( ``` 2. Planar Layout (RRR...GGG...BBB...): -* Preferred by certain image-processing routines or hardware accelerators (e.g., some GPU kernels or certain ML frameworks). -* Achieved naturally by Halide's default loop ordering (x, y, c). +* Preferred by certain image-processing routines or hardware accelerators (for example, some GPU kernels or certain ML frameworks). +* Achieved naturally by Halide's default loop ordering (x, y, c). -It is essential to select loop ordering based on your specific data format requirements and integration scenario. Halide provides full flexibility, allowing you to explicitly reorder loops to match the desired memory layout efficiently. +Select loop ordering based on your specific data format requirements and integration scenario. Halide provides full flexibility, allowing you to explicitly reorder loops to match the desired memory layout efficiently. -In Halide, two distinct concepts must be distinguished clearly: +In Halide, distinguish two distinct concepts clearly: 1. Loop execution order (controlled by reorder). Defines the nesting order of loops during computation. For example, to make the channel dimension (c) innermost during computation: ```cpp @@ -213,7 +213,7 @@ g++ -std=c++17 hello-world.cpp -o hello-world \ -Wl,-rpath,/path/to/halide/lib ``` -Note that, on Linux, you would set LD_LIBRARY_PATH instead: +On Linux, set LD_LIBRARY_PATH instead: ```console export LD_LIBRARY_PATH=/path/to/halide/lib/ ``` @@ -223,14 +223,14 @@ Run the executable: ./hello-world ``` -You will see two windows displaying the original and inverted images: +You'll see two windows displaying the original and inverted images: ![img1](Figures/01.png) ![img2](Figures/02.png) ## Summary -In this section, you have learned Halide's foundational concepts, explored the benefits of separating algorithms and schedules, set up your development environment, and created your first functional Halide application integrated with OpenCV. +In this section, you've learned Halide's foundational concepts, explored the benefits of separating algorithms and schedules, set up your development environment, and created your first functional Halide application integrated with OpenCV. -While the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it does not yet showcase the substantial benefits of explicitly separating algorithm definition from scheduling strategies. +While the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it doesn't yet showcase the substantial benefits of explicitly separating algorithm definition from scheduling strategies. -In subsequent sections, you will explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which will clearly demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to specific hardware without modifying algorithmic correctness. +In subsequent sections, explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which clearly demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to specific hardware without modifying algorithmic correctness. diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md index 6d7b9ec3d9..2d5da32a05 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md @@ -128,9 +128,9 @@ int main() { } ``` -The camera delivers interleaved BGR frames. Inside Halide, we convert to grayscale (Rec.601), apply a 3×3 binomial blur (sum/16 with 16-bit accumulation), then threshold to produce a binary image. We compile once (outside the capture loop) and realize per frame for real-time processing. +The camera delivers interleaved BGR frames. Inside Halide, convert to grayscale (Rec.601), apply a 3×3 binomial blur (sum/16 with 16-bit accumulation), then threshold to produce a binary image. Compile once (outside the capture loop) and realize per frame for real-time processing. -A 3×3 filter needs neighbors (x±1, y±1). At the image edges, some taps would fall outside the valid region. Rather than scattering manual clamps across expressions, we wrap the input once: +A 3×3 filter needs neighbors (x±1, y±1). At the image edges, some taps would fall outside the valid region. Rather than scattering manual clamps across expressions, wrap the input once: ```cpp // Wrap the input so out-of-bounds reads replicate the nearest edge pixel. @@ -139,7 +139,7 @@ Func inputClamped = BoundaryConditions::repeat_edge(input); Any out-of-bounds access replicates the nearest edge pixel. This makes the boundary policy obvious, keeps expressions clean, and ensures all downstream stages behave consistently at the edges. -Grayscale conversion happens inside Halide using Rec.601 weights. We read B, G, R from the interleaved input and compute luminance: +Grayscale conversion happens inside Halide using Rec.601 weights. Read B, G, R from the interleaved input and compute luminance: ```cpp // Grayscale (Rec.601) @@ -150,7 +150,7 @@ gray(x, y) = cast(0.114f * inputClamped(x, y, 0) + // B 0.299f * inputClamped(x, y, 2)); // R ``` -Next, the pipeline applies a Gaussian-approximate (binomial) blur using a fixed 3×3 kernel. For this learning path, we implement it with small loops and 16-bit accumulation for safety: +Next, the pipeline applies a Gaussian-approximate (binomial) blur using a fixed 3×3 kernel. For this Learning Path, implement it with small loops and 16-bit accumulation for safety: ```cpp Func blur("blur"); @@ -167,7 +167,7 @@ Why this kernel? * The weights approximate a Gaussian distribution, which reduces noise but preserves edges better than a box filter. * This is mathematically a binomial filter, a standard and efficient approximation of Gaussian blurring. -After the blur, the pipeline applies thresholding to produce a binary image. We explicitly cast constants to uint8_t to remove ambiguity and avoid redundant widen/narrow operations in generated code: +After the blur, the pipeline applies thresholding to produce a binary image. Explicitly cast constants to uint8_t to remove ambiguity and avoid redundant widen/narrow operations in generated code: ```cpp Func output("output"); @@ -354,7 +354,7 @@ realize: 3.98 ms | 251.51 FPS | 521.52 MPix/s This gives an FPS of 251.51, and average throughput of 521.52 MPix/s. Now you can start measuring potential improvements from scheduling. ### Parallelization -Parallelization lets Halide run independent pieces of work at the same time on multiple CPU cores. In image pipelines, rows (or row tiles) are naturally parallel once producer data is available. By distributing work across cores, we reduce wall-clock time—crucial for real-time video. +Parallelization lets Halide run independent pieces of work at the same time on multiple CPU cores. In image pipelines, rows (or row tiles) are naturally parallel once producer data is available. By distributing work across cores, wall-clock time is reduced—crucial for real-time video. With the baseline measured, apply a minimal schedule that parallelizes the loop iteration for y axis. @@ -387,10 +387,10 @@ Tiling splits the image into cache-friendly blocks (tiles). Two wins: * Partitioning: tiles are easy to parallelize across cores. * Locality: when you cache intermediates per tile, you avoid refetching/recomputing data and hit CPU L1/L2 cache more often. -Now lets look at both flavors. +Now let's look at both flavors. ### Tiling with explicit intermediate storage (best for cache efficiency) -Here you will cache gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB -> gray up to 9× per output pixel. +Cache gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB -> gray up to 9× per output pixel. ```cpp // Scheduling @@ -414,23 +414,23 @@ In this scheduling: * parallel(yo) distributes tiles across CPU cores where a CPU core is in charge of a row (yo) of tiles. * gray.compute_at(...).store_at(...) materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile. -Recompile your application as before, then run. What we observed on our machine: +Recompile your application as before, then run. Here's sample output: ```output realize: 0.98 ms | 1023.15 FPS | 2121.60 MPix/s ``` This was the fastest variant here—caching a planar grayscale per tile enabled efficient reuse. -### How we schedule -In general, there is no one-size-fits-all rule of scheduling to achieve the best performance as it depends on your pipeline characteristics and the target device architecture. So, it is recommended to explore the scheduling options and that is where Halide's scheduling API is purposed for. +### How to schedule +In general, there's no one-size-fits-all rule of scheduling to achieve the best performance as it depends on your pipeline characteristics and the target device architecture. It's recommended to explore the scheduling options and that's where Halide's scheduling API is purposed for. For example of this application: * Start with parallelizing the outer-most loop. -* Add tiling + caching only if: there is a spatial filter, or the intermediate is reused by multiple consumers—and preferably after converting sources to planar (or precomputing a planar gray). +* Add tiling + caching only if: there's a spatial filter, or the intermediate is reused by multiple consumers—and preferably after converting sources to planar (or precomputing a planar gray). * From there, tune tile sizes and thread count for your target. `HL_NUM_THREADS` is the environmental variable which allows you to limit the number of threads in-flight. ## Summary -In this section, you built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. And then, we observed that parallelization and tiling improved the performance. +In this section, you built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. Parallelization and tiling improved the performance. * Parallelization spreads independent work across CPU cores. -* Tiling for cache efficiency helps when an expensive intermediate is reused many times per output (e.g., larger kernels, separable/multi-stage pipelines, multiple consumers) and when producers read planar data. +* Tiling for cache efficiency helps when an expensive intermediate is reused many times per output (for example, larger kernels, separable/multi-stage pipelines, multiple consumers) and when producers read planar data. From 99578aae11a3046ed04f3070580e3bd569865786 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Thu, 13 Nov 2025 22:08:05 +0000 Subject: [PATCH 2/8] Refactor Android Halide documentation for clarity and consistency --- .../android_halide/_index.md | 10 +- .../android_halide/intro.md | 143 +++++++++++------- .../android_halide/processing-workflow.md | 57 ++++--- 3 files changed, 125 insertions(+), 85 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md index f00c88f159..255b7a92d0 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md @@ -3,13 +3,13 @@ title: Build high-performance image processing with Halide on Android minutes_to_complete: 180 -who_is_this_for: This is an introductory topic for software developers interested in learning how to use Halide for image processing. +who_is_this_for: This is an introductory topic for developers interested in learning how to use Halide for image processing. learning_objectives: - - Understand foundational concepts of Halide and set up your development environment. - - Create a basic real-time image processing pipeline using Halide. - - Optimize image processing workflows by applying operation fusion in Halide. - - Integrate Halide pipelines into Android applications developed with Kotlin. + - Understand Halide fundamentals and set up your development environment + - Create a basic real-time image processing pipeline using Halide + - Optimize image processing workflows by applying operation fusion in Halide + - Integrate Halide pipelines into Android applications developed with Kotlin prerequisites: - Basic C++ knowledge diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md index a4ea931cbb..b6dfecb724 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md @@ -1,32 +1,45 @@ --- # User change -title: "Background and Installation" +title: "Install and configure Halide for Arm development" weight: 2 layout: "learningpathall" --- -## Introduction -Halide is a powerful, open-source programming language specifically designed to simplify and optimize high-performance image and signal processing pipelines. Initially developed by researchers at MIT and Adobe in 2012, Halide addresses a critical challenge in computational imaging: efficiently mapping image-processing algorithms onto diverse hardware architectures without extensive manual tuning. It accomplishes this by clearly separating the description of an algorithm (specifying the mathematical or logical transformations applied to images or signals) from its schedule (detailing how and where those computations execute). This design enables rapid experimentation and effective optimization for various processing platforms, including CPUs, GPUs, and mobile hardware. +## What is Halide? -A key advantage of Halide lies in its innovative programming model. By clearly distinguishing between algorithmic logic and scheduling decisions—such as parallelism, vectorization, memory management, and hardware-specific optimizations, developers can first focus on ensuring the correctness of their algorithms. Performance tuning can then be handled independently, significantly accelerating development cycles. This approach often yields performance that matches or even surpasses manually optimized code. As a result, Halide has seen widespread adoption across industry and academia, powering image processing systems at organizations such as Google, Adobe, and Facebook, and enabling advanced computational photography features used by millions daily. +Halide is a powerful, open-source programming language designed to simplify and optimize high-performance image and signal processing. Researchers at MIT and Adobe developed Halide in 2012 to address a critical challenge: efficiently running image-processing algorithms on different hardware architectures without extensive manual tuning. Halide separates the description of an algorithm (the mathematical or logical transformations applied to images or signals) from its schedule (how and where those computations execute). This design enables rapid experimentation and effective optimization for various platforms, including CPUs, GPUs, and mobile hardware. + +A key advantage of Halide lies in its innovative programming model. By distinguishing between algorithmic logic and scheduling decisions—such as parallelism, vectorization, memory management, and hardware-specific optimizations—you can first focus on ensuring the correctness of your algorithms. Performance tuning can then be handled independently, accelerating development cycles. This approach often yields performance that matches or even surpasses manually optimized code. As a result, Halide has seen widespread adoption across industry and academia, powering image processing systems at organizations such as Google, Adobe, and Facebook, and enabling advanced computational photography features used by millions daily. In this Learning Path, you'll explore Halide's foundational concepts, set up your development environment, and create your first functional Halide application. By the end, you'll understand what makes Halide uniquely suited to efficient image processing, particularly on mobile and Arm-based hardware, and be ready to build your own optimized pipelines. -For broader or more general use cases, please refer to the official Halide documentation and tutorials available at [halide-lang.org](https://halide-lang.org). +For broader use cases, see the official Halide documentation and tutorials at [the Halide website](https://halide-lang.org). -The example code for this Learning Path is available in two repositories [here](https://github.com/dawidborycki/Arm.Halide.Hello-World.git) and [here](https://github.com/dawidborycki/Arm.Halide.AndroidDemo.git) +The example code for this Learning Path is available in two GitHub repositories: [Arm.Halide.Hello-World GitHub repository](https://github.com/dawidborycki/Arm.Halide.Hello-World.git) and [Arm.Halide.AndroidDemo GitHub repository](https://github.com/dawidborycki/Arm.Halide.AndroidDemo.git). ## Key concepts in Halide + +Before building your first Halide application, you need to understand three foundational concepts that make Halide powerful for image processing: + +- Separating algorithms from schedules +- Using symbolic building blocks +- Applying scheduling strategies + +These concepts work together to enable high-performance code that's both readable and portable across different hardware architectures. + ### Separation of algorithm and schedule -At the core of Halide's design philosophy is the principle of clearly separating algorithms from schedules. Traditional image-processing programming tightly couples algorithmic logic with execution strategy, complicating optimization and portability. In contrast, Halide explicitly distinguishes these two components: - * Algorithm: Defines what computations are performed—for example, image filters, pixel transformations, or other mathematical operations on image data. - * Schedule: Specifies how and where these computations are executed, addressing critical details such as parallel execution, memory usage, caching strategies, and hardware-specific optimizations. -This separation allows developers to rapidly experiment and optimize their code for different hardware architectures or performance requirements without altering the core algorithmic logic. +Halide's core design principle separates algorithms from schedules. Traditional image-processing code tightly couples algorithmic logic with execution strategy, complicating optimization and portability. Halide distinguishes these two components: + +**Algorithm** defines what computations are performed (for example, image filters, pixel transformations, or mathematical operations on image data). + +**Schedule** specifies how and where these computations execute, including parallel execution, memory usage, caching strategies, and hardware-specific optimizations. + +This separation lets you experiment and optimize code for different hardware architectures without changing the core algorithmic logic. -Halide provides three key building blocks, including Functions, Vars, and Pipelines, to simplify and structure image processing algorithms. Consider the following illustrative example: +Halide provides three key building blocks to structure image processing algorithms: ```cpp Halide::Var x("x"), y("y"), c("c"); @@ -36,40 +49,55 @@ Halide::Func brighter("brighter"); brighter(x, y, c) = Halide::cast(Halide::min(input(x, y, c) + 50, 255)); ``` -Functions (Func) represent individual computational steps or image operations. Each Func encapsulates an expression applied to pixels, allowing concise definition of complex image processing tasks. Vars symbolically represent spatial coordinates or dimensions (for example, horizontal x, vertical y, color channel c). They specify where computations are applied in the image data. Pipelines are formed by interconnecting multiple Func objects, structuring a clear workflow where the output of one stage feeds into subsequent stages, enabling modular and structured image processing. +**Functions (Func)** represent individual computational steps or image operations. Each `Func` encapsulates an expression applied to pixels, enabling concise definition of complex tasks. -Halide is a domain-specific language (DSL) tailored explicitly for image and signal processing tasks. It provides a concise set of predefined operations and building blocks optimized for expressing complex image processing pipelines. By abstracting common computational patterns into powerful operators, Halide allows developers to succinctly define their processing logic, facilitating readability, maintainability, and easy optimization for various hardware targets. +**Vars** symbolically represent spatial coordinates or dimensions (for example, horizontal x, vertical y, color channel c), specifying where computations are applied. -### Scheduling strategies (parallelism, vectorization, tiling) -Halide offers several powerful scheduling strategies designed for maximum performance: - * Parallelism: Executes computations concurrently across multiple CPU cores, significantly reducing execution time for large datasets. - * Vectorization: Enables simultaneous processing of multiple data elements using SIMD (Single Instruction, Multiple Data) instructions available on CPUs and GPUs, greatly enhancing performance. - * Tiling: Divides computations into smaller blocks (tiles) optimized for cache efficiency, thus improving memory locality and reducing overhead due to memory transfers. +**Pipelines** are formed by connecting multiple `Func` objects, creating a workflow where each stage's output feeds into subsequent stages. -By combining these scheduling techniques, developers can achieve optimal performance tailored specifically to their target hardware architecture. +Halide is a domain-specific language (DSL) tailored for image and signal processing. It provides predefined operations and building blocks optimized for expressing complex pipelines. By abstracting common computational patterns, Halide lets you define processing logic concisely, facilitating readability, maintainability, and optimization across hardware targets. -Beyond manual scheduling strategies, Halide also provides an Autoscheduler, a powerful tool that automatically generates optimized schedules tailored to specific hardware architectures, further simplifying performance optimization. +### Scheduling strategies + +Halide offers several powerful scheduling strategies for maximum performance: + +- Parallelism - executes computations concurrently across multiple CPU cores, reducing execution time for large datasets. + +- Vectorization - enables simultaneous processing of multiple data elements using SIMD (Single Instruction, Multiple Data) instructions, enhancing performance on CPUs and GPUs. + +- Tiling divides computations into smaller blocks optimized for cache efficiency, improving memory locality and reducing transfer overhead. + +Combining these techniques achieves optimal performance tailored to your target hardware architecture. + +Beyond manual scheduling, Halide provides an Autoscheduler that automatically generates optimized schedules for specific hardware architectures, simplifying performance optimization. ## System requirements and environment setup -To start developing with Halide, your system must meet several requirements and dependencies. + +To start developing with Halide, your system needs to meet several requirements. ### Installation options -Halide can be set up using one of two main approaches: -* Installing pre-built binaries - pre-built binaries are convenient, quick to install, and suitable for most beginners or standard platforms (Windows, Linux, macOS). This approach is recommended for typical use cases. -* Building Halide from source is required when pre-built binaries are unavailable for your specific environment, or if you wish to experiment with the latest Halide features or LLVM versions still under active development. This method typically requires greater familiarity with build systems and may be more suitable for advanced users. - -Use pre-built binaries: - 1. Visit the official Halide releases [page](https://github.com/halide/Halide/releases). As of this writing, the latest Halide version is v19.0.0. - 2. Download and unzip the binaries to a convenient location (e.g., /usr/local/halide on Linux/macOS or C:\halide on Windows). - 3. Optionally set environment variables to simplify further usage: + +You can set up Halide using one of two approaches: + +**Pre-built binaries** are convenient, quick to install, and suitable for most users on standard platforms (Windows, Linux, macOS). This approach is recommended for typical use cases. + +**Building from source** is required when pre-built binaries aren't available for your environment, or if you want to experiment with the latest Halide features or LLVM versions under active development. This method requires familiarity with build systems. + +To use pre-built binaries: + +1. Visit the official Halide [releases page](https://github.com/halide/Halide/releases). As of this writing, the latest version is v19.0.0. +2. Download and unzip the binaries to a convenient location (for example, `/usr/local/halide` on Linux/macOS or `C:\halide` on Windows). +3. Set environment variables to simplify usage: ```console export HALIDE_DIR=/path/to/halide export PATH=$HALIDE_DIR/bin:$PATH ``` -To proceed further, install the following components: -1. LLVM (Halide requires LLVM to compile and execute pipelines) -2. OpenCV (for image handling in later sections) +Next, install the following components: + +**LLVM** - Halide requires LLVM to compile and execute pipelines + +**OpenCV** - For image handling in later sections Install with the commands for your OS: @@ -87,7 +115,8 @@ brew install opencv pkg-config Halide examples were tested with OpenCV 4.11.0 ## Your first Halide program -Now you’re ready to build your first Halide-based application. Save the following code in a file named `hello-world.cpp`: + +You're now ready to build your first Halide application. Save the following code in a file named `hello-world.cpp`: ```cpp #include "Halide.h" #include @@ -157,24 +186,22 @@ Once the image is loaded and converted, the program wraps the raw image data int The actual computation occurs when the pipeline is executed with the call to invert.realize(...). This is the step that processes the input image according to the defined pipeline and produces an output Halide buffer. The scheduling directive (invert.reorder(c, x, y)) ensures that pixel data is computed in an interleaved manner (channel-by-channel per pixel), aligning the resulting data with OpenCV’s expected memory layout for images. -Finally, the processed Halide output buffer is efficiently wrapped in an OpenCV Mat header without copying pixel data. For proper display in OpenCV, which uses BGR (Blue-Green-Red) channel ordering by default, the code converts the processed image back from RGB (Red-Green-Blue) to BGR. The program then displays the original and inverted images in separate windows, waiting for a key press before exiting. This approach demonstrates a streamlined integration between Halide for high-performance image processing and OpenCV for convenient input and output operations. +Finally, the processed Halide output buffer is wrapped in an OpenCV `Mat` header without copying pixel data. For proper display in OpenCV, which uses BGR (Blue-Green-Red) channel ordering by default, the code converts the processed image back from RGB to BGR. The program then displays the original and inverted images in separate windows, waiting for a key press before exiting. This approach demonstrates integration between Halide for high-performance image processing and OpenCV for convenient input and output operations. By default, Halide orders loops based on the order of variable declaration. In this example, the original ordering (x, y, c) implies processing the image pixel-by-pixel across all horizontal positions (x), then vertical positions (y), and finally channels (c). This ordering naturally produces a planar memory layout (e.g., processing all red pixels first, then green, then blue). However, the optimal loop order depends on your intended memory layout and compatibility with external libraries: -1. Interleaved Layout (RGBRGBRGB…): -* Commonly used by libraries such as OpenCV. -* To achieve this, the color channel (c) should be the innermost loop, followed by horizontal (x) and then vertical (y) loops -Specifically, call: +**Interleaved layout (RGBRGBRGB…)** is commonly used by libraries such as OpenCV. To achieve this, the color channel (c) should be the innermost loop, followed by horizontal (x) and then vertical (y) loops. + +Call: ```cpp invert.reorder(c, x, y); ``` -This changes the loop nesting to process each pixel’s channels together (R, G, B for the first pixel, then R, G, B for the second pixel, and so on), resulting in: -* Better memory locality and cache performance when interfacing with interleaved libraries like OpenCV. -* Reduced overhead for subsequent image-handling operations (display, saving, or further processing). -By default, OpenCV stores images in interleaved memory layout, using the HWC (Height, Width, Channel) ordering. To correctly represent this data layout in a Halide buffer, you can also explicitly use the Buffer::make_interleaved() method, which ensures the data layout is properly specified. The code snippet would look like this: +This changes the loop nesting to process each pixel's channels together (R, G, B for the first pixel, then R, G, B for the second pixel, and so on). This provides better memory locality and cache performance when interfacing with interleaved libraries like OpenCV, and reduces overhead for subsequent image-handling operations (display, saving, or further processing). + +By default, OpenCV stores images in interleaved memory layout, using the HWC (Height, Width, Channel) ordering. To correctly represent this data layout in a Halide buffer, you can use the `Buffer::make_interleaved()` method, which ensures the data layout is properly specified: ```cpp // Wrap the OpenCV Mat data in a Halide buffer with interleaved HWC layout. @@ -183,28 +210,29 @@ Buffer inputBuffer = Buffer::make_interleaved( ); ``` -2. Planar Layout (RRR...GGG...BBB...): -* Preferred by certain image-processing routines or hardware accelerators (for example, some GPU kernels or certain ML frameworks). -* Achieved naturally by Halide's default loop ordering (x, y, c). +**Planar layout (RRR...GGG...BBB...)** is preferred by certain image-processing routines or hardware accelerators (for example, some GPU kernels or ML frameworks). This is achieved naturally by Halide's default loop ordering (x, y, c). + +Select loop ordering based on your data format requirements and integration scenario. Halide provides full flexibility, letting you reorder loops to match the desired memory layout efficiently. -Select loop ordering based on your specific data format requirements and integration scenario. Halide provides full flexibility, allowing you to explicitly reorder loops to match the desired memory layout efficiently. +In Halide, distinguish two distinct concepts: -In Halide, distinguish two distinct concepts clearly: -1. Loop execution order (controlled by reorder). Defines the nesting order of loops during computation. For example, to make the channel dimension (c) innermost during computation: +**Loop execution order** (controlled by `reorder`) defines the nesting order of loops during computation. For example, to make the channel dimension (c) innermost during computation: ```cpp invert.reorder(c, x, y); ``` -2. Memory storage layout (controlled by reorder_storage). Defines the actual order in which data is stored in memory, such as interleaved or planar: + +**Memory storage layout** (controlled by `reorder_storage`) defines the actual order in which data is stored in memory, such as interleaved or planar: ```cpp invert.reorder_storage(c, x, y); ``` -Using only reorder(c, x, y) affects the computational loop order but not necessarily the memory layout. The computed data could still be stored in planar order by default. Using reorder_storage(c, x, y) explicitly defines the memory layout as interleaved. +Using only `reorder(c, x, y)` affects the computational loop order but not necessarily the memory layout. The computed data could still be stored in planar order by default. Using `reorder_storage(c, x, y)` defines the memory layout as interleaved. ## Compilation instructions -Compile the program as follows (replace /path/to/halide accordingly): + +Compile the program as follows (replace `/path/to/halide` with your actual path): ```console export DYLD_LIBRARY_PATH=/path/to/halide/lib/libHalide.19.dylib g++ -std=c++17 hello-world.cpp -o hello-world \ @@ -218,19 +246,20 @@ On Linux, set LD_LIBRARY_PATH instead: export LD_LIBRARY_PATH=/path/to/halide/lib/ ``` -Run the executable: +To run the executable: ```console ./hello-world ``` You'll see two windows displaying the original and inverted images: -![img1](Figures/01.png) -![img2](Figures/02.png) +![Original color photograph of a cameraman on the left showing a person operating a professional camera, and inverted version on the right with reversed colors where the subject appears in negative](Figures/01.png) +![Two side-by-side terminal windows showing compilation and execution of the Halide hello-world program, with the left window displaying g++ compilation commands and library paths, and the right window showing successful program execution with OpenCV window initialization messages](Figures/02.png) ## Summary -In this section, you've learned Halide's foundational concepts, explored the benefits of separating algorithms and schedules, set up your development environment, and created your first functional Halide application integrated with OpenCV. -While the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it doesn't yet showcase the substantial benefits of explicitly separating algorithm definition from scheduling strategies. +You've learned Halide's foundational concepts, explored the benefits of separating algorithms and schedules, set up your development environment, and created your first functional Halide application integrated with OpenCV. + +While the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it doesn't yet showcase the benefits of separating algorithm definition from scheduling strategies. -In subsequent sections, explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which clearly demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to specific hardware without modifying algorithmic correctness. +In subsequent sections, you'll explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to specific hardware without modifying algorithmic correctness. diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md index 2d5da32a05..c82d8d7a73 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md @@ -1,16 +1,19 @@ --- # User change -title: "Building a Simple Camera Image Processing Workflow" +title: "Build a simple camera image processing workflow" weight: 3 layout: "learningpathall" --- -## Objective -In this section, you will build a real-time camera processing pipeline using Halide. First, you capture video frames from a webcam using OpenCV, then implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. After establishing this pipeline, you will measure performance and then explore Halide's scheduling options—parallelization and tiling—to understand when they help and when they don’t. +## What you'll build -## Gaussian blur and thresholding +In this section, you will build a real-time camera processing pipeline using Halide. First, you will capture video frames from a webcam using OpenCV, then implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. + +You will then measure performance and explore Halide's scheduling options—parallelization and tiling—to see how each improves throughput. + +## Implement Gaussian blur and thresholding Create a new `camera-capture.cpp` file and modify it as follows: ```cpp #include "Halide.h" @@ -127,10 +130,11 @@ int main() { return 0; } ``` +The camera delivers interleaved BGR frames. You convert them to grayscale using Rec.601 weights, apply a 3×3 binomial blur (with 16-bit accumulation and division by 16), and then threshold to create a binary image. -The camera delivers interleaved BGR frames. Inside Halide, convert to grayscale (Rec.601), apply a 3×3 binomial blur (sum/16 with 16-bit accumulation), then threshold to produce a binary image. Compile once (outside the capture loop) and realize per frame for real-time processing. +Compile the pipeline once before the capture loop starts, then call `realize()` each frame for real-time processing. -A 3×3 filter needs neighbors (x±1, y±1). At the image edges, some taps would fall outside the valid region. Rather than scattering manual clamps across expressions, wrap the input once: +A 3×3 filter needs neighbors (x±1, y±1). At the image edges, some taps fall outside the valid region. Rather than scattering manual clamps across expressions, wrap the input once: ```cpp // Wrap the input so out-of-bounds reads replicate the nearest edge pixel. @@ -175,9 +179,9 @@ Func output("output"); output(x, y) = select(blur(x, y) > T, cast(255), cast(0)); ``` -This simple but effective step emphasizes strong edges and regions of high contrast, often used as a building block in segmentation and feature extraction pipelines +This simple but effective step emphasizes strong edges and regions of high contrast, often used as a building block in segmentation and feature extraction pipelines. -Finally, the result is realized by Halide and displayed via OpenCV. The pipeline is built once (outside the capture loop) and then realized each frame: +Halide generates the final output, which OpenCV then displays. Build the pipeline once (outside the capture loop), and then realized each frame: ```cpp // Build the pipeline once (outside the capture loop) Buffer outBuf(width, height); @@ -192,7 +196,7 @@ imshow("Processing Workflow", view); The main loop continues capturing frames, running the Halide pipeline, and displaying the processed output in real time until a key is pressed. This illustrates how Halide integrates cleanly with OpenCV to build efficient, interactive image-processing applications. -## Compilation instructions +## Compile and run the program Compile the program as follows (replace /path/to/halide accordingly): ```console g++ -std=c++17 camera-capture.cpp -o camera-capture \ @@ -205,16 +209,20 @@ Run the executable: ```console ./camera-capture ``` +The output should look similar to the figure below: +![A camera viewport window titled Processing Workflow displaying a real-time binary threshold output from a webcam feed. The image shows a person's face and shoulders rendered in stark black and white, where bright areas above the threshold value appear white and darker areas appear black, creating a high-contrast silhouette effect that emphasizes edges and prominent features.](Figures/03.webp) -The output should look as in the figure below: -![img3](Figures/03.webp) +## Parallelization and tiling -## Parallelization and Tiling -In this section, you will explore two complementary scheduling optimizations provided by Halide: Parallelization and Tiling. Both techniques help enhance performance but achieve it through different mechanisms—parallelization leverages multiple CPU cores, whereas tiling improves cache efficiency by optimizing data locality. +In this section, you will explore two scheduling optimizations that Halide provides: parallelization and tiling. Each technique improves performance in a different way—parallelization uses multiple CPU cores, while tiling optimizes cache efficiency through better data locality. Now you will learn how to use each technique separately for clarity and to emphasize their distinct benefits. -Let’s first lock in a measurable baseline before we start changing the schedule. You will create a second file, `camera-capture-perf-measurement.cpp`, that runs the same grayscale → blur → threshold pipeline but prints per-frame timing, FPS, and MPix/s around the Halide realize() call. This lets you quantify each optimization you will add next (parallelization, tiling, caching). + +### Measure baseline performance + +Before applying any scheduling optimizations, establish a measurable baseline. Create a second file, `camera-capture-perf-measurement.cpp`, that runs the same grayscale → blur → threshold pipeline but prints per-frame timing, FPS, and MPix/s around the Halide `realize()` call. This lets you quantify each optimization you add next (parallelization, tiling, caching). + Create `camera-capture-perf-measurement.cpp` with the following code: ```cpp @@ -353,7 +361,8 @@ realize: 3.98 ms | 251.51 FPS | 521.52 MPix/s This gives an FPS of 251.51, and average throughput of 521.52 MPix/s. Now you can start measuring potential improvements from scheduling. -### Parallelization +### Apply parallelization + Parallelization lets Halide run independent pieces of work at the same time on multiple CPU cores. In image pipelines, rows (or row tiles) are naturally parallel once producer data is available. By distributing work across cores, wall-clock time is reduced—crucial for real-time video. With the baseline measured, apply a minimal schedule that parallelizes the loop iteration for y axis. @@ -380,17 +389,19 @@ realize: 1.16 ms | 864.15 FPS | 1791.90 MPix/s The performance gain by parallelization depends on how many CPU cores are available for this application to occupy. -### Tiling +### Apply tiling for cache efficiency + Tiling is a scheduling technique that divides computations into smaller, cache-friendly blocks or tiles. This approach significantly enhances data locality, reduces memory bandwidth usage, and leverages CPU caches more efficiently. While tiling can also use parallel execution, its primary advantage comes from optimizing intermediate data storage. Tiling splits the image into cache-friendly blocks (tiles). Two wins: * Partitioning: tiles are easy to parallelize across cores. * Locality: when you cache intermediates per tile, you avoid refetching/recomputing data and hit CPU L1/L2 cache more often. -Now let's look at both flavors. +Now have a look at both flavors. + +### Cache intermediates per tile -### Tiling with explicit intermediate storage (best for cache efficiency) -Cache gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB -> gray up to 9× per output pixel. +This approach caches gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB to gray up to 9× per output pixel. This provides the best cache efficiency. ```cpp // Scheduling @@ -410,16 +421,16 @@ Cache gray once per tile so the 3×3 blur can reuse it instead of recomputing RG ``` In this scheduling: -* tile(...) splits the image into cache-friendly blocks and makes it easy to parallelize across tiles. -* parallel(yo) distributes tiles across CPU cores where a CPU core is in charge of a row (yo) of tiles. -* gray.compute_at(...).store_at(...) materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile. +* tile(...) splits the image into cache-friendly blocks and makes it easy to parallelize across tiles +* parallel(yo) distributes tiles across CPU cores where a CPU core is in charge of a row (yo) of tiles +* gray.compute_at(...).store_at(...) materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile Recompile your application as before, then run. Here's sample output: ```output realize: 0.98 ms | 1023.15 FPS | 2121.60 MPix/s ``` -This was the fastest variant here—caching a planar grayscale per tile enabled efficient reuse. +This is the fastest variant. Caching a planar grayscale per tile enables efficient reuse, which improves performance. ### How to schedule In general, there's no one-size-fits-all rule of scheduling to achieve the best performance as it depends on your pipeline characteristics and the target device architecture. It's recommended to explore the scheduling options and that's where Halide's scheduling API is purposed for. From 71f6ceefe9017d8721c7d4e33b2af25ddf8bf996 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Fri, 14 Nov 2025 12:13:04 +0000 Subject: [PATCH 3/8] Refactor Android Halide learning path content for clarity and consistency - Updated learning objectives and prerequisites for the Android Halide index. - Revised titles and section headings for better readability in Android integration and AOT compilation documents. - Enhanced descriptions and explanations throughout the Android Halide materials, focusing on clarity and user engagement. - Improved code comments and documentation in processing workflow and fusion sections to better illustrate concepts. - Streamlined the introduction to Halide, emphasizing its advantages and foundational concepts. - Added performance measurement details and clarified scheduling strategies in the camera processing workflow. --- .../android_halide/_index.md | 12 +-- .../android_halide/android.md | 37 ++++----- .../aot-and-cross-compilation.md | 24 +++--- .../android_halide/fusion.md | 22 +++--- .../android_halide/intro.md | 78 +++++++++---------- .../android_halide/processing-workflow.md | 47 +++++------ 6 files changed, 110 insertions(+), 110 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md index 255b7a92d0..1b45f5ed82 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md @@ -6,10 +6,10 @@ minutes_to_complete: 180 who_is_this_for: This is an introductory topic for developers interested in learning how to use Halide for image processing. learning_objectives: - - Understand Halide fundamentals and set up your development environment - - Create a basic real-time image processing pipeline using Halide - - Optimize image processing workflows by applying operation fusion in Halide - - Integrate Halide pipelines into Android applications developed with Kotlin + - Learn the basics of Halide and set up your development environment + - Build a simple real-time image processing pipeline with Halide + - Make your image processing faster by combining operations in Halide + - Use Halide pipelines in Android apps written with Kotlin prerequisites: - Basic C++ knowledge @@ -36,11 +36,11 @@ tools_software_languages: further_reading: - resource: - title: Halide 19.0.0 + title: Halide documentation link: https://halide-lang.org/docs/index.html type: website - resource: - title: Halide GitHub + title: Halide GitHub repository link: https://github.com/halide/Halide type: repository - resource: diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md index b0fd394f05..7cdcae8788 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md @@ -1,21 +1,21 @@ --- # User change -title: "Integrating Halide into an Android (Kotlin) Project" +title: "Integrate Halide into an Android project with Kotlin" weight: 6 layout: "learningpathall" --- -## Objective -In this lesson, we’ll learn how to integrate a high-performance Halide image-processing pipeline into an Android application using Kotlin. +## What you'll build +In this section you'll integrate a high-performance Halide image-processing pipeline into an Android application using Kotlin. -## Overview of mobile integration with Halide +## Learn about mobile integration with Halide Android is the world’s most widely-used mobile operating system, powering billions of devices across diverse markets. This vast user base makes Android an ideal target platform for developers aiming to reach a broad audience, particularly in applications requiring sophisticated image and signal processing, such as augmented reality, photography, video editing, and real-time analytics. Kotlin, now the preferred programming language for Android development, combines concise syntax with robust language features, enabling developers to write maintainable, expressive, and safe code. It offers seamless interoperability with existing Java codebases and straightforward integration with native code via JNI, simplifying the development of performant mobile applications. -## Benefits of using Halide on mobile +## Explore the benefits of using Halide on mobile Integrating Halide into Android applications brings several key advantages: - Performance - Halide enables significant acceleration of complex image processing algorithms, often surpassing the speed of traditional Java or Kotlin implementations by leveraging optimized code generation. By generating highly optimized native code tailored for Arm CPUs or GPUs, Halide can dramatically increase frame rates and responsiveness, essential for real-time or interactive applications. - Efficiency - on mobile devices, resource efficiency translates directly to improved battery life and reduced thermal output. Halide's scheduling strategies (such as operation fusion, tiling, parallelization, and vectorization) minimize unnecessary memory transfers, CPU usage, and GPU overhead. This optimization substantially reduces overall power consumption, extending battery life and enhancing the user experience by preventing overheating. @@ -24,7 +24,7 @@ Integrating Halide into Android applications brings several key advantages: In short, Halide delivers high-performance image processing without sacrificing portability or efficiency, a balance particularly valuable on resource-constrained mobile devices. -### Android development ecosystem and challenges +### Navigate Android development challenges While Android presents abundant opportunities for developers, the mobile development ecosystem brings its own set of challenges, especially for performance-intensive applications: 1. Limited Hardware Resources. Unlike desktop or server environments, mobile devices have significant constraints on processing power, memory capacity, and battery life. Developers must optimize software meticulously to deliver smooth performance while carefully managing hardware resource consumption. Leveraging tools like Halide allows developers to overcome these constraints by optimizing computational workloads, making resource-intensive tasks feasible on constrained hardware. 2. Cross-Compilation Complexities. Developing native code for Android requires handling multiple hardware architectures (such as armv8-a, ARM64, and sometimes x86/x86_64). Cross-compilation introduces complexities due to different instruction sets, CPU features, and performance characteristics. Managing this complexity involves careful use of the Android NDK, understanding toolchains, and correctly configuring build systems (e.g., Gradle, CMake). Halide helps mitigate these issues by abstracting away many platform-specific optimizations, automatically generating code optimized for target architectures. @@ -37,8 +37,8 @@ Before integrating Halide into your Android application, ensure you have the nec 1. Android Studio. [Download link](https://developer.android.com/studio). 2. Android NDK (Native Development Kit). Can be easily installed from Android Studio (Tools → SDK Manager → SDK Tools → Android NDK). -## Setting up the Android project -### Creating the project +## Set up the Android project +### Create the project 1. Open Android Studio. 2. Select New Project > Native C++. ![img4](Figures/04.webp) @@ -152,8 +152,9 @@ dependencies { Click the Sync Now button at the top. To verify that everything is configured correctly, click Build > Make Project in Android Studio. -## UI -Now, you'll define the application's User Interface, consisting of two buttons and an ImageView. One button loads the image, the other processes it, and the ImageView displays both the original and processed images. +## Define the user interface +Define the application's user interface, consisting of two buttons and an ImageView. One button loads the image, the other processes it, and the ImageView displays both the original and processed images. + 1. Open the res/layout/activity_main.xml file, and modify it as follows: ```XML @@ -204,8 +205,8 @@ Now you can run the app to view the UI: ![img7](Figures/07.webp) -## Processing -You will now implement the image processing code. First, pick up an image you want to process. Here we use the camera man. Then, under the Arm.Halide.AndroidDemo/src/main create assets folder, and save the image under that folder as img.png. +## Implement image processing +Implement the image processing code. First, pick an image you want to process. This example uses the camera man image. Under Arm.Halide.AndroidDemo/src/main, create an assets folder and save the image as img.png. Now, open MainActivity.kt and modify it as follows: ```java @@ -330,9 +331,9 @@ class MainActivity : AppCompatActivity() { } ``` -This Kotlin Android application demonstrates integrating a Halide-generated image-processing pipeline within an Android app. The main activity (MainActivity) manages loading and processing an image stored in the application’s asset folder. +This Kotlin Android application demonstrates integrating a Halide-generated image-processing pipeline within an Android app. The main activity (MainActivity) manages loading and processing an image stored in the application's asset folder. -When the app launches, the Process Image button is disabled. When a user taps Load Image, the app retrieves img.png from its assets directory and displays it within the ImageView, simultaneously enabling the Process Image button for further interaction. +When the app launches, the app disables the Process Image button. When you tap Load Image, the app retrieves img.png from its assets directory and displays it within the ImageView, simultaneously enabling the Process Image button for further interaction. Upon pressing the Process Image button, the following sequence occurs: 1. Background Processing. A Kotlin coroutine initiates processing on a background thread, ensuring the application’s UI remains responsive. @@ -346,11 +347,11 @@ The code defines three utility methods: 2. extractGrayScaleBytes - converts a Bitmap into a grayscale byte array suitable for native processing. 3. createBitmapFromGrayBytes - converts a grayscale byte array back into a Bitmap for display purposes. -Note that performing the grayscale conversion in Halide allows us to exploit operator fusion, further improving performance by avoiding intermediate memory accesses. This could be done as in our examples before (processing-workflow). +Note that performing the grayscale conversion in Halide allows you to exploit operator fusion, further improving performance by avoiding intermediate memory accesses. You can do this as shown in the earlier processing-workflow examples. The JNI integration occurs through an external method declaration, blurThresholdImage, loaded via the companion object at app startup. The native library (armhalideandroiddemo) containing this function is compiled separately and integrated into the application (native-lib.cpp). -You will now need to create blurThresholdImage function. To do so, in Android Studio put the cursor above blurThresholdImage function, and then click Create JNI function for blurThresholdImage: +Create the blurThresholdImage function. In Android Studio, put the cursor above blurThresholdImage function, and then select Create JNI function for blurThresholdImage: ![img8](Figures/08.webp) This will generate a new function in the native-lib.cpp: @@ -404,7 +405,7 @@ This C++ function acts as a bridge between Java (Kotlin) and native code. Specif The input Java byte array (input_bytes) is accessed and pinned into native memory via GetByteArrayElements. This provides a direct pointer (inBytes) to the grayscale data sent from Kotlin. The raw grayscale byte data is wrapped into a Halide::Runtime::Buffer object (inputBuffer). This buffer structure is required by the Halide pipeline. An output buffer (outputBuffer) is created with the same dimensions as the input image. This buffer will store the result produced by the Halide pipeline. The native function invokes the Halide-generated AOT function blur_threshold, passing in both the input and output buffers. After processing, a new Java byte array (outputArray) is allocated to hold the processed grayscale data. The processed data from the Halide output buffer is copied into this Java array using SetByteArrayRegion. The native input buffer (inBytes) is explicitly released using ReleaseByteArrayElements, specifying JNI_ABORT as no changes were made to the input array. Finally, the processed byte array (outputArray) is returned to Kotlin. -Through this JNI bridge, Kotlin can invoke high-performance native code. You can now re-run the application. Click the Load Image button, and then Process Image. You will see the following results: +Through this JNI bridge, Kotlin can invoke high-performance native code. You can now re-run the application. Select the Load Image button, and then Process Image. You'll see the following results: ![img9](Figures/09.png) ![img10](Figures/10.png) @@ -416,4 +417,4 @@ jobject outputBuffer = env->NewDirectByteBuffer(output.data(), width * height); ``` ## Summary -In this lesson, we’ve successfully integrated a Halide image-processing pipeline into an Android application using Kotlin. We started by setting up an Android project configured for native development with the Android NDK, employing Kotlin as the primary language. We then integrated Halide-generated static libraries and demonstrated their usage through Java Native Interface (JNI), bridging Kotlin and native code. This equips developers with the skills needed to harness Halide's capabilities for building sophisticated, performant mobile applications on Android. \ No newline at end of file +You've successfully integrated a Halide image-processing pipeline into an Android application using Kotlin. You started by setting up an Android project configured for native development with the Android NDK, using Kotlin as the primary language. You then integrated Halide-generated static libraries and demonstrated their usage through Java Native Interface (JNI), bridging Kotlin and native code. You now have the skills needed to harness Halide's capabilities for building sophisticated, performant mobile applications on Android. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md index 0d2166744c..8ec662cb62 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md @@ -7,16 +7,16 @@ weight: 5 layout: "learningpathall" --- -## Ahead-of-time and cross-compilation -One of Halide's standout features is the ability to compile image processing pipelines ahead-of-time (AOT), enabling developers to generate optimized binary code on their host machines rather than compiling directly on target devices. This AOT compilation process allows developers to create highly efficient libraries that run effectively across diverse hardware without incurring the runtime overhead associated with just-in-time (JIT) compilation. +## Learn about ahead-of-time (AOT) and cross-compilation +One of Halide's standout features is the ability to compile image processing pipelines ahead-of-time (AOT), enabling you to generate optimized binary code on your host machine rather than compiling directly on target devices. This AOT compilation process enables you to create highly efficient libraries that run effectively across diverse hardware without incurring the runtime overhead associated with just-in-time (JIT) compilation. -Halide also supports robust cross-compilation capabilities. Cross-compilation means using the host version of Halide, typically running on a desktop Linux or macOS system—to target different architectures, such as ARM for Android devices. Developers can thus optimize Halide pipelines on their host machine, produce libraries specifically optimized for Android, and integrate them seamlessly into Android applications. The generated pipeline code includes essential optimizations and can embed minimal runtime support, further reducing workload on the target device and ensuring responsiveness and efficiency. +Halide also supports robust cross-compilation capabilities. Cross-compilation means using the host version of Halide, typically running on a desktop Linux or macOS system—to target different architectures, such as Arm for Android devices. You can optimize Halide pipelines on your host machine, produce libraries specifically optimized for Android, and integrate them seamlessly into Android applications. The generated pipeline code includes essential optimizations and can embed minimal runtime support, further reducing workload on the target device and ensuring responsiveness and efficiency. -## Objective +## What you'll build In this section, you'll leverage the host version of Halide to perform AOT compilation of an image processing pipeline via cross-compilation. The resulting pipeline library is specifically tailored to Android devices (targeting, for instance, arm64-v8a ABI), while the compilation itself occurs entirely on the host system. This approach significantly accelerates development by eliminating the need to build Halide or perform JIT compilation on Android devices. It also guarantees that the resulting binaries are optimized for the intended hardware, streamlining the deployment of high-performance image processing applications on mobile platforms. ## Prepare pipeline for Android -The procedure implemented in the following code demonstrates how Halide's AOT compilation and cross-compilation features can be utilized to create an optimized image processing pipeline for Android. Run Halide on your host machine (in this example, macOS) to generate a static library containing the pipeline function, which will later be invoked from an Android device. Below is a step-by-step explanation of this process. +The following code demonstrates how to use Halide's AOT compilation and cross-compilation features to create an optimized image processing pipeline for Android. Run Halide on your host machine (in this example, macOS) to generate a static library containing the pipeline function, which you'll later invoke from an Android device. Below is a step-by-step explanation of this process. Create a new file named blur-android.cpp with the following contents: @@ -99,14 +99,14 @@ target.bits = 64; // Enable Halide runtime inclusion in the generated library (needed if not linking Halide runtime separately). target.set_feature(Target::NoRuntime, false); -// Optionally, enable hardware-specific optimizations to improve performance on ARM devices: -// - DotProd: Optimizes matrix multiplication and convolution-like operations on ARM. +// Optionally, enable hardware-specific optimizations to improve performance on Arm devices: +// - DotProd: Optimizes matrix multiplication and convolution-like operations on Arm. // - ARMFp16 (half-precision floating-point operations). ``` Notes: 1. NoRuntime — When set to true, Halide excludes its runtime from the generated code, and you must link the runtime manually during the linking step. When set to false, the Halide runtime is included in the generated library, which simplifies deployment. -2. ARMFp16 — Enables the use of ARM hardware support for half-precision (16-bit) floating-point operations, which can provide faster execution when reduced precision is acceptable. +2. ARMFp16 — Enables the use of Arm hardware support for half-precision (16-bit) floating-point operations, which improves execution speed when reduced precision is acceptable. 3. Why the runtime choice matters - If your app links several AOT-compiled pipelines, ensure there is exactly one Halide runtime at link time: * Strategy A (cleanest): build all pipelines with NoRuntime ON and link a single standalone Halide runtime once (matching the union of features you need, for example, Vulkan/OpenCL/Metal or Arm options). * Strategy B: embed the runtime in exactly one pipeline (leave NoRuntime OFF only there); compile all other pipelines with NoRuntime ON. @@ -120,7 +120,7 @@ Simple scheduling directives (compute_root) instruct Halide to compute intermedi This strategy can simplify debugging by clearly isolating computational steps and may enhance runtime efficiency by explicitly controlling intermediate storage locations. -By clearly separating algorithm logic from scheduling, developers can easily test and compare different scheduling strategies,such as compute_inline, compute_root, compute_at, and more, without modifying their fundamental algorithmic code. This separation significantly accelerates iterative optimization and debugging processes, ultimately yielding better-performing code with minimal overhead. +By clearly separating algorithm logic from scheduling, you can easily test and compare different scheduling strategies, such as compute_inline, compute_root, compute_at, and more, without modifying your fundamental algorithmic code. This separation significantly accelerates iterative optimization and debugging processes, ultimately yielding better-performing code with minimal overhead. Halide's AOT compilation function compile_to_static_library generates a static library (.a) containing the optimized pipeline and a corresponding header file (.h). @@ -141,7 +141,7 @@ These generated files are then ready to integrate directly into an Android proje JNI (Java Native Interface) is a framework that allows Java (or Kotlin) code running in a Java Virtual Machine (JVM), such as on Android, to interact with native applications and libraries written in languages like C or C++. JNI bridges the managed Java/Kotlin environment and the native, platform-specific implementations. -## Compilation instructions +## Compile the pipeline To compile the pipeline-generation program on your host system, use the following commands (replace /path/to/halide with your Halide installation directory): ```console export DYLD_LIBRARY_PATH=/path/to/halide/lib/libHalide.19.dylib @@ -160,7 +160,7 @@ This will produce two files: * blur_threshold_android.a: The static library containing your Halide pipeline. * blur_threshold_android.h: The header file needed to invoke the generated pipeline. -We will integrate these files into our Android project in the following section. +You'll integrate these files into the Android project in the following section. ## Summary -In this section, we’ve explored Halide's powerful ahead-of-time (AOT) and cross-compilation capabilities, preparing an optimized image processing pipeline tailored specifically for Android devices. By using the host-based Halide compiler, we’ve generated a static library optimized for ARM64 Android architecture, incorporating safe boundary conditions, neighborhood-based blurring, and thresholding operations. This streamlined process allows seamless integration of highly optimized native code into Android applications, ensuring both development efficiency and runtime performance on mobile platforms. \ No newline at end of file +You've explored Halide's powerful ahead-of-time (AOT) and cross-compilation capabilities, preparing an optimized image processing pipeline tailored specifically for Android devices. By using the host-based Halide compiler, you generated a static library optimized for 64-bit Arm Android architecture, incorporating safe boundary conditions, neighborhood-based blurring, and thresholding operations. This streamlined process allows seamless integration of highly optimized native code into Android applications, ensuring both development efficiency and runtime performance on mobile platforms. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md index a11c7cb396..d8e3238ee9 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md @@ -1,16 +1,16 @@ --- # User change -title: "Demonstrating Operation Fusion" +title: "Apply operator fusion in Halide for real-time image processing" weight: 4 layout: "learningpathall" --- -## Objective -In the previous section, you explored parallelization and tiling. Here, you will focus on operator fusion (inlining) in Halide, that is, letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You will learn when fusion reduces memory traffic and when materializing saves recomputation (for example, for large stencils or multi-use intermediates). You will inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s). +## What you'll build +In the previous section, you explored parallelization and tiling. Here, you'll focus on operator fusion (inlining) in Halide, that is, letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You'll learn when fusion reduces memory traffic and when materializing saves recomputation (for example, for large stencils or multi-use intermediates). You'll inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s). -This section doesn't cover loop fusion (the fuse directive). You will focus on operator fusion, which is Halide's default behavior. +This section doesn't cover loop fusion (the fuse directive). You'll focus instead on operator fusion, which is Halide's default behavior. ## Code To demonstrate how fusion in Halide works create a new file `camera-capture-fusion.cpp`, and modify it as follows. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion vs. materialization, and print ms / FPS / MPix/s. So you can see the impact immediately. @@ -257,10 +257,10 @@ Next comes the gray conversion. As in previous section, use Rec.601 weights and Then add a threshold stage. Pixels above 128 become white, and all others black, producing a binary image. Finally, define an output Func that wraps the thresholded result and call compute_root() on it so that it will be realized explicitly when you run the pipeline. Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, you instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output. - * Simple: Here you will explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic. + * Simple: Here you'll explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic. * FuseBlurAndThreshold: You compute gray once as a planar buffer, but leave blur and thresholded fused into output. This often works well when the input is interleaved, because subsequent stages read from a planar gray. - * FuseAll: You will apply no scheduling to producers, so gray, blur, and thresholded are all inlined into output. This minimizes memory usage but can recompute gray many times inside the 3×3 stencil. - * Tile: You will split the output into 64×64 tiles. Within each tile, we materialize gray (compute_at(output, xo)), so the working set is small and stays in cache. blur remains fused within each tile. + * FuseAll: You'll apply no scheduling to producers, so gray, blur, and thresholded are all inlined into output. This minimizes memory usage but can recompute gray many times inside the 3×3 stencil. + * Tile: You'll split the output into 64×64 tiles. Within each tile, you materialize gray (compute_at(output, xo)), so the working set is small and stays in cache. blur remains fused within each tile. To help you examine what’s happening, print the loop nest Halide generates for each schedule using print_loop_nest(). This will give you a clear view of how fusion or materialization changes the structure of the computation. @@ -399,7 +399,7 @@ Simple | 6.01 ms | 166.44 FPS | 345.12 MPix/s15 MPix/s ``` The console output combines two kinds of information: -1. Loop nests – printed by print_loop_nest(). These show how Halide actually arranges the computation for the chosen schedule. They are a great “x-ray” view of fusion and materialization: +1. Loop nests – printed by print_loop_nest(). These show how Halide actually arranges the computation for the chosen schedule. They're a great "x-ray" view of fusion and materialization: * In FuseAll, the loop nest contains only output. That’s because gray, blur, and thresholded are all inlined (fused) into it. Each pixel of output recomputes its 3×3 neighborhood of gray. * In FuseBlurAndThreshold, there is an extra loop for gray, because we explicitly called gray.compute_root(). The blur and thresholded stages are still fused into output. This reduces recomputation of gray and makes downstream loops simpler to vectorize. * In Simple, both gray and blur have their own loop nests, and thresholded fuses into output. This introduces two extra buffers, but each stage is computed once and can be parallelized independently. @@ -422,8 +422,8 @@ By toggling schedules live, you can see and measure how operator fusion and mate This demo makes these trade-offs concrete: the loop nest diagrams explain the structure, and the live FPS/MPix/s stats show the real performance impact. -## What “fusion” means in Halide -One of Halide's defining features is that, by default, it performs operator fusion, also called inlining. This means that if a stage produces some intermediate values, those values aren’t stored in a separate buffer and then re-read later—instead, the stage is computed directly inside the consumer’s loop. In other words, unless you tell Halide otherwise, every producer Func is fused into the next stage that uses it. +## What "fusion" means in Halide +Halide's defining feature is that, by default, it performs operator fusion, also called inlining. This means that if a stage produces some intermediate values, those values aren't stored in a separate buffer and then re-read later—instead, the stage is computed directly inside the consumer's loop. In other words, unless you tell Halide otherwise, every producer Func is fused into the next stage that uses it. Why is this important? Fusion reduces memory traffic, because Halide doesn’t need to write intermediates out to RAM and read them back again. On CPUs, where memory bandwidth is often the bottleneck, this can be a major performance win. Fusion also improves cache locality, since values are computed exactly where they are needed and the working set stays small. The trade-off, however, is that fusion can cause recomputation: if a consumer uses a neighborhood (like a blur that reads 3×3 or 9×9 pixels), the fused producer may be recalculated multiple times for overlapping regions. Whether fusion is faster depends on the balance between compute cost and memory traffic. @@ -444,7 +444,7 @@ for y: for x: out(x,y) = threshold( sum kernel * gray(x+i,y+j) ) The fused version eliminates buffer writes but recomputes gray under the blur stencil. The materialized version performs more memory operations but avoids recomputation, and also provides a clean point to parallelize or vectorize the gray stage. -It's worth noting that Halide also supports a loop fusion directive (fuse) that merges two loop variables together. That's a different concept and not the focus here. In this tutorial, the focus is specifically on operator fusion—the decision of whether to inline or materialize stages. +Note that Halide also supports a loop fusion directive (fuse) that merges two loop variables together. That's a different concept and not the focus here. This tutorial focuses specifically on operator fusion—the decision of whether to inline or materialize stages. ## How this looks in the live camera demo The pipeline is: BGR input → gray → 3×3 blur → thresholded → output. Depending on the schedule, different kinds of fusion are shown: diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md index b6dfecb724..78913a277b 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md @@ -9,35 +9,35 @@ layout: "learningpathall" ## What is Halide? -Halide is a powerful, open-source programming language designed to simplify and optimize high-performance image and signal processing. Researchers at MIT and Adobe developed Halide in 2012 to address a critical challenge: efficiently running image-processing algorithms on different hardware architectures without extensive manual tuning. Halide separates the description of an algorithm (the mathematical or logical transformations applied to images or signals) from its schedule (how and where those computations execute). This design enables rapid experimentation and effective optimization for various platforms, including CPUs, GPUs, and mobile hardware. +Halide is a powerful, open-source programming language designed to simplify and optimize high-performance image and signal processing. Researchers at MIT and Adobe developed Halide in 2012 to address the challenge of efficiently running image-processing algorithms on different hardware architectures without extensive manual tuning. -A key advantage of Halide lies in its innovative programming model. By distinguishing between algorithmic logic and scheduling decisions—such as parallelism, vectorization, memory management, and hardware-specific optimizations—you can first focus on ensuring the correctness of your algorithms. Performance tuning can then be handled independently, accelerating development cycles. This approach often yields performance that matches or even surpasses manually optimized code. As a result, Halide has seen widespread adoption across industry and academia, powering image processing systems at organizations such as Google, Adobe, and Facebook, and enabling advanced computational photography features used by millions daily. +Halide's programming model separates algorithmic logic from scheduling decisions, including parallelism, vectorization, memory management, and hardware-specific optimizations. This lets you focus on correctness first and tune performance independently, often achieving results that rival or exceed manually optimized code. In this Learning Path, you'll explore Halide's foundational concepts, set up your development environment, and create your first functional Halide application. By the end, you'll understand what makes Halide uniquely suited to efficient image processing, particularly on mobile and Arm-based hardware, and be ready to build your own optimized pipelines. -For broader use cases, see the official Halide documentation and tutorials at [the Halide website](https://halide-lang.org). +For broader use cases, see the official Halide documentation and tutorials on [the Halide website](https://halide-lang.org). -The example code for this Learning Path is available in two GitHub repositories: [Arm.Halide.Hello-World GitHub repository](https://github.com/dawidborycki/Arm.Halide.Hello-World.git) and [Arm.Halide.AndroidDemo GitHub repository](https://github.com/dawidborycki/Arm.Halide.AndroidDemo.git). +You can find the example code for this Learning Path in two GitHub repositories: [Arm.Halide.Hello-World GitHub repository](https://github.com/dawidborycki/Arm.Halide.Hello-World.git) and [Arm.Halide.AndroidDemo GitHub repository](https://github.com/dawidborycki/Arm.Halide.AndroidDemo.git). ## Key concepts in Halide -Before building your first Halide application, you need to understand three foundational concepts that make Halide powerful for image processing: +Before building your first Halide application, you need to understand the foundational concepts that make Halide powerful for image processing around separating algorithms from schedules, using symbolic building blocks, and applying scheduling strategies. -- Separating algorithms from schedules -- Using symbolic building blocks -- Applying scheduling strategies +These concepts work together to enable high-performance code that's both readable and portable across different hardware architectures, including Arm processors. -These concepts work together to enable high-performance code that's both readable and portable across different hardware architectures. +## Separation of algorithm and schedule -### Separation of algorithm and schedule +Halide's core design principle separates algorithms from schedules. Traditional image-processing code tightly couples algorithmic logic with execution strategy, complicating optimization and portability. -Halide's core design principle separates algorithms from schedules. Traditional image-processing code tightly couples algorithmic logic with execution strategy, complicating optimization and portability. Halide distinguishes these two components: +Halide distinguishes these two components: -**Algorithm** defines what computations are performed (for example, image filters, pixel transformations, or mathematical operations on image data). +- **Algorithm** defines what computations are performed, such as image filters, pixel transformations, or mathematical operations on image data. -**Schedule** specifies how and where these computations execute, including parallel execution, memory usage, caching strategies, and hardware-specific optimizations. +- **Schedule** specifies how and where these computations execute, including parallel execution, memory usage, caching strategies, and hardware-specific optimizations. -This separation lets you experiment and optimize code for different hardware architectures without changing the core algorithmic logic. +This separation enables you to experiment and optimize code for different hardware architectures without changing the core algorithmic logic. + +## Halide building blocks Halide provides three key building blocks to structure image processing algorithms: @@ -49,45 +49,43 @@ Halide::Func brighter("brighter"); brighter(x, y, c) = Halide::cast(Halide::min(input(x, y, c) + 50, 255)); ``` -**Functions (Func)** represent individual computational steps or image operations. Each `Func` encapsulates an expression applied to pixels, enabling concise definition of complex tasks. +- **Functions (Func)** represent individual computational steps or image operations. Each `Func` encapsulates an expression applied to pixels, enabling concise definition of complex tasks. -**Vars** symbolically represent spatial coordinates or dimensions (for example, horizontal x, vertical y, color channel c), specifying where computations are applied. +- **Vars** symbolically represent spatial coordinates or dimensions (for example, horizontal x, vertical y, color channel c), specifying where computations are applied. -**Pipelines** are formed by connecting multiple `Func` objects, creating a workflow where each stage's output feeds into subsequent stages. +- **Pipelines** are formed by connecting multiple `Func` objects, creating a workflow where each stage's output feeds into subsequent stages. Halide is a domain-specific language (DSL) tailored for image and signal processing. It provides predefined operations and building blocks optimized for expressing complex pipelines. By abstracting common computational patterns, Halide lets you define processing logic concisely, facilitating readability, maintainability, and optimization across hardware targets. -### Scheduling strategies +## Scheduling strategies Halide offers several powerful scheduling strategies for maximum performance: -- Parallelism - executes computations concurrently across multiple CPU cores, reducing execution time for large datasets. +- Parallelism - executes computations concurrently across multiple CPU cores, reducing execution time for large datasets -- Vectorization - enables simultaneous processing of multiple data elements using SIMD (Single Instruction, Multiple Data) instructions, enhancing performance on CPUs and GPUs. +- Vectorization - enables simultaneous processing of multiple data elements using SIMD (Single Instruction, Multiple Data) instructions, such as Arm NEON, enhancing performance on Arm CPUs and GPUs -- Tiling divides computations into smaller blocks optimized for cache efficiency, improving memory locality and reducing transfer overhead. +- Tiling divides computations into smaller blocks optimized for cache efficiency, improving memory locality and reducing transfer overhead -Combining these techniques achieves optimal performance tailored to your target hardware architecture. +You can combine these techniques to achieve optimal performance tailored to your target hardware architecture. -Beyond manual scheduling, Halide provides an Autoscheduler that automatically generates optimized schedules for specific hardware architectures, simplifying performance optimization. +Beyond manual scheduling, Halide provides an Autoscheduler that automatically generates optimized schedules for specific hardware architectures, including Arm-based systems, simplifying performance optimization. -## System requirements and environment setup +## Set up your environment To start developing with Halide, your system needs to meet several requirements. -### Installation options - You can set up Halide using one of two approaches: -**Pre-built binaries** are convenient, quick to install, and suitable for most users on standard platforms (Windows, Linux, macOS). This approach is recommended for typical use cases. +- **Pre-built binaries** are convenient, quick to install, and suitable for most users on standard platforms (Windows, Linux, macOS). This approach is recommended for typical use cases. -**Building from source** is required when pre-built binaries aren't available for your environment, or if you want to experiment with the latest Halide features or LLVM versions under active development. This method requires familiarity with build systems. +- **Building from source** is required when pre-built binaries aren't available for your environment, or if you want to experiment with the latest Halide features or LLVM versions under active development. This method requires familiarity with build systems. -To use pre-built binaries: +To use pre-built binaries, follow these steps" -1. Visit the official Halide [releases page](https://github.com/halide/Halide/releases). As of this writing, the latest version is v19.0.0. -2. Download and unzip the binaries to a convenient location (for example, `/usr/local/halide` on Linux/macOS or `C:\halide` on Windows). -3. Set environment variables to simplify usage: +- Visit [the official Halide releases page](https://github.com/halide/Halide/releases). This Learning Path was tested with version is v19.0.0. +- Download and unzip the binaries to a convenient location (for example, `/usr/local/halide` on Linux/macOS or `C:\halide` on Windows). +- Set environment variables to simplify usage: ```console export HALIDE_DIR=/path/to/halide export PATH=$HALIDE_DIR/bin:$PATH @@ -182,13 +180,13 @@ int main() { This program demonstrates how to combine Halide's image processing capabilities with OpenCV's image I/O and display functionality. It begins by loading an image from disk using OpenCV, specifically reading from a static file named `img.png` (here you use a Cameraman image). Since OpenCV loads images in BGR (Blue-Green-Red) format by default, the code immediately converts the image to RGB (Red-Green-Blue) format so that it's compatible with Halide's expectations. -Once the image is loaded and converted, the program wraps the raw image data into a Halide buffer, capturing the image's dimensions (width, height, and color channels). Next, the Halide pipeline is defined through a function named invert, which specifies the computations to perform on each pixel—in this case, subtracting the original pixel value from 255 to invert the colors. The pipeline definition alone doesn't perform any actual computation; it only describes what computations should occur and how to schedule them. +Once the image is loaded and converted, the program wraps the raw image data into a Halide buffer, capturing the image's dimensions (width, height, and color channels). Next, the Halide pipeline is defined through a function named *invert*, which specifies the computations to perform on each pixel—in this case, subtracting the original pixel value from 255 to invert the colors. The pipeline definition alone doesn't perform any actual computation; it only describes what computations should occur and how to schedule them. The actual computation occurs when the pipeline is executed with the call to invert.realize(...). This is the step that processes the input image according to the defined pipeline and produces an output Halide buffer. The scheduling directive (invert.reorder(c, x, y)) ensures that pixel data is computed in an interleaved manner (channel-by-channel per pixel), aligning the resulting data with OpenCV’s expected memory layout for images. Finally, the processed Halide output buffer is wrapped in an OpenCV `Mat` header without copying pixel data. For proper display in OpenCV, which uses BGR (Blue-Green-Red) channel ordering by default, the code converts the processed image back from RGB to BGR. The program then displays the original and inverted images in separate windows, waiting for a key press before exiting. This approach demonstrates integration between Halide for high-performance image processing and OpenCV for convenient input and output operations. -By default, Halide orders loops based on the order of variable declaration. In this example, the original ordering (x, y, c) implies processing the image pixel-by-pixel across all horizontal positions (x), then vertical positions (y), and finally channels (c). This ordering naturally produces a planar memory layout (e.g., processing all red pixels first, then green, then blue). +By default, Halide orders loops based on the order of variable declaration. In this example, the original ordering (x, y, c) implies processing the image pixel-by-pixel across all horizontal positions (x), then vertical positions (y), and finally channels (c). This ordering naturally produces a planar memory layout (for example, processing all red pixels first, then green, then blue). However, the optimal loop order depends on your intended memory layout and compatibility with external libraries: @@ -212,11 +210,11 @@ Buffer inputBuffer = Buffer::make_interleaved( **Planar layout (RRR...GGG...BBB...)** is preferred by certain image-processing routines or hardware accelerators (for example, some GPU kernels or ML frameworks). This is achieved naturally by Halide's default loop ordering (x, y, c). -Select loop ordering based on your data format requirements and integration scenario. Halide provides full flexibility, letting you reorder loops to match the desired memory layout efficiently. +Choose your loop ordering based on how your image data is stored and which libraries you use. Halide lets you control loop order for both performance and compatibility. -In Halide, distinguish two distinct concepts: +Halide separates two important ideas: -**Loop execution order** (controlled by `reorder`) defines the nesting order of loops during computation. For example, to make the channel dimension (c) innermost during computation: +**Loop execution order** — Use `reorder` to set the order in which loops run during computation. For example, making the channel (`c`) the innermost loop helps match interleaved layouts like OpenCV's HWC format: ```cpp invert.reorder(c, x, y); @@ -255,11 +253,11 @@ You'll see two windows displaying the original and inverted images: ![Original color photograph of a cameraman on the left showing a person operating a professional camera, and inverted version on the right with reversed colors where the subject appears in negative](Figures/01.png) ![Two side-by-side terminal windows showing compilation and execution of the Halide hello-world program, with the left window displaying g++ compilation commands and library paths, and the right window showing successful program execution with OpenCV window initialization messages](Figures/02.png) -## Summary +## What you've accomplished and what's next -You've learned Halide's foundational concepts, explored the benefits of separating algorithms and schedules, set up your development environment, and created your first functional Halide application integrated with OpenCV. +You've learned Halide's foundational concepts, explored the benefits of separating algorithms and schedules, set up your development environment, and created your first functional Halide application integrated with OpenCV for Arm development. While the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it doesn't yet showcase the benefits of separating algorithm definition from scheduling strategies. -In subsequent sections, you'll explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to specific hardware without modifying algorithmic correctness. +In subsequent sections, you'll explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to Arm processors and other hardware without modifying algorithmic correctness. diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md index c82d8d7a73..79d905fde3 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md @@ -9,12 +9,17 @@ layout: "learningpathall" ## What you'll build -In this section, you will build a real-time camera processing pipeline using Halide. First, you will capture video frames from a webcam using OpenCV, then implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. +In this section, you will build a real-time camera processing pipeline using Halide: + +- First, you will capture video frames from a webcam using OpenCV, implement a Gaussian (binomial) blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. + +- Next, you will measure performance and explore Halide's scheduling options: parallelization and tiling. Each technique improves throughput in a different way. -You will then measure performance and explore Halide's scheduling options—parallelization and tiling—to see how each improves throughput. ## Implement Gaussian blur and thresholding -Create a new `camera-capture.cpp` file and modify it as follows: + +To get started, create a new `camera-capture.cpp` file and copy and paste in the contents below: + ```cpp #include "Halide.h" #include @@ -143,7 +148,7 @@ Func inputClamped = BoundaryConditions::repeat_edge(input); Any out-of-bounds access replicates the nearest edge pixel. This makes the boundary policy obvious, keeps expressions clean, and ensures all downstream stages behave consistently at the edges. -Grayscale conversion happens inside Halide using Rec.601 weights. Read B, G, R from the interleaved input and compute luminance: +Halide converts the image to grayscale using Rec.601 weights. Read B, G, R from the interleaved input and compute luminance: ```cpp // Grayscale (Rec.601) @@ -154,7 +159,7 @@ gray(x, y) = cast(0.114f * inputClamped(x, y, 0) + // B 0.299f * inputClamped(x, y, 2)); // R ``` -Next, the pipeline applies a Gaussian-approximate (binomial) blur using a fixed 3×3 kernel. For this Learning Path, implement it with small loops and 16-bit accumulation for safety: +Next, the pipeline applies a Gaussian-approximate (binomial) blur using a fixed 3×3 kernel. Implement it with small loops and 16-bit accumulation for safety: ```cpp Func blur("blur"); @@ -166,10 +171,7 @@ for (int j = 0; j < 3; ++j) blur(x, y) = cast(sum / 16); ``` -Why this kernel? -* It provides effective smoothing while remaining computationally lightweight. -* The weights approximate a Gaussian distribution, which reduces noise but preserves edges better than a box filter. -* This is mathematically a binomial filter, a standard and efficient approximation of Gaussian blurring. +This binomial kernel smooths images effectively while staying lightweight. Its weights closely match a Gaussian distribution, so it reduces noise but preserves edges better than a simple box filter. This makes it a fast and practical way to approximate Gaussian blur in real-time image processing. After the blur, the pipeline applies thresholding to produce a binary image. Explicitly cast constants to uint8_t to remove ambiguity and avoid redundant widen/narrow operations in generated code: @@ -179,9 +181,9 @@ Func output("output"); output(x, y) = select(blur(x, y) > T, cast(255), cast(0)); ``` -This simple but effective step emphasizes strong edges and regions of high contrast, often used as a building block in segmentation and feature extraction pipelines. +This step emphasizes strong edges and regions of high contrast, providing a building block for segmentation and feature extraction pipelines. -Halide generates the final output, which OpenCV then displays. Build the pipeline once (outside the capture loop), and then realized each frame: +Halide generates the final output, and OpenCV displays it. Build the pipeline once (outside the capture loop), and then realize each frame: ```cpp // Build the pipeline once (outside the capture loop) Buffer outBuf(width, height); @@ -214,12 +216,11 @@ The output should look similar to the figure below: ## Parallelization and tiling -In this section, you will explore two scheduling optimizations that Halide provides: parallelization and tiling. Each technique improves performance in a different way—parallelization uses multiple CPU cores, while tiling optimizes cache efficiency through better data locality. - -Now you will learn how to use each technique separately for clarity and to emphasize their distinct benefits. +In this section, you will explore two scheduling optimizations that Halide provides: parallelization and tiling. Each technique improves performance in a different way. Parallelization uses multiple CPU cores, while tiling optimizes cache efficiency through better data locality. +You will learn how to use each technique separately for clarity and to emphasize their distinct benefits. -### Measure baseline performance +### Establish baseline performance Before applying any scheduling optimizations, establish a measurable baseline. Create a second file, `camera-capture-perf-measurement.cpp`, that runs the same grayscale → blur → threshold pipeline but prints per-frame timing, FPS, and MPix/s around the Halide `realize()` call. This lets you quantify each optimization you add next (parallelization, tiling, caching). @@ -389,7 +390,7 @@ realize: 1.16 ms | 864.15 FPS | 1791.90 MPix/s The performance gain by parallelization depends on how many CPU cores are available for this application to occupy. -### Apply tiling for cache efficiency +## Apply tiling for cache efficiency Tiling is a scheduling technique that divides computations into smaller, cache-friendly blocks or tiles. This approach significantly enhances data locality, reduces memory bandwidth usage, and leverages CPU caches more efficiently. While tiling can also use parallel execution, its primary advantage comes from optimizing intermediate data storage. @@ -397,9 +398,9 @@ Tiling splits the image into cache-friendly blocks (tiles). Two wins: * Partitioning: tiles are easy to parallelize across cores. * Locality: when you cache intermediates per tile, you avoid refetching/recomputing data and hit CPU L1/L2 cache more often. -Now have a look at both flavors. +Explore both approaches. -### Cache intermediates per tile +## Cache intermediates per tile This approach caches gray once per tile so the 3×3 blur can reuse it instead of recomputing RGB to gray up to 9× per output pixel. This provides the best cache efficiency. @@ -430,17 +431,17 @@ Recompile your application as before, then run. Here's sample output: realize: 0.98 ms | 1023.15 FPS | 2121.60 MPix/s ``` -This is the fastest variant. Caching a planar grayscale per tile enables efficient reuse, which improves performance. +Caching the grayscale image for each tile gives the best performance. By storing the intermediate grayscale result in a tile-local buffer, Halide can reuse it efficiently during the blur step. This reduces redundant computations and makes better use of the CPU cache, resulting in faster processing. -### How to schedule -In general, there's no one-size-fits-all rule of scheduling to achieve the best performance as it depends on your pipeline characteristics and the target device architecture. It's recommended to explore the scheduling options and that's where Halide's scheduling API is purposed for. +## Choose a scheduling strategy +There isn't a universal scheduling strategy that guarantees the best performance for every pipeline or device. The optimal approach depends on your specific image-processing workflow and the Arm architecture you're targeting. Halide's scheduling API gives you the flexibility to experiment with parallelization, tiling, and caching. Try different combinations to see which delivers the highest throughput and efficiency for your application. -For example of this application: +For the example of this application: * Start with parallelizing the outer-most loop. * Add tiling + caching only if: there's a spatial filter, or the intermediate is reused by multiple consumers—and preferably after converting sources to planar (or precomputing a planar gray). * From there, tune tile sizes and thread count for your target. `HL_NUM_THREADS` is the environmental variable which allows you to limit the number of threads in-flight. -## Summary +## What you've accomplished and what's next In this section, you built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. Parallelization and tiling improved the performance. * Parallelization spreads independent work across CPU cores. From e0ecee0c993f26baa80ccd6f5ce2accd5bb8b1ef Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Fri, 14 Nov 2025 14:22:32 +0000 Subject: [PATCH 4/8] Pending changes exported from your codespace --- .../android_halide/aot-and-cross-compilation.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md index 8ec662cb62..0ff84c62dd 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md @@ -7,13 +7,16 @@ weight: 5 layout: "learningpathall" --- + +## What you'll build +In this section, you'll leverage the host version of Halide to perform AOT compilation of an image processing pipeline via cross-compilation. The resulting pipeline library is specifically tailored to Android devices (targeting, for instance, arm64-v8a ABI), while the compilation itself occurs entirely on the host system. This approach significantly accelerates development by eliminating the need to build Halide or perform JIT compilation on Android devices. It also guarantees that the resulting binaries are optimized for the intended hardware, streamlining the deployment of high-performance image processing applications on mobile platforms. + + ## Learn about ahead-of-time (AOT) and cross-compilation One of Halide's standout features is the ability to compile image processing pipelines ahead-of-time (AOT), enabling you to generate optimized binary code on your host machine rather than compiling directly on target devices. This AOT compilation process enables you to create highly efficient libraries that run effectively across diverse hardware without incurring the runtime overhead associated with just-in-time (JIT) compilation. Halide also supports robust cross-compilation capabilities. Cross-compilation means using the host version of Halide, typically running on a desktop Linux or macOS system—to target different architectures, such as Arm for Android devices. You can optimize Halide pipelines on your host machine, produce libraries specifically optimized for Android, and integrate them seamlessly into Android applications. The generated pipeline code includes essential optimizations and can embed minimal runtime support, further reducing workload on the target device and ensuring responsiveness and efficiency. -## What you'll build -In this section, you'll leverage the host version of Halide to perform AOT compilation of an image processing pipeline via cross-compilation. The resulting pipeline library is specifically tailored to Android devices (targeting, for instance, arm64-v8a ABI), while the compilation itself occurs entirely on the host system. This approach significantly accelerates development by eliminating the need to build Halide or perform JIT compilation on Android devices. It also guarantees that the resulting binaries are optimized for the intended hardware, streamlining the deployment of high-performance image processing applications on mobile platforms. ## Prepare pipeline for Android The following code demonstrates how to use Halide's AOT compilation and cross-compilation features to create an optimized image processing pipeline for Android. Run Halide on your host machine (in this example, macOS) to generate a static library containing the pipeline function, which you'll later invoke from an Android device. Below is a step-by-step explanation of this process. From a8ecb758b6a5aefcdbac9201cea617ff4fe272d2 Mon Sep 17 00:00:00 2001 From: Madeline Underwood Date: Fri, 14 Nov 2025 19:03:19 +0000 Subject: [PATCH 5/8] Update content for Hugo site --- .../android_halide/fusion.md | 7 ++- .../android_halide/processing-workflow.md | 48 ++++++++++++------- 2 files changed, 35 insertions(+), 20 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md index d8e3238ee9..2ad5590ffa 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md @@ -8,7 +8,9 @@ layout: "learningpathall" --- ## What you'll build -In the previous section, you explored parallelization and tiling. Here, you'll focus on operator fusion (inlining) in Halide, that is, letting producers be computed directly inside their consumers—versus materializing intermediates with compute_root() or compute_at(). You'll learn when fusion reduces memory traffic and when materializing saves recomputation (for example, for large stencils or multi-use intermediates). You'll inspect loop nests with print_loop_nest(), switch among schedules (fuse-all, fuse-blur-only, materialize, tile-and-materialize-per-tile) in a live camera pipeline, and measure the impact (ms/FPS/MPix/s). + +In this section, you'll focus on operator fusion in Halide—where each stage is computed directly inside its consumer, instead of storing intermediate results. You'll learn how fusion can reduce memory traffic, and when materializing intermediates with `compute_root()` or `compute_at()` is better, especially for large filters or when results are reused. You'll use `print_loop_nest()` to see how Halide arranges the computation, switch between different scheduling modes (fuse all, fuse blur only, materialize, tile and materialize per tile) in a live camera pipeline, and measure the impact using ms, FPS, and MPix/s. + This section doesn't cover loop fusion (the fuse directive). You'll focus instead on operator fusion, which is Halide's default behavior. @@ -475,4 +477,5 @@ Fusion isn’t always best. You’ll want to materialize an intermediate (comput The fastest way to check whether fusion helps is to measure it. The demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official [Halide profiling tutorial](https://halide-lang.org/tutorials/tutorial_lesson_21_auto_scheduler_generate.html#profiling). ## Summary -In this section, you've learned about operator fusion in Halide—a powerful technique for reducing memory bandwidth and improving computational efficiency. You explored why fusion matters, looked at scenarios where it's most effective, and saw how Halide's scheduling constructs such as compute_root() and compute_at() let you control whether stages are fused or materialized. By experimenting with different schedules, including fusing the Gaussian blur and thresholding stages, you observed how fusion can significantly improve the performance of a real-time image processing pipeline. + +You've seen how operator fusion in Halide can make your image processing pipeline faster and more efficient. Fusion means Halide computes each stage directly inside its consumer, reducing memory traffic and keeping data in cache. You learned when fusion is best—like for simple pixel operations or cheap post-processing—and when materializing intermediates with `compute_root()` or `compute_at()` can help, especially for large stencils or multi-use buffers. By switching schedules in the live demo, you saw how fusion and materialization affect both the loop structure and real-time performance. Now you know how to choose the right approach for your own Arm-based image processing tasks. diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md index 79d905fde3..d1637bc222 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md @@ -364,7 +364,7 @@ This gives an FPS of 251.51, and average throughput of 521.52 MPix/s. Now you ca ### Apply parallelization -Parallelization lets Halide run independent pieces of work at the same time on multiple CPU cores. In image pipelines, rows (or row tiles) are naturally parallel once producer data is available. By distributing work across cores, wall-clock time is reduced—crucial for real-time video. +Parallelization allows Halide to process different parts of the image at the same time using multiple CPU cores. In image processing pipelines, each row or block of rows can be handled independently once the input data is ready. By spreading the work across several cores, you reduce the total processing time—this is especially important for real-time video applications. With the baseline measured, apply a minimal schedule that parallelizes the loop iteration for y axis. @@ -379,10 +379,15 @@ Add these lines after defining output(x, y) (and before any realize()). In this ``` This does two important things: -* compute_root() on gray divides the entire processing into two loops, one to compute the entire gray output, and the other to compute the final output. -* parallel(y) parallelizes over the pure loop variable y (rows). The rows are computed on different CPU cores in parallel. +* `compute_root()` on gray divides the entire processing into two loops, one to compute the entire gray output, and the other to compute the final output. +* `parallel(y)` parallelizes over the pure loop variable y (rows). The rows are computed on different CPU cores in parallel. +Now rebuild and run the application. You should see output similar to: -Now rebuild and run the application again. The results should look like: +```output +realize: 1.16 ms | 864.15 FPS | 1791.90 MPix/s +``` + +This shows a significant speedup from parallelization. The exact numbers depend on your Arm CPU and how many cores are available. ```output % ./camera-capture-perf-measurement realize: 1.16 ms | 864.15 FPS | 1791.90 MPix/s @@ -394,11 +399,12 @@ The performance gain by parallelization depends on how many CPU cores are availa Tiling is a scheduling technique that divides computations into smaller, cache-friendly blocks or tiles. This approach significantly enhances data locality, reduces memory bandwidth usage, and leverages CPU caches more efficiently. While tiling can also use parallel execution, its primary advantage comes from optimizing intermediate data storage. -Tiling splits the image into cache-friendly blocks (tiles). Two wins: -* Partitioning: tiles are easy to parallelize across cores. -* Locality: when you cache intermediates per tile, you avoid refetching/recomputing data and hit CPU L1/L2 cache more often. +Tiling divides the image into smaller, cache-friendly blocks called tiles. This gives you two main benefits: + +* Partitioning: tiles are easy to process in parallel, so you can spread the work across multiple CPU cores. +* Locality: by caching intermediate results within each tile, you avoid repeating calculations and make better use of the CPU cache. -Explore both approaches. +Try both methods to see how they improve performance. ## Cache intermediates per tile @@ -422,11 +428,13 @@ This approach caches gray once per tile so the 3×3 blur can reuse it instead of ``` In this scheduling: -* tile(...) splits the image into cache-friendly blocks and makes it easy to parallelize across tiles -* parallel(yo) distributes tiles across CPU cores where a CPU core is in charge of a row (yo) of tiles -* gray.compute_at(...).store_at(...) materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile +* `tile`(...) splits the image into cache-friendly blocks and makes it easy to parallelize across tiles +* `parallel(yo)` distributes tiles across CPU cores where a CPU core is in charge of a row (yo) of tiles +* `gray.compute_at(...).store_at(...)` materializes a tile-local planar buffer for the grayscale intermediate so blur can reuse it within the tile -Recompile your application as before, then run. Here's sample output: +Recompile your application as before, then run. + +Here's sample output: ```output realize: 0.98 ms | 1023.15 FPS | 2121.60 MPix/s ``` @@ -437,12 +445,16 @@ Caching the grayscale image for each tile gives the best performance. By storing There isn't a universal scheduling strategy that guarantees the best performance for every pipeline or device. The optimal approach depends on your specific image-processing workflow and the Arm architecture you're targeting. Halide's scheduling API gives you the flexibility to experiment with parallelization, tiling, and caching. Try different combinations to see which delivers the highest throughput and efficiency for your application. For the example of this application: -* Start with parallelizing the outer-most loop. -* Add tiling + caching only if: there's a spatial filter, or the intermediate is reused by multiple consumers—and preferably after converting sources to planar (or precomputing a planar gray). -* From there, tune tile sizes and thread count for your target. `HL_NUM_THREADS` is the environmental variable which allows you to limit the number of threads in-flight. +Start by parallelizing the outermost loop to use multiple CPU cores. This is usually the simplest way to boost performance. + +Add tiling and caching if your pipeline includes a spatial filter (such as blur or convolution), or if an intermediate result is reused by several stages. Tiling works best after converting your source data to planar format, or after precomputing a planar grayscale image. + +Try parallelization first, then experiment with tiling and caching for further speedups. From there, tune tile sizes and thread count for your target. `HL_NUM_THREADS` is the environmental variable which allows you to limit the number of threads in-flight. ## What you've accomplished and what's next -In this section, you built a real-time Halide+OpenCV pipeline—grayscale, a 3×3 binomial blur, then thresholding—and instrumented it to measure throughput. Parallelization and tiling improved the performance. +You built a real-time image processing pipeline using Halide and OpenCV. The workflow included converting camera frames to grayscale, applying a 3×3 binomial blur, and thresholding to create a binary image. You also measured performance to see how different scheduling strategies affect throughput. + +- Parallelization lets Halide use multiple CPU cores, speeding up processing by dividing work across rows or tiles. +- Tiling improves cache efficiency, especially when intermediate results are reused often, such as with larger filters or multi-stage pipelines. -* Parallelization spreads independent work across CPU cores. -* Tiling for cache efficiency helps when an expensive intermediate is reused many times per output (for example, larger kernels, separable/multi-stage pipelines, multiple consumers) and when producers read planar data. +By combining these techniques, you achieved faster and more efficient image processing on Arm systems. From 991d334e1a67861237b7fb88846d8593828bd277 Mon Sep 17 00:00:00 2001 From: Madeline Underwood Date: Sat, 15 Nov 2025 21:50:25 +0000 Subject: [PATCH 6/8] Enhance Android Halide documentation for clarity and detail in performance challenges, project setup, and operator fusion concepts --- .../android_halide/android.md | 10 +-- .../aot-and-cross-compilation.md | 2 +- .../android_halide/fusion.md | 48 ++++++++-- .../android_halide/intro.md | 87 ++++++++++--------- 4 files changed, 91 insertions(+), 56 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md index 7cdcae8788..9e9bb96139 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md @@ -26,9 +26,9 @@ In short, Halide delivers high-performance image processing without sacrificing ### Navigate Android development challenges While Android presents abundant opportunities for developers, the mobile development ecosystem brings its own set of challenges, especially for performance-intensive applications: -1. Limited Hardware Resources. Unlike desktop or server environments, mobile devices have significant constraints on processing power, memory capacity, and battery life. Developers must optimize software meticulously to deliver smooth performance while carefully managing hardware resource consumption. Leveraging tools like Halide allows developers to overcome these constraints by optimizing computational workloads, making resource-intensive tasks feasible on constrained hardware. -2. Cross-Compilation Complexities. Developing native code for Android requires handling multiple hardware architectures (such as armv8-a, ARM64, and sometimes x86/x86_64). Cross-compilation introduces complexities due to different instruction sets, CPU features, and performance characteristics. Managing this complexity involves careful use of the Android NDK, understanding toolchains, and correctly configuring build systems (e.g., Gradle, CMake). Halide helps mitigate these issues by abstracting away many platform-specific optimizations, automatically generating code optimized for target architectures. -3. Image-Format Conversions (Bitmap ↔ Halide Buffer). Android typically handles images through the Bitmap class or similar platform-specific constructs, whereas Halide expects image data to be in raw, contiguous buffer formats. Developers must bridge the gap between Android-specific image representations (Bitmaps, YUV images from camera APIs, etc.) and Halide's native buffer format. Proper management of these conversions—including considerations for pixel formats, stride alignment, and memory copying overhead—can significantly impact performance and correctness, necessitating careful design and efficient implementation of buffer-handling routines. +- Limited hardware resources: unlike desktop or server environments, mobile devices have significant constraints on processing power, memory capacity, and battery life. Developers must optimize software meticulously to deliver smooth performance while carefully managing hardware resource consumption. Leveraging tools like Halide allows developers to overcome these constraints by optimizing computational workloads, making resource-intensive tasks feasible on constrained hardware. +- Cross-compilation complexities: developing native code for Android requires handling multiple hardware architectures (such as Armv8-A, ARM64, and sometimes x86/x86_64). Cross-compilation introduces complexities due to different instruction sets, CPU features, and performance characteristics. Managing this complexity involves careful use of the Android NDK, understanding toolchains, and correctly configuring build systems (e.g., Gradle, CMake). Halide helps mitigate these issues by abstracting away many platform-specific optimizations, automatically generating code optimized for target architectures. +- Image format conversions (Bitmap ↔ Halide Buffer). Android typically handles images through the Bitmap class or similar platform-specific constructs, whereas Halide expects image data to be in raw, contiguous buffer formats. Developers must bridge the gap between Android-specific image representations (Bitmaps, YUV images from camera APIs, etc.) and Halide's native buffer format. Proper management of these conversions—including considerations for pixel formats, stride alignment, and memory copying overhead—can significantly impact performance and correctness, necessitating careful design and efficient implementation of buffer-handling routines. ## Project requirements Before integrating Halide into your Android application, ensure you have the necessary tools and libraries. @@ -41,7 +41,7 @@ Before integrating Halide into your Android application, ensure you have the nec ### Create the project 1. Open Android Studio. 2. Select New Project > Native C++. -![img4](Figures/04.webp) +![Android Studio New Project dialog showing Native C++ template selected. The dialog displays options for project name, language, and minimum SDK. The primary subject is the Native C++ template highlighted in the project creation workflow. The wider environment is a typical Android Studio interface with a neutral, technical tone. Visible text includes Native C++ and fields for configuring the new project.] ### Configure the project 1. Set the project Name to Arm.Halide.AndroidDemo. @@ -407,7 +407,7 @@ The input Java byte array (input_bytes) is accessed and pinned into native memor Through this JNI bridge, Kotlin can invoke high-performance native code. You can now re-run the application. Select the Load Image button, and then Process Image. You'll see the following results: -![img9](Figures/09.png) +Android app screenshot showing the Arm Halide Android demo interface. The screen displays two buttons labeled Load Image and Process Image, with the Process Image button enabled. Below the buttons, an ImageView shows a grayscale photo of a camera man standing outdoors, holding a camera and tripod. The environment appears neutral and technical, with no visible emotional tone. The layout is centered and uses a simple vertical arrangement, making the interface easy to navigate for users with visual impairment. ![img10](Figures/10.png) In the above code we created a new jbyteArray and copying the data explicitly, which can result in an additional overhead. To optimize performance by avoiding unnecessary memory copies, you can directly wrap Halide's buffer in a Java-accessible ByteBuffer like so diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md index 0ff84c62dd..b74495cba1 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md @@ -1,6 +1,6 @@ --- # User change -title: "Ahead-of-time and cross-compilation" +title: "Generate optimized Halide pipelines for Android using ahead-of-time cross-compilation" weight: 5 diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md index 2ad5590ffa..f94f214de1 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md @@ -7,15 +7,18 @@ weight: 4 layout: "learningpathall" --- -## What you'll build +## What you'll build and learn -In this section, you'll focus on operator fusion in Halide—where each stage is computed directly inside its consumer, instead of storing intermediate results. You'll learn how fusion can reduce memory traffic, and when materializing intermediates with `compute_root()` or `compute_at()` is better, especially for large filters or when results are reused. You'll use `print_loop_nest()` to see how Halide arranges the computation, switch between different scheduling modes (fuse all, fuse blur only, materialize, tile and materialize per tile) in a live camera pipeline, and measure the impact using ms, FPS, and MPix/s. +You'll explore operator fusion in Halide, where each stage is computed inside its consumer instead of storing intermediate results. This approach reduces memory traffic and improves cache efficiency. You'll also learn when it's better to materialize intermediates using `compute_root()` or `compute_at()`, such as with large filters or when results are reused by multiple stages. By the end, you'll understand how to choose between fusion and materialization for real-time image processing on Arm devices. +You'll also use `print_loop_nest()` to see how Halide arranges the computation, switch between different scheduling modes (fuse all, fuse blur only, materialize, tile and materialize per tile) in a live camera pipeline, and measure the impact using ms, FPS, and MPix/s. -This section doesn't cover loop fusion (the fuse directive). You'll focus instead on operator fusion, which is Halide's default behavior. +{{% notice Note on scope %}} +This section doesn't cover loop fusion using the `fuse` directive. You'll focus instead on operator fusion, which is Halide's default behavior. +{{% /notice %}} ## Code -To demonstrate how fusion in Halide works create a new file `camera-capture-fusion.cpp`, and modify it as follows. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion vs. materialization, and print ms / FPS / MPix/s. So you can see the impact immediately. +To explore how fusion in Halide works create a new file called `camera-capture-fusion.cpp`, and copy in the code below. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion compared to materialization, and print ms / FPS / MPix/s. - you'll be able to see the impact immediately: ```cpp #include "Halide.h" @@ -234,12 +237,17 @@ int main(int argc, char** argv) { return 0; } ``` +The heart of this program is the `make_pipeline` function. This function builds the camera processing pipeline in Halide and lets you switch between different scheduling modes. Each mode changes how intermediate results are handled, by either fusing stages together to minimize memory use, or materializing them to avoid recomputation. By adjusting the schedule, you can see how these choices affect both the loop structure and the real-time performance of your image processing pipeline. -The main part of this program is the `make_pipeline` function. It defines the camera processing pipeline in Halide and applies different scheduling choices depending on which mode is selected. +Start by declaring `Var x, y` to represent pixel coordinates. The camera frames use a 3-channel interleaved BGR format. This means: -Start by declaring Var x, y as pixel coordinates. Similarly as before, the camera frames come in as 3-channel interleaved BGR, telling Halide how the data is laid out: the stride along x is 3 (one step moves across all three channels), the stride along c (channels) is 1, and the bounds on the channel dimension are 0–2. +- The stride along the x-axis is 3, because each step moves across all three color channels. +- The stride along the channel axis (c) is 1, so channels are stored contiguously. +- The channel bounds are set from 0 to 2, covering the three BGR channels. -Because you don't want to worry about array bounds when applying filters, clamp the input at the borders. In Halide 19, BoundaryConditions::repeat_edge works cleanly when applied to an ImageParam, since it has .dim() information. This way, all downstream stages can assume safe access even at the edges of the image. +These settings tell Halide exactly how the image data is organized in memory, so it can process each pixel and channel correctly. + +To avoid errors when applying filters near the edges of an image, clamp the input at the borders. In Halide 19, you can use `BoundaryConditions::repeat_edge` directly on an `ImageParam`, because it includes dimension information. This ensures that all stages in your pipeline can safely access pixels, even at the image boundaries. ```cpp Pipeline make_pipeline(ImageParam& input, Schedule schedule) { @@ -253,10 +261,32 @@ Pipeline make_pipeline(ImageParam& input, Schedule schedule) { // (b) Border handling: clamp the *ImageParam* (works cleanly in Halide 19) Func inputClamped = BoundaryConditions::repeat_edge(input); ``` +The next stage converts the image to grayscale. Use the Rec.601 weights for BGR to gray conversion, just like in the previous section. For the blur, apply a 3×3 binomial kernel with values: + +``` +1 2 1 +2 4 2 +1 2 1 +``` -Next comes the gray conversion. As in previous section, use Rec.601 weights and a 3×3 binomial blur. Instead of using a reduction domain (RDom), unroll the sum in C++ host code with a pair of loops over the kernel. The kernel values {1, 2, 1; 2, 4, 2; 1, 2, 1} approximate a Gaussian filter. Each pixel of blur is simply the weighted sum of its 3×3 neighborhood, divided by 16. +This kernel closely approximates a Gaussian filter. Instead of using Halide's reduction domain (`RDom`), unroll the sum directly in C++ using two nested loops over the kernel values. For each pixel, calculate the weighted sum of its 3×3 neighborhood and divide by 16 to get the blurred result. This approach makes the computation straightforward and easy to follow. +Now, add a threshold stage to your pipeline. This stage checks each pixel value after the blur and sets it to white (255) if it's above 128, or black (0) otherwise. This produces a binary image, making it easy to see which areas are brighter than the threshold. + +Here's how you define the thresholded stage and the output Func: + +```cpp +// Threshold (binary) +Func thresholded("thresholded"); +Expr T = cast(128); +thresholded(x, y) = select(blur(x, y) > T, cast(255), cast(0)); + +// Final output +Func output("output"); +output(x, y) = thresholded(x, y); +output.compute_root(); // Realize 'output' explicitly when running the pipeline +``` -Then add a threshold stage. Pixels above 128 become white, and all others black, producing a binary image. Finally, define an output Func that wraps the thresholded result and call compute_root() on it so that it will be realized explicitly when you run the pipeline. +This setup ensures that the output is a binary image, and Halide will compute and store the result when you run the pipeline. By calling `compute_root()` on the output Func, you tell Halide to materialize the final result, making it available for display or further processing. Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, you instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output. * Simple: Here you'll explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic. diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md index 78913a277b..45415cfc0f 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md @@ -9,9 +9,9 @@ layout: "learningpathall" ## What is Halide? -Halide is a powerful, open-source programming language designed to simplify and optimize high-performance image and signal processing. Researchers at MIT and Adobe developed Halide in 2012 to address the challenge of efficiently running image-processing algorithms on different hardware architectures without extensive manual tuning. +Halide is a powerful, open-source programming language designed to simplify and optimize high-performance image and signal processing. In 2012, researchers at MIT and Adobe developed Halide to efficiently run image-processing algorithms on different hardware architectures without extensive manual tuning. -Halide's programming model separates algorithmic logic from scheduling decisions, including parallelism, vectorization, memory management, and hardware-specific optimizations. This lets you focus on correctness first and tune performance independently, often achieving results that rival or exceed manually optimized code. +Halide makes it easy to write correct image-processing code by separating what your program does from how it runs. You first describe the algorithm, which is the steps to process each pixel, without needing to worry about performance details. You can then later choose scheduling strategies like parallelism, vectorization, and memory management to optimize for your hardware, including Arm processors. This approach helps you focus on getting the right results before tuning for speed, often matching or beating hand-optimized code. In this Learning Path, you'll explore Halide's foundational concepts, set up your development environment, and create your first functional Halide application. By the end, you'll understand what makes Halide uniquely suited to efficient image processing, particularly on mobile and Arm-based hardware, and be ready to build your own optimized pipelines. @@ -21,25 +21,21 @@ You can find the example code for this Learning Path in two GitHub repositories: ## Key concepts in Halide -Before building your first Halide application, you need to understand the foundational concepts that make Halide powerful for image processing around separating algorithms from schedules, using symbolic building blocks, and applying scheduling strategies. +Before you build your first Halide application, get familiar with the key ideas that make Halide powerful for image processing. Halide separates the steps of what your code does (the algorithm) from how it runs (the schedule). You'll use symbolic building blocks to describe image operations, then apply scheduling strategies to optimize performance for Arm processors. Understanding these concepts helps you write code that's both correct and fast. These concepts work together to enable high-performance code that's both readable and portable across different hardware architectures, including Arm processors. -These concepts work together to enable high-performance code that's both readable and portable across different hardware architectures, including Arm processors. - -## Separation of algorithm and schedule +## Separate algorithm from schedule for optimal performance Halide's core design principle separates algorithms from schedules. Traditional image-processing code tightly couples algorithmic logic with execution strategy, complicating optimization and portability. -Halide distinguishes these two components: - -- **Algorithm** defines what computations are performed, such as image filters, pixel transformations, or mathematical operations on image data. +- The algorithm defines what computations are performed, such as image filters, pixel transformations, or mathematical operations on image data. -- **Schedule** specifies how and where these computations execute, including parallel execution, memory usage, caching strategies, and hardware-specific optimizations. +- The schedule specifies how and where these computations execute, including parallel execution, memory usage, caching strategies, and hardware-specific optimizations. This separation enables you to experiment and optimize code for different hardware architectures without changing the core algorithmic logic. -## Halide building blocks +## Discover Halide building blocks -Halide provides three key building blocks to structure image processing algorithms: +Halide provides three key building blocks to structure image processing algorithms, as shown below: ```cpp Halide::Var x("x"), y("y"), c("c"); @@ -49,21 +45,21 @@ Halide::Func brighter("brighter"); brighter(x, y, c) = Halide::cast(Halide::min(input(x, y, c) + 50, 255)); ``` -- **Functions (Func)** represent individual computational steps or image operations. Each `Func` encapsulates an expression applied to pixels, enabling concise definition of complex tasks. +- Functions (`Func)` represent individual computational steps or image operations. Each Func encapsulates an expression applied to pixels, enabling concise definition of complex tasks. -- **Vars** symbolically represent spatial coordinates or dimensions (for example, horizontal x, vertical y, color channel c), specifying where computations are applied. +- `Var` symbolically represents spatial coordinates or dimensions (for example, horizontal x, vertical y, color channel c), specifying where computations are applied. -- **Pipelines** are formed by connecting multiple `Func` objects, creating a workflow where each stage's output feeds into subsequent stages. +- Pipelines are formed by connecting multiple `Func` objects, creating a workflow where each stage's output feeds into subsequent stages. -Halide is a domain-specific language (DSL) tailored for image and signal processing. It provides predefined operations and building blocks optimized for expressing complex pipelines. By abstracting common computational patterns, Halide lets you define processing logic concisely, facilitating readability, maintainability, and optimization across hardware targets. +Halide is a domain-specific language (DSL) tailored for image and signal processing. It provides predefined operations and building blocks optimized for expressing complex pipelines. By abstracting common computational patterns, Halide lets you define processing logic concisely, which in turn facilitates readability, maintainability, and optimization across hardware targets. -## Scheduling strategies +## Learn about scheduling strategies Halide offers several powerful scheduling strategies for maximum performance: -- Parallelism - executes computations concurrently across multiple CPU cores, reducing execution time for large datasets +- Parallelism is the execution of computations concurrently across multiple CPU cores, reducing execution time for large datasets -- Vectorization - enables simultaneous processing of multiple data elements using SIMD (Single Instruction, Multiple Data) instructions, such as Arm NEON, enhancing performance on Arm CPUs and GPUs +- Vectorization enables simultaneous processing of multiple data elements using SIMD (Single Instruction, Multiple Data) instructions, such as Arm NEON, enhancing performance on Arm CPUs and GPUs - Tiling divides computations into smaller blocks optimized for cache efficiency, improving memory locality and reducing transfer overhead @@ -73,31 +69,36 @@ Beyond manual scheduling, Halide provides an Autoscheduler that automatically ge ## Set up your environment -To start developing with Halide, your system needs to meet several requirements. - You can set up Halide using one of two approaches: -- **Pre-built binaries** are convenient, quick to install, and suitable for most users on standard platforms (Windows, Linux, macOS). This approach is recommended for typical use cases. +- **Use pre-built binaries** for a fast and convenient setup on Windows, Linux, and macOS. This method is recommended for most users and standard development environments. - **Building from source** is required when pre-built binaries aren't available for your environment, or if you want to experiment with the latest Halide features or LLVM versions under active development. This method requires familiarity with build systems. -To use pre-built binaries, follow these steps" +To use pre-built binaries, follow these steps: -- Visit [the official Halide releases page](https://github.com/halide/Halide/releases). This Learning Path was tested with version is v19.0.0. -- Download and unzip the binaries to a convenient location (for example, `/usr/local/halide` on Linux/macOS or `C:\halide` on Windows). -- Set environment variables to simplify usage: -```console -export HALIDE_DIR=/path/to/halide -export PATH=$HALIDE_DIR/bin:$PATH -``` +To set up Halide using pre-built binaries: -Next, install the following components: +- Go to the [Halide releases page](https://github.com/halide/Halide/releases). This Learning Path uses version v19.0.0. +- Download and unzip the binaries to a convenient location, such as `/usr/local/halide` (Linux/macOS) or `C:\halide` (Windows). +- Set environment variables to make Halide easy to use: + ```console + export HALIDE_DIR=/path/to/halide + export PATH=$HALIDE_DIR/bin:$PATH + ``` -**LLVM** - Halide requires LLVM to compile and execute pipelines -**OpenCV** - For image handling in later sections +## Install LLVM and OpenCV -Install with the commands for your OS: +Before you can build and run Halide pipelines, you need to install two essential components: + +- LLVM: Halide depends on LLVM to compile and execute image processing pipelines. LLVM provides the backend that turns Halide code into optimized machine instructions for Arm processors. + +- OpenCV: You'll use OpenCV for image input and output in later sections. OpenCV makes it easy to load, display, and save images, and it integrates smoothly with Halide buffers. + +Both tools are available for Arm platforms on Linux, macOS, and Windows. Make sure you install the correct versions for your operating system and architecture. + +The commands below show how to install LLVM and OpenCV: {{< tabpane code=true >}} {{< tab header="Linux/Ubuntu" language="bash">}} @@ -112,7 +113,7 @@ brew install opencv pkg-config Halide examples were tested with OpenCV 4.11.0 -## Your first Halide program +## Build your first Halide program You're now ready to build your first Halide application. Save the following code in a file named `hello-world.cpp`: ```cpp @@ -178,13 +179,17 @@ int main() { } ``` -This program demonstrates how to combine Halide's image processing capabilities with OpenCV's image I/O and display functionality. It begins by loading an image from disk using OpenCV, specifically reading from a static file named `img.png` (here you use a Cameraman image). Since OpenCV loads images in BGR (Blue-Green-Red) format by default, the code immediately converts the image to RGB (Red-Green-Blue) format so that it's compatible with Halide's expectations. +This program demonstrates how you can combine Halide's image processing capabilities with OpenCV's image I/O and display functionality. It begins by loading an image from disk using OpenCV, specifically reading from a static file named `img.png` (here you use a Cameraman image). Since OpenCV loads images in BGR (Blue-Green-Red) format by default, the code immediately converts the image to RGB (Red-Green-Blue) format so that it's compatible with Halide. + +The program wraps the raw image data into a Halide buffer, capturing the image's width, height, and color channels. It defines the Halide pipeline using a function named `invert` to specify the computation for each pixel—subtract the original pixel value from 255 to invert the colors. -Once the image is loaded and converted, the program wraps the raw image data into a Halide buffer, capturing the image's dimensions (width, height, and color channels). Next, the Halide pipeline is defined through a function named *invert*, which specifies the computations to perform on each pixel—in this case, subtracting the original pixel value from 255 to invert the colors. The pipeline definition alone doesn't perform any actual computation; it only describes what computations should occur and how to schedule them. +{{% notice Note %}} +Remember, the pipeline definition only describes the computations and scheduling; it doesn't perform any actual processing until you realize the pipeline. +{{% /notice %}} -The actual computation occurs when the pipeline is executed with the call to invert.realize(...). This is the step that processes the input image according to the defined pipeline and produces an output Halide buffer. The scheduling directive (invert.reorder(c, x, y)) ensures that pixel data is computed in an interleaved manner (channel-by-channel per pixel), aligning the resulting data with OpenCV’s expected memory layout for images. +The actual computation occurs when the pipeline is executed with the call to `invert.realize`(...). This is the step that processes the input image according to the defined pipeline and produces an output Halide buffer. The scheduling directive `(invert.reorder(c, x, y))` ensures that pixel data is computed in an interleaved manner (channel-by-channel per pixel), aligning the resulting data with OpenCV’s expected memory layout for images. -Finally, the processed Halide output buffer is wrapped in an OpenCV `Mat` header without copying pixel data. For proper display in OpenCV, which uses BGR (Blue-Green-Red) channel ordering by default, the code converts the processed image back from RGB to BGR. The program then displays the original and inverted images in separate windows, waiting for a key press before exiting. This approach demonstrates integration between Halide for high-performance image processing and OpenCV for convenient input and output operations. +Wrap the processed Halide output buffer in an OpenCV `Mat` header without copying pixel data. Convert the processed image from RGB back to BGR for proper display in OpenCV, which uses BGR channel ordering by default. Display the original and inverted images in separate windows, and wait for a key press before exiting. Use this approach to integrate Halide for high-performance image processing with OpenCV for convenient input and output operations. By default, Halide orders loops based on the order of variable declaration. In this example, the original ordering (x, y, c) implies processing the image pixel-by-pixel across all horizontal positions (x), then vertical positions (y), and finally channels (c). This ordering naturally produces a planar memory layout (for example, processing all red pixels first, then green, then blue). @@ -228,7 +233,7 @@ invert.reorder_storage(c, x, y); Using only `reorder(c, x, y)` affects the computational loop order but not necessarily the memory layout. The computed data could still be stored in planar order by default. Using `reorder_storage(c, x, y)` defines the memory layout as interleaved. -## Compilation instructions +## Compile the program Compile the program as follows (replace `/path/to/halide` with your actual path): ```console @@ -257,7 +262,7 @@ You'll see two windows displaying the original and inverted images: You've learned Halide's foundational concepts, explored the benefits of separating algorithms and schedules, set up your development environment, and created your first functional Halide application integrated with OpenCV for Arm development. -While the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it doesn't yet showcase the benefits of separating algorithm definition from scheduling strategies. +cheWhile the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it doesn't yet showcase the benefits of separating algorithm definition from scheduling strategies. In subsequent sections, you'll explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to Arm processors and other hardware without modifying algorithmic correctness. From 5639dcc251124980fd0126dd4739e7321bc4b988 Mon Sep 17 00:00:00 2001 From: Madeline Underwood Date: Sat, 15 Nov 2025 21:59:14 +0000 Subject: [PATCH 7/8] Refactor Halide documentation: update section titles for clarity and remove redundant text --- .../mobile-graphics-and-gaming/android_halide/fusion.md | 2 +- .../mobile-graphics-and-gaming/android_halide/intro.md | 5 ++--- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md index f94f214de1..467ae467d1 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md @@ -17,7 +17,7 @@ You'll also use `print_loop_nest()` to see how Halide arranges the computation, This section doesn't cover loop fusion using the `fuse` directive. You'll focus instead on operator fusion, which is Halide's default behavior. {{% /notice %}} -## Code +## Explore the code To explore how fusion in Halide works create a new file called `camera-capture-fusion.cpp`, and copy in the code below. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion compared to materialization, and print ms / FPS / MPix/s. - you'll be able to see the impact immediately: ```cpp diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md index 45415cfc0f..e2535f65a6 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md @@ -262,7 +262,6 @@ You'll see two windows displaying the original and inverted images: You've learned Halide's foundational concepts, explored the benefits of separating algorithms and schedules, set up your development environment, and created your first functional Halide application integrated with OpenCV for Arm development. -cheWhile the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it doesn't yet showcase the benefits of separating algorithm definition from scheduling strategies. - -In subsequent sections, you'll explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to Arm processors and other hardware without modifying algorithmic correctness. +While the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it doesn't yet showcase the benefits of separating algorithm definition from scheduling strategies. +In subsequent sections, you'll explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to Arm processors and other hardware without modifying algorithmic correctness. \ No newline at end of file From c9454cf45510b93bc02c9b783424cf7d8e043f0d Mon Sep 17 00:00:00 2001 From: Madeline Underwood Date: Sat, 15 Nov 2025 22:00:45 +0000 Subject: [PATCH 8/8] Refactor fusion documentation: update section title from 'Profiling' to 'Profiling' for consistency --- .../mobile-graphics-and-gaming/android_halide/fusion.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md index 467ae467d1..84b6ee815e 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md +++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md @@ -503,7 +503,7 @@ Fusion isn’t always best. You’ll want to materialize an intermediate (comput * The intermediate is reused by multiple consumers. * You need a natural stage to apply parallelization or tiling. -### Profiling +## Profiling The fastest way to check whether fusion helps is to measure it. The demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official [Halide profiling tutorial](https://halide-lang.org/tutorials/tutorial_lesson_21_auto_scheduler_generate.html#profiling). ## Summary