FastChwHwcConverter

Overview

Multi-Core CPU Implementation (C++Thread-OpenMP-oneTBB)

FastChwHwcConverter.hpp is a high-performance, multi-threaded, header-only C++ library for converting image data formats between HWC (Height, Width, Channels) and CHW (Channels, Height, Width). It leverages C++ STL Thread / OpenMP / Intel oneTBB for parallel processing, utilizing all CPU cores for maximum performance.

Note: If the compilation environment does not find OpenMP, or you set USE_OPENMP to OFF, it will be use C++ thread mode.

GPU Acceleration (NVIDIA CUDA)

FastChwHwcConverterCuda.hpp is a high-performance, GPU-accelerated library for converting image data formats between HWC and CHW, supporting CUDA versions 10.0+ and above. It requires no installation of the CUDA SDK, header files, or static linking. The library dynamically loads CUDA libraries from the system path. It will automatically search for CUDA's dynamic link library from the system path and dynamically load the functions inside and use them.

Note: If your operating environment does not support CUDA or does not meet the conditions for using CUDA acceleration, it will automatically fall back to the CPU (OpenMP/C++ Thread/Intel oneTBB) for processing. The functions support passing in cuda device memory and host memory parameters.

GPU Acceleration (AMD ROCm)

FastChwHwcConverterROCm.hpp is a high-performance, GPU-accelerated library for converting image data formats between HWC and CHW, supporting ROCm versions 5.0+ and above. Like the CUDA library, it does not require the ROCm (HIP) SDK, header files, or static linking, and dynamically loads ROCm libraries from the system path.

Note: If your operating environment does not support ROCm or does not meet the conditions for using ROCm acceleration, it will automatically fall back to the CPU (OpenMP/C++ Thread/Intel oneTBB) for processing. The functions support passing in ROCm device memory and host memory parameters.

Any similar type conversion code you find another project on GitHub will most likely only achieve performance close to the speed of single-thread execution.

The difference between CHW and HWC

Let's consider a 2x2 image with three channels (RGB).

Example Image Data:
```
Pixel 1 (R, G, B)    Pixel 2 (R, G, B)
Pixel 3 (R, G, B)    Pixel 4 (R, G, B)
```
We can store this image data in two different formats: CHW (Channel-Height-Width) and HWC (Height-Width-Channel).

CHW Format

CHW Format: In this format, the data is stored channel by channel. First, all the red channel data, then all the green channel data, and finally all the blue channel data.

For example (2x2 RGB Image):

RRRRGGGGBBBB

Mapping to the actual pixel positions:

R1, R2, R3, R4, G1, G2, G3, G4, B1, B2, B3, B4

HWC Format

HWC Format: In this format, the data is stored by each pixel's channels in sequence. So, the RGB data for each pixel is stored together.

For example (2x2 RGB Image):

RGBRGBRGBRGB

Mapping to the actual pixel positions:

(R1, G1, B1), (R2, G2, B2), (R3, G3, B3), (R4, G4, B4)

Why Convert Between HWC and CHW Formats?

The conversion between HWC (Height-Width-Channel) and CHW (Channel-Height-Width) formats is crucial for optimizing image processing tasks. Different machine learning frameworks and libraries have varying data format preferences. For instance, many deep learning frameworks, such as PyTorch, prefer the CHW format, while libraries like OpenCV often use the HWC format. By converting between these formats, we ensure compatibility and efficient data handling, enabling seamless transitions between different processing pipelines and maximizing performance for specific tasks. This flexibility enhances the overall efficiency and effectiveness of image processing and machine learning workflows.

Features

High-Performance: Utilizes C++ Thread / OpenMP / Intel oneTBB for parallel processing. Make full use of CPU multi-core features.
GPU Optimization: Fully leverages NVIDIA CUDA and AMD ROCm technologies to harness the computational power of GPUs, accelerating performance for intensive workloads.
Header-Only: Include ONLY a single header file. Easy to integrate into your C/C++ project. example.
Flexible: Supports scaling, clamping, and normalization of image data, any data type.
Lightweight & SDK-Free: No dependency on any external SDKs like CUDA SDK or HIP SDK. The project requires no additional header files or static library linkage, making it clean and easy to deploy.

Installation

for CPU (C++ Thread)

Simply include the header file FastChwHwcConverter.hpp in your project:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=OFF -DUSE_TBB=OFF -DBUILD_BENCHMARK=ON -DBUILD_CUDA_BENCHMARK=OFF -DBUILD_ROCM_BENCHMARK=OFF -DBUILD_EXAMPLE=OFF -DBUILD_EXAMPLE_OPENCV=OFF

cmake --build build --config Release

for CPU (OpenMP)

OpenMP is an API that supports multi-platform shared-memory multiprocessing programming. on many platforms, instruction-set architectures and operating systems. OpenMP uses a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the standard desktop computer to the supercomputer. see more.

Option 1:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=ON -DUSE_TBB=OFF -DBUILD_BENCHMARK=ON -DBUILD_CUDA_BENCHMARK=OFF -DBUILD_ROCM_BENCHMARK=OFF -DBUILD_EXAMPLE=OFF -DBUILD_EXAMPLE_OPENCV=OFF

cmake --build build --config Release

Option 2:

Simply include the header file FastChwHwcConverter.hpp in your project. Before include, you need to add a macro #define USE_OPENMP 1.

for CPU (oneTBB)

Intel oneTBB (Intel® oneAPI Threading Building Blocks) is a simplify parallelism with this advanced threading and memory-management template library. This component is part of the Intel® oneAPI Base Toolkit. see more.

Option 1:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=OFF -DUSE_TBB=ON -DTBB_DIR=D:/extlibs/oneAPI/tbb/2021.13/lib/cmake/tbb -DBUILD_BENCHMARK=ON -DBUILD_CUDA_BENCHMARK=OFF -DBUILD_ROCM_BENCHMARK=OFF -DBUILD_EXAMPLE=OFF -DBUILD_EXAMPLE_OPENCV=OFF

cmake --build build --config Release

Option 2:

Simply include the header file FastChwHwcConverter.hpp in your project. Before include, you need to add a macro #define USE_TBB 1.

for GPU (CUDA or ROCm)

NVIDIA CUDA Official Website

AMD ROCm Official Website

Option 1:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=OFF -DUSE_TBB=OFF -DBUILD_BENCHMARK=ON -DBUILD_CUDA_BENCHMARK=ON -DBUILD_ROCM_BENCHMARK=ON -DBUILD_EXAMPLE=ON -DBUILD_EXAMPLE_OPENCV=ON

cmake --build build --config Release

Option 2:

Simply include the header file FastChwHwcConverterCuda.hpp or FastChwHwcConverterRocm.hpp in your project:
```
#include "FastChwHwcConverterCuda.hpp"
```
```
#include "FastChwHwcConverterROCm.hpp"
```

Usually you also need to copy the nvrtc64_***_0.dll nvrtc-builtins64_*** (for Windows CUDA) or hiprtc****.dll hiprtc-builtins****.dll amd_comgr_*.dll amd_comgr****.dll (for Windows ROCm) or libnvrtc.so (for Linux CUDA) or libhiprtc.so (for Linux ROCm) file in the CUDA/ROCm Runtime SDK to the executable program directory, or set CUDA/ROCm SDK HOME as a system environment variable.

In addition, you need to download and install the latest version of the driver from the NVIDIA drivers website or AMD drivers website. Because this project will dynamically load driver file: nvcuda.dll (for Windows CUDA) or amdhip64_6.dll (for Windows ROCm) or libcuda.so (for Linux CUDA) or libamdhip64.so (for Linux ROCm).

Requirements

C++17 or later
OpenMP support (optional, set USE_OPENMP to ON for high performance)
oneTBB support (optional, set USE_TBB to ON and set valid TBB_LIBS for Intel oneTBB's high performance)
CMake v3.10 or later (optional)
OpenCV v4.0 or later (optional, if BUILD_EXAMPLE_OPENCV is ON)
CUDA 11.2+ driver (optional, if you want to use CUDA acceleration, And NVIDIA GPU's compute capability > 3.5, more details see here. )
ROCm 5.0+ driver (optional, if you want to use ROCm acceleration, hardware and system requirements see here. )

API Documents

HWC to CHW Conversion (CPU)

The whyb::cpu::hwc2chw() function converts image data from HWC format to CHW format.

template <typename Stype, typename Dtype,
            bool HasAlpha = false,
            bool NeedClamp = false,
            bool NeedNormalizedMeanStds = false>
void hwc2chw(
    const size_t h, const size_t w, const size_t c,
    const Stype* src, Dtype* dst,
    const Dtype alpha = 1, 
    const Dtype min_v = 0.0, const Dtype max_v = 1.0,
    const std::array<float, 3> mean = { 0.485, 0.456, 0.406 },
    const std::array<float, 3> stds = { 0.229, 0.224, 0.225 }
);

Parameters:

h: Height of the image.
w: Width of the image.
c: Number of channels.
src: Pointer to the source data in HWC format.
dst: Pointer to the destination data in CHW format.
alpha: Scaling factor (default is 1).
min_v: Minimum value for clamping (default is 0.0).
max_v: Maximum value for clamping (default is 1.0).
mean: Array of mean values for normalization (default is {0.485, 0.456, 0.406}).
stds: Array of standard deviation values for normalization (default is {0.229, 0.224, 0.225}).

CHW to HWC Conversion (CPU)

The whyb::cpu::chw2hwc() function converts image data from CHW format to HWC format.

template <typename Stype, typename Dtype,
            bool HasAlpha = false,
            bool NeedClamp = false>
void chw2hwc(
    const size_t c, const size_t h, const size_t w,
    const Stype* src, Dtype* dst, 
    const Dtype alpha = 1, 
    const Dtype min_v = 0, const Dtype max_v = 255
);

Parameters:

c: Number of channels.
h: Height of the image.
w: Width of the image.
src: Pointer to the source data in CHW format.
dst: Pointer to the destination data in HWC format.
alpha: Scaling factor (default is 1).
min_v: Minimum value for clamping (default is 0).
max_v: Maximum value for clamping (default is 255).

HWC to CHW Conversion (CUDA)

The whyb::nvidia::hwc2chw() function converts image data from HWC format to CHW format.

void hwc2chw(
    const size_t h, const size_t w, const size_t c,
    const uint8_t* src, float* dst,
    const float alpha = 1.f/255.f
);

Parameters:

h: Height of the image.
w: Width of the image.
c: Number of channels.
src: Pointer to the source data(host memory) in HWC format.
dst: Pointer to the destination data(host memory) in CHW format.
alpha: Scaling factor (default is 1).

Note: Please call whyb::nvidia::init() before the first use, and call whyb::nvidia::release() to release it after confirming that it will not be used anymore.

CHW to HWC Conversion (CUDA)

The whyb::nvidia::chw2hwc() function converts image data from CHW format to HWC format.

void chw2hwc(
    const size_t c, const size_t h, const size_t w,
    const float* src, uint8_t* dst,
    const uint8_t alpha = 255.0f
);

Parameters:

c: Number of channels.
h: Height of the image.
w: Width of the image.
src: Pointer to the source data(host memory) in CHW format.
dst: Pointer to the destination data(host memory) in HWC format.
alpha: Scaling factor (default is 1).

Note: Please call whyb::nvidia::init() before the first use, and call whyb::nvidia::release() to release it after confirming that it will not be used anymore.

HWC to CHW Conversion (ROCm)

The whyb::amd::hwc2chw() function converts image data from HWC format to CHW format.

void hwc2chw(
    const size_t h, const size_t w, const size_t c,
    const uint8_t* src, float* dst,
    const float alpha = 1.f/255.f
);

Parameters:

h: Height of the image.
w: Width of the image.
c: Number of channels.
src: Pointer to the source data(host memory) in HWC format.
dst: Pointer to the destination data(host memory) in CHW format.
alpha: Scaling factor (default is 1).

Note: Please call whyb::amd::init() before the first use, and call whyb::amd::release() to release it after confirming that it will not be used anymore.

CHW to HWC Conversion (ROCm)

The whyb::amd::chw2hwc() function converts image data from CHW format to HWC format.

void chw2hwc(
    const size_t c, const size_t h, const size_t w,
    const float* src, uint8_t* dst,
    const uint8_t alpha = 255.0f
);

Parameters:

c: Number of channels.
h: Height of the image.
w: Width of the image.
src: Pointer to the source data(host memory) in CHW format.
dst: Pointer to the destination data(host memory) in HWC format.
alpha: Scaling factor (default is 1).

Note: Please call whyb::amd::init() before the first use, and call whyb::amd::release() to release it after confirming that it will not be used anymore.

Example

This example code(test/example.cpp) demonstrates how to use the FastChwHwcConverter and FastChwHwcConverterCuda library to convert image data from HWC format to CHW format, and then back to HWC format after AI inference.

#include "FastChwHwcConverter.hpp"
#include "FastChwHwcConverterCuda.hpp"
#include "FastChwHwcConverterROCm.hpp"
#include <vector>
#include <cstdint>
#include <iostream>

void cpu_example()
{
    const size_t c = 3;
    const size_t w = 1920;
    const size_t h = 1080;

    // step 1. Defining input and output 
    const size_t pixel_size = h * w * c;
    std::vector<uint8_t> src_uint8(pixel_size); // Source data(hwc)
    std::vector<float> src_float(pixel_size); // Source data(chw)

    std::vector<float> out_float(pixel_size); // Inference output data(chw)
    std::vector<uint8_t> out_uint8(pixel_size); // Inference output data(hwc)

    // step 2. Load image data to src_uint8(8U3C)

    // step 3. Convert HWC(Height, Width, Channels) to CHW(Channels, Height, Width)
    whyb::cpu::hwc2chw<uint8_t, float, true>(h, w, c, (uint8_t*)src_uint8.data(), (float*)src_float.data(), 1.f/255.f);

    // step 4. Do AI inference
    // input: src_float ==infer==> output: out_float

    // step 5. Convert CHW(Channels, Height, Width) to HWC(Height, Width, Channels)
    whyb::cpu::chw2hwc<float, uint8_t, true>(c, h, w, (float*)out_float.data(), (uint8_t*)out_uint8.data(), 255.f);

    std::cout << "cpu example done" << std::endl;
}

void cuda_example()
{
    if (!whyb::nvidia::init()) { return; }
    const size_t c = 3;
    const size_t w = 1920;
    const size_t h = 1080;

    // step 1. Defining input and output 
    const size_t pixel_size = h * w * c;
    std::vector<uint8_t> src_uint8(pixel_size); // Source data(hwc)
    std::vector<float> src_float(pixel_size); // Source data(chw)

    std::vector<float> out_float(pixel_size); // Inference output data(chw)
    std::vector<uint8_t> out_uint8(pixel_size); // Inference output data(hwc)

    // step 2. Load image data to src_uint8(8U3C)

    // step 3. Convert HWC(Height, Width, Channels) to CHW(Channels, Height, Width)
    whyb::nvidia::hwc2chw(h, w, c, (uint8_t*)src_uint8.data(), (float*)src_float.data(), 1.f/255.f);

    // step 4. Do AI inference
    // input: src_float ==infer==> output: out_float

    // step 5. Convert CHW(Channels, Height, Width) to HWC(Height, Width, Channels)
    whyb::nvidia::chw2hwc(c, h, w, (float*)out_float.data(), (uint8_t*)out_uint8.data(), 255.f);

    whyb::nvidia::release();
    std::cout << "cuda example done" << std::endl;
}

void rocm_example()
{
    if (!whyb::amd::init()) { return; }
    const size_t c = 3;
    const size_t w = 1920;
    const size_t h = 1080;

    // step 1. Defining input and output 
    const size_t pixel_size = h * w * c;
    std::vector<uint8_t> src_uint8(pixel_size); // Source data(hwc)
    std::vector<float> src_float(pixel_size); // Source data(chw)

    std::vector<float> out_float(pixel_size); // Inference output data(chw)
    std::vector<uint8_t> out_uint8(pixel_size); // Inference output data(hwc)

    // step 2. Load image data to src_uint8(8U3C)

    // step 3. Convert HWC(Height, Width, Channels) to CHW(Channels, Height, Width)
    whyb::amd::hwc2chw(h, w, c, (uint8_t*)src_uint8.data(), (float*)src_float.data(), 1.f / 255.f);

    // step 4. Do AI inference
    // input: src_float ==infer==> output: out_float

    // step 5. Convert CHW(Channels, Height, Width) to HWC(Height, Width, Channels)
    whyb::amd::chw2hwc(c, h, w, (float*)out_float.data(), (uint8_t*)out_uint8.data(), 255.f);

    whyb::amd::release();
    std::cout << "rocm example done" << std::endl;
}

int main() {
    cpu_example();
    cuda_example();
    rocm_example();
    return 0;
}

If you are using OpenCV's cv::Mat, please refer to the test/example-opencv.cpp file.

Benchmark Performance Timing Results

The table below shows the benchmark performance timing for different image dimensions, channels, and processing configurations.

RAM: DDR5 2400MHz 4x32-bit channels
CPU(OpenMP): Intel(R) Core(TM) i7-13700K
GPU(CUDA): NVIDIA GeForce RTX 3060 Ti
GPU(ROCm): AMD Radeon RX 6900 XT

	CPU(Single)	CPU(Single)	CPU(OpenMP)	CPU(OpenMP)	CUDA	CUDA	ROCm	ROCm
W x H x C	hwc2chw	chw2hwc	hwc2chw	chw2hwc	hwc2chw	chw2hwc	hwc2chw	chw2hwc
426x240x1	0.097ms	0.110ms	0.113ms	0.030ms	0.022ms	0.019ms	0.059ms	0.053ms
426x240x3	0.331ms	0.314ms	0.061ms	0.068ms	0.022ms	0.019ms	0.062ms	0.059ms
426x240x4	0.439ms	0.415ms	0.082ms	0.082ms	0.020ms	0.019ms	0.062ms	0.061ms
640x360x1	0.217ms	0.236ms	0.048ms	0.052ms	0.022ms	0.021ms	0.062ms	0.061ms
640x360x3	0.743ms	0.705ms	0.147ms	0.140ms	0.036ms	0.021ms	0.060ms	0.059ms
640x360x4	0.881ms	0.921ms	0.219ms	0.203ms	0.025ms	0.021ms	0.057ms	0.053ms
854x480x1	0.393ms	0.415ms	0.094ms	0.089ms	0.025ms	0.024ms	0.063ms	0.060ms
854x480x3	1.328ms	1.269ms	0.250ms	0.232ms	0.029ms	0.024ms	0.052ms	0.052ms
854x480x4	1.717ms	1.670ms	0.263ms	0.262ms	0.034ms	0.027ms	0.054ms	0.051ms
1280x720x1	0.873ms	0.937ms	0.130ms	0.180ms	0.053ms	0.040ms	0.060ms	0.052ms
1280x720x3	2.877ms	2.828ms	0.449ms	0.457ms	0.052ms	0.042ms	0.061ms	0.056ms
1280x720x4	3.558ms	3.848ms	0.719ms	0.616ms	0.054ms	0.045ms	0.062ms	0.056ms
1920x1080x1	1.949ms	2.136ms	0.374ms	0.342ms	0.081ms	0.067ms	0.079ms	0.060ms
1920x1080x3	6.587ms	6.469ms	1.000ms	0.672ms	0.087ms	0.074ms	0.080ms	0.064ms
1920x1080x4	8.144ms	8.615ms	0.832ms	0.914ms	0.103ms	0.080ms	0.077ms	0.057ms
2560x1440x1	3.530ms	3.800ms	0.423ms	0.476ms	0.114ms	0.116ms	0.094ms	0.074ms
2560x1440x3	11.47ms	11.611ms	1.323ms	1.169ms	0.142ms	0.127ms	0.089ms	0.070ms
2560x1440x4	14.14ms	15.273ms	2.391ms	2.567ms	0.154ms	0.136ms	0.094ms	0.075ms
3840x2160x1	7.976ms	8.494ms	1.103ms	1.387ms	0.234ms	0.227ms	0.129ms	0.097ms
3840x2160x3	26.30ms	25.824ms	5.339ms	4.438ms	0.307ms	0.253ms	0.132ms	0.096ms
3840x2160x4	32.94ms	34.718ms	5.805ms	4.514ms	0.323ms	0.272ms	0.131ms	0.097ms
7680x4320x1	31.54ms	34.100ms	5.742ms	4.976ms	0.836ms	0.741ms	0.484ms	0.214ms
7680x4320x3	102.87ms	102.42ms	19.261ms	17.294ms	1.057ms	0.890ms	0.621ms	0.222ms
7680x4320x4	133.08ms	136.31ms	23.398ms	18.445ms	1.144ms	1.013ms	0.686ms	0.220ms

Contact

For any questions or suggestions, please open an issue or contact the me.

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
.github/workflows		.github/workflows
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
DynamicLibraryManager.hpp		DynamicLibraryManager.hpp
FastChwHwcConverter.hpp		FastChwHwcConverter.hpp
FastChwHwcConverterCuda.hpp		FastChwHwcConverterCuda.hpp
FastChwHwcConverterROCm.hpp		FastChwHwcConverterROCm.hpp
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastChwHwcConverter

Overview

Multi-Core CPU Implementation (C++Thread-OpenMP-oneTBB)

GPU Acceleration (NVIDIA CUDA)

GPU Acceleration (AMD ROCm)

Table of Contents

The difference between CHW and HWC

CHW Format

HWC Format

Why Convert Between HWC and CHW Formats?

Features

Installation

for CPU (C++ Thread)

for CPU (OpenMP)

for CPU (oneTBB)

for GPU (CUDA or ROCm)

Requirements

API Documents

HWC to CHW Conversion (CPU)

CHW to HWC Conversion (CPU)

HWC to CHW Conversion (CUDA)

CHW to HWC Conversion (CUDA)

HWC to CHW Conversion (ROCm)

CHW to HWC Conversion (ROCm)

Example

Benchmark Performance Timing Results

Contact

About

Releases 13

Packages

Languages

License

whyb/FastChwHwcConverter

Folders and files

Latest commit

History

Repository files navigation

FastChwHwcConverter

Overview

Multi-Core CPU Implementation (C++Thread-OpenMP-oneTBB)

GPU Acceleration (NVIDIA CUDA)

GPU Acceleration (AMD ROCm)

Table of Contents

The difference between CHW and HWC

CHW Format

HWC Format

Why Convert Between HWC and CHW Formats?

Features

Installation

for CPU (C++ Thread)

for CPU (OpenMP)

for CPU (oneTBB)

for GPU (CUDA or ROCm)

Requirements

API Documents

HWC to CHW Conversion (CPU)

CHW to HWC Conversion (CPU)

HWC to CHW Conversion (CUDA)

CHW to HWC Conversion (CUDA)

HWC to CHW Conversion (ROCm)

CHW to HWC Conversion (ROCm)

Example

Benchmark Performance Timing Results

Contact

About

Resources

License

Stars

Watchers

Forks

Releases 13

Packages 0

Languages

Packages