A "Hello World" equivalent for Intel FPGA devices using Intel oneAPI (up to version 2025.0). This example demonstrates vector addition (C[i] = A[i] + B[i]
) on FPGA hardware using SYCL programming model.
This example performs element-wise addition of two vectors using Intel's oneAPI SYCL framework. The kernel reads from two input arrays and writes the sum to an output array. The implementation uses manual memory management with explicit memcpy
operations for clarity and educational purposes.
The FPGA development workflow consists of three stages with increasing compilation time:
-
Emulation (seconds) to ensure functional validation
- CPU emulation:
make cpu
- FPGA emulator:
make fpga_emu
- CPU emulation:
-
Report Generation (minutes)
make report
- Generates hardware resource utilization and performance reports- Requires BOARD_NAME configuration
-
Hardware Compilation (hours)
make fpga
- Full hardware synthesis and place-and-routemake recompile_fpga
- Recompiles host code with existing kernel binary- Requires BOARD_NAME configuration
Before generating reports or executing on FPGA hardware, you must specify your target board in the Makefile:
BOARD_NAME := your_board_name
Common board names:
intel_a10gx_pac:pac_a10
- Intel Arria 10intel_s10gx_pac:pac_s10
- Intel Stratix 10ia840f:ofs_ia840f
- Intel Agilex 7 (Bittware's IA840F)/path/to/IOFS_BUILD_ROOT/oneapi-asp/<folder>:<variant>
Edit the Makefile and uncomment/modify the appropriate board configuration for your system.
# CPU execution (fastest for development)
make cpu
./vec_add.cpu
# FPGA emulator (seconds)
make fpga_emu
./vec_add.fpga_emu
# Generate hardware reports
make report
# Full hardware compilation
make fpga
./vec_add.fpga
# Recompile only host code (seconds vs hours)
make recompile_fpga
main.cxx
- Host code with memory allocation, data transfer, and timingkernel.cxx
- Device kernel implementing vector additionkernel.hpp
- Header with SYCL kernel interfaceMakefile
- Build system with multiple targets
This example uses manual memory allocation with explicit memcpy
operations:
malloc_device<DATATYPE>()
for device memoryqueue.memcpy()
for host-device transfers- No USM (Unified Shared Memory) for educational clarity
The kernel uses a single_task
with optional loop unrolling:
h.single_task([=]() [[intel::kernel_args_restrict]] {
#pragma unroll UNROLL
for (size_t i = 0; i < N; ++i) {
const DATATYPE a = d_A[i];
const DATATYPE b = d_B[i];
d_res[i] = a + b;
}
});
- II stands for "Initiation Interval" - the number of clock cycles between starting consecutive iterations of a loop. An II of 1 means the loop can start a new iteration every clock cycle (optimal). Higher II values indicate pipeline stalls or resource conflicts.
- Frequency: Design clock frequency read from executable file (using
aocl info
), used for II calculations. FPGA designs run at fixed frequencies (typically 200-500 MHz) determined during synthesis, unlike CPUs that can dynamically adjust clock speeds. - Throughput: Measured memory transfer rate in GB/s, reported separately for kernel (load/store) and host-device transfers
- Loop Unrolling: Controlled by
UNROLL
macro at compile time. Disabled by default - Restrict pointers:
__restrict__
keyword prevents pointer aliasing - Kernel arguments restrict:
[[intel::kernel_args_restrict]]
attribute
# Use float instead of double
make USE_FLOAT=1 fpga
# Enable unrolling (modify kernel.hpp)
#define UNROLL 1
- Separated Host/Kernel Compilation: Kernel compiled to dynamic library, enabling fast host-only recompilation
- Multiple Execution Targets: CPU, emulator, simulator, and hardware
- Comprehensive Timing: Detailed breakdown of memory transfers and computation
- Performance Analysis: Automatic II calculation and throughput estimation
- Verification: Built-in result validation against CPU reference
- Board not found: Verify
BOARD_NAME
in Makefile - Memory insufficient: Reduce problem size or use smaller data type
- Compilation errors: Check oneAPI installation and board support
- Use loop unrolling for better II
- Consider data type impact (float vs double, or even more specific data types, using ac_int / ap_float)
- Monitor memory bandwidth utilization
- Intel oneAPI Base Toolkit (2024.0 or later)
- Intel oneAPI HPC Toolkit
- Intel FPGA Add-on for oneAPI
- Supported Intel FPGA board
MIT License - see LICENSE file for details.
This example is provided as-is for educational purposes. The MIT License allows for maximum freedom in using, modifying, and distributing the code while providing minimal liability protection.