Skip to content
This repository has been archived by the owner on May 5, 2020. It is now read-only.
/ soda-compiler Public archive

Stencil with Optimized Dataflow Architecture Compiler

License

Notifications You must be signed in to change notification settings

UCLA-VAST/soda-compiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repo is archived. For latest version, please see https://github.com/UCLA-VAST/soda.

SODA Compiler

Stencil with Optimized Dataflow Architecture Compiler

Publication

SODA DSL Example

# comments start with hashtag(#)

kernel: blur      # the kernel name, will be used as the kernel name in HLS
burst width: 512  # DRAM burst I/O width in bits, for Xilinx platform by default it's 512
unroll factor: 16 # how many pixels are generated per cycle

# specify the dram bank, type, name, and dimension of the input tile
# the last dimension is not needed and a placeholder '*' must be given
# dram bank is optional
# multiple inputs can be specified but 1 and only 1 must specify the dimensions
input dram 1 uint16: input(2000, *)

# specify an intermediate stage of computation, may appear 0 or more times
local uint16: blur_x(0, 0) = (input(0, 0) + input(0, 1) + input(0, 2)) / 3

# specify the output
# dram bank is optional
output dram 1 uint16: blur_y(0, 0) = (blur_x(0, 0) + blur_x(1, 0) + blur_x(2, 0)) / 3

# how many times the whole computation is repeated (only works if input matches output)
iterate: 1

Getting Started

Prerequisites

  • Python 3.5+ and corresponding pip
  • SDAccel 2018.3 (earlier versions might work but won't be supported)
How to install Python 3.5+ on Ubuntu 16.04+ and CentOS 7?

Ubuntu 16.04+

sudo apt install python3 python3-pip

CentOS 7

sudo yum install python36 python36-pip
sudo alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 100

Clone the Repo

git clone https://github.com/UCLA-VAST/soda-compiler.git
cd soda-compiler
python3 -m pip install --user -r requirements.txt

Parameter Setup

app=blur
platform=xilinx_u200_xdma_201830_1
# The following can be set via sourcing /path/to/xilinx/sdx/settings64.sh
XILINX_SDX=/path/to/xilinx/sdx
XILINX_VIVADO=/path/to/xilinx/vivado

Generate HLS Kernel Code

src/sodac tests/src/${app}.soda --xocl-kernel ${app}_kernel.cpp

Generate OpenCL Host Code

src/sodac tests/src/${app}.soda --xocl-header ${app}.h
src/sodac tests/src/${app}.soda --xocl-host ${app}.cpp

Create Testbench

cat >${app}_run.cpp <<EOF
#include <cstdio>
#include <cstdlib>

#include "${app}.h"                                                                            

int ${app}_test(const char* xclbin, const int dims[4]);
int main(int argc, char **argv) {
  if (argc != 4) {
    fprintf(stderr, "Usage: \n    %s <xclbin> <input width> <input height>\n", argv[0]);
    return 1;
  }
  int dims[4] = {atoi(argv[2]), atoi(argv[3]), 0, 0};
  return ${app}_test(argv[1], dims);
}
EOF

Compile OpenCL Host Executable

# Please set TILE_SIZE_DIM_0 and UNROLL_FACTOR macros to match the kernel.
g++ -std=c++11 -I${XILINX_SDX}/runtime/include -I${XILINX_VIVADO}/include ${app}.cpp ${app}_run.cpp -o ${app} \
    -lxilinxopencl -DTILE_SIZE_DIM_0=2000 -DUNROLL_FACTOR=2 -fopenmp -Wno-deprecated-declarations -Wall

Create Emulation Config

emconfigutil -f ${platform}

Software Emulation

Compile for Software Emulation

xocc -t sw_emu -f ${platform} --kernel ${app}_kernel --xp prop:kernel.${app}_kernel.kernel_flags="-std=c++0x" \
    -c ${app}_kernel.cpp -o ${app}.sw_emu.xo

Link for Software Emulation

xocc -t sw_emu -f ${platform} -l ${app}.sw_emu.xo -o ${app}.sw_emu.xclbin

Run Software Emulation

XCL_EMULATION_MODE=sw_emu ./${app} ${app}.sw_emu.xclbin 2000 100

High-Level Synthesis

xocc -t hw -f ${platform} --kernel ${app}_kernel --xp prop:kernel.${app}_kernel.kernel_flags="-std=c++0x" \
    -c ${app}_kernel.cpp -o ${app}.hw.xo

Hardware Emulation

Link for Hardware Emulation

xocc -t hw_emu -f ${platform} -l ${app}.hw.xo -o ${app}.hw_emu.xclbin

Run Hardware Emulation

# By default, kernel ports are connected via DRAM bank 1 on the xilinx_u200_xdma_201830_1 platform.
DRAM_IN=1 DRAM_OUT=1 XCL_EMULATION_MODE=hw_emu ./${app} ${app}.hw_emu.xclbin 2000 10

Hardware Deployment

Logic Synthesis, Place, and Route

xocc -t hw -f ${platform} -l ${app}.hw.xo -o ${app}.hw.xclbin

Run Bitstream on FPGA

# By default, kernel ports are connected via DRAM bank 1 on the xilinx_u200_xdma_201830_1 platform.
DRAM_IN=1 DRAM_OUT=1 ./${app} ${app}.hw.xclbin 2000 1000

Code Snippet Example

Source Code

kernel: jacobi2d
burst width: 512
unroll factor: 2
input float: t1(2000, *)
output float: t0(0, 0) = (t1(0, 1) + t1(1, 0) + t1(0, 0) + t1(0, -1) + t1(-1, 0)) * 0.2f
iterate: 1

HLS Kernel Code

Each function in the below code snippets is synthesized into an RTL module. Their arguments are all hls::stream FIFOs; Without unrolling, a simple line-buffer pipeline is generated, producing 1 pixel per cycle. With unrolling, a SODA microarchitecture pipeline is generated, procuding 2 pixeles per cycle.

Without Unrolling (--unroll-factor=1)

#pragma HLS dataflow
Module1Func(
  /*output*/ &from_t1_offset_0_to_t1_offset_1999,
  /*output*/ &from_t1_offset_0_to_t0_pe_0,
  /* input*/ &from_super_source_to_t1_offset_0);
Module2Func(
  /*output*/ &from_t1_offset_1999_to_t1_offset_2000,
  /*output*/ &from_t1_offset_1999_to_t0_pe_0,
  /* input*/ &from_t1_offset_0_to_t1_offset_1999);
Module3Func(
  /*output*/ &from_t1_offset_2000_to_t1_offset_2001,
  /*output*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_1999_to_t1_offset_2000);
Module3Func(
  /*output*/ &from_t1_offset_2001_to_t1_offset_4000,
  /*output*/ &from_t1_offset_2001_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t1_offset_2001);
Module4Func(
  /*output*/ &from_t1_offset_4000_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t1_offset_4000);
Module5Func(
  /*output*/ &from_t0_pe_0_to_super_sink,
  /* input*/ &from_t1_offset_0_to_t0_pe_0,
  /* input*/ &from_t1_offset_1999_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_4000_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t0_pe_0);

In the above code snippet, Module1Func to Module4Func are forwarding modules; they constitute the data-reuse line buffer. The line buffer size is approximately two lines of pixels, i.e. 4000 pixels. Module5Func is a computing module; it implements the computation kernel. The whole design is fully pipelined; however, with only 1 computing module, it can only produce 1 pixel per cycle.

Unroll 2 Times (--unroll-factor=2)

#pragma HLS dataflow
Module1Func(
  /*output*/ &from_t1_offset_1_to_t1_offset_1999,
  /*output*/ &from_t1_offset_1_to_t0_pe_0,
  /* input*/ &from_super_source_to_t1_offset_1);
Module1Func(
  /*output*/ &from_t1_offset_0_to_t1_offset_2000,
  /*output*/ &from_t1_offset_0_to_t0_pe_1,
  /* input*/ &from_super_source_to_t1_offset_0);
Module2Func(
  /*output*/ &from_t1_offset_1999_to_t1_offset_2001,
  /*output*/ &from_t1_offset_1999_to_t0_pe_1,
  /* input*/ &from_t1_offset_1_to_t1_offset_1999);
Module3Func(
  /*output*/ &from_t1_offset_2000_to_t1_offset_2002,
  /*output*/ &from_t1_offset_2000_to_t0_pe_1,
  /*output*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_0_to_t1_offset_2000);
Module4Func(
  /*output*/ &from_t1_offset_2001_to_t1_offset_4001,
  /*output*/ &from_t1_offset_2001_to_t0_pe_1,
  /*output*/ &from_t1_offset_2001_to_t0_pe_0,
  /* input*/ &from_t1_offset_1999_to_t1_offset_2001);
Module5Func(
  /*output*/ &from_t1_offset_2002_to_t1_offset_4000,
  /*output*/ &from_t1_offset_2002_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t1_offset_2002);
Module6Func(
  /*output*/ &from_t1_offset_4001_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t1_offset_4001);
Module7Func(
  /*output*/ &from_t0_pe_0_to_super_sink,
  /* input*/ &from_t1_offset_1_to_t0_pe_0,
  /* input*/ &from_t1_offset_2000_to_t0_pe_0,
  /* input*/ &from_t1_offset_2001_to_t0_pe_0,
  /* input*/ &from_t1_offset_4001_to_t0_pe_0,
  /* input*/ &from_t1_offset_2002_to_t0_pe_0);
Module8Func(
  /*output*/ &from_t1_offset_4000_to_t0_pe_1,
  /* input*/ &from_t1_offset_2002_to_t1_offset_4000);
Module7Func(
  /*output*/ &from_t0_pe_1_to_super_sink,
  /* input*/ &from_t1_offset_0_to_t0_pe_1,
  /* input*/ &from_t1_offset_1999_to_t0_pe_1,
  /* input*/ &from_t1_offset_2000_to_t0_pe_1,
  /* input*/ &from_t1_offset_4000_to_t0_pe_1,
  /* input*/ &from_t1_offset_2001_to_t0_pe_1);

In the above code snippet, Module1Func to Module6Func and Module8Func are forwarding modules; they constitute the reuse buffers of the SODA microarchitecture. Although unrolled, the reuse buffer size is still approximately two lines of pixels, i.e. 4000 pixels. Module7Func is a computing module; it is instanciated twice. The whole design is fully pipelined and can produce 2 pixel per cycle. In general, the unroll factor can be set to any number that satisfies the throughput requirement.

Design Considerations

  • kernel, burst width, unroll factor, input, output, and iterate keywords are mandatory
  • For non-iterative stencil, unroll factor shall be determined by the DRAM bandwidth, i.e. saturate the external bandwidth, since the resource is usually not the bottleneck
  • For iterative stencil, prefer to use more PEs in a single iteration rather than implement more iterations
  • Note that 2.0 will be a double number. To generate float, use 2.0f. This may help reduce DSP usage
  • SODA is tiling-based and the size of the tile is specified in the input keyword. The last dimension is a placeholder because it is not needed in the reuse buffer generation

Projects Using SODA

About

Stencil with Optimized Dataflow Architecture Compiler

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published