# HLSpolito/KNN forked from fahadmuslim/KNN

No description or website provided.
C++ C Tcl
This branch is 1 commit ahead of fahadmuslim:master. Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information. data figures knn_cpu knn_fpga README.md license

# k-Nearest Neighbor algorithm

## Overview:

This project is aimed at using SDAccel to implement the k-Nearest Neighbor algorithm onto a Xilinx FPGA. The nearest neighbor algorithm is used to find the k nearest neighbors of a specified point among a set of unstructured data points. It is widely used in a diverse range of domains and applications such as pattern recognition, machine learning, computer vision and coding theory to name a few. For a set S of n reference data points in a d-dimensional space and a query point q, the k-nearest neighbor algorithm finds the k closest points in S to the point q. This is illustrated in Figure 1 for k=3 and n=12. The red triangle represents the query point while the blue diamonds represent the points in the reference data set in Figure 1.

![alt text](./figures/knn.png) Figure 1: Pictorial demonstration of KNN search algorithm

The algorithm includes the following main steps:

1. Compute n distances between the query point q and the n reference points of the set S. The distances in our case are the squared Euclidean distances which for a given set of two 2-dimensional points (x1,y1) and (x2,y2) are as follows:
_d = (x1 - x2)2 + (y1 - y2)2_ 2. Sort the _n_ distances while preserving their original indices in _S_. 3. Return the _k_ points from _S_ corresponding to the lowest distances of the sorted distance array.

Further details can be found in the following paper. If you find this work useful for your research, please consider citing:

``````@ARTICLE{7859319,
author={F. B. Muslim and L. Ma and M. Roozmeh and L. Lavagno},
journal={IEEE Access},
title={Efficient FPGA Implementation of OpenCL High-Performance Computing Applications via High-Level Synthesis},
year={2017},
volume={5},
pages={2747-2762},
doi={10.1109/ACCESS.2017.2671881},
}
``````

## Getting Started:

The repository contains three directories, namely "data", "knn_cpu" and "knn_fpga". The "data" directory contains a file with a sample set of reference data points. The "knn_cpu" and "knn_fpga" directories contain two different implementations of the KNN algorithm, in which the nearest neighbor identification is done on the CPU and FPGA respectively. Each implementation contains the OpenCL code of the algorithm along with the required host code and the Tcl script used to run the example in SDAccel.

###File Tree

``````KNN
│
└── data
│   └─  filelist.txt
│
└── knn_cpu
│   │   main_cpu.cpp
│   │   nn_cpu.cl
│   │   params.h
│   └─  solution_cpu.tcl
│
└── knn_fpga
│   main_fpga.cpp
│   nn_fpga.cl
│   params.h
└─  solution_fpga.tcl
``````
File/Dir name Information
knn_cpu Contains the "single kernel implementation" of the algorithm. The nearest neighbors identification in this implementation is done on the host.
knn_fpga Contains the "two kernel implementation" of the algorithm, which uses a memory architecture optimization provided by SDAccel that implements the global memories used to communicate between the kernels as streaming buffers on the FPGA Block RAMs. The nearest neighbor identification in this case is done on the FPGA.
filelist.txt Contains the points of a reference data set (294912 points).
params.h Some constant parameters (see below).
main_cpu.cpp The host code for the implementation with nearest neighbor identification done on the host.
nn_cpu.cl The kernel which calculates all the distances and writes them to the host for nearest neighbors identification.
solution_cpu.tcl Script for the "single kernel implementation" in SDAccel.
main_fpga.cpp The host code for the "two kernel implementation" of the algorithm, where the distances are calculated by multiple work groups in one kernel and then the neighbors are computed by a single work-group (assuming that k is small) in another kernel.
nn_fpga.cl The OpenCL code using global memory buffers to stream data between two kernels. One of the kernels calculates the distance between the query point and all the points in the reference data set. The second kernel identifies the k nearest neighbors using a single work-group.
solution_fpga.tcl Script for the "two kernel implementation" in SDAccel.

### Parameters

The number of nearest neighbors to be returned, the required work group size and the maximum size of an input file line are defined in "params.h"

In the "two kernel" implementation, the size of the inter-kernel on-chip global memory buffer is also specified. Note that SDAccel implements it using a streaming buffer, hence its specific size does not actually matter.

### How to run an example

The code has been tested with Xilinx SDAccel 2015.1.

The script files for each test case can be used as follows:

sdaccel solution_cpu.tcl
sdaccel solution_fpga.tcl

The k nearest neighbors, the given query point, and the total number of points are printed on standard output.

### Sample Output

The sample output after a successful run would look like:

![alt text](./figures/samplerun.png) Figure 2: Sample Output after a successful run

## Performance Metrics

We considered the following performance metrics:

Initiation Interval: Indicating the number of clock cycles between two consecutive iterations of a loop
Latency: Indicating the number of clock cycles required to complete the execution of the entire kernel

Both metrics have a profound effect on the execution time and energy consumption in case of an FPGA implementation. The various optimization options applied cause a significant reduction in the overall latency for the different implementations. Both the cases use "reqd_work_group_size" attribute necessary to allow for performance optimization while generating custom logic for a kernel.
The latency after employing work-item pipeline optimization for the "knn_cpu" case goes from 7433 clock cycles down to 294 clock cycles. For the "knn_fpga" case, that uses the on-chip block RAMs to implement global memory buffers used for inter-kernel communication, the best case latency reduces to 292 clock cycles from 5901 clock cycles in the un-optimized case. This improvement in the overall latency shows up in the performance comparison tables given below.

### Performance Comparison

The FPGA vs GPU performance comparison for the "knn_cpu" and "knn_fpga" implementations in terms of execution time, power and energy consumption is described here. Note that in the "knn_cpu" case, the nearest neighbors identification time on the host (the CPU) has also been added to the distance calculation time on FPGA to get the total execution time. This "sorting and neighbors identification time" for "knn_cpu" case is also reported below. The execution time on FPGA implementation in each case has been calculated by using extrapolation to take into account the total number of work-groups. The devices used for comparison are the following:

• NVIDIA GeForce GTX 960 with 1024 cores
• NVIDIA Quadro K4200 with 1344 cores
• Xilinx Virtex 7 xc7vx690tffg1157-2

knn_cpu

Platform Total Time Sort Time (CPU) Power(W) Energy(mJ)
Virtex 7 5.24ms 4.0ms 0.422 0.523
GTX 960 3.04ms 3.0ms 120 4.4
K4200 3.05ms 3.0ms 108 5.6
Area Utilization Values
BRAMs 0
DSPs 12 (0.33%)
FFs 3109 (0.36%)
LUTs 2006 (0.46%)

knn_fpga

Platform Time Power(W) Energy(J)
Virtex 7 1.23ms 3.136 0.0039
GTX 960 0.93s 120 111.6
K4200 3.11s 108 335.88
Area Utilization Values
BRAMs 512 (34.83%)
DSPs 12 (0.33%)
FFs 23892 (2.78%)
LUTs 11838 (2.76%)