# Evaluating AIWC metrics between Cuda and OpenCL

We present an evaluation of the change in AIWC metrics between Cuda and OpenCL implementations of the same algorithm.
OpenCL and Cuda codes are selected from the [Rodinia Benchmark Suite](https://github.com/BeauJoh/rodinia.git) which was chosen because it's original goal was as a comparison between languages on heterogenous computing architectures. Presently it boasts Cuda, OpenCL, OpenMP and OpenACC versions of several application codes.

cocl -- as part of [coriander](https://github.com/hughperkins/coriander.git) -- is used to perform the translation of Cuda to OpenCL codes.


## Set environment variables

In [12]:
%env COCL=/root/coriander/bin/cocl
%env NVCC=/usr/local/cuda/bin/nvcc

env: COCL=/root/coriander/bin/cocl
env: NVCC=/usr/local/cuda/bin/nvcc


## Compile Cuda code using standard nvidia compiler 

In [20]:
! $NVCC ./gaussian_cuda_version/gaussian.cu -o gaussian_cuda

## Compile OpenCL version of the Cuda code using cocl

In [21]:
! $COCL ./gaussian_cuda_version/gaussian.cu -o gaussian_opencl


Please use: `cocl_py`, which is easier to maintain, and portable

cocl args: ./gaussian_cuda_version/gaussian.cu -o gaussian_opencl
LLVM_COMPILE_FLAGS -I/coriander/coriander/soft/llvm-4.0/include -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/coriander/coriander/soft/llvm-4.0/include -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -Werror=date-time -std=c++11 -ffunction-sections -fdata-sections -fexceptions -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
+ /coriander/coriander/soft/llvm-4.0/bin/clang++ -DUSE_CLEW -std=c++11 -x cuda -D__CORIANDERCC__ -D__CUDACC__ --cuda-gpu-arch=sm_30 -nocudalib -nocudainc --cuda-device-only -emit-llvm -O2 -S -Wno-gnu-anonymous-struct -Wno-nested-anon-types -I/coriander/coriander/soft/llvm-4.

## Functionality test of the OpenCL generated version

The code must produce the same results regardless of the compiler and backend before any further evaluation is performed.

In [14]:
! ./gaussian_cuda ./matrix4.txt

Matrix m is: 
    0.00     0.00     0.00     0.00 
    0.50     0.00     0.00     0.00 
    0.67     0.26     0.00     0.00 
   -0.00     0.15    -0.28     0.00 

Matrix a is: 
   -0.60    -0.50     0.70     0.30 
    0.00    -0.65    -0.05     0.55 
   -0.00     0.00    -0.75    -1.14 
    0.00    -0.00     0.00     0.50 

Array b is: 
-0.85 -0.25 0.87 -0.25 

The final solution is: 
0.70 0.00 -0.40 -0.50 


Time total (including memory transfers)	0.391966 sec
Time for CUDA kernels:	0.000064 sec


In [34]:
! ./gaussian_opencl ./matrix4.txt

Cannot choose gpu device more than 0
terminate called after throwing an instance of 'std::runtime_error'
  what():  gpu device ordinal beyond range of number of gpus
Aborted (core dumped)


## Functionality test of the generated vs hand-coded versions

Next, we test for the same functionality against a manually written version.

### Compile the hand-coded version

In [27]:
!cd gaussian_opencl_version/ && g++ gaussianElim.cpp  clutils.cpp utils.cpp -lOpenCL -std=c++11 -o gaussian_hand_opencl && cd ..
!mv gaussian_opencl_version/gaussian_hand_opencl . && cp gaussian_opencl_version/gaussianElim_kernels.cl .

[01m[Kclutils.cpp:[m[K In function '[01m[K_cl_context* cl_init(char)[m[K':
     commandQueueProf = clCreateCommandQueue(context, device, 
[01;32m[K                        ^[m[K
In file included from [01m[Kclutils.cpp:57:0[m[K:
[01m[K/usr/include/CL/cl.h:1359:1:[m[K [01;36m[Knote: [m[Kdeclared here
 clCreateCommandQueue(cl_context                     /* context */,
[01;32m[K ^[m[K
     commandQueueProf = clCreateCommandQueue(context, device, 
[01;32m[K                        ^[m[K
In file included from [01m[Kclutils.cpp:57:0[m[K:
[01m[K/usr/include/CL/cl.h:1359:1:[m[K [01;36m[Knote: [m[Kdeclared here
 clCreateCommandQueue(cl_context                     /* context */,
[01;32m[K ^[m[K
                             CL_QUEUE_PROFILING_ENABLE, &status);
[01;32m[K                                                               ^[m[K
In file included from [01m[Kclutils.cpp:57:0[m[K:
[01m[K/usr/include/CL/cl.h:1359:1:[m[K 

In [30]:
! ./gaussian_hand_opencl ./matrix4.txt -p 0 -d 0

Using Platform 0 	 Device No 0 
Creating CPU Context

	gaussianElim_kernels.cl
The result of matrix m is: 
    0.00     0.00     0.00     0.00 
    0.50     0.00     0.00     0.00 
    0.67     0.26     0.00     0.00 
   -0.00     0.15    -0.28     0.00 

The result of matrix a is: 
   -0.60    -0.50     0.70     0.30 
    0.00    -0.65    -0.05     0.55 
   -0.00     0.00    -0.75    -1.14 
    0.00    -0.00     0.00     0.50 

The result of array b is: 
-0.85 -0.25 0.87 -0.25 

The final solution is: 
0.70 0.00 -0.40 -0.50 



## Run-time  Comparison between all 3

## AIWC feature-space comparison between the generated and hand-coded version

Here they should be similar enough to show that AIWC and it's OpenCL back-end is suitable for language agnostic architecture-independent workload characterization

In [None]:
 -p 0 -d 0 matrix4.txt