<a href="https://colab.research.google.com/github/AmadoMaria/hands-on-supercomputing-with-parallel-computing/blob/master/Maria_Amado_e_Fernanda_Lisboa_report_handson_6_jupyter_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Hands-on 6: Portable Parallel Programming with CUDA

M. Amado$^1$, F. Lisboa$^1$

$^1$ Department of Computer Engenier – University SENAI CIMATEC, Salvador, Bahia, Brazil  

# Abstract

In order to optimize high cost applications, covering fine-grained and corse-grained parallelisms, NVIDIA develop the CUDA framework, that enables GPU programming. So, developers are able to use multicore and many threads benefits, being able to easily convert sequential _.c_ programs into parallelized _.cu_ ones. This practice, aims to explore CUDA, which is found simple to use and very powerful, reducing the code's execution time.


# Introduction

High performance programming with GPU (Graphic Process Units) uses both, multicore and many threads benefits, being ideal for processing large amounts of data. The NVIDIA's CUDA framework is an option for implementing general purpose parallel programming applications, and it also keeps code implementation simple and portable [1].

Different of CPU's implementations, that work with two or four cores, GPUs architectures are equipped with hundreds of cores, that are able to run thousands of threads in parallel [2]. CUDA is a scalable parallel programming model because the same code can run with a different number of processors without the need of recompilation. In addition, CUDA applications can run also in CPUs, once it is possible to compile the same source code to run in different platforms [3].

Once parallel programming with CUDA makes it possible to use many threads and process, it covers the fine-grained and corse-grained parallelisms. Which makes it ideal for high cost programs in areas such as visual computing and high arithmetic intensity.[3]

Thus, this activity aims to explore CUDA implementation, parallelizing a sequential code wrote in C language.

# Results and Discussion

### Installing necessary libraries

In [None]:
from IPython.display import clear_output

In [None]:
!sudo apt install libomp-dev
clear_output(wait=False)

In [None]:
!sudo apt-get install openmpi-bin
clear_output(wait=False)

### SAXPY sequential implementation:


In order to exemplify the parallization with CUDA, we are going to use a sequential code implementation of SAXPY, a function in the standard Basic Linear Algebra Subroutines that stands for Single-Precision.

#### Code

In [None]:
%%writefile saxpy.cu

#include <stdio.h>
#include <stdlib.h>

void saxpy ( int n,  float *x, float *y ) {

  for (int i=0; i < n ; ++i)
  y[i] = x[i] + y[i];

}

void printVector ( float *vector, int n ) {

  for (int i=0; i < n ; ++i)
  printf("%1.0f\t", vector[i]);

  printf("\n\n");

}

void generateVector (float *vector, int n) {

for (int i=0; i < n ; ++i)
 vector[i] = i + 1;

}

int main (int argc, char *argv[]) {

  int n = atoi(argv[1]);   
  float *x,*y;

  x = (float*) malloc(sizeof(float) * n);
  y = (float*) malloc(sizeof(float) * n);
 
  generateVector(x, n);
  printVector(x, n);

  generateVector(y, n);
  printVector(y, n);

  saxpy(n, x, y);
  printVector(y, n);
 
  free(x);
  free(y);

  return 0;

}

Writing saxpy.cu


#### Execution

In [None]:
!nvcc saxpy.cu -o saxpy

In [None]:
! ./saxpy 10

1	2	3	4	5	6	7	8	9	10	

1	2	3	4	5	6	7	8	9	10	

2	4	6	8	10	12	14	16	18	20	



### SAXPY CUDA implementation:

In this section, the SAXPY code develop in C is converted for CUDA. One of the differences between the two implementations is the inclusion of "cuda.h" library and the change in file extension for ".cu". Besides that, the kernel function needs to be marked as "global", and the variables _xd_ and _yd_ are used to store the data that came from the GPU. The function _cudaMemcpy_ is sends data from host to device, and from device to host, the former before executing the kernel function, and the latter, after it finishes.

When it is time to call the kernel we need to specify the number of blocks, and the number of threads per block that we want to use in our GPU execution.


#### Code

In [None]:
%%writefile saxpy.cu

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

__global__ void saxpy(int n, float *x, float *y){
  int i = threadIdx.x;

  if(i < n)
    y[i] = x[i] + y[i];
}

void printVector ( float *vector, int n ) {

  for (int i=0; i < n ; ++i)
  printf("%1.0f\t", vector[i]);

  printf("\n\n");

}

void generateVector (float *vector, int n) {

for (int i=0; i < n ; ++i)
 vector[i] = i + 1;

}

int main (int argc, char *argv[]) {

  int n = atoi(argv[1]);   
  float *x,*y, *xd, *yd;

  x = (float*) malloc(sizeof(float) * n);
  y = (float*) malloc(sizeof(float) * n);

  cudaMalloc( (void**)&xd, sizeof(float) * n );
  cudaMalloc( (void**)&yd, sizeof(float) * n );
 
  generateVector(x, n);
  printVector(x, n);

  generateVector(y, n);
  printVector(y, n);

  cudaMemcpy(xd, x, sizeof(float) * n, cudaMemcpyHostToDevice);
  cudaMemcpy(yd, y, sizeof(float) * n, cudaMemcpyHostToDevice);

  int NUMBER_OF_BLOCKS = 1;
  int NUMBER_OF_THREADS_PER_BLOCK = n;

  saxpy<<< NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK >>>(n, xd, yd);

  cudaDeviceSynchronize();
  //saxpy(n, x, y);
  
  cudaMemcpy(y, yd, sizeof(float) * (n), cudaMemcpyDeviceToHost);

  printVector(y, n);
 
  cudaFree(xd);
  cudaFree(yd);

  return 0;

}

Writing saxpy.cu


#### Execution

In [None]:
!nvcc saxpy.cu -o saxpy

In [None]:
! ./saxpy 10

1	2	3	4	5	6	7	8	9	10	

1	2	3	4	5	6	7	8	9	10	

2	4	6	8	10	12	14	16	18	20	



### SAXPY CUDA implementation with unified memory:


Besides the previous implementation, we can also parallelize an application using CUDA's implementation with unified memory, which is easier to implement and brings better optimization results. To do so, it is necessary to call the function _cudaMallocManaged_ instead of _cudaMalloc_, and then we do not need to use the auxiliary variables _xd_ and _yp_ when calling the kernel function. Finally, before printing the results, we just call _cudaDeviceSynchronize_.

#### Code

In [None]:
%%writefile saxpy.cu

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

__global__ void saxpy(int n, float *x, float *y){
    int i = threadIdx.x;
    if (i < n)
        y[i] = x[i] + y[i];
}

void printVector(float *vector, int n)
{
    for (int i = 0; i < n; ++i)
        printf("%1.0f\t", vector[i]);
    printf("\n\n");
}

void generateVector(float *vector, int n)
{
    for (int i = 0; i < n; ++i)
        vector[i] = i + 1;
}

int main(int argc, char *argv[])
{
    int n = atoi(argv[1]);
    float *x, *y; 

    cudaMallocManaged(&x, sizeof(float) * n);
    cudaMallocManaged(&y, sizeof(float) * n);

    generateVector(x, n);
    printVector(x, n);

    generateVector(y, n);
    printVector(y, n);

    int NUMBER_OF_BLOCKS = 1;
    int NUMBER_OF_THREADS_PER_BLOCK = n;

    saxpy<<<NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK>>>(n, x, y);

    cudaDeviceSynchronize();

    printVector(y, n);

    cudaFree(x);
    cudaFree(y);
    
    return 0;
}

Writing saxpy.cu


#### Execution

In [None]:
!nvcc saxpy.cu -o saxpy

In [None]:
! ./saxpy 10

1	2	3	4	5	6	7	8	9	10	

1	2	3	4	5	6	7	8	9	10	

2	4	6	8	10	12	14	16	18	20	



# Final Considerations

This practice aimed to explore the code parallelization with CUDA, a tool that enables the use of multiple threads and process. So, we found that turning a sequential _.c_ code in a _.cu_ implementation it is very simple, especially using the unified memory functions. Furthermore, CUDA presents itself as a good solution for applications that manipulate large amounts of data.

# References

[1] Yang, C.-T., Huang, C.-L., & Lin, C.-F. (2011). Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters. Computer Physics Communications, 182(1), 266–269. doi:10.1016/j.cpc.2010.06.035 

[2] Luebke, D. (2008). CUDA: Scalable parallel programming for high-performance scientific computing. 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro. doi:10.1109/isbi.2008.4541126 

[3] Nickolls, J. (2008). Scalable parallel programming with CUDA introduction. 2008 IEEE Hot Chips 20 Symposium (HCS). doi:10.1109/hotchips.2008.7476518 
