# 0. C pointers Warm-up

## Pointer and array on stack

1. Pointer to a plain variable

In [27]:
char c = 'A';
char *p2c = &c;

2. **Declare** Pointer to arrays 

* Array in C can be accessed through a pointer points to the ```0th``` element
* ```p = &(a[0])``` is equivalent to ```p = a```

* Arrayname ```vowel``` iteself is a pointer to ```0th``` element

In [28]:
char vowels[] = {'A', 'E', 'I', 'O', 'U'};
char *p2v1 = vowels;
char *p2v2 = &(vowels[0]);
printf("p2v method1 = %p\np2v method2 = %p\n", p2v1, p2v2);
printf("p2v method3 = %p", vowels);

p2v method1 = 0x7f114fb8a060
p2v method2 = 0x7f114fb8a060
p2v method3 = 0x7f114fb8a060

3. **Access** arrays with pointers
* thru array name and index
* thru dereferenced array name with offset
* thru pointer name and index
* thru dereferenced pointer with offsets

In [29]:
printf("vowels[0]@addr[%p] = %c\n", vowels, vowels[0]);
printf("vowels[1]@addr[%p] = %c\n", vowels + 1, *(vowels + 1));
printf("vowels[2]@addr[%p] = %c\n", vowels, p2v1[2]);
printf("vowels[3]@addr[%p] = %c\n", vowels, *(p2v1 + 3));
printf("vowels[4]@addr[%p] = %c\n", vowels, vowels[4]);

vowels[0]@addr[0x7f114fb8a060] = A
vowels[1]@addr[0x7f114fb8a061] = E
vowels[2]@addr[0x7f114fb8a060] = I
vowels[3]@addr[0x7f114fb8a060] = O
vowels[4]@addr[0x7f114fb8a060] = U


## 2D array on the heap

In [30]:
int nRows = 5;
int nCols = 3;

In [31]:
float** MatrixInitialize(int nrows, int ncols) {
    float** mat = (float **) malloc (nrows * sizeof(float *));
    for (int i=0; i < nrows; i++) {
        *(mat + i) = (float*) malloc(ncols * sizeof(float));
    }
    return mat;
}

In [32]:
void fillMatrix(float** mat, int nrows, int ncols) {
    for (int i=0; i < nrows; i++) {
        for (int j=0; j < ncols; j++) {
//             mat[i][j] = i + j * i;
            *(mat[i] + j) = i + j * i;
        }
    }
}

In [33]:
void printMatrix (float** mat, int nrows, int ncols) {
    for (int i=0; i < nrows; i++) {
        for (int j=0; j < ncols; j++) {
            printf("mat[%d][%d] = %f, ", i, j, mat[i][j]);
        }
        printf("\n");
    }
}

In [34]:
void test(int row, int col) {
    float** h_A = MatrixInitialize(row, col);
    fillMatrix(h_A, row, col);
    printMatrix (h_A, row, col);
    for (int i=0; i < row; i++) {
        free(h_A[i]);
    }
    free(h_A);
}


In [35]:
test(nRows, nCols);

mat[0][0] = 0.000000, mat[0][1] = 0.000000, mat[0][2] = 0.000000, 
mat[1][0] = 1.000000, mat[1][1] = 2.000000, mat[1][2] = 3.000000, 
mat[2][0] = 2.000000, mat[2][1] = 4.000000, mat[2][2] = 6.000000, 
mat[3][0] = 3.000000, mat[3][1] = 6.000000, mat[3][2] = 9.000000, 
mat[4][0] = 4.000000, mat[4][1] = 8.000000, mat[4][2] = 12.000000, 


# 1. Fundamentals of Paralell Computing

## a. Device Global Memory and Data Transfer

### Memory Allocation

```cudaMalloc(para1, para2)```
* Allocates object in the device global memory
* Two parameters:
 * **Address of a pointer** to the allcoated object
 * **Size** of allocated object in terms of bytes
 
```c
int size = n*sizeof(float);
float *d_A;
//1. Allocate global memory on the device for A
CHECK_ERROR(cudaMalloc((void**)&d_A, size));
```

```cudaFree(para1)```
* Frees object from device global memory
    * **Pointer** to freed object 
    
```c
cudaFree(d_A);
```

### Memory Transfer

```cudaMemcpy(p1, p2, p3, p4)```

* Memory data transfer
* Requires 4 parameters
 * Ptr to **destinaton**
 * Ptr to **source**
 * Number of bytes copied
 * Type/Direction of transfer
 
```c
	cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
	cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
```

## b. Kernel Functions and Threading

### 1. BlockDim, ThreadIdx, blockIdx built-in variables

<img src="resources/Fig2.11.png" alt="Drawing" style="width: 600px;"/>

When host launches a kernel, CUDA generates a 2-level hierarchy:
* **Grid** is organized as an array of **thread blocks** (*blocks* for brevity). Each **block** consists of "many" threads
* The number of threads in a block is avaiable in built in ```blockDim``` variable.


```BlockDim``` variable is a struct with three unsigned integer fields: x, y, and z

Within each block, CUDA threads use built-in variables: ```threadIdx``` and ```blockIdx``` to distinguish among themselves and  determine the area of data each threads to work on.

As shown in the example above: unique global index i is calculated as ```i = blockIdx.x * blockDim.x + threadIdx.x```

### 2. Function Declaration variables

<img src="resources/Fig2.13.png" alt="Drawing" style="width: 600px;"/>

* \_\_global__ indicates the function as a ```CUDA C kernel function```. **Only** callable from the host except for ```dymanic paralellism```
* \_\_device__ indicates function as a ```CUDA device function```. **Only** callable from the ```kernel function``` or another ```device function```
* \_\_host__ indicates function as a ```CUDA host function```**Only** callable from another ```host function```

### 3. Kernel Launch

```c
vecAdd<<<dimGrid, dimBlock>>>(p1, p2, p3);
```
The parameters between ```<<< >>>``` are called ```configuration parameters```. 

* ```dimGrid``` defines the number of blocks in the grid
* ```dimBlock``` defines the number of threads in each block

**Code Example, say n=4000**

<img src="resources/Fig2.15.png" alt="Drawing" style="width: 600px;"/>

In the example above, each ```block``` has 256 threads. When ```n=4000```, 16 blocks are used. 

A total of ```16*256 = 4096 threads``` are launched, additional codes may should be added to ```disable 96``` of them.