# COMPUTATIONAL PHYSICS: INCLUDES PARALLEL COMPUTING/PARALLEL PROGRAMMING

#### ERNEST YEUNG ERNESTYALUMNI@GMAIL.COM

#### Contents

### Part 1. Introduction

- 1. Parallel Computing
- 1.1. Udacity Intro to Parallel Programming: Lesson 1 The GPU Programming Model
- 2. Pointers in C; Pointers in C categorified (interpreted in Category Theory)
- 3. On CUDA By Example

References

ABSTRACT. Everything about Computational Physics, including Parallel computing/ Parallel programming.

# Part 1. Introduction

#### 1. Parallel Computing

- 1.1. Udacity Intro to Parallel Programming: Lesson 1 The GPU Programming Model. Owens and Luebki pound fists at the end of this video. =)))) Intro to the class.
- 1.1.1. Running CUDA locally. Also, Intro to the class, in Lesson 1 The GPU Programming Model, has links to documentation for running CUDA locally; in particular, for Linux: <a href="http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html">http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html</a>. That guide told me to go download the NVIDIA CUDA Toolkit, which is the <a href="https://developer.nvidia.com/cuda-downloads">https://developer.nvidia.com/cuda-downloads</a>.

For Fedora, I chose Installer Type runfile (local).

Afterwards, installation of CUDA on Fedora 23 workstation had been nontrivial. Go see either my github repository ML-grabbag (which will be updated) or my wordpress blog (which may not be upgraded frequently).

 $P = VI = I^2R$  heating.

 $1.1.2.\ Definitions\ of\ Latency\ and\ throughput\ (or\ bandwidth).\ cf.\ Building\ a\ Power\ Efficient\ Processor$ 

Latency vs Bandwidth

latency [sec]. From the title "Latency vs. bandwidth", I'm thinking that throughput = bandwidth (???). throughput = job/time (of job).

Given total task, velocity v,

total task /v = latency. throughput = latency/(jobs per total task).

Also, in Building a Power Efficient Processor. Owens recommends the article David Patterson, "Latency..."

cf. GPU from the Point of View of the Developer

 $n_{\rm core} \equiv \text{number of cores}$ 

 $n_{\rm vecop} \equiv (n_{\rm vecop} - \text{wide axial vector operations}/core \text{ core})$ 

Date: 23 mai 2016.

Key words and phrases. Computational Physics, Parallel Computing, Parallel Programming.

 $n_{\rm thread} \equiv {\rm threads/core} \ ({\rm hyperthreading})$ 

Subsystem: eVga.com. Corp. Device 3994

\$ lspci -vnn | grep VGA -A 12

 $n_{\text{core}} \cdot n_{\text{vecop}} \cdot n_{\text{thread}}$  parallelism

There were various websites that I looked up to try to find out the capabilities of my video card, but so far, I've only found these commands (and I'll print out the resulting output):

```
03:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] [10de:17c8] (rev a1) (prog-if 00 [VC
        Subsystem: eVga.com. Corp. Device [3842:3994]
        Physical Slot: 4
        Flags: bus master, fast devsel, latency 0, IRQ 50
        Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
        Memory at e00000000 (64-bit, prefetchable) [size=256M]
        Memory at f0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at e000 [size=128]
        [virtual] Expansion ROM at fb000000 [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia
$ lspci | grep VGA -E
03:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] (rev a1)
$ grep driver /var/log/Xorg.0.log
     18.074] Kernel command line: BOOT_IMAGE=/vmlinuz-4.2.3-300.fc23.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora
     18.087] (WW) Hotplugging is on, devices using drivers 'kbd', 'mouse' or 'vmmouse' will be disabled.
             X.Org XInput driver : 22.1
     18.192 (II) Loading /usr/lib64/xorg/modules/drivers/nvidia_drv.so
     19.088 (II) NVIDIA(GPU-0): Found DRM driver nvidia-drm (20150116)
     19.102] (II) NVIDIA(0):
                               ACPI event daemon is available, the NVIDIA X driver will
     19.174] (II) NVIDIA(0): [DRI2] VDPAU driver: nvidia
     19.284] ABI class: X. Org XInput driver, version 22.1
$ lspci -k | grep -A 8 VGA
03:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] (rev a1)
```

1

ERNEST YEUNG ERNESTYALUMNI@GMAIL.COM

```
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia
03:00.1 Audio device: NVIDIA Corporation GM200 High Definition Audio (rev a1)
Subsystem: eVga.com. Corp. Device 3994
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
05:00.0 USB controller: VIA Technologies, Inc. VL805 USB 3.0 Host Controller (rev 01)
```

#### CUDA Program Diagram



CPU "host" is the boss (and issues commands) -Owen.

Coprocessor : CPU "host"  $\rightarrow$  GPU "device"

Coprocessor: CPU process  $\mapsto$  (co)-process out to GPU

With

- 1 data cpu  $\rightarrow$  gpu
- 2 data gpu  $\rightarrow$  cpu (initiated by cpu host)
  - 1., 2., uses cudaMemcpy
- 3 allocate GPU memory: cudaMalloc
- 4 launch kernel on GPU

Remember that for 4., this launching of the kernel, while it's acting on GPU "device" onto itself, it's initiated by the boss, the CPU "host".

Hence, cf. Quiz: What Can GPU Do in CUDA, GPUs can respond to CPU request to receive and send Data CPU  $\rightarrow$  GPU and Data GPU  $\rightarrow$  CPU, respectively (1,2, respectively), and compute a kernel launched by the CPU (3).

A CUDA Program A typical GPU program

- cudaMalloc CPU allocates storage on GPU
- cudaMemcpy CPU copies input data from CPU  $\rightarrow$  GPU
- kernel launch CPU launches kernel(s) on GPU to process the data
- cudaMemcpy CPU copies results back to CPU from GPU

Owens advises minimizing "communication" as much as possible (e.g. the cudaMemcpy between CPU and GPU), and do a lot of computation in the CPU and GPU, each separately.

Defining the GPU Computation

Owens circled this

# BIG IDEA

This is Important

Kernels look like serial programs

Write your program as if it will run on  $\ \mathbf{one} \ \ \mathrm{thread}$ 

The GPU will run that program on many threads

# Squaring A Number on the CPU

Note

- (1) Only 1 thread of execution: ("thread" := one independent path of execution through the code) e.g. the for loop
- (2) no explicit parallelism; it's serial code e.g. the for loop through 64 elements in an array

# GPU Code A High Level View

CPU:

- Allocate Memory
- Copy Data to/from GPU
- Launch Kernel species degree of parallelism

GPU:

ullet Express Out = In  $\cdot$  In - says *nothing* about the degree of parallelism

Owens reiterates that in the GPU, everything looks serial, but it's only in the CPU that anything parallel is specified. pseudocode: CPU code: square kernel <<< 64 >>> (outArray,inArray)

Squaring Numbers Using CUDA Part 3

From the example

```
// launch the kernel square <<<1, ARRAY_SIZE>>>(d_out, d_in)
```

we're introduced to the "CUDA launch operator", initiating a kernel of 1 block of 64 elements (ARRAY\_SIZE is 64) on the GPU. Remember that d\_ prefix (this is naming convention) tells us it's on the device, the GPU, solely.

With CUDA launch operator ≡<<<>>>, then also looking at this explanation on stackexchange (so surely others are confused as well, of those who are learning this (cf. CUDA kernel launch parameters explained right?). From Eric's answer,

threads are grouped into blocks. all the threads will execute the invoked kernel function. Certainly,

```
<<<>>>: (n_{block}, n_{threads}) \times kernel functions \mapsto kernel function <<< n_{block}, n_{threads} >>> \in End: Dat_{GPU} <<<<>>>: <math>\mathbb{N}^+ \times \mathbb{N}^+ \times Mor_{GPU} \to EndDat_{GPU}
```

where I propose that GPU can be modeled as a category containing objects  $Dat_{GPU}$ , the collection of all possible data inputs and outputs into the GPU, and  $Mor_{GPU}$ , the collection of all kernel functions that run (exclusively, and this *must* be the class, as reiterated by Prof. Owen) on the GPU.

Next,

```
kernelfunction <<< n_{\rm block}, n_{\rm threads} >>>: \dim \mapsto \operatorname{dout} (as given in the "square" example, and so I propose) kernelfunction <<< n_{\rm block}, n_{\rm threads} >>>: (\mathbb{N}^+)^{n_{\rm threads}} \to (\mathbb{N}^+)^{n_{\rm threads}}
```

But keep in mind that dout, din are pointers in the C program, pointers to the place in the memory.

 $\operatorname{cudaMemcopy}$  is a functor category, s.t. e.g.  $\operatorname{Obj}_{\operatorname{CudaMemcopy}} \ni \operatorname{cudaMemcpyDevicetoHost}$  where

```
\operatorname{cudaMemcopy}(-,-,n_{\operatorname{thread}},\operatorname{cudaMemcpyDeviceToHost}):\operatorname{Memory}_{\operatorname{GPU}}\to\operatorname{Memory}_{\operatorname{CPU}}\in\operatorname{Hom}(\operatorname{Memory}_{\operatorname{GPU}},\operatorname{Memory}_{\operatorname{CPU}})
```

#### Squaring Numbers Using CUDA 4

Note the C language construct *declaration specifier* - denotes that this is a kernel (for the GPU) and not CPU code. Pointers need to be allocated on the GPU (otherwise your program will crash spectacularly -Prof. Owen).

1.1.3. What are C pointers? Is  $\langle$  type  $\rangle$ \*, a pointer, then a mapping from the category, namely the objects of types, to a mapping from the specified value type to a memory address?

e.g.

$$\langle \rangle * : float \mapsto float *$$

float  $*: \dim \mapsto$  some memory address

and then we pass in mappings, not values, and so we're actually declaring a square functor.

What is threadIdx? What is it mathematically? Consider that ∃ 3 "modules":

threadIdx.x

threadIdx.y

threadIdx.z

And then the line

```
int idx = threadIdx.x;
```

says that idx is an integer, "declares" it to be so, and then assigns idx to threadIdx.x which surely has to also have the same type, integer. So (perhaps)

$$idx \equiv \text{threadIdx.} x \in \mathbb{Z}$$

is the same thing.

Then suppose threadIdx ⊂ FinSet, a subcategory of the category of all (possible) finite sets, s.t. threadIdx has 3 particular morphisms,  $x, y, z \in MorthreadIdx$ ,

 $x: \text{threadIdx} \mapsto \text{threadIdx}.x \in \text{Obj}_{\text{FinSet}}$  $y: \text{threadIdx} \mapsto \text{threadIdx}.x \in \text{Obj}_{\text{FinSet}}$  $z: \text{threadIdx} \mapsto \text{threadIdx}.x \in \text{Obj}_{\text{FinSet}}$ 

# Configuring the Kernel Launch Parameters Part 1

 $n_{\text{blocks}}$ ,  $n_{\text{threads}}$  with  $n_{\text{threads}} \ge 1024$  (this maximum constant is GPU dependent). You should pick the ( $n_{\text{blocks}}$ ,  $n_{\text{threads}}$ ) that makes sense for your problem, says Prof. Owen.

1.1.4. Memory layout of blocks and threads.  $\forall (n_{\text{blocks}}, n_{\text{threads}}) \in \mathbb{Z} \times \{1 \dots 1024\}, \{1 \dots n_{\text{block}} \times \{1 \dots n_{\text{threads}}\} \text{ is now an ordered}$ index (with lexicographical ordering). This is just 1-dimensional (so possibly there's a 1-to-1 mapping to a finite subset of Z).

I propose that "adding another dimension" or the 2-dimension, that Prof. Owen mentions is being able to do the Cartesian product, up to 3 Cartesian products, of the block-thread index.

Quiz: Configuring the Kernel Launch Parameters 2

Most general syntax:

Configuring the kernel launch

```
kernel << grid of blocks, block of threads >>>(...)
  // for example
  square \ll dim3(bx, by, bz), dim3(tx, ty, tz), shmem >>> (...)
where dim3(tx,ty,tz) is the grid of blocks bx \cdot by \cdot bz
        \{\dim 3\}(\mathsf{tx},\mathsf{ty},\mathsf{tz}) is the block of threads tx \cdot ty \cdot tz
```

Problem Set 1 "Also, the image is represented as an 1D array in the kernel, not a 2D array like I mentioned in the video." Here's part of that code for squaring numbers:

```
__global__ void square(float *d_out, float *d_in) {
 int idx = threadIdx.x;
 float f = d_{in}[idx];
```

shmem is the shared memory per block in bytes

1.1.5. Grid of blocks, block of threads, thread that's indexed; (mathematical) structure of it all. Let

$$grid = \prod_{I=1}^{N} (block)^{n_I^{block}}$$

where N = 1, 2, 3 (for CUDA) and by naming convention  $I=3\equiv z$ 

Let's try to make it explicity (as others had difficulty understanding the grid, block, thread model, cf. colored image to greyscale image using CUDA parallel processing, Cuda gridDim and blockDim) through commutative diagrams and categories (from math):



and then similar relations (i.e. arrows, i.e. relations) go for a block of threads:

 $\prod_{I=1}^{N} \mathbb{Z} \supset \prod_{I=1}^{N} \{1 \dots N_{I}^{\text{threads}}\}$ 

$$\prod_{I=1}^{N} \mathbb{Z}^{+} \qquad \ni (N_{x}^{\text{threads}}, N_{y}^{\text{threads}}, N_{z}^{\text{threads}})$$

$$\text{blockDim} \left( \bigcup_{\text{dim3}} \text{dim3} \bigcup_{\text{blockDim.} x, \text{ blockDim.} x, \text{ blockDim.$$

gridsize help assignment 1 Pp explains how threads per block is variable, and remember how Owens said Luebki says that a GPU doesn't get up for more than a 1000 threads per block.

1.1.6. Generalizing the model of an image. Consider vector space V, e.g.  $\dim V = 4$ , vector space V over field  $\mathbb{K}$ , so  $V = \mathbb{K}^{\dim V}$ Each pixel represented by  $\forall v \in V$ .

Consider an image, or space, M.  $\dim M = 2$  (image),  $\dim M = 3$ . Consider a local chart (that happens to be global in our case):

$$\varphi: M \to \mathbb{Z}^{\dim M} \supset \{1 \dots N_1\} \times \{1 \dots N_2\} \times \dots \times \{1 \dots N_{\dim M}\}$$
  
$$\varphi: x \mapsto (x^1(x), x^2(x), \dots, x^{\dim M}(x))$$





Consider a "coarsing" of underlying M:



e.g. 
$$N_1^{\text{thread}} = 12$$

$$N_2^{\text{thread}} = 12$$

Just note that in terms of syntax, you have the "block" model, in which you allocate blocks along each dimension. So in

const dim3 blockSize
$$(n_x^b, n_y^b, n_z^b)$$
  
const dim3 gridSize $(n_x^{gr}, n_y^{gr}, n_z^{gr})$ 

Then the condition is  $n_x^b/\dim V$ ,  $n_y^b/\dim V$ ,  $n_z^b/\dim V \in \mathbb{Z}$  (condition),  $(n_x^{\rm gr}-1)/\dim V$ ,  $n_y^{\rm gr}/\dim V$ ,  $n_z^{\rm gr}/\dim V \in \mathbb{Z}$ 

Transpose Part 1 Now

$$\operatorname{Mat}_{\mathbb{F}}(n,n) \xrightarrow{T} \operatorname{Mat}_{\mathbb{F}}(n,n)$$

$$A \mapsto A^{T} \text{ s.t. } (A^{T})_{ij} = A_{ji}$$

$$\operatorname{Mat}_{\mathbb{F}} \xrightarrow{T} \mathbb{F}^{n^{2}}$$

$$A_{ij} \mapsto A_{ij} = A_{in+j}$$

$$\operatorname{Mat}_{\mathbb{F}}(n,n) \xrightarrow{} \mathbb{F}^{n^{2}} \qquad A_{ij} \longmapsto A_{in+j}$$

$$T \downarrow \qquad \qquad \downarrow T \qquad \qquad \downarrow T \qquad \qquad \downarrow T$$

$$\operatorname{Mat}_{\mathbb{F}}(n,n) \xrightarrow{} \mathbb{F}^{n^{2}} \qquad (A^{T})_{ij} = A_{ji} \longmapsto A_{jn+i}$$

#### Transpose Part 2

Possibly, transpose is a functor.

Consider struct as a category. In this special case, Objstruct = {arrays} (a struct of arrays). Now this struct already has a hash table for indexing upon declaration (i.e. "creation"): so this category struct will need to be equipped with a "diagram" from the category of indices J to struct:  $J \to \text{struct}$ .

So possibly

Quiz: What Kind Of Communication Pattern This guiz made a few points that clarified the characteristics of these so-called communication patterns (amongst the memory?)

- map is bijective, and map :  $Idx \rightarrow Idx$
- gather not necessarily surjective
- scatter not necessarily surjective
- stencil surjective
- transpose (see before)

# Parallel Communication Patterns Recap

- map bijective
- transpose bijective
- gather not necessarily surjective, and is many-to-one (by def.)
- scatter one-to-many (by def.) and is not necessarily surjective
- stencil several-to-one (not injective, by definition), and is surjective
- reduce all-to-one
- scan/sort all-to-all

# Programmer View of the GPU

thread blocks: group of threads that cooperate to solve a (sub)problem

Thread Blocks And GPU Hardware

CUDA GPU is a bunch of SMs:

Streaming Multiprocessors (SM)s

SMs have a bunch of simple processors and memory.

Dr. Luebki:

Let me say that again because it's really important GPU is responsible for allocating blocks to SMs

Programmer only gives GPU a pile of blocks.

Quiz: What Can The Programmer Specify

I myself thought this was a revelation and was not intuitive at first:

Given a single kernel that's launched on many thread blocks include X, Y, the programmer cannot specify the sequence the blocks, e.g. block X, block Y, run (same time, or run one after the other), and which SM the block will run on (GPU does all this).

Quiz: A Thread Block Programming Example

Open up hello blockIdx.cu in Lesson 2 Code Snippets (I got the repository from github, repo name is cs344).

At first, I thought you can do a single file compile and run in Eclipse without creating a new project. No. cf. Eclipse creating projects every time to run a single file?.

I ended up creating a new CUDA C/C++ project from File -; New project, and then chose project type Executable, Empty Project, making sure to include Toolchain CUDA Toolkit (my version is 7.5), and chose an arbitrary project name (I chose cs344single). Then, as suggested by Kenny Nguyen, I dragged and dropped files into the folder, from my file directory program.

I ran the program with the "Play" triangle button, clicking on the green triangle button, and it ran as expected. I also turned off Build Automatically by deselecting the option (no checkmark).

# GPU Memory Model



Then consider threadblock  $\equiv$  thread block

Objthreadblock  $\supset \{ \text{ threads } \}$ 

FinSet  $\xrightarrow{\text{threadIdx}}$  thread  $\in$  Morthreadblock



 $\forall$  thread



# Synchronization - Barrier

Quiz: The Need For Barriers

3 barriers were needed (wasn't obvious to me at first). All threads need to finish the write, or initialization, so it'll need a barrier.

While

$$array[idx] = array[idx+1];$$

is 1 line, it'll actually need 2 barriers; first read. Then write.

So actually we'll need to rewrite this code:

```
int temp = array[idx+1];
__syncthreads();
array[idx] = temp;
__syncthreads();
```

kernels have implicit barrier for each.

Writing Efficient Programs

(1) Maximize arithmetic intensity arithmetic intensity :=  $\frac{\text{math}}{\text{memory}}$ 

video: Minimize Time Spent On Memory

local memory is fastest; global memory is slower

kernel we know (in the code) is tagged with \_\_global\_\_

quiz: A Quiz on Coalescing Memory Access

Work it out as Dr. Luebki did to figure out if it's coalesced memory access or not.

Atomic Memory Operations

Atomic Memory Operations

atomicadd atomicmin atomicXOR atomicCAS Compare And Swap

2. Pointers in C; Pointers in C categorified (interpreted in Category Theory)

Suppose  $v \in \text{ObjData}$ , category of data **Data**,

e.g.  $v \in \text{Int} \in \text{ObjType}$ , category of types Type.

Data 
$$\stackrel{\&}{\rightarrow}$$
 Memory  $v \stackrel{\&}{\mapsto} \& v$ 

with address & $v \in$  Memory.

With

assignment pv = &v,

$$pv \in \text{Objpointer}$$
, category of pointers, pointer  $pv \in \text{Memory}$  (i.e. not  $pv \in \text{Dat}$ , i.e.  $pv \notin \text{Dat}$ )

pointer 
$$\ni pv \stackrel{*}{\mapsto} *pv \in Dat$$



Examples. Consider passfunction.c in Fitzpatrick [5].

Consider the type double, double  $\in$  ObjTypes.

fun1, fun2  $\in$  MorTypes namely

 $fun1, fun2 \in Hom(double, double) \equiv Hom_{Types}(double, double)$ 

Recall that

pointer  $\stackrel{*}{\rightarrow}$  Dat pointer  $\stackrel{\&}{\rightarrow}$  Memory

\*, & are functors with domain on the category pointer.

Pointers to functions is the "extension" of functor \* to the codomain of MorTypes:



6 ERNEST YEUNG ERNESTYALUMNI@GMAI

It's unclear to me how void cube can be represented in terms of category theory, as surely it cannot be represented as a mapping (it acts upon a functor, namely the \* functor for pointers). It doesn't return a value, and so one cannot be confident to say there's explicitly a domain and codomain, or range for that matter.

But what is going on is that

pointer , double , pointer 
$$\xrightarrow{\text{cube}}$$
 pointer , pointer   
fun1,  $x$ , res1  $\xrightarrow{\text{cube}}$  fun1, res1

s.t. \*res1 = 
$$y^3 = (*fun1(x))^3$$

So I'll speculate that in this case, cube is a functor, and in particular, is acting on \*, the so-called deferencing operator:

$$\begin{array}{ccc} \text{pointer} & \overset{*}{\to} \text{float} \in \text{Data} & \xrightarrow{\text{cube}} & \text{pointer} & \overset{\text{cube}(*)}{\to} \text{float} \in \text{Data} \\ & \text{res1} & \overset{*}{\to} * \text{res1} & \xrightarrow{\text{cube}(*)} \text{cube}(*\text{res1}) = y^3 \end{array}$$

cf. Arrays, from Fitzpatrick [5]

Types 
$$\xrightarrow{\text{declaration}}$$
 arrays

If  $x \in \text{Objarrays}$ ,

&
$$x[0] \in \text{Memory} \xrightarrow{==} x \in \text{pointer}$$
 (to 1st element of array)

cf. Section 2.13 Character Strings from Fitzpatrick [5]

```
char word [20] = ''four'
char *word = ''four'
```

cf. C++ extensions for C according to Fitzpatrick [5]

- simplified syntax to pass by reference pointers into functions
- inline functions
- variable size arrays

```
int n;
double x[n];
```

• complex number class

2.0.7. Need a CUDA, C, C++, IDE? Try Eclipse! This website has a clear, lucid, and pedagogical tutorial for using Eclipse: Creating Your First C++ Program in Eclipse. But it looks like I had to pay. Other than the well-written tips on the webpage, I looked up stackexchange for my Eclipse questions (I had difficulty with the Eclipse documentation).

Others, like myself, had questions on how to use an IDE like Eclipse when learning CUDA, and "building" (is that the same as compiling?) and running only single files.

My workflow: I have a separate, in my file directory, folder with my github repository clone that's local.

I start a New Project, CUDA Project, in Eclipse. I type up my single file (I right click on the src folder and add a 'Source File'). I build it (with the Hammer, Hammer looking icon; yes there are a lot of new icons near the top) and it runs. I can then run it again with the Play, triangle, icon.

I found that if I have more than 1 (2 or more) file in the src folder, that requires the main function, it won't build right.

So once a file builds and it's good, I, in Terminal, cp the file into my local github repository. Note that from there, I could use the nvcc compiler to build, from there, if I wanted to.

Now with my file saved (for example, helloworldkernel.cu), then I can delete it, without fear, from my, say, cuda-workplace, from the right side, "C/C++ Projects" window in Eclipse.

### 3. ON CUDA BY EXAMPLE

cf. 3.2.3 Passing Parameters of Sanders and Kandrot (2010) [6]

| TYALUMNI@GMAIL.COM |  |  |
|--------------------|--|--|
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |
|                    |  |  |

# References

- [1] Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) 2nd ed. 2009. Corr. 7th printing 2013 Edition. ISBN-13: 978-0387848570. https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII\_print4.pdf
- [2] Jared Culbertson, Kirk Sturtz. Bayesian machine learning via category theory. arXiv:1312.1445 [math.CT]
- [3] John Owens. David Luebki. Intro to Parallel Programming. CS344. Udacity http://arxiv.org/abs/1312.1445 Also, https://github.com/udacity/cs344
- [4] CS229 Stanford University. http://cs229.stanford.edu/materials.html
- [5] Richard Fitzpatrick. "Computational Physics." http://farside.ph.utexas.edu/teaching/329/329.pdf
- [6] Jason Sanders, Edward Kandrot. CUDA by Example: An Introduction to General-Purpose GPU Programming 1st Edition. Addison-Wesley Professional; 1 edition (July 29, 2010). ISBN-13: 978-0131387683