<div align="center"><h1>Stream on GPU</h1></div>

---
## Vector Add

In the world of computing, the addition of two vectors is the standard "Hello World". 

![vector add](./images/vector_add.png "Vector Addition")

Given two sets of scalar data, such as the image above, we want to compute the sum, element by element. 

We start by implementing the algorithm in plain C#. 

Edit the file `01-naive-add.cs` and implement this algorithm in plain C# until it displays `OK`

See the `01-naive-add.cs` file in the `Solutions` directory if you get stuck.

In [None]:
!hybridizer-cuda 01-naive-add.cs -o Target/naive-add/naive-add.exe -run

---
## With Parallelism

As we can see in the [solution](../../edit/03_Streams/01-naive-add/solutions/01-naive-add.cs), a plain scalar iterative approach only uses one thread, while modern CPUs have typically 4 cores and 8 threads. 

Fortunately, .Net and C# provide an intuitive construct to leverage parallelism : [Parallel.For](https://msdn.microsoft.com/en-us/library/dd783539.aspx). 

Modify `01-naive-add.cs` to distribute the work among multiple threads. 

See the `02-parallel-add.cs` file in the `Solutions` directory if you get stuck.

In [None]:
!hybridizer-cuda 01-naive-add.cs -o Target/parallel-add/parallel-add.exe -run

---
## Run Code on the GPU

Using Hybridizer to run the above code on a GPU is quite straightforward. We need to
- Decorate methods we want to run on the GPU  
This is done by adding `[EntryPoint]` attribute on methods of interest. 
- "Wrap" current object into a dynamic object able to dispatch code on the GPU
This is done by the following boilerplate code:  
```csharp
dynamic wrapped = HybRunner.Cuda().Wrap(new Program());
wrapped.mymethod(...);
```
`wrapped` object has the same methods signatures (static or instance) as the current object, but dispatches calls to GPU.

Modify the `03-gpu-add.cs` so the `Add` method runs on a GPU. 

See the `03-gpu-add.cs` file in the `Solutions` directory if you get stuck.

In [None]:
!hybridizer-cuda 03-gpu-add.cs -o Target/gpu-add/gpu-add.exe -run

---
## Manage Memory

Now you can manage your memory yourself. Even if you want to have your data on the device. With the hybridizer all is implemented to let you choose where you want to stock your data.

For that we need to :
- Allow the use of unsafe code
- Create an `IntPtr` for the device and allocate it with
```csharp
IntPtr d_a;
//N is the size of the array you want to allocate 
cuda.Malloc(out d_a, N * sizeof(datatype));
```
- Use `GCHandle` to pin a c# array ([Alloc](https://msdn.microsoft.com/en-us/library/1246yz8f.aspx) & [AddrOfPinnedObject](https://msdn.microsoft.com/en-us/library/system.runtime.interopservices.gchandle.addrofpinnedobject.aspx)): 
```csharp
float[] a = new float[N];
GCHandle handle_a = GCHandle.Alloc(a, GCHandleType.Pinned);
IntPtr h_a = handle_a.AddrOfPinnedObject();
```
- Copy the data on the device with your device pointer and your pinned c# pointer
```csharp
cuda.Memcpy(d_a,
            h_a,
            N * sizeof(float),
            cudaMemcpyKind.cudaMemcpyHostToDevice);
```

- After you launch the kernel you can return the device data on the host
```csharp
cuda.Memcpy(h_a,
             d_a,
             N * sizeof(float),
             cudaMemcpyKind.cudaMemcpyDeviceToHost);
```
- Make sure before each copy between the host and the device, the device is synchronize.

- Don't forget to free the memory of your GChandle ([free](https://msdn.microsoft.com/en-us/library/system.runtime.interopservices.gchandle.free.aspx))
```csharp
handle_a.Free();
```

Modify the `04-malloc-add.cs` so you allocate and use some device pointer. 

See the `04-malloc-add.cs` file in the `Solutions` directory if you get stuck.

In [None]:
!hybridizer-cuda 04-malloc-add.cs -o Target/malloc-add/maloc-add.exe -run

---
## STREAM

the purpose of this example is to allow you to use streams with the Hybridizer, on one very big vector without cut it. We will use 8 streams for this example.

- You can create a stream with the object `cudaStream_t` and  `cuda.StreamCreate(out yourStream)`.
- To set a stream on a kernel you have to use the `SetStream(stream)` function on `wrapped`.
```csharp
wrapped.SetStream(stream).mymethod(...);
```
- You have the possibility to make an asynchronous cudaMemCpy when you copy data
```csharp
cuda.MemcpyAsync(IntPtr dst, IntPtr src, size_t size, cudaMemcpyKind kindOfCopy, cudaStream_t stream =0);
```
- You can block until the stream finish to compute with `cuda.StreamSynchronize(stream)`.
- Finally destroy your stream with `cuda.StreamDestroy(stream)`.

Modify the `05-stream-add.cs` so you can create and use multiple streams.

See the `05-stream-add.cs` file in the `Solutions` directory if you get stuck.

In [None]:
!hybridizer-cuda 05-stream-add.cs -o Target/stream-add/stream-add.exe -run