In [3]:
using Pkg
Pkg.activate(".")
Pkg.instantiate()

using BenchmarkTools
using OhMyThreads

### Multi-Threading
Julia's base [Threads](https://docs.julialang.org/en/v1/manual/multi-threading/) library provides the `Threads.@threads` macro to turn any for loop into a parallel for loop. To change the number of threads Julia has access to we can define the environment variable `JULIA_NUM_THREADS`. This must be done before starting the notebook.

In [None]:
print("Julia has access to ", Threads.nthreads(), " threads\n")
println(ENV["JULIA_NUM_THREADS"])

In [None]:
function single_thread!(vals)
    for i in eachindex(vals)
        vals[i] = sqrt(vals[i])*sqrt(vals[i])
    end
    return vals
end

function all_threads!(vals)
    Threads.@threads :static for i in eachindex(vals)
        vals[i] = sqrt(vals[i])*sqrt(vals[i])
    end
    return vals
end

function broadcasted!(vals)
    vals .= sqrt.(vals).*sqrt.(vals)
    return vals
end

In [7]:
vals = abs.(rand(1000000));

In [None]:
@benchmark single_thread!($vals)

In [None]:
@benchmark broadcasted!($vals)

In [None]:
@benchmark all_threads!($vals)

The Threads library does not let you set the number of tasks, to do that we can use the `OhMyThreads.jl` library. In `OhMyThreads` the `Threads.@threads` macro is replaced with the `@tasks` macro inside of which we can set certain properties like `ntasks` or the type of thread scheduling.

In [None]:
function ohmythreads!(vals, n_threads = Threads.nthreads())
    @tasks for i in eachindex(vals)
        @set ntasks = n_threads
        @set scheduler = :static
        vals[i] = sqrt(vals[i])*sqrt(vals[i])
    end
    return vals
end

In [None]:
@benchmark ohmythreads!(vals)

### GPU Array Programming
With the power of dynamic dispacth, we can perform many operations on the GPU just by changing the type of our data from `Array` to `CuArray` (or the analagous type for AMD, Intel, or Apple). For the purposes of the workshop, we cannot assume everyone will have a GPU available to run on, so I will just demonstrate how easy it is to get your code running on a GPU.

In [None]:
# This example is for CUDA but is the same principle for all the GPU libraries
#using CUDA or AMDGPU or oneAPI or Metal or KernelAbstractions

# Moving data from CPU to GPU.
A = rand(3,3)
cu_B = CuArray{Float32}(A)

# Can also pre-allocate storage on the GPU and copy data to it.
gpu_storage = CUDA.zeros(Float32, 3, 3)
gpu_array2 = copyto!(cu_B, A)

If we multiply `cu_B` by itself, this multiplcation will take place on the GPU automatically because the type of `cu_B` is a `CuArray`!

In [None]:
cu_C = cu_B * cu_B 

There are many more high-level functions that "just work" on the GPU. For example, all unary operations like `+`,`-`,`*` have GPU implementations. There are also abstractions such as `reduce` and `mapreduce` which allow you to apply an operation or function across an array. Note that all high-level GPU operations should act on an entire array and not a single element at a time.

In [None]:
cu_C = CUDA.rand(5)
sum_of_C = sum(cu_C)
sum_of_C_also = reduce(+, cu_C)

f = (x) -> x^2
C_squared = map(f, cu_C)
sum_of_C_squared = mapreduce(f, +, cu_C)

### Primitive Types

> **Primitive Types**: Unlike most languages, Julia lets you declare your own primitive types, rather than providing only a fixed set of built-in ones. 

```julia
primitive type Float16 <: AbstractFloat 16 end
primitive type Float32 <: AbstractFloat 32 end
primitive type Float64 <: AbstractFloat 64 end

primitive type Bool <: Integer 8 end
primitive type Char <: AbstractChar 32 end

primitive type Int8    <: Signed   8 end
primitive type UInt8   <: Unsigned 8 end
primitive type Int16   <: Signed   16 end
primitive type UInt16  <: Unsigned 16 end
primitive type Int32   <: Signed   32 end
primitive type UInt32  <: Unsigned 32 end
primitive type Int64   <: Signed   64 end
primitive type UInt64  <: Unsigned 64 end
primitive type Int128  <: Signed   128 end
primitive type UInt128 <: Unsigned 128 end
```     