# Thread will use resourece of multiple core of CPU

How many thread should we have ?
1. Thus, while you can have hundreds or thousands of tasks in your program,
you should only have a limited number of threads.
The general advice is that the number of threads should correspond directly to
the number of CPU cores you have.

# Check CPU and hardware information

In [1]:
versioninfo(verbose=true)

Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
      Microsoft Windows [Version 10.0.18362.900]
  CPU: Intel(R) Pentium(R) CPU 4415Y @ 1.60GHz: 
              speed         user         nice          sys         idle          irq
       #1  1608 MHz    2200125            0      3854484     90901250       966203  ticks
       #2  1608 MHz    2127234            0      2039406     92788750        92281  ticks
       #3  1608 MHz    2291031            0      2363250     92301125        41468  ticks
       #4  1608 MHz    2399625            0      2172375     92383390        27765  ticks
       
  Memory: 3.909626007080078 GB (634.43359375 MB free)
  Uptime: 1.218641e6 sec
  Load Avg:  0.0  0.0  0.0
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 2
  HOMEDRIVE = C:
  HOMEPATH = \Users\Benz
  PATH = C:\ProgramData\Anaconda3;C:\ProgramData\Anaconda3\Library\mingw-w64

# How to set number of threads 
We can check the current no. of threads by 

In [1]:
nThreads=Threads.nthreads()

8

Unfortunately, we can't change number of threads in the script. We have to set no. of threads before starting the Julia

For Jupyter notebook
1. Open "Anaconda prompt" and type "set JULIA_NUM_THREADS=..."
2. cd to diretory you want and Open jupyter notebook by "jupyter notebook"


For Juno
1. Open Juno > File > Setting > package
2. Find package "julia-client" and click "setting" 
3. Expand list of "Julia Options" you will found insert box "Number of Threads", insert number of thread and close setting tab.

# Multi-Thread with for loop



We can easily use advantage of multi-threads by divided task in each for loop to Threads. 

### E.g. 1 Parallel for loop for independent task
Using multi-threads execute time will depend on the bigger task. On the other hand, Single thread the excute time will be sum of time in all task. 

In [9]:
thid=zeros(10)
taskSim=[1,2,1,3,1,4,4,5];
sumTaskTime=sum(taskSim);
nTasks=length(taskSim);

### Single thread run 
Execute time should be around sumTaskTime

In [15]:
@time begin
    for i=1:nTasks
        sleep(taskSim[i])
    end
end

 21.013439 seconds (487 allocations: 13.750 KiB)


### Multi-Threads run
Execute time will depend on the biggest taskSim

In [14]:
@time begin
Threads.@threads for i=1:nTasks
    thid[i]=Threads.threadid()
    print("i:$i Thread:$(thid[i]) Sleep:$(taskSim[i]) \n")
    sleep(taskSim[i])
    end # loop i
end

i:1 Thread:1.0 Sleep:1 
i:6 Thread:6.0 Sleep:4 
i:4 Thread:4.0 Sleep:3 
i:8 Thread:8.0 Sleep:5 
i:2 Thread:2.0 Sleep:2 
i:7 Thread:7.0 Sleep:4 
i:3 Thread:3.0 Sleep:1 
i:5 Thread:5.0 Sleep:1 
  5.019182 seconds (23.13 k allocations: 1.158 MiB)


### E.g. 2 Multi-thread in function

In [3]:
using Base.Threads
using BenchmarkTools

Create normal function and multi-threads function

In [3]:
function linearSum(x,m,c)
    y=(m.*x).+c
    return y
end

function linearSum_threads(x,m,c)
    nx=length(x)
    y=zeros(nx,1)
    Threads.@threads for ix=1:nx
        @inbounds y[ix]=m*x[ix]+c
        end 
    
    return y
end

function linearSum_threads2(x,m,c,nThreadsIn)
    nx=length(x)
    y=zeros(eltype(x),nx,1)
    Threads.@threads for ix=1:nx
        @inbounds y[ix]=m*x[ix]+c
        end 
    
    return y
end

m=3.52
c=0.78;

### Can't improve traditional sum

In [10]:
x=rand(10000000,1)

@btime begin
   sum($x) 
end

@btime begin
   sum($x[1:1250000,1]) 
end


  2.881 ms (0 allocations: 0 bytes)
  2.447 ms (2 allocations: 9.54 MiB)
  8.374 ms (60 allocations: 76.30 MiB)


UndefVarError: UndefVarError: elty not defined

In [11]:
@btime begin
   @threads for it=1:nThreads
       sum($x[1:1250000,1]) 
    end
end

function sumt(x,nThrd)
   y=zero(eltype(x))
    @threads for i=1:length(x)
         y+=x[i]
    end
    return y
end

@btime begin
   sumt($x,$nThreads)
end

  8.393 ms (60 allocations: 76.30 MiB)
  312.783 ms (20000047 allocations: 305.18 MiB)


677407.7082735315

In [12]:
nThreads

8

To get the advantage of multi threads that require a bit tricky either not multi-threads may slower than normal.

In [4]:
arrN=[100,10000,100000,1000000]

for n in arrN
    print("n:$n\n")
    print("time single")
    x=rand(n,1)
    @btime begin
        linearSum($x,$m,$c)
    end
    
    print("time Multi threads")
    @btime begin
        linearSum_threads($x,$m,$c)
    end
#     delY=sum(y1-y2)
#     print("sum(delY)=$delY\n")
end

n:100
time single  53.753 ns (1 allocation: 896 bytes)
time Multi threads  18.200 μs (44 allocations: 7.53 KiB)
n:10000
time single  3.657 μs (2 allocations: 78.20 KiB)
time Multi threads  23.200 μs (45 allocations: 84.86 KiB)
n:100000
time single  39.300 μs (2 allocations: 781.33 KiB)
time Multi threads  68.999 μs (45 allocations: 787.98 KiB)
n:1000000
time single  1.476 ms (2 allocations: 7.63 MiB)
time Multi threads  2.517 ms (45 allocations: 7.64 MiB)


### E.g.3 Montr carlo

In [21]:
using Random
function darts_in_circle(n, rng=Random.GLOBAL_RNG)
    inside = 0
    for i in 1:n
        if rand(rng)^2 + rand(rng)^2 < 1
            inside += 1
        end
    end
    return inside
end
function pi_serial(n)
    return (4 * darts_in_circle(n) / n)
end

pi_serial (generic function with 1 method)

In [29]:
# const rnglist = [MersenneTwister() for i in 1:Threads.nthreads()]
function pi_threads(n, loops)
    inside = zeros(Int, loops)
    
    Threads.@threads for i in 1:loops
        rng = rnglist[Threads.threadid()]
        inside[Threads.threadid()] = darts_in_circle(n, rng)
        end
            
    return (4 * sum(inside) / (n*loops))
end

pi_threads (generic function with 1 method)

In [31]:
@btime pi_serial(16000000)
@btime pi_threads(2000000,8)


  80.334 ms (0 allocations: 0 bytes)
  5.855 ms (44 allocations: 6.78 KiB)


3.1411385

In [32]:
diffPI=pi_serial(16000000)-pi_threads(2000000,8)

0.00036199999999997345

In [33]:
diffPI=pi_serial(16000000)-pi_threads(1000000,8)

0.0005537499999999085

In [35]:
diffPI=pi_serial(16000000)-pi_serial(16000000)

-0.0009380000000001054

### E.g.

# Atomic variable
In the case that we seperates task to parallel loop but need the lasted output to futher operate. We declare variable as Atomic 
### Declare atomic variable


In [45]:
arr=3
varAtom = Threads.Atomic{eltype(arr)}(arr)

Base.Threads.Atomic{Int64}(3)

Access value of Atomic variable

In [46]:
varAtom[]

3

Atomic operator

In [47]:
Threads.atomic_add!(varAtom,5)

3

See updated value

In [48]:
varAtom[]

8

### Sum a large number

In [5]:
# n=100
# nDiv=100/8 = 12 + 4
#     Threads
#        1       1 - 12
#        2      13 - 24
#        3      25 - 36
#        ...
#        11     77 - 88
#        12     89 - 96+4

function sum_threads_atomic(x,nThreads)
    global nx,nDivTask
    
    nx=length(x)
    
#     nThreads=Threads.nthreads()
    nDivTask=floor(nx/nThreads)
    
#     y=zeros(nx,1)
    y = Threads.Atomic{Float64}(0)    
    y[]=zero(eltype(x))
    
    Threads.@threads for it=1:nThreads
#         global nx,nDivTask
        istr=Int((it-1)*nDivTask + 1)
        if it!=nThreads
            iend=Int(istr+nDivTask-1)
        else
            iend=nx
        end
#         print("istr:$istr iend:$iend\n")
#         y[istr:iend,1]=(m.*x[istr:iend,1]).+c
        
        y[]=sum(x[istr:iend,1])
#          @simd for ix=istr:iend
#             y[]+=x[ix]
#         end
        
        Threads.atomic_add!(y,y[])
        
#         @inbounds Threads.atomic_add!(y,y[])
    end
#     result = y[]
    
    return y[]
end

sum_threads_atomic (generic function with 1 method)

In [6]:
arrN=[100,10000,100000,1000000]
nThreads=Threads.nthreads()
for n in arrN
    x=rand(n,1)
    print("n:$n\n")
    print("time single")
# #     print("x:$x\n")
    
    @btime begin
        sum($x)
    end
    
    print("time Multi threads")
    @btime begin
        sum_threads_atomic($x,$nThreads)
    end
end

# ydiff=(sum(x)-sum_threads_atomic(x,nThreads))

n:100
time single  9.609 ns (0 allocations: 0 bytes)
time Multi threads  18.800 μs (100 allocations: 8.94 KiB)
n:10000
time single  778.641 ns (0 allocations: 0 bytes)
time Multi threads  20.400 μs (115 allocations: 87.27 KiB)
n:100000
time single  11.700 μs (0 allocations: 0 bytes)
time Multi threads  43.999 μs (123 allocations: 789.89 KiB)
n:1000000
time single  126.901 μs (0 allocations: 0 bytes)
time Multi threads  775.299 μs (123 allocations: 7.64 MiB)


# Async & Sync task 
The parallel for loop in above are normally use for independent task, task in each loop not require result from the each other.