In [10]:
using Distributed
using Base.Threads


@everywhere begin
    using BenchmarkTools
    using ParallelTemperingMonteCarlo
end


  ** incremental compilation may be fatally broken for this module **



[33m[1m│ [22m[39m- If you have ParallelTemperingMonteCarlo checked out for development and have
[33m[1m│ [22m[39m  added Distributed as a dependency but haven't updated your primary
[33m[1m│ [22m[39m  environment's manifest file, try `Pkg.resolve()`.
[33m[1m│ [22m[39m- Otherwise you may need to report an issue with ParallelTemperingMonteCarlo


In order to use a function of several variables in the pmap environment we require a curry function

In [11]:
@everywhere curry(f,y) = x -> f(x,y)
@everywhere add_xy(x,y) = x + y 


In [12]:
pmap(curry(add_xy,10), 1:5)


5-element Vector{Int64}:
 11
 12
 13
 14
 15

NB spawning processes is expensive and _not recommended at all_ for simple loops and functions.

# Sync macros

sync requires all tasks inside to complete before moving on, async moves right on along without waiting, 

In [13]:
@sync begin
    sleep(2)
    println("slept for two")

    @async begin 
        sleep(5)
        println("nice and rested")
    end
#the async wrapper skips straight to done
    println("done")
end

slept for two
done


nice and rested


We'll simulate a complex process with a 2 second sleep

In [14]:
function simtest(x)
    sleep(2)
    return x
end

simtest (generic function with 1 method)

In [15]:
@time begin 
    veccy = []
    for i = 1:10

        y = simtest(i)
        push!(veccy,y)
        
    end
    println(veccy)
end

@time begin 
    @sync for i = 1:10
        veccy = []
        @async begin 
            y = simtest(i)
            push!(veccy,y)
        end
    end
    println(veccy)
end

Any[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
 20.330085 seconds (309.83 k allocations: 16.636 MiB, 0.08% gc time, 1.53% compilation time)


Any[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
  2.015827 seconds (2.56 k allocations: 129.685 KiB, 0.62% compilation time)


NB: Asynchronous tasks are not parallel, demonstrated above they do boost performance for large operations. Use @spawn not @async for parallel operations

Let's test a data unsafe operation

In [33]:
function printtest(x::Bool = true)
    println("begin")
    if x == true
        @sync for i=1:10
            Distributed.@spawn println("$i $i $i $i $i ")
        end
    else
        for i=1:10
            println("$i $i $i $i $i ")
        end
    end
    println("end")
end

printtest (generic function with 2 methods)

In [34]:
@time printtest()
println("done")

@time printtest(false)
println("done")
# @time begin 
#     println("begin")
#     @threads for i=1:10
#         println("$i $i $i $i $i ")
#     end
#     println("end")
# end

begin


1 1 1 1 1 
2 2 2 2 2 
3 3 3 3 3 
4 4 4 4 4 
5 5 5 5 5 
6 6 6 6 6 
7 7 7 7 7 
8 8 8 8 8 
9 9 9 9 9 
10 10 10 10 10 


end
  0.400992 seconds (210.65 k allocations: 11.385 MiB, 99.66% compilation time)
done
begin
1 1 1 1 1 
2 2 2 2 2 
3 3 3 3 3 
4 4 4 4 4 
5 5 5 5 5 
6 6 6 6 6 
7 7 7 7 7 
8 8 8 8 8 
9 9 9 9 9 
10 10 10 10 10 
end
  0.000624 seconds (419 allocations: 12.875 KiB)
done


# NB 
Categorically do not use @spawn where data ordering is relevant. Additionally, it is considered bad practice to parallelise any operation faster than 100 $\mu m$ for threads or 100ms for spawning as it does not increase speed enough to compensate for the time taken to spawn the operation.

Below we test specific functions for this purpose.

## Onwards, time-testing some functions

In [35]:
n_atoms = 13

# temperature grid
ti = 5.
tf = 16.
n_traj = 32

temp = TempGrid{n_traj}(ti,tf) 

# MC simulation details
mc_cycles = 300000 #default 20% equilibration cycles on top
mc_sample = 1  #sample every mc_sample MC cycles

#move_atom=AtomMove(n_atoms) #move strategy (here only atom moves, n_atoms per MC cycle)
displ_atom = 0.1 # Angstrom
n_adjust = 100

max_displ_atom = [0.1*sqrt(displ_atom*temp.t_grid[i]) for i in 1:n_traj]

mc_params = MCParams(mc_cycles, n_traj, n_atoms, mc_sample = mc_sample, n_adjust = n_adjust)

#moves - allowed at present: atom, volume and rotation moves (volume,rotation not yet implemented)
move_strat = MoveStrategy(atom_moves = n_atoms)  

#ensemble
ensemble = NVT(n_atoms)

#ELJpotential for neon
#c1=[-10.5097942564988, 0., 989.725135614556, 0., -101383.865938807, 0., 3918846.12841668, 0., -56234083.4334278, 0., 288738837.441765]
#elj_ne1 = ELJPotential{11}(c1)

c=[-10.5097942564988, 989.725135614556, -101383.865938807, 3918846.12841668, -56234083.4334278, 288738837.441765]
pot = ELJPotentialEven{6}(c)

#starting configurations
#icosahedral ground state of Ne13 (from Cambridge cluster database) in Angstrom
pos_ne13 = [[2.825384495892464, 0.928562467914040, 0.505520149314310],
[2.023342172678102,	-2.136126268595355, 0.666071287554958],
[2.033761811732818,	-0.643989413759464, -2.133000349161121],
[0.979777205108572,	2.312002562803556, -1.671909307631893],
[0.962914279874254,	-0.102326586625353, 2.857083360096907],
[0.317957619634043,	2.646768968413408, 1.412132053672896],
[-2.825388342924982, -0.928563755928189, -0.505520471387560],
[-0.317955944853142, -2.646769840660271, -1.412131825293682],
[-0.979776174195320, -2.312003751825495, 1.671909138648006],
[-0.962916072888105, 0.102326392265998,	-2.857083272537599],
[-2.023340541398004, 2.136128558801072,	-0.666071089291685],
[-2.033762834001679, 0.643989905095452, 2.132999911364582],
[0.000002325340981,	0.000000762100600, 0.000000414930733]]

#convert to Bohr
AtoBohr = 1.8897259886
pos_ne13 = pos_ne13 * AtoBohr

length(pos_ne13) == n_atoms || error("number of atoms and positions not the same - check starting config")

#boundary conditions 
bc_ne13 = SphericalBC(radius=5.32*AtoBohr)   #5.32 Angstrom

#starting configuration
start_config = Config(pos_ne13, bc_ne13)

#histogram information
n_bin = 100
#en_min = -0.006    #might want to update after equilibration run if generated on the fly
#en_max = -0.001    #otherwise will be determined after run as min/max of sampled energies (ham vector)

#construct array of MCState (for each temperature)
mc_states = [MCState(temp.t_grid[i], temp.beta_grid[i], start_config, pot; max_displ=[max_displ_atom[i],0.01,1.]) for i in 1:n_traj]

#results = Output(n_bin, max_displ_vec)
results = Output{Float64}(n_bin; en_min = mc_states[1].en_tot)


Output{Float64}(100, 0.0, 0.0, Float64[], Float64[], Float64[], Vector{Float64}[], Float64[], Float64[], Float64[], Float64[], Float64[])

above we just define the 13 atom system, below we show that threading halves the time taken to complete one mc_step per trajectory.

In [20]:
@benchmark begin 
    for i in 1:mc_params.n_traj
        mc_step!(mc_states[i],pot,ensemble,1,0,0);
    end
end



BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m13.400 μs[22m[39m … [35m  6.396 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m20.400 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m25.756 μs[22m[39m ± [32m128.136 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.91% ± 1.35%

  [39m▃[39m [39m [39m [39m [39m▂[39m█[39m [39m [39m [39m [39m▆[34m [39m[39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m▂[39m▁[39m▁[39m▄

In [21]:
@benchmark begin 
    @threads for i in 1:mc_params.n_traj
        mc_step!(mc_states[i],pot,ensemble,1,0,0);
    end
end


BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m27.100 μs[22m[39m … [35m 25.266 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 98.97%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m53.500 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m72.984 μs[22m[39m ± [32m317.122 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m3.43% ±  0.99%

  [39m [39m [39m [39m▂[39m▆[39m█[39m▇[34m▅[39m[39m▃[39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▂[39m▄[39m█[39

In [22]:
@benchmark begin 
    for i in 1:mc_params.n_traj
        x = MCRun.atom_displacement(mc_states[i].config.pos[2],mc_states[i].max_displ[1],mc_states[i].config.bc);
    end
end


BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m12.600 μs[22m[39m … [35m 4.158 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 97.62%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m16.400 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m21.325 μs[22m[39m ± [32m62.162 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.90% ±  0.98%

  [39m▆[39m▄[39m [39m [39m [39m▂[39m█[34m▅[39m[39m▂[39m▁[39m [39m▁[39m▃[39m▃[39m▃[39m▂[32m [39m[39m▁[39m▄[39m▂[39m▄[39m▆[39m▃[39m▁[39m▃[39m▂[39m [39m▂[39m▂[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂
  [39m█[39m█[39m█[39m▇[39m█[39m█

In [23]:
@benchmark begin 
    @threads for i in 1:mc_params.n_traj
        x = MCRun.atom_displacement(mc_states[i].config.pos[2],mc_states[i].max_displ[1],mc_states[i].config.bc);
    end
end

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m24.900 μs[22m[39m … [35m 32.992 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m59.100 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m92.799 μs[22m[39m ± [32m505.959 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m2.50% ± 0.99%

  [39m [39m [39m [39m▆[39m█[39m▇[34m▄[39m[39m▂[39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▁[39m▁[39m▆[39m█[39m█

There is a speed-up in calculation for the atom_displacement step, but the greatest increase is to complete the metropolis condition as well, so threading works perfectly for the dimer by simply threading the loops. 

# Time to test the writing step

Let's test printing 32 (traj) iterations of 55(atoms) and time this with the @sync @sync vs without

In [24]:
t1 = @benchmark begin
    filetest = open("test.dat", "w+")
    
    @sync begin
    @async for i=1:32
         @async for j=1:55
            write(filetest,"$i $j \n")
        end
    end
end

close(filetest)
end

BenchmarkTools.Trial: 530 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m6.418 ms[22m[39m … [35m52.703 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m8.797 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m9.422 ms[22m[39m ± [32m 3.078 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.07% ± 5.27%

  [39m [39m▂[39m▂[39m▇[39m█[39m▆[39m▅[39m▅[39m▄[39m▄[34m▄[39m[39m▅[32m▆[39m[39m▅[39m▄[39m▄[39m▂[39m▃[39m▃[39m▂[39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m█[39m█[39m█[39m█[39m█[39m█[

In [25]:
t2 = @benchmark begin
    filetest = open("test2.dat", "w+")
    
    begin
    for i=1:32
        for j=1:55
            write(filetest,"$i $j \n")
        end
    end
end

    close(filetest)
end

BenchmarkTools.Trial: 560 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m6.241 ms[22m[39m … [35m24.304 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m8.197 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m8.907 ms[22m[39m ± [32m 2.314 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.06% ± 5.01%

  [39m [39m [39m [39m [39m▂[39m█[39m▄[39m [39m [34m [39m[39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▇[39m▅[39m▅[39m█[39m█[39m█[39m█[39m█[

It's slower to use the sync async functionality, but the atom-invariance within trajectories may make @spawn or @threads more useful

In [26]:
t3 = @benchmark begin
    filetest = open("test3.dat", "w+")
   
   @sync begin

   @async for i=1:32
        @threads for j=1:55
           write(filetest,"$i $j \n")
       end
   end
end

close(filetest)
end

BenchmarkTools.Trial: 378 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m 6.879 ms[22m[39m … [35m99.789 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 76.25%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m12.394 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m13.195 ms[22m[39m ± [32m 6.057 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.53% ±  3.92%

  [39m [39m [39m [39m [39m [39m [39m▁[39m▃[39m█[39m▂[39m [39m▂[39m [39m▃[39m▂[39m▅[34m▂[39m[39m▄[39m▅[32m▆[39m[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▄[39m▃[39m▇[39m▆[39m█[39m█[3

In [36]:
t4 = @benchmark begin
    filetest = open("test4.dat", "w+")
   
   @sync begin

   @async for i=1:32
        for j=1:55
        Threads.@spawn write(filetest,"$i $j \n")
       end
   end
end

close(filetest)
end

BenchmarkTools.Trial: 474 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m 6.729 ms[22m[39m … [35m50.094 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m10.110 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m10.539 ms[22m[39m ± [32m 3.559 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m3.12% ± 8.35%

  [39m [39m [39m [39m [39m▁[39m▄[39m█[39m▄[34m▆[39m[32m▇[39m[39m▃[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▃[39m▄[39m█[39m█[39m█[39m█[39m

Nope. Writing shouldn't be parallelised. This isn't advantageous. Atom moves on the otherhand benefit from this. Operations, and in particular for-loops for the storage steps are next on the agenda

Testing some Runner functionality

In [37]:
function test_things(x,mc_states)
    file = RuNNer.writeinit(pwd())
    if x == true
        @sync begin
        @threads for mc_state in mc_states
            writeconfig(file,mc_state.config,"Cu")
        end
        end
    elseif x == false
        for mc_state in mc_states
            writeconfig(file,mc_state.config,"Cu")
        end
    elseif x == 1
        for mc_state in mc_states
           Distributed.@spawn writeconfig(file,mc_state.config,"Cu")
        end
    end
    close(file)
end


test_things (generic function with 1 method)

In [38]:
@benchmark test_things(true,mc_states)

@benchmark test_things(1,mc_states)

@benchmark test_things(false,mc_states)

BenchmarkTools.Trial: 788 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m4.828 ms[22m[39m … [35m17.895 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 63.33%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m5.756 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m6.318 ms[22m[39m ± [32m 1.634 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.87% ±  4.40%

  [39m▅[39m▄[39m [39m [39m▇[39m█[34m [39m[39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m█[39m▇[39m█[39m█[34m█[39m

As expected, significantly faster to not thread the writing step, what we can try is writing several files. Next the pmap formalism.

In [31]:
 @everywhere curry(f,y1,y2,y3,y4,y5,y6,y7,y8) = x -> f(x,y1,y2,y3,y4,y5,y6,y7,y8)
 