/
index.html
33 lines (31 loc) · 20 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<!DOCTYPE html>
<html lang="en"><head><meta charset="UTF-8"/><meta name="viewport" content="width=device-width, initial-scale=1.0"/><title>Timings and parallelization · DFTK.jl</title><link rel="canonical" href="https://juliamolsim.github.io/DFTK.jl/stable/guide/parallelization/"/><link href="https://fonts.googleapis.com/css?family=Lato|Roboto+Mono" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.0/css/fontawesome.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.0/css/solid.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.0/css/brands.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.11.1/katex.min.css" rel="stylesheet" type="text/css"/><script>documenterBaseURL="../.."</script><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.6/require.min.js" data-main="../../assets/documenter.js"></script><script src="../../siteinfo.js"></script><script src="../../../versions.js"></script><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../../assets/themes/documenter-dark.css" data-theme-name="documenter-dark" data-theme-primary-dark/><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../../assets/themes/documenter-light.css" data-theme-name="documenter-light" data-theme-primary/><script src="../../assets/themeswap.js"></script><link href="../../assets/favicon.ico" rel="icon" type="image/x-icon"/></head><body><div id="documenter"><nav class="docs-sidebar"><a class="docs-logo" href="../../"><img src="../../assets/logo.png" alt="DFTK.jl logo"/></a><div class="docs-package-name"><span class="docs-autofit">DFTK.jl</span></div><form class="docs-search" action="../../search/"><input class="docs-search-query" id="documenter-search-query" name="q" type="text" placeholder="Search docs"/></form><ul class="docs-menu"><li><a class="tocitem" href="../../">Home</a></li><li><span class="tocitem">Getting started</span><ul><li><a class="tocitem" href="../installation/">Installation</a></li><li><a class="tocitem" href="../tutorial/">Tutorial</a></li><li><a class="tocitem" href="../input_output/">Input and output formats</a></li><li class="is-active"><a class="tocitem" href>Timings and parallelization</a><ul class="internal"><li><a class="tocitem" href="#Timing-measurements"><span>Timing measurements</span></a></li><li><a class="tocitem" href="#Options-for-parallelization"><span>Options for parallelization</span></a></li><li><a class="tocitem" href="#MPI-based-parallelism"><span>MPI-based parallelism</span></a></li><li><a class="tocitem" href="#Thread-based-parallelism"><span>Thread-based parallelism</span></a></li><li><a class="tocitem" href="#Advanced-threading-tweaks"><span>Advanced threading tweaks</span></a></li></ul></li><li><a class="tocitem" href="../density_functional_theory/">Density-functional theory</a></li></ul></li><li><span class="tocitem">Examples</span><ul><li><a class="tocitem" href="../../examples/metallic_systems/">Temperature and metallic systems</a></li><li><a class="tocitem" href="../../examples/pymatgen/">Creating supercells with pymatgen</a></li><li><a class="tocitem" href="../../examples/ase/">Creating slabs with ASE</a></li><li><a class="tocitem" href="../../examples/collinear_magnetism/">Collinear spin and magnetic systems</a></li><li><a class="tocitem" href="../../examples/geometry_optimization/">Geometry optimization</a></li><li><a class="tocitem" href="../../examples/scf_callbacks/">Monitoring self-consistent field calculations</a></li><li><a class="tocitem" href="../../examples/scf_checkpoints/">Saving SCF results on disk and SCF checkpoints</a></li><li><a class="tocitem" href="../../examples/polarizability/">Polarizability by linear response</a></li><li><a class="tocitem" href="../../examples/gross_pitaevskii/">Gross-Pitaevskii equation in one dimension</a></li><li><a class="tocitem" href="../../examples/gross_pitaevskii_2D/">Gross-Pitaevskii equation with magnetism</a></li><li><a class="tocitem" href="../../examples/cohen_bergstresser/">Cohen-Bergstresser model</a></li><li><a class="tocitem" href="../../examples/arbitrary_floattype/">Arbitrary floating-point types</a></li><li><a class="tocitem" href="../../examples/custom_solvers/">Custom solvers</a></li><li><a class="tocitem" href="../../examples/custom_potential/">Custom potential</a></li></ul></li><li><span class="tocitem">Advanced topics</span><ul><li><a class="tocitem" href="../../advanced/conventions/">Notation and conventions</a></li><li><a class="tocitem" href="../../advanced/data_structures/">Data structures</a></li><li><a class="tocitem" href="../../advanced/useful_formulas/">Useful formulas</a></li><li><a class="tocitem" href="../../advanced/symmetries/">Crystal symmetries</a></li></ul></li><li><a class="tocitem" href="../../api/">API reference</a></li><li><a class="tocitem" href="../../publications/">Publications</a></li></ul><div class="docs-version-selector field has-addons"><div class="control"><span class="docs-label button is-static is-size-7">Version</span></div><div class="docs-selector control is-expanded"><div class="select is-fullwidth is-size-7"><select id="documenter-version-selector"></select></div></div></div></nav><div class="docs-main"><header class="docs-navbar"><nav class="breadcrumb"><ul class="is-hidden-mobile"><li><a class="is-disabled">Getting started</a></li><li class="is-active"><a href>Timings and parallelization</a></li></ul><ul class="is-hidden-tablet"><li class="is-active"><a href>Timings and parallelization</a></li></ul></nav><div class="docs-right"><a class="docs-edit-link" href="https://github.com/JuliaMolSim/DFTK.jl/blob/master/docs/src/guide/parallelization.md" title="Edit on GitHub"><span class="docs-icon fab"></span><span class="docs-label is-hidden-touch">Edit on GitHub</span></a><a class="docs-settings-button fas fa-cog" id="documenter-settings-button" href="#" title="Settings"></a><a class="docs-sidebar-button fa fa-bars is-hidden-desktop" id="documenter-sidebar-button" href="#"></a></div></header><article class="content" id="documenter-page"><h1 id="Timings-and-parallelization"><a class="docs-heading-anchor" href="#Timings-and-parallelization">Timings and parallelization</a><a id="Timings-and-parallelization-1"></a><a class="docs-heading-anchor-permalink" href="#Timings-and-parallelization" title="Permalink"></a></h1><p>This section summarizes the options DFTK offers to monitor and influence performance of the code.</p><h2 id="Timing-measurements"><a class="docs-heading-anchor" href="#Timing-measurements">Timing measurements</a><a id="Timing-measurements-1"></a><a class="docs-heading-anchor-permalink" href="#Timing-measurements" title="Permalink"></a></h2><p>By default DFTK uses <a href="https://github.com/KristofferC/TimerOutputs.jl">TimerOutputs.jl</a> to record timings, memory allocations and the number of calls for selected routines inside the code. These numbers are accessible in the object <code>DFTK.timer</code>. Since the timings are automatically accumulated inside this datastructure, any timing measurement should first reset this timer before running the calculation of interest.</p><p>For example to measure the timing of an SCF:</p><pre><code class="language-julia">DFTK.reset_timer!(DFTK.timer)
scfres = self_consistent_field(basis, tol=1e-8)
DFTK.timer</code></pre><pre class="documenter-example-output"> ──────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 837ms / 23.6% 88.4MiB / 39.6%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────────
self_consistent_field 1 197ms 100% 197ms 34.7MiB 99.0% 34.7MiB
compute_density 6 85.9ms 43.5% 14.3ms 6.15MiB 17.6% 1.02MiB
LOBPCG 12 85.5ms 43.3% 7.13ms 10.6MiB 30.3% 904KiB
Hamiltonian mu... 42 56.0ms 28.3% 1.33ms 3.49MiB 10.0% 85.0KiB
kinetic+local 42 53.6ms 27.1% 1.28ms 753KiB 2.10% 17.9KiB
nonlocal 42 1.35ms 0.68% 32.1μs 828KiB 2.31% 19.7KiB
ortho! 112 13.3ms 6.74% 119μs 925KiB 2.58% 8.26KiB
rayleigh_ritz 30 5.08ms 2.57% 169μs 1.03MiB 2.93% 35.1KiB
energy_hamiltonian 13 21.9ms 11.1% 1.68ms 14.1MiB 40.3% 1.09MiB
ene_ops 13 19.9ms 10.1% 1.53ms 10.4MiB 29.8% 821KiB
ene_ops: xc 13 14.7ms 7.45% 1.13ms 3.07MiB 8.77% 242KiB
ene_ops: har... 13 2.89ms 1.46% 222μs 5.85MiB 16.7% 460KiB
ene_ops: non... 13 734μs 0.37% 56.4μs 152KiB 0.42% 11.7KiB
ene_ops: local 13 555μs 0.28% 42.7μs 1.23MiB 3.50% 96.5KiB
ene_ops: kin... 13 441μs 0.22% 33.9μs 95.0KiB 0.26% 7.31KiB
QR orthonormaliz... 12 213μs 0.11% 17.7μs 160KiB 0.45% 13.3KiB
guess_density 1 493μs 0.25% 493μs 370KiB 1.03% 370KiB
──────────────────────────────────────────────────────────────────────────────</pre><p>The output produced when printing or displaying the <code>DFTK.timer</code> now shows a nice table summarising total time and allocations as well as a breakdown over individual routines.</p><div class="admonition is-info"><header class="admonition-header">Timing measurements and stack traces</header><div class="admonition-body"><p>Timing measurements have the unfortunate disadvantage that they alter the way stack traces look making it sometimes harder to find errors when debugging. For this reason timing measurements can be disabled completely (i.e. not even compiled into the code) by setting the environment variable <code>DFTK_TIMING</code> to <code>"0"</code> or <code>"false"</code>. For this to take effect recompiling all DFTK (including the precompile cache) is needed.</p></div></div><div class="admonition is-info"><header class="admonition-header">Timing measurements and threading</header><div class="admonition-body"><p>Unfortunately measuring timings in <code>TimerOutputs</code> is not yet thread-safe. Therefore taking timings of threaded parts of the code will be disabled unless you set <code>DFTK_TIMING</code> to <code>"all"</code>. In this case you must not use Julia threading (see section below) or otherwise undefined behaviour results.</p></div></div><h2 id="Options-for-parallelization"><a class="docs-heading-anchor" href="#Options-for-parallelization">Options for parallelization</a><a id="Options-for-parallelization-1"></a><a class="docs-heading-anchor-permalink" href="#Options-for-parallelization" title="Permalink"></a></h2><p>At the moment DFTK offers two ways to parallelize a calculation, firstly shared-memory parallelism using threading and secondly multiprocessing using MPI (via the <a href="https://github.com/JuliaParallel/MPI.jl">MPI.jl</a> Julia interface). MPI-based parallelism is currently only over <code>k</code>-Points, such that it cannot be used for calculations with only a single <code>k</code>-Point. Otherwise combining both forms of parallelism is possible as well.</p><p>The scaling of both forms of parallelism for a number of test cases is demonstrated in the following figure. These values were obtained using DFTK version 0.1.17 and Julia 1.6 and the precise scalings will likely be different depending on architecture, DFTK or Julia version. The rough trends should, however, be similar.</p><img src="../scaling.png" width=750 /><p>The MPI-based parallelization strategy clearly shows a superior scaling and should be preferred if available.</p><h2 id="MPI-based-parallelism"><a class="docs-heading-anchor" href="#MPI-based-parallelism">MPI-based parallelism</a><a id="MPI-based-parallelism-1"></a><a class="docs-heading-anchor-permalink" href="#MPI-based-parallelism" title="Permalink"></a></h2><p>Currently DFTK uses MPI to distribute on <code>k</code>-Points only. This implies that calculations with only a single <code>k</code>-Point cannot use make use of this. For details on setting up and configuring MPI with Julia see the <a href="https://juliaparallel.github.io/MPI.jl/stable/configuration">MPI.jl documentation</a>.</p><ol><li><p>First disable all threading inside DFTK, by adding the following to your script running the DFTK calculation:</p><pre><code class="language-julia">using DFTK
disable_threading()</code></pre></li><li><p>Run Julia in parallel using the <code>mpiexecjl</code> wrapper script from MPI.jl:</p><pre><code class="language-sh">mpiexecjl -np 16 julia myscript.jl</code></pre><p>In this <code>-np 16</code> tells MPI to use 16 processes and <code>-t 1</code> tells Julia to use one thread only. Notice that we use <code>mpiexecjl</code> to automatically select the <code>mpiexec</code> compatible with the MPI version used by MPI.jl.</p></li></ol><p>As usual with MPI printing will be garbled. You can use</p><pre><code class="language-julia">DFTK.mpi_master() || (redirect_stdout(); redirect_stderr())</code></pre><p>at the top of your script to disable printing on all processes but one.</p><div class="admonition is-warning"><header class="admonition-header">MPI-based parallelism is experimental</header><div class="admonition-body"><p>Even though MPI-based parallelism shows the better scaling it is still experimental and some routines (e.g. band structure and direct minimization) are not compatible with it yet.</p></div></div><h2 id="Thread-based-parallelism"><a class="docs-heading-anchor" href="#Thread-based-parallelism">Thread-based parallelism</a><a id="Thread-based-parallelism-1"></a><a class="docs-heading-anchor-permalink" href="#Thread-based-parallelism" title="Permalink"></a></h2><p>Threading in DFTK currently happens on multiple layers distributing the workload over different <span>$k$</span>-Points, bands or within an FFT or BLAS call between threads. At its current stage our scaling for thread-based parallelism is worse compared MPI-based and therefore the parallelism described here should only be used if no other option exists. To use thread-based parallelism proceed as follows:</p><ol><li><p>Ensure that threading is properly setup inside DFTK by adding to the script running the DFTK calculation:</p><pre><code class="language-julia">using DFTK
setup_threading()</code></pre><p>This disables FFT threading and sets the number of BLAS threads to the number of Julia threads.</p></li><li><p>Run Julia passing the desired number of threads using the flag <code>-t</code>:</p><pre><code class="language-sh">julia -t 8 myscript.jl</code></pre></li></ol><p>For some cases (e.g. a single <code>k</code>-Point, fewish bands and a large FFT grid) it can be advantageous to add threading inside the FFTs as well. One example is the Caffeine calculation in the above scaling plot. In order to do so just call <code>setup_threading(n_fft=2)</code>, which will select two FFT threads. More than two FFT threads is rarely useful.</p><h2 id="Advanced-threading-tweaks"><a class="docs-heading-anchor" href="#Advanced-threading-tweaks">Advanced threading tweaks</a><a id="Advanced-threading-tweaks-1"></a><a class="docs-heading-anchor-permalink" href="#Advanced-threading-tweaks" title="Permalink"></a></h2><p>The default threading setup done by <code>setup_threading</code> is to select one FFT thread and the same number of BLAS and Julia threads. This section provides some info in case you want to change these defaults.</p><h3 id="BLAS-threads"><a class="docs-heading-anchor" href="#BLAS-threads">BLAS threads</a><a id="BLAS-threads-1"></a><a class="docs-heading-anchor-permalink" href="#BLAS-threads" title="Permalink"></a></h3><p>All BLAS calls in Julia go through a parallelized OpenBlas or MKL (with <a href="https://github.com/JuliaComputing/MKL.jl">MKL.jl</a>. Generally threading in BLAS calls is far from optimal and the default settings can be pretty bad. For example for CPUs with hyper threading enabled, the default number of threads seems to equal the number of <em>virtual</em> cores. Still, BLAS calls typically take second place in terms of the share of runtime they make up (between 10% and 20%). Of note many of these do not take place on matrices of the size of the full FFT grid, but rather only in a subspace (e.g. orthogonalization, Rayleigh-Ritz, ...) such that parallelization is either anyway disabled by the BLAS library or not very effective. To <strong>set the number of BLAS threads</strong> use</p><pre><code class="language-none">using LinearAlgebra
BLAS.set_num_threads(N)</code></pre><p>where <code>N</code> is the number of threads you desire. To <strong>check the number of BLAS threads</strong> currently used, you can use</p><pre><code class="language-none">Int(ccall((BLAS.@blasfunc(openblas_get_num_threads), BLAS.libblas), Cint, ()))</code></pre><p>or (from Julia 1.6) simply <code>BLAS.get_num_threads()</code>.</p><h3 id="Julia-threads"><a class="docs-heading-anchor" href="#Julia-threads">Julia threads</a><a id="Julia-threads-1"></a><a class="docs-heading-anchor-permalink" href="#Julia-threads" title="Permalink"></a></h3><p>On top of BLAS threading DFTK uses Julia threads (<code>Thread.@threads</code>) in a couple of places to parallelize over <code>k</code>-Points (density computation) or bands (Hamiltonian application). The number of threads used for these aspects is controlled by the flag <code>-t</code> passed to Julia or the <em>environment variable</em> <code>JULIA_NUM_THREADS</code>. To <strong>check the number of Julia threads</strong> use <code>Threads.nthreads()</code>.</p><h3 id="FFT-threads"><a class="docs-heading-anchor" href="#FFT-threads">FFT threads</a><a id="FFT-threads-1"></a><a class="docs-heading-anchor-permalink" href="#FFT-threads" title="Permalink"></a></h3><p>Since FFT threading is only used in DFTK inside the regions already parallelized by Julia threads, setting FFT threads to something larger than <code>1</code> is rarely useful if a sensible number of Julia threads has been chosen. Still, to explicitly <strong>set the FFT threads</strong> use</p><pre><code class="language-none">using FFTW
FFTW.set_num_threads(N)</code></pre><p>where <code>N</code> is the number of threads you desire. By default no FFT threads are used, which is almost always the best choice.</p></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../input_output/">« Input and output formats</a><a class="docs-footer-nextpage" href="../density_functional_theory/">Density-functional theory »</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> on <span class="colophon-date" title="Monday 19 July 2021 10:16">Monday 19 July 2021</span>. Using Julia version 1.6.2.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>