# Using DFTK on GPUs

In this example we will look at how DFTK can be used on
Graphics Processing Units.
In its current state, runs based on Nvidia GPUs
using the [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) Julia
package are better supported. Running on AMD GPUs is also possible
with the [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) package,
albeit with lower performance.

> **GPU parallelism not supported everywhere**
>
> Not all features of DFTK are ported to the GPU. SCF and forces
> with standard Libxc functionals are fully supported. Stresses
> and response calculations are work in progress as of December 2025.
> In most cases there is no intrinsic limitation and typically it only takes
> minor code modifications to make it work on GPUs (and some extra work for
> optimization). If you require GPU support in one of our routines, where this is not
> yet supported, feel free to open an issue on github or otherwise get in touch.

In [1]:
using AtomsBuilder
using DFTK
using PseudoPotentialData

**Model setup.** First step is to set up a `Model` in DFTK.
This proceeds exactly as in the standard CPU case
(see also our Tutorial).

In [2]:
silicon = bulk(:Si)

model  = model_DFT(silicon;
                   functionals=PBE(),
                   pseudopotentials=PseudoFamily("dojo.nc.sr.pbe.v0_4_1.standard.upf"))
nothing  # hide

Next is the selection of the computational architecture.
This effectively makes the choice, whether the computation will be run
on the CPU or on a GPU.

**Nvidia GPUs.**
Supported via [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl).
If you install the CUDA package, all required Nvidia cuda libraries
will be automatically downloaded. So literally, the only thing
you have to do is:

In [3]:
using CUDA
architecture = DFTK.GPU(CuArray)

DFTK.GPU{CUDA.CuArray}()

**AMD GPUs.** Supported via [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl).
Here you need to [install ROCm](https://rocm.docs.amd.com/) manually.
With that in place you can then select:

In [4]:
using AMDGPU
architecture = DFTK.GPU(ROCArray)

DFTK.GPU{AMDGPU.ROCArray}()

**Portable architecture selection.**
To make sure this script runs on the github CI (where we don't have GPUs
available) we check for the availability of GPUs before selecting an
architecture:

In [5]:
architecture = has_cuda() ? DFTK.GPU(CuArray) : DFTK.CPU()

DFTK.CPU()

**Basis and SCF.**
Based on the `architecture` we construct a `PlaneWaveBasis` object
as usual:

In [6]:
basis  = PlaneWaveBasis(model; Ecut=30, kgrid=(5, 5, 5), architecture)
nothing  # hide

... and run the SCF and some post-processing:

In [7]:
scfres = self_consistent_field(basis; tol=1e-6)
compute_forces(scfres)

n     Energy            log10(ΔE)   log10(Δρ)   Diag   Δtime 
---   ---------------   ---------   ---------   ----   ------
  1   -8.457773823709                   -0.94    5.2    9.19s
  2   -8.459895393305       -2.67       -1.78    1.0    4.08s
  3   -8.460053485886       -3.80       -2.92    2.0    494ms
  4   -8.460064824878       -4.95       -3.36    3.2    1.55s
  5   -8.460064907824       -7.08       -3.91    1.7    348ms
  6   -8.460064914783       -8.16       -5.16    1.6    356ms
  7   -8.460064915259       -9.32       -5.35    3.2    508ms
  8   -8.460064915266      -11.16       -6.43    1.0    313ms


2-element Vector{StaticArraysCore.SVector{3, Float64}}:
 [-1.2568742991302967e-14, -1.7702971620062024e-14, -1.9019951759976258e-14]
 [1.5305078969393146e-14, 1.7933098284562134e-14, 2.0171795831799374e-14]

> **GPU performance**
>
> Our current (December 2025) benchmarks show DFTK to have reasonable performance
> on Nvidia / CUDA GPUs with up to a 100-fold speed-up over single-threaded
> CPU execution (SCF + forces). A lot of work has been done to stabilize
> the AMDGPU implementation as well, but performance is typically lower
> (~20x speedup). There may still be rough edges, and we would appreciate
> experience or bug reports.