# Multi CPUS/GPUs support with Horovod

> **Warning** 
> The use  mutli-GPU is under developpement and hasn't been thoroughly tested yet. Proceed with caution !

QMC simulations can easily be parallelized by using multiple ressources to sample the wave function. Each walker is indenpendent of the other ones and therefore multiple compute node can be used in parallel to obtain more samples. Each node can alsu use GPUs is they are available. We demonstrate here how to use the library `Horovod` (https://github.com/horovod/horovod) to leverage large compute ressources for QMC.

Let's first create a simple system

In [7]:
import torch
from torch import optim
from qmctorch.scf import Molecule
from qmctorch.wavefunction import SlaterJastrow
from qmctorch.sampler import Metropolis
from qmctorch.utils import (plot_energy, plot_data)
from qmctorch.utils import set_torch_double_precision
set_torch_double_precision()
mol = Molecule(atom='H 0. 0. 0; H 0. 0. 1.', unit='bohr', redo_scf=True)

INFO:QMCTorch|
INFO:QMCTorch| SCF Calculation
INFO:QMCTorch|  Removing H2_pyscf_sto-3g.hdf5 and redo SCF calculations
INFO:QMCTorch|  Running scf  calculation
converged SCF energy = -1.06599946214331
INFO:QMCTorch|  Molecule name       : H2
INFO:QMCTorch|  Number of electrons : 2
INFO:QMCTorch|  SCF calculator      : pyscf
INFO:QMCTorch|  Basis set           : sto-3g
INFO:QMCTorch|  SCF                 : HF
INFO:QMCTorch|  Number of AOs       : 2
INFO:QMCTorch|  Number of MOs       : 2
INFO:QMCTorch|  SCF Energy          : -1.066 Hartree


Let's see if GPUs are available

In [8]:
use_gpu = torch.cuda.is_available()

In [9]:
wf = SlaterJastrow(mol, cuda=use_gpu).gto2sto()
sampler = Metropolis(nwalkers=100, nstep=500, step_size=0.25,
                     nelec=wf.nelec, ndim=wf.ndim,
                     init=mol.domain('atomic'),
                     move={'type': 'all-elec', 'proba': 'normal'},
                     cuda=use_gpu)

INFO:QMCTorch|
INFO:QMCTorch| Wave Function
INFO:QMCTorch|  Jastrow factor      : True
INFO:QMCTorch|  Jastrow kernel      : PadeJastrowKernel
INFO:QMCTorch|  Highest MO included : 2
INFO:QMCTorch|  Configurations      : ground_state
INFO:QMCTorch|  Number of confs     : 1
INFO:QMCTorch|  Kinetic energy      : jacobi
INFO:QMCTorch|  Number var  param   : 18
INFO:QMCTorch|  Cuda support        : False
INFO:QMCTorch|  Fit GTOs to STOs  : 
INFO:QMCTorch|
INFO:QMCTorch| Wave Function
INFO:QMCTorch|  Jastrow factor      : True
INFO:QMCTorch|  Jastrow kernel      : PadeJastrowKernel
INFO:QMCTorch|  Highest MO included : 2
INFO:QMCTorch|  Configurations      : ground_state
INFO:QMCTorch|  Number of confs     : 1
INFO:QMCTorch|  Kinetic energy      : jacobi
INFO:QMCTorch|  Number var  param   : 14
INFO:QMCTorch|  Cuda support        : False
INFO:QMCTorch|
INFO:QMCTorch| Monte-Carlo Sampler
INFO:QMCTorch|  Number of walkers   : 100
INFO:QMCTorch|  Number of steps     : 500
INFO:QMCTorch|  Step 

In [10]:
lr_dict = [{'params': wf.jastrow.parameters(), 'lr': 3E-3},
           {'params': wf.ao.parameters(), 'lr': 1E-6},
           {'params': wf.mo.parameters(), 'lr': 1E-3},
           {'params': wf.fc.parameters(), 'lr': 2E-3}]
opt = optim.Adam(lr_dict, lr=1E-3)

A dedicated QMCTorch Solver has been developped to handle multiple GPU. To use this solver simply import it
and use is as the normal solver and only a few modifications are required to use horovod :

In [11]:
import horovod.torch as hvd
from qmctorch.solver import SolverSlaterJastrowHorovod

hvd.init()
if torch.cuda.is_available():
    torch.cuda.set_device(hvd.rank())
    
solver = SolverSlaterJastrowHorovod(wf=wf, sampler=sampler,
                                    optimizer=opt,
                                    rank=hvd.rank())

INFO:QMCTorch|
INFO:QMCTorch| Object SolverSlaterJastrowHorovod already exists in H2_pyscf_sto-3g_QMCTorch.hdf5
INFO:QMCTorch| Object name changed to SolverSlaterJastrowHorovod_2
INFO:QMCTorch|
INFO:QMCTorch|
INFO:QMCTorch| QMC Solver 
INFO:QMCTorch|  WaveFunction        : SlaterJastrow
INFO:QMCTorch|  Sampler             : Metropolis
INFO:QMCTorch|  Optimizer           : Adam


In [12]:
solver.configure(track=['local_energy'], freeze=['ao', 'mo'],
                loss='energy', grad='auto',
                ortho_mo=False, clip_loss=False,
                resampling={'mode': 'update',
                            'resample_every': 1,
                            'nstep_update': 50})

# optimize the wave function
obs = solver.run(5)


INFO:QMCTorch|
INFO:QMCTorch|  Distributed Optimization on 1 process
INFO:QMCTorch|   - Process 0 using 100 walkers
INFO:QMCTorch|
INFO:QMCTorch|  Optimization
INFO:QMCTorch|  Task                :
INFO:QMCTorch|  Number Parameters   : 2
INFO:QMCTorch|  Number of epoch     : 5
INFO:QMCTorch|  Batch size          : 100
INFO:QMCTorch|  Loss function       : energy
INFO:QMCTorch|  Clip Loss           : False
INFO:QMCTorch|  Gradients           : auto
INFO:QMCTorch|  Resampling mode     : update
INFO:QMCTorch|  Resampling every    : 1
INFO:QMCTorch|  Resampling steps    : 50
INFO:QMCTorch|  Output file         : H2_pyscf_sto-3g_QMCTorch.hdf5
INFO:QMCTorch|  Checkpoint every    : None
INFO:QMCTorch|


INFO:QMCTorch|  Sampling: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 997.02it/s]

INFO:QMCTorch|   Acceptance rate     : 65.96 %
INFO:QMCTorch|   Timing statistics   : 969.14 steps/sec.
INFO:QMCTorch|   Total Time          : 0.05 sec.
INFO:QMCTorch|
INFO:QMCTorch|  epoch 0





INFO:QMCTorch|  energy   : -1.105308 +/- 0.040357
INFO:QMCTorch|  variance : 0.403565
INFO:QMCTorch|  epoch done in 0.16 sec.
INFO:QMCTorch|
INFO:QMCTorch|  epoch 1
INFO:QMCTorch|  energy   : -1.058329 +/- 0.037548
INFO:QMCTorch|  variance : 0.375481
INFO:QMCTorch|  epoch done in 0.19 sec.
INFO:QMCTorch|
INFO:QMCTorch|  epoch 2
INFO:QMCTorch|  energy   : -1.088986 +/- 0.039466
INFO:QMCTorch|  variance : 0.394658
INFO:QMCTorch|  epoch done in 0.20 sec.
INFO:QMCTorch|
INFO:QMCTorch|  epoch 3
INFO:QMCTorch|  energy   : -1.041960 +/- 0.038432
INFO:QMCTorch|  variance : 0.384323
INFO:QMCTorch|  epoch done in 0.18 sec.
INFO:QMCTorch|
INFO:QMCTorch|  epoch 4
INFO:QMCTorch|  energy   : -1.105442 +/- 0.034883
INFO:QMCTorch|  variance : 0.348827
INFO:QMCTorch|  epoch done in 0.18 sec.
INFO:QMCTorch|
INFO:QMCTorch| Object wf_opt already exists in H2_pyscf_sto-3g_QMCTorch.hdf5
INFO:QMCTorch| Object name changed to wf_opt_6
INFO:QMCTorch|


As you can see some classes need the rank of the process when they are defined. This is simply
to insure that only the master process generates the HDF5 files containing the information relative to the calculation.

## Running parallel calculations

It is currently difficult to use Horovod on mutliple node through a jupyter notebook. To do so, one should have a python file with all the code and execute the code  with the following command

```
horovodrun -np 2 python <example>.py
```

See the horovod documentation for more details : https://github.com/horovod/horovod


This solver distribute the `Nw` walkers over the `Np` process . For example specifying 2000 walkers
and using 4 process will lead to each process using only 500 walkers. During the optimizaiton of the wavefunction
each process will compute the gradients of the variational parameter using their local 500 walkers.
The gradients are then averaged over all the processes before the optimization step takes place. This data parallel
model has been greatly succesfull in machine learning applications (http://jmlr.org/papers/volume20/18-789/18-789.pdf)

A complete example can found in `qmctorch/docs/example/horovod/h2.py`