# GPU Simulator


## GPU Qiskit Aer Simulator Backends and Methods

 Following Qiskit Aer backends currently support GPU acceleration:
* `QasmSimulator`
* `StatevectorSimulator`
* `UnitarySimulator`

To check the availability of GPU support on these backends, `available_method()` returns methods with gpu suports.

## Introduction

This notebook shows how to accelerate Qiskit Aer simulators by using GPUs. 

To install GPU support in Qiskit Aer, please install GPU version of Qiskit Aer by

`pip install qiskit-aer-gpu`


### Note 

Qiskit Aer only supports NVIDIA's GPUs and requires CUDA toolkit installed on the system. 

In [6]:
from qiskit import *
from qiskit.circuit.library import *
from qiskit.providers.aer import *
qasm_sim = QasmSimulator()
print(qasm_sim.available_methods())

['automatic', 'statevector', 'statevector_gpu', 'density_matrix', 'density_matrix_gpu', 'stabilizer', 'matrix_product_state', 'extended_stabilizer']


If Qiskit Aer with GPU support is installed correctly, you can see `statevector_gpu` and `density_matrix_gpu`

In [7]:
st_sim = StatevectorSimulator()
print(st_sim.available_methods())
u_sim = UnitarySimulator()
print(u_sim.available_methods())

['automatic', 'statevector', 'statevector_gpu']
['automatic', 'unitary', 'unitary_gpu']


### Simulation with GPU

Here is a simple example to run quantum volume circuit with 20 qubits by using `QasmSimulator` backend.
Setting the simulation method `statevector_gpu` in `backend_options` parameter passed to `QasmSimulator.run` method to use GPU for the simulaiton.

In [8]:
shots = 64
qubit = 20
depth=10
qv20 = QuantumVolume(qubit, depth, seed = 0)
qv20 = transpile(qv20, backend=qasm_sim, optimization_level=0)
qv20.measure_all()
qobj = assemble(qv20, shots=shots, memory=True)
result = qasm_sim.run(qobj, backend_options={"method" : "statevector_gpu"}).result()

counts = result.get_counts(qv20)
print(counts)

{'10001100110011011001': 1, '10111000011101110011': 1, '11011111010001101101': 1, '00000001011101110101': 1, '01011000101011101100': 1, '01101010110001110001': 1, '01000010000011000000': 1, '10010001011110000010': 1, '01100111101001100101': 1, '11100111111100011101': 1, '11010001000011001010': 1, '11010110110001010110': 1, '11000000010101000001': 1, '01110110110111000101': 1, '01111111100110110001': 1, '00111011110011110010': 1, '10010000101110000010': 1, '00101100011000000010': 1, '00111111100100010000': 1, '00000100100010001010': 1, '01110100001110111101': 1, '01111010011010010000': 1, '11111011101000010101': 1, '10010101001010001101': 1, '00111000111010110101': 1, '00111011100110010000': 1, '10001001110000100111': 1, '10101101010110011111': 1, '00001010000000100111': 1, '00101100101110111001': 1, '00110000111011010111': 1, '01101100011010001111': 1, '00000010100100001010': 1, '00100010001011010110': 1, '01100010101101010010': 1, '00011111111001011100': 1, '11101000100010000111': 1, 

The following sample shows an example using `density_matrix_gpu` mthod in `QasmSimulator`.

In [9]:
qubit = 10
depth = 10
qv10 = QuantumVolume(qubit, depth, seed = 0)
qv10 = transpile(qv10, backend=qasm_sim, optimization_level=0)
qv10.measure_all()
qobj = assemble(qv10, shots=shots, memory=True)
result = qasm_sim.run(qobj, backend_options={"method" : "density_matrix_gpu"}).result()

counts = result.get_counts(qv10)
print(counts)

{'0111001111': 1, '0100011110': 1, '0001001111': 1, '1101111011': 1, '0010000000': 1, '0001111101': 1, '0100010010': 1, '1110000101': 1, '1100111001': 1, '1111110001': 1, '1011111010': 1, '1000110101': 1, '0111001101': 1, '1010001110': 1, '0111101100': 1, '1000111000': 1, '0011001011': 1, '1110011011': 1, '0110100001': 1, '0001101111': 1, '0110001101': 1, '1101111110': 1, '0000010011': 1, '1111011111': 1, '0010010101': 1, '0100001100': 1, '0011110100': 1, '0001010011': 1, '1010101011': 1, '1101111001': 1, '1110001000': 1, '0010001100': 1, '0000010000': 1, '0101010011': 1, '1100001001': 1, '0100011001': 1, '0111010010': 1, '0101111010': 1, '0110011110': 1, '1100011000': 1, '0011010110': 1, '1110000100': 1, '1000000001': 1, '0100111100': 1, '1011100111': 1, '0101010100': 1, '1100101101': 1, '0111110111': 1, '1000111110': 1, '1000100001': 1, '0001000011': 2, '1000111101': 1, '1011000110': 1, '0100010000': 1, '0010001110': 1, '0110010001': 1, '1000010111': 1, '0000101101': 1, '1001101010':

## Parallelizing Simulaiton by Using Multiple GPUs

In general GPU has less memory size than CPU, and the largest number of qubits is depending on the memory size. For example, if a GPU has 16 GB of memory, Qiskit Aer can simulate up to 29 qubits by using `statevector_gpu` method in `QasmSimulator` and `StatevectorSimulator` backends or up to 14 qubits by using `density_matrix_gpu` method in `QasmSimulator` backend and `unitary_gpu` method in `UnitarySimulator` backend in double precision.

To simulate with more larger nnumber of qubits, multiple GPUs can be used to parallelize the simulation or also parallel simulation can accelerate the simulation speed. 

To use multi-GPUs, following options should be set in the `backend_options` parameter passed to `run` method. In the parallel simulator, the vector of quantum states are divided into sub-vectors called chunk and chunks are distributed to memory of multiple-GPUs. 

Following 2 options should be passed:
* `blocking_enable` : Set `True` to enable parallelization
* `blocking_qubits` : This option sets the size of chunk that is distributed to parallel memory space. Set this parameter to satisfy `16*(2^(blocking_qubits+4)) < smallest memory size on the system (in byte)` for double precision. (`8*` for single precision).

The parameter `blocking_qubits` will be varied in different environment, so this parameter is optimized by using some benchmarks before running actual applications. Usually setting 20 to 23 will be good for many environments. 

Here is an example of Quantum Volume of 30 qubits with multiple GPUs by using `QasmSimulator` backend and `statevector_gpu` method.

In [10]:
qubit = 30
depth = 10
qv30 = QuantumVolume(qubit, depth, seed = 0)
qv30 = transpile(qv30, backend=qasm_sim, optimization_level=0)
qv30.measure_all()
qobj = assemble(qv30, shots=shots, memory=True)
result = qasm_sim.run(qobj, backend_options={"method" : "statevector_gpu", "blocking_enable" : True, "blocking_qubits" : 23 }).result()

counts = result.get_counts(qv30)
print(counts)

{'001111011110000110011101110101': 1, '100000101001111101100110011011': 1, '001011001100011110010111111101': 1, '101111111000110011000000011000': 1, '111011110101100100011101011011': 1, '110111001001010100011000001100': 1, '110000111110000011110010000010': 1, '100011101011001110110011100001': 1, '101011001100110000001011100110': 1, '010101111011100010011100000010': 1, '010110100100110110010100000110': 1, '100000011100111100000010110011': 1, '110001001001000101010110010110': 1, '110111010001001111100010110110': 1, '000011001001110100010110111111': 1, '101111000000011111001011001101': 1, '011010110110110011100100101101': 1, '011101011010001110000100100001': 1, '010001111100100101101001101111': 1, '010011001001001010011100100111': 1, '011101000000001011010000101110': 1, '100010011000001001000111101101': 1, '010110001011001010011101001110': 1, '110011011010100111100001000110': 1, '001101001110100101011111110101': 1, '110111000101000000110010000010': 1, '100000111010000000111001010101': 1, 

### Note

Note that only `QasmSimulator` can be applied for large qubit circuits because `StatevectorSimulator` and `UnitarySimulator` backends currently returns snapshots of state that will require large memory space. If CPU has enough memory to store snapshots these 2 backends can be used with GPUs.

## Distribution of Shots by Using Multiple GPUs

Also GPUs can be used to accelerate simulating multiple shots with noise models. If the system has multiple GPUs, shots are automatically distributed to GPUs if there is enough memory to simulate one shot on single GPU. Also if there is only one GPU on the system, multiple shots can be parallelized on threads of GPU.

Note multiple shots distribution on GPU is slower than running on CPU when number of qubits to be simulated is small because of large overheads of GPU kernel launch. 

Following example shows running 1000 shots of quantum volume circuit with noise on GPU.

In [11]:
from qiskit.providers.aer.noise import *
noise_model = NoiseModel()
error = depolarizing_error(0.05, 1)
noise_model.add_all_qubit_quantum_error(error, ['u1', 'u2', 'u3'])
shots = 1000
qobj = assemble(qv10, shots=shots, memory=True)
result = qasm_sim.run(qobj, noise_model = noise_model, backend_options={"method" : "statevector_gpu"}).result()

rdict = result.to_dict()
print("simulation time = {0}".format(rdict['time_taken']))

simulation time = 2.358891010284424


In [12]:
import qiskit.tools.jupyter
%qiskit_version_table
%qiskit_copyright

Qiskit Software,Version
Qiskit,0.23.4
Terra,0.17.0
Aer,0.8.0
Ignis,0.5.1
Aqua,0.8.1
IBM Q Provider,0.11.1
System information,
Python,"3.9.1 (default, Dec 11 2020, 14:41:06) [GCC 7.3.0]"
OS,Linux
CPUs,40
