Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental support of cuQuantum #1400

Merged
merged 33 commits into from Mar 1, 2022
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
309c73d
add cuStateVec support
doichanj Dec 13, 2021
54dc128
Merge remote-tracking branch 'upstream/main' into cuStatevec
doichanj Dec 13, 2021
a5bc75e
delete space
doichanj Dec 13, 2021
b1bd96e
Merge branch 'main' into cuStatevec
chriseclectic Dec 14, 2021
a40898c
disable batched shots optimization for cuStateVec
doichanj Dec 15, 2021
adfc125
Merge branch 'cuStatevec' of github.com:doichanj/qiskit-aer into cuSt…
doichanj Dec 15, 2021
26c4538
Fix cuStateVec test fails
doichanj Dec 15, 2021
87afff5
Fix qasm_simulator.py
doichanj Dec 16, 2021
f16a35c
update for the latest cuQuantum / added diagonal matrix
doichanj Jan 4, 2022
5533b76
resolved conflict
doichanj Jan 4, 2022
0c10325
add more cuStateVec support / refactor qubitvector_thrust and chunk_c…
doichanj Jan 18, 2022
181eb2c
Merge remote-tracking branch 'upstream/main' into cuStatevec
doichanj Jan 18, 2022
54d1a68
Merge branch 'main' into cuStatevec
doichanj Jan 18, 2022
4d502ed
Merge branch 'cuStatevec' of github.com:doichanj/qiskit-aer into cuSt…
doichanj Jan 18, 2022
eba2594
Fix norm() for Thrust CPU
doichanj Jan 18, 2022
5a93807
change cuStateVec from device to option
doichanj Jan 26, 2022
983773b
Fix unchanged device=cuStateVec
doichanj Jan 26, 2022
5bea04d
Add build option to link cuStateVec statically
doichanj Jan 27, 2022
1fb5031
removed whitespace
doichanj Jan 27, 2022
1d01542
Merge remote-tracking branch 'upstream/main' into cuStatevec
doichanj Jan 27, 2022
da0f42d
Merge branch 'main' into cuStatevec
doichanj Jan 31, 2022
c781208
reflecting review comments
doichanj Feb 1, 2022
0f4a93e
added release note
doichanj Feb 1, 2022
c509131
set cuStateVec_enable to False as default, added test cases for cuSta…
doichanj Feb 3, 2022
5458b7c
Merge remote-tracking branch 'upstream/main' into cuStatevec
doichanj Feb 3, 2022
61083cb
Merge branch 'main' into cuStatevec
doichanj Feb 3, 2022
046036d
Merge branch 'cuStatevec' of github.com:doichanj/qiskit-aer into cuSt…
doichanj Feb 3, 2022
3a31cef
Fix omp setting for non-GPU / Fix omp nested loops
doichanj Feb 4, 2022
de4c978
Merge branch 'main' into cuStatevec
doichanj Feb 7, 2022
88d7d95
Implemented optimized rotation gates
doichanj Feb 14, 2022
3ffabcf
Merge branch 'cuStatevec' of github.com:doichanj/qiskit-aer into cuSt…
doichanj Feb 14, 2022
7cf50ee
Merge branch 'main' into cuStatevec
doichanj Feb 14, 2022
879a4ac
Merge branch 'main' into cuStatevec
hhorii Feb 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 9 additions & 0 deletions CMakeLists.txt
Expand Up @@ -257,6 +257,15 @@ if(AER_THRUST_SUPPORTED)

set(AER_COMPILER_DEFINITIONS ${AER_COMPILER_DEFINITIONS} THRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA)
set(THRUST_DEPENDENT_LIBS "")
if(CUSTATEVEC_ROOT)
set(AER_COMPILER_DEFINITIONS ${AER_COMPILER_DEFINITIONS} AER_CUSTATEVEC)
set(AER_COMPILER_FLAGS "${AER_COMPILER_FLAGS} -I${CUSTATEVEC_ROOT}/include")
if(CUSTATEVEC_STATIC)
set(THRUST_DEPENDANT_LIBS "-L${CUSTATEVEC_ROOT}/lib -L${CUSTATEVEC_ROOT}/lib64 -lcustatevec_static -L${CUDA_TOOLKIT_ROOT_DIR}/lib64 -lcublas")
else()
set(THRUST_DEPENDANT_LIBS "-L${CUSTATEVEC_ROOT}/lib -L${CUSTATEVEC_ROOT}/lib64 -lcustatevec")
endif()
endif()
elseif(AER_THRUST_BACKEND STREQUAL "TBB")
message(STATUS "TBB Support found!")
set(THRUST_DEPENDENT_LIBS AER_DEPENDENCY_PKG::tbb)
Expand Down
23 changes: 23 additions & 0 deletions CONTRIBUTING.md
Expand Up @@ -643,6 +643,29 @@ Few notes on GPU builds:
3. We don't need NVIDIA® drivers for building, but we need them for running simulations
4. Only Linux platforms are supported

Qiskit Aer now supports cuQuantum optimized Quantum computing APIs from NVIDIA®.
cuStateVec APIs can be exploited to accelerate statevector, density_matrix and unitary methods.
Because cuQuantum is beta version currently, some of the operations are not accelerated by cuStateVec.

To build Qiskit Aer with cuStateVec support, please set the path to cuQuantum root directory to CUSTATEVEC_ROOT as following.

For example,

qiskit-aer$ python ./setup.py bdist_wheel -- -DAER_THRUST_BACKEND=CUDA -DCUSTATEVEC_ROOT=path_to_cuQuantum

To run with cuStateVec, set `device='GPU'` to AerSimulator option and cuStateVec is enabled
if the number of qubits of input circuit is equal or greater than 22 qubits by default.
This threshold can be modified by setting `cuStateVec_threshold` option.
By setting `cuStateVec_enable=False` to disable using cuStateVec.
Following example shows how you accelerate 10 or more qubits simulations using cuStateVec.

```
sim = AerSimulator(method='statevector', device='GPU')
results = execute(circuit,sim,cuStateVec_enable=True,cuStateVec_threshold=10).result()
```

hhorii marked this conversation as resolved.
Show resolved Hide resolved


### Building with MPI support

Qiskit Aer can parallelize its simulation on the cluster systems by using MPI.
Expand Down
20 changes: 20 additions & 0 deletions qiskit/providers/aer/backends/aer_simulator.py
Expand Up @@ -148,6 +148,10 @@ class AerSimulator(AerBackend):
initialization or with :meth:`set_options`. The list of supported devices
for the current system can be returned using :meth:`available_devices`.

If AerSimulator is built with cuStateVec support, cuStateVec APIs are enabled
by setting ``cuStateVec_enable=True``. This is experimental implementation
based on cuQuantum Beta 2.

**Additional Backend Options**

The following simulator specific backend options are supported
Expand Down Expand Up @@ -216,6 +220,19 @@ class AerSimulator(AerBackend):
values (16 Bytes). If set to 0, the maximum will be automatically
set to the system memory size (Default: 0).

* ``cuStateVec_enable`` (bool): This option enables accelerating by
cuStateVec library of cuQuantum from NVIDIA, that has highly optimized
kernels for GPUs. This option is enabled when the number of qubits of
the input circuit is equal or greater than ``cuStateVec_threshold``.
Currently this option only works well for large number of qubits.
Also this option will be disabled for noise simulation
(Default: True).
hhorii marked this conversation as resolved.
Show resolved Hide resolved

* ``cuStateVec_threshold`` (int): This option sets the threshold
number of qubits to enable ``cuStateVec_enable`` option.
cuStateVec is enabled when the number of qubits is equal or greater
than this option (Default: 22).

* ``blocking_enable`` (bool): This option enables parallelization with
multiple GPUs or multiple processes with MPI (CPU/GPU). This option
is only available for ``"statevector"``, ``"density_matrix"`` and
Expand Down Expand Up @@ -514,6 +531,9 @@ def _default_options(cls):
memory=None,
noise_model=None,
seed_simulator=None,
# cuStateVec (cuQuantum) options
cuStateVec_enable=True,
cuStateVec_threshold=22,
# cache blocking for multi-GPUs/MPI options
blocking_qubits=None,
blocking_enable=False,
Expand Down
19 changes: 12 additions & 7 deletions qiskit/providers/aer/backends/qasm_simulator.py
Expand Up @@ -339,9 +339,9 @@ class QasmSimulator(AerBackend):
}

_SIMULATION_METHODS = [
'automatic', 'statevector', 'statevector_gpu',
'automatic', 'statevector', 'statevector_gpu', 'statevector_custatevec',
'statevector_thrust', 'density_matrix',
'density_matrix_gpu', 'density_matrix_thrust',
'density_matrix_gpu', 'density_matrix_custatevec', 'density_matrix_thrust',
'stabilizer', 'matrix_product_state', 'extended_stabilizer'
]

Expand Down Expand Up @@ -595,7 +595,8 @@ def _basis_gates(self):
def _method_basis_gates(self):
"""Return method basis gates and custom instructions"""
method = self._options.get('method', None)
if method in ['density_matrix', 'density_matrix_gpu', 'density_matrix_thrust']:
if method in ['density_matrix', 'density_matrix_gpu',
'density_matrix_custatevec', 'density_matrix_thrust']:
return sorted([
'u1', 'u2', 'u3', 'u', 'p', 'r', 'rx', 'ry', 'rz', 'id', 'x',
'y', 'z', 'h', 's', 'sdg', 'sx', 'sxdg', 't', 'tdg', 'swap', 'cx',
Expand Down Expand Up @@ -628,15 +629,17 @@ def _custom_instructions(self):
return self._options_configuration['custom_instructions']

method = self._options.get('method', None)
if method in ['statevector', 'statevector_gpu', 'statevector_thrust']:
if method in ['statevector', 'statevector_gpu',
'statevector_custatevec', 'statevector_thrust']:
return sorted([
'quantum_channel', 'qerror_loc', 'roerror', 'kraus', 'snapshot', 'save_expval',
'save_expval_var', 'save_probabilities', 'save_probabilities_dict',
'save_amplitudes', 'save_amplitudes_sq', 'save_state',
'save_density_matrix', 'save_statevector', 'save_statevector_dict',
'set_statevector'
])
if method in ['density_matrix', 'density_matrix_gpu', 'density_matrix_thrust']:
if method in ['density_matrix', 'density_matrix_gpu',
'density_matrix_custatevec', 'density_matrix_thrust']:
return sorted([
'quantum_channel', 'qerror_loc', 'roerror', 'kraus', 'superop', 'snapshot',
'save_expval', 'save_expval_var', 'save_probabilities', 'save_probabilities_dict',
Expand Down Expand Up @@ -666,10 +669,12 @@ def _custom_instructions(self):
def _set_method_config(self, method=None):
"""Set non-basis gate options when setting method"""
# Update configuration description and number of qubits
if method in ['statevector', 'statevector_gpu', 'statevector_thrust']:
if method in ['statevector', 'statevector_gpu',
'statevector_custatevec', 'statevector_thrust']:
description = 'A C++ statevector simulator with noise'
n_qubits = MAX_QUBITS_STATEVECTOR
elif method in ['density_matrix', 'density_matrix_gpu', 'density_matrix_thrust']:
elif method in ['density_matrix', 'density_matrix_gpu',
'density_matrix_custatevec', 'density_matrix_thrust']:
description = 'A C++ density matrix simulator with noise'
n_qubits = MAX_QUBITS_STATEVECTOR // 2
elif method == 'matrix_product_state':
Expand Down
72 changes: 57 additions & 15 deletions src/controllers/aer_controller.hpp
Expand Up @@ -377,6 +377,9 @@ class Controller {
int_t batched_shots_gpu_max_qubits_ = 16; //multi-shot parallelization is applied if qubits is less than max qubits
bool enable_batch_multi_shots_ = false; //multi-shot parallelization can be applied

//settings for cuStateVec
bool cuStateVec_enable_ = false;
int cuStateVec_threshold_ = 22;
};

//=========================================================================
Expand Down Expand Up @@ -466,6 +469,16 @@ void Controller::set_config(const json_t &config) {
JSON::get_value(batched_shots_gpu_max_qubits_, "batched_shots_gpu_max_qubits", config);
}

#ifdef AER_CUSTATEVEC
//cuStateVec configs
if(JSON::check_key("cuStateVec_enable", config)) {
JSON::get_value(cuStateVec_enable_, "cuStateVec_enable", config);
}
if(JSON::check_key("cuStateVec_threshold", config)) {
JSON::get_value(cuStateVec_threshold_, "cuStateVec_threshold", config);
}
#endif

// Override automatic simulation method with a fixed method
std::string method;
if (JSON::get_value(method, "method", config)) {
Expand All @@ -489,6 +502,9 @@ void Controller::set_config(const json_t &config) {
}
}

if(method_ == Method::density_matrix || method_ == Method::unitary)
batched_shots_gpu_max_qubits_ /= 2;

// Override automatic simulation method with a fixed method
if (JSON::get_value(sim_device_name_, "device", config)) {
if (sim_device_name_ == "CPU") {
Expand All @@ -502,18 +518,29 @@ void Controller::set_config(const json_t &config) {
#endif
} else if (sim_device_name_ == "GPU") {
#ifndef AER_THRUST_CUDA
throw std::runtime_error(
"Simulation device \"GPU\" is not supported on this system");
throw std::runtime_error(
"Simulation device \"GPU\" is not supported on this system");
#else
int nDev;
if (cudaGetDeviceCount(&nDev) != cudaSuccess) {
cudaGetLastError();
throw std::runtime_error("No CUDA device available!");
int nDev;
if (cudaGetDeviceCount(&nDev) != cudaSuccess) {
cudaGetLastError();
throw std::runtime_error("No CUDA device available!");
}
sim_device_ = Device::GPU;

#ifdef AER_CUSTATEVEC
if(cuStateVec_enable_){
//initialize custatevevtor handle once before actual calculation (takes long time at first call)
custatevecStatus_t err;
custatevecHandle_t stHandle;
err = custatevecCreate(&stHandle);
if(err == CUSTATEVEC_STATUS_SUCCESS){
custatevecDestroy(stHandle);
}

sim_device_ = Device::GPU;
#endif
}
#endif
#endif
}
else {
throw std::runtime_error(std::string("Invalid simulation device (\"") +
sim_device_name_ + std::string("\")."));
Expand Down Expand Up @@ -636,9 +663,19 @@ void Controller::set_parallelization_circuit(const Circuit &circ,
const Method method)
{
enable_batch_multi_shots_ = false;
if(batched_shots_gpu_ && sim_device_ == Device::GPU && circ.shots > 1 && max_batched_states_ >= num_gpus_ &&
batched_shots_gpu_max_qubits_ >= circ.num_qubits ){
enable_batch_multi_shots_ = true;
if(batched_shots_gpu_ && sim_device_ == Device::GPU &&
circ.shots > 1 && max_batched_states_ >= num_gpus_ &&
batched_shots_gpu_max_qubits_ >= circ.num_qubits ){
//cuStateVec is not supported currently, because cuStateVec does not handle conditional functions
hhorii marked this conversation as resolved.
Show resolved Hide resolved
if(cuStateVec_enable_ && circ.num_qubits >= cuStateVec_threshold_)
enable_batch_multi_shots_ = false;
else
enable_batch_multi_shots_ = true;
}

if(cuStateVec_enable_ && circ.num_qubits >= cuStateVec_threshold_){
parallel_shots_ = 1; //cuStateVec is currently not thread safe
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cuStateVec_enable=True is configured in AerSimulator.run(), parallel_state_update_ is not set. This will produce performance regression if application accidientaly sets cuStateVec_enable with device='CPU'.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: when enable_batch_multi_shots_=true would you create nShots copies of the statevector for parallelization? If so & IIUC, I think a proper "workaround" is to create multiple cuStateVec handles (or just retain and reuse a pool of handles at init time to reduce overhead) and use them in parallel.

IMHO though it's beyond a "workaround": even after we fix the thread safety issue, generally speaking it is still challenging for library handles to be shared by multiple host threads. For example, despite cuBLAS supports this usage pattern they explicitly recommend to not do so. Thus the handle pool approach is commonly seen in ML/DL frameworks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_batch_multi_shots_=true is not applicable for cuStateVec currently, because multiple state vectors are calculated in a single CUDA kernel and each state vector refers classical registers to handle branch operations, this is not implemented in cuStateVec.
Multiple cuStateVec handle is required when enable_batch_multi_shots_=false and shot level parallelization is required. In this case, state vectors are independently calculated using OpenMP threads. (Currently cuStateVec is not thread safe and we disable OpenMP parallelization)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation @doichanj. I understand better now. So once we fix thread safety we can unblock you for the shot-level parallelization.

}

if(explicit_parallelization_)
Expand Down Expand Up @@ -785,6 +822,7 @@ size_t Controller::get_gpu_memory_mb() {
}
num_gpus_ = nDev;
#endif

#ifdef AER_MPI
// get minimum memory size per process
uint64_t locMem, minMem;
Expand Down Expand Up @@ -866,7 +904,6 @@ Result Controller::execute(const inputdata_t &input_qobj) {
auto time_taken =
std::chrono::duration<double>(myclock_t::now() - timer_start).count();
result.metadata.add(time_taken, "time_taken");

return result;
} catch (std::exception &e) {
// qobj was invalid, return valid output containing error message
Expand Down Expand Up @@ -1439,7 +1476,7 @@ void Controller::run_circuit_without_sampled_noise(Circuit &circ,
// Check if measure sampler and optimization are valid
if (can_sample) {
// Implement measure sampler
if (parallel_shots_ <= 1) {
if (parallel_shots_ <= 1 || sim_device_ == Device::GPU) {
state.set_max_matrix_qubits(max_bits);
RngEngine rng;
rng.set_seed(circ.seed);
Expand All @@ -1460,7 +1497,7 @@ void Controller::run_circuit_without_sampled_noise(Circuit &circ,
shot_state.set_parallelization(parallel_state_update_);
shot_state.set_global_phase(circ.global_phase_angle);

state.set_max_matrix_qubits(max_bits);
shot_state.set_max_matrix_qubits(max_bits);
hhorii marked this conversation as resolved.
Show resolved Hide resolved

RngEngine rng;
rng.set_seed(circ.seed + i);
Expand Down Expand Up @@ -1736,7 +1773,12 @@ void Controller::measure_sampler(
shots_or_index = shots;
else
shots_or_index = shot_index;

auto timer_start = myclock_t::now();
auto all_samples = state.sample_measure(meas_qubits, shots_or_index, rng);
auto time_taken =
std::chrono::duration<double>(myclock_t::now() - timer_start).count();
result.metadata.add(time_taken, "sample_measure_time");

// Make qubit map of position in vector of measured qubits
std::unordered_map<uint_t, uint_t> qubit_map;
Expand Down
2 changes: 1 addition & 1 deletion src/simulators/density_matrix/densitymatrix_state.hpp
Expand Up @@ -1344,7 +1344,7 @@ void State<densmat_t>::apply_gate_u3(const int_t iChunk, uint_t qubit, double th
template <class densmat_t>
void State<densmat_t>::apply_diagonal_unitary_matrix(const int_t iChunk, const reg_t &qubits, const cvector_t & diag)
{
if(BaseState::thrust_optimization_){
if(BaseState::thrust_optimization_ || !BaseState::multi_chunk_distribution_){
hhorii marked this conversation as resolved.
Show resolved Hide resolved
//GPU computes all chunks in one kernel, so pass qubits and diagonal matrix as is
BaseState::qregs_[iChunk].apply_diagonal_unitary_matrix(qubits,diag);
}
Expand Down
8 changes: 6 additions & 2 deletions src/simulators/state.hpp
Expand Up @@ -342,6 +342,8 @@ class State {
complex_t global_phase_ = 1;

int_t max_matrix_qubits_ = 0;

std::string sim_device_name_; //name of device
hhorii marked this conversation as resolved.
Show resolved Hide resolved
};


Expand All @@ -355,8 +357,10 @@ State<state_t>::~State(void)
}

template <class state_t>
void State<state_t>::set_config(const json_t &config) {
(ignore_argument)config;
void State<state_t>::set_config(const json_t &config)
{
//get device name
JSON::get_value(sim_device_name_, "device", config);
}

template <class state_t>
Expand Down