Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Improve the internal data implementation for cudaq::state #1107

Open
10 of 18 tasks
amccaskey opened this issue Jan 18, 2024 · 4 comments
Open
10 of 18 tasks

[RFC] Improve the internal data implementation for cudaq::state #1107

amccaskey opened this issue Jan 18, 2024 · 4 comments
Assignees
Milestone

Comments

@amccaskey
Copy link
Collaborator

amccaskey commented Jan 18, 2024

Background

Currently, state just wraps a tuple containing the shape of the state data and a vector to the state data. This is sub-optimal because it forces the creation of copies, especially for state vector data managed by the GPU.

We need to rethink the internal data representation for cudaq::state to remove this tuple and replace with a new data type specific to simulation backends (i.e. some interface that backend simulators must implement) that allows cudaq::state to track the state data in a backend-native manner, i.e. holding the GPU device pointer.

Proposal

This RFC introduces the following new types and type changes:

namespace cudaq {

/// @brief state_data is a variant type
/// encoding different forms of user state vector data
/// we support.
using state_data = std::variant<
    std::vector<cudaq::complex>,
    std::pair<cudaq::complex *, std::size_t>,
    std::vector<cudaq::complex *>, std::shared_ptr<TensorNetworkState>>;

class SimulationState {
public: 

  /// @brief Simulation states are made up of a
  /// vector of Tensors. Each tensor tracks its data pointer
  /// and the tensor extents.
  struct Tensor {
    cudaq::complex *data = nullptr;

    std::vector<std::size_t> extents;
    precision fp_precision;
    std::size_t get_rank() const { return extents.size(); }
    std::size_t get_num_elements() const;
    std::size_t element_size() const ;
  };

  /// @brief Create a new subclass specific SimulationState
  /// from the user provided data set.
  virtual std::unique_ptr<cudaq::SimulationState>
  createFromData(const state_data &data) ;

  /// @brief Return the tensor at the given index. Throws
  /// for an invalid tensor index.
  virtual Tensor getTensor(std::size_t tensorIdx = 0) const = 0;

  /// @brief Return all tensors that represent this state
  virtual std::vector<Tensor> getTensors() const = 0;

  /// @brief Return the number of tensors that represent this state.
  virtual std::size_t getNumTensors() const = 0;

  /// @brief Return the number of qubits this state represents.
  virtual std::size_t getNumQubits() const = 0;

  /// @brief Compute the overlap of this state representation with
  /// the provided `other` state, e.g. `<this | other>`.
  virtual cudaq::complex overlap(const SimulationState &other) = 0;

  /// @brief Return the amplitude of the given computational
  /// basis state.
  virtual cudaq::complex
  getAmplitude(const std::vector<int> &basisState) = 0;

  // @brief Return the floating point precision used by the simulation state.
  virtual precision getPrecision() const = 0;

  /// @brief Destroy the state representation, frees all associated memory.
  virtual void destroyState() = 0;

  /// @brief Return the element from the tensor at the
  /// given tensor index and at the given indices.
  virtual cudaq::complex 
  operator()(std::size_t tensorIdx, const std::vector<std::size_t> &indices) ;

  /// @brief Return the number of elements in this state representation.
  /// Defaults to adding all shape elements.
  virtual std::size_t getNumElements() const ;

  /// @brief Return true if this `SimulationState` wraps data on the GPU.
  virtual bool isDeviceData() const ;

  /// @brief Return true if this `SimulationState` wraps contiguous memory
  /// (array-like).
  //  If true, `operator()` can be used to index elements in a multi-dimensional
  //  array manner.
  // Otherwise, `getAmplitude()` must be used.
  virtual bool isArrayLike() const;

  /// @brief Transfer data from device to host, return the data
  /// to the pointer provided by the client. Clients must specify the number of
  /// elements.
  virtual void toHost(cudaq::complex *clientAllocatedData,
                      std::size_t numElements) const;

  /// @brief Destructor
  virtual ~SimulationState() {}
};

using tensor = SimulationState::Tensor; 

class state {
private:
  std::unique_ptr<SimulationState> internal;

public:
  state(SimulationState *ptrToOwn); 
  
  /// @brief Convenience function for extracting from a known vector.
  cudaq::complex operator[](std::size_t idx);

  /// @brief Convenience function for extracting from a known matrix.
  cudaq::complex operator()(std::size_t idx, std::size_t jdx);

  /// @brief General extraction operator for state data.
  cudaq::complex operator()(const std::initializer_list<std::size_t> &,
                                  std::size_t tensorIdx = 0);

  /// @brief Return the tensor at the given index for this state representation.
  /// For state-vector and density matrix simulation states, there is just one
  /// tensor with rank 1 or 2 respectively.
  tensor get_tensor(std::size_t tensorIdx = 0) const;

  /// @brief Return all tensors that represent this simulation state.
  std::vector<tensor> get_tensors() const;

  /// @brief Return the number of tensors that represent this state.
  std::size_t get_num_tensors() const;

  /// @brief Return the underlying floating point precision for this state.
  cudaq::simulation_precision get_precision() const;

  /// @brief Return true if this a state on the GPU.
  bool is_on_gpu() const;

  /// @brief Copy this state from device to
  /// FIXME should be a vector<Tensor> right?
  void to_host(cudaq::complex *hostPtr,
               std::size_t numElements) const;

  /// @brief Create a new state from user-provided data.
  /// The data can be host or device data.
  static state from_data(const state_data &data);

  ~state();
};

cudaq::complex overlap(const state& other); 
cudaq::complex amplitude(const std::vector<int>& basisState); 

}

Concrete CircuitSimulators are then responsible for subclassing SimulationState and returning as a pointer.

Examples

    kernel = cudaq.make_kernel()
    qubits = kernel.qalloc(2)
    kernel.h(qubits[0])
    kernel.cx(qubits[0], qubits[1])

    # Get the quantum state, which should be a vector.
    got_state = cudaq.get_state(kernel)

    # Data type needs to be the same as the internal state vector
    want_state = np.array([1. / np.sqrt(2.), 0., 0., 1. / np.sqrt(2.)],
                          dtype=dtype)

    # Check the indexing operators on the State class
    # while also checking their values
    np.isclose(want_state[0], got_state[0].real)
    np.isclose(want_state[1], got_state[1].real)
    np.isclose(want_state[2], got_state[2].real)
    np.isclose(want_state[3], got_state[3].real)

    # Check the entire vector with numpy.
    got_vector = np.array(got_state, copy=False)
    for i in range(len(want_state)):
        assert np.isclose(want_state[i], got_vector[i])
        # if not np.isclose(got_vector[i], got_vector_b[i]):
        print(f"want = {want_state[i]}")
        print(f"got = {got_vector[i]}")
    assert np.allclose(want_state, np.array(got_state))

    # Check overlaps.
    # Check the overlap overload with want numpy array.
    assert np.isclose(got_state.overlap(want_state), 1.0)

    kernel = cudaq.make_kernel()
    q = kernel.qalloc(2)
    kernel.h(q[0])
    kernel.cx(q[0], q[1])

    # State is on the GPU, this is nvidia target
    state = cudaq.get_state(kernel)
    state.dump()
    # Create a state on GPU
    expected = cp.array([.707107, 0, 0, .707107], dtype=cp.complex128)
    # Compute the overlap
    result = state.overlap(expected)

    cp_data = cp.array([.707107, 0, 0, .707107], dtype=cp.complex64)
    state_from_cupy = cudaq.State.from_data(cp_data)
    state_from_cupy.dump()
    kernel = cudaq.make_kernel()
    q = kernel.qalloc(2)
    kernel.h(q[0])
    kernel.cx(q[0], q[1])
    # State is on the GPU, this is nvidia target
    state = cudaq.get_state(kernel)
    result = state.overlap(state_from_cupy)
    assert np.isclose(result, 1.0, atol=1e-3)

   @cudaq.kernel
    def bell():
        qubits = cudaq.qvector(2)
        h(qubits[0])
        x.ctrl(qubits[0], qubits[1])

    # Get the quantum state, which should be a vector.
    got_state = cudaq.get_state(bell)

    want_state = cudaq.State.from_data(np.array([1. / np.sqrt(2.), 0., 0., 1. / np.sqrt(2.)],
                          dtype=np.complex128))
    
    assert len(want_state) == 4 

    # Check the indexing operators on the State class
    # while also checking their values
    np.isclose(want_state[0], got_state[0].real)
    np.isclose(want_state[1], got_state[1].real)
    np.isclose(want_state[2], got_state[2].real)
    np.isclose(want_state[3], got_state[3].real)

    # Check the entire vector with numpy.
    got_vector = np.array(got_state, copy=False)
    for i in range(len(want_state)):
        assert np.isclose(want_state[i], got_vector[i])
        # if not np.isclose(got_vector[i], got_vector_b[i]):
        print(f"want = {want_state[i]}")
        print(f"got = {got_vector[i]}")
    assert np.allclose(want_state, np.array(got_state))
    cudaq.reset_target()

TODO

Before merging to main:

Post merging to main

  • Tests / demonstrate MQPU and copying GPU device data (MPI and single node)
  • Examples
  • Documentation
@amccaskey amccaskey added bug Something isn't working enhancement New feature or request labels Jan 18, 2024
@amccaskey amccaskey self-assigned this Jan 18, 2024
@1tnguyen
Copy link
Collaborator

virtual complex amplitude(std::size_t idx) = 0 limits the index to a 64-bit integer. We might want to cover the case with more than 64 qubits (e.g., tensornet can handle single amplitude computation for a large number of qubits).

@amccaskey
Copy link
Collaborator Author

virtual complex amplitude(std::size_t idx) = 0 limits the index to a 64-bit integer. We might want to cover the case with more than 64 qubits (e.g., tensornet can handle single amplitude computation for a large number of qubits).

Good point. How would you cover those cases? Maybe have another overload that takes the computational basis state as a string? What would be best for those types of backends?

@1tnguyen
Copy link
Collaborator

Good point. How would you cover those cases? Maybe have another overload that takes the computational basis state as a string? What would be best for those types of backends?

We may consider an 'overcomplete' representation (std::string, std::vector<int>, etc.) or Boolean-based (std::vector<bool>). With the latter, we don't need to validate that input should only be 0 or 1 for qubit cases. However, it is limited to the qubit case (e.g., with std::string or std::vector<int>, we can encode other values beyond 0/1).

An additional state accessor API for consideration:

virtual std::vector<complex> amplitude(const std::vector<std::pair<std::size_t, bool>>& bitMask) = 0;

Rationale: block-access to a contiguous section of the state-vector, e.g., mgpu.

In cases where we cannot reconstruct the full state vector, we can use this type of masking API to get sub state vector.
e.g., 35 qubits, per-GPU sub-statevector = 30 qubits.
bitMask = {{30, 0}, {31, 0}, {32, 0}, {33, 0}, {34, 0}} => GPU 0 sub-state (2^30 elements returned).
bitMask = {{30, 1}, {31, 0}, {32, 0}, {33, 0}, {34, 0}} => GPU 1 sub-state (2^30 elements returned).
...

Doing this masking with an explicit list of indices is not efficient.

@amccaskey amccaskey changed the title Improve the internal data implementation for cudaq::state [RFC] Improve the internal data implementation for cudaq::state Apr 16, 2024
@amccaskey amccaskey added the RFC Request for Comments label Apr 16, 2024
@bettinaheim bettinaheim removed bug Something isn't working enhancement New feature or request labels Apr 25, 2024
@bettinaheim bettinaheim added this to the release 0.8.0 milestone Jul 1, 2024
@bettinaheim
Copy link
Collaborator

bettinaheim commented Jul 1, 2024

Missing in the above write-up of tasks are the necessary compiler passes to support this on quantum backends (discussed offline).

@bettinaheim bettinaheim added RFC-approved and removed RFC Request for Comments labels Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants