# CUDA-Q Workshop: Quantum Computing with NVIDIA CUDA-Q

Welcome! This notebook teaches **CUDA-Q** — NVIDIA's open-source platform for
hybrid quantum-classical computing — to programmers already familiar with
quantum circuits and gates.

## What is CUDA-Q?

CUDA-Q provides a unified programming model that lets quantum kernels run on:
- **CPU simulators** (statevector or density-matrix)
- **NVIDIA GPUs** via cuStateVec / cuTensorNet
- **Real QPUs** (IonQ, Quantinuum, IQM, OQC, and more)

The same kernel code targets *all* of the above — just change one line.

## Notebook Outline

| # | Section | Key APIs |
|---|---------|----------|
| 1 | Setup | `cudaq.__version__`, `cudaq.get_targets()` |
| 2 | Kernels & Circuits | `@cudaq.kernel`, `cudaq.sample()` |
| 3 | Simulation Backends | `cudaq.set_target()` |
| 4 | Noise Models | `cudaq.NoiseModel`, density-matrix target |
| 5 | Parameterized Circuits | kernel parameters, sweeps |
| 6 | VQE | `cudaq.observe()`, spin operators, optimizer |
| 7 | AI for Quantum | data encoding, VQC, parameter-shift rule |

> **Prerequisite**: familiarity with qubits, gates (H, X, CNOT), and measurement.


In [None]:
import cudaq
import numpy as np
import matplotlib.pyplot as plt

print(f"CUDA-Q version: {cudaq.__version__}")
print()

# List available simulation / hardware targets
print("Available targets:")
for t in cudaq.get_targets():
    print(f"  - {t.name}")


## Section 1: Quantum Kernels & Basic Circuits

In CUDA-Q, quantum programs are **kernels** decorated with `@cudaq.kernel`.
The decorator compiles the function to quantum IR and dispatches it to the
active backend.

### Vocabulary
| Concept | CUDA-Q API |
|---------|-----------|
| Allocate qubits | `q = cudaq.qvector(n)` |
| Single-qubit gate | `h(q[0])`, `x(q[0])`, `ry(theta, q[0])` |
| Two-qubit gate | `cx(q[0], q[1])` |
| Measure all | `mz(q)` |
| Run (sample) | `cudaq.sample(kernel, *args, shots_count=N)` |

Let's build the **Bell state** |Φ⁺⟩ = (|00⟩ + |11⟩)/√2:


In [None]:
# ─── Bell State ───────────────────────────────────────────
@cudaq.kernel
def bell_state():
    q = cudaq.qvector(2)
    h(q[0])           # |+⟩ superposition on qubit 0
    cx(q[0], q[1])    # CNOT: entangle qubits
    mz(q)             # measure both

counts = cudaq.sample(bell_state, shots_count=1000)

print("Bell State  |Φ⁺⟩ = (|00⟩ + |11⟩)/√2")
print("=" * 40)
counts.dump()

# Plot
states = ["00", "01", "10", "11"]
probs  = [counts.get(s, 0) / 1000 for s in states]

fig, ax = plt.subplots(figsize=(6, 4))
bars = ax.bar(states, probs, color=["steelblue", "lightgray", "lightgray", "coral"], width=0.5)
ax.set_xlabel("Measurement outcome")
ax.set_ylabel("Probability")
ax.set_title("Bell State — Measurement Distribution")
ax.set_ylim(0, 1)
for bar, p in zip(bars, probs):
    if p > 0.01:
        ax.text(bar.get_x() + bar.get_width()/2, p + 0.02,
                f"{p:.2f}", ha="center", fontweight="bold")
plt.tight_layout()
plt.show()


In [None]:
# ─── GHZ State: generalized n-qubit entanglement ─────────
# |GHZ_n⟩ = (|00...0⟩ + |11...1⟩) / √2

@cudaq.kernel
def ghz_state(n: int):
    q = cudaq.qvector(n)
    h(q[0])
    for i in range(n - 1):
        cx(q[i], q[i + 1])
    mz(q)

for n in [4, 6, 8]:
    counts = cudaq.sample(ghz_state, n, shots_count=2000)
    zero = "0" * n
    one  = "1" * n
    p0 = counts.get(zero, 0) / 2000
    p1 = counts.get(one,  0) / 2000
    print(f"{n}-qubit GHZ:  P(|{'0'*n}⟩)={p0:.3f}   P(|{'1'*n}⟩)={p1:.3f}")


In [None]:
# ─── ASCII circuit diagrams ───────────────────────────────
print("Bell State Circuit:")
print("=" * 50)
print(cudaq.draw(bell_state))

print("4-qubit GHZ Circuit:")
print("=" * 50)
print(cudaq.draw(ghz_state, 4))


## Section 2: Simulation Backends

CUDA-Q's key promise: **write once, run anywhere**. Switching backends requires
zero changes to your kernel — just one call before sampling.

| Target | Device | Notes |
|--------|--------|-------|
| `qpp-cpu` | CPU statevector | Default; no GPU needed |
| `nvidia` | GPU — cuStateVec FP32 | Fast single-GPU |
| `nvidia-fp64` | GPU — cuStateVec FP64 | Higher precision |
| `density-matrix-cpu` | CPU density matrix | Required for noise |
| `tensornet` | GPU — cuTensorNet | Large qubit counts via MPS |

Below we run the same kernel on CPU then GPU and compare timing.


In [None]:
import time

@cudaq.kernel
def large_ghz(n: int):
    q = cudaq.qvector(n)
    h(q[0])
    for i in range(n - 1):
        cx(q[i], q[i + 1])
    mz(q)

N = 22   # number of qubits
SHOTS = 1000

# ── CPU ──────────────────────────────────────────────────
cudaq.set_target("qpp-cpu")
t0 = time.perf_counter()
counts_cpu = cudaq.sample(large_ghz, N, shots_count=SHOTS)
cpu_time = time.perf_counter() - t0
print(f"qpp-cpu   | {N}-qubit GHZ | {cpu_time:.3f} s")

# ── GPU ──────────────────────────────────────────────────
try:
    cudaq.set_target("nvidia")          # cuStateVec FP32
    t0 = time.perf_counter()
    counts_gpu = cudaq.sample(large_ghz, N, shots_count=SHOTS)
    gpu_time = time.perf_counter() - t0
    print(f"nvidia    | {N}-qubit GHZ | {gpu_time:.3f} s")
    print(f"Speedup   | {cpu_time / gpu_time:.1f}x")
except Exception as e:
    print(f"GPU not available ({e})")
    print("-> Continuing on CPU. On a machine with an NVIDIA GPU,")
    print("   set_target('nvidia') gives significant speedups for 20+ qubits.")
    cudaq.set_target("qpp-cpu")

# Results should be the same regardless of backend
print("\nCPU result: ", end=""); counts_cpu.dump()


In [None]:
# ─── State vectors ────────────────────────────────────────
# cudaq.get_state() returns the full 2^n complex amplitude vector.
# Available on statevector targets (qpp-cpu, nvidia).

cudaq.set_target("qpp-cpu")

@cudaq.kernel
def uniform_superposition(n: int):
    q = cudaq.qvector(n)
    for i in range(n):
        h(q[i])    # H⊗n → uniform superposition

n = 3
state = cudaq.get_state(uniform_superposition, n)

# Convert to numpy (may need list() on older CUDA-Q versions)
sv = np.array(state)

print(f"{n}-qubit uniform superposition  |+⟩^⊗{n}")
print(f"Expected amplitude per basis state: 1/√{2**n} = {1/np.sqrt(2**n):.4f}")
print()
print(f"{'|basis⟩':<12} {'Re(amp)':>10} {'Im(amp)':>10} {'|amp|²':>10}")
print("-" * 46)
for i, amp in enumerate(sv):
    basis = format(i, f"0{n}b")
    print(f"|{basis}⟩       {amp.real:>+10.4f} {amp.imag:>+10.4f} {abs(amp)**2:>10.4f}")

print(f"\nNorm check: {np.sum(np.abs(sv)**2):.8f}  (should be 1.0)")


## Section 3: Noise Models & Density Matrix Simulation

Real quantum hardware is noisy. CUDA-Q models this with **Kraus channels**
applied per gate. The `density-matrix-cpu` target tracks full ρ evolution.

### Built-in Channels

| Class | Effect |
|-------|--------|
| `cudaq.DepolarizationChannel(p)` | Random X/Y/Z error with prob p |
| `cudaq.BitFlipChannel(p)` | X error with prob p |
| `cudaq.PhaseFlipChannel(p)` | Z error with prob p |
| `cudaq.AmplitudeDampingChannel(p)` | T₁ relaxation (\|1⟩→\|0⟩) |
| `cudaq.KrausChannel([K0, K1, ...])` | Custom Kraus operators |

Attach channels to gates with:
```python
noise.add_channel("cx", [0, 1], channel)   # after every CX on qubits 0,1
```


In [None]:
# Compare ideal vs noisy Bell state
cudaq.set_target("density-matrix-cpu")

@cudaq.kernel
def bell_noisy():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])
    mz(q)

# Ideal (no noise model)
ideal = cudaq.sample(bell_noisy, shots_count=2000)

# Noisy: depolarization after H and CX
noise = cudaq.NoiseModel()
noise.add_channel("h",  [0],    cudaq.DepolarizationChannel(0.03))
noise.add_channel("cx", [0, 1], cudaq.DepolarizationChannel(0.05))
noisy = cudaq.sample(bell_noisy, noise_model=noise, shots_count=2000)

# Visualise
states = ["00", "01", "10", "11"]
ip = [ideal.get(s, 0) / 2000 for s in states]
np_ = [noisy.get(s, 0) / 2000 for s in states]

x = np.arange(len(states))
w = 0.35
fig, ax = plt.subplots(figsize=(8, 5))
ax.bar(x - w/2, ip,  w, label="Ideal",  color="steelblue", alpha=0.85)
ax.bar(x + w/2, np_, w, label="Noisy (3-5% depol.)", color="coral",     alpha=0.85)
ax.set_xticks(x); ax.set_xticklabels([f"|{s}⟩" for s in states])
ax.set_ylabel("Probability"); ax.set_title("Bell State: Ideal vs. Noisy")
ax.legend(); ax.grid(axis="y", alpha=0.3)
plt.tight_layout(); plt.show()

print("Ideal:"); ideal.dump()
print("Noisy:"); noisy.dump()


In [None]:
# ─── Amplitude damping: T₁ relaxation model ──────────────
# Prepare |1⟩, apply damping with increasing probability.
# As p→1, state decays back to |0⟩.

cudaq.set_target("density-matrix-cpu")

@cudaq.kernel
def excite_qubit():
    q = cudaq.qvector(1)
    x(q[0])     # |0⟩ → |1⟩
    mz(q)

ps = np.linspace(0, 0.95, 20)
p_zero = []

for p in ps:
    nm = cudaq.NoiseModel()
    nm.add_channel("x", [0], cudaq.AmplitudeDampingChannel(p))
    c = cudaq.sample(excite_qubit, noise_model=nm, shots_count=2000)
    p_zero.append(c.get("0", 0) / 2000)

plt.figure(figsize=(7, 4))
plt.plot(ps, p_zero, "o-", color="purple", markersize=5)
plt.xlabel("Amplitude Damping Probability  p")
plt.ylabel("P(|0⟩)  after preparing |1⟩")
plt.title("T₁ Relaxation: |1⟩ decays to |0⟩")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Reset to CPU statevector for next sections
cudaq.set_target("qpp-cpu")


## Section 4: Parameterized Circuits

Parameterized circuits are the core building block for variational algorithms
and quantum machine learning. CUDA-Q supports kernel parameters natively:

```python
@cudaq.kernel
def ansatz(theta: float, phi: float):
    q = cudaq.qvector(2)
    ry(theta, q[0])
    rz(phi,   q[1])
    cx(q[0],  q[1])
```

Supported parameter types: `float`, `list[float]`, `int`.

Sweeping a parameter lets us trace out expectation values as continuous
functions — useful for visualisation and gradient computation.


In [None]:
cudaq.set_target("qpp-cpu")

# Single-qubit RY rotation
@cudaq.kernel
def ry_rotation(theta: float):
    q = cudaq.qvector(1)
    ry(theta, q[0])
    mz(q)

# Sweep θ ∈ [0, 4π]
thetas = np.linspace(0, 4 * np.pi, 120)
prob_one = []
for theta in thetas:
    c = cudaq.sample(ry_rotation, theta, shots_count=500)
    prob_one.append(c.get("1", 0) / 500)

fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(thetas / np.pi, prob_one,
        "b-", linewidth=1.5, alpha=0.8, label="Measured  P(|1⟩)")
ax.plot(thetas / np.pi, np.sin(thetas / 2) ** 2,
        "r--", linewidth=1.2, alpha=0.6, label="Analytical  sin²(θ/2)")
ax.set_xlabel("θ / π"); ax.set_ylabel("P(|1⟩)")
ax.set_title("RY Rotation: Bloch-Sphere Dynamics")
ax.legend(); ax.grid(True, alpha=0.3)
plt.tight_layout(); plt.show()

# Spot-check key angles
print(f"{'θ':>6}  {'State':>8}  {'P(|1⟩)':>8}")
print("-" * 30)
for theta_pi, label in [(0,"→|0⟩"), (1,"→|+y⟩"), (2,"→|1⟩"), (3,"→|-y⟩"), (4,"→|0⟩")]:
    c = cudaq.sample(ry_rotation, theta_pi * np.pi, shots_count=2000)
    p1 = c.get("1", 0) / 2000
    print(f"{theta_pi}π     {label:>8}  {p1:.3f}")


## Section 5: Hybrid Quantum-Classical — VQE

The **Variational Quantum Eigensolver (VQE)** minimises ⟨ψ(θ)|H|ψ(θ)⟩ over
circuit parameters θ using a classical optimiser.

```
Loop:
  1. Prepare |ψ(θ)⟩ on quantum hardware
  2. Measure ⟨H⟩ = cudaq.observe(ansatz, H, θ)
  3. Classical optimizer proposes new θ
  4. Until convergence → ground state energy
```

CUDA-Q spin operators (Pauli terms):
```python
from cudaq import spin
H = spin.z(0) * spin.z(1) + 0.5 * spin.x(0) + 0.5 * spin.x(1)
```


In [None]:
from cudaq import spin
from scipy.optimize import minimize

cudaq.set_target("qpp-cpu")

# ── Hamiltonian: H = Z₀Z₁ + 0.5 X₀ + 0.5 X₁ ────────────
H = spin.z(0) * spin.z(1) + 0.5 * spin.x(0) + 0.5 * spin.x(1)

# ── Hardware-efficient ansatz ─────────────────────────────
@cudaq.kernel
def ansatz(params: list[float]):
    q = cudaq.qvector(2)
    rx(params[0], q[0]); rx(params[1], q[1])
    cx(q[0], q[1])
    ry(params[2], q[0]); ry(params[3], q[1])
    rz(params[4], q[0]); rz(params[5], q[1])

print("Ansatz circuit:")
print(cudaq.draw(ansatz, [0.1] * 6))

# ── Objective ─────────────────────────────────────────────
history = []

def energy(params):
    result = cudaq.observe(ansatz, H, params.tolist())
    e = result.expectation()
    history.append(e)
    return e

# ── Optimise ──────────────────────────────────────────────
np.random.seed(42)
theta0 = np.random.uniform(0, 2 * np.pi, 6)
print(f"Initial energy: {energy(theta0):.4f}")

opt = minimize(energy, theta0, method="COBYLA",
               options={"maxiter": 300, "rhobeg": 0.5})

print(f"\nConverged : {opt.success}")
print(f"Ground state energy : {opt.fun:.6f}")

# ── Convergence plot ──────────────────────────────────────
fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(history, color="royalblue", linewidth=1.5, alpha=0.9)
ax.axhline(opt.fun, color="red", linestyle="--", linewidth=1.5,
           label=f"Converged  E = {opt.fun:.4f}")
ax.set_xlabel("Optimizer iteration"); ax.set_ylabel("Energy ⟨H⟩")
ax.set_title("VQE Convergence"); ax.legend(); ax.grid(True, alpha=0.3)
plt.tight_layout(); plt.show()


## Section 6: AI for Quantum — Variational Quantum Classifiers

Quantum Machine Learning (QML) treats a parameterized quantum circuit as a
**trainable model layer**:

```
Classical data  →  Encode into quantum state
                →  Apply parameterized circuit
                →  Measure observable → model output
                →  Classical optimizer adjusts params
```

### Parameter-Shift Rule (exact gradient)
```
∂⟨H⟩/∂θⱼ = [ ⟨H⟩(θⱼ + π/2) − ⟨H⟩(θⱼ − π/2) ] / 2
```
This gives exact gradients **without finite-difference approximations**, and
works on real hardware too.


In [None]:
# ─── Angle Encoding: classical data → quantum state ───────

@cudaq.kernel
def angle_encode(n: int, features: list[float]):
    q = cudaq.qvector(n)
    for i in range(n):
        ry(features[i], q[i])   # map xᵢ → RY(xᵢ)|0⟩
    mz(q)

data_point = [np.pi / 4, np.pi / 2, 3 * np.pi / 4]
n = len(data_point)

print("Encoding:", data_point)
print()
print("Circuit:")
print(cudaq.draw(angle_encode, n, data_point))

counts = cudaq.sample(angle_encode, n, data_point, shots_count=2000)
print("Measurement distribution:")
counts.dump()

print("\nExpected P(|1⟩) per qubit (from sin²(x/2)):")
for i, x in enumerate(data_point):
    print(f"  q[{i}]: P(|1⟩) = sin²({x/np.pi:.2f}π) = {np.sin(x/2)**2:.3f}")


In [None]:
# ─── Variational Quantum Classifier (VQC) ────────────────
# 2-qubit, 2-layer VQC trained on a simple 2D dataset

from cudaq import spin

cudaq.set_target("qpp-cpu")

# ── Architecture ──────────────────────────────────────────
# 8 trainable parameters: 2 layers × 2 qubits × (Ry + Rz)
N_PARAMS = 8

@cudaq.kernel
def vqc(features: list[float], params: list[float]):
    q = cudaq.qvector(2)

    # Angle encoding
    ry(features[0], q[0])
    ry(features[1], q[1])

    # Layer 0: entangle + rotate
    cx(q[0], q[1])
    ry(params[0], q[0]); ry(params[1], q[1])
    rz(params[2], q[0]); rz(params[3], q[1])

    # Layer 1: entangle + rotate
    cx(q[0], q[1])
    ry(params[4], q[0]); ry(params[5], q[1])
    rz(params[6], q[0]); rz(params[7], q[1])

obs = spin.z(0)    # output: expectation in [-1, +1]

def predict(features, params):
    r = cudaq.observe(vqc, obs, features, params)
    return r.expectation()

# ── Dataset: two linearly separable clusters ──────────────
X = np.array([[0.2, 0.3], [0.1, 0.5], [0.3, 0.1],
              [1.8, 1.9], [2.0, 1.7], [1.9, 2.1]])
y = np.array([1, 1, 1, -1, -1, -1], dtype=float)

# ── Training with parameter-shift gradients ───────────────
np.random.seed(42)
params = np.random.uniform(0, 2 * np.pi, N_PARAMS)
lr = 0.3
losses = []

print(f"{'Epoch':<8} {'MSE Loss':<12} {'Accuracy'}")
print("-" * 34)

for epoch in range(50):
    total_loss = 0.0
    grad = np.zeros(N_PARAMS)
    correct = 0

    for xi, yi in zip(X, y):
        pred = predict(xi.tolist(), params.tolist())
        total_loss += (pred - yi) ** 2
        correct += int((pred > 0) == (yi > 0))

        # Parameter-shift rule: exact gradient
        for j in range(N_PARAMS):
            p_plus  = params.copy(); p_plus[j]  += np.pi / 2
            p_minus = params.copy(); p_minus[j] -= np.pi / 2
            shift = (predict(xi.tolist(), p_plus.tolist()) -
                     predict(xi.tolist(), p_minus.tolist())) / 2
            grad[j] += 2 * (pred - yi) * shift

    grad /= len(X)
    params -= lr * grad
    loss = total_loss / len(X)
    losses.append(loss)

    if epoch % 10 == 0:
        print(f"{epoch:<8} {loss:<12.4f} {correct/len(X):.0%}")

# ── Evaluation ────────────────────────────────────────────
print("\nFinal predictions:")
for xi, yi in zip(X, y):
    pred = predict(xi.tolist(), params.tolist())
    ok = (pred > 0) == (yi > 0)
    symbol = "[OK]" if ok else "[X]"
    print(f"  {symbol}  features={xi}  true={int(yi):+d}  pred={pred:+.3f}")

# ── Loss curve ────────────────────────────────────────────
plt.figure(figsize=(8, 4))
plt.plot(losses, "g-", linewidth=2)
plt.xlabel("Epoch"); plt.ylabel("MSE Loss")
plt.title("VQC Training Loss (Parameter-Shift Gradients)")
plt.grid(True, alpha=0.3)
plt.tight_layout(); plt.show()


## Summary & Next Steps

Great work! Here's a quick reference for everything covered:

| Topic | Key API |
|-------|---------|
| Quantum kernels | `@cudaq.kernel` + `cudaq.qvector(n)` |
| Run circuits | `cudaq.sample(kernel, *args, shots_count=N)` |
| State vector | `cudaq.get_state(kernel, *args)` |
| Expectation value | `cudaq.observe(kernel, H, *args).expectation()` |
| Switch backend | `cudaq.set_target("nvidia")` |
| Noise simulation | `cudaq.NoiseModel()` + density-matrix target |
| Spin Hamiltonians | `from cudaq import spin; spin.z(0) * spin.z(1)` |

## Where to Go Next

- **[CUDA-Q Docs](https://nvidia.github.io/cuda-quantum/)**: Full API reference
- **Real QPU execution**: targets `ionq`, `quantinuum`, `iqm`, `oqc`
- **Multi-GPU**: `cudaq.set_target("nvidia", option="mgpu")`
- **Tensor networks**: `cudaq.set_target("tensornet")` for 50+ qubits
- **CUDA-Q Libraries**: Pre-built QAOA, chemistry in `cudaq.kernels`

## Suggested Exercises

1. Implement the 3-qubit bit-flip error-correcting code and test with `BitFlipChannel`
2. Build a QAOA circuit for Max-Cut on a 4-node graph
3. Extend the VQC to 3 classes using softmax over 3 observables
4. Profile CPU vs GPU simulation time as a function of qubit count (12 – 28 qubits)
5. Add shot noise to the VQE cost function and study its effect on convergence

Happy quantum computing!
