Skip to content

refactor: redesign pipeline groups 05/07/08, expand to 24 pipelines#9

Merged
George930502 merged 7 commits into
mainfrom
refactor/pipeline-redesign-24
Mar 27, 2026
Merged

refactor: redesign pipeline groups 05/07/08, expand to 24 pipelines#9
George930502 merged 7 commits into
mainfrom
refactor/pipeline-redesign-24

Conversation

@George930502
Copy link
Copy Markdown
Collaborator

@George930502 George930502 commented Mar 26, 2026

Summary

  • Group 05 (HF-only): Added hf_sqd.py SQD variant, completing the 3-mode set (classical Krylov, quantum Krylov, SQD)
  • Group 07: Rewritten from "DCI seed + separate NQS loop" → "NF training + DCI merge → iterative NQS refinement" (consistent with Group 02 basis generation)
  • Group 08: Rewritten from "DCI+PT2 seed + separate NQS loop" → "NF training + DCI merge + PT2 expansion → iterative NQS refinement" (consistent with Group 03 basis generation)
  • Cleanup: Deleted redundant experiments/methods/ (11 scripts) and experiments/configs/ (11 YAML files), fully replaced by experiments/pipelines/
  • Docs: Updated all references across README.md, AGENTS.md, experiments_guide.md, Sphinx RST docs, tutorials, and ADR-001

All 8 groups now follow a consistent pattern: each group has exactly 3 diag variants (Classical Krylov, Quantum Krylov, SQD) = 24 pipelines total.

Test plan

  • All 24 pipelines pass on WSL2 Ubuntu with CUDA GPU (run_all_pipelines.py h2 --device cuda → 24/24 OK)
  • All 24 pipelines pass in Docker GPU container (docker run --gpus all → 24/24 OK)
  • Verify no broken Sphinx doc references (make html)
  • Verify CI lint passes (ruff check)

🤖 Generated with Claude Code

Fix three bugs preventing GPU execution across all pipeline variants:

1. SQD eigensolver (krylov/circuits/sqd.py): gpu_eigsh() was called with
   invalid kwargs (which, use_gpu) and a torch tensor instead of numpy.
   Every SQD batch silently fell back to CPU. Now converts to numpy first
   and removes unsupported kwargs.

2. IBM SQD API (methods/nqs/hi_nqs_sqd.py, hi_nqs_skqd.py): Updated
   recover_configurations() and solve_fermion() calls to match
   qiskit-addon-sqd >= 0.12 API (new probabilities arg, split
   num_elec_a/num_elec_b, SCIState return type). Default use_ibm_solver
   set to False since SCIState.amplitudes shape doesn't align with input
   configs for NQS teacher training.

3. Dockerfile.gpu: Add configs extra to pip install for YAML support.

Also adds:
- 23 pipeline scripts (experiments/pipelines/) covering 8 groups × 3
  diag modes (classical Krylov, quantum Krylov, SQD)
- Pipeline runner (run_all_pipelines.py) with comparison table
- Quantum circuit method scripts and configs
- quantum_circuit subpackage (methods/quantum_circuit/)

All 23 pipelines verified passing on GPU in both WSL2 and Docker.
…elines

- Group 05 (HF-only): add hf_sqd.py SQD variant, completing the 3-mode set
- Group 07: rewrite from "DCI seed + separate NQS loop" to "NF training +
  DCI merge → iterative NQS refinement" (same as Group 02 stages 1-2 feeding
  into Group 06 stage 3)
- Group 08: rewrite from "DCI+PT2 seed + separate NQS loop" to "NF training +
  DCI merge + PT2 expansion → iterative NQS refinement" (same as Group 03
  stages 1-2.5 feeding into Group 06 stage 3)
- Delete redundant experiments/methods/ (11 scripts) and experiments/configs/
  (11 YAML files), replaced by experiments/pipelines/
- Update run_all_pipelines.py from 23 to 24 pipelines
- Update all documentation: README.md, AGENTS.md, experiments_guide.md,
  pipelines.rst, yaml_configs.rst, h2/beh2 tutorials, ADR-001

All 24 pipelines verified passing on GPU in both WSL2 and Docker.
@George930502 George930502 requested a review from thc1006 March 26, 2026 15:46
thc1006 added 2 commits March 27, 2026 07:47
1. CRITICAL: vectorized_dedup returns np.ndarray — wrap with
   torch.from_numpy before .to(device) to prevent AttributeError crash
   (hi_nqs_sqd.py:267)

2. MEDIUM: _compute_orbital_occupancies device regression — move coeffs
   to configs.device (keep GPU) instead of forcing configs to CPU
   (sqd.py:889-891)

3. MEDIUM: Replace `mol_info.get(key) or fallback` with
   `mol_info.get(key, fallback)` to avoid falsy-zero trap, plus
   getattr guard for hamiltonian.integrals on non-molecular Hamiltonians
   (hi_nqs_sqd.py:219-221, hi_nqs_skqd.py:225-227)

4. LOW: Remove redundant ground_state.to(H_work.device) since both
   branches already place it there (sqd.py:786)

Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
1. run_all_pipelines.py: refactor --skip-iterative filter from fragile
   and/or chain to clear str.startswith(tuple) form
2. 04_nf_only/nf_sqd.py: add missing 5th tier (<=5000) in
   get_training_params to match all other pipeline scripts
3. 07/08 groups: fix YAML key mismatch — scripts read "n_iterations"
   but YAML defines "max_iterations". Standardize to "max_iterations"
4. 05_hf_only.yaml: add missing SQD config keys (sqd_noise_rate,
   sqd_num_batches, sqd_self_consistent_iters)

Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
Copy link
Copy Markdown
Member

@thc1006 thc1006 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Deep review of all 60 files. Found 12 issues (1 critical, 6 medium, 5 low).

Critical (will crash at runtime)

hi_nqs_sqd.py:267vectorized_dedup() returns np.ndarray, but .to(device) is called on it. This raises AttributeError whenever cumulative_basis.shape[0] > 0.

Medium

  • sqd.py:889_compute_orbital_occupancies moves configs to CPU (coeffs' device) instead of moving coeffs to GPU. Performance regression.
  • hi_nqs_sqd.py:219, hi_nqs_skqd.py:225mol_info.get(key) or fallback uses or which treats integer 0 as falsy. Should use .get(key, default).
  • Groups 07/08 (6 pipelines) — NF+DCI basis is computed in stages 1-2 but never passed to run_hi_nqs_skqd in stage 3. These pipelines are functionally identical to Group 06. Tracked in #10.
  • run_all_pipelines.py--skip-iterative filter has confusing and/or precedence (works by coincidence).
  • 04_nf_only/nf_sqd.pyget_training_params has 4 tiers (missing <=5000), all other files have 5.
  • Groups 07/08 YAML — YAML defines max_iterations but scripts read n_iterations — value silently ignored.

Low

  • 05_hf_only.yaml missing SQD keys
  • No hamiltonian.integrals guard for non-molecular Hamiltonians
  • Redundant .to(H_work.device) in sqd.py
  • compare.py incomplete config validation

Fix PR

All fixes except #4 (requires API change, tracked in #10) are in PR #11.

@thc1006
Copy link
Copy Markdown
Member

thc1006 commented Mar 26, 2026

可以直接 merge PR #11 到 PR #9 branch,然後再 merge PR #9 到 main @George930502

thc1006 and others added 2 commits March 27, 2026 08:03
When neither mol_info nor hamiltonian.integrals provides n_orbitals,
n_alpha, or n_beta, the old fallback to 0 would cause division-by-zero
in occupancy computation and misconfigure AutoregressiveTransformer.

Now raises ValueError with a clear message listing which values are
missing. Addresses Copilot review on PR #11.

Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
fix: PR #9 code review findings — 10 bug fixes
@George930502 George930502 merged commit e1af2a7 into main Mar 27, 2026
5 checks passed
@thc1006 thc1006 deleted the refactor/pipeline-redesign-24 branch March 31, 2026 09:27
thc1006 added a commit to thc1006/qvartools that referenced this pull request Apr 1, 2026
Critical fixes:
- #1/#2: Use persistent prev_coeffs/prev_batch_configs for PT2 scoring
  (was using best_coeffs before it was defined in current iteration)
- QuantumNoLab#5: Eviction checks cumulative_basis size (was checking best_batch_configs)
- QuantumNoLab#7: Convert numpy indices to torch.LongTensor before indexing CUDA tensor
- QuantumNoLab#8: Default energy_weight=0.0, entropy_weight=0.0 (was 0.1/0.05,
  silently changing existing behavior)

Lower priority fixes:
- QuantumNoLab#6: CIPSI docstring mentions build_sparse_hamiltonian requirement
- QuantumNoLab#9: Warn when energy_weight>0 but hamiltonian=None

Not applicable (won't fix):
- QuantumNoLab#3/QuantumNoLab#4: config_integer_hash returns tuple for n_sites>=64; our CAS max
  is 58Q (<64), so int cast is safe

Co-authored-by: leo07010 <leo07010@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants