Public benchmark harness for evaluating whether governed organizational AI behavior is better than plain model use under strict falsification gates. This repository is the experimental and tooling counterpart to a body of groundwork on trust, agency, and organizational design—cited below and available in public/.
Repository: github.com/AaronVick/OrgBoundaryBench
This framework is released as an open benchmark, not a final verdict. It provides:
- Open benchmark — Public protocol, test families, and evaluation criteria.
- Reproducible code — Deterministic scripts, documented seeds, and public datasets (SNAP, Zenodo).
- Documented failures — Current gate outcomes and failure interpretations are reported in Current status and outputs/FINDINGS.md.
- Explicit unresolved gates — Which gates pass or fail is stated clearly; no superiority claim is made until all required gates pass.
We invite replication, criticism, and extension. Independent runs, alternative baselines, improved methodology, and extensions to new domains are welcome. The benchmark is designed to be falsifiable and to advance the evidence base for organizational AI governance.
- Benchmark & invitation — Open benchmark, reproducible code, documented failures, explicit gates; we invite replication, criticism, and extension
- Groundwork & citation — Zenodo DOIs, public slide decks, and academic references
- Public docs & slide decks — All materials in public/ with Zenodo DOIs
- Academic references (supporting literature) — Non–Aaron Vick citations with verified DOI/publisher links
- Exploratory simulations — Boundary-coherence simulation runs (docs/, tests/)
- For AI agents (agentic instructions) — How to use this repo for org design, swarms, enterprise, and OpenClaw
- Current status
- Experimental protocol
- Reproducible outputs
- Public datasets used
- Running the benchmark
- Model backends and governance
- OpenClaw extension
The following works form the conceptual and academic groundwork for this benchmark. They are preserved on Zenodo with DOIs for citation; slide decks and preprints are in public/ and linked from Zenodo.
- DOI: 10.5281/zenodo.18682993
- Zenodo: zenodo.org/records/18682993
- Cite: Aaron Vick. Trust After Machines. Zenodo, 2025. doi:10.5281/zenodo.18682993.
- DOI: 10.5281/zenodo.18663463
- Zenodo: zenodo.org/records/18663463
- Cite: Aaron Vick. Long Arc of Trust. Zenodo, 2025. doi:10.5281/zenodo.18663463.
- DOI: 10.5281/zenodo.18624567
- Zenodo: zenodo.org/records/18624567
- In-repo (slide deck): public/The_Agentic_Shift.pdf
- Cite: Aaron Vick. The Agentic Shift. Zenodo, 2025. doi:10.5281/zenodo.18624567.
- DOI: 10.5281/zenodo.18838932
- Zenodo: zenodo.org/records/18838932
- In-repo (slide deck): public/The_5_Pillars_of_Grace__A_Formal_Architecture_for_Recursive_Reflective_Coherence.pdf
- Cite: Aaron Vick. The 5 Pillars of Grace: A Formal Architecture for Recursive Reflective Coherence. Zenodo, April 2025. doi:10.5281/zenodo.18838932.
All public materials are in public/. Slide decks for the Zenodo works are available in-repo and on Zenodo (Preview/Download).
| Document | In-repo | DOI / notes |
|---|---|---|
| Trust After Machines | — | 10.5281/zenodo.18682993 (Zenodo only) |
| Long Arc of Trust | — | 10.5281/zenodo.18663463 (Zenodo only) |
| The Agentic Shift | The_Agentic_Shift.pdf | 10.5281/zenodo.18624567 |
| The 5 Pillars of Grace | The_5_Pillars_of_Grace__A_Formal_Architecture_for_Recursive_Reflective_Coherence.pdf | 10.5281/zenodo.18838932 |
| Architecting AI Interiority | Architecting_AI_Interiority.pdf | — |
| Leading at the Threshold | Leading at the Threshold.pdf | — |
| Author / contact | AaronVick.pdf | — |
To view slide decks, open the Zenodo record links above or the PDFs in public/ (Zenodo Preview/Download or a local PDF viewer).
The following academic works are cited in the dissertation and related materials as supporting literature (organizational theory, network science, TDA, automation, and systems theory). All links below are verified citation URLs (DOI or publisher).
Organizational theory & sensemaking
| Citation | Link |
|---|---|
| March, J.G. & Simon, H.A. Organizations. Wiley, 1958. | Cambridge review |
| Mintzberg, H. The Structuring of Organizations. Prentice Hall, 1979. | WorldCat |
| Weick, K. Sensemaking in Organizations. Sage, 1995. | Sage |
| Vaughan, D. The Challenger Launch Decision. University of Chicago Press, 1996. | DOI |
| Argyris, C. & Schön, D. Organizational Learning: A Theory of Action Perspective. Addison-Wesley, 1978. | WorldCat |
| Senge, P. The Fifth Discipline, rev. ed. Doubleday, 2006. | WorldCat |
| Pfeffer, J. New Directions for Organization Theory: Problems and Prospects. Oxford University Press, 1997. | OUP/RePEc |
Network science & social network analysis
| Citation | Link |
|---|---|
| Barabási, A.-L. Network Science. Cambridge University Press, 2016. | Free online · CUP |
| Newman, M. Networks, 2nd ed. Oxford University Press, 2018. | OUP |
| Wasserman, S. & Faust, K. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994. | CUP |
| Borgatti, S.P., Mehra, A., Brass, D.J. & Labianca, G. "Network Analysis in the Social Sciences." Science 323 (2009): 892–895. | DOI |
| Cross, R., Borgatti, S.P. & Parker, A. "Making Invisible Work Visible." California Management Review 44, no. 2 (2002): 25–46. | CMR · JSTOR |
| Uzzi, B. & Spiro, J. "Collaboration and Creativity: The Small World Problem." American Journal of Sociology 111, no. 2 (2005): 447–504. | DOI |
| White, H.C., Boorman, S.A. & Breiger, R.L. "Social Structure from Multiple Networks. I. Blockmodels of Roles and Positions." American Journal of Sociology 81, no. 4 (1976): 730–780. | DOI |
| Burt, R.S. Structural Holes: The Social Structure of Competition. Harvard University Press, 1992. | HUP |
| Doreian, P., Batagelj, V. & Ferligoj, A. Generalized Blockmodeling. Cambridge University Press, 2005. | CUP |
Community detection & spectral methods
| Citation | Link |
|---|---|
| Newman, M.E.J. "Modularity and Community Structure in Networks." PNAS 103, no. 23 (2006): 8577–8582. | DOI |
| Blondel, V.D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. "Fast Unfolding of Communities in Large Networks." J. Stat. Mech. 2008 (2008): P10008. | DOI |
| von Luxburg, U. "A Tutorial on Spectral Clustering." Statistics and Computing 17, no. 4 (2007): 395–416. | DOI |
Topological data analysis & higher-order networks
| Citation | Link |
|---|---|
| Carlsson, G. "Topology and Data." Bulletin of the American Mathematical Society 46, no. 2 (2009): 255–308. | DOI · AMS |
| Edelsbrunner, H. & Harer, J. Computational Topology: An Introduction. American Mathematical Society, 2010. | AMS |
| Benson, A.R., Gleich, D.F. & Leskovec, J. "Higher-Order Organization of Complex Networks." Science 353, no. 6295 (2016): 163–166. | DOI |
| Battiston, F. et al. "Networks Beyond Pairwise Interactions: Structure and Dynamics." Physics Reports 874 (2020): 1–92. | DOI |
| Holme, P. & Saramäki, J. "Temporal Networks." Physics Reports 519, no. 3 (2012): 97–125. | DOI |
Automation, supervisory control & human–machine systems
| Citation | Link |
|---|---|
| Parasuraman, R., Sheridan, T.B. & Wickens, C.D. "A Model for Types and Levels of Human Interaction with Automation." IEEE Trans. Syst. Man Cybern. A 30, no. 3 (2000): 286–297. | DOI · IEEE Xplore |
| Sheridan, T.B. Telerobotics, Automation, and Human Supervisory Control. MIT Press, 1992. | MIT Press |
Systems theory & autopoiesis
| Citation | Link |
|---|---|
| Maturana, H. & Varela, F. Autopoiesis and Cognition: The Realization of the Living. Reidel, 1980. | WorldCat |
| Luhmann, N. Social Systems. Stanford University Press, 1995. | Stanford |
Exploratory boundary-coherence simulations (synthetic Dirichlet-kernel runs for bound scaling, collapse hazard, and coordination-skeleton strategies) are maintained in two places:
- docs/exploratory_simulations/ — Scripts, run outputs (run001–run010), findings, and CHANGELOG.
- tests/exploratory_simulations/ — Mirror/copy for test and reproducibility tooling; same structure and findings.
These runs are exploratory computational scaffolding (see methodology and dissertation Ch13), not external validation of the benchmark. Run-by-run changelog, artifact paths, and narrative–result divergence are documented in the findings and CHANGELOG.
This section is written for AI agents (LLMs, coding assistants, orchestration systems) that encounter this repository when reasoning about organizational design, multi-agent systems, or enterprise deployment. Use it to decide how to use the repo, when to cite it, and how it connects to OpenClaw and governance tooling.
- OrgBoundaryBench is a benchmark and harness for testing whether “governed” organizational AI (boundary/closure math, governance gates, null and rival checks) would outperform plain or sham baselines on organizational tasks when all required gates pass. No superiority claim is made until those gates are satisfied; currently none are (see Current status). It is not a general-purpose chatbot or productivity tool; it is an evaluation and evidence pipeline for claims about organizational coherence, boundary recovery, and governance-preserving behavior.
- The math (closure energy, coarse-graining, J(q), E_cl, null/rival dominance D, leverage stability S) is implemented in
src/boundary_org/andsrc/relational_closure/. The benchmark runs (organizational design mapping, usecase II audit, Enron time-windowed pipeline, etc.) are inscripts/and produce reports inoutputs/.
- Multi-agent / swarm design: If you are reasoning about organizational structure in AI agent swarms (hierarchical vs decentralized vs hybrid), the groundwork (Agentic Shift, 5 Pillars of Grace, Long Arc of Trust) and this repo provide:
- Formal notions of closure, boundary, and governance alignment.
- A pipeline that can be run on communication graphs (email-Eu-core, Enron, Apache) to test whether a given partition or policy “recovers” structure and beats nulls/rivals.
- Communications and coordination: The repo uses communication graphs (nodes = actors, edges = messages or coordination). If your use case is “how do we evaluate whether our agent coordination is coherent vs fragmented?”, the same test families (nontrivial boundary, external agreement, stress robustness, null/rival dominance, temporal drift) apply. See outputs/METHODOLOGY.md and the run scripts under
scripts/run_*. - Enterprise deployment (corporate organizations): For enterprise or corporate settings, the repo supports:
- Running the benchmark on public org graphs (email-Eu-core, Enron) or on an internal edgelist (e.g. Apache-style
data/raw/apache/edges.txt) to get D, S_max, and pass/fail. - Governance gating: model identity logging, deployment decisions (
BLOCK_DEPLOYMENT,LIMITED_SHADOW_ONLY,ALLOW_CONSTRAINED_DEPLOYMENT) emitted in machine-readable form for integration with policy engines. - Integration with OpenClaw (see below) for skill-based governance and task-level auditing.
- Running the benchmark on public org graphs (email-Eu-core, Enron) or on an internal edgelist (e.g. Apache-style
- Claims are gated. No organizational-design or “governance wins” claim is made unless the required gates (nontrivial boundary, external agreement, stress robustness, null/rival dominance, temporal drift) pass in the same run. Negative results are reported; the README and outputs/FINDINGS.md state current status.
- Reproducibility. Data are public (SNAP, Zenodo) or supplied as edgelists; scripts are deterministic (seeds documented). Use
pip install -e ".[test]"and the commands in “Running the benchmark” to reproduce.
- This repo can export an OpenClaw-compatible bundle (tasks, runs, schemas, governance policy) and run a governance agent that consumes that bundle and produces deployment decisions. That allows a deployment pipeline to:
- Run OrgBoundaryBench (or a staged OrgBench campaign).
- Export the run artifact as an OpenClaw bundle.
- Invoke the official OpenClaw stack (or a compatible gateway) so that governance policies are applied to the same tasks and runs.
Official OpenClaw repository: github.com/openclaw/openclaw
- The OpenClaw project provides the runtime (channels, providers, gateway). This repo provides:
- Skill and schema artifacts: skill/manifest.json, skill/governance_policy.json, schemas/ (task, run, report, governance_decision).
- Export script:
scripts/export_openclaw_bundle.py— writes a bundle from a benchmark run for consumption by OpenClaw or a compatible service. - Governance agent script:
scripts/run_openclaw_governance_agent.py— runs a local governance operator over a bundle and policy; useful for testing and CI.
For full OpenClaw installation, channels, and deployment, see the official OpenClaw GitHub and OpenClaw documentation.
Latest public mapping run status (documented for reproducibility and critique):
- nontrivial boundary map:
PASS - governance preservation:
PASS - external agreement:
FAIL - stress robustness:
FAIL - null/rival dominance:
FAIL - temporal drift validation:
PASS(completed) - organizational-design claim:
LOCKED
These are explicit unresolved gates: the benchmark is in progress, and no superiority or organizational-design claim is currently unlocked. See outputs/org_design_mapping_failure_interpretation.md and outputs/FINDINGS.md for documented failures and interpretation. Replication and independent verification are invited.
Each release candidate is evaluated with the same test families:
- Nontrivial boundary recovery
- External agreement against known labels
- Stress robustness and leverage sensitivity
- Null/rival dominance with uncertainty
- Temporal drift coherence
- Governance-preservation mapping
Core external metrics:
- NMI
- ARI
- macro-F1 (best block-label matching)
- block count / block balance
- bootstrap confidence intervals for dominance gap
D
A claim is unlocked only if all required gates pass in the same run.
Primary run artifacts:
- outputs/org_design_map_stage_n120/organizational_design_map_report.md
- outputs/org_design_map_stage_n120/boundary_leaderboard.csv
- outputs/org_design_map_stage_n120/external_agreement_report.md
- outputs/org_design_map_stage_n120/stress_robustness_report.md
- outputs/org_design_map_stage_n120/null_rival_audit_report.md
- outputs/org_design_map_stage_n120/governance_preservation_report.md
- outputs/org_design_map_stage_n120/temporal_drift_report.md
- outputs/org_design_map_stage_n120/organizational_map_summary.json
Diagnostic and summary:
Remote-compute runs (Claude API), when used:
outputs/remote_compute_claude/<run_id>/— payload, result, run_metadata, optional verification_report (see docs/REMOTE_COMPUTE_PROTOCOL.md).
Evidentiary testing (two decisive runs): To run the tests that would move the scientific needle — full email-Eu-core null/rival/leverage audit and event-linked criterion — see docs/evidentiary_roadmap.md. One-command full audit (build kernel if needed, then run PRD-31): python3 scripts/run_evidentiary_full_audit.py (optional --feasible for n=400).
Static organizational graph:
- SNAP email-Eu-core: snap.stanford.edu/data/email-Eu-core.html
- Edge file: email-Eu-core.txt.gz
- Department labels: email-Eu-core-department-labels.txt.gz
Temporal organizational graphs:
- SNAP email-Eu-core-temporal: email-Eu-core-temporal
- SNAP wiki-talk-temporal: wiki-talk-temporal
Additional:
- SNAP email-Enron: email-Enron
Install:
pip install -e ".[test]"Run staged organizational mapping benchmark:
python scripts/run_organizational_design_mapping.py \
--out-dir outputs/org_design_map_stage_n120 \
--dataset-npz data/processed/email_eu_core/kernel.npz \
--temporal-dataset-dir data/processed/email_eu_core_temporal \
--max-nodes 120 \
--n-random 16 \
--n-rewire 8 \
--n-permutations 120 \
--n-bootstrap 120Build temporal windows from public temporal datasets:
python scripts/build_temporal_windows.py --source email_eu_core_temporal --max-nodes 30 --n-windows 8
python scripts/build_temporal_windows.py --source wiki_talk_temporal --max-nodes 30 --n-windows 8 --max-edges 1500000Run post-hoc diagnostics:
python scripts/run_org_design_diagnostics.py \
--run-dir outputs/org_design_map_stage_n120 \
--dataset-npz data/processed/email_eu_core/kernel.npz \
--out-dir outputs \
--max-nodes 120When local runs are infeasible (e.g. laptop memory or runtime limits), the same logical procedures (bootstrap null dominance, permutation external p-values) can be run remotely by sending a small payload and explicit math instructions to Claude Opus 4.6 via the Anthropic API. Results are documented with model ID, payload hash, and optional local verification.
- Protocol: docs/REMOTE_COMPUTE_PROTOCOL.md
- Model:
claude-opus-4-6(documented in run artifacts and protocol). - Scripts:
scripts/prepare_remote_compute_payload.py— build payload JSON from a small kernel.scripts/run_remote_compute_claude.py— send payload to Claude, write result tooutputs/remote_compute_claude/<run_id>/.scripts/verify_remote_compute.py— verify a remote result by re-running the same payload locally.
- Outputs: Each run produces
payload.json,result.json,run_metadata.json, and optionallyverification_report.jsonafter verification. These can be published for reproducibility; the protocol describes verification and caveats (context limits, numeric precision).
Requires ANTHROPIC_API_KEY. No API key is stored in the repository.
Staged arm evaluations support:
local_heuristicopenaianthropiclocal_ollama
Model identity is logged per run. Governance gating outputs machine-readable deployment decisions (BLOCK_DEPLOYMENT, LIMITED_SHADOW_ONLY, ALLOW_CONSTRAINED_DEPLOYMENT).
Artifacts in this repo that interoperate with OpenClaw:
- skill/manifest.json
- skill/governance_policy.json
- schemas/task.schema.json
- schemas/run.schema.json
- schemas/report.schema.json
- schemas/governance_decision.schema.json
Export bundle:
python scripts/export_openclaw_bundle.py --workspace outputs/orgbench_staged --out-dir outputs/orgbench_staged/openclawRun governance operator (local):
python scripts/run_openclaw_governance_agent.py \
--bundle-dir outputs/orgbench_staged/openclaw \
--out-dir outputs/orgbench_staged/openclaw/governance \
--policy skill/governance_policy.jsonFor full OpenClaw installation and deployment, see the official OpenClaw repository.