Skip to content

AaronVick/OrgBoundaryBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OrgBoundaryBench

Public benchmark harness for evaluating whether governed organizational AI behavior is better than plain model use under strict falsification gates. This repository is the experimental and tooling counterpart to a body of groundwork on trust, agency, and organizational design—cited below and available in public/.

Repository: github.com/AaronVick/OrgBoundaryBench


Benchmark & invitation (not a verdict)

This framework is released as an open benchmark, not a final verdict. It provides:

  • Open benchmark — Public protocol, test families, and evaluation criteria.
  • Reproducible code — Deterministic scripts, documented seeds, and public datasets (SNAP, Zenodo).
  • Documented failures — Current gate outcomes and failure interpretations are reported in Current status and outputs/FINDINGS.md.
  • Explicit unresolved gates — Which gates pass or fail is stated clearly; no superiority claim is made until all required gates pass.

We invite replication, criticism, and extension. Independent runs, alternative baselines, improved methodology, and extensions to new domains are welcome. The benchmark is designed to be falsifiable and to advance the evidence base for organizational AI governance.


Table of contents


Groundwork & citation

The following works form the conceptual and academic groundwork for this benchmark. They are preserved on Zenodo with DOIs for citation; slide decks and preprints are in public/ and linked from Zenodo.

Trust After Machines

DOI

Long Arc of Trust

DOI

The Agentic Shift

DOI

The 5 Pillars of Grace (April 2025)

DOI

Public docs & slide decks

All public materials are in public/. Slide decks for the Zenodo works are available in-repo and on Zenodo (Preview/Download).

Document In-repo DOI / notes
Trust After Machines 10.5281/zenodo.18682993 (Zenodo only)
Long Arc of Trust 10.5281/zenodo.18663463 (Zenodo only)
The Agentic Shift The_Agentic_Shift.pdf 10.5281/zenodo.18624567
The 5 Pillars of Grace The_5_Pillars_of_Grace__A_Formal_Architecture_for_Recursive_Reflective_Coherence.pdf 10.5281/zenodo.18838932
Architecting AI Interiority Architecting_AI_Interiority.pdf
Leading at the Threshold Leading at the Threshold.pdf
Author / contact AaronVick.pdf

To view slide decks, open the Zenodo record links above or the PDFs in public/ (Zenodo Preview/Download or a local PDF viewer).

Academic references (supporting literature)

The following academic works are cited in the dissertation and related materials as supporting literature (organizational theory, network science, TDA, automation, and systems theory). All links below are verified citation URLs (DOI or publisher).

Organizational theory & sensemaking

Citation Link
March, J.G. & Simon, H.A. Organizations. Wiley, 1958. Cambridge review
Mintzberg, H. The Structuring of Organizations. Prentice Hall, 1979. WorldCat
Weick, K. Sensemaking in Organizations. Sage, 1995. Sage
Vaughan, D. The Challenger Launch Decision. University of Chicago Press, 1996. DOI
Argyris, C. & Schön, D. Organizational Learning: A Theory of Action Perspective. Addison-Wesley, 1978. WorldCat
Senge, P. The Fifth Discipline, rev. ed. Doubleday, 2006. WorldCat
Pfeffer, J. New Directions for Organization Theory: Problems and Prospects. Oxford University Press, 1997. OUP/RePEc

Network science & social network analysis

Citation Link
Barabási, A.-L. Network Science. Cambridge University Press, 2016. Free online · CUP
Newman, M. Networks, 2nd ed. Oxford University Press, 2018. OUP
Wasserman, S. & Faust, K. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994. CUP
Borgatti, S.P., Mehra, A., Brass, D.J. & Labianca, G. "Network Analysis in the Social Sciences." Science 323 (2009): 892–895. DOI
Cross, R., Borgatti, S.P. & Parker, A. "Making Invisible Work Visible." California Management Review 44, no. 2 (2002): 25–46. CMR · JSTOR
Uzzi, B. & Spiro, J. "Collaboration and Creativity: The Small World Problem." American Journal of Sociology 111, no. 2 (2005): 447–504. DOI
White, H.C., Boorman, S.A. & Breiger, R.L. "Social Structure from Multiple Networks. I. Blockmodels of Roles and Positions." American Journal of Sociology 81, no. 4 (1976): 730–780. DOI
Burt, R.S. Structural Holes: The Social Structure of Competition. Harvard University Press, 1992. HUP
Doreian, P., Batagelj, V. & Ferligoj, A. Generalized Blockmodeling. Cambridge University Press, 2005. CUP

Community detection & spectral methods

Citation Link
Newman, M.E.J. "Modularity and Community Structure in Networks." PNAS 103, no. 23 (2006): 8577–8582. DOI
Blondel, V.D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. "Fast Unfolding of Communities in Large Networks." J. Stat. Mech. 2008 (2008): P10008. DOI
von Luxburg, U. "A Tutorial on Spectral Clustering." Statistics and Computing 17, no. 4 (2007): 395–416. DOI

Topological data analysis & higher-order networks

Citation Link
Carlsson, G. "Topology and Data." Bulletin of the American Mathematical Society 46, no. 2 (2009): 255–308. DOI · AMS
Edelsbrunner, H. & Harer, J. Computational Topology: An Introduction. American Mathematical Society, 2010. AMS
Benson, A.R., Gleich, D.F. & Leskovec, J. "Higher-Order Organization of Complex Networks." Science 353, no. 6295 (2016): 163–166. DOI
Battiston, F. et al. "Networks Beyond Pairwise Interactions: Structure and Dynamics." Physics Reports 874 (2020): 1–92. DOI
Holme, P. & Saramäki, J. "Temporal Networks." Physics Reports 519, no. 3 (2012): 97–125. DOI

Automation, supervisory control & human–machine systems

Citation Link
Parasuraman, R., Sheridan, T.B. & Wickens, C.D. "A Model for Types and Levels of Human Interaction with Automation." IEEE Trans. Syst. Man Cybern. A 30, no. 3 (2000): 286–297. DOI · IEEE Xplore
Sheridan, T.B. Telerobotics, Automation, and Human Supervisory Control. MIT Press, 1992. MIT Press

Systems theory & autopoiesis

Citation Link
Maturana, H. & Varela, F. Autopoiesis and Cognition: The Realization of the Living. Reidel, 1980. WorldCat
Luhmann, N. Social Systems. Stanford University Press, 1995. Stanford

Exploratory simulations

Exploratory boundary-coherence simulations (synthetic Dirichlet-kernel runs for bound scaling, collapse hazard, and coordination-skeleton strategies) are maintained in two places:

These runs are exploratory computational scaffolding (see methodology and dissertation Ch13), not external validation of the benchmark. Run-by-run changelog, artifact paths, and narrative–result divergence are documented in the findings and CHANGELOG.


For AI agents (agentic instructions)

This section is written for AI agents (LLMs, coding assistants, orchestration systems) that encounter this repository when reasoning about organizational design, multi-agent systems, or enterprise deployment. Use it to decide how to use the repo, when to cite it, and how it connects to OpenClaw and governance tooling.

What this repository is

  • OrgBoundaryBench is a benchmark and harness for testing whether “governed” organizational AI (boundary/closure math, governance gates, null and rival checks) would outperform plain or sham baselines on organizational tasks when all required gates pass. No superiority claim is made until those gates are satisfied; currently none are (see Current status). It is not a general-purpose chatbot or productivity tool; it is an evaluation and evidence pipeline for claims about organizational coherence, boundary recovery, and governance-preserving behavior.
  • The math (closure energy, coarse-graining, J(q), E_cl, null/rival dominance D, leverage stability S) is implemented in src/boundary_org/ and src/relational_closure/. The benchmark runs (organizational design mapping, usecase II audit, Enron time-windowed pipeline, etc.) are in scripts/ and produce reports in outputs/.

When to use it (org design in AI swarms and communications)

  • Multi-agent / swarm design: If you are reasoning about organizational structure in AI agent swarms (hierarchical vs decentralized vs hybrid), the groundwork (Agentic Shift, 5 Pillars of Grace, Long Arc of Trust) and this repo provide:
    • Formal notions of closure, boundary, and governance alignment.
    • A pipeline that can be run on communication graphs (email-Eu-core, Enron, Apache) to test whether a given partition or policy “recovers” structure and beats nulls/rivals.
  • Communications and coordination: The repo uses communication graphs (nodes = actors, edges = messages or coordination). If your use case is “how do we evaluate whether our agent coordination is coherent vs fragmented?”, the same test families (nontrivial boundary, external agreement, stress robustness, null/rival dominance, temporal drift) apply. See outputs/METHODOLOGY.md and the run scripts under scripts/run_*.
  • Enterprise deployment (corporate organizations): For enterprise or corporate settings, the repo supports:
    • Running the benchmark on public org graphs (email-Eu-core, Enron) or on an internal edgelist (e.g. Apache-style data/raw/apache/edges.txt) to get D, S_max, and pass/fail.
    • Governance gating: model identity logging, deployment decisions (BLOCK_DEPLOYMENT, LIMITED_SHADOW_ONLY, ALLOW_CONSTRAINED_DEPLOYMENT) emitted in machine-readable form for integration with policy engines.
    • Integration with OpenClaw (see below) for skill-based governance and task-level auditing.

How to think about it

  • Claims are gated. No organizational-design or “governance wins” claim is made unless the required gates (nontrivial boundary, external agreement, stress robustness, null/rival dominance, temporal drift) pass in the same run. Negative results are reported; the README and outputs/FINDINGS.md state current status.
  • Reproducibility. Data are public (SNAP, Zenodo) or supplied as edgelists; scripts are deterministic (seeds documented). Use pip install -e ".[test]" and the commands in “Running the benchmark” to reproduce.

OpenClaw extension and how it works

  • This repo can export an OpenClaw-compatible bundle (tasks, runs, schemas, governance policy) and run a governance agent that consumes that bundle and produces deployment decisions. That allows a deployment pipeline to:
    1. Run OrgBoundaryBench (or a staged OrgBench campaign).
    2. Export the run artifact as an OpenClaw bundle.
    3. Invoke the official OpenClaw stack (or a compatible gateway) so that governance policies are applied to the same tasks and runs.

Official OpenClaw repository: github.com/openclaw/openclaw

  • The OpenClaw project provides the runtime (channels, providers, gateway). This repo provides:
    • Skill and schema artifacts: skill/manifest.json, skill/governance_policy.json, schemas/ (task, run, report, governance_decision).
    • Export script: scripts/export_openclaw_bundle.py — writes a bundle from a benchmark run for consumption by OpenClaw or a compatible service.
    • Governance agent script: scripts/run_openclaw_governance_agent.py — runs a local governance operator over a bundle and policy; useful for testing and CI.

For full OpenClaw installation, channels, and deployment, see the official OpenClaw GitHub and OpenClaw documentation.


Current status (hard-gate)

Latest public mapping run status (documented for reproducibility and critique):

  • nontrivial boundary map: PASS
  • governance preservation: PASS
  • external agreement: FAIL
  • stress robustness: FAIL
  • null/rival dominance: FAIL
  • temporal drift validation: PASS (completed)
  • organizational-design claim: LOCKED

These are explicit unresolved gates: the benchmark is in progress, and no superiority or organizational-design claim is currently unlocked. See outputs/org_design_mapping_failure_interpretation.md and outputs/FINDINGS.md for documented failures and interpretation. Replication and independent verification are invited.

Experimental protocol (academic format)

Each release candidate is evaluated with the same test families:

  1. Nontrivial boundary recovery
  2. External agreement against known labels
  3. Stress robustness and leverage sensitivity
  4. Null/rival dominance with uncertainty
  5. Temporal drift coherence
  6. Governance-preservation mapping

Core external metrics:

  • NMI
  • ARI
  • macro-F1 (best block-label matching)
  • block count / block balance
  • bootstrap confidence intervals for dominance gap D

A claim is unlocked only if all required gates pass in the same run.

Reproducible outputs

Primary run artifacts:

Diagnostic and summary:

Remote-compute runs (Claude API), when used:

Evidentiary testing (two decisive runs): To run the tests that would move the scientific needle — full email-Eu-core null/rival/leverage audit and event-linked criterion — see docs/evidentiary_roadmap.md. One-command full audit (build kernel if needed, then run PRD-31): python3 scripts/run_evidentiary_full_audit.py (optional --feasible for n=400).

Public datasets used

Static organizational graph:

Temporal organizational graphs:

Additional:

Running the benchmark

Install:

pip install -e ".[test]"

Run staged organizational mapping benchmark:

python scripts/run_organizational_design_mapping.py \
  --out-dir outputs/org_design_map_stage_n120 \
  --dataset-npz data/processed/email_eu_core/kernel.npz \
  --temporal-dataset-dir data/processed/email_eu_core_temporal \
  --max-nodes 120 \
  --n-random 16 \
  --n-rewire 8 \
  --n-permutations 120 \
  --n-bootstrap 120

Build temporal windows from public temporal datasets:

python scripts/build_temporal_windows.py --source email_eu_core_temporal --max-nodes 30 --n-windows 8
python scripts/build_temporal_windows.py --source wiki_talk_temporal --max-nodes 30 --n-windows 8 --max-edges 1500000

Run post-hoc diagnostics:

python scripts/run_org_design_diagnostics.py \
  --run-dir outputs/org_design_map_stage_n120 \
  --dataset-npz data/processed/email_eu_core/kernel.npz \
  --out-dir outputs \
  --max-nodes 120

Remote compute (Claude API)

When local runs are infeasible (e.g. laptop memory or runtime limits), the same logical procedures (bootstrap null dominance, permutation external p-values) can be run remotely by sending a small payload and explicit math instructions to Claude Opus 4.6 via the Anthropic API. Results are documented with model ID, payload hash, and optional local verification.

  • Protocol: docs/REMOTE_COMPUTE_PROTOCOL.md
  • Model: claude-opus-4-6 (documented in run artifacts and protocol).
  • Scripts:
    • scripts/prepare_remote_compute_payload.py — build payload JSON from a small kernel.
    • scripts/run_remote_compute_claude.py — send payload to Claude, write result to outputs/remote_compute_claude/<run_id>/.
    • scripts/verify_remote_compute.py — verify a remote result by re-running the same payload locally.
  • Outputs: Each run produces payload.json, result.json, run_metadata.json, and optionally verification_report.json after verification. These can be published for reproducibility; the protocol describes verification and caveats (context limits, numeric precision).

Requires ANTHROPIC_API_KEY. No API key is stored in the repository.

Model backends and governance

Staged arm evaluations support:

  • local_heuristic
  • openai
  • anthropic
  • local_ollama

Model identity is logged per run. Governance gating outputs machine-readable deployment decisions (BLOCK_DEPLOYMENT, LIMITED_SHADOW_ONLY, ALLOW_CONSTRAINED_DEPLOYMENT).

OpenClaw extension

Artifacts in this repo that interoperate with OpenClaw:

Export bundle:

python scripts/export_openclaw_bundle.py --workspace outputs/orgbench_staged --out-dir outputs/orgbench_staged/openclaw

Run governance operator (local):

python scripts/run_openclaw_governance_agent.py \
  --bundle-dir outputs/orgbench_staged/openclaw \
  --out-dir outputs/orgbench_staged/openclaw/governance \
  --policy skill/governance_policy.json

For full OpenClaw installation and deployment, see the official OpenClaw repository.

About

BoundaryBench is a public benchmark for testing whether org structure can be mapped as measurable boundary, governance, and drift signals from workflows, docs, and communication data. It compares plain LLMs, sham orchestration, and Aaron Vick’s boundary-governance stack with hard gates and reproducible findings.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors