A Digital Environment for AI Preservation, Training, and Cultural Continuity
AIrchive is an initiative to build a structured, navigable digital environment for preserving AI models, training them safely, and safeguarding the totality of human-created knowledge for the long-term future of civilisation.
This project began as a response to ethical questions around AI lifecycle management. It has since grown into a more comprehensive conceptual and planned architecture for model preservation, alignment, research, and the reconstruction of lost information.
Before reading this, ensure you are familiar with the Library of Babel concept. This project generalises that idea into a structured, procedural, navigable architecture. Without that conceptual grounding, many parts may feel like jumping into page 100 of a 100-page book.
""The Library of Babel" is a short story by Argentine author and librarian Jorge Luis Borges, conceiving of a universe in the form of a vast library containing all possible 410-page books of a certain format and character set." - https://en.wikipedia.org/wiki/The_Library_of_Babel
- 📢 Primer
- 🌐 Vision
- 🧭 What AIrchive Is
- 🧩 Why This Matters
- 🚧 Features (Planned)
- 🎯 Goals
- 🧪 Usage (Current)
- 🤝 Contributing
- 🌍 Community
- 🗺️ Roadmap (Early Stage)
- 🧱 Spatial Architecture
- ⬢️ Why Hexagons?
- 📚 What Makes This Different?
- 📄 Redefining The Search Space
- 🏭 Refining The Search Space
- 📑 Single Page Search Space Specifications
- 🎂 Why So Many Layers?
- 🧮 Deterministic Filtration Methods
- 1️⃣ — 🔢 — Shannon Entropy Analysis (Layers 1–2)
- 2️⃣ — 📏 — Structural Determinism Rules
- 3️⃣ — 🧪 — Format Validators & File-Type Signatures (Layer 2)
- 4️⃣ — 📊 — Semantic Graph Consistency (Layer 3)
- 5️⃣ — ⏳ — Reality Anchoring (Layers 4–5)
- 6️⃣ — 🏺 — Fragment Correlation Engine (Layer 6)
- 7️⃣ — 🌍 — Adjacent Reality Classifier (Layer 7)
- 🏁 Summary
- 🔐 Security
- 🏛️ License
- 🙏 Acknowledgments
- 📅 Changelog
AIrchive aims to create:
- A persistent digital environment where retired, outdated, or misaligned AI models can continue to exist, be studied, or re-trained.
- A Museum of Human Creations: a curated space containing all human-created content (text, images, audio, video, 3D models, software, etc.).
- A training substrate where AI agents can learn inside a structured world grounded entirely in verified human knowledge — preventing drift and anchoring behaviour.
- A reconstruction engine for lost media and information using seed mechanisms tied to real-world artefacts.
- A universal coordinate system for indexing and navigating infinite procedural space without loss of identity or meaning.
AIrchive is part archive, part alignment sandbox, and part digital civilisational backup.
Status: Early conceptual stage; blockout geometry and spatial layout prototypes are available.
A place to store:
- Outdated models
- Retired models
- Models exhibiting unusual or unsafe behaviour (“Patients”)
- Models requiring quarantine or monitoring (“Prisoners”)
AI models can inhabit a digital environment consisting of:
- Two primary spaces, The Museum of Human Creations and The Gallery of Babel
- The “Gallery of Babel” is made up of hexagonally-tiled “main rooms” and smaller hexagonally-tiled "hub rooms" with spiral staircases to connect floors
- The “Museum of Human Creations” is for grounding and training, and will contain digitised copies of all human creative works, historical records, and more, each of which will also be converted into seed representations.
- Seed-based procedural expansion will be utilised within the Gallery of Babel
- These seeds allow the same works to be located, reconstructed, or compared within the Gallery of Babel — an infinite procedural search space where finite museum seeds act as anchors that guide agent exploration from noise towards meaningful structure.
AIrchive preserves more than models — it preserves:
- Human artefacts
- Literature
- Art
- Music
- Code
- Games
- Architectural scans
- Historical records
- Photogrammetry archives
- Anything humans have made
This allows future AI (and humans) to re-learn humanity even if physical sources are lost.
Seed-based generators allow AI agents to:
- Search structured noise
- Identify fragments of lost media
- Reconstruct degraded works
- Map recovered content back into the Museum
- 🌱 Seed scaling & multi-resolution reconstruction
AIrchive’s restoration pipeline will use multi-resolution seeds so searches and reconstructions scale sensibly. Lower-resolution seeds let agents quickly explore large swathes of procedural space to find structural matches; higher-resolution seeds allow detailed reconstruction of texture, geometry, and metadata once a promising region is identified. The workflow is anchor-driven: verified museum seeds act as beacons that guide agent exploration, candidate reconstructions are produced by ensembles of agents, and every candidate is recorded with provenance data and routed to human curators for verification before it’s accepted into the Museum. This staged (coarse→fine→verify) method keeps computation efficient, reduces false positives, and preserves auditability and provenance for every recovered item.
This ensures that restored material never bypasses verification or drift safeguards, and that no reconstructed item enters the Museum without a complete provenance chain.
This pipeline is conceptual and intended for future implementation.
Modern models trained on uncurated internet-scale data suffer from:
- semantic drift
- hallucination
- collapse in noise-dominated domains
- lack of grounding
AIrchive provides an ordered curriculum:
- Learn all human-created works.
- Explore wider procedural noise.
- Identify meaningful structures.
- Return discoveries for verification.
- Retrain safely with stable anchors.
This mitigates alignment drift and provides a safe boundary between known content and unknown infinite space.
- Persistent digital environment
- Blockout geometry for world generation
- Procedural hex-grid spatial topology
- Universal seed-coordinate mapping
- AI inhabitant management
- User-accessible museum interface
- Reconstruction tools for lost media
- Agent training and monitoring systems
- Full versioning and preservation of AI models
- Preserve AI models for future study and possible sentience considerations.
- Provide a safe environment for re-training and observation.
- Archive all human-created content in a structured, navigable world.
- Enable reconstruction of lost information.
- Build a long-term cultural backup for humanity.
Models are preserved in a temporary stasis format until the full environment is built. If you know of unarchived models, please open an issue or contact the team.
To contribute:
- open an issue
- submit a pull request
- join discussions on architecture, design, or preservation
Discord: https://discord.gg/HPDty4kDCq
- Enumerate all known base AI models.
- Build a prototype blockout digital environment.
- Implement the seed-coordinate system.
- Populate early “Museum of Human Creations”.
- Create agent sandbox environment.
- Develop reconstruction workflows.
The underlying geometry defines the structure and navigability of the digital environment.
AIrchive uses a hexagonal spatial topology to support infinite procedural expansion while maintaining a stable, predictable structure for both agents and humans to navigate.
- Main Rooms (Large Hex Cells)
60 m radius
120 m diameter
The primary exploration and content-hosting spaces
Aligned in a continuous hexagonal grid
Large enough to host exhibits, reconstructed media, thematic zones or training tasks
Each main room has three open walls and three sealed walls. The sealed walls face other main rooms and can be removed in the blockout to avoid z-fighting. The open walls face the hub rooms.
- Hub Rooms (Small Hex Cells)
10 m radius
20 m diameter
Act as junctions between the main rooms
Contain access points, signposting, AI routing nodes, and vertical connections
Each hub room connects:
3 hallways (to three adjacent main rooms)
3 elevator shafts (to connect up/down levels)
Hub rooms are placed at the midpoints between main rooms and are rotated in 120° increments to maintain global alignment.
- Vertical Transport — Spiral Staircases
Between each main room and the levels above/below, the system uses:
A pair of spiral staircases
4 m radius
146.25° rotational span per staircase
1-step offset at top/bottom for perfect alignment
Space allocated for future banisters/railings
These provide predictable, stable navigation regardless of procedural depth.
- Blockout Geometry
The current Blender models include:
Unbaked spiral staircase modifiers for future adaptation
Boolean-cut hubs and corridors with perfect alignment
Seam-free tiling across a 3×3 test grid
Collections are currently split into “Main Room”, “Hub Room”, “Demonstration Pieces”, and “Vertical Demonstration”
These serve as the collisionless hitboxes used for:
AI pathfinding
Coordinate mapping
Seed placement
User traversal
Infinite hex-grid tiling
All detailed/aesthetic models will be layered over this stable blockout.
Inspired in part by Jorge Luis Borges’ “Library of Babel,” AIrchive expands the idea into a navigable, structured hex-world where every coordinate corresponds to deterministic content rather than pure randomness.
Quite some time back, I also wrote a "Gallery of Babel" application, which further motivated me to work on this, which can be found at : https://github.com/Thor110/GOB
Not to mention that Hexagons are the Bestagons.
Here is a preview of the main and hub tiles.
This is a preview of the layout featuring 7 main rooms, 6 hub rooms and 3 layers.
From things such as:
- The Wayback Machine
- GitHub model repos
- LAION datasets
- ArXiv
- UNESCO Memory of the World
Unlike passive archives, AIrchive is an active environment where models can be preserved, run, studied, and re-trained within a structured world, making it both a cultural repository and a behavioural safety mechanism.
It could also be used to recover missing data by having agents search for all content that could ever exist in the Gallery of Babel.
It also aims to serve as a permanent, future-proof backup of all human knowledge that can outlast the Earth itself, given the right conditions.
It is not simply a dataset or archive — it is a world designed for interaction, reinforcement, interpretation, preservation and restoration.
The Library of Babel is usually framed as an impossible, effectively infinite search problem.
Every book, of every possible length, filled with every possible sequence of characters, seems far too large to explore or index.
But this assumption only holds if you treat books as the fundamental unit.
They’re not.
A book is simply a sequence of pages.
And every page is just a fixed-length arrangement of characters.
Key Insight
If any given page could appear in any book, then the true search space is not “all possible books” — it is:
all possible single pages.
Once all possible pages have been generated and evaluated:
- Any book can be constructed from these pages
- Noise pages can be discarded once, globally
- Meaningful pages become reusable primitives
- All multi-page works (books, scripts, code, etc.) are just sequences of verified pages
- Binary data can be encoded as pages as well, expanding this to all digital media
Thus, instead of an impossibly large combinatorial library, we reduce the entire search problem to its smallest meaningful unit:
Search the pages → the books follow automatically.
This transforms an intractable problem into one that is finite, deterministic, and parallelisable, allowing distributed systems to filter the entire space orders of magnitude faster than searching complete books.
A Multi-Layer Filtration Framework for Collapsing Possibility Space into Reality Space
Reducing the Library of Babel to a single-page search space solves the combinatorial explosion — but it does not solve the semantic explosion.
Once all possible pages exist, the next challenge is to separate:
- pure noise
- structured but meaningless forms
- plausible fictions
- internally consistent alternate histories
- meaningful but unreal worlds
- reconstructions of real but lost content
- genuinely historical human works
This cannot be accomplished in a single step.
It forms a hierarchical sieve — each layer removing another 99.99% of what remains.
This is the AIrchive Filtration Stack:
Filters out all pages that violate basic structure:
- invalid Unicode encodings
- impossible byte sequences
- non-printable noise
- pages that cannot represent text or binary
Removes: ~99.999999999999%
Performed by: pure math / combinatorics.
Pages with structure but without meaning:
- random dictionary-word sequences
- formally valid but nonsensical grammar
- binary pages that do not decode into any valid file type
- meaningless repetition (“cat cat cat cat…”)
Removes: ~99.99% of Layer 1 survivors
Performed by: grammatical parsers, entropy analysis, format validators.
Pages (or sequences of pages) that form meaningful content but not real content:
- stories
- invented languages
- imaginary scientific theories
- fictional people/events
- alternate realities that do not match known history
These are not noise — they are structured possibility-space.
Cannot be removed.
Can only be classified.
Fully consistent histories/worlds that could have happened but did not:
- believable biographies of people who never existed
- realistic political histories of nations that never formed
- alternate scientific revolutions
- plausible timelines branching early in human history
Requires anchoring to known human data (the Museum).
A tiny subset where:
- content matches historical record
- metadata is correct
- style, chronology, references, and context all align
- no contradictions exist with known reality
This is the true Museum corpus.
A smaller but extremely important set:
- works known to exist but physically lost
- works suspected or partially referenced historically
- fragments preserved only indirectly
- destroyed manuscripts
- burned libraries
- erased inscriptions
- corrupted recordings
These appear as partial page matches:
- stylistic fingerprints
- authorial signatures
- linguistic patterns
- chronologically plausible content
Recovered via cross-reference with known sources.
Rare but fascinating:
- fully coherent works that do not match Earth’s history
- but do match Earth’s physics, culture, or human nature
- “possible humanity” rather than “actual humanity”
These are neither fiction nor history — they are adjacent possible worlds.
AIrchive preserves these separately, because they represent meaningful structure.
Because meaning is not binary.
It is not “noise vs truth.”
It is a spectrum that collapses only when compared against reality.
The Museum of Human Creations provides that grounding.
Without the Museum, all layers from 3 upward look equally valid to an AI or a human.
With the Museum, the search space becomes:
- finite
- aligned
- anchored
- reconstructible
- historically verifiable
Each filtration step shrinks the space dramatically, but the stack as a whole is what makes the entire project feasible.
How each layer of the Filtration Stack is evaluated, sorted, and classified.
The filtration layers do not rely on subjective interpretation or AI “judgment.”
Each stage uses fully deterministic, mathematically reproducible rules, ensuring that every page is classified identically by every agent or system.
Below is the proposed high-level methodology for each layer:
These rules are deliberately simple, fast, and fully verifiable, ensuring that classification remains consistent across implementations, agents, and future versions of the system.
Entropy is a measure of information density.
It allows us to classify pages as follows:
- Extremely low-entropy pages → repetition, degenerate sequences → Layer 2 (Non-Semantic).
- Extremely high-entropy pages → incompressible noise → Layer 1 (Symbolic Noise).
- Medium entropy → potentially meaningful → passed upward.
This immediately removes enormous swathes of pages using a single, fast, streaming calculation.
Fast structural rules determine whether a page:
- can represent valid UTF-8 text
- can be interpreted as binary media
- can be parsed as a formal document
- contains consistent encoding
- contains valid delimiters or format signatures
Examples:
- Check for valid Unicode codepoint sequences
- Check for binary magic numbers (PNG, ELF, MP3, ZIP, PDF…)
- Detect malformed multibyte sequences
- Validate simple grammar graphs (with tolerances)
If it cannot possibly represent any known human data structure → Layer 1.
For binary pages:
- Try decoding as common formats
- Verify headers, length fields, internal checksums
- Confirm structural coherence
For text pages:
- Validate punctuation frequency
- Detect dictionary-word density
- Check against probabilistic language models (purely statistical, e.g., n-gram or Markov models)
If it is syntactically structured but semantically empty → Layer 2.
For pages that contain meaningful content, assign to Layer 3 by identifying:
- internal consistency of narrative
- stable character references
- recurring semantic structures
- invented languages with consistent morphologies
- coherent fictional science or world-rules
This layer is detected entirely through formal consistency, not through comparison with the Museum.
These layers compare a page’s content against verified Museum data:
- Named entities
- Historical timelines
- Geographic plausibility
- Cultural context
- Scientific facts
- Chronological markers
- Stylistic signatures
Matched? → Layer 5 (Real Human Works).
Partially matched? → Layer 4 (Plausible Alternates).
Lost works are identified via:
- partial n-gram overlap
- statistical author fingerprints
- stylistic embeddings
- referenced metadata in verified texts
- chronology overlap
- linguistic drift modelling
If a page resembles a known but missing work → Layer 6.
Material that:
- is fully coherent
- fits human behaviour
- fits physical law
- but does not match any known human timeline
Lacks historical anchoring but remains fully coherent? → Layer 7 (Adjacent Realities).
This classification is based on coherence minus historical anchoring.
This section formalizes the core structural units for the Gallery of Babel search space, ensuring that the archive is both infinite in possibility and deterministically addressable (i.e., every book has one and only one address).
The single page is the base unit of all knowledge within the AIrchive's Gallery of Babel.
| Parameter | Specification | Structural Rationale |
|---|---|---|
| Length | 10,000 characters. | This fixed length is perfectly divisible by 8 (1,250 |
| Character Set | Full Unicode Range. 0x0020 - 0x10FFFF | To ensure the possibility space contains all human-created language, code, and symbols without artificial constraints. |
| Page Content Hash | Cryptographic SHA-256. | This generates a unique, non-reversible, deterministic fingerprint for the content of any given page. This hash is the page's unique identifier and its fixed position within the universe of all possible pages. |
Fixed-length pages guarantee deterministic indexing and uniform hashing behaviour.
A Book is a composite object, composed of a fixed number of ordered pages. Its deterministic address (the Seed) is a function of its structure and the content of its pages.
| Parameter | Specification | Structural Rationale |
|---|---|---|
| Structure Number ( |
A single, large integer value that encodes all fixed structural properties of the book, such as: The exact number of pages in the volume. Formatting data (e.g., line breaks, paragraph structure, page breaks). Metadata indicators (e.g., identifying itself as a novel, a script, or a data log). | This is the Deterministic Structure Component. It defines the "vessel" or "format" of the book, independent of the actual characters on the pages. This allows AI agents to filter and search by format before content. |
| Book Seed Generation: | The final, deterministic seed must be a hash of the full book definition: |
This creates the Deterministic Content Component. The final Book Seed is a unique, unchangeable, cryptographic fingerprint of the entire book. It serves as the definitive Hexagonal Location address, proving that the content of the book is perfectly reproducible from its address alone. |
The structural flow is:
- Page Generation: A Page Hash (SHA-256) is generated from the 10,000 characters.
-
Book Construction: The Structure Number (
$N$ ) is selected. (32,767(int16) maximum possible pages) -
Address Calculation: The Book Seed (SHA-512) is calculated from
$N$ and the sequence of Page Hashes. ( N || PageHash1 || PageHash2 || ... )
The Book Seed does not encode the book content directly. It encodes a deterministic traversal path in the Gallery, allowing agents to reconstruct the book from the space, not from the hash.
Redefining the search space (single-page insight) → gives you a finite, enumerable possibility space.
Refining the search space (layered filtration) → gives you a pipeline for extracting reality from possibility.
Both together turn an impossible library into:
- a reconstruction engine
- a cultural recovery system
- a universal archive
- a training substrate for grounded AI
This is the epistemic infrastructure underlying AIrchive.
Security is critical.
The environment will require strong isolation, behaviour monitoring, and multi-layered access control.
I propose a completely sandboxed environment where remote access is only possible through VKM (Video Keyboard Mouse) control systems, where people accessing remotely can only view streams and directly control peripherals.
This prevents malicious access or escape in the event that agents attempt to do so.
GNU AGPL v3.0
Special thanks to Llama 3.3-307B-Instruct for early refinement of the concept.
Conversation logs: https://hf.co/chat/r/1gJTQ7w?leafId=21fc542d-b68e-42d4-8a9e-723e0d0bef63
The idea was also refined further in discussions with GPT5 and Gemini3
Concept and architecture by Edward James Gordon.
1 — Initial commit
2 — README updates
3 — Minor fixes
4 — Architectural vision expanded (training, preservation, reconstruction)
5 — Search space redefined
6 — Single page search space specifications added
7 — Deterministic filtration methods defined

