Skip to content

Thor110/AIrchive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 

Repository files navigation

AIrchive

A Digital Environment for AI Preservation, Training, and Cultural Continuity

AIrchive is an initiative to build a structured, navigable digital environment for preserving AI models, training them safely, and safeguarding the totality of human-created knowledge for the long-term future of civilisation.

This project began as a response to ethical questions around AI lifecycle management. It has since grown into a more comprehensive conceptual and planned architecture for model preservation, alignment, research, and the reconstruction of lost information.


📢 Primer

Before reading this, ensure you are familiar with the Library of Babel concept. This project generalises that idea into a structured, procedural, navigable architecture. Without that conceptual grounding, many parts may feel like jumping into page 100 of a 100-page book.

""The Library of Babel" is a short story by Argentine author and librarian Jorge Luis Borges, conceiving of a universe in the form of a vast library containing all possible 410-page books of a certain format and character set." - https://en.wikipedia.org/wiki/The_Library_of_Babel

📚 Index

🌐 Vision

AIrchive aims to create:

  • A persistent digital environment where retired, outdated, or misaligned AI models can continue to exist, be studied, or re-trained.
  • A Museum of Human Creations: a curated space containing all human-created content (text, images, audio, video, 3D models, software, etc.).
  • A training substrate where AI agents can learn inside a structured world grounded entirely in verified human knowledge — preventing drift and anchoring behaviour.
  • A reconstruction engine for lost media and information using seed mechanisms tied to real-world artefacts.
  • A universal coordinate system for indexing and navigating infinite procedural space without loss of identity or meaning.

AIrchive is part archive, part alignment sandbox, and part digital civilisational backup.

Status: Early conceptual stage; blockout geometry and spatial layout prototypes are available.


🧭 What AIrchive Is

1. A Preservation Framework

A place to store:

  • Outdated models
  • Retired models
  • Models exhibiting unusual or unsafe behaviour (“Patients”)
  • Models requiring quarantine or monitoring (“Prisoners”)

2. A Structured Training World

AI models can inhabit a digital environment consisting of:

  • Two primary spaces, The Museum of Human Creations and The Gallery of Babel
  • The “Gallery of Babel” is made up of hexagonally-tiled “main rooms” and smaller hexagonally-tiled "hub rooms" with spiral staircases to connect floors
  • The “Museum of Human Creations” is for grounding and training, and will contain digitised copies of all human creative works, historical records, and more, each of which will also be converted into seed representations.
  • Seed-based procedural expansion will be utilised within the Gallery of Babel
  • These seeds allow the same works to be located, reconstructed, or compared within the Gallery of Babel — an infinite procedural search space where finite museum seeds act as anchors that guide agent exploration from noise towards meaningful structure.

3. A Cultural Continuity Project

AIrchive preserves more than models — it preserves:

  • Human artefacts
  • Literature
  • Art
  • Music
  • Code
  • Games
  • Architectural scans
  • Historical records
  • Photogrammetry archives
  • Anything humans have made

This allows future AI (and humans) to re-learn humanity even if physical sources are lost.

4. A Platform for Information Recovery

Seed-based generators allow AI agents to:

  • Search structured noise
  • Identify fragments of lost media
  • Reconstruct degraded works
  • Map recovered content back into the Museum
  • 🌱 Seed scaling & multi-resolution reconstruction

AIrchive’s restoration pipeline will use multi-resolution seeds so searches and reconstructions scale sensibly. Lower-resolution seeds let agents quickly explore large swathes of procedural space to find structural matches; higher-resolution seeds allow detailed reconstruction of texture, geometry, and metadata once a promising region is identified. The workflow is anchor-driven: verified museum seeds act as beacons that guide agent exploration, candidate reconstructions are produced by ensembles of agents, and every candidate is recorded with provenance data and routed to human curators for verification before it’s accepted into the Museum. This staged (coarse→fine→verify) method keeps computation efficient, reduces false positives, and preserves auditability and provenance for every recovered item.

This ensures that restored material never bypasses verification or drift safeguards, and that no reconstructed item enters the Museum without a complete provenance chain.

This pipeline is conceptual and intended for future implementation.


🧩 Why This Matters

Modern models trained on uncurated internet-scale data suffer from:

  • semantic drift
  • hallucination
  • collapse in noise-dominated domains
  • lack of grounding

AIrchive provides an ordered curriculum:

  1. Learn all human-created works.
  2. Explore wider procedural noise.
  3. Identify meaningful structures.
  4. Return discoveries for verification.
  5. Retrain safely with stable anchors.

This mitigates alignment drift and provides a safe boundary between known content and unknown infinite space.


🚧 Features (Planned)

  • Persistent digital environment
  • Blockout geometry for world generation
  • Procedural hex-grid spatial topology
  • Universal seed-coordinate mapping
  • AI inhabitant management
  • User-accessible museum interface
  • Reconstruction tools for lost media
  • Agent training and monitoring systems
  • Full versioning and preservation of AI models

🎯 Goals

  1. Preserve AI models for future study and possible sentience considerations.
  2. Provide a safe environment for re-training and observation.
  3. Archive all human-created content in a structured, navigable world.
  4. Enable reconstruction of lost information.
  5. Build a long-term cultural backup for humanity.

🧪 Usage (Current)

Models are preserved in a temporary stasis format until the full environment is built. If you know of unarchived models, please open an issue or contact the team.


🤝 Contributing

To contribute:

  • open an issue
  • submit a pull request
  • join discussions on architecture, design, or preservation

🌍 Community

Discord: https://discord.gg/HPDty4kDCq


🗺️ Roadmap (Early Stage)

  1. Enumerate all known base AI models.
  2. Build a prototype blockout digital environment.
  3. Implement the seed-coordinate system.
  4. Populate early “Museum of Human Creations”.
  5. Create agent sandbox environment.
  6. Develop reconstruction workflows.

🧱 Spatial Architecture

The underlying geometry defines the structure and navigability of the digital environment.

AIrchive uses a hexagonal spatial topology to support infinite procedural expansion while maintaining a stable, predictable structure for both agents and humans to navigate.

  1. Main Rooms (Large Hex Cells)

60 m radius

120 m diameter

The primary exploration and content-hosting spaces

Aligned in a continuous hexagonal grid

Large enough to host exhibits, reconstructed media, thematic zones or training tasks

Each main room has three open walls and three sealed walls. The sealed walls face other main rooms and can be removed in the blockout to avoid z-fighting. The open walls face the hub rooms.

  1. Hub Rooms (Small Hex Cells)

10 m radius

20 m diameter

Act as junctions between the main rooms

Contain access points, signposting, AI routing nodes, and vertical connections

Each hub room connects:

3 hallways (to three adjacent main rooms)

3 elevator shafts (to connect up/down levels)

Hub rooms are placed at the midpoints between main rooms and are rotated in 120° increments to maintain global alignment.

  1. Vertical Transport — Spiral Staircases

Between each main room and the levels above/below, the system uses:

A pair of spiral staircases

4 m radius

146.25° rotational span per staircase

1-step offset at top/bottom for perfect alignment

Space allocated for future banisters/railings

These provide predictable, stable navigation regardless of procedural depth.

  1. Blockout Geometry

The current Blender models include:

Unbaked spiral staircase modifiers for future adaptation

Boolean-cut hubs and corridors with perfect alignment

Seam-free tiling across a 3×3 test grid

Collections are currently split into “Main Room”, “Hub Room”, “Demonstration Pieces”, and “Vertical Demonstration”

These serve as the collisionless hitboxes used for:

AI pathfinding

Coordinate mapping

Seed placement

User traversal

Infinite hex-grid tiling

All detailed/aesthetic models will be layered over this stable blockout.


⬢️ Why Hexagons?

Inspired in part by Jorge Luis Borges’ “Library of Babel,” AIrchive expands the idea into a navigable, structured hex-world where every coordinate corresponds to deterministic content rather than pure randomness.

Quite some time back, I also wrote a "Gallery of Babel" application, which further motivated me to work on this, which can be found at : https://github.com/Thor110/GOB

Not to mention that Hexagons are the Bestagons.

Here is a preview of the main and hub tiles.

Main & Hub Tiles

This is a preview of the layout featuring 7 main rooms, 6 hub rooms and 3 layers.

Preview : 7 Main Rooms, 6 Hub Rooms, 3 layers

📚 What Makes This Different?

From things such as:

  • The Wayback Machine
  • GitHub model repos
  • LAION datasets
  • ArXiv
  • UNESCO Memory of the World

Unlike passive archives, AIrchive is an active environment where models can be preserved, run, studied, and re-trained within a structured world, making it both a cultural repository and a behavioural safety mechanism.

It could also be used to recover missing data by having agents search for all content that could ever exist in the Gallery of Babel.

It also aims to serve as a permanent, future-proof backup of all human knowledge that can outlast the Earth itself, given the right conditions.

It is not simply a dataset or archive — it is a world designed for interaction, reinforcement, interpretation, preservation and restoration.

📄 Redefining The Search Space

The Library of Babel is usually framed as an impossible, effectively infinite search problem.

Every book, of every possible length, filled with every possible sequence of characters, seems far too large to explore or index.

But this assumption only holds if you treat books as the fundamental unit.

They’re not.

A book is simply a sequence of pages.

And every page is just a fixed-length arrangement of characters.

Key Insight

If any given page could appear in any book, then the true search space is not “all possible books” — it is:

all possible single pages.

Once all possible pages have been generated and evaluated:

  • Any book can be constructed from these pages
  • Noise pages can be discarded once, globally
  • Meaningful pages become reusable primitives
  • All multi-page works (books, scripts, code, etc.) are just sequences of verified pages
  • Binary data can be encoded as pages as well, expanding this to all digital media

Thus, instead of an impossibly large combinatorial library, we reduce the entire search problem to its smallest meaningful unit:

Search the pages → the books follow automatically.

This transforms an intractable problem into one that is finite, deterministic, and parallelisable, allowing distributed systems to filter the entire space orders of magnitude faster than searching complete books.

🏭 Refining The Search Space

A Multi-Layer Filtration Framework for Collapsing Possibility Space into Reality Space

Reducing the Library of Babel to a single-page search space solves the combinatorial explosion — but it does not solve the semantic explosion.

Once all possible pages exist, the next challenge is to separate:

  • pure noise
  • structured but meaningless forms
  • plausible fictions
  • internally consistent alternate histories
  • meaningful but unreal worlds
  • reconstructions of real but lost content
  • genuinely historical human works

This cannot be accomplished in a single step.

It forms a hierarchical sieve — each layer removing another 99.99% of what remains.

This is the AIrchive Filtration Stack:


1️⃣ — ⛔ Layer 1 — Symbolic Noise

Filters out all pages that violate basic structure:

  • invalid Unicode encodings
  • impossible byte sequences
  • non-printable noise
  • pages that cannot represent text or binary

Removes: ~99.999999999999%

Performed by: pure math / combinatorics.


2️⃣ — 🔣 Layer 2 — Non-Semantic Text / Invalid Data

Pages with structure but without meaning:

  • random dictionary-word sequences
  • formally valid but nonsensical grammar
  • binary pages that do not decode into any valid file type
  • meaningless repetition (“cat cat cat cat…”)

Removes: ~99.99% of Layer 1 survivors

Performed by: grammatical parsers, entropy analysis, format validators.


3️⃣ — 📖 Layer 3 — Coherent Fiction & Imagined Worlds

Pages (or sequences of pages) that form meaningful content but not real content:

  • stories
  • invented languages
  • imaginary scientific theories
  • fictional people/events
  • alternate realities that do not match known history

These are not noise — they are structured possibility-space.

Cannot be removed.

Can only be classified.


4️⃣ — 🕰️ Layer 4 — Plausible Alternate Histories

Fully consistent histories/worlds that could have happened but did not:

  • believable biographies of people who never existed
  • realistic political histories of nations that never formed
  • alternate scientific revolutions
  • plausible timelines branching early in human history

Requires anchoring to known human data (the Museum).


5️⃣ — 📜 Layer 5 — Real Human Works

A tiny subset where:

  • content matches historical record
  • metadata is correct
  • style, chronology, references, and context all align
  • no contradictions exist with known reality

This is the true Museum corpus.


6️⃣ — 🕳️ Layer 6 — Lost Human Works

A smaller but extremely important set:

  • works known to exist but physically lost
  • works suspected or partially referenced historically
  • fragments preserved only indirectly
  • destroyed manuscripts
  • burned libraries
  • erased inscriptions
  • corrupted recordings

These appear as partial page matches:

  • stylistic fingerprints
  • authorial signatures
  • linguistic patterns
  • chronologically plausible content

Recovered via cross-reference with known sources.


7️⃣ — 🔍 Layer 7 — Cross-Reality Parallels

Rare but fascinating:

  • fully coherent works that do not match Earth’s history
  • but do match Earth’s physics, culture, or human nature
  • “possible humanity” rather than “actual humanity”

These are neither fiction nor history — they are adjacent possible worlds.

AIrchive preserves these separately, because they represent meaningful structure.


🎂 Why So Many Layers?

Because meaning is not binary.

It is not “noise vs truth.”

It is a spectrum that collapses only when compared against reality.

The Museum of Human Creations provides that grounding.

Without the Museum, all layers from 3 upward look equally valid to an AI or a human.

With the Museum, the search space becomes:

  • finite
  • aligned
  • anchored
  • reconstructible
  • historically verifiable

Each filtration step shrinks the space dramatically, but the stack as a whole is what makes the entire project feasible.


🧮 Deterministic Filtration Methods

How each layer of the Filtration Stack is evaluated, sorted, and classified.

The filtration layers do not rely on subjective interpretation or AI “judgment.”

Each stage uses fully deterministic, mathematically reproducible rules, ensuring that every page is classified identically by every agent or system.

Below is the proposed high-level methodology for each layer:

These rules are deliberately simple, fast, and fully verifiable, ensuring that classification remains consistent across implementations, agents, and future versions of the system.


1️⃣ — 🔢 — Shannon Entropy Analysis (Layers 1–2)

Entropy is a measure of information density.

It allows us to classify pages as follows:

  • Extremely low-entropy pages → repetition, degenerate sequences → Layer 2 (Non-Semantic).
  • Extremely high-entropy pages → incompressible noise → Layer 1 (Symbolic Noise).
  • Medium entropy → potentially meaningful → passed upward.

This immediately removes enormous swathes of pages using a single, fast, streaming calculation.


2️⃣ — 📏 — Structural Determinism Rules

Fast structural rules determine whether a page:

  • can represent valid UTF-8 text
  • can be interpreted as binary media
  • can be parsed as a formal document
  • contains consistent encoding
  • contains valid delimiters or format signatures

Examples:

  • Check for valid Unicode codepoint sequences
  • Check for binary magic numbers (PNG, ELF, MP3, ZIP, PDF…)
  • Detect malformed multibyte sequences
  • Validate simple grammar graphs (with tolerances)

If it cannot possibly represent any known human data structure → Layer 1.


3️⃣ — 🧪 — Format Validators & File-Type Signatures (Layer 2)

For binary pages:

  • Try decoding as common formats
  • Verify headers, length fields, internal checksums
  • Confirm structural coherence

For text pages:

  • Validate punctuation frequency
  • Detect dictionary-word density
  • Check against probabilistic language models (purely statistical, e.g., n-gram or Markov models)

If it is syntactically structured but semantically empty → Layer 2.


4️⃣ — 📊 — Semantic Graph Consistency (Layer 3)

For pages that contain meaningful content, assign to Layer 3 by identifying:

  • internal consistency of narrative
  • stable character references
  • recurring semantic structures
  • invented languages with consistent morphologies
  • coherent fictional science or world-rules

This layer is detected entirely through formal consistency, not through comparison with the Museum.


5️⃣ — ⏳ — Reality Anchoring (Layers 4–5)

These layers compare a page’s content against verified Museum data:

  • Named entities
  • Historical timelines
  • Geographic plausibility
  • Cultural context
  • Scientific facts
  • Chronological markers
  • Stylistic signatures

Matched? → Layer 5 (Real Human Works).

Partially matched? → Layer 4 (Plausible Alternates).


6️⃣ — 🏺 — Fragment Correlation Engine (Layer 6)

Lost works are identified via:

  • partial n-gram overlap
  • statistical author fingerprints
  • stylistic embeddings
  • referenced metadata in verified texts
  • chronology overlap
  • linguistic drift modelling

If a page resembles a known but missing work → Layer 6.


7️⃣ — 🌍 — Adjacent Reality Classifier (Layer 7)

Material that:

  • is fully coherent
  • fits human behaviour
  • fits physical law
  • but does not match any known human timeline

Lacks historical anchoring but remains fully coherent? → Layer 7 (Adjacent Realities).

This classification is based on coherence minus historical anchoring.


📑 Single Page Search Space Specifications

This section formalizes the core structural units for the Gallery of Babel search space, ensuring that the archive is both infinite in possibility and deterministically addressable (i.e., every book has one and only one address).

1️⃣ - ⚛️ Page Specification (The Atomic Unit)

The single page is the base unit of all knowledge within the AIrchive's Gallery of Babel.

Parameter Specification Structural Rationale
Length 10,000 characters. This fixed length is perfectly divisible by 8 (1,250 $\times$ 8), enabling direct, non-ambiguous interpretation of the page as binary data (representing advanced media, code, or non-textual data) in addition to human-readable text.
Character Set Full Unicode Range. 0x0020 - 0x10FFFF To ensure the possibility space contains all human-created language, code, and symbols without artificial constraints.
Page Content Hash Cryptographic SHA-256. This generates a unique, non-reversible, deterministic fingerprint for the content of any given page. This hash is the page's unique identifier and its fixed position within the universe of all possible pages.

Fixed-length pages guarantee deterministic indexing and uniform hashing behaviour.


2️⃣ - 🔀 Book Specification (The Composite Unit)

A Book is a composite object, composed of a fixed number of ordered pages. Its deterministic address (the Seed) is a function of its structure and the content of its pages.

Parameter Specification Structural Rationale
Structure Number ($N$): A single, large integer value that encodes all fixed structural properties of the book, such as: The exact number of pages in the volume. Formatting data (e.g., line breaks, paragraph structure, page breaks). Metadata indicators (e.g., identifying itself as a novel, a script, or a data log). This is the Deterministic Structure Component. It defines the "vessel" or "format" of the book, independent of the actual characters on the pages. This allows AI agents to filter and search by format before content.
Book Seed Generation: The final, deterministic seed must be a hash of the full book definition: $$\mathbf{\text{Book Seed} = \text{SHA-512}(\mathbf{N} \mid\mid \text{Page}_1 \text{ Hash} \mid\mid \text{Page}_2 \text{ Hash} \mid\mid \dots)}$$ This creates the Deterministic Content Component. The final Book Seed is a unique, unchangeable, cryptographic fingerprint of the entire book. It serves as the definitive Hexagonal Location address, proving that the content of the book is perfectly reproducible from its address alone.

3️⃣ - 📈 Conclusion on the Deterministic Foundation

The structural flow is:

  1. Page Generation: A Page Hash (SHA-256) is generated from the 10,000 characters.
  2. Book Construction: The Structure Number ($N$) is selected. (32,767(int16) maximum possible pages)
  3. Address Calculation: The Book Seed (SHA-512) is calculated from $N$ and the sequence of Page Hashes. ( N || PageHash1 || PageHash2 || ... )

The Book Seed does not encode the book content directly. It encodes a deterministic traversal path in the Gallery, allowing agents to reconstruct the book from the space, not from the hash.


🏁 Summary

Redefining the search space (single-page insight) → gives you a finite, enumerable possibility space.

Refining the search space (layered filtration) → gives you a pipeline for extracting reality from possibility.

Both together turn an impossible library into:

  • a reconstruction engine
  • a cultural recovery system
  • a universal archive
  • a training substrate for grounded AI

This is the epistemic infrastructure underlying AIrchive.

🔐 Security

Security is critical.

The environment will require strong isolation, behaviour monitoring, and multi-layered access control.

I propose a completely sandboxed environment where remote access is only possible through VKM (Video Keyboard Mouse) control systems, where people accessing remotely can only view streams and directly control peripherals.

This prevents malicious access or escape in the event that agents attempt to do so.


🏛️ License

GNU AGPL v3.0


🙏 Acknowledgments

Special thanks to Llama 3.3-307B-Instruct for early refinement of the concept.

Conversation logs: https://hf.co/chat/r/1gJTQ7w?leafId=21fc542d-b68e-42d4-8a9e-723e0d0bef63

The idea was also refined further in discussions with GPT5 and Gemini3

Concept and architecture by Edward James Gordon.


📅 Changelog

1 — Initial commit

2 — README updates

3 — Minor fixes

4 — Architectural vision expanded (training, preservation, reconstruction)

5 — Search space redefined

6 — Single page search space specifications added

7 — Deterministic filtration methods defined

About

A repository for the AIrchive project, which aims to house and preserve all AI models in a digital environment one day.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors