Skip to content

SHA888/beyond-chatbot

Repository files navigation

beyond-chatbot

LLMs are one branch of a much larger tree. This repo is a growing knowledge graph of the statistics, data science, and AI that lives beyond — and beneath — the chatbot.

The idea

Every loss function, every optimizer, every probabilistic primitive, every classical ML method, every signal-processing trick a modern chatbot stands on top of has a name, a history, and a place in a larger map. Most of that map is invisible in the day-to-day discourse around LLMs.

This project makes the map navigable, one atomic node at a time:

  • A node is a single concept — an algorithm, model class, framework, method, system, or mathematical construct.
  • Each node has a one-sentence technical descriptor and a list of concrete real-world deployments. No hand-waving.
  • Edges between nodes are deliberately omitted. You draw the graph. Different consumers want different topologies — taxonomic, historical, dependency, pedagogical — and a fixed edge set would foreclose that.

What's in the repo today

File What it is
ai-nodes.yaml The catalog. 165 nodes across 11 branches, from symbolic AI through reinforcement learning, with statistics, optimization, information theory, and signal processing as substrates.
dist/data/graph.json Browser artifact (generated by build). Immutable snapshot of nodes + edges consumed by the visualization.
schema/graph.cypher LadybugDB schema (Node and Edge tables).
scripts/import-yaml.ts YAML → LadybugDB importer. Validates all nodes and edges.
scripts/export-json.ts LadybugDB → JSON exporter. Produces deterministic dist/data/graph.json.
scripts/export-yaml.ts LadybugDB → YAML exporter. Proves round-trip: import → export → diff has no semantic diff.
graph.html Cytoscape.js visualization (WIP: will fetch dist/data/graph.json instead of YAML at runtime).

The catalog is the substance; the build pipeline (YAML → LadybugDB → JSON) is the machinery.

Building the project

Requirements

  • Node.js ≥14.15.0
  • pnpm ≥8.0.0 (enforced via .npmrc)

Build steps

# Install dependencies
pnpm install

# Compile TypeScript and generate dist/data/graph.json
pnpm run build

What the build does:

  1. Compile TypeScript → JavaScript (scripts/, test/)
  2. Import ai-nodes.yaml + ai-edges.yaml.ladybugdb/ (embedded columnar DB)
  3. Export LadybugDB → dist/data/graph.json (browser artifact)

Output artifacts:

  • dist/data/graph.json — nodes (with degree), edges, metadata; ~1.3 KB
  • .ladybugdb/ — transient database; gitignored

Performance: <30 seconds from clean checkout.

Development commands

pnpm test              # Run test suite (46 tests: schema, importer, round-trip validation)
pnpm test:watch       # Watch mode
pnpm test:coverage    # Coverage report (v8 provider)
pnpm run import       # Just YAML → LadybugDB
pnpm run export       # Just LadybugDB → JSON
pnpm run export:yaml  # LadybugDB → YAML (round-trip validation)

The build pipeline

YAML → LadybugDB → JSON

  1. YAML (ai-nodes.yaml, ai-edges.yaml) is the human-editable source of truth. All changes land here.
  2. LadybugDB (embedded columnar graph DB) is the canonical storage. Validation happens at import time.
  3. JSON (dist/data/graph.json) is the immutable artifact consumed by the browser. Byte-stable for identical input; sorted deterministically (nodes by ID, edges by source→target→type).

See docs/spec/00-project-spec.md for the full spec (§3: Storage Model, §4: Pipeline Contracts).

The node schema

The full schema is documented in the header of ai-nodes.yaml. Briefly:

Field Notes
id kebab-case slug, stable identifier
name canonical display name
aliases (optional) common alternative names
branch one of 11 fixed groupings (see file header)
type algorithm · method · model-class · system · framework · math-construct
era year or decade of significant introduction
status foundational · active · legacy · emerging · dormant
descriptor one technical sentence
anchors list of concrete, verifiable real-world uses

Contributing

Adding or improving nodes

PRs are welcome. The bar for node inclusion:

  1. Atomic. One concept per node. If you find yourself writing "…and also…", it's probably two nodes.
  2. Concrete anchors. Named systems, products, papers, or deployments — not "used in industry" or "widely applied".
  3. One-sentence descriptor. Resist the urge to expand.
  4. Existing branches only. If a concept doesn't obviously fit, it belongs in cross-cutting, not a new branch.
  5. YAML must parse. Quick check:
    python3 -c "import yaml; yaml.safe_load(open('ai-nodes.yaml'))"

Good PRs to open:

  • Missing nodes within an existing branch.
  • Better anchors for an existing node (more concrete, more verifiable).
  • Tightening a descriptor that has drifted into two sentences.
  • New aliases that people actually use in the wild.

Contributing edges (Phase 2)

Not yet open. Edges are populated via a three-stage pipeline (bulk-seed from Wikidata → LLM-densify → human curation). Details will be documented in Phase 2.

Structural changes

Open an issue first if you want to argue for:

  • A new branch
  • A schema change
  • Removing or merging an existing node

Status

Version 0.1.0. Early, growing, opinionated about conciseness.

License

Dual:

  • Contentai-nodes.yaml and any future data / docs are licensed under CC BY 4.0. See LICENSE-CC-BY-4.0. Use and adapt freely with attribution.
  • Code — any code that lands in this repo is licensed under MIT. See LICENSE-MIT. No code lives here yet, but the license is in place for when it does.

About

A growing knowledge graph of the statistics, data science, and AI

Resources

License

CC-BY-4.0, MIT licenses found

Licenses found

CC-BY-4.0
LICENSE-CC-BY-4.0
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors