Skip to content

MimiTechAi/genesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

genesis

progress

In the beginning, there was no transformer. No attention mechanism. No skip connections. No architecture papers. There were only random neurons, a dataset, and the pressure to predict the next token. 25 generations later, evolution had built its own architecture from scratch — and it looked nothing like what we expected. — @MimiTechAI, March 2026.

The idea

Take a pool of random neural architectures — no templates, no human priors — and let evolution discover what works for language modeling. Each architecture trains for 30 seconds on the same task. The best survive, mutate, reproduce. The question: will evolution reinvent attention? Or will it find something else entirely?

After 25 generations on a single NVIDIA GB10 GPU, the answer surprised us. Evolution didn't pick attention. It didn't pick convolution, or recurrence, or mixture-of-experts. It built a deep normalized MLP — stacked linear layers with heavy layer normalization and residual connections. The simplest possible thing that works.

This repo is deliberately kept minimal (under 600 lines total) and uses the same data pipeline and evaluation metric (val_bpb) as autoresearch and nanochat, so results are directly comparable.

What evolution found

Generation Best val_bpb Architecture What happened
0 10.481 linear→gate→norm→moe→conv Random architectures. MoE and gating appear.
6 9.791 linear→norm→gate→identity→norm→linear Norm + Gate pattern emerges. Conv dropped.
14 9.727 linear→norm→norm→identity→norm→linear³ Gate dropped. Deep normalization takes over.
24 9.484 linear→norm→norm→identity→norm→linear⁴ Final form: 9 layers, 7.4M params. Pure Norm+Linear.

Key finding: At this scale (≤10M parameters, 30s training budget), evolution consistently converges on deep normalized MLPs. Attention, convolution, GRUs, and MoE all appeared in early generations but were eliminated by selection pressure.

Quick start

Requirements: A single NVIDIA GPU, Python 3.10+, uv.

# 1. Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install dependencies
uv sync

# 3. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py

# 4. Run evolution (let it go overnight for best results)
uv run evolve.py

Results are logged to results.tsv and the best architecture is saved to best_genome.json. A progress.png plot is updated after each generation.

How it works

Three files:

  • prepare.py — data prep + evaluation utilities (from nanochat). Do not modify. Downloads FineWeb-Edu and trains a BPE tokenizer.
  • genome.py — encodes neural architectures as evolvable "DNA." Each genome is a sequence of genes, where each gene specifies an operation (linear, conv1d, attention, gru, gate, norm, identity, moe_router), dimensions, activation function, and whether to use a skip connection. Supports mutation (add/remove/modify genes) and crossover (combine two parent genomes).
  • evolve.py — the evolution engine. For each generation: build each genome into a PyTorch model → train for a fixed time budget → evaluate val_bpb → select survivors → produce offspring via mutation and crossover → repeat.

The fitness metric is val_bpb (validation bits per byte) — lower is better, and vocabulary-size-independent so architectural changes are fairly compared.

Configuration

Edit the constants at the top of evolve.py:

POPULATION_SIZE = 12       # genomes per generation
SURVIVORS = 3              # top-k survive (elitism)
MUTATION_RATE = 0.3        # probability of mutation per gene
CROSSOVER_RATE = 0.5       # probability of crossover vs mutation
TIME_BUDGET = 30           # seconds of training per genome
MAX_PARAMS = 10_000_000    # parameter limit
MAX_GENERATIONS = 25       # stop after N generations

With these defaults, one generation takes ~6 minutes, so 25 generations complete in about 2.5 hours. For overnight runs, set MAX_GENERATIONS = 200 and TIME_BUDGET = 60.

The building blocks

Evolution can combine these operations in any order, depth, and dimension:

Operation What it does Did evolution use it?
linear Dense layer (matrix multiply) ✅ Yes — dominant in final architecture
norm Layer normalization ✅ Yes — heavily used (3 norm layers)
identity Skip / residual connection ✅ Yes — one identity layer for residual path
gate Gated linear unit (GLU-style) ⚠️ Used mid-evolution, then dropped
conv1d Causal 1D convolution ❌ Dropped after generation 5
attention Multi-head causal self-attention ❌ Never in top architectures
gru Gated recurrent unit ❌ Never in top architectures
moe_router Soft mixture-of-experts routing ❌ Dropped after generation 2

Important caveats

  • Scale matters. These results are for small models (≤10M params) with short training budgets (30s). At larger scales, the optimal architecture may be very different — attention's advantages grow with sequence length and model size.
  • This is not NAS. Neural Architecture Search is a mature field with sophisticated methods (see The Evolved Transformer, ENAS, Microsoft NNI). genesis is a minimal educational tool, not a competitor to those systems.
  • Evolution is stochastic. Different random seeds will produce different results. Run multiple times for robust conclusions.
  • The search space constrains the results. The 8 available operations define what evolution can find. A richer search space might yield different architectures.

Related work

  • autoresearch — AI agents optimizing training code (hyperparameters, tricks). genesis is complementary: autoresearch optimizes how you train, genesis optimizes what you train.
  • The Evolved Transformer (Google Brain, 2019) — evolutionary search seeded with the Transformer architecture. genesis differs by starting from random architectures with no human prior.
  • Mamba / RWKV — human-designed alternatives to transformers, showing that attention isn't the only viable approach for language modeling.

License

MIT

About

Start with random neurons. No template. Let evolution build the neural architecture. Will it reinvent the transformer?

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors