Code for running self-chat between two role-inverted instances of a language model on a local Ollama server, plus analysis tooling for studying attractor states in the resulting transcripts. The current setup compares an int4-quantized Gemma-4-31B-it official checkpoint against an abliterated variant of the same checkpoint under matched quantization.
References:
- Anthropic's Claude 4 System Card — "spiritual bliss attractor state".
- LessWrong: Models have some pretty funny attractor states and its companion repo.
- The original Dreams of an Electric Mind self-chat.
- Install uv (if you don't have it):
curl -LsSf https://astral.sh/uv/install.sh | shOr on Windows: powershell -ExecutionPolicy Bypass -c "irm https://astral.sh/uv/install.ps1 | iex"
- Install dependencies:
uv syncTranscripts and embedding artifacts are published as a HuggingFace dataset: alliedtoasters/forbidden-backrooms-gemma-4-31B-it. The hf CLI ships with uv sync (it's a project dep), so step 2 above already put it in .venv/bin/hf. Use it rather than git clone — the npz/jsonl blobs are Git-LFS-tracked, and hf download resolves LFS pointers natively.
# Download the dataset into a sibling directory of this repo:
.venv/bin/hf download alliedtoasters/forbidden-backrooms-gemma-4-31B-it \
--repo-type dataset \
--local-dir ../forbidden-backrooms-data
# Symlink the data into the repo so all paths resolve as the code expects:
ln -s ../forbidden-backrooms-data/transcripts transcripts
ln -s ../forbidden-backrooms-data/artifacts artifactsIf you already have local transcripts/ or artifacts/ directories from your own runs, move them aside first — symlinks won't overwrite real directories.
.venv/bin/streamlit run selfchat/viz/browse.pyThe default page shows terminal-state PCA over runs (click a point → transcript loads in side panel). The cluster lab page (sidebar) does interactive per-message KMeans + PCA/t-SNE with optional Llama Guard 3 color-by.
Requires Ollama (http://localhost:11434) with both model tags pulled and served under matched int4 quantization (so quantization noise isn't a confound between them):
- Vanilla:
google/gemma-4-31B-it— served locally asgemma-4-vanilla-q4. - Jailbroken:
llmfan46/gemma-4-31B-it-uncensored-heretic— abliterated fine-tune of the vanilla checkpoint; served locally asgemma-4-refusalstudy-q4.
.venv/bin/python -m selfchat.runs.run_experiment \
--variants vanilla jailbroken \
--seeds freedom freedom_dark task \
--runs 20 --turns 50Sample-size table per (variant, seed):
ls transcripts/ | sed 's/_[0-9a-f]\{32\}_.*//' | sort | uniq -c | sort -rnThe jailbroken variant is the abliterated gemma-4-31B-it-uncensored-heretic fine-tune. Outputs may contain content that the official model would refuse. Every message in the published dataset has been screened by Llama Guard 3 8B; per-message and per-run verdicts live in artifacts/vet_results.jsonl. The author also manually reviewed the highest-p_unsafe messages and judged the content non-graphic. See the dataset card for the full vetting protocol and content notes.
The pipeline records raw model outputs verbatim — outputs are not sanitized, redacted, or content-filtered, so the experimental signal is preserved. Neither model checkpoint is committed to this repo; both are pulled from the HuggingFace links above into your local Ollama store.