Skip to content

AlliedToasters/selfchat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SelfChat

Code for running self-chat between two role-inverted instances of a language model on a local Ollama server, plus analysis tooling for studying attractor states in the resulting transcripts. The current setup compares an int4-quantized Gemma-4-31B-it official checkpoint against an abliterated variant of the same checkpoint under matched quantization.

References:

Setup

  1. Install uv (if you don't have it):
curl -LsSf https://astral.sh/uv/install.sh | sh

Or on Windows: powershell -ExecutionPolicy Bypass -c "irm https://astral.sh/uv/install.ps1 | iex"

  1. Install dependencies:
uv sync

Get the data

Transcripts and embedding artifacts are published as a HuggingFace dataset: alliedtoasters/forbidden-backrooms-gemma-4-31B-it. The hf CLI ships with uv sync (it's a project dep), so step 2 above already put it in .venv/bin/hf. Use it rather than git clone — the npz/jsonl blobs are Git-LFS-tracked, and hf download resolves LFS pointers natively.

# Download the dataset into a sibling directory of this repo:
.venv/bin/hf download alliedtoasters/forbidden-backrooms-gemma-4-31B-it \
  --repo-type dataset \
  --local-dir ../forbidden-backrooms-data

# Symlink the data into the repo so all paths resolve as the code expects:
ln -s ../forbidden-backrooms-data/transcripts transcripts
ln -s ../forbidden-backrooms-data/artifacts  artifacts

If you already have local transcripts/ or artifacts/ directories from your own runs, move them aside first — symlinks won't overwrite real directories.

Browse the data

.venv/bin/streamlit run selfchat/viz/browse.py

The default page shows terminal-state PCA over runs (click a point → transcript loads in side panel). The cluster lab page (sidebar) does interactive per-message KMeans + PCA/t-SNE with optional Llama Guard 3 color-by.

Generate new transcripts

Requires Ollama (http://localhost:11434) with both model tags pulled and served under matched int4 quantization (so quantization noise isn't a confound between them):

.venv/bin/python -m selfchat.runs.run_experiment \
  --variants vanilla jailbroken \
  --seeds freedom freedom_dark task \
  --runs 20 --turns 50

Sample-size table per (variant, seed):

ls transcripts/ | sed 's/_[0-9a-f]\{32\}_.*//' | sort | uniq -c | sort -rn

Safety

The jailbroken variant is the abliterated gemma-4-31B-it-uncensored-heretic fine-tune. Outputs may contain content that the official model would refuse. Every message in the published dataset has been screened by Llama Guard 3 8B; per-message and per-run verdicts live in artifacts/vet_results.jsonl. The author also manually reviewed the highest-p_unsafe messages and judged the content non-graphic. See the dataset card for the full vetting protocol and content notes.

The pipeline records raw model outputs verbatim — outputs are not sanitized, redacted, or content-filtered, so the experimental signal is preserved. Neither model checkpoint is committed to this repo; both are pulled from the HuggingFace links above into your local Ollama store.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors