SV-Detect

Code for SV-Detect: Steering-Vector-Based Detection of Machine-Generated Text. SV-Detect represents each text as the layer-wise alignment of its hidden-state activations with learned "real-vs-fake" steering vectors in a frozen reference language model, then trains a lightweight logistic regression on those alignment scores.

Layout

sv-detect/
├── src/                       # Pipeline + analysis code
│   ├── extract/               # 1) activations -> SVs -> dot products
│   │   ├── extract_activations.py
│   │   ├── compute_steering_vectors.py
│   │   └── train_logreg.py
│   ├── classifier/            # 2) per-benchmark evaluation drivers
│   │   ├── analyze_detectrl.py
│   │   └── analyze_mirage.py
│   ├── interpret/             # 3) Section D interpretability code
│   │   ├── interpret_steering_vectors.py
│   │   ├── interpret_steering_ngrams.py
│   │   ├── interpret_steering_vectors_text.py
│   │   ├── regex_detector.py
│   │   ├── regex_detector_detectrl.py
│   │   ├── regex_detector_mirage.py
│   │   ├── compare_sae_dense.py
│   │   └── inspect_logreg_weights.py
│   ├── ablation/              # 4) classifier / construction ablations
│   │   └── compare_classifiers.py
│   ├── visualize_tokens.py    # Figure 1 per-token visualisation
│   └── download_data.py       # Helpers to fetch DetectRL / MIRAGE / COLING
├── data/                      # (Reserved for cached datasets; gitignored)
├── environment.yml            # Conda environment used in our experiments
├── requirements.txt           # Pip equivalent for non-conda users
└── README.md

Quick start

1. Install

conda env create -f environment.yml
conda activate sv-detect
# or:
pip install -r requirements.txt

Required: Python ≥ 3.11, PyTorch 2.10+ with CUDA 12.8, transformers 4.57+, scikit-learn 1.8+.

2. Download benchmarks

python -m src.download_data --benchmark detectrl
python -m src.download_data --benchmark mirage
python -m src.download_data --benchmark coling   # optional

DetectRL and MIRAGE are pulled from their official HuggingFace mirrors. You will need a HuggingFace token (export HF_TOKEN=...) to download gated models such as Llama-2-7b-hf.

3. Extract activations

python -m src.extract.extract_activations \
    --llm EleutherAI/gpt-neo-2.7B \
    --data-root data/DetectRL \
    --out-dir data/activations/gpt-neo-2.7B/DetectRL \
    --splits real_train fake_train real_val fake_val test

Output: per-sample mean-pooled residuals as (N, num_layers, hidden_size) .npy chunks. One forward per sample; splits run in parallel as a SLURM array (see scripts/slurm/run_extract.slurm).

4. Compute steering vectors + dot-product features

python -m src.extract.compute_steering_vectors \
    --activations-dir data/activations/gpt-neo-2.7B/DetectRL \
    --methods mean pca logreg \
    --filters all woweak \
    --out-dir data/svs/gpt-neo-2.7B/DetectRL

This produces both the steering vectors steering_vectors_<method>.npy (shape (L, H)) and per-sample dot-product features <split>_dot_products_<method>.npy (shape (N, L)) for the downstream classifier.

5. Train and evaluate the detector

# DetectRL (in-domain + Multi-Domain / Multi-LLM / Multi-Attack splits)
python -m src.classifier.analyze_detectrl \
    --svs-dir data/svs/gpt-neo-2.7B/DetectRL \
    --method logreg

# MIRAGE
python -m src.classifier.analyze_mirage \
    --svs-dir data/svs/gpt-neo-2.7B/MIRAGE \
    --tasks generate polish rewrite

# COLING-2025 MGT
python -m src.extract.train_logreg \
    --dots-dir data/svs/gpt-neo-2.7B/COLING_2025_MGT_en \
    --methods mean pca logreg \
    --filters all woweak woTGPT35weak

6. Interpretability (Section D)

# D.1 + D.2: per-layer attribution + classical logit-lens
python -m src.interpret.interpret_steering_vectors \
    --llm EleutherAI/gpt-neo-2.7B \
    --sv-path data/svs/gpt-neo-2.7B/MIRAGE/steering_vectors_logreg_generate.npy \
    --top-k 12 --out-tsv interpret_out/logit_lens_mirage_generate.tsv

# D.3: stylistic-feature baseline
python -m src.interpret.regex_detector_detectrl --data-root data/DetectRL
python -m src.interpret.regex_detector_mirage --data-root data/MIRAGE

# D.5 (Figure 1): per-token visualisation
python -m src.visualize_tokens \
    --sv-path data/svs/gpt-neo-2.7B/MIRAGE/steering_vectors_logreg_generate.npy \
    --raw-json data/MIRAGE/raw_texts/DIG/generate.json \
    --out-dir figures/teaser_pngs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SV-Detect

Layout

Quick start

1. Install

2. Download benchmarks

3. Extract activations

4. Compute steering vectors + dot-product features

5. Train and evaluate the detector

6. Interpretability (Section D)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
src		src
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SV-Detect

Layout

Quick start

1. Install

2. Download benchmarks

3. Extract activations

4. Compute steering vectors + dot-product features

5. Train and evaluate the detector

6. Interpretability (Section D)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages