HistSAE is a toolkit for historical semantic drift analysis with sparse autoencoders (SAEs). It supports concept-centered analysis over Chinese historical corpora, from target-sentence activation extraction to drift-base discovery, evidence sentence export, visualization, and non-target analysis.
- SAE activation extraction for concept-bearing sentences
- yearly center construction and inter-year distance measurement
- drift-base ranking by cumulative change across time
- peak evidence sentence export and top-sentence ranking
- wordcloud generation for highly activated bases
- non-target analysis for sentences that do not explicitly mention the target concept
- run manifests and structured output directories for reproducibility
src/pipeline/: main workflow stagessrc/non_target/: non-target activation analysissrc/analysis/: visualization and reporting utilitiessrc/configs/: config loading, schema validation, and run-root resolutionconfigs/: experiment and local environment config filesscripts/run_pipeline.sh: shell wrapper for the pipeline runnertests/: lightweight smoke tests for config loading and post-processing logic
HistSAE expects sentence files and metadata in a corpus-oriented directory structure:
sentences/
example_corpus/
*.txt
metadata/
example_corpus/
documents.jsonl
Each metadata record should include at least:
filenamepub_date
An example record shape is:
{"doc_id":"...","filename":"...txt","pub_date":"1915-09-15","...":"..."}- Create a Python environment.
- Install Python dependencies:
pip install -r requirements.txt- Install OpenSAE in the environment you use for extraction. The extraction scripts import:
opensaeopensae.transformer_with_saeopensae.config_utils
- Copy the example environment file and fill in local paths:
cp configs/env/local.example.yaml configs/env/local.yamlconfigs/env/local.yaml should define:
paths.data_rootor explicitsentences_dir/metadata_dir/output_rootpaths.sae_ckptpaths.llama_pathpaths.font_path
The example experiment config is provided in configs/exp/default.yaml.
Run the default pipeline:
python src/pipeline/run.py --config configs/exp/default.yamlRun selected stages only:
python src/pipeline/run.py --config configs/exp/default.yaml --stages extract,centers,driftRun non-target analysis after the main drift outputs exist:
python src/pipeline/run.py --config configs/exp/default.yaml --stages non_target,non_target_visualizePreview the execution plan without running computation:
python src/pipeline/run.py --config configs/exp/default.yaml --dry-runYou can also use the shell wrapper:
bash scripts/run_pipeline.sh --config configs/exp/default.yamlEach run is organized under:
output/{corpus}/{model_name}/layer_{layer}/sae_{sae_id}/{experiment_name}/{concept}/
Typical per-concept outputs include:
activations/yearly_centers.jsonyearly_distances.jsontop_drift_bases.jsonkey_bases_peak_change.jsonsentences/top30_sentences/wordclouds/target_sorted_sentences_by_year/visualizations/non_target_analysis/
The central code paths are:
src/pipeline/run.pysrc/pipeline/extract_activations.pysrc/pipeline/compute_centers.pysrc/pipeline/identify_drift.pysrc/non_target/analyze_non_target_activations.pysrc/analysis/visualize.pysrc/analysis/target_vs_non_target.pysrc/analysis/visualize_non_target.py
The smoke tests cover config loading and pure post-processing stages, and do not require model weights.
Run them with:
python -m unittest discover -s tests -v