This repository contains the complete experimental code, data, and results for the paper:
"Evaluating Humanities Theory Alignment in LLMs: Incremental Prompting and Statistical Assessment"
We evaluate whether large language models (LLMs) exhibit behavioral alignment with three humanities theories—Davidson's truth-conditional semantics, Lewis's truth in fiction, and Iser's concept of textual gaps—using a top-down, theory-driven black-box framework. Core assumptions are reconstructed into testable behavioral rules and assessed via controlled classification tasks (binary for Davidson/Lewis; three-way for Iser) with systematic prompt comparisons and significance testing.
Our experiments show that theory-uninformed classification prompts generally outperform theory-enriched prompts in Lewis and Iser settings, while theory prompts help in Davidson tasks. Removing formal logical notation unexpectedly degrades performance, indicating symbols act as structural scaffolding rather than mere formal noise. Gemini Flash consistently achieves the highest scores across tasks and corpora, while the Iser gap detection task remains substantially harder than binary truth-conditional judgments. Statistical tests confirm robust prompt effects and the failure of basic prompts; however, model behavior under incremental theory exposure is unstable and architecture-dependent.
.
├── Notebooks/ # Jupyter notebooks for experiments
│ ├── davidson_extended_nosymbol.ipynb
│ ├── interannotator_agreement_multiple_files.ipynb
│ ├── multimodel_evaluation_iser_3class_Borchert.ipynb
│ ├── multimodel_evaluation_iser_3class_Schnitzler.ipynb
│ ├── multimodel_evaluation_with_stats_Check_Davidson.ipynb
│ ├── multimodel_evaluation_with_stats_Check_Lewis_Borchert.ipynb
│ └── multimodel_evaluation_with_stats_Check_Lewis_Schnitzler.ipynb
│
├── Reference Data/ # Gold standard test sets
│ ├── agreement_Prinzip_1_lewis_borchert.xlsx
│ ├── agreement_Prinzip_1_lewis_schnitzler.xlsx
│ ├── agreement_Prinzip_2_lewis_borchert.xlsx
│ ├── agreement_Prinzip_2_lewis_schnitzler.xlsx
│ ├── agreement_Prinzip_3_lewis_borchert.xlsx
│ ├── agreement_Prinzip_3_lewis_schnitzler.xlsx
│ ├── davidson_goldstandard.xlsx
│ ├── Iser_Testdaten_Borchert_final.xlsx
│ └── Iser_Testdaten_Schnitzler_final.xlsx
│
└── Result Tables/ # Experimental results
├── davidson_multimodel_evaluation_results.xlsx
├── davidson_multimodel_evaluation_summary.xlsx
├── davidson_results_mcnemar_tests_3runs.xlsx
├── davidson_results_metrics_3runs.xlsx
├── iser_borchert_results_cochrans_q_3runs.xlsx
├── iser_borchert_results_metrics_3runs.xlsx
├── iser_borchert_results_per_class_metrics.xlsx
├── iser_schnitzler_results_cochrans_q_3runs.xlsx
├── iser_schnitzler_results_metrics_3runs.xlsx
├── iser_schnitzler_results_per_class_metrics.xlsx
├── lewis_borchert_multimodel_evaluation_results.xlsx
├── lewis_borchert_multimodel_evaluation_summary.xlsx
├── lewis_borchert_results_mcnemar_tests_2runs.xlsx
├── lewis_borchert_results_metrics_2runs.xlsx
├── lewis_borchert_results_paired_test_2runs.xlsx
├── lewis_borchert_results_wilcoxon_tests_2runs.xlsx
├── lewis_schnitzler_multimodel_evaluation_results.xlsx
├── lewis_schnitzler_multimodel_evaluation_summary.xlsx
├── lewis_schnitzler_results_mcnemar_tests_2runs.xlsx
├── lewis_schnitzler_results_metrics_2runs.xlsx
├── lewis_schnitzler_results_paired_test_2runs.xlsx
└── lewis_schnitzler_results_wilcoxon_tests_2runs.xlsx
-
Davidson (Truth-Conditional Semantics): Binary classification determining whether sentences are true/false relative to literary contexts (170 instances)
-
Lewis (Truth in Fiction): Binary classification across Lewis' three principles of truth in fiction on two corpora (Schnitzler, Borchert)
-
Iser (Narrative Gaps): Three-way classification (true/false/gap) identifying textual indeterminacies across two corpora (60 instances each)
- Google Gemini 3 Flash Preview
- Anthropic Claude Sonnet 4.5
- Qwen 3 Next 80B A3B Instruct
- Mistral Ministral 14B 2512
- Basic: Minimal instruction (incomplete output specification)
- Theory-uninformed classification: Prompts with explicit classification task
- Theory-enriched: Prompts enriched with theoretical frameworks
- Theory-enriched (no symbols): Natural language equivalents of formal notation (Davidson only)
- McNemar's Test: Pairwise prompt comparisons (binary tasks)
- Cochran's Q Test: Multi-classifier comparisons (multi-class tasks)
- Inter-annotator Agreement: Cohen's Kappa
pip install pandas numpy scikit-learn openai statsmodels scipy seaborn scikit-posthocs openpyxlSet your OpenRouter API key:
export MY_OpenRouter="your-api-key-here"Each notebook is self-contained and runs independently:
# Davidson truth-conditional task
jupyter notebook Notebooks/multimodel_evaluation_with_stats_Check_Davidson.ipynb
# Lewis truth in fiction
jupyter notebook Notebooks/multimodel_evaluation_with_stats_Check_Lewis_Schnitzler.ipynb
jupyter notebook Notebooks/multimodel_evaluation_with_stats_Check_Lewis_Borchert.ipynb
# Iser gap detection
jupyter notebook Notebooks/multimodel_evaluation_iser_3class_Schnitzler.ipynb
jupyter notebook Notebooks/multimodel_evaluation_iser_3class_Borchert.ipynbNote: Experiments require API access to evaluated models via OpenRouter. Temperature is set to 0.1 for quasi-deterministic behavior with controlled stochasticity.
All notebooks implement checkpoint-based execution:
- Predictions saved incrementally to avoid re-running completed experiments
- Checkpoints stored in
checkpoints_[task]/directories - Resume capability after interruptions
Gold-standard annotations created by humanities scholars:
- Davidson: 170 sentence-context pairs from German literature
- Lewis: 3 principles × 2 corpora with principle-specific annotations
- Iser: 60 instances per corpus with three-class labels (true/false/gap)
Complete experimental outputs including:
- Per-model, per-prompt, per-run metrics (F1, precision, recall, accuracy)
- Statistical test results (p-values, contingency tables)
- Aggregated summaries for all experiments
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues, please open an issue on GitHub or contact [contact information].
Parts of the experimental code in this repository were developed with assistance from Claude Sonnet 4.5 (Anthropic). All code was thoroughly reviewed, tested, and verified by the authors. The authors take full responsibility for:
- All experimental designs and methodological decisions
- Verification of code correctness and statistical validity
- All results, analyses, and interpretations presented in the paper
- The scientific conclusions drawn from the experiments
AI assistance was used as a development tool to accelerate implementation, similar to using code libraries or IDEs. All outputs were validated against theoretical expectations and statistical best practices by domain experts.
Repository maintained by: [Author Names]
Last updated: January 2026