Evaluating Humanities Theory Alignment in LLMs

This repository contains the complete experimental code, data, and results for the paper:

"Evaluating Humanities Theory Alignment in LLMs: Incremental Prompting and Statistical Assessment"

Abstract

We evaluate whether large language models (LLMs) exhibit behavioral alignment with three humanities theories—Davidson's truth-conditional semantics, Lewis's truth in fiction, and Iser's concept of textual gaps—using a top-down, theory-driven black-box framework. Core assumptions are reconstructed into testable behavioral rules and assessed via controlled classification tasks (binary for Davidson/Lewis; three-way for Iser) with systematic prompt comparisons and significance testing.

Our experiments show that theory-uninformed classification prompts generally outperform theory-enriched prompts in Lewis and Iser settings, while theory prompts help in Davidson tasks. Removing formal logical notation unexpectedly degrades performance, indicating symbols act as structural scaffolding rather than mere formal noise. Gemini Flash consistently achieves the highest scores across tasks and corpora, while the Iser gap detection task remains substantially harder than binary truth-conditional judgments. Statistical tests confirm robust prompt effects and the failure of basic prompts; however, model behavior under incremental theory exposure is unstable and architecture-dependent.

Repository Structure

.
├── Notebooks/                          # Jupyter notebooks for experiments
│   ├── davidson_extended_nosymbol.ipynb
│   ├── interannotator_agreement_multiple_files.ipynb
│   ├── multimodel_evaluation_iser_3class_Borchert.ipynb
│   ├── multimodel_evaluation_iser_3class_Schnitzler.ipynb
│   ├── multimodel_evaluation_with_stats_Check_Davidson.ipynb
│   ├── multimodel_evaluation_with_stats_Check_Lewis_Borchert.ipynb
│   └── multimodel_evaluation_with_stats_Check_Lewis_Schnitzler.ipynb
│
├── Reference Data/                     # Gold standard test sets
│   ├── agreement_Prinzip_1_lewis_borchert.xlsx
│   ├── agreement_Prinzip_1_lewis_schnitzler.xlsx
│   ├── agreement_Prinzip_2_lewis_borchert.xlsx
│   ├── agreement_Prinzip_2_lewis_schnitzler.xlsx
│   ├── agreement_Prinzip_3_lewis_borchert.xlsx
│   ├── agreement_Prinzip_3_lewis_schnitzler.xlsx
│   ├── davidson_goldstandard.xlsx
│   ├── Iser_Testdaten_Borchert_final.xlsx
│   └── Iser_Testdaten_Schnitzler_final.xlsx
│
└── Result Tables/                      # Experimental results
    ├── davidson_multimodel_evaluation_results.xlsx
    ├── davidson_multimodel_evaluation_summary.xlsx
    ├── davidson_results_mcnemar_tests_3runs.xlsx
    ├── davidson_results_metrics_3runs.xlsx
    ├── iser_borchert_results_cochrans_q_3runs.xlsx
    ├── iser_borchert_results_metrics_3runs.xlsx
    ├── iser_borchert_results_per_class_metrics.xlsx
    ├── iser_schnitzler_results_cochrans_q_3runs.xlsx
    ├── iser_schnitzler_results_metrics_3runs.xlsx
    ├── iser_schnitzler_results_per_class_metrics.xlsx
    ├── lewis_borchert_multimodel_evaluation_results.xlsx
    ├── lewis_borchert_multimodel_evaluation_summary.xlsx
    ├── lewis_borchert_results_mcnemar_tests_2runs.xlsx
    ├── lewis_borchert_results_metrics_2runs.xlsx
    ├── lewis_borchert_results_paired_test_2runs.xlsx
    ├── lewis_borchert_results_wilcoxon_tests_2runs.xlsx
    ├── lewis_schnitzler_multimodel_evaluation_results.xlsx
    ├── lewis_schnitzler_multimodel_evaluation_summary.xlsx
    ├── lewis_schnitzler_results_mcnemar_tests_2runs.xlsx
    ├── lewis_schnitzler_results_metrics_2runs.xlsx
    ├── lewis_schnitzler_results_paired_test_2runs.xlsx
    └── lewis_schnitzler_results_wilcoxon_tests_2runs.xlsx

Experimental Design

Tasks

Davidson (Truth-Conditional Semantics): Binary classification determining whether sentences are true/false relative to literary contexts (170 instances)
Lewis (Truth in Fiction): Binary classification across Lewis' three principles of truth in fiction on two corpora (Schnitzler, Borchert)
Iser (Narrative Gaps): Three-way classification (true/false/gap) identifying textual indeterminacies across two corpora (60 instances each)

Models Evaluated

Google Gemini 3 Flash Preview
Anthropic Claude Sonnet 4.5
Qwen 3 Next 80B A3B Instruct
Mistral Ministral 14B 2512

Prompt Variants

Basic: Minimal instruction (incomplete output specification)
Theory-uninformed classification: Prompts with explicit classification task
Theory-enriched: Prompts enriched with theoretical frameworks
Theory-enriched (no symbols): Natural language equivalents of formal notation (Davidson only)

Statistical Testing

McNemar's Test: Pairwise prompt comparisons (binary tasks)
Cochran's Q Test: Multi-classifier comparisons (multi-class tasks)
Inter-annotator Agreement: Cohen's Kappa

Requirements

pip install pandas numpy scikit-learn openai statsmodels scipy seaborn scikit-posthocs openpyxl

API Configuration

Set your OpenRouter API key:

export MY_OpenRouter="your-api-key-here"

Running Experiments

Each notebook is self-contained and runs independently:

# Davidson truth-conditional task
jupyter notebook Notebooks/multimodel_evaluation_with_stats_Check_Davidson.ipynb

# Lewis truth in fiction
jupyter notebook Notebooks/multimodel_evaluation_with_stats_Check_Lewis_Schnitzler.ipynb
jupyter notebook Notebooks/multimodel_evaluation_with_stats_Check_Lewis_Borchert.ipynb

# Iser gap detection 
jupyter notebook Notebooks/multimodel_evaluation_iser_3class_Schnitzler.ipynb
jupyter notebook Notebooks/multimodel_evaluation_iser_3class_Borchert.ipynb

Note: Experiments require API access to evaluated models via OpenRouter. Temperature is set to 0.1 for quasi-deterministic behavior with controlled stochasticity.

Checkpoint System

All notebooks implement checkpoint-based execution:

Predictions saved incrementally to avoid re-running completed experiments
Checkpoints stored in checkpoints_[task]/ directories
Resume capability after interruptions

Data

Reference Data

Gold-standard annotations created by humanities scholars:

Davidson: 170 sentence-context pairs from German literature
Lewis: 3 principles × 2 corpora with principle-specific annotations
Iser: 60 instances per corpus with three-class labels (true/false/gap)

Result Tables

Complete experimental outputs including:

Per-model, per-prompt, per-run metrics (F1, precision, recall, accuracy)
Statistical test results (p-values, contingency tables)
Aggregated summaries for all experiments

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or issues, please open an issue on GitHub or contact [contact information].

Code Development and Verification

Parts of the experimental code in this repository were developed with assistance from Claude Sonnet 4.5 (Anthropic). All code was thoroughly reviewed, tested, and verified by the authors. The authors take full responsibility for:

All experimental designs and methodological decisions
Verification of code correctness and statistical validity
All results, analyses, and interpretations presented in the paper
The scientific conclusions drawn from the experiments

AI assistance was used as a development tool to accelerate implementation, similar to using code libraries or IDEs. All outputs were validated against theoretical expectations and statistical best practices by domain experts.

Repository maintained by: [Author Names]
Last updated: January 2026

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Notebooks		Notebooks
Reference Data		Reference Data
Result Tables		Result Tables
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Humanities Theory Alignment in LLMs

Abstract

Repository Structure

Experimental Design

Tasks

Models Evaluated

Prompt Variants

Statistical Testing

Requirements

API Configuration

Running Experiments

Checkpoint System

Data

Reference Data

Result Tables

License

Contact

Code Development and Verification

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluating Humanities Theory Alignment in LLMs

Abstract

Repository Structure

Experimental Design

Tasks

Models Evaluated

Prompt Variants

Statistical Testing

Requirements

API Configuration

Running Experiments

Checkpoint System

Data

Reference Data

Result Tables

License

Contact

Code Development and Verification

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages