Agentic statistical analysis with data structure inference, built to detect repeated measures before selecting tests.
Most LLM-based statistical agents jump straight to naming a test. They often skip explicit assumption checking and, more importantly, skip data structure inference.
That misses the most damaging error pattern in biomedical and neuroscience analysis: pseudoreplication. In the sleepstudy validation, treating repeated measurements as independent inflates effective sample size from 18 subjects to 180 observations (10x inflation), which can produce statistically invalid conclusions.
- Structure inference (Layer 0): inspects columns and row patterns to detect independent data, repeated measures, time-series signals, nested structure, and wide-format repeated measurements.
- Assumption checking (Layer 1): checks normality (including within-subject differences for repeated measures), variance homogeneity, sample adequacy, and independence consistency.
- Test selection (Layer 2): applies a deterministic matrix from structure + assumptions to select a valid test.
- Methods paragraph: generates deterministic methods text directly from pipeline outputs, with no LLM required.
| Data structure + intent | If assumptions hold | If assumptions fail | Selected test |
|---|---|---|---|
| Independent, 2 groups | Normal + equal variance | Normal + unequal variance / non-normal | Independent t-test / Welch's t-test / Mann-Whitney U |
| Repeated measures, 2 conditions | Normal differences | Non-normal differences | Paired t-test / Wilcoxon signed-rank |
| Independent, 3+ groups | Normal + equal variance | Normal + unequal variance / non-normal | One-way ANOVA / Welch's ANOVA / Kruskal-Wallis H |
| Repeated measures, 3+ conditions | n/a | n/a | Friedman test |
| Correlation | Both variables normal | Either non-normal | Pearson's r / Spearman's rho |
Example methods output:
"Data were collected in a repeated-measures design with 18 participants, each providing measurements under 10 conditions. Normality of within-subject differences was assessed using the Shapiro-Wilk test (W=0.917, p<0.001). As data were paired and differences were non-normally distributed, a Friedman test was conducted. A statistically significant difference was observed (chi-square=86.085, p<0.001). Effect size was large (Kendall's W=0.531)."
pip install -r requirements.txt
ollama pull qwen2.5
python -m astats.cli analyze path/to/scenario1.csv "compare score across groups" --no-llmExpected output (benchmark scenario 1):
- Selected test:
Independent t-test - Statistic:
-3.8597 - p-value:
0.0003 - Effect size:
-0.9966 (cohen_d)
| Scenario Group | AStats | GPT-4.1 |
|---|---|---|
| Standard (assumption checking) | 90% | 40% |
| Structure traps (repeated measures) | 90% | 55% |
| Edge cases | 100% | 30% |
| Overall | 92.5% | 45% |
GPT-4.1 baseline run using GitHub Models API on identical scenario prompts.
On the published sleepstudy dataset (18 subjects, 10 repeated days), AStats inferred a repeated-measures structure from the Subject column and selected a Friedman test. The forced-independent baseline treated all 180 rows as independent and selected a different independent-groups path, which is statistically invalid for this design. AStats reported a significant repeated-measures effect (chi-square=86.085, p<0.001) and generated methods-ready text directly from deterministic outputs.
Quoted methods paragraph:
"Data were collected in a repeated-measures design with 18 participants, each providing measurements under 10 conditions. Normality of within-subject differences was assessed using the Shapiro-Wilk test (W=0.917, p<0.001). As data were paired and differences were non-normally distributed, a Friedman test was conducted. A statistically significant difference was observed (chi-square=86.085, p<0.001). Effect size was large (Kendall's W=0.531)."
- Scenario 9: small-sample two-group normal case still routes to Independent t-test instead of Mann-Whitney U.
- Scenario 12: one repeated-measures borderline case routes to Wilcoxon instead of Paired t-test.
- Scenario 28: unequal-size repeated-measures downgrade still routes to Welch's t-test instead of Mann-Whitney U.
This is a proof-of-concept for the AStats GSoC 2026 project under INCF, mentored by Suresh Krishna (McGill), Jonathan Morris (UW-Madison), and Yohai-Eliel Berreby (McGill).
Project idea: Neurostars link
- Python 3.10+
- Ollama
- Python packages:
pandasnumpyscipystatsmodelspingouinclickrichollamaopenaipytest