AStats

Agentic statistical analysis with data structure inference, built to detect repeated measures before selecting tests.

The Problem

Most LLM-based statistical agents jump straight to naming a test. They often skip explicit assumption checking and, more importantly, skip data structure inference.

That misses the most damaging error pattern in biomedical and neuroscience analysis: pseudoreplication. In the sleepstudy validation, treating repeated measurements as independent inflates effective sample size from 18 subjects to 180 observations (10x inflation), which can produce statistically invalid conclusions.

How AStats Works

Structure inference (Layer 0): inspects columns and row patterns to detect independent data, repeated measures, time-series signals, nested structure, and wide-format repeated measurements.
Assumption checking (Layer 1): checks normality (including within-subject differences for repeated measures), variance homogeneity, sample adequacy, and independence consistency.
Test selection (Layer 2): applies a deterministic matrix from structure + assumptions to select a valid test.
Methods paragraph: generates deterministic methods text directly from pipeline outputs, with no LLM required.

Layer 2 Decision Matrix

Data structure + intent	If assumptions hold	If assumptions fail	Selected test
Independent, 2 groups	Normal + equal variance	Normal + unequal variance / non-normal	Independent t-test / Welch's t-test / Mann-Whitney U
Repeated measures, 2 conditions	Normal differences	Non-normal differences	Paired t-test / Wilcoxon signed-rank
Independent, 3+ groups	Normal + equal variance	Normal + unequal variance / non-normal	One-way ANOVA / Welch's ANOVA / Kruskal-Wallis H
Repeated measures, 3+ conditions	n/a	n/a	Friedman test
Correlation	Both variables normal	Either non-normal	Pearson's r / Spearman's rho

Example methods output:

"Data were collected in a repeated-measures design with 18 participants, each providing measurements under 10 conditions. Normality of within-subject differences was assessed using the Shapiro-Wilk test (W=0.917, p<0.001). As data were paired and differences were non-normally distributed, a Friedman test was conducted. A statistically significant difference was observed (chi-square=86.085, p<0.001). Effect size was large (Kendall's W=0.531)."

Quick Start

pip install -r requirements.txt
ollama pull qwen2.5
python -m astats.cli analyze path/to/scenario1.csv "compare score across groups" --no-llm

Expected output (benchmark scenario 1):

Selected test: Independent t-test
Statistic: -3.8597
p-value: 0.0003
Effect size: -0.9966 (cohen_d)

Benchmark Results

Scenario Group	AStats	GPT-4.1
Standard (assumption checking)	90%	40%
Structure traps (repeated measures)	90%	55%
Edge cases	100%	30%
Overall	92.5%	45%

GPT-4.1 baseline run using GitHub Models API on identical scenario prompts.

Real Dataset Validation

On the published sleepstudy dataset (18 subjects, 10 repeated days), AStats inferred a repeated-measures structure from the Subject column and selected a Friedman test. The forced-independent baseline treated all 180 rows as independent and selected a different independent-groups path, which is statistically invalid for this design. AStats reported a significant repeated-measures effect (chi-square=86.085, p<0.001) and generated methods-ready text directly from deterministic outputs.

Quoted methods paragraph:

"Data were collected in a repeated-measures design with 18 participants, each providing measurements under 10 conditions. Normality of within-subject differences was assessed using the Shapiro-Wilk test (W=0.917, p<0.001). As data were paired and differences were non-normally distributed, a Friedman test was conducted. A statistically significant difference was observed (chi-square=86.085, p<0.001). Effect size was large (Kendall's W=0.531)."

Known Limitations

Scenario 9: small-sample two-group normal case still routes to Independent t-test instead of Mann-Whitney U.
Scenario 12: one repeated-measures borderline case routes to Wilcoxon instead of Paired t-test.
Scenario 28: unequal-size repeated-measures downgrade still routes to Welch's t-test instead of Mann-Whitney U.

GSoC 2026

This is a proof-of-concept for the AStats GSoC 2026 project under INCF, mentored by Suresh Krishna (McGill), Jonathan Morris (UW-Madison), and Yohai-Eliel Berreby (McGill).

Project idea: Neurostars link

Requirements

Python 3.10+
Ollama
Python packages:
- pandas
- numpy
- scipy
- statsmodels
- pingouin
- click
- rich
- ollama
- openai
- pytest

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
astats		astats
benchmark		benchmark
real_data		real_data
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AStats

The Problem

How AStats Works

Layer 2 Decision Matrix

Quick Start

Benchmark Results

Real Dataset Validation

Known Limitations

GSoC 2026

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AStats

The Problem

How AStats Works

Layer 2 Decision Matrix

Quick Start

Benchmark Results

Real Dataset Validation

Known Limitations

GSoC 2026

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages