Skip to content

Ashisane/astats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AStats

Agentic statistical analysis with data structure inference, built to detect repeated measures before selecting tests.

The Problem

Most LLM-based statistical agents jump straight to naming a test. They often skip explicit assumption checking and, more importantly, skip data structure inference.

That misses the most damaging error pattern in biomedical and neuroscience analysis: pseudoreplication. In the sleepstudy validation, treating repeated measurements as independent inflates effective sample size from 18 subjects to 180 observations (10x inflation), which can produce statistically invalid conclusions.

How AStats Works

  1. Structure inference (Layer 0): inspects columns and row patterns to detect independent data, repeated measures, time-series signals, nested structure, and wide-format repeated measurements.
  2. Assumption checking (Layer 1): checks normality (including within-subject differences for repeated measures), variance homogeneity, sample adequacy, and independence consistency.
  3. Test selection (Layer 2): applies a deterministic matrix from structure + assumptions to select a valid test.
  4. Methods paragraph: generates deterministic methods text directly from pipeline outputs, with no LLM required.

Layer 2 Decision Matrix

Data structure + intent If assumptions hold If assumptions fail Selected test
Independent, 2 groups Normal + equal variance Normal + unequal variance / non-normal Independent t-test / Welch's t-test / Mann-Whitney U
Repeated measures, 2 conditions Normal differences Non-normal differences Paired t-test / Wilcoxon signed-rank
Independent, 3+ groups Normal + equal variance Normal + unequal variance / non-normal One-way ANOVA / Welch's ANOVA / Kruskal-Wallis H
Repeated measures, 3+ conditions n/a n/a Friedman test
Correlation Both variables normal Either non-normal Pearson's r / Spearman's rho

Example methods output:

"Data were collected in a repeated-measures design with 18 participants, each providing measurements under 10 conditions. Normality of within-subject differences was assessed using the Shapiro-Wilk test (W=0.917, p<0.001). As data were paired and differences were non-normally distributed, a Friedman test was conducted. A statistically significant difference was observed (chi-square=86.085, p<0.001). Effect size was large (Kendall's W=0.531)."

Quick Start

pip install -r requirements.txt
ollama pull qwen2.5
python -m astats.cli analyze path/to/scenario1.csv "compare score across groups" --no-llm

Expected output (benchmark scenario 1):

  • Selected test: Independent t-test
  • Statistic: -3.8597
  • p-value: 0.0003
  • Effect size: -0.9966 (cohen_d)

Benchmark Results

Scenario Group AStats GPT-4.1
Standard (assumption checking) 90% 40%
Structure traps (repeated measures) 90% 55%
Edge cases 100% 30%
Overall 92.5% 45%

GPT-4.1 baseline run using GitHub Models API on identical scenario prompts.

Real Dataset Validation

On the published sleepstudy dataset (18 subjects, 10 repeated days), AStats inferred a repeated-measures structure from the Subject column and selected a Friedman test. The forced-independent baseline treated all 180 rows as independent and selected a different independent-groups path, which is statistically invalid for this design. AStats reported a significant repeated-measures effect (chi-square=86.085, p<0.001) and generated methods-ready text directly from deterministic outputs.

Quoted methods paragraph:

"Data were collected in a repeated-measures design with 18 participants, each providing measurements under 10 conditions. Normality of within-subject differences was assessed using the Shapiro-Wilk test (W=0.917, p<0.001). As data were paired and differences were non-normally distributed, a Friedman test was conducted. A statistically significant difference was observed (chi-square=86.085, p<0.001). Effect size was large (Kendall's W=0.531)."

Known Limitations

  • Scenario 9: small-sample two-group normal case still routes to Independent t-test instead of Mann-Whitney U.
  • Scenario 12: one repeated-measures borderline case routes to Wilcoxon instead of Paired t-test.
  • Scenario 28: unequal-size repeated-measures downgrade still routes to Welch's t-test instead of Mann-Whitney U.

GSoC 2026

This is a proof-of-concept for the AStats GSoC 2026 project under INCF, mentored by Suresh Krishna (McGill), Jonathan Morris (UW-Madison), and Yohai-Eliel Berreby (McGill).

Project idea: Neurostars link

Requirements

  • Python 3.10+
  • Ollama
  • Python packages:
    • pandas
    • numpy
    • scipy
    • statsmodels
    • pingouin
    • click
    • rich
    • ollama
    • openai
    • pytest

About

Agentic statistical analysis with data structure inference

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages