Author: Eitan Hemed
robusta is a statistics package in Python3 providing an interface to many common statistical analyses, performed using through R and RPY2.
PLEASE NOTE robusta is under active development and is supplied as-is with no guarantees.
Install with pip using pip install robusta-stats
, see also Installation.
See here.
For the most recent, thorough tutorial in the different features of robusta, head on to Google Colab.
Some of the features are shown below.
This could take ~15 seconds as many R libraries are imported under the hood. If you begin with an empty R environment the first you import robusta should take at least a couple of minutes, as R dependencies will be installed.
import robusta as rst
First, define a helper function used to pretty-print output of dataframes when converting the notebook to .md (credit).
from tabulate import tabulate
import IPython.display as d
def md_print_df(df):
md = tabulate(df, headers='keys', tablefmt='pipe')
md = md.replace('| |','| %s |' % (df.index.name if df.index.name else ''))
return d.Markdown(md)
First off, we need data. Using robusta we can import R built-in and some imported datasets. You can get a full list of the datasets, similarly to calling to data()
with no input arguments in R.
md_print_df(rst.get_available_datasets().tail())
Package | Item | Description | |
---|---|---|---|
284 | ARTool | Higgins1990Table5 | Split-plot Experiment Examining Effect of Moisture and Fertilizer on Dry Matter in Peat Pots |
285 | ARTool | Higgins1990Table1.art | Aligned Rank Transformed Version of Higgins1990Table1 |
286 | ARTool | Higgins1990Table1 | Synthetic 3x3 Factorial Randomized Experiment |
287 | ARTool | ElkinABC | Synthetic 2x2x2 Within-Subjects Experiment |
288 | ARTool | ElkinAB | Synthetic 2x2 Within-Subjects Experiment |
We can import a dataset using rst.load_dataset
iris = rst.load_dataset('iris')
md_print_df(iris.head())
dataset_rownames | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 2 | 4.9 | 3 | 1.4 | 0.2 | setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5 | 5 | 3.6 | 1.4 | 0.2 | setosa |
Analyses are performed through using designated model objects that also store the . The model objects are returned through calls to the function API. In this example we create a model (m
) object by calling t2samples
. m
will be used to fit the statistical model, returning the results
object.
Here is a paired-samples t-test using the Students' sleep dataset previously loaded:
# Create the model
m = rst.groupwise.T2Samples(
data=rst.load_dataset('sleep'), independent='group',
dependent='extra', subject='ID', paired=True, tail='less')
# Dataframe format of the results
md_print_df(m.report_table())
t | df | p-value | Cohen-d Low | Cohen-d | Cohen-d High | |
---|---|---|---|---|---|---|
1 | -4.06213 | 9 | 0.00141645 | -2.11801 | -1.28456 | -0.414622 |
# Textual report of the results - copy and paste into your results section!
m.report_text()
't(9) = -4.06, p = 0.001'
We can reset the models in order to update the model parameters and re-fit it. In this example, we run the same model an an independent samples t-test:
m.reset(paired=False, assume_equal_variance=True, refit=True)
md_print_df(m.report_table())
t | df | p-value | Cohen-d Low | Cohen-d | Cohen-d High | |
---|---|---|---|---|---|---|
1 | -1.86081 | 18 | 0.0395934 | -1.73882 | -0.832181 | 0.0954595 |
bayes_t2samples
and bayes_t1sample
allow you to calculate Bayes factors or sample from the posterior distribution:
m = rst.groupwise.BayesT2Samples(
data=rst.load_dataset('mtcars'), subject='dataset_rownames',
dependent='mpg', independent='am', prior_scale=0.5,
paired=False)
md_print_df(m.report_table())
model | bf | error | |
---|---|---|---|
0 | Alt., r=0.5 | 71.3861 | 7.97835e-07 |
# Test different null intervals and prior values:
m.reset(prior_scale=0.1, null_interval=[0, 0.5], refit=True)
print(f'{m.report_text()}\n\n')
md_print_df(m.report_table())
Alt., r=0.1 [BF1:0 = 18.64, Error = 0.001%]
model | bf | error | |
---|---|---|---|
0 | Alt., r=0.1 | 18.6411 | 2.33663e-05 |
use Anova
to run between, within or mixed-design ANOVA, we load the anxiety dataset for the next demonstrations.
For non-parametric ANOVAs see KruskalWallisTest
, FriedmanTest
and AlignedRanksTest
# Load the dataset and modify it from a 'wide' to 'long' format dataframe
anxiety = rst.load_dataset('anxiety').set_index(['id', 'group']
).filter(regex='^t[1-3]$').stack().reset_index().rename(
columns={0: 'score',
'level_2': 'time'})
md_print_df(anxiety.head())
id | group | time | score | |
---|---|---|---|---|
0 | 1 | grp1 | t1 | 14.1 |
1 | 1 | grp1 | t2 | 14.4 |
2 | 1 | grp1 | t3 | 14.1 |
3 | 2 | grp1 | t1 | 14.5 |
4 | 2 | grp1 | t2 | 14.6 |
m = rst.groupwise.Anova(
data=anxiety, subject='id',
dependent='score', between='group', within='time')
md_print_df(m.report_table())
R[write to console]: Contrasts set to contr.sum for the following variables: group
Term | p-value | Partial Eta-Squared | F | df1 | df2 | |
---|---|---|---|---|---|---|
1 | group | 0.019 | 0.172 | 4.35 | 2 | 42 |
2 | time | 0.001 | 0.904 | 394.91 | 1.79 | 75.24 |
3 | group:time | 0.001 | 0.84 | 110.19 | 3.58 | 75.24 |
Similarly, we run the model usign only the between subject term (group
). As the model was already generated we can simpyl drop the within-subject term:
m.reset(within=None, refit=True)
md_print_df(m.report_table())
R[write to console]: Contrasts set to contr.sum for the following variables: group
Term | p-value | Partial Eta-Squared | F | df1 | df2 | |
---|---|---|---|---|---|---|
1 | group | 0.019 | 0.172 | 4.35 | 2 | 42 |
R and many other statistical packages (e.g., statsmodels support a formula interface to fit statistical models. Here it is shown that a model can also be specified by the formula kwargs rather than specifying dependent
, between
etc. The formula indicates that the score column is regressed by the time variable, with observations nested within the id column.
m.reset(formula='score~time|id', refit=True)
md_print_df(m.report_table())
Term | p-value | Partial Eta-Squared | F | df1 | df2 | |
---|---|---|---|---|---|---|
1 | time | 0.001 | 0.601 | 66.23 | 1.15 | 50.55 |
We can also run a similar, bayesian ANOVA using BayesAnova
comparing the
specified terms to the null model:
m = rst.groupwise.BayesAnova(data=anxiety, within='time',
dependent='score', subject='id')
md_print_df(m.report_table())
model | bf | error | |
---|---|---|---|
0 | time | 496.129 | 7.82496e-05 |
robusta includes several other features that are either under development or planned for the future.
Currently under work
- Regressions and correlations modules
Planned
- Sequential analysis plots (inspired by JASP)
All help is welcome, but currently there are no specific guidelines. Please contact Eitan Hemed