In [None]:
ind_t_params = []
for i in xrange(100):
    # Defines the parameters for the populations
    mu1 = np.random.randint(0, 20)
    mu2 = np.random.randint(0, 20)
    sigma1 = np.random.randint(5, 30)
    sigma2 = np.random.randint(5, 30)
    base_num = np.random(25, 250)
    
    # Draws the samples
    sample1 = np.random.randn(base_num) * sigma1 + mu1
    sample2 = np.random.randn(base_num) * sigma2 + mu2
    
    # Summarizes the samples
    ind_t_params.append({'mu1': mu1,
                        'mean_1': sample1.mean(),
                        'mu2': mu2,
                        'mean2': sample2.mean(),
                        'sigma1': sigma1,
                        'std1': sample1.std(),
                        'sigma2': sigma2,
                        'std2': sample2.std(),
                        'base_num': base_num,
                        'sample1': sample1,
                        'sample2': sample2,
                        'test_p': t_test([sample1, sample2])
                        })

**Author**: J W Debelius<br/>
**Date**: 22 June 2015<br/>
**virtualenv**: power play

In [1]:
%%javascript
IPython.load_extensions('calico-spell-check', 'calico-document-tools')

<IPython.core.display.Javascript object>

<a id="#top"></a>
#Table of Contents
* [1. Introduction](#1.-Introduction)
* [2. Case I t test](#2.-Case-I-t-test)
* [3. Case II T test](#3.-Case-II-T-test)


# 1. Introduction

...

In [1]:
from __future__ import division

import numpy as np
import skbio
import scipy.stats
import matplotlib.pyplot as plt

from skbio.stats.power import subsample_power
from statsmodels.stats.power import FTestAnovaPower, TTestIndPower

ft = FTestAnovaPower()
tt = TTestIndPower()

% matplotlib inline

In [2]:
def vital_stats(dist):
    """Returns a distribution summary"""
    return dist.mean(), dist.std(), len(dist)

<a href="#top">Return to the top</a>

# T tests

... discussion of the test and distribution...

## One Sample T test (Case I)
...describes the test...
Lets you compare a value against a sample or population...

The test statistic for a case I t test is given as

$t = \frac{(\bar{x} - x)\sqrt{n}}{s} \tag{1.1}$

where $\bar{x}$ is the mean of the population, $x$ is the value being compared to the sample, $s$ is the standard devation of the sample, and $n$ is the size of the sample, and the test statistic, t, is drawn from the T distribution with $(n-1)$ degrees of freedom. The scipy function `scipy.stats.ttest_1samp` can perform the one sample t test.

Practially, we'll test the hypothesis that our distribution is centered around 0. When we simulate the distribtions, we will set the means between 1 and 15, to prevent this from happening.

In [3]:
def t_test1(x):
    return scipy.stats.ttest_1samp(x)[1]

For  the case I t-tests, the effect size is given as
$\lambda = \frac{(\bar{x} - x)}{s} \tag{1.2}$

which allows us to translate power as 

$\begin{align*}
1-\beta &= \Phi_{T} \left(-T(1-\alpha/2, n-1) + \lambda\sqrt{n}, n-1 \right )\\
&= \Phi_{T} \left(-T(1-\alpha/2, n-1), \left(\frac{\bar{x} - x}{s}\right)\sqrt{n}, n - 1 \right )
\end{align*} \tag{1.3}$

In [4]:
def trad_t_power1(counts, dist, x, alpha=0.05):
    """..."""
    # Summarizes the distribution
    [x1, s1, n1] = vital_stats(dist)
    
    # Calculates the effect size
    eff = np.absloute(x1 - x)/s
    
    # Calculates the power
    pwr = np.array([scipy.stats.t.cdf(-scipy.stats.t.ppf(1-alpha/2, c-1) + eff, c-1) for c in counts])
    
    return pwr

## Independent Sample Distributions (Case II)

...Some happy discussion of the test and distribution...

The test statistic for a case II t test, is given as

$t = \frac{\bar{x}_{1} - \bar{x}_{2}}{\sqrt{\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}}}}\tag{2.1}$

The t statistic follows a T distribution with $df$ degrees of freedom, where $df$ is given as
$df = \frac{(s_{1}^{2}/n_{1} + s_{2}^{2}/n_{2})^{2}}{(s_{1}^{2}/n_{1})^2/(n_{1}-1) + s_{2}^{2}/n_{2})^2/(n_{2}-1)} \tag{2.2}$

Scipy has a built in function, `scipy.stats.ttest_ind`, to perform this test.

In [5]:
def t_test2(x):
    """..."""
    return scipy.stats.ttest_ind(*x)[1]

For the sake of simplicity, we'll assume that $n_{1} = n_{2}$, which allows us to redefine equation (2.1) as
$t = \frac{\sqrt{n}(\bar{x}_{1} - \bar{x}_{2})}{s_{1}^{2} + s_{2}^{2}} \tag{2.3}$
which means the test statitic is now drawn from a t distribution with df degrees of freedom, where
df is defined as
$df = \left (n-1 \right ) \left (\frac{\left (s_{1}^{2} + s_{2}^{2}  \right )^{2}}{\left (s_{1}^{2} \right)^{2} + \left (s_{2}^{2}  \right )^{2}} \right ) \tag{2.4}$

In [6]:
def get_t_df(n, s1, s2):
    """doc string!"""
    modifier = (np.square(np.square(s1) + np.square(s2)) / (np.power(s1, 4) + np.power(s2, 4)))
    return (n-1)*modifier

The effect size for this test is given as
$\begin{align*}
\lambda &= \frac{t}{\sqrt{n}}\\
&=\frac{(\bar{x}_{1} - \bar{x}_{2})}{s_{1}^{2} + s_{2}^{2}}
\end{align*} \tag{2.5}$

Based on earlier theory... power is defined as
$\begin{align*} 
1 - \beta &= \Phi_{T} \left(-T(1-\alpha/2, df) + \lambda, df \right )\\
&= \Phi_{T} \left(-T(1-\alpha/2, df) + \frac{(\bar{x}_{1} - \bar{x}_{2})}{s_{1}^{2} + s_{2}^{2}}, df \right )
\end{align*} \tag{2.6}$

In [19]:
def trad_t_power2(counts, dist1, dist2, alpha=0.05):
    """..."""
    # Gets the effect size
    eff = effect_t_test2(dist1, dist2)
    
    # Calculates the modifier
    mod = (np.square(np.square(s1) + np.square(s2)) / (np.power(s1, 4) + np.power(s2, 4)))
    
    # Calculates the power
    pwr = np.array([stats.t.cdf(-stats.t.ppf(1-alpha/2, mod*(c - 1)) + np.sqrt(c)*eff, mod*(c-1)) for c in counts])
    
    return pwr

<a href="#top">Return to the top</a>
# Pearson's r
...discussion of the correlation coeffecient...

The Pearson correlation co-effecient can be tested using a t static:
$T = \frac{r_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy}^{2}}} \tag{3.1}$
where T follows a t distribution with n-2 degrees of freedom. Scipy's `scipy.stats.pearsonr` can calculate the correlation coeffecient *and* a p value for the coeffecient. However, the notes suggest that the p value may not be accurate for smaller sample sizes, so we're going to try an initial test, and compare the dervived p-value, based on [1] with the scipy p-value.

In [None]:
def test_pearson(dists):
    return scipy.stats.pearsonr(*dists)[1]

The effect size for this distribution is then given as
$\lambda = \frac{r_{xy}\sqrt{n}}{\sqrt{1 - r_{xy^2}}} \tag{3.2}$
which means that when we solve for power, we can start by defining power as
$1 - \beta = 1 - P\left (T(\lambda, n-2) < t(1-\alpha/2, n-2) \right) \tag{3.3}$

In [12]:
x = np.random.randint(0, 25, 10)
y = 3*x + 5 + 10*np.random.randn(10)
r, p = scipy.stats.pearsonr(x, y)
print r, p

0.965112818738 6.21359356276e-06


In [14]:
eff = r/np.sqrt(1-np.square(r))
eff

3.6859666276681282

In [18]:
print scipy.stats.t.pdf(eff*np.sqrt(5), 3)
print scipy.stats.t.ppf(0.975, 3)

0.000657477139229
3.18244630528


In [None]:
def trad_pearson_power(counts, dist1, dist2, alpha):
    """..."""
    r, _ = scipy.stats.pearsonr(dist1, dist2)
    eff = r/np.sqrt(1-np.square(r))
    pwr = np.array([scipy.stats.norm(float(scipy.stats.t.pdf(eff*np.sqrt(c), c-2)) < 
                                     0,1)])

In [None]:
def trad_t_power2(counts, dist1, dist2, alpha=0.05):
    """..."""
    # Summarizes the distributions
    [x1, s1, n1] = vital_stats(dist1)
    [x2, s2, n2] = vital_stats(dist2)
    
    # Calculates the effect size
    eff = np.absolute(x1 - x2)/np.sqrt(np.square(s1) + np.square(s2))
    
    # Calculates the modifier
    mod = (np.square(np.square(s1) + np.square(s2)) / (np.power(s1, 4) + np.power(s2, 4)))
    
    # Calculates the power
    pwr = np.array([stats.t.cdf(-stats.t.ppf(1-alpha/2, mod*(c - 1)) + np.sqrt(c)*eff, mod*(c-1)) for c in counts])
    
    return pwr

<a href="#top">Return to the top</a>

# Analysis of Variance
...

The total sum of squares is given by

$SST = \sum_{j=1}^{J}\sum_{i=1}^{n_{j}}{\left (\bar{x}_{i,j} - \bar{x}_{..} \right )^{2}}$

$SSR = \sum_{j=1}^{J}\sum_{i=1}^{n_{j}}{\left (\bar{x}_{.j} - \bar{x}_{..} \right )^{2}}$

$SSE = \sum_{j=1}^{J}\sum_{i=1}^{n_{j}}{\left (\bar{x}_{i,j} - \bar{x}_{.j} \right )^{2}}$

where

$\bar{x}_{.j} = \frac{1}{n_{j}}\sum_{i=1}^{n_j}{x_{i,j}}$

and

$\bar{x}_{..} = \frac{1}{J}\sum_{j=1}^{J}\left({\frac{1}{n_{j}}\sum_{i=1}^{n_j}{x_{i,j}}}\right)$

The test statistic, F, is then given by


In [63]:

def anova(x):
    return (f_oneway(*x))[1]

In an anova, the effect size for the jth group is given by 
$\lambda_{j} = \frac{\bar{x}_{j} - \bar{x}}{s\sqrt{n_{j}}} \tag{3.2}$
And, the overall effect size is
$\begin{align*}
\lambda &= \sum_{j=1}^{J}{\lambda_{j}^{2}}\\
&= \sum_{j=1}^{J}{\left (\frac{\sqrt{n_{j}}(\bar{x}_{j} - \bar{x})}{s} \right )^2}\\
&= \sum_{j=1}^{J}{\frac{n_{j}}{s^{2}}\left (\bar{x}_{j} - \bar{x} \right )}
\end{align*} \tag{3.3}$

Power is then defined in terms of the F distribution. So,
$1-\beta = P[F'(\lambda, J-1, N-J) \geq F(1-\alpha/2, J-1, N-J)] \tag{3.4}$
where $F'$ is the non-centrality F parameter. This can be re-written as
$\begin{align*}
1-\beta &= \Phi_{F} \left( -F(1-\alpha/2, J - 1, N-J) + \lambda, J -1, N-J \right)\\
&= \Phi_{F} \left(-F(1-\alpha/2, J-1, N-J) + \sum_{j=1}^{J}{\frac{(\bar{x}_{j} - \bar{x})^{2}}{s^{2}\sqrt{n_{j}}}}, J-1, N-J \right )
\end{align*} \tag{3.5}$

In [None]:
def trad_anova_power(*dists, alpha=0.05):
    """..."""
    # Summarizes the distributions
    vitals = [(x, s, n) = vitals_stats(dist) for dist in dists]
    x2, ss, ns = zip(*vitals)
    
    # Calculates the grand mean
    grand_mean = 

 

def trad_t_power2(counts, dist1, dist2, alpha=0.05):
    """..."""
    # Summarizes the distributions
    [x1, s1, n1] = vital_stats(dist1)
    [x2, s2, n2] = vital_stats(dist2)
    
    # Calculates the effect size
    eff = np.absolute(x1 - x2)/np.sqrt(np.square(s1) + np.square(s2))
    
    # Calculates the modifier
    mod = (np.square(np.square(s1) + np.square(s2)) / (np.power(s1, 4) + np.power(s2, 4)))
    
    # Calculates the power
    pwr = np.array([stats.t.cdf(-stats.t.ppf(1-alpha/2, mod*(c - 1)) + np.sqrt(c)*eff, mod*(c-1)) for c in counts])
    
    return pwr