**Author**: J W Debelius<br/>
**Date**: 23 August 2015<br/>
**virtualenv**: power play

In [1]:
%%javascript
IPython.load_extensions('calico-spell-check', 'calico-document-tools')

<IPython.core.display.Javascript object>

The purpose of this notebook is to test the effect of outliers on traditional power calculations, on emperical power calculations, and on extrapolated power. 

This notebook will focus on using a Case I t test as a model for introducing outliers. We will test the alternative hypotheses that
<center><strong>H</strong><sub>0</sub>: $\bar{x} = 0$<br>
<strong>H</strong><sub>1</sub>: $\bar{x} \neq 0$</center>
for some sample with mean $\bar{x}$ and standard deviation, $s$, drawn from an underlying population with mean, $\mu$ ($\mu \neq 0$) and variance $\sigma^{2}$. 

In [18]:
from __future__ import division

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

import absloute_power.utils as ap

from absloute_power.traditional import calc_ttest_1
from skbio.stats.power import subsample_power

We can test these hypotheses using the scipy function `scipy.stats.ttest_1samp`.

In [3]:
def emp_ttest_1(samples):
    return scipy.stats.ttest_1samp(samples[0], 0)[1]

In the traditional model of power, the effect size for a sample given by
$\begin{align*}
\lambda &=\frac{(\bar{x} - x)}{s}\\
&= \frac{\bar{x}}{s}\\
&= \frac{t}{\sqrt{n}}
\end{align*}\tag{1}$

We're going to compare the ability of the traditional post-hoc power, emperical post-hoc power and extrapolated post-hoc power to re-capiluate the traditional power calculation. We'll approach this by simuling a random population with a mean, $\mu$ and standard devation, $\sigma$, and then drawing a random sample of size $n$ with mean $\bar{x}$ and standard devation $s$.

In [4]:
def ttest_1_simulate(mu_lim, sigma_lim, count_lims):
    # Gets the distribution parameters
    mu = np.random.randint(*mu_lim)
    sigma = np.random.randint(*sigma_lim)
    n = np.random.randint(*count_lims)
    
    dist = mu + np.random.randn(n) * sigma

    # Draws a sample that fits the parameters
    return [mu, sigma, n], dist

We can design a function that will let us "spike in" outliers: points drawn from a distribution `offset` units above or below the central mean. 

In [5]:
def add_outliers(dist, num_out, offset=50):
    # Draws the offset mean
    if num_out > 0:
        index = np.arange(0, len(dist))
        # Selects the points ot be outliers
        outlier_pos = np.random.choice(np.arange(0, len(dist)), num_out, replace=False)
        # Updates the distribution
        dist[outlier_pos] = dist[outlier_pos] + offset
    
    return [num_out, offset], [dist]

We're going to use the function to simulate several distributions with varying numbers of outliers. We're going to draw distributions with means between 2 and 10, standard deviations between 5 and 15, and at 120 points.

In [6]:
mu_lims = [1, 10]
sigma_lims = [10, 25]
num_counts = 200
count_lims = [num_counts, num_counts+1]

We'll simulate the distributions so they have no outliers, 1%, 2%, 5%, 10% and 20% outliers. The offset mean can fall between -50 and 50 units offset from the original distribution.

In [11]:
num_outliers = (np.concatenate((np.array([0, 1]), np.array([0.01, 0.02, 0.05, 0.1, 0.2])*num_counts))).astype(int)
len_out = len(num_outliers)
offset=50

When we draw traditional and emperical power, we'll do it starting at 5 samples up to 100 samples, counting by 5s. We draw 100 samples to calculate power for each point, and repeat this three times at each point along the curve. We'll use an alpha value of 0.05. And, we'll simulate 1000 distributions for each number of outliers.

In [12]:
counts = np.arange(10, 101, 10)
alpha=0.05
subsample_params = {'min_counts': 10,
                    'max_counts': 101,
                    'counts_interval': 10,
                    'num_runs': 3,
                    'num_iter': 100,
                    'alpha_pwr': alpha}
num_reps = 100

Let's build the simulated populations.

In [13]:
watch = {o: {'pop_params': [], 'sample_params': [], 'pop__power': [], 'base_power': [], 'samp_power': [], 
             'empr_power': [], 'extr_power': []} for o in num_outliers}
for o in num_outliers:
    v = watch[o]
    for i in xrange(num_reps):
        # Draws a sample distribution
        params, sample = ttest_1_simulate(mu_lims, sigma_lims, count_lims)
        base_power = calc_ttest_1(sample, x0=0, counts=counts)
        # Spikes in outliers, if necessary
        (num_out, offset), dist = add_outliers(sample, o, offset)
        params.append(num_out)
        params.append(offset)
        # Calculates the effect using the tradtional method
        samp_power = calc_ttest_1(dist[0], 0, counts)
        #  Calculates the emperical power
        empr_power, empr_counts = subsample_power(emp_ttest_1,
                                                  dist,
                                                  **subsample_params)
        # Calculates the extraploted power
        extr_power = ap.extrapolate_f(counts, empr_power, empr_counts).flatten()
        
        #Updates watch
        v['pop_params'].append(params)
        v['base_power'].append(base_power)
        v['samp_power'].append(samp_power)
        v['empr_power'].append(empr_power.mean(0))
        v['extr_power'].append(extr_power)
        
        v['df'] = pd.DataFrame(data=np.vstack((np.concatenate(v['base_power']),
                                               np.concatenate(v['samp_power']),
                                               np.concatenate(v['extr_power']))
                                             ).transpose(),
                  columns=['base', 'samp', 'extr'])
       
    watch[o] = v

Let's set up a table so we can perform and ANCOVA.

In [187]:
stack = df.stack()
stack.name = 'value'
stack = pd.DataFrame(stack)
stack['source'] = [c[1] for c in stack.index]
stack['base'] = [df.loc[c[0], 'base'] for c in stack.index]


In [204]:
c = stack.reindex(check)

In [207]:
c.loc[c.source.apply(lambda x: x in {'samp', 'extr', 'empr'})]

Unnamed: 0,Unnamed: 1,value,source,base
158,samp,0.984006,samp,0.985093
881,extr,0.157202,extr,0.155425
447,extr,0.969258,extr,0.937909
122,empr,0.280000,empr,0.295549
267,extr,0.605438,extr,0.567116
536,samp,0.172497,samp,0.147675
18,samp,0.115207,samp,0.093897
885,empr,0.400000,empr,0.401440
678,empr,0.816667,empr,0.729932
171,empr,0.096667,empr,0.120879


In [170]:
s = pd.DataFrame(df.stack())
help(df.stack)

Help on method stack in module pandas.core.frame:

stack(self, level=-1, dropna=True) method of pandas.core.frame.DataFrame instance
    Pivot a level of the (possibly hierarchical) column labels, returning a
    DataFrame (or Series in the case of an object with a single level of
    column labels) having a hierarchical index with a new inner-most level
    of row labels.
    The level involved will automatically get sorted.
    
    Parameters
    ----------
    level : int, string, or list of these, default last level
        Level(s) to stack, can pass level name
    dropna : boolean, default True
        Whether to drop rows in the resulting Frame/Series with no valid
        values
    
    Examples
    ----------
    >>> s
         a   b
    one  1.  2.
    two  3.  4.
    
    >>> s.stack()
    one a    1
        b    2
    two a    3
        b    4
    
    Returns
    -------
    stacked : DataFrame or Series



In [159]:
help(df.stack)

Help on method stack in module pandas.core.frame:

stack(self, level=-1, dropna=True) method of pandas.core.frame.DataFrame instance
    Pivot a level of the (possibly hierarchical) column labels, returning a
    DataFrame (or Series in the case of an object with a single level of
    column labels) having a hierarchical index with a new inner-most level
    of row labels.
    The level involved will automatically get sorted.
    
    Parameters
    ----------
    level : int, string, or list of these, default last level
        Level(s) to stack, can pass level name
    dropna : boolean, default True
        Whether to drop rows in the resulting Frame/Series with no valid
        values
    
    Examples
    ----------
    >>> s
         a   b
    one  1.  2.
    two  3.  4.
    
    >>> s.stack()
    one a    1
        b    2
    two a    3
        b    4
    
    Returns
    -------
    stacked : DataFrame or Series



In [98]:
v = watch[1]

In [128]:
df = pd.DataFrame(data=np.vstack((np.concatenate(v['base_power']),
                                  np.concatenate(v['samp_power']),
                                  np.concatenate(v['extr_power']),
                                  np.concatenate(v['empr_power']))).transpose(),
                  columns=['base', 'samp', 'extr', 'empr'])

base_samp_reg = ols('base ~ samp', data=df).fit()
base_extr_reg = ols('base ~ extr', data=df).fit()
samp_extr_reg = ols('samp ~ extr', data=df).fit()

In [131]:
df2 = df[['base', 'samp']]
df2.rename(columns={'base': 'x', 'samp': 'y'}, inplace=True)
df2['z'] = 0

df3 = df[['base', 'extr']]
df3.rename(columns={'base': 'x', 'extr': 'y'}, inplace=True)
df3.reindex(np.arange(1000, 2000))
df3['z'] = 1

df4 = df[['base', 'empr']]
df4.rename(columns={'base': 'x', 'empr': 'y'}, inplace=True)
df4.reindex(np.arange(2000, 3000))
df4['z'] = 2

df2 = pd.concat((df2, df3, df4))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [154]:
lm = ols('y ~ x * ((z == 1) + (z == 2))', df2).fit()

In [155]:
sm.stats.anova_lm(lm)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
z == 1,1,0.009055,0.009055,13.347843,0.0002631573
z == 2,1,0.000398,0.000398,0.58606,0.4440071
x,1,411.085493,411.085493,605951.008371,0.0
x:z == 1,1,0.003111,0.003111,4.585141,0.03233082
x:z == 2,1,0.19289,0.19289,284.324596,5.142209e-61
Residual,2994,2.031171,0.000678,,


In [142]:
lm.model.data.frame[:5]
lm1 = ols('base ~ samp', df).fit()
lm2 = ols('base ~ extr', df).fit()
lm3 = ols('base ~ extr', df).fit()
print sm.stats.anova_lm(lm2, lm1, lm3, typ=1)

   df_resid       ssr  df_diff   ss_diff    F  Pr(>F)
0       998  0.713127        0       NaN  NaN     NaN
1       998  0.192449       -0  0.520678 -inf     NaN
2       998  0.713127       -0 -0.520678  inf     NaN


In [146]:
sm.stats.anova_lm(lm, typ=3)

Unnamed: 0,sum_sq,df,F,PR(>F)
Intercept,0.104102,1,153.449163,2.097586e-34
C(z),0.145107,2,106.945479,1.374679e-45
x,130.373709,1,192174.332505,0.0
x:C(z),0.196,2,144.454869,1.286182e-60
Residual,2.031171,2994,,
