We can examine the distance based effects used for PERMANOVA and the Mantel test. Here, the observations are not independent, permutative testing is required. We have also chosen to subsample without replacement, due to potential problems associated with the distance between the same sample drawn twice.

The notebook will take the form of building functions which will facilitate the power calculations, and then applying them. The format will feel similar to previous notebooks, where data was presented and then calculated, although this notebook is setup to allow parallel processing to limit run time. We ran this notebook using supercomputers, including the Knight Lab supercomputer and UC San Diego Jupyterhub. We recommend this notebook be used for review and not be run on a local computer. The estimated run time for serial processing is at least 4 hours, however may take longer depending on the speed of your system.

You can download the precalculated files from XXX.

In [1]:
from multiprocessing import Pool
import os
import pickle

import numpy as np
import pandas as pd
import scipy
import skbio

from emp_power.power import subsample_power

To insure consistency with simulations, we'll set a seed.

In [2]:
np.random.seed(25) 

# Simulation Parameters

This notebook is intended to be run in parallel, thus we'll set the number of threads. By default, we'll use 1. However, on a system with more threads, a larger number of processing steps can be performed to limit runtime.

In [3]:
num_cpus = 1

A second way to limit runtime is by setting the overwrite variable. This will only calculate power for simulations which do not already exist.

In [4]:
overwrite = False

We'll loop over the 100 simulations, which should be placed in a simulations folder under the current directory. You can download the precalculated simulations [here].

In [5]:
num_rounds = 100

sim_location = './simulations/'
if not os.path.exists(sim_location):
    raise ValueError('The simulations do not exist. '
                     'Go back and simulate some data!')

For this type of permutative testing, we'll select somewhat stringent parameters. We'll use the common biological threshold of 0.05 as our critical value. We'll perform 99 permutations which will give us a minimum p-value of 0.01.

In [6]:
depth = 99
alpha = 0.05

Finally, we'll define a series of file paths where we can save the output data, and we'll make sure the directories exist.

In [7]:
distributions = {
    'mantel': {'sim_dir': os.path.join(sim_location, 'data/mantel'),
               'power_dir': os.path.join(sim_location, 'power/mantel/'),
                },
    'permanova': {'sim_dir': os.path.join(sim_location, 'data/permanova'),
                  'power_dir': os.path.join(sim_location, 'power/permanova/'),
                  }
    }

for test, dirs in distributions.items():
    sim_dir = dirs['sim_dir']
    if not os.path.exists(sim_dir):
        os.makedirs(sim_dir)
    power_dir = dirs['power_dir']
    if not os.path.exists(power_dir):
        os.makedirs(power_dir)

# Power Calculation

To appropriately handle the parallel processing, we'll write a wrapper function, and then use this in the parallel processing method.

## Permanova

To look at power associated with a test for a categorical variable on a distance matrix, we'll pass in a simulation, the critical value and number of power calculations to perform, and a filepath where we'll save the directory.

We'll retrieve the distance matrix and grouping series from the samples object we built during the simulation, and separate the observations into groups. The power calculation method draws observations from each set independently, we need to partition the observations into groups.

We then define the counts. Because the observations in a distance matrix are not independent (the distance between a sample and itself is 0, for instance), bootstrapping from the distance matrix alone becomes more of a challenge. To address this, we perform subsampling without replacement of the existing observations.

We then define the test, which will take a list of the observations and return a p value.

We then calculate power as the fraction of times there is a significant difference in 100 tests at the specified depth. To account for variability in the data, we'll perform this power calculation 5 times.

The output information will be put in a summary dictionary and saved to a binary pickle file.

In [8]:
def calculate_permanova_power(simulation, alpha, depth, power_fp):
    """A helper function for handling power simulations"""

    # Draws the groups and distance matrix and idenifies the sample
    dm, groups = simulation['samples']
    samples = [groups.loc[groups == i].index for i in [0, 1]]
    
    # Sets up the counts because we cannot bootstrap
    counts = np.arange(5, min([len(s) for s in samples]) - 10, 10)

    # Defines the statistical test
    def test(ids):
        obs = np.hstack(ids)
        res = skbio.stats.distance.permanova(
            distance_matrix=dm.filter(obs),
            grouping=groups.loc[obs],
            permutations=depth,
        )
        return res['p-value']
    
    # Calculates power
    power = subsample_power(test=test,
                            samples=samples,
                            counts=counts,
                            num_iter=100,
                            num_runs=5,
                            alpha=alpha,
                            bootstrap=False,
                            )
    
    # Generates the summary dictionary
    power_summary = {'emperical_power': power,
                     'traditional_power': None,
                     'original_p': test(samples),
                     'num_obs': len(samples[0]),
                     'counts': counts,
                     }
    
    with open(power_fp, 'wb') as f_:
        pickle.dump(power_summary, f_)

We'll add this function to our distributions dictionary.

In [9]:
distributions['permanova']['function'] = calculate_permanova_power

## Mantel Calculation

The function we'll use in the mantel test will be similar to what was used in the PERMANOVA, although here, we can instead just subsample the correlation matrix, since we're focused on correlations.

In [10]:
def calculate_mantel_power(simulation, power_fp, alpha, depth):
    """Wrapper to calculate power for the mantel test"""

    # Draws the groups and distance matrix and idenifies the sample
    x, y = simulation['samples']
    samples = [np.array(x.ids)]
    
    # Sets up the counts vector
    counts = np.arange(5, len(samples[0]) - 10, 10)
    
    def test(samples):
        obs = samples[0]
        res = skbio.stats.distance.mantel(
            x.filter(obs),
            y.filter(obs),
            permutations=depth
        )
        return res[1]
    
    power = subsample_power(test=test,
                            samples=samples,
                            counts=counts,
                            num_iter=100,
                            num_runs=5,
                            alpha=alpha,
                            bootstrap=False,
                            draw_mode='matched'
                            )
    # Generates the summary dictionary
    power_summary = {'emperical_power': power,
                     'traditional_power': None,
                     'original_p': test(samples),
                     'num_obs': len(samples[0]),
                     'depth': depth,
                     'alpha': alpha,
                     'counts': counts,
                     }
    
    # Saves the file
    with open(power_fp, 'wb') as f_:
        pickle.dump(power_summary, f_)

In [11]:
distributions['mantel']['function'] = calculate_mantel_power

We'll also add this to the simulation dictionary.

In [12]:
p = Pool(num_cpus)
for i in range(num_rounds):
    # Loads the simulation
    for test, dirs in distributions.items():
        sim_dir = dirs['sim_dir']
        power_dir = dirs['power_dir']
        sim_func = dirs['function']
        
        simulation_fp = os.path.join(sim_dir, 'simulation_%i.p' % i)
        with open(simulation_fp, 'rb') as f_:
            simulation = pickle.load(f_)
        # Generates the power calculation if appropriate
        power_fp = os.path.join(power_dir, 'simulation_%i.p' % i)
        if (overwrite or (not os.path.exists(power_fp))):
            print('%s sim is a go!' % test)
            sim_kwargs = {'simulation': simulation, 
                          'alpha': alpha, 
                          'depth': depth, 
                          'power_fp': power_fp
                          }
            p.apply(sim_func, kwds=sim_kwargs)

p.close()
p.join()

In the [next notebook](4-Comparisons%20of%20Power%20Calculations.ipynb), we'll explore the empirical power, and the way it compares to power fit from the empirical results.