<strong>License</strong>: BSD<br/>
<strong>Copyright</strong>: Copyright The American Gut Project, 2014

In [23]:
%%javascript
IPython.load_extensions('calico-spell-check', 'calico-document-tools')

<IPython.core.display.Javascript object>

In [24]:
# # This cell allows us to render the notebook in the way we wish no matter where
# # the notebook is rendered.
# from IPython.core.display import HTML
# css_file = '../ag.css'
# HTML(open(css_file, "r").read())

<a id="intro"></a>
___
*Note*: this notebook will likely require signifigant manual interaction is more intended to document exploratory analysis than to be adapted for other projects.
___

# [Title]

**[text about the topic goes here.]**

<a id="intro_sub1"></a>
## [Subsection 1]
**[Text about optional subsections.]**

<a href="#top">Return to the Table of Contents</a>

<a id="requirements"></a>
## Notebook Requirements
* [Python 2.7.3](https://www.python.org/download/releases/2.7/)
* [Numpy 1.9](http://www.numpy.org)
* [Qiime 1.9](https://www.qiime.org/install/install.html)
* [hdf5](http://www.hdfgroup.org/HDF5/) and [h5py](http://www.h5py.org). This is required to read the American Gut biom tables in Qiime.
* [Jinja2](http://jinja.pocoo.org/docs/dev/), [pyzmq](https://learning-0mq-with-pyzmq.readthedocs.org/en/latest/) and  [tornado](http://www.tornadoweb.org/en/stable/). These are required to open a local IPython notebook instance. hese are required to open a local ipython notebook on your machine. They are not automatically installed with iPython or Qiime.
* [Statsmodels 0.6.0](http://statsmodels.sourceforge.net)
* [American Gut Python Library](https://github.com/biocore/American-Gut)
* $\LaTeX$. [LiveTex](http://www.tug.org/texlive/) offers one installation solution.

<a id="top"></a>
##Table of contents
<ul><li><a href="#intro">Introduction</a>
<ul><li><a href="#intro_sub1">Subsection 1</a>
</li></ul>
</li><li><a href="#requirements">Notebook Requirements</a>
</li><li><a href="#imports">Function Import</a>
</li><li><a href="#params">Analysis parameters</a>
<ul><li><a href="#params_data">Dataset Selection</a>
</li><li><a href="#params_save">File Saving Parameters</a>
</li><li><a href="#params_text">Text File and Metadata Parameters</a>
</li><li><a href="#params_cat">Analysis Category Parameters</a>
</li><li><a href="#params_alpha">Alpha Diversity Parameters</a>
</li><li><a href="#params_beta">Beta Diversity Parameters</a>
</li><li><a href="#params_gs">Group Significance Parameters</a>
</li><li><a href="#params_figs">Plotting Parameters</a>
</li></ul>
</li><li><a href="#dir">Files and Directories</a>
<ul><li><a href="#dir_base">Base Directory</a>
</li><li><a href="#dir_data">Sample Directory and Files</a>
</li><li><a href="#dir_bdiv">Beta Diversity Analsysis Directories and Files</a>
</li><li><a href="#dir_gs">Group Significance Analysis Directories and Files</a>
</li><li><a href="#dir_image">Image Directories and Files</a>
</li></ul>
</li><li><a href="#download">Data Download</a>
</li><li><a href="#alpha">Alpha Diversity</a>
</li><li><a href="#beta">Beta Diversity</a>
</li><li><a href="#group">Group Significance</a>
</li><li><a href="#discussion">Discussion</a>
</li><li><a href="#refs">References</a>
</li></ul>

<a id="imports"></a>
## Function Import
We start by importing necessary functions, and determining if files should be overwritten.

In [49]:
import os
import shutil
import time

import scipy
import skbio
import biom
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import statsmodels.api as sms
import statsmodels.formula.api as smf
import americangut.diversity_analysis as div

from matplotlib import rcParams

We will also set up some plotting parameters so the generated figures use Helvetica or Arial as their default font. For more on font properties, see the matplotlib documentation on [text objects](http://matplotlib.org/api/text_api.html?highlight=font#matplotlib.text.Text.set_fontproperties) and [rendering text with LaTex](http://matplotlib.org/users/pgf.html?highlight=font). We will also prompt the IPython notebook to display the images we generate live in the notebook.

In [3]:
# Displays images inline
%matplotlib inline

# Sets up plotting parameters so that the default setting is use to 
# Helvetica in plots
rcParams['font.family'] = 'sans-serif'
rcParams['font.sans-serif'] = ['Helvetica', 'Arial']
rcParams['text.usetex'] = True

<a href=#top>Return to the top</a>

<a id="params"></a>
## Analysis Parameters
We can also set some necessary parameters for handling files and this analysis. It’s easier to set them as a block, here, so that our systems are consistent than to modify each of the variables later in the import if our needs or our data changes.

<a id="params_data"></a>
### Dataset Selection
We will start by selecting which dataset we’d like to use for this analysis. We can select to work with the full OTU table or focus on a single body site. We can also choose which grouping of this data we’d like to use, limiting the analysis to a certain subset of the American Gut Population.

<table style="width:90%;
              border-style:hidden;
              borders-collapse:collapse;
              line-height:120%">
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>site</strong><br>
(<code style="color:Firebrick;background-color:#D0D0D0">"all"</code>, 
<code style="color:Firebrick;background-color:#D0D0D0">"fecal"</code>, 
<code style="color:Firebrick;background-color:#D0D0D0">"oral"</code>, 
<code style="color:Firebrick;background-color:#D0D0D0">"skin"</code>)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            This identifies the bodysite where the analysis should be analyzed. It is recommended that categorical analysis focus on a single bodysite, since location on the human body has the largest effect on the microbial communities at those sites in relatively healthy adults and children four years of age and older [<a href="#22699611">22699611</a>; <a href="#22699609">22699609</a>].
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>dataset</strong><br>(
            <code style="color:Firebrick;background-color:#D0D0D0">""</code>,
            <code style="color:Firebrick;background-color:#D0D0D0">"all_participants_all_samples"</code>,
            <code style="color:Firebrick;background-color:#D0D0D0">"all_participants_one_sample"</code>,
            <code style="color:Firebrick;background-color:#D0D0D0">"sub_participants_all_samples"</code>,
            <code style="color:Firebrick;background-color:#D0D0D0">"sub_participants_one_sample"</code>)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            <p><strong><code>dataset</code></strong> identifies the subset of samples to be used 
            for analysis.</p> 
            <p>If the <strong><code>site</code></strong> is <code><font color="Firebrick">"all"</font></code>, the <strong><code>dataset</code></strong> should be <code><font color="Firebrick">""</font></code>. <br>
            There are not multiple subsets of samples avaliable for all data, if data was generated 
            through the preprocessing notebook.</p>
            <p>For site-specific analyses, every site has data for all participants and all 
            samples.
            Each individual’s microbiome is correlated with itself 
            [<a href="#21624126">21624126</a>, <a href="#21885731">21885731</a>], so to allow 
            multiple samples per individual violates an assumption of independence used in many statistical tests. Therefore, the 
            Preprocessing Notebook draws a single sample for each individual at each bodysite.</p>
            <p>We may also choose to work with a subset of the data. The preprocessing notebook 
            selects a healthy subset of adult participants. This includes individuals between 20 
            and 70 who have a BMI between 18.5 and 30. Additionally, these individuals cannot have 
            been diagnosed with IBD or diabetes and do not report using antibiotics in the past
            year.</p>
        </td>
    </tr>
</table>

In [4]:
site = 'fecal'
dataset = 'all_participants_one_sample'

<a id="params_save"></a>
### File Saving Parameters

In the course of this analysis, a series of files can be generated. The File Saving Parameters determine if new files are saved.

<table style="width:90%;
              border-style:hidden;
              borders-collapse:collapse;
              line-height:120%">
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>overwrite</strong><br />(boolian)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            <p>When <strong><code>overwrite</code></strong> is 
            <code><font color="228B22">True</font></code>, new files will
            be generated and saved during data processing. <br>It is 
            recommended that overwrite be set to 
            <code><font color="228B22">False</font></code>, in which case 
            new files will only be generated when the file does not exist. 
            This substantially decreases analysis time.</p>
            <p><strong><code>overwrite</code></strong> will also cause the 
            notebook to generate new post-hoc beta diversity comparisons, 
            even if the files exist. This can be computationally 
            expensive, and scales with the number of groups in a metadata 
            category and the number of samples.</p>
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>save_images</strong><br>(boolian)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            This notebook will generate images of the power curves. By 
            default, these will be displayed inside the notebook. However, 
            some users also find it advantageous to save the images.
            The file format can be set with the 
            <a href="#dir_analysis"><strong><code>image_pattern</code></strong></a>.
        </td>
    </tr>
</table>

In [5]:
overwrite = False
save_images = True

<a href="#top">Return to the top</a>

<a id="params_text"></a>
### Text File and Metadata Parameters
Qiime-formatted metadata and results files are frequently tab-separated text (.txt) files. These files can be opened in Excel or spreadsheet programs. You can learn more about Qiime mapping files [here](http://qiime.org/documentation/file_formats.html). We use the Pandas library to read most of our text files, which provides some spreadsheet like functionalities.

<table style="width:90%;
              border-style:hidden;
              borders-collapse:collapse;
              line-height:120%">
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>txt_delim</strong><br />(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            <strong><code>txt_delim</code></strong> specifies the way columns are separated in the files. Qiime typically consumes and produces tab-delimited (<code><font color="FireBrick">"\t"</font></code>) text files (.txt) for metadata and results generation.
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
              <strong>map_index</strong><br />(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The name of the column containg the sample names. In Qiime, this column is called <code><font color="FireBrick">#SampleID</font></code>.
        </td>
    <tr>

    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>map_nas</strong><br />(list of strings)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            t is possible a mapping file map be missing values, since American Gut participants are free to skip any question. The pandas package is able to omit these missing samples from analysis. In raw American Gut files, missing values are typically denoted as <code><font color="FireBrick">“NA”</font></code>, <code><font color="FireBrick">“no_data”</font></code>, <code><font color="FireBrick">“unknown”</font></code>, and empty spaces (<code><font color="FireBrick">“”</font></code>).
        </td>
    <tr>

    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>write_na</strong><br /> (string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The value to denote missing values when text files are written from Pandas data frames. Using an empty space, (<code><font color="FireBrick">“”</font></code>) will allow certain Qiime scripts, like [group_signigance.py](http://qiime.org/scripts/group_significance.html), to ignore the missing values.
        </td>
    <tr>

    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>date_cols</strong><br /> (list of strings)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            Temporal data can be identified using the <strong><code>date_cols</code></strong>, allowing the Pandas program to do time-based analysis. In the American Gut dataset, there are four we identify initially: 
            <ul><li>*BIRTH_DATE*          (the participant’s birthdate)
            </li><li>*COLLECTION_DATE*    (the day the sample was collected)
            </li><li>*SAMPLE_TIME*        (the time the sample was collected)
            </li><li>*RUN_DATE*           (the day the samples were sequenced)
            </li></ul>
        </td>
    <tr>
</table>

In [6]:
# Sets parameters for file handling and reading tables
# into pandas
txt_delim = '\t'
map_index = '#SampleID'
map_nas = ['NA', 'no_data', 'unknown', '']
write_na = ''

<a href="#top">Return to the top</a>

<a id="params_alpha"></a>
### Alpha Diversity Parameters

<p>This notebook will compare alpha diversity and beta diversity for the category selected.  Alpha diversity is a comparison of intra-community variation. When alpha diversity values are compared, the comparison does not take into account the community structure. So, two communities which share no species can have the same alpha diversity. American Gut Analyses primarily focus on an alpha diversity metric called PD Whole Tree Diversity [<a herf="#15831718">15831718</a>]. PD Whole Tree is phylogenetically aware, meaning that it takes into account shared evolutionary history.</p>
<p>We will compare alpha diversity using a [kruskall-wallis test](http://en.wikipedia.org/wiki/Kruskal–Wallis_one-way_analysis_of_variance), and we will plot the results as a [boxplot](http://en.wikipedia.org/wiki/Box_plot). We can set parameters for the way we make the comparison and how the figure will look.</p>


<table style="width:90%;
              border-style:hidden;
              borders-collapse:collapse;
              line-height:120%">
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>a_div_metric</strong><br />(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            <p>The alpha diversity metric to be used in the analysis. Mapping files generated by the Preprocessing Notebook have a set of mapping columns appended which provide the mean for several metrics. These are labeled as the metric name with <font color="firebrick"><code>“_mean”</code></font> appended to the end, to indicate the values are the mean of 10 rarefactions.</p>
<p>There are multiple alpha diversity metrics which can be used. The preprocessing notebook calculates four possible alpha diversity metrics for the data (PD Whole Tree Diversity [<a href="#15831718">15831718</a>], Shannon Diversity [<a href="#shannon">Shannon</a>], Chao1 diversity [<a href="#chao1">Chao</a>], and Observed Species diversity). The default value used here, <code><font color="Firebrick">“PD_whole_tree_mean”</font></code>, is the only metric which takes into account the evolutionary relationship between organisms in a sample.</p>
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>a_ylabel</strong><br />(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            This y-label appears on the boxplot to help clarify the information presented there. Alpha diversity is a unitless quantity.
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>a_ylim</strong><br />(2 element list of numbers)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            This specifies the limits for the y axis. Limits should be set as a function of the diversity metric. With PD whole tree diversity, limits of <code>[5, 55]</code> is suggested. For shannon diversity, <code>[0, 8]</code> should be used as limits. Chao1 diversity has a larger range, and <code>[100, 1000]</code> can be used as limits. Observed Species also has a larger range, and <code>[0, 800]</code> may be an appropriate starting place.
        </td>
    </tr>
</table>


In [22]:
# Alpha Diversity Parameters
a_div_metric = 'PD_whole_tree_mean'
a_ylabel = 'PD Whole Tree Mean'
a_ylim = [5, 55]

<a href="#top">Return to the top</a>

<a id="params_beta"></a>
## Beta Diversity Parameters

Beta diversity looks at the difference in community structure across two communities. Each metric calculates a distance between the communities, which is reflective of their difference. American Gut Analyses have calculated weighted and unweighted UniFrac distance for the communities [<a href="#16332807">16332807</a>]. UniFrac distance takes into account the evolutionary relationship between samples, by determining what fraction of evolutionary history is different between two samples. Weighted UniFrac also takes into account the relative abundance of each taxa, while unweighted UniFrac distance only considers presence and absence. We will compare intra and intergroup UniFrac distances using permutation tests. We will plot the distance between samples as a bar chart. 

<table style="width:90%;
              border-style:hidden;
              borders-collapse:collapse;
              line-height:120%">
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>b_div_metric</strong><br />(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The beta diversity metric to be used in the analysis. This name will appear at the beginning of the distance matrix file.
        </td>
    </tr>
</table>

In [8]:
# Beta Diversity Parameters
b_div_metric = 'unweighted_unifrac'

## Regression Parameters

We'll start by setting up the parameters for our regression and providing several peices of information about the variables we'll be working with. [More text to be added laters...]

<table style="width:90%;
              border-style:hidden;
              borders-collapse:collapse;
              line-height:120%">
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>all_variables</strong><br />(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The names of any variables which will be used in the regression.
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>continous_variables</strong><br />(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The names of any variables which will be used in the regression.
        </td>
    </tr>
</table>

In [25]:
all_variables = ['AGE', 'BMI', 'IBD', 'DIABETES', 'ANTIBIOTIC_SELECT', 
                 'ALCOHOL_FREQUENCY', 'TYPES_OF_PLANTS', 'COLLECTION_MONTH',
                 'EXERCISE_FREQUENCY', 'EXERCISE_LOCATION', 'SLEEP_DURATION',
                 'PD_whole_tree_mean', 'SEX', 'BMI_CAT']
response = 'PD_whole_tree_mean'

con_predictors = ['AGE', 'BMI']

cat_predictors = ['IBD', 'DIABETES', 'ANTIBIOTIC_SELECT', 
                  'ALCOHOL_FREQUENCY', 'TYPES_OF_PLANTS', 'COLLECTION_MONTH',
                  'EXERCISE_FREQUENCY', 'EXERCISE_LOCATION', 'SLEEP_DURATION',
                  'SEX', 'BMI_CAT']

We define our own categorization functions, rather than using the Patsy coding system, so we can pick our own reference varaibles. We are doing this in lou of using the [Patsy coding system](http://statsmodels.sourceforge.net/devel/contrasts.html), which is another option for handling categorical data.

In [96]:
def categorize_ibd(x):
    if x == 'I do not have IBD':
        return 0
    elif x in {"Ulcerative colitis", "Crohn's disease"}:
        return 1
    else:
        return np.nan

def categorize_diabetes(x):
    if x == 'I do not have diabetes':
        return 0
    elif x in {'Type I', 'Type II'}:
        return 1
    else:
        return np.nan

def categorize_antibiotics(x):
    if x == 'Not in the last year':
        return np.array([0, 0, 0])
    elif x in {'In the past week', 'In the past month'}:
        return np.array([1, 0, 0])
    elif x == 'In the past 6 months':
        return np.array([0, 1, 0])
    elif x == 'In the past year':
        return np.array([0, 0, 1])
    else:
        return np.array([np.nan]*3)
    
def categorize_frequency(x):
    if x == 'Never':
        return np.array([0, 0, 0, 0])
    elif x in {'Rarely', 'Rarely (few times/month)'}:
        return np.array([1, 0, 0, 0])
    elif x in {'Occasionally', 'Occasionally (1-2 times/week)'}:
        return np.array([0, 1, 0, 0])
    elif x in {'Regularly', 'Regularly (3-5 times/week)'}:
        return np.array([0, 0, 1, 0])
    elif x == 'Daily':
        return np.array([0, 0, 0, 1])
    else:
        return np.array([np.nan]*4)

def categorize_plants(x):
    if x == 'Less than 5':
        return np.array([0, 0, 0, 0])
    elif x == '6 to 10':
        return np.array([1, 0, 0, 0])
    elif x == '11 to 20':
        return np.array([0, 1, 0, 0])
    elif x in {'28', '21 to 30'}:
        return np.array([0, 0, 1, 0])
    elif x == 'More than 30':
        return np.array([0, 0, 0, 1])
    else:
        return np.array([np.nan]*4)
    

def descritize_month(x):
    try:
        t = time.strptime(x, '%B')
        return t.tm_mon
    except:
        return np.nan
    
def categorize_season(x):
    if x == 'Winter':
        return np.array([0, 0, 0])
    elif x == 'Spring':
        return np.array([1, 0, 0])
    elif x == 'Summer':
        return np.array([0, 1, 0])
    elif x == 'Fall':
        return np.array([0, 0, 1])
    else:
        return np.array([np.nan]*3)
    
def descritize_season(x):
    if x == "Winter":
        return 0
    elif x == 'Spring':
        return 1
    elif x == 'Summer':
        return 2
    elif x == 'Fall':
        return 3
    
def categorize_location(x):
    if x == 'Indoors':
        return np.array([0, 0, 0, 0])
    elif x == 'Outdoors':
        return np.array([1, 0, 0, 0])
    elif x == 'Depends on the Season':
        return np.array([0, 1, 0, 0])
    elif x == 'Both':
        return np.array([0, 0, 1, 0])
    elif x == 'None of the above':
        return np.array([0, 0, 0, 1])
    else:
        return np.array([np.nan]*4)

def categorize_sleep(x):
    if x == 'Less than 6 hours':
        return np.array([0, 0, 0])
    elif x == '6-7 hours':
        return np.array([1, 0, 0])
    elif x == '7-8 hours':
        return np.array([0, 1, 0])
    elif x == '8 or more hours':
        return np.array([0, 0, 1])
    else:
        return np.array([np.nan]*3)
    
def categorize_sex(x):
    if x == 'female':
        return np.array([0, 0])
    elif x == 'male':
        return np.array([1, 0])
    elif x == 'other':
        return np.array([0, 1])
    else:
        return np.array([np.nan, np.nan])

def categorize_bmi(x):
    if x == 'Normal':
        return np.array([0, 0, 0])
    elif x == "Underweight":
        return np.array([1, 0, 0])
    elif x == 'Overweight':
        return np.array([0, 1, 0])
    elif x == 'Obese':
        return np.array([0, 0, 1])
    else:
        return np.array([np.nan]*3)

Now, we'll create a dictory of categorical variables, mapping the variable to its conversion function and the column names when converted.

In [103]:
conversion = {'IBD': (['ibd_case'], categorize_ibd),
              'DIABETES': (['diabetes_case'], categorize_diabetes),
              'ANTIBIOTIC_SELECT': (['ABX_past_month', 'ABX_past_6_months', 'ABX_past_year'], categorize_antibiotics),
              'TYPES_OF_PLANTS': (['PLANTS_6_10', 'PLANTS_11_20', 'PLANTS_21_30', 'PLANTS_30+'], categorize_plants),
              'EXERCISE_LOCATION': (['EX_LOC_out', 'EX_LOC_depends', 'EX_LOC_both', 'EX_LOC_none'], categorize_location),
              'SLEEP_DURATION': (['SLEEP_6', 'SLEEP_7', 'SLEEP_8'], categorize_sleep),
              'SEX': (['SEX_m', 'SEX_o'], categorize_sex),
              'BMI_CAT': (['BMI_under', 'BMI_over', 'BMI_obese'], categorize_bmi),
              'COLLECTION_MONTH': (['COLLECTION_MONTH'], descritize_month),
              'EXERCISE_FREQUENCY': (['EX_FREQ_rare', 'EX_FREQ_occ', 'EX_FREQ_regular', 'EX_FREQ_daily'], categorize_frequency),
              'ALCOHOL_FREQUENCY': (['ETOH_FREQ_rare', 'ETOH_FREQ_occ', 'ETOH_regular', 'ETOH_FREQ_daily'], categorize_frequency)
              }

<a href="#top">Return to the top</a>

<a id="dir"></a>
## Files and Directories

We need to import working OUT data for analysis and set up a location where results from our analysis can be saved. This notebook consumes pre-processed tables (OTU tables, mapping files and distance matrices) produced by the Preprocessing Notebook. These can be downloaded individually, or the whole set is available [here](https://www.dropbox.com/s/q7wrf4tme2mrt0p/all_samples.tgz).

As we set up directories, we’ll make use the of the **check_dir** function. This will create the directories we identify if they do not exist.

<a id="dir_base"></a>

### Base Directory

We need a general location to do all our analysis; this is the base_dir. All our other directories will exist within the **$base_dir$**, and allow us to work. The working directory is a directory within the base directory where we’ll find the files we need.

<table style="width:90%;
              border-style:hidden;
              borders-collapse:collapse;
              line-height:120%">
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>base_dir</strong><br />(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The filepath for the directory where any files associated with the analysis should be saved. It is suggested this be a directory called <strong>agp_analysis</strong>, and be located in the same directory as the IPython notebooks.
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>working_dir</strong><br />(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The file path for the directory where all data files associated with this analysis have been stored. This should contain the results of the Preprocessing Notebook.<br>
The working_dir is expected to be a directory called <strong>sample_data</strong> in the <strong><code>base_dir</code></strong>.
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>analysis_dir</strong><br />(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The file path where analysis results should be stored. This is expected to be a folder in the <strong><code>base_dir</code></strong>.
        </td>
    </tr>
</table>

In [59]:
# Sets up the base directory
# base_dir = os.path.join(os.path.abspath('.'), 'agp_analysis')
base_dir = '/Users/jwdebelius/Desktop/agp_analysis'
div.check_dir(base_dir)

# Sets up data directory
working_dir = os.path.join(base_dir, 'sample_data')
div.check_dir(working_dir)

# Sets up the analysis directory
analysis_dir = os.path.join(base_dir, 'analysis_results')
div.check_dir(analysis_dir)

<a href="#top">Return to the top</a>
<a id="dir_data"></a>
### Sample Directory and Files

We’ll focus our analysis on fecal samples, which we set with the <a href="#params_data"><strong><code>site</code></strong></a> variable. We’ve chosen to focus on a single sample from a healthy subset of adults in the American Gut population (set with the <strong><code><a href="#params_data">dataset</a></code></strong> variable). To be included in this group, a sample must come from a donor between the ages of 20 and 69 (inclusive) who has a BMI between 18.5 and 30 and does not report having IBD or diabetes.  
Our analysis will use a mapping file, OTU table and unweighted UniFrac distance matrix.

<table style="width:90%;
              border-style:hidden;
              borders-collapse:collapse;
              line-height:120%">
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>site_dir</strong><br>(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The filepath for the directory where data sets from fecal samples are stored. This should be a directory in the <strong><code>working_dir</code></strong>.
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>data_dir</strong><br>(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The filepath of the subset participants single sample directory. This should be a folder in the <strong><code>site_dir</code></strong>.
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>data_map_fp</strong><br>(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            <p>This specifies the filepath for the metadata file associated with the fecal samples. This should be a text file (.txt) in the <strong><code>data_dir</code></strong>.</p>
<p>A mapping file allows us to relate information about the sample to information about the microbiome. This contains a barcode used to identify each sample, and information about the participants from the survey, such as age, diet, or disease status. This cannot be used identify participants and does not contain data like names, physical or email addresses. In the rarefied mapping file (the filenames contain even10k), the mapping file also contains alpha diversity results for each sample. </p>
<p>The notebook expects the metadata to be processed through the Preprocessing notebook, which involved converting continuous categories to categorical data. The rarefied file (<code>AGP_100nt_even10k…</code>), which contains alpha diversity results, is required.</p>
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>data_otu_fp</strong><br>(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
    
<p>The filepath for the otu table file associated with the fecal samples. This should be a <a href="http://www.biom-format.org">biom-format</a> file (.biom) in the <strong><code>data_dir</code></strong> [<a href="#23587224">23587224</a>].</p>
<p>The OTU table is assumed to be rarefied to an even depth. This is designated in the filename with the phrase, “even10k”, indicating the OTU table has been rarefied to 10,000 sequences per sample.</p>
<p>An OTU table gives the bacterial counts in each sample. An OTU, or operational taxonomic unit, is technically a cluster of sequence at a certain level of similarity. We use sequence clustering to account for PCR and read error. The level of similarity used here, 97% gives approximately genus level resolution [<a href="#17586664">17586664</a>]. Multiple OTUs may map to a single bacterial taxa.</p>
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>data_bdiv_fp</strong><br>(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            <p>The filepath for the unweighted UniFrac distance matrix file associated with the fecal samples. This should be a text file (.txt) in the <strong><code>data_dir</code></strong>.</p>
The distance matrix relates the microbiome composition in each sample community to every other community. Unweighted UniFrac distance considers shared evolutionary history and the presence or absence of OTUs in this calculation [<a href="#16332807">16332807</a>]. Identical communities have a UniFrac Distance of 0, while communities which have no shared history have a UniFrac distance of 1.
        </td>
    </tr>
</table>

In [60]:
# Sets up OTU path directories
site_dir = os.path.join(working_dir, site)
div.check_dir(site_dir)

data_dir = os.path.join(site_dir, dataset)

# Sets the subset filepath for all samples
data_otu_fp = os.path.join(data_dir, 'AGP_100nt_even10k_fecal.biom')
data_map_fp = os.path.join(data_dir, 'AGP_100nt_even10k_fecal.txt')
data_ubd_fp = os.path.join(data_dir, '%s_AGP_100nt_even10k_fecal.txt') % b_div_metric

<a href="#top">Return to the top</a>
<a id="dir_image"></a>
### Image Directories and Files

We can save the graphical results of our analysis, in addition to displaying them in the notebook. We will use a file structure similar to the way we’ve saved our OTU tables to keep track of the images generated.

<table style="width:90%;
              border-style:hidden;
              borders-collapse:collapse;
              line-height:120%">
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>image_dir</strong><br>(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The parent directory for the images generated by the notebook. This expected to be a directory in the <strong><code>analysis_dir</code></strong>.
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>site_image_dir</strong><br>(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
    
A body-site specific directory for the result images. This is expected to be a directory in the <strong><code>image_dir</code></strong>.
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>data_image_dir</strong><br>(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The directory for the specific dataset used in the analysis. For example, a comparison based on age could be performed on a single sample per participant among the healthy subset of adults (one_sample_sub_participants) and a single sample per participant for all participants (one_sample_all_participants) to see if the trends seen in the healthy subset hold true as the data set expands.
This is expected to be a directory in <strong><code>image_site_dir</code></strong>.
        </td>
    </tr>
    <tr>
        <td style="width: 30%;
                   text-align:left; 
                   vertical-align:top;
                   background-color:#D0D0D0;
                   border-right:hidden; 
                   border-bottom: 10px solid white;
                   padding:10px">
            <strong>image_pattern</strong><br>(string)
        </td>
        <td style="width: 60%
                   text-align: left;
                   vertical-align: top;
                   border-left:hidden;
                   border-top:hidden;
                   border-bottom:hidden;
                   padding:10px;
                   ">
            The filename to be used for saving results images we generate. This uses file replacement. We can generate a dictionary object where we identify the value that will fill in the blanks. The code, <code><font color="Firebrick">"%s"</font></code> allows a string.
        </td>
    </tr>
</table>

In [61]:
image_dir = os.path.join(analysis_dir, 'images')
div.check_dir(image_dir)

site_image_dir = os.path.join(image_dir, site)
div.check_dir(site_image_dir)

data_image_dir = os.path.join(site_image_dir, dataset)
div.check_dir(data_image_dir)

image_pattern = os.path.join(data_image_dir, '%(div_metric)s_%(image_type)s_%(category)s.png')

<a href="#top">Return to the top</a>
<a id="download"></a>
## Data Download

We will start our analysis using the clean, rarefied tables generated by the Preprocessing Notebook. If necessary, these files can be downloaded. The necessary files are then loaded into the notebook for analysis and processing.

In [62]:
# Loads the files into the notebook
data_otu = biom.load_table(data_otu_fp)
data_map = pd.read_csv(data_map_fp,
                       sep=txt_delim, 
                       na_values=map_nas,
                       index_col=False)
data_map.index = data_map[map_index]
del data_map[map_index]
data_ubd = skbio.DistanceMatrix.read(data_ubd_fp)

# Function definations

# Metadata Conversion and Massage

We'll start by creating a dataframe that will look at only the prediction and response variables. We'll use this to re-code the response variables for the regression. There are several discussion of this approach which I should probably cite at some point...

In [141]:
data_in = data_map[con_predictors]
data_in = data_in.join(data_map[response])

for cat, (columns, f) in conversion.iteritems():
    # Checks we should operate on this category
    if cat not in cat_predictors:
        continue
    descrete = pd.DataFrame(data=np.vstack(data_map[cat].apply(f).values),
                            index=data_map.index, columns=columns)
    data_in = data_in.join(descrete)

Now, we'll remove any lines which contain undefined variables, so our regression looks only at defined data.

In [152]:
data_in = data_in.dropna()

Finally, we'll indentify an initial predictor and response variable. I'd like to try using `AGE` as the first variable. So, if we call our PD whole tree diversity, $y$, then we are evaluating the following relationship:

$$y = \beta_{0} + \beta_{age}*(AGE) + \epsilon \tag{r1}$$

We will solve for the intercept, $\beta_{0}$, and the linear coeffecient for age, $\beta_{age}$. Our residuals will represent the error term, $\epsilon$.

There are several options for permuative implementations of linear regressions. Coeffecients can be estimated using bootstrapping ... [ter Braak]. The most robust method for permutative mutlivariate linear regression is that proposed by Freedman and Lane. [Anderson, Freedman]. This method 

In a brief mathematical summary of the results required mostly because I don't understand well, and rephrasing things in math helps me understand...

### Freedman and Lane Defination of Simple Permutative Regression
Assume there exist two vectors, $x$ and $y$ of length, $n$ where $\bar{x}$ and $\bar{y}$ are the respective means, and $s(x)$ and $s(y)$ are the respective standard deviations.

We are testing the assumption that $x$ and $y$ are related according to equation 1.1:
$$y = \beta_{0} + \beta_{1}x + \epsilon \tag{1.1}$$

Let $r(x,y)$ be the correlation between $x$ and $y$, given by

$$r(x, y) = \frac{1}{n(s(x))(s(y))}\sum_{m=1}^{n} (x_{m} - \bar{x})(y_{m} - \bar{y}) \tag{1.2}$$

Let $I_{n}$ be the index of a vector of length $n$, and let $\pi$ be one of the $n!$ equally weighted possible permutations of $I_{n}$, such that for the $\pi$th permuation, $i \textrm{  } \epsilon \textrm{  }I_{n}$ is transformed to $i\pi \textrm{  }\epsilon \textrm{  } I_{n}$. If we apply this to one of the earlier vectors, $x$ the permutated vector, $x_{\pi} = (x_{1\pi}, x_{2\pi}, ... x_{n\pi})$.

***
**Freedman and Lane's Theorem 1**: Let $K$ be a finite, positive constant such that

$$|x_{m} - \bar{x}| < Ks(x) \textrm{ and } |y_{m} - \bar{y} < Ks(y)|\tag{1.3}$$

for $m \textrm{ } \epsilon \textrm{ } \mathbb{N} \textrm{ and } m < n$.
***
Let 

$$R(\pi) = \sqrt{n}r(y_{\pi}, z) \tag{1.4}$$

where as $n \to \infty$, R converges to a noraml distribution with mean 0 and a variance of 1. 

We can apply the Theorem 1 to a permuative appraoch to simple linear regression, as described in equation 1.1. In this case, $\beta_{0}$ and $\beta_{1}$ minimize the error sum of squares, so

$$\sum_{i = 1}^{n}(y_{i} - \beta_{0} - \beta_{1}x_{1})^{2} \tag{1.5}$$

The null hypothesis, that $\beta{1} = 0$ can be tested with the statistic, $t$,
$$t = \frac{r\sqrt{n - 2}}{\sqrt{1 - r^{2}}} \tag{1.6}$$
where $t$ is drawn from a student's t distribution with $n - 2$ degrees of freedom.

Let's assume the test statistic, $t$ in eq (1.6) is calcuate for k random permuations of y, where for the $\pi$th permutation, the relationship
$$y_{\pi} = \beta_{0\pi} + \beta{1\pi}x \tag{1.7}$$

If the data is stochastic, we treat the error term in the regression model given by eq (1.1) as data, there are no outliers in the data, and $k$ is a large number, we can start to estimate a $p$ value for our null hypothesis test.


### Freedman and Lane Defination of Multivariate Regression

Now, let's consider a case where we're testing whether $p + q$ predictors, $\{x_{1}, x_{2}, ..., x_{p}, y_{1}, y_{2}, ... , y_{q}$ are correlated with a response variable $y$, assuming that the other variables are held constant. (Again, this is really clunky notation).
We can express this mathematically as

$$z = \alpha_{0} + \alpha_{1}x_{1} + \alpha_{2}x_{2} + ... + \alpha_{p}x_{p} + \beta_{1}y_{1} + \beta_{2}y_{2} + ... + \beta_{q}y_{q} + \epsilon \tag{2.1}$$

A multivariate regression can also be represented in linear algebra, where the response variable, $Z$ is a 1 x $n$ column vector, $X$ is an $p +  1$ x $n$ matrix where $X^{0} = 1$ and $X^{i} = x_{i}$ and Y is a $q$ x $n$ matrix and $Y^{j} = y_{j}$ ($i, j \textrm{ } \epsilon \textrm{ } \mathbb{N} \textrm{ and } i \leq p$ and $j \leq q$).
The regression coeffecients in can represented as 1 x $p + 1$ and 1 x $q$ column vectors called A and B, respectively, where $A = (\alpha_{0}, \alpha_{1}, \alpha_{2}, ... \alpha_{p})$. And, the error terms, $\epsilon_{1}, \epsilon_{2}, ... \epsilon_{n}$ are collected into a 1 x $n$ column vector, E.

We can represent equation (2.1) using linear algebra as 
$$Y = A \cdot X + B \cdot Y + E \tag{2.2}$$
$A$ and $B$ are chosen to minimize the sum of squares.


For the $i$th coeffecient in B ($B^{i}$ or $\beta_{i}$), we can test the following alternative hypotheses:
<center><strong>H<sub>0</sub></strong>: $\beta_{i} = 0$<br/>
<strong>H<sub>1</sub></strong>: $\beta_{i} \neq 0$
</center>

We can use a conventional $F$ statistic to test this hypothesis, where $F(Z, X, Y)$ has $q$ degrees of freedom in the numerator and $n - p - q$ in the denomenator. We define $P$ as the level of signfigiance, or the probability of finding a value as extreme in our test $F$.

If we consider on the regression with reguard to $X$, 

The vector of coeffecients, $B$ is a 1 x $p$ vector, where $B = (\beta_{0}, \beta_{1}, \beta_{2}, ... , \beta_{p})$. The error terms, or residuals are represented as a 1 x $n$ column vector, where $E = (\epsilon_{1}, \epsilon_{2}, ... \epsilon_{n})$. This terminology allows us to write the regression equation as
$$Y = B \cdot X + E \tag{2.2}$$

So, given this context, we can solve for B by minimizing the sum of squares, according to a modified version of equation (1.5):

$$\sum_{i = 1}^{n}(y_{i} - B \cdot X_{i})^{2} \tag{2.3}$$

Or, in matrix terms,

$$B = (X'X)^{-1}(X'Y) \tag{2.4}$$
If you're interested in the deriviation of this, [resource] is recommended.

We are testing the null hypothesis that $\beta_{j} = 0$ 

Now, let's consider a case where we have a response varaible, column vector, $Y$ of shape 1 x $n$, and we're looking at the relationship to a matrix of $p$ predictor variables, $X$, where $X$ has dimensions $p$ x $n$, and individual predictor variables are denoted as $X^{1}, X^{2}, ... X^{p}$. The predictor value for the $i$th predictor variable and the $j$th observations is given by $X_{i,j}. Let us also assume that $X^{1} = 1.

If we're not good at linear algebra, we can fit the relationship between $y$ and its predictor variables as

$$y = \beta{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{p}x_{p} + \epsilon \tag{2.1}$$

In matrix space,
$$Y = B 

If we were using linear algebra, we can represent the regression co-effecients, $\{\beta_{0}, \beta_{1}, \beta_{2}, ... \beta{p}\}$, as a $p$ x 1 row vector, $B$.

We can fit the regression using the sum of least squares, so 

$$\sum_{i = 1}^{n}(y_{i} -  $$

In [150]:
def permute_linear_regression(df, response_name, predictor_name, num_iter=99, regression_params=None):
    """Doc string here!"""
    if regression_params is None:
        regression_params = {}
    # Draws the predictor and response columns
    y = df[response_name]
    X = sms.add_constant(df[predictor_name])
    
    # Calculates the original linear regression
    ori_model = sms.OLS(y, X, **regression_params)
    ori_results = ori_model.fit()
    ori_p = ori_results.pvalues
    
    y_index = y.index
    
    p_vector = np.ones(ori_p.shape[0], num_iter)
    
    # Calculates the permutation
    for i in xrange(num_iter):
        # Shuffles the labels on the response variable
        y_shuffle = pd.Series(np.random.permutation(df[response_name].values),
                              name=response_name, index=y_index)
        # Performs the regression
        shuffle_model = sms.OLS(y_shuffle, X, **regression_params)
        shuffle_results = shuffle_model.fit()
        p_vector[:, i] = shuffle_results.pvalues
    
    # Calculates the permutative p
    

In [151]:
# Gets predictor and responses
y = data_in[response]
X = sms.add_constant(data_in['AGE'])

# Solves the regression
ori_model = sms.OLS(y, X, **{})
ori_results = model.fit()
print res.summary()

                            OLS Regression Results                            
Dep. Variable:     PD_whole_tree_mean   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.037
Method:                 Least Squares   F-statistic:                     54.10
Date:                Tue, 07 Apr 2015   Prob (F-statistic):           3.25e-13
Time:                        16:01:34   Log-Likelihood:                -4426.6
No. Observations:                1386   AIC:                             8857.
Df Residuals:                    1384   BIC:                             8868.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         27.6559      0.438     63.106      0.0

In [118]:
# df = data_map['BMI']
# # Removes samples with missing data
# df = data_map[variables].dropna()
# model = smf.ols(formula="PD_whole_tree_mean ~ AGE", data=df)
# res = model.fit()
# print res.summary()