# Tutorial 2a: Tidy data and split-apply-combine
(c) 2017 Justin Bois. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

In [38]:
import numpy as np
import pandas as pd

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

Note: * The code in this tutorial comes from the tutorial by Justin Bois, which can be found [here](http://bebi103.caltech.edu.s3-website-us-east-1.amazonaws.com/2017/tutorials/t2a_tidy_data.html)
## Introduction
The data we are investigating come from David Prober's lab. A description of their work on the genetic regulation of sleep can be found on the research page of the lab website. There is a movie of the moving/sleeping larvae similar to the one used to produce the data set we are using in this tutorial. The work based on this data set was published in (Gandhi et al., *Neuron*, 85, 1193–1199, 2015).

## The genotype data
First, we will load the genotype file.

In [39]:
with open('data/130315_1A_genotypes.txt', 'r') as f:
    for _ in range(30):
        print(next(f), end='')

# Genotype data from the Gandhi, et al. experiment ending March 13, 2013
#
# The experiment was performed in a 96 well plate with zebrafish
# embryos.  Gene sequencing was later used to identify the genotype
# of the fish in each well.  Not all fish could be genotyped.
#
# The mutants being studied have deletions in the gene coding for
# arylalkylamine N-acetyltransferase (aanat), which is a key enzyme
# in the rhythmic production of melatonin.  Melatonin is a hormone
# responsible for regulation of circadian rhythms.  It is often taken
# as a drug to treat sleep disorders.  The goal of this study is to
# investigate the effects of aanat deletion on sleep pattern in
# 5+ day old zebrafish larvae.
#
# Each column lists the wells corresponding to the genotype of
# each fish.  If a number is missing (between 1 and 96), the 
# genotype of that fish is not known.
# 
# These data were kindly provided by Avni Gandhi and Audrey Chen
# from David Prober's lab.  They were part of the paper Gandh

Each column in this file contains a list of wells in the 96-well plate corresponding to each genotype. Apparently, the columns are tab delimited. We can parse this using the `delimiter` keyword argument of `pd.read_csv()`. When specifying delimiters, tabs are denoted as `\t`. 
<br>
<br>
There are two header rows, one starting with `Genotype 1` and the other one starting with `WH 17`. The first one is really redundant, so it can be skipped. Nevertheless, we will read them both in as headers using the `header` kwarg of `pd.read_csv()`.

In [40]:
# Load in the genotype file, call it df_gt for genotype DataFrame
df_gt = pd.read_csv('data/130315_1A_genotypes.txt',
                    delimiter='\t', 
                    comment='#',
                    header=[0 ,1])

# Let's look at it
df_gt

Unnamed: 0_level_0,Genotype1,Genotype2,Genotype3
Unnamed: 0_level_1,WT 17,Het 34,Mut 22
0,2.0,1,4.0
1,14.0,3,11.0
2,18.0,5,12.0
3,24.0,6,13.0
4,28.0,8,20.0
5,29.0,10,21.0
6,30.0,15,23.0
7,54.0,19,27.0
8,58.0,22,35.0
9,61.0,33,39.0


Now we have two header rows for the columns. If we look at the column names, we see that they are a `MultiIndex` instance.

In [41]:
df_gt.columns

MultiIndex(levels=[['Genotype1', 'Genotype2', 'Genotype3'], ['Het 34', 'Mut 22', 'WT 17']],
           labels=[[0, 1, 2], [2, 0, 1]])

This is part of Pandas's cool multi-indexing functionality, which is not comething necessary if we have tidy data. However, the messay data with multi-indexing may result in performance boosts for accessing data. This can be important for very large data sets. But in general, tidy data are easier to conceptualize and syntactically much simpler to access.
<br>
<br>
In this case, we do not need the multi-indexing, and zero-level indexing (`Genotype1`, etc.), so we will leave just level one index. This can be accomplished using the `get_level_values` method of Pandas `MultiIndex` objects.

In [42]:
# Reset the columns to be the second level of indexing
df_gt.columns = df_gt.columns.get_level_values(1)

# Check out the new columns
df_gt.columns

# Let's actually clean up the column names
df_gt.columns = ['wt', 'het', 'mut']

The above approach to renaming is a shortcut to using `rename()` method that has previously been encountered. It could actually have been done before messing around with the multi-indexing, but why not to learn something new? We need to know how to tidy data sets in this format anyways.

## Tidying the genotype data
As they are, the data are not tidy. For tidy data, we would have two columns, `location`, which gives the well number for each larva, and `genotype`, which is `wt` for wild type, `het` for heterozygote, or `mut` for mutant.
<br>
<br>
A useful data tidying tool is the `pd.melt()` function. For this simple data set, it takes the column headings and makes them into a column (with repeated entries) and puts the data accordingly in the correct order. We just need to specify the name of the "variable" column, in this case the genotype, and the name of the "value" column, in this case, the fish ID.

In [43]:
# Tidy the DataFrame
df_gt = pd.melt(df_gt, var_name='genotype', value_name='location')

# Take a look
df_gt

Unnamed: 0,genotype,location
0,wt,2.0
1,wt,14.0
2,wt,18.0
3,wt,24.0
4,wt,28.0
5,wt,29.0
6,wt,30.0
7,wt,54.0
8,wt,58.0
9,wt,61.0


We still have some unhelpful `NaN` entries, so we drop them using `dropna()` method of `DataFrame`s.

In [44]:
# Drop NaNs
df_gt = df_gt.dropna()

# Take a look
df_gt

Unnamed: 0,genotype,location
0,wt,2.0
1,wt,14.0
2,wt,18.0
3,wt,24.0
4,wt,28.0
5,wt,29.0
6,wt,30.0
7,wt,54.0
8,wt,58.0
9,wt,61.0


We can see that indices have some skips due to using the NaN's. This is not really of concern, since we will not use them. We can use the `reset_index()` to reset the indexing of the `DataFrame`.

In [45]:
# Reset the indexing of the DataFrame
df_gt = df_gt.reset_index(drop=True)

We now have a tidy `DataFrame`. However, the fish numbers are floates. This happens becuase NaN is a float, so the column got converted to floats. We can reset the data type of the column to `int`s.

In [46]:
# Set data types to be integers
df_gt.loc[:, 'location'] = df_gt.loc[:, 'location'].astype(int)

We now have a beautiful, tidy `DataFrame` of genotypes. Apparently, the `pd.melt()` may be the most useful function in tidying messy data sets. Now let's take a look at the behavioral data.

## The behavioral data
We will first take a look at the behavioral data set. It has been preprocessed by Justin Bois (thanks, Justin!)  to some degree, and we will work with the original raw data later on. Let's look at the contents of the data file.

In [47]:
with open('data/130315_1A_aanat2.csv', 'r') as f:
    for _ in range(40):
        print(next(f), end='')

# Lightly processed data from VideoTracker from the Gandhi, et al. experiment
# concluding on March 15, 2013.
#
# The experiment was performed in a 96 well plate with zebrafish
# embryos. Gene sequencing was later used to identify the genotype
# of the fish in each well. Not all fish could be genotyped.
#
# The mutants being studied have deletions in the gene coding for
# arylalkylamine N-acetyltransferase (aanat), which is a key enzyme
# in the rhythmic production of melatonin. Melatonin is a hormone
# responsible for regulation of circadian rhythms. It is often taken
# as a drug to treat sleep disorders. The goal of this study is to
# investigate the effects of aanat deletion on sleep pattern in
# 5+ day old zebrafish embryos.
#
# Activity is defined as the number of seconds over the one-minute interval 
# in which a given larva was moving.
#
# The column 'zeit' contains the so-called Zeitgeber time in units of hours.
# The Zeitgeber time is zero when the lights come on on the first 

The provided data set is tidy. Each row corresponds to the measured seconds of activity of a single fish for a single minute. Other information includes the time of the measurement and the day in the life of the fish. The Zeitgeber time seem to somehow give us a "reference time" for the experiment (`Zeit` ist "time" in Deutsch. Alles klar?)
<br>
<br>
Let's load the data set

In [48]:
df = pd.read_csv('data/130315_1A_aanat2.csv', comment='#')

# Let's take a look
df.head()

Unnamed: 0,location,activity,time,zeit,zeit_ind,day
0,1,0.6,2013-03-15 18:31:09,-14.480833,-869,4
1,2,1.4,2013-03-15 18:31:09,-14.480833,-869,4
2,3,0.0,2013-03-15 18:31:09,-14.480833,-869,4
3,4,0.0,2013-03-15 18:31:09,-14.480833,-869,4
4,5,0.0,2013-03-15 18:31:09,-14.480833,-869,4


## Adding the genotype information
First of all, we can try to sort out data by including the genotype information for the fish at each location. The `DataFrame` genotyp has two columns, `location` and `genotype`. The `location` column is shared with our behavior `DataFrame` so we can use `pd.merge()` to add the genotype information as a new column.

In [49]:
df = pd.merge(df, df_gt)

# Take a look
df.head()

Unnamed: 0,location,activity,time,zeit,zeit_ind,day,genotype
0,1,0.6,2013-03-15 18:31:09,-14.480833,-869,4,het
1,1,1.9,2013-03-15 18:32:09,-14.464167,-868,4,het
2,1,1.9,2013-03-15 18:33:09,-14.4475,-867,4,het
3,1,13.4,2013-03-15 18:34:09,-14.430833,-866,4,het
4,1,15.4,2013-03-15 18:35:09,-14.414167,-865,4,het


## Light or dark?
We can attempt to annotate our `DataFrame` to indicate if it is light or dark. Then we can easily separate the diurnal and nocturnal activity. So let's make a column `light` which will have the entry `True` if the lights are on, and `False` if the lights are off. Referring again to the comments in the header of the data set, the light come on at 9AM and turn off at 11PM. We can use the `time` column to determine this.
<br>
We should first look at the time column and determine its data type.

In [50]:
df['time'].dtype

dtype('O')

The result `dtype('O')` means that the data type is "object", which is a generic catch-all for unknown data types. However, we know that the `time` column are actual clock times. We can tell this to Pandas and unleash its data processing power on our data set. We do this using the `pd.to_datetime()` function.

In [51]:
df['time'] = pd.to_datetime(df['time'])

# What is the data type now?
df['time'].dtype

dtype('<M8[ns]')

The data type is `<M8[ns]`, which is essentially saying that Pandas is aware that this is a point in time, and it is stored with nanosecond precision. This has some implications: <br> 
* If data are taken on the greater than nanosecond frequency, Pandas's datetime utility will not help. In such case, we would simply use a number (say femtoseconds) as the time variable.
* The time zero can not be more than about a thousand years ago. This can come up in geology or fields like that. 
<br>
<br>
We can do lots of time-based things with this datetime format, like extraction of the time from the datetime. We use `.dt.time` to do this.

In [52]:
df['time'].dt.time.head()

0    18:31:09
1    18:32:09
2    18:33:09
3    18:34:09
4    18:35:09
Name: time, dtype: object

We can also compare times. For example, let's ask if the time is after 9AM.

In [53]:
(df['time'].dt.time > pd.to_datetime('9:00:00').time()).head()

0    True
1    True
2    True
3    True
4    True
Name: time, dtype: bool

We had to convert the string `9:oo:oo` to datetime and then extract the time using the `time()` method. <br>
Now we can just make a column of booleans reporting whether the time is greater than 9:00:00 and less than 23:00:00.

In [54]:
df['light'] = ( (df['time'].dt.time >= pd.to_datetime('9:00:00').time())
               & (df['time'].dt.time < pd.to_datetime('23:00:00').time()))

# Take a look
df.head()

Unnamed: 0,location,activity,time,zeit,zeit_ind,day,genotype,light
0,1,0.6,2013-03-15 18:31:09,-14.480833,-869,4,het,True
1,1,1.9,2013-03-15 18:32:09,-14.464167,-868,4,het,True
2,1,1.9,2013-03-15 18:33:09,-14.4475,-867,4,het,True
3,1,13.4,2013-03-15 18:34:09,-14.430833,-866,4,het,True
4,1,15.4,2013-03-15 18:35:09,-14.414167,-865,4,het,True


We make a quick plot of the Zeitgeber time versus light. We only need to do this for a single location (single well).

In [55]:
p = bokeh.plotting.figure(plot_height=259,
                          plot_width=700,
                          x_axis_label='Zeitgeber time (hours)',
                          y_axis_label='light')

inds = df['location'] == 1
p.circle(df.loc[inds, 'zeit'], df.loc[inds, 'light'])

bokeh.io.show(p)

Apparently, the light switch on and off as it should.
## Split-apply-combine
Let's now compute some more things. For instance, let's say we want to compute the average per-minute activity of each fish over the course of the experiment. To do this, we must:
1. **Split** the data set according to the `location` field, i. e. split it up so we have a separate data set for each fish.
2. **Apply** an averaging function to the activity in these split data sets.
3. **Combine** the results of these averages on the split data set into a new, summary data set that contains the locations and means for each location.

The strategy we want is a **split-apply-combine** strategy. This idea was put forward by Hadley Wickham in [this paper](https://www.jstatsoft.org/article/view/v040i01) and is used quite often:
Split data according to some criterion (maybe according to genotype). Apply some function to split-up data. Combine the results into a new data frame. 
<br>
If the data are tidy, it makes a lot of sense. Choose ther column by which we want to split the data. All rows with like entries in the splitting column are then grouped into a new data set. 
<br>
Pandas's split-apply-combine operations are achieved using the `groupby()` method. It can be thought of as a the splitting part. We can then apply functions to the resulting `DataFrameGroupBy` object.

In [56]:
gb = df.groupby('location')

# Let's look at it
gb

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10feebe48>

There is really nothing to see in the `DataFrameGroupBy` object. It is a split-up data set. We will now apply the `np.mean()` function to this `DataFrameGroupBy` object. To apply a funciton, we use the `agg()` method. There also is an `apply()` method that is more generic at the cost of performance. For computing summary statsitics, we are aggregating, hence the use of the `agg()` method, and use of this function gives better performance.)

In [57]:
# Compute the mean of the data grouped by location
df_mean = gb.agg(np.mean)

# Take a look
df_mean.head()

Unnamed: 0_level_0,activity,zeit,zeit_ind,day,light
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2.684318,30.263574,1815.818199,5.743241,0.552489
2,3.296122,30.263574,1815.818199,5.743241,0.552489
3,2.030076,30.263574,1815.818199,5.743241,0.552489
4,4.527876,30.263574,1815.818199,5.743241,0.552489
5,4.255473,30.263574,1815.818199,5.743241,0.552489


We got back a `DataFrame`. So the `agg()` method of the `DataFrameGroupBy` object did both the **apply** and **combine** steps. This makes sense - we would never want to apply, and then not combine it into something useful.
<br>
So in Pandas, split-apply-combine to compute summary statistics is achieved by first doing a `groupby()` operation to get a `DataFrameGroupBy` object and then an `agg()` operation on it. The result is a `DataFrame`.
<br>
The index of the `DataFrame` is now called `location`, as the original `DataFrame` was grouped by the `'location'`. It then applied the `np.mean()` function to every column of the original `DataFrame`. However, we only wanted it to be applied to `activity`. Let's try the operation again, this time slicing out the location column before doing the `agg()` step.

In [58]:
df_mean = gb['activity'].agg(np.mean)

# Take a look
df_mean.head()

location
1    2.684318
2    3.296122
3    2.030076
4    4.527876
5    4.255473
Name: activity, dtype: float64

We now have a Pandas `Series`, where the index is `location`. Because we like to work with datasets with Boolean indexing, we would prefer a `DataFrame` with two columns, `location` and `activity`. We can get this using the `reset_index()` method.

In [59]:
df_mean = df_mean.reset_index()

# Take a look
df_mean.head()

Unnamed: 0,location,activity
0,1,2.684318
1,2,3.296122
2,3,2.030076
3,4,4.527876
4,5,4.255473


The above `DataFrame` gives us the average seconds of activity per second of each fish.
<br>
Quick note: Some of the functions we apply are very common, like the mean. For the functions of this type, Pandas often has built-in methods. Let's try using a `DataFrameGroupBy` object's `mean()` method.

In [60]:
df_mean = gb['activity'].mean().reset_index()

# Take a look
df_mean.head()

Unnamed: 0,location,activity
0,1,2.684318
1,2,3.296122
2,3,2.030076
3,4,4.527876
4,5,4.255473


We got exactly the same result!

## Grouping by multiple columns
What we may *really* want is the mean activity for each genotype for each day and for each night. For this, we want to split the data set according to the `genotype`, `day`, and `light` columns. To do this, we simply pass a list of columns as an the argument to the `groupby()` method.

In [61]:
# Group by three columns
gb = df.groupby(['genotype', 'day', 'light'])

# Apply the mean and reset index
df_mean = gb['activity'].mean().reset_index()

# Take a look
df_mean

Unnamed: 0,genotype,day,light,activity
0,het,4,False,0.447843
1,het,4,True,0.488257
2,het,5,False,0.900623
3,het,5,True,4.453268
4,het,6,False,1.125245
5,het,6,True,5.994772
6,het,7,False,1.305324
7,het,7,True,6.69987
8,het,8,True,7.9513
9,mut,4,False,0.66353


Now we can compute other summary statistics, such as the standard deviation. The `agg()` method allows passing multiple functions.


In [62]:
df_summary = gb['activity'].agg([np.mean, np.std])

# Take a look
df_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mean,std
genotype,day,light,Unnamed: 3_level_1,Unnamed: 4_level_1
het,4,False,0.447843,1.131747
het,4,True,0.488257,2.256494
het,5,False,0.900623,1.551801
het,5,True,4.453268,3.687598
het,6,False,1.125245,1.650415
het,6,True,5.994772,4.409796
het,7,False,1.305324,1.941781
het,7,True,6.69987,4.547888
het,8,True,7.9513,4.640795
mut,4,False,0.66353,1.418944


Columns are automatically named with the functions we chose to compute. We can also rename these columns and `reset_index()` them to get the results we want.

In [63]:
df_summary = (gb['activity'].agg([np.mean, np.std])
                            .reset_index()
                            .rename(columns={'mean': 'mean activity',
                                             'std': 'std activity'}))

# Take a look
df_summary

Unnamed: 0,genotype,day,light,mean activity,std activity
0,het,4,False,0.447843,1.131747
1,het,4,True,0.488257,2.256494
2,het,5,False,0.900623,1.551801
3,het,5,True,4.453268,3.687598
4,het,6,False,1.125245,1.650415
5,het,6,True,5.994772,4.409796
6,het,7,False,1.305324,1.941781
7,het,7,True,6.69987,4.547888
8,het,8,True,7.9513,4.640795
9,mut,4,False,0.66353,1.418944


The above example show that we can rapidly compute and then compare summary statistics, even for the large data sets with split-apply-combine approach

## Resampling
Let's look at a single plot of the activity of a fish.

In [64]:
p = bokeh.plotting.figure(plot_height=250,
                          plot_width=700,
                          x_axis_label='Zeitgeber time (hours)',
                          y_axis_label='activity (sec/min)')

inds = df['location'] == 1
p.line(df.loc[inds, 'zeit'], df.loc[inds, 'activity'], line_join='bevel')

bokeh.io.show(p)

We used thr `line_join=bevel` argument. This ensures that we do not get weird point artifacts in line plots with highly variable signal. This plot might look a little nicer if we colored it differently for night versus day. 
<br>
So we will loop through each day, plot a line with light color when the lights were on and a line with dark color when the lights were of ("days" begin and end at lights on, not a midnight). For coloring, we will use one of Bokeh's built-in color palettes, based on the work of Cynthia Brewer. Since we might need to do this over and over again, let's write a fuction for this.

In [65]:
def plot_trace(df, p=None, activity='activity', y_axis_label='activity (sec/min)',
               colors=bokeh.palettes.brewer['Paired'][4]):
    """Plot a time trace of fish activity with differently colored
    light and dark periods.
    """
    
    if p is None:
        p = bokeh.plotting.figure(plot_height=250,
                                  plot_width=700,
                                  x_axis_label='Zeutgeber time (hours)',
                                  y_axis_label=y_axis_label)
        
        for day in df['day'].unique():
            # Light color lines for day
            inds = (df['day'] == day) & (df['light'])
            p.line(df.loc[inds, 'zeit'],
                   df.loc[inds, activity],
                   line_join='bevel',
                   color=colors[0])
            
            # Dark lines for night
            inds = (df['day'] == day) & (~df['light'])
            p.line(df.loc[inds, 'zeit'],
                   df.loc[inds, activity],
                   line_join='bevel',
                   color=colors[1])
            
        return p
            


Let's use the function.

In [66]:
p = plot_trace(df.loc[(df['location'] == 1), :])
bokeh.io.show(p)

The data are quite noisy, as the fish are sometimes quiescent even when they are awake and darting around the well. We can consider a lower sampling frequency for the activity in such case. We can sum the number of seconds of movement (we call activity) over a longer time window than the default one minute - say ten minutes. 
<br>
We could do it by hand, looping over each location and then performing the sum over each ten minute interval. It would be quite lengthy, though. Instead, we will do it using built-in Pandas functions.

### Using rolling aggregation
We first need to group data set by `location`, as we want to resample the activity for each fish. The `DataFrameGroupBy` object has a `rolling()` method that will create an object over which you may compute rolling averages. Before we build this object, it is convenient to have the `DataFrame` sorted by `location` and then by `zeit_ind`, which is the index of the time point. This is a nice ordering, as each fish has their time trace in order, and the fishes are in order too. We can accomplish this sorting using the `DataFrame`s `sort_values()` along with the `by` kwarg.

In [67]:
df = df.sort_values(by=['location', 'zeit_ind'])

# Take a look
df.head()

Unnamed: 0,location,activity,time,zeit,zeit_ind,day,genotype,light
0,1,0.6,2013-03-15 18:31:09,-14.480833,-869,4,het,True
1,1,1.9,2013-03-15 18:32:09,-14.464167,-868,4,het,True
2,1,1.9,2013-03-15 18:33:09,-14.4475,-867,4,het,True
3,1,13.4,2013-03-15 18:34:09,-14.430833,-866,4,het,True
4,1,15.4,2013-03-15 18:35:09,-14.414167,-865,4,het,True


Now when data is in order, let's create the `DataFrameGroupBy` object and then generate a `RollingGroupby` object. We only need `zeit_ind` and `activity` columns. To make the `RollingGroupby` object, we use the `on` keyworkd argument to indicate that we are going to use the `zeit_ind` column instead of the index to determine the window.

In [68]:
# Create GroupBy object
gb = df.groupby('location')['zeit_ind', 'activity']

# Make a RollingGroupby with window size of 10
rolling = gb.rolling(window=10, on='zeit_ind')

# Look at rolling object
rolling

RollingGroupby [window=10,center=False,axis=0,on=zeit_ind]

Having the `RollingGroupby` object, we can use its `sum()` method to compute a rolling sum.

In [69]:
df_rolling = rolling.sum()

# Take a look
df_rolling.head(n=20)

Unnamed: 0_level_0,Unnamed: 1_level_0,zeit_ind,activity
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,-869,
1,1,-868,
1,2,-867,
1,3,-866,
1,4,-865,
1,5,-864,
1,6,-863,
1,7,-862,
1,8,-861,
1,9,-860,80.5


We have two indices, and we want to reset them so that they are columns in the `DataFrame`. We use the `level=0` kwarg to indicate that we only want to location to be reindexed.


In [70]:
df_rolling = df_rolling.reset_index(level=0)

# Take a look
df_rolling.head(n=20)

Unnamed: 0,location,zeit_ind,activity
0,1,-869,
1,1,-868,
2,1,-867,
3,1,-866,
4,1,-865,
5,1,-864,
6,1,-863,
7,1,-862,
8,1,-861,
9,1,-860,80.5


The first nine entries are `NaN`. This is because we cannot compute anything for a window of length 10 when we have less than 10 entries.
We can now insert the rolling summed activity into the original `DataFrame` because we have taken care to have it properly sorted.


In [71]:
df['rolling activity'] = df_rolling['activity']

# Take a look
df.head(n=20)

Unnamed: 0,location,activity,time,zeit,zeit_ind,day,genotype,light,rolling activity
0,1,0.6,2013-03-15 18:31:09,-14.480833,-869,4,het,True,
1,1,1.9,2013-03-15 18:32:09,-14.464167,-868,4,het,True,
2,1,1.9,2013-03-15 18:33:09,-14.4475,-867,4,het,True,
3,1,13.4,2013-03-15 18:34:09,-14.430833,-866,4,het,True,
4,1,15.4,2013-03-15 18:35:09,-14.414167,-865,4,het,True,
5,1,12.7,2013-03-15 18:36:09,-14.3975,-864,4,het,True,
6,1,11.4,2013-03-15 18:37:09,-14.380833,-863,4,het,True,
7,1,11.6,2013-03-15 18:38:09,-14.364167,-862,4,het,True,
8,1,8.4,2013-03-15 18:39:09,-14.3475,-861,4,het,True,
9,1,3.2,2013-03-15 18:40:09,-14.330833,-860,4,het,True,80.5


Now let's plot the rolling activity against the Zeitgeber time.

In [73]:
p = plot_trace(df.loc[df['location'] == 1, :],
               activity = 'rolling activity',
               y_axis_label = 'activity (sec/ 10 min)')

bokeh.io.show(p)

The plot with 10-min windows definitely has less noise, but there is still an entry for each time point. This is because the `rolling()` object allows for overlapping rolling windows. We therefore want to keep only every 10th entry for each fish in the `DataFrame`. We can build a new, resampled `DataFrame` by taking every tenth time point using `zeit_ind[9::10]`. We can check and see if a given `zeit_ind` is in the list of `zeit_ind`s we want to keep using `isin()` method.

In [78]:
# Get all zeit_inds, sorted
zeit_inds = np.sort(df['zeit_ind'].unique())

# Select every tenth, starting at tenth
zeit_inds = zeit_inds[9::10]

# Keep all entries matching the zeit_inds in the DataFrame
df_resampled = df.loc[df['zeit_ind'].isin(zeit_inds), :]

# Drop orignial activity column
del df_resampled['activity']

# Rename the rolling activity column to activity
df_resampled = df_resampled.rename(columns={'rolling activity': 'activity'})

# Take a look
df_resampled.head(n=20)

Unnamed: 0,location,time,zeit,zeit_ind,day,genotype,light,activity
9,1,2013-03-15 18:40:09,-14.330833,-860,4,het,True,80.5
19,1,2013-03-15 18:50:09,-14.164167,-850,4,het,True,1.3
29,1,2013-03-15 19:00:09,-13.9975,-840,4,het,True,6.883383e-15
39,1,2013-03-15 19:10:09,-13.830833,-830,4,het,True,6.883383e-15
49,1,2013-03-15 19:20:09,-13.664167,-820,4,het,True,6.883383e-15
59,1,2013-03-15 19:30:09,-13.4975,-810,4,het,True,6.883383e-15
69,1,2013-03-15 19:40:09,-13.330833,-800,4,het,True,6.883383e-15
79,1,2013-03-15 19:50:09,-13.164167,-790,4,het,True,6.883383e-15
89,1,2013-03-15 20:00:09,-12.9975,-780,4,het,True,6.883383e-15
99,1,2013-03-15 20:10:09,-12.830833,-770,4,het,True,1.1


This looks good! Now let's make a plot of the trace for the fish in location 1.

In [80]:
p = plot_trace(df_resampled.loc[df['location']==1, :],
               activity='activity',
               y_axis_label='activity (sec/10 min)')

bokeh.io.show(p)

### Resampling using the resample method
There are several problems with this approach:
1. We might not want to resample past boundaries between light and dark (because then we get time points that contain both light and dark moments).
2. If the time points are not evenly sampled, by using a fixed window of 10 time points, we may be sampling time windows of varying lengths.
<br>
For the second point, we can check the difference between respective time point to see if the data are evenly sampled.

In [85]:
# Compute difference between time points in units of minutes
delta_t = np.diff(df.loc[df['location']==1, 'time']) / 60 / 1e3

# Plot the difference
p = bokeh.plotting.figure(plot_height=200,
                          plot_width=500,
                          x_axis_label='obseration number',
                          y_axis_label='Δt (min)')
p.circle(np.arange(len(delta_t)), delta_t)
bokeh.io.show(p)

HA! We can see that there are two measurements which are almost 5 minutes apart between time points. This is because the experimenter, in this case, had to check the instrument and momentarily pause acquisition. Here is a big statement:

        Always validate the dataset.

We need to know about the possible discrepancies in the data **before** performing data analysis. If we are assuming the equal time steps, and they are not equal, we need to know that before doing the analyisis.
<br>
Now, we can alleviate these problems by using Pandas's built-in `resample()` method, which works on `DataFrameGroupBy` objects. It uses datetime columns to resample data to different time windows. Like we created a `RollingGroupby` object before, we can make an analogous resampling object.

In [87]:
gb = df.groupby('location')
resampler = gb.resample('10min', on='time')

# Take a look
resampler

DatetimeIndexResamplerGroupby [freq=<10 * Minutes>, axis=0, closed=left, label=left, convention=e, base=0]

We provided the `resample()` method with the argument `10min` which tells it we want to resample to ten-minute intervals and the `on='time'` kwarg says to use the `time` column for resampling. For instructions on the strings for specifying the resampling frequency, we can look [here](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases).
<br>
Now that we know that we are resampling, we can compute the sums. Let's start by summing, which is our goal; we want to know how many seconds of activity the fish had in 10 minutes.

In [88]:
resampler.sum()['activity'].head()

location  time               
1         2013-03-15 18:30:00    77.3
          2013-03-15 18:40:00     4.5
          2013-03-15 18:50:00     0.0
          2013-03-15 19:00:00     0.0
          2013-03-15 19:10:00     0.0
Name: activity, dtype: float64

Apparently, the resampler uses exact 10-minute intervals; that is, it gives the result at 18:30, 18:40, etc. This alleviates the problem of uneven sampling, in that it only aggregates over real time. However, there are some possible problems with this.
1. Forcing of the time intervals on the whole-ten-minute intervals causes some clipping, as the minutes measured start and end on fractional minutes. *It is important to understand how this rounding works, and what possible effects it might have on the subsequent analyses.*
2. Some time intervals have fewer observations than others. If we are looking at the mean activity per minute on those intervals, things are OK (though some are undersampled compared to others), but if we sum, some intevals will show lower activity when the fish are just as active. To obtain seconds of activity per 10 minutes, we should adjust the sums to scale with the number of measurements contained in the interval.
3. Because the lights on and lights off times are 9:00:00 and 23:00:00, we have dealt with the problem of resampling across lights on/of events. If this were not the case, this strategy would not deal with that.

To fix the problem (2), we can take the mean and then scale it according to how many observations there were using `count()` method of the resampler.

In [90]:
# Resample with summing
activity_resampled = resampler.sum()['activity']

# Rescale sum with count
activity_resampled *= 10 / resampler.count()['activity']

# Reset the index
activity_resampled = activity_resampled.reset_index()

# Take a look
activity_resampled.head()

Unnamed: 0,location,time,activity
0,1,2013-03-15 18:30:00,85.888889
1,1,2013-03-15 18:40:00,4.5
2,1,2013-03-15 18:50:00,0.0
3,1,2013-03-15 19:00:00,0.0
4,1,2013-03-15 19:10:00,0.0


Activity is now approximately resampled to the ten minutes intervals properly. "Approximately", because there is still issue (1) above, that we are focing whole-minute time stamps, resulting in some errors around the time of lights swithching on and off. However, not much can be done about this, as the experiment was not started and ended on whole-minute intervals, whereas lights on and off are exactly on a minute. 
While we have the `location`, `time` and `activity` times properly resampled, we need to get the other columns as well. We can use the `first()` method of the resampler to get those entries (since we do not want to sum over them). The `first()` method five the first entry in the resampled interval.

In [102]:
# Get a new DatFrame, resampled without summing.
df_resampled = resampler.first()

# Resampling was done on activity and time columns; delete them
del df_resampled['time']
del df_resampled['location']

# Reset index so resampled location and time indices become columns
df_resampled = df_resampled.reset_index()

# Add in the properly resampled activity
df_resampled['activity'] = activity_resampled['activity']

# Take a look 
df_resampled.head()

Unnamed: 0,location,time,activity,zeit,zeit_ind,day,genotype,light,rolling activity
0,1,2013-03-15 18:30:00,85.888889,-14.480833,-869,4,het,True,
1,1,2013-03-15 18:40:00,4.5,-14.330833,-860,4,het,True,80.5
2,1,2013-03-15 18:50:00,0.0,-14.164167,-850,4,het,True,1.3
3,1,2013-03-15 19:00:00,0.0,-13.9975,-840,4,het,True,6.883383e-15
4,1,2013-03-15 19:10:00,0.0,-13.830833,-830,4,het,True,6.883383e-15


However, `rolling activity` column is a remnant from the earlier analysis, so we should delete it. Also, the `zeit` column no longer represents the `time` column, since that has been resampled. When we applied the `first()` method of the resampler, it did not know that `zeit` is connected to the time. So, we need to re-compute the `zeit` columns.

In [103]:
# Set the Zeitgeber time
zeitgeber_0 = pd.to_datetime('2013-03-16 9:00:00')
df_resampled['zeit'] = (df_resampled['time'] - zeitgeber_0).dt.total_seconds() / 3600

# Delete rolling activity
del df_resampled['rolling activity']

# Take a look
df_resampled.head()

Unnamed: 0,location,time,activity,zeit,zeit_ind,day,genotype,light
0,1,2013-03-15 18:30:00,85.888889,-14.5,-869,4,het,True
1,1,2013-03-15 18:40:00,4.5,-14.333333,-860,4,het,True
2,1,2013-03-15 18:50:00,0.0,-14.166667,-850,4,het,True
3,1,2013-03-15 19:00:00,0.0,-14.0,-840,4,het,True
4,1,2013-03-15 19:10:00,0.0,-13.833333,-830,4,het,True


We can now make a plot of these resampled data.

In [104]:
p = plot_trace(df_resampled.loc[df_resampled['location']==1, :],
               activity='activity',
               y_axis_label='activity (sec/10 min)')
bokeh.io.show(p)

This seems to be similar to what we have done before, but we took better care of missing time points.

## Saving work and pipeline
When wranging data as we have done, it is important to have code and documentation that will perform *exactly* the same steps as we did to organize the data. This is true of the entire data analysis process, from notes on acquisition to validation to wrangling to inference and reporting. Jupyter notebooks can help, but it is often useful to have functions to do the analysis pipeline we would typically do. For example, we could write a function to generate the resampled `DataFrame` we just generated from the original data.

In [106]:
def load_and_resample(activity_file, genotype_file, zeitgeber_0, resample_r=None):
    """
    Load and resample activity data.
    Assumes genotype file has columns corresponding
    to wild type, heterozygote, and mutant.
    """
    
    # Load in the genotype file, call it df_gt for genotype DataFrame
    df_gt = pd.read_csv(genotype_file,
                        delimiter='\t',
                        comment='#',
                        header=[0, 1])
    
    # Rename columns
    df_gt.columns = ['wt', 'het', 'mut']
    
    # Melt to tidy
    df_gt = pd.melt(df_gt, var_name='genotype', value_name='location').dropna()
    
    # Reset index
    df_gt = df_gt.reset_index(drop=True)
    
    # Integer location names
    df_gt.loc[:, 'location'] = df_gt.loc[:, 'location'].astype(int)
    
    # Read in activity data
    df = pd.read_csv(activity_file, comment='#')
    
    # Merge with genotype data
    df = pd.merge(df, df_gt)
    
    # Convert time to datetime
    df['time'] = pd.to_datetime(df['time'])
    
    # Column for light or dark
    df['light'] = (  (df['time'].dt.time >=pd.to_datetime('9:00:00').time())
                   & (df['time'].dt.time < pd.to_datetime('23:00:00').time()))
    
    if resample_r is None:
        df_resampled = df
    else:
        # Group by location
        gb = df.groupby('location')
        
        # Make resampler
        resampler = gb.resample(resample_r, on='time')
        
        # Resample with summing
        activity_resampled = resampler.sum()['activity']
        
        # Rescale sum with count
        activity_resampled *= 10 / resampler.count()['activity']
        
        # Reset the index
        activity_resampled = activity_resampled.reset_index()
        
        # Get a new DataFrame, resampled without summing.
        df_resampled = resampler.first()
        
        # Resampling happened on activity and time columns; delete them
        del df_resampled['time']
        del df_resampled['location']
        
        # Reset index so resampled location and time indices become columns
        df_resampled = df_resampled.reset_index()
        
        # Add in the properly resampled activity
        df_resampled['activity'] = activity_resampled['activity']
        
    # Set the Zeitgeber time
    zeitgeber_0 = pd.to_datetime(zeitgeber_0)
    df_resampled['zeit'] = (df_resampled['time'] - zeitgeber_0).dt.total_seconds() / 3600
    
    return df_resampled

This function makes strong assumptions about the structure of the data files, which need to be verified first in order to use the function. Also ,it should have a much more descriptive doc string explaining these things.

Now that we have a function, we can use it to load in the original data sets and then generate the resampled tidy data frame.

In [109]:
df = load_and_resample('data/130315_1A_aanat2.csv', 
                       'data/130315_1A_genotypes.txt', 
                       '2013-03-16 9:00:00',
                       resample_r='10min')
# Take a look
df.head()

Unnamed: 0,location,time,activity,zeit,zeit_ind,day,genotype,light
0,1,2013-03-15 18:30:00,85.888889,-14.5,-869,4,het,True
1,1,2013-03-15 18:40:00,4.5,-14.333333,-860,4,het,True
2,1,2013-03-15 18:50:00,0.0,-14.166667,-850,4,het,True
3,1,2013-03-15 19:00:00,0.0,-14.0,-840,4,het,True
4,1,2013-03-15 19:10:00,0.0,-13.833333,-830,4,het,True


To save the processed `DataFrame` to a CSV file, we can use `to_csv()` method of a `DataFrame`.

In [112]:
df.to_csv('data/130315_1A_aanat2_resampled.csv', index=False)


## Conclusions (and a lesson about validation)
* We learned about the power of tidy data and split-apply-combine.
* "Apply" step can be an aggregation (such as computing primary statistics like the mean or standard deviation) or an transformation, like a resampling or rolling mean. Pandas provided useful tools to enable this kind of calculations.
* We learned that knowing the structure of the data set is imperative. We still made some key assumptions about our tidy activity data:
    1. The time samples are evenly spaced.
    2. There are no missing data points.
    3. Each location has the same time points.
* We relied on these assumptions as we did resampling and computed summary statistics about the data.
* There are many hidden assumptions, for instance the assumption that values of the activity are all between zero and sixty. If not, something went wrong.
* This tutorial underscores the importance of **data validation** *before* analysis.