# Tutorial 1b: Data frames

(c) 2017 Justin Bois. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

In [4]:
import numpy as np
import pandas as pd

import bokeh.io
import bokeh.plotting

from bokeh.models import Legend
from bokeh.plotting import figure, show, output_file

bokeh.io.output_notebook()

In this tutorial, we will learn how to load data stored on disk into a Python data structure. We will use Pandas to read in CSV (comma separated value) files and store the results in the very handy Pandas DataFrame. This incredibly flexible and powerful data structure will be a centerpiece in the rest of this course and beyond. In this tutorial, we will learn about what a data frame is and how to use it.
<br>
<br>
The data set we will use comes from a fun paper about the adhesive properties of frog tongues. The reference is Kleinteich and Gorb, Tongue adhesion in the horned frog Ceratophrys sp., *Sci. Rep.*, 4, 5225, 2014. 

## The data file
The data are contained in the file `frog_tongue_adhesion.csv`. Let's look at its contents:

In [5]:
with open('data/frog_tongue_adhesion.csv', 'r') as f:
    for _ in range(20):
        print(next(f), end='')

# These data are from the paper,
#   Kleinteich and Gorb, Sci. Rep., 4, 5225, 2014.
# It was featured in the New York Times.
#    http://www.nytimes.com/2014/08/25/science/a-frog-thats-a-living-breathing-pac-man.html
#
# The authors included the data in their supplemental information.
#
# Importantly, the ID refers to the identifites of the frogs they tested.
#   I:   adult, 63 mm snout-vent-length (SVL) and 63.1 g body weight,
#        Ceratophrys cranwelli crossed with Ceratophrys cornuta
#   II:  adult, 70 mm SVL and 72.7 g body weight,
#        Ceratophrys cranwelli crossed with Ceratophrys cornuta
#   III: juvenile, 28 mm SVL and 12.7 g body weight, Ceratophrys cranwelli
#   IV:  juvenile, 31 mm SVL and 12.7 g body weight, Ceratophrys cranwelli
date,ID,trial number,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (m

## Loading a data set
We use `pd.read_csv()` to load the data set. The data are stored in a **DataFrame**. Let's load a `DataFrame`

In [6]:
df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')

# Look at the DataFrame
df.head()

Unnamed: 0,date,ID,trial number,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (mm2),contact area with mucus / contact area without mucus,contact pressure (Pa),adhesive strength (Pa)
0,2013_02_26,I,3,1205,46,1.95,-785,884,1.27,-0.29,387,70,0.82,3117,-2030
1,2013_02_26,I,4,2527,44,4.08,-983,248,1.59,-0.181,101,94,0.07,24923,-9695
2,2013_03_01,I,1,1745,34,2.82,-850,211,1.37,-0.157,83,79,0.05,21020,-10239
3,2013_03_01,I,2,1556,41,2.51,-455,1025,0.74,-0.17,330,158,0.52,4718,-1381
4,2013_03_01,I,3,493,36,0.8,-974,499,1.57,-0.423,245,216,0.12,2012,-3975


In [7]:
# We can access a column of data (slice the data) using the column name
df.head()['impact force (mN)']

0    1205
1    2527
2    1745
3    1556
4     493
Name: impact force (mN), dtype: int64

Clearly, the indexing of the rows is preserved. The data were interpreted as integer type (`dtype = int64`), so we want to convert them into floats using `.astype()` method.

In [8]:
# Use df.astype() method to convert it to a NumPy float 64 data type.
df['impact force (mN)'] = df['impact force (mN)'].astype(float)

# Let's check if it worked well
df['impact force (mN)'].dtype

dtype('float64')

Now let's select only the specific impact forces, say, with the impact force above two Newtons. Pandas `DataFrame` can be conveniently sliced with the Booleans.

In [9]:
# Generate True/False array of rows for indexing 
inds = df['impact force (mN)'] >= 2000.0

# Take a look
inds.head()

0    False
1     True
2    False
3    False
4    False
Name: impact force (mN), dtype: bool

Now we have an array of Booleans that is of the same length as the `DataFrame` itself. Now, we can use the `.loc` featore of a `DataFrame` to slice what we want out of the `DataFrame`.

In [10]:
# Slice out rows we want (with big force)
df_big_force = df.loc[inds,:]

# Let's look at the sliced rows; 
# there will be just a couple of high-force valuea
df_big_force

Unnamed: 0,date,ID,trial number,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (mm2),contact area with mucus / contact area without mucus,contact pressure (Pa),adhesive strength (Pa)
1,2013_02_26,I,4,2527.0,44,4.08,-983,248,1.59,-0.181,101,94,0.07,24923,-9695
5,2013_03_01,I,4,2276.0,31,3.68,-592,969,0.96,-0.176,341,106,0.69,6676,-1737
8,2013_03_05,I,3,2641.0,50,4.27,-690,491,1.12,-0.239,269,224,0.17,9824,-2568
17,2013_03_15,I,3,2032.0,60,3.28,-652,486,1.05,-0.257,147,134,0.09,13784,-4425


Using `.loc` allows us to index by row and column. We chose all columns (using `:`) and put an array of Booleans for the rows. We get back a `DataFrame` with the rows associated with `True` in the Booleans' array.
<br>
<br>

Now we only have the strikes of high force. It can be noted that the original indexing of rowas was retained. This is a good idea, as the indices do not have to be integers. 
<br>
    To overcome this, we can use `iloc` attribute of a `DataFrame`, which give the indexing with sequential integers.

In [11]:
# Indexing using iloc, which enables indexing by the corresponding
# integer sequence
df_big_force['impact force (mN)'].iloc[3]

2032.0

## Tidy data
The data in our `DataFrame` are tidy. This concept comes from the development of databases, but has been generalized to data processing in the recent years. Tidy data refers to data in a tabular form with the following format: <br>
<br>
1. Each variable forms a column,
2. Each observation forms a row.
3. Each type of observational unit forms a separate table.

## More data extraction
To extract a single observation (i. e. a single experiment), we can extract a row and see all of the measured quantities from a given strike using `.loc`.

In [12]:
# Slice out experiment with index 42
df.loc[42,:]

date                                                    2013_05_27
ID                                                             III
trial number                                                     3
impact force (mN)                                              324
impact time (ms)                                               105
impact force / body weight                                    2.61
adhesive force (mN)                                           -172
time frog pulls on target (ms)                                 619
adhesive force / body weight                                  1.38
adhesive impulse (N-s)                                      -0.079
total contact area (mm2)                                        55
contact area without mucus (mm2)                                23
contact area with mucus / contact area without mucus          0.37
contact pressure (Pa)                                         5946
adhesive strength (Pa)                                       -

`DataFrame` conveniently arranges the indices describing the elements of the row as column headings. This is a Pandas `Series` object.
<br>
<br>
Slicing out a single index is not very meaningful, because the indices are arbitrary. We can instead look at our lab notebook and look at a trial number 3 on May 27, 2013, of frog III. This is a more common and indeed meaningful, use case.

In [13]:
# Set up Boolean slicing
date = df['date'] == '2013_05_27'
trial = df['trial number'] == 3
ID = df['ID'] == 'III'

# Slice out the row
df.loc[date & trial & ID]

Unnamed: 0,date,ID,trial number,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (mm2),contact area with mucus / contact area without mucus,contact pressure (Pa),adhesive strength (Pa)
42,2013_05_27,III,3,324.0,105,2.61,-172,619,1.38,-0.079,55,23,0.37,5946,-3149


The difference is that the returned object (slice) is a `DataFrame` instead of `Series`. We can easily make it a `Series` object using `iloc`. Note that in most use cases, this is jsut a matter of convenience for viewing and nothing more.


In [14]:
df.loc[date & trial & ID, :].iloc[0]

date                                                    2013_05_27
ID                                                             III
trial number                                                     3
impact force (mN)                                              324
impact time (ms)                                               105
impact force / body weight                                    2.61
adhesive force (mN)                                           -172
time frog pulls on target (ms)                                 619
adhesive force / body weight                                  1.38
adhesive impulse (N-s)                                      -0.079
total contact area (mm2)                                        55
contact area without mucus (mm2)                                23
contact area with mucus / contact area without mucus          0.37
contact pressure (Pa)                                         5946
adhesive strength (Pa)                                       -

## Renaming columns
The lengthy syntax of access column names can be annoying, so it can be useful to change the names of the columns. For instance, now, to access the ratio of contact area with mucus to contact area without mucus, we would have to do following.

In [15]:
# Set up criteria for our search in the DataFrame
date = df['date'] == '2013_05_27'
trial = df['trial number'] == 3
ID = df['ID'] == 'III'

# When indexind DataFrames, use & for Boolean and (and / for or; - for not)
df.loc[date & trial & ID, 'contact area with mucus / \
contact area without mucus']


42    0.37
Name: contact area with mucus / contact area without mucus, dtype: float64

The reason to use the verbose nature of the column headings is to avoid ambiguity. Also, many plotting packages, including *HoloViews* can automatically label axes based on headers in `DataFrame`s. To use shorter names, we do this.

In [16]:
# Make a dictionary to rename columns
rename_dict = {'trial number' : 'trial',
               'contact area with mucus / contact area without mucus' :
               'ca_ratio'}

# Rename the columns
df = df.rename(columns=rename_dict)

# Try out the new column name
df.loc[date & trial & ID, 'ca_ratio']

# Indexing of dictionaries looks syntactically similar to cols in DataFrames
rename_dict['trial number']

'trial'

## Computing with DataFrames and adding columns
The `DataFrame`s are convenient for the organization and selection of data, but how about computations and storing the results of computations in the new columns?
<br>
As a simple example, let's say we want to have a column with the impact force i nunits of Newtons instead of milliNewtons, so we can divide `impact force (mN)` columns elementwise by 1000, just as with Numpy arrays.

In [17]:
# Add a new columns with impact force in units of Newtons
df['impact force (N)'] = df['impact force (mN)'] / 1000

# Take a look
df.head()

Unnamed: 0,date,ID,trial,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (mm2),ca_ratio,contact pressure (Pa),adhesive strength (Pa),impact force (N)
0,2013_02_26,I,3,1205.0,46,1.95,-785,884,1.27,-0.29,387,70,0.82,3117,-2030,1.205
1,2013_02_26,I,4,2527.0,44,4.08,-983,248,1.59,-0.181,101,94,0.07,24923,-9695,2.527
2,2013_03_01,I,1,1745.0,34,2.82,-850,211,1.37,-0.157,83,79,0.05,21020,-10239,1.745
3,2013_03_01,I,2,1556.0,41,2.51,-455,1025,0.74,-0.17,330,158,0.52,4718,-1381,1.556
4,2013_03_01,I,3,493.0,36,0.8,-974,499,1.57,-0.423,245,216,0.12,2012,-3975,0.493


The new column was created by assigning it to Pandas `Series` we calculated.
<br>
We can do other calculations on `DataFrame`s besides elementwise calculations. For example, if we wanted the mean impact force in units of milliNewtons, we can do this...

In [18]:
df['impact force (mN)'].mean()

801.6875

To compute all sorts of useful summary statistics all at once about the `DataFrame`, we can use `describe()` method, which gives the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentiel, and maximum for each column.

In [19]:
df.describe()

Unnamed: 0,trial,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (mm2),ca_ratio,contact pressure (Pa),adhesive strength (Pa),impact force (N)
count,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0,80.0
mean,2.4,801.6875,39.0625,2.920375,-397.7625,1132.45,1.444875,-0.187462,166.475,61.4,0.569,6073.1625,-3005.875,0.801687
std,1.164887,587.836143,19.558639,1.581092,228.675562,747.172695,0.659858,0.134058,98.696052,58.532821,0.341862,5515.265706,2525.468421,0.587836
min,1.0,22.0,6.0,0.17,-983.0,189.0,0.22,-0.768,19.0,0.0,0.01,397.0,-17652.0,0.022
25%,1.0,456.0,29.75,1.47,-567.75,682.25,0.99,-0.27725,104.75,16.75,0.28,2579.25,-3443.25,0.456
50%,2.0,601.0,34.0,3.03,-335.0,927.0,1.32,-0.165,134.5,43.0,0.665,4678.0,-2185.5,0.601
75%,3.0,1005.0,42.0,4.2775,-224.5,1381.25,1.7725,-0.08125,238.25,92.5,0.885,7249.75,-1736.0,1.005
max,5.0,2641.0,143.0,6.49,-92.0,4251.0,3.4,-0.001,455.0,260.0,1.0,28641.0,-678.0,2.641


Now let's say we want to compute the mean impact force of each frog. We can combine Boolean indexing with looping to do that.

In [20]:
for frog in df['ID'].unique():
    inds = df['ID'] == frog
    mean_impact_force = df.loc[inds, 'impact force (mN)'].mean()
    print('{0:s}: {1:.1f} mN'.format(frog, mean_impact_force))

I: 1530.2 mN
II: 707.4 mN
III: 550.1 mN
IV: 419.1 mN


We used the `.unique()` method to get the unique entries in the `ID` column. We can also use Panda's `.groupby()` function to do this much more cleanly and efficiently, but for now, it is good to appreciate the ability to pull out the data needed and do computations with it.
<br>
## Creating a DataFrame from scratch
Let's construct our own `DataFrame` from scratch, which would contain information about each frog.
<br>
One way to do this is to first create a dictionary with the respective fields, and then convert it into a `DataFrame` by instantiating a `pd.DataFrame` class with it.


In [21]:
# Create a dictionary with the appropriate fields
data_dict = {'ID': ['I', 'II', 'III', 'IV'],
             'age': ['adult', 'adult', 'juvenile', 'juvenile'],
             ' SVL (mm)': [63, 70, 28, 31],
             'weight (g)': [63.1, 72.7, 12.7, 12.7],
             'species': ['cross', 'cross', 'cranwelli', 'cranwelli']}

# Make it into a DataFrame
df_frog_info = pd.DataFrame(data=data_dict)

# Take a look
df_frog_info



Unnamed: 0,ID,age,SVL (mm),weight (g),species
0,I,adult,63,63.1,cross
1,II,adult,70,72.7,cross
2,III,juvenile,28,12.7,cranwelli
3,IV,juvenile,31,12.7,cranwelli


In some instances, the data sets are not small enough to construct a dictionary by hand. Oftentimes, we have a two-dimensional array of data that we want to make into a `DataFrame`. For instance, we can have a Numpy array with two columns, one for snout-vent length and one for the weight.

In [22]:
data = np.array([[63, 70, 28, 31], \
                 [63.1, 72.7, 12.7, 12.7]]).transpose()

# Let's verify this
data

array([[63. , 63.1],
       [70. , 72.7],
       [28. , 12.7],
       [31. , 12.7]])

To turn the data above into `DataFrame`, we specify the `column` keyword argument too.

In [23]:
df_demo = pd.DataFrame(data=data, columns=['SVL (mm)', 'weight (g)'])

# Take a look
df_demo

Unnamed: 0,SVL (mm),weight (g)
0,63.0,63.1
1,70.0,72.7
2,28.0,12.7
3,31.0,12.7


In general, any two-dimensional Numpy array can be converted into a `DataFrame` in this way. We only need to supply column names.
<br>
## Merging DataFrames
For each row in the `DataFrame` we can add the relevant value in each column. We will do these operations on a copy of `df` using the `copy()` method.

In [24]:
# Make a copy of df
df_copy = df.copy()

# Build each column
for col in df_frog_info.columns[df_frog_info.columns != 'ID']:
    # Make a new column with empty values
    df_copy[col] = np.empty(len(df_copy))
    
    # Add in each entry, row by row
    for i, r in df_copy.iterrows():
        ind = df_frog_info['ID'] == r['ID']
        df_copy.loc[i, col] = df_frog_info.loc[ind, col].iloc[0]
        
# Take a look at the updated DataFrame
df_copy.head()

Unnamed: 0,date,ID,trial,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (mm2),ca_ratio,contact pressure (Pa),adhesive strength (Pa),impact force (N),age,SVL (mm),weight (g),species
0,2013_02_26,I,3,1205.0,46,1.95,-785,884,1.27,-0.29,387,70,0.82,3117,-2030,1.205,adult,63.0,63.1,cross
1,2013_02_26,I,4,2527.0,44,4.08,-983,248,1.59,-0.181,101,94,0.07,24923,-9695,2.527,adult,63.0,63.1,cross
2,2013_03_01,I,1,1745.0,34,2.82,-850,211,1.37,-0.157,83,79,0.05,21020,-10239,1.745,adult,63.0,63.1,cross
3,2013_03_01,I,2,1556.0,41,2.51,-455,1025,0.74,-0.17,330,158,0.52,4718,-1381,1.556,adult,63.0,63.1,cross
4,2013_03_01,I,3,493.0,36,0.8,-974,499,1.57,-0.423,245,216,0.12,2012,-3975,0.493,adult,63.0,63.1,cross


We used the `iterrows()` method of the `df_copy` data frame. The iterator gives an index (which we called `i`) and a row of a `DataFrame` (which we called `r`). This methods, and the analogous one for iterating over columns, `iteritems()`, can be useful.
<br>
However, much better method is to use the Pandas's built-in `merge()` method. Called with all the default kwargs, this function finds a common columns between two `DataFramee`S (in this case, the `ID` column), and then uses those columns to merge them ,filling in values that amtch in the common column. This is exactly what we want.

In [25]:
df = df.merge(df_frog_info)

# Check it 
df.head()

Unnamed: 0,date,ID,trial,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (mm2),ca_ratio,contact pressure (Pa),adhesive strength (Pa),impact force (N),age,SVL (mm),weight (g),species
0,2013_02_26,I,3,1205.0,46,1.95,-785,884,1.27,-0.29,387,70,0.82,3117,-2030,1.205,adult,63,63.1,cross
1,2013_02_26,I,4,2527.0,44,4.08,-983,248,1.59,-0.181,101,94,0.07,24923,-9695,2.527,adult,63,63.1,cross
2,2013_03_01,I,1,1745.0,34,2.82,-850,211,1.37,-0.157,83,79,0.05,21020,-10239,1.745,adult,63,63.1,cross
3,2013_03_01,I,2,1556.0,41,2.51,-455,1025,0.74,-0.17,330,158,0.52,4718,-1381,1.556,adult,63,63.1,cross
4,2013_03_01,I,3,493.0,36,0.8,-974,499,1.57,-0.423,245,216,0.12,2012,-3975,0.493,adult,63,63.1,cross


## Plotting how impact forces correlate with other metrics
Let's say that we want to know how the impact forces correlate with other measures about the impact. Let's make a scatter plot of adhesion forces vs. impact forces.

In [26]:
# Set up the plot
p = bokeh.plotting.figure(plot_height=300,
                          plot_width=500,
                          x_axis_label='impact force (mN)',
                          y_axis_label='adhesive force (mN)')

# Add a scatter plot
p.circle(df['impact force (mN)'], df['adhesive force (mN)'])

bokeh.io.show(p)

We can see some correlation between the adhesive forces and the impact forces. The stronger the impact force, the stronger the adhesive force, with some exceptions.
<br>
## Making subplots
Now let's learn how to make subplots in Bokeh. Let's say we want to plot the adhesive force, total contact area, impact time, and contact pressure against impact force. To lay these out as subplots, we make the respective `bokeh.plotting.figure.Figure` objects one by one and put them in a list. We then use the utilities from `bokeh.layouts` to make the subplots. Let's start by making the list of figure objects. It will help to wirte a quick function to generate a plot.

In [27]:
def scatter(df, x, y, p=None, color='#30a2da',
            plot_height=300, plot_width=500):
    """Populate a figure with a scatter plot."""
    p = bokeh.plotting.figure(plot_height=plot_height,
                              plot_width=plot_width,
                              x_axis_label=x,
                              y_axis_label=y)
    
    p.circle(df[x], df[y], color=color)
    
    return p

Now we can build a list of plots

In [28]:
# To be plotted
cols = ['impact time (ms)',
        'adhesive force (mN)',
        'total contact area (mm2)',
        'contact pressure (Pa)']

plots = []
for col in cols:
    plots.append(scatter(df,
                         'impact force (mN)',
                         col,
                         plot_height=200,
                         plot_width=400))
    
# Line up the plots in a column
bokeh.io.show(bokeh.layouts.column(plots))

There are some things that are to be cleared: <br>
* Plots share the same x-axis, so we want to link the axes. This is accomplisehed by setting the `x_range` attribute of the respective `bokeh.plotting.figure.Figure` objects.
* All the x-axes are the same, so we do not need to label them all.

In [29]:
for p in plots[:-1]:
    # Set each plot's x_range to be that of the last plot
    p.x_range = plots[-1].x_range
    
    # Only have x_axis label on bottom plot
    p.xaxis.axis_label = None
    
# To show only a single toolbar for a set of subplots, we should do this
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=1))

To arrange the plots in a 2x2 gird, we use ncols=2 as the kwarg. However, we have to supply an axis label for the plot in the lower left corner.

In [30]:
plots[2].xaxis.axis_label = 'impact force (mN)'
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

To have the blank spaces on the two-dimensional lsit of plots, we can use `bokeh.layouts.gridplot()` again. Let's leave out the contact pressure plot

In [31]:
# Make 2D list of plots (put None for no plot)
plots = [[plots[0], plots[1]],
         [plots[2], None]]

# Show using gridplot without an ncols kwatg
bokeh.io.show(bokeh.layouts.gridplot(plots))

## Plotting the distribution of impact forces
To simply plot the distribution of impact forces, we can do a couple of things. Probably the most familiar thing is to plot them as a histogram. To do that with Bokeh, we can first compute the edges and heights of the bars of the histogram using Numpy, and then add them to the plot using the `quad()` method of Bokeh figures, which plots the filled rectangles. 

In [32]:
# Compute the histogram
heights, edges = np.histogram(df['impact force (mN)'])

# Set up the plot
p = bokeh.plotting.figure(plot_height=300,
                          plot_width=500,
                          x_axis_label='impact force (mN)',
                          y_axis_label='count')

# Generate the histogram
p.quad(top=heights, bottom=0, left=edges[:-1], right=edges[1:])

bokeh.io.show(p)

From looking at the histogram, there might be some bimodalidy. However, there is a better way to visualize the data - use of ECDF...
## Empirical cumulative distribution functions (ECDFs)
The primary reason why histogram might not be the best way to display the distributions of measurements is the **binning biais**. To create a histogram, we necessarily have to consider not the exact values of the measurements, but rather place them in bins. We do not plot all of the data, and the choice of bins can change what we can infer from the plot.
<br>
<br>
Instead, we should look at the **empirical cumulative distribution function** (or ECDF). The ECDF evaluated at $x$ for a set of measurements is defined as 
\begin{align}
ECDF(x) = fraction \ of \ measurements \leq x
\end{align}
Let's write a function that will generate $x$ values of ECDFs and $y$ values of ECDFs.

In [33]:
def ecdf(data):
    """Return the ECDF of the data."""
    x = np.sort(data)
    y = np.arange(1, len(data)+1) / len(data)
    
    return x, y

In [34]:
# Get the values of the impact force for ECDF calculation
# For adult frogs
inds_ad = df['age'] == 'adult'
vals_ad = df.loc[inds_ad,'impact force (mN)']

# For juvenile frogs
inds_juv = df['age'] == 'juvenile'
vals_juv = df.loc[inds_juv,'impact force (mN)']

# Calculate the ECDFs for both adult and juvenile frogs
xad, yad = ecdf(vals_ad)
xjuv, yjuv = ecdf(vals_juv)

# Plot the ECDFs for the adult and juvenile frogs
# Set up the plot
p = bokeh.plotting.figure(plot_height=300,
                          plot_width=600,
                          x_axis_label='impact force (mN)',
                          y_axis_label='ECDF')

# Add a scatter plot
p = figure(toolbar_location="above")

p1 = p.circle(xad, yad, color='blue')
p2 = p.circle(xjuv, yjuv, color='red')

legend = Legend(items=[
    ('adult', [p1]),
    ('juvenile', [p2])
    ])
# Add legend
p.add_layout(legend)

p.xaxis.axis_label = 'impact force (mN)'
p.yaxis.axis_label = 'ECDF'

bokeh.io.show(p)

From the ECDF graph above, it is apparent that there is a difference in the impact forces between adult frogs and the juvenile frogs, explaining the slight bimodality of the histogram.
## Conclusions: what we learned?
* How to lead data from CSVfiles into Pandas `DataFrame`s.
* `DataFrame`s are useful objects for looking at data.
