## Groupby: Understanding grouping in pandas dataframes
Let's say we need some summary statistics for each type of iris in our iris dataset. We can use pandas to group our data by flower class, so that we can then apply calculations and functions to each iris class as a set. 

This would allow us to ask questions like: 
- What is the average petal size for each iris class? 
- What is the longest petal width for Iris setosa? 
- What is the correlation between length and width for each class? 

### Summary stats
Here's a quick summary of some of the most common group stats available to us: 

<img src="pandas_stats.png" align="left">

In [18]:
import pandas as pd

# url to get file from
url = "http://mlr.cs.umass.edu/ml/machine-learning-databases/iris/iris.data"

# read the file into a dataframe
iris = pd.read_csv(url, 
                   header=0, 
                   names=['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Class'])

In [12]:
iris.

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### Grouping Data
Let's say we want to find out the average petal width for Iris-setosa. 

First we will need to group our data together by the class column.

In [2]:
iris_classes = iris.groupby("Class")

This gives us a groupby object: 

In [3]:
iris_classes

<pandas.core.groupby.DataFrameGroupBy object at 0x11237e908>

## Viewing Groupby Information

We can iterate through group name and group to access the values: 

In [21]:
list(iris_classes)

[('Iris-setosa',
      SepalLength  SepalWidth  PetalLength  PetalWidth        Class
  0           5.1         3.5          1.4         0.2  Iris-setosa
  1           4.9         3.0          1.4         0.2  Iris-setosa
  2           4.7         3.2          1.3         0.2  Iris-setosa
  3           4.6         3.1          1.5         0.2  Iris-setosa
  4           5.0         3.6          1.4         0.2  Iris-setosa
  5           5.4         3.9          1.7         0.4  Iris-setosa
  6           4.6         3.4          1.4         0.3  Iris-setosa
  7           5.0         3.4          1.5         0.2  Iris-setosa
  8           4.4         2.9          1.4         0.2  Iris-setosa
  9           4.9         3.1          1.5         0.1  Iris-setosa
  10          5.4         3.7          1.5         0.2  Iris-setosa
  11          4.8         3.4          1.6         0.2  Iris-setosa
  12          4.8         3.0          1.4         0.1  Iris-setosa
  13          4.3         3.0  

### View Summary Stats

In [23]:
iris_classes.describe()

Unnamed: 0_level_0,PetalLength,PetalLength,PetalLength,PetalLength,PetalLength,PetalLength,PetalLength,PetalLength,PetalWidth,PetalWidth,...,SepalLength,SepalLength,SepalWidth,SepalWidth,SepalWidth,SepalWidth,SepalWidth,SepalWidth,SepalWidth,SepalWidth
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Iris-setosa,50.0,1.464,0.173511,1.0,1.4,1.5,1.575,1.9,50.0,0.244,...,5.2,5.8,50.0,3.418,0.381024,2.3,3.125,3.4,3.675,4.4
Iris-versicolor,50.0,4.26,0.469911,3.0,4.0,4.35,4.6,5.1,50.0,1.326,...,6.3,7.0,50.0,2.77,0.313798,2.0,2.525,2.8,3.0,3.4
Iris-virginica,50.0,5.552,0.551895,4.5,5.1,5.55,5.875,6.9,50.0,2.026,...,6.9,7.9,50.0,2.974,0.322497,2.2,2.8,3.0,3.175,3.8


### Calculate summary stats on each group
Now that we have a groupby object, we can calculate statistis on each flower class. 

In [17]:
# Apply mean function to PetalWidth column
iris_classes['PetalWidth'].mean()

Class
Iris-setosa        0.244
Iris-versicolor    1.326
Iris-virginica     2.026
Name: PetalWidth, dtype: float64

How about finding the longest petal for each class? 

In [18]:
iris_classes["PetalLength"].max()

Class
Iris-setosa        1.9
Iris-versicolor    5.1
Iris-virginica     6.9
Name: PetalLength, dtype: float64

How many are in each category? 

In [25]:
iris_classes.size()

Class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
dtype: int64

How many entries are in each column for every group? 

In [19]:
iris_classes.count()

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-setosa,50,50,50,50
Iris-versicolor,50,50,50,50
Iris-virginica,50,50,50,50


## Applying Function to Groupby Groups
We can apply function to each group using pandas apply() method: 

In [26]:
def get_stats(group):
    return {'min': group.min(), 
            'max': group.max(), 
            'count': group.count(), 
            'mean': group.mean()}

In [27]:
iris_classes.apply(get_stats)

Class
Iris-setosa        {'min': [1.0, 0.1, 4.3, 2.3], 'max': [1.9, 0.6...
Iris-versicolor    {'min': [3.0, 1.0, 4.9, 2.0], 'max': [5.1, 1.8...
Iris-virginica     {'min': [4.5, 1.4, 4.9, 2.2], 'max': [6.9, 2.5...
dtype: object

In [4]:
iris.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [16]:
iris.drop("Class", axis=1, inplace=True)

ValueError: labels ['Class'] not contained in axis

In [17]:
iris.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [10]:
test

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
