In [1]:
import pandas as pd

## Splitting data
Like a SQL GROUP BY statement, this is used for things like computing aggregate statistics within groups, but the Pandas implementation goes considerably beyond what is easily possible in SQL. This offers much greater flexibility in terms of the expected output of a Pandas groupby operation and what types of computation can be done.

To do group-based computation in Pandas, use the .groupby()  method of the DataFrame. Under simple use, this method simply takes a list of columns whose unique values will form the groups (similar to SQL). It returns a DataFrameGroupBy object. This object gives us an interface to apply functionality to the subsets of our data created during splitting.

In [2]:
grocery = pd.DataFrame({'category':['produce', 'produce', 'meat', 'meat', 'meat', 'cheese', 'cheese'],
                        'item': ['celery', 'apple', 'ham', 'turkey', 'lamb', 'cheddar', 'brie'],
                        'price':[.99, .49, 1.89, 4.34, 9.50, 6.25, 8.0]})


In [3]:
grocery

Unnamed: 0,category,item,price
0,produce,celery,0.99
1,produce,apple,0.49
2,meat,ham,1.89
3,meat,turkey,4.34
4,meat,lamb,9.5
5,cheese,cheddar,6.25
6,cheese,brie,8.0


In [4]:
grocery.describe()

Unnamed: 0,price
count,7.0
mean,4.494286
std,3.548478
min,0.49
25%,1.44
50%,4.34
75%,7.125
max,9.5


In [5]:
grocery.groupby('category')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff914db4250>

## Applying calculations to groups
There are several methods on the DataFrameGroupBy object. These methods are all similar in that they have access to some subset of the data determined by the split, but they differ in the flexibility of return values they allow (for reasons having to do with efficiency).

* aggregate: This method is for operations that return a single value for each group, like mean, or max. aggregate iterates over groups and performs its operation on the group as a whole.


* filter: This method is for doing group (not individual row) filtration. For example, if you want to eliminate any groups that have too few observations, you would first .groupby your DataFrame into groups, then check each group, removing groups that fall below a certain threshold. filter will then return every item in the groups that passed.


* transform: This method is for operations that return values with the same indices as the original data. For example, if you want to center a series by subtracting its mean, you compute the mean, then return the original series minus that value. transform iterates over each row and operates on each value individually.


* apply: This method can stand in for any of the above operations, but is less efficient. It places no constraints on the type of data returned.


All of these methods take a function that is then applied to each subset of data. Pandas takes care of the combining part of "split-apply-combine" for you based on which method is called on the DataFrameGroupBy object.

More details and examples of each of these methods are available in the Pandas groupby docs  .

In [8]:
import numpy as np

In [9]:
grouped = grocery.groupby('category')
In [3]: grouped.aggregate(['count', 'mean'])

Unnamed: 0_level_0,price,price
Unnamed: 0_level_1,count,mean
category,Unnamed: 1_level_2,Unnamed: 2_level_2
cheese,2,7.125
meat,3,5.243333
produce,2,0.74


In [10]:
grouped.transform(lambda x: x - x.mean())

Unnamed: 0,price
0,0.25
1,-0.25
2,-3.353333
3,-0.903333
4,4.256667
5,-0.875
6,0.875


In [11]:
grouped.filter(lambda x: len(x)>2)

Unnamed: 0,category,item,price
2,meat,ham,1.89
3,meat,turkey,4.34
4,meat,lamb,9.5


In the above examples, the index of the new DataFrame is updated according to the method applied. This is logical behavior, but it's sometimes helpful to be able to treat the grouped index as a regular column. One situation where this is particularly common is when grouping across more than one column. The result of such an operation is a hierarchical index or MultiIndex. When dealing with indexes gets complicated, a good method to keep in mind is .reset_index(), which assigns the DataFrame's index to the default range of integers and takes whatever had been in the index and puts it in new column(s).

### Perform the following operations using split-apply-combine.

* Remove all items in categories where the mean price in that category is less than $3.00.

* Find the maximum values in each category for all features. (What does Pandas take to be the maximum value of the 'item' column?)

* If the maximum price in a category is more than $3.00, reduce all prices in that category by 10%. Return a Series of the new price column.

In [12]:
import pandas as pd
import numpy as np

grocery = pd.DataFrame({'category':['produce', 'produce', 'meat', 'meat', 'meat', 'cheese', 'cheese'],
                        'item':['celery', 'apple', 'ham', 'turkey',  'lamb', 'cheddar', 'brie'],
                        'price':[.99, .49, 1.89, 4.34, 9.50, 6.25, 8.0]})


grouped_grocery = grocery.groupby('category')

one_mean = grouped_grocery.filter(lambda x: x.mean() > 3.00)

two_max = grouped_grocery.max()

three_round = grouped_grocery['price'].apply(lambda x: x - (x*.1) if x.max() > 3.00 else x)

In [17]:
grouped = grocery.groupby('category')

In [19]:
one_mean = grouped.filter(lambda x: x.mean() > 3.0)

In [20]:
one_mean

Unnamed: 0,category,item,price
2,meat,ham,1.89
3,meat,turkey,4.34
4,meat,lamb,9.5
5,cheese,cheddar,6.25
6,cheese,brie,8.0


In [21]:
two_max = grouped.aggregate(max)

In [22]:
two_max

Unnamed: 0_level_0,item,price
category,Unnamed: 1_level_1,Unnamed: 2_level_1
cheese,cheddar,8.0
meat,turkey,9.5
produce,celery,0.99


In [23]:
three_round = grouped['price'].transform(lambda x: 0.9*x if x.max() > 3.00 else x)

In [24]:
three_round

0    0.990
1    0.490
2    1.701
3    3.906
4    8.550
5    5.625
6    7.200
Name: price, dtype: float64