# Data Aggregation and Group Operations

In [1]:
# Categorizing a data set and applying a function to each group, whether an aggregation or transformation, is often a critical component of a data analysis workflow.
# After loading, merging, and preparing a dataset, you may need to compute group statistics or possible pivot tables for reporting or visualization purpose.
# pandas provides a flexible and high performance groupby facility, enabling you to slice and dice, and summarize data sets in a natural way.

# GroupBy mechanics

In [6]:
import pandas as pd 
import numpy as np

df = pd.DataFrame({ 'key1' : ['a', 'a', 'b', 'b', 'a'],
                    'key2' : ['one', 'two', 'one', 'two', 'one'],
                    'data1' : np.random.randn(5),
                    'data2' : np.random.randn(5)}) 

In [7]:
df.head()

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.249953,-0.323921
1,a,two,-1.096372,-0.040417
2,b,one,-0.070876,0.416844
3,b,two,-0.226155,-0.679555
4,a,one,0.640174,0.385047


In [12]:
# Suppose you wanted to compute the mean of the data1 column using the labels from key1. 
# There are a number of ways to do this. One is to access data1 and call groupby with the column(a Series) at key1: 
grouped = df['data1'].groupby(df['key1']) # group data1 by key1
grouped 

<pandas.core.groupby.generic.SeriesGroupBy object at 0x158b40bd0>

In [15]:
# This "grouped" variable is now a GroupBy object. It has not actualy computed anything yet except for some intermediate data about the group key df['key1'].
# The idea is that this object has all of the information needed to then apply some operation to each of the groups. 
# For example, to compute group means we can call the GroupBy's mean method: 
grouped.mean()

key1
a   -0.235384
b   -0.148516
Name: data1, dtype: float64

In [16]:
# If instead we had passed multiple arrays as a list, we'd get something different: 

means = df['data1'].groupby([df['key1'], df['key2']]).mean() # group data1 by key1 and key2 and compute mean 
means

key1  key2
a     one     0.195111
      two    -1.096372
b     one    -0.070876
      two    -0.226155
Name: data1, dtype: float64