Aggregation and Grouping: 
Planets Data: Let’s use the Planets dataset, which is available on Seaborn Package. It gives information on
planets that astronomers have discovered around other stars.

In [3]:
import seaborn as sb
planets = sb.load_dataset('planets')

planets.shape

(1035, 6)

In [4]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [5]:
#Simple Aggregation in Pandas

import numpy as np
import pandas as pd
rng = np.random.RandomState(42)
# Series
ser = pd.Series(rng.rand(5))
ser

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [6]:
ser.sum()

2.811925491708157

In [7]:
ser.mean()

0.5623850983416314

In [8]:
 # DataFrame
df = pd.DataFrame({'A': rng.rand(5),
'B': rng.rand(5)})
df

Unnamed: 0,A,B
0,0.155995,0.020584
1,0.058084,0.96991
2,0.866176,0.832443
3,0.601115,0.212339
4,0.708073,0.181825


In [9]:
df.mean()

A    0.477888
B    0.443420
dtype: float64

In [11]:
# mean by rows
df.mean(axis=1)

0    0.088290
1    0.513997
2    0.849309
3    0.406727
4    0.444949
dtype: float64

In [12]:
df.mean(axis='columns')   #alternatively

0    0.088290
1    0.513997
2    0.849309
3    0.406727
4    0.444949
dtype: float64

In [13]:
planets.describe()    #computes several common aggregates for each column and returns the result

Unnamed: 0,number,orbital_period,mass,distance,year
count,1035.0,992.0,513.0,808.0,1035.0
mean,1.785507,2002.917596,2.638161,264.069282,2009.070531
std,1.240976,26014.728304,3.818617,733.116493,3.972567
min,1.0,0.090706,0.0036,1.35,1989.0
25%,1.0,5.44254,0.229,32.56,2007.0
50%,1.0,39.9795,1.26,55.25,2010.0
75%,2.0,526.005,3.04,178.5,2012.0
max,7.0,730000.0,25.0,8500.0,2014.0


In [14]:
planets.dropna().describe() # drop the NA's and compute the aggregates

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


In [15]:
# GroupBy: Split, Apply, Combine

import pandas as pd
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [16]:
df_group_by = df.groupby('key')
type(df_group_by)

pandas.core.groupby.generic.DataFrameGroupBy

In [17]:
df_group_by.sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


In [18]:
df_group_by.max()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,4
C,5


The GroupBy object
• Column indexing
The GroupBy object supports column indexing in the same way as the DataFrame, and returns a
modified GroupBy object.


In [19]:
planets

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.300000,7.10,77.40,2006
1,Radial Velocity,1,874.774000,2.21,56.95,2008
2,Radial Velocity,1,763.000000,2.60,19.84,2011
3,Radial Velocity,1,326.030000,19.40,110.62,2007
4,Radial Velocity,1,516.220000,10.50,119.47,2009
...,...,...,...,...,...,...
1030,Transit,1,3.941507,,172.00,2006
1031,Transit,1,2.615864,,148.00,2007
1032,Transit,1,3.191524,,174.00,2007
1033,Transit,1,4.125083,,293.00,2008


Aggregate, filter, transform, apply: 
method that efficiently implement a variety of useful operations before combining the grouped
data.

In [23]:
import numpy as np
import pandas as pd
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df


Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [24]:
df.groupby('key').aggregate([min, np.median, max])

  df.groupby('key').aggregate([min, np.median, max])
  df.groupby('key').aggregate([min, np.median, max])
  df.groupby('key').aggregate([min, np.median, max])
  df.groupby('key').aggregate([min, np.median, max])


Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


In [25]:
df.groupby('key').aggregate({'data1': 'min',
'data2': 'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


Filtering allows to drop data based on the group properties.
For example, we want to keep all the groups that have the standard deviation larger than some
critical value, then we can write the following code

In [27]:
def filter_func(x):
  return x['data2'].std() > 4
print(df)
print('-'*20)
print(df.groupby('key').std())
print('-'*20)
print(df.groupby('key').filter(filter_func))

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
--------------------
       data1     data2
key                   
A    2.12132  1.414214
B    2.12132  4.949747
C    2.12132  4.242641
--------------------
  key  data1  data2
1   B      1      0
2   C      2      3
4   B      4      7
5   C      5      9


Transformation: 
While aggregation returns a reduced version of the data, transformation can return some transformed version of the full data to recombine.
For such transformation, the output is the same shape as the input.
As example is to center the data by subtracting the group-wise mean.

In [28]:
df.groupby('key').transform(lambda x: x - x.mean())

Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


In [29]:
"""The apply() method lets us apply an arbitrary function to the group results. The function takes
a DataFrame, and returns either a Pandas object (e.g. DataFrame, or Series) or a scalar; and then
the combine operation will be tailored to the type of output returned.
Below is and example of apply() which normalizes the first column by the sum of the second."""

'The apply() method lets us apply an arbitrary function to the group results. The function takes\na DataFrame, and returns either a Pandas object (e.g. DataFrame, or Series) or a scalar; and then\nthe combine operation will be tailored to the type of output returned.\nBelow is and example of apply() which normalizes the first column by the sum of the second.'

In [31]:
def norm_by_data2(x):
# x is a DataFrame of group values
  x['data1'] /= x['data2'].sum()
  return x

print(df)
print('-'*20)
print(df.groupby('key').apply(norm_by_data2))

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
--------------------
      key     data1  data2
key                       
A   0   A  0.000000      5
    3   A  0.375000      3
B   1   B  0.142857      0
    4   B  0.571429      7
C   2   C  0.166667      3
    5   C  0.416667      9


  print(df.groupby('key').apply(norm_by_data2))


In [32]:
planets

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.300000,7.10,77.40,2006
1,Radial Velocity,1,874.774000,2.21,56.95,2008
2,Radial Velocity,1,763.000000,2.60,19.84,2011
3,Radial Velocity,1,326.030000,19.40,110.62,2007
4,Radial Velocity,1,516.220000,10.50,119.47,2009
...,...,...,...,...,...,...
1030,Transit,1,3.941507,,172.00,2006
1031,Transit,1,2.615864,,148.00,2007
1032,Transit,1,3.191524,,174.00,2007
1033,Transit,1,4.125083,,293.00,2008


In [34]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
decade

0       2000s
1       2000s
2       2010s
3       2000s
4       2000s
        ...  
1030    2000s
1031    2000s
1032    2000s
1033    2000s
1034    2000s
Name: decade, Length: 1035, dtype: object