### Agregating and Grouping

This is an effective part of large data analysis. The effecient way of summarization, like computing aggregation of sum(), median(), max(), min().

we will explore aggregations in Pandas, from simple to complex group by concept.

In [4]:
# we will use planets dataset on seaborn, which is about data of other planetary bodies

import seaborn as sns

planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

In [6]:
# viewing the first 5 contents of the dataset
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [27]:
# simple aggregation in pandas

import pandas as pd
import numpy as np

ran_num = np.random.RandomState(40)
ser = pd.Series(ran_num.rand(5))
ser

0    0.407687
1    0.055366
2    0.788535
3    0.287305
4    0.450351
dtype: float64

In [34]:
ser.sum(), ser.mean(), ser.max(), ser.min()

(1.989243717435516,
 0.3978487434871032,
 0.7885348774867527,
 0.05536604011186008)

In [35]:
# for a dataframe

df = pd.DataFrame({'A': ran_num.rand(5),
                  'B': ran_num.rand(5)})

df

Unnamed: 0,A,B
0,0.14499,0.036655
1,0.548892,0.568545
2,0.187481,0.922765
3,0.398981,0.179055
4,0.240038,0.703078


In [39]:
# for a dataframe, computation is done alone columns

df.mean(), df.max(), df.min(), df.sum()

(A    0.304077
 B    0.482019
 dtype: float64,
 A    0.548892
 B    0.922765
 dtype: float64,
 A    0.144990
 B    0.036655
 dtype: float64,
 A    1.520383
 B    2.410097
 dtype: float64)

In [40]:
# to specify direction of computation or axis argument for dataframe

df.mean(axis='columns')

0    0.090823
1    0.558718
2    0.555123
3    0.289018
4    0.471558
dtype: float64

In [43]:
# this is the magic of pandas series and datframe in the common aggregations
# done in pandas. ANother one is the describe method.
# the describe method sumarrizes our data along each columns

# let is check for missing data
planets.isnull().any()


method            False
number            False
orbital_period     True
mass               True
distance           True
year              False
dtype: bool

In [48]:
# we could see that some columns have missing data.
planets.loc[planets['orbital_period'].isnull(), ['year','method']].head()

Unnamed: 0,year,method
29,2005,Imaging
30,2007,Imaging
31,2004,Imaging
33,2008,Imaging
34,2013,Imaging


In [75]:
# let us drop our missing data and then summarize our data
# it is a useful way of understanding the whole properties of a dataset


planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


In [78]:
# let us use other built in pandas aggregation methods
# the count method is a better way of summarizing both missing data in the dataset

planets.count()

method            1035
number            1035
orbital_period     992
mass               513
distance           808
year              1035
dtype: int64

In [86]:
planets.mean(), planets.median()

(number               1.785507
 orbital_period    2002.917596
 mass                 2.638161
 distance           264.069282
 year              2009.070531
 dtype: float64,
 number               1.0000
 orbital_period      39.9795
 mass                 1.2600
 distance            55.2500
 year              2010.0000
 dtype: float64)

we also have for min()

max()

std()

var()

mad() mean absolute deviation

prod()
 
sum()

### Groupby: Split, Apply and Combine

the complex format is the group by methods which allows us do more complex data forms.

The groupby can be seen as the combinagtion of the operation of the 3 steps, split, apply and combine. it is a a command gotten from SQL database.

The split step involves breaking up and grouping a dataframe depending on the value specified

The apply step invloces computing some functions usually an aggregate, transformation, filtering within individual groups.

The combine step merges the resuts of these operations into an output array.

In [91]:
# creating a sample dataframe

df = pd.DataFrame({'key':['A','B','C','A','B','C'],
                 'data':range(6)}, columns=['key','data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [92]:
# let us apply the basic groupby(split, apply and combine) method

df.groupby('key') # grouping by the key column

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000257888668C8>

In [93]:
# we noticed that no result was produced and a groupby Dataframe  was shown 
# insteaad. This means that we can further apply other functions to 
# speicfy its apply and combine features

df.groupby('key').sum() # adding the sum() function

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


### GroupBy Object

In [94]:
# column indexing
# the groupby supports column indexing

planets.groupby('method') # no computation will be done here except when specified

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000025788906108>

In [95]:
planets.groupby('method')['orbital_period']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000257888633C8>

In [97]:
planets.groupby('method').median()

Unnamed: 0_level_0,number,orbital_period,mass,distance,year
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Astrometry,1.0,631.18,,17.875,2011.5
Eclipse Timing Variations,2.0,4343.5,5.125,315.36,2010.0
Imaging,1.0,27500.0,,40.395,2009.0
Microlensing,1.0,3300.0,,3840.0,2010.0
Orbital Brightness Modulation,2.0,0.342887,,1180.0,2011.0
Pulsar Timing,3.0,66.5419,,1200.0,1994.0
Pulsation Timing Variations,1.0,1170.0,,,2007.0
Radial Velocity,1.0,360.2,1.26,40.445,2009.0
Transit,1.0,5.714932,1.47,341.0,2012.0
Transit Timing Variations,2.0,57.011,,855.0,2012.5


In [96]:
# specifying computation
# orbital_period is in days

planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

In [104]:
# iteration over groups.
# the groupby suppors direct iteration over groups
# the result of the iteration is the series or Dataframe of each group

for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)


In [123]:
# we will use dispach methods for simplication of the above iteration

planets.groupby('method')['year'].describe()

# this gives us a clearer view of the data. we could see that
# Radial velocity was recordered the majority of the planets
# this dispatch method are applied to each individual group 
# which is later combined within the groupby


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0


In [214]:
planets.groupby('method')['year'].describe().unstack()

       method                       
count  Astrometry                          2.0
       Eclipse Timing Variations           9.0
       Imaging                            38.0
       Microlensing                       23.0
       Orbital Brightness Modulation       3.0
                                         ...  
max    Pulsar Timing                    2011.0
       Pulsation Timing Variations      2007.0
       Radial Velocity                  2014.0
       Transit                          2014.0
       Transit Timing Variations        2014.0
Length: 80, dtype: float64

### Aggregate, filter, transform and apply

In [215]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key':['A','B','C','B','C','A'],
                  'data1':range(6),
                  'data2': rng.randint(0,10, 6)},
                 columns=['key','data1','data2'])

df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,B,3,3
4,C,4,7
5,A,5,9


In [216]:
# Aggregation() method allows for more flexibility

df.groupby('key').aggregate([min,np.median, max])



Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,2.5,5,5,7.0,9
B,1,2.0,3,0,1.5,3
C,2,3.0,4,3,5.0,7


In [217]:
# let us pass a dictionary mapping

df.groupby('key').aggregate({'data1':'min',
                            'data2':'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,9
B,1,3
C,2,7


In [218]:
# filtering allows us to drop data based on the group properties
# let us keep groups with std higher than some value

def filter_func(x):
    return x['data2'].std() > 4 
print(df), print(df.groupby('key').std()),
print(df.groupby('key').filter(filter_func))


  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   B      3      3
4   C      4      7
5   A      5      9
        data1     data2
key                    
A    3.535534  2.828427
B    1.414214  2.121320
C    1.414214  2.828427
Empty DataFrame
Columns: [key, data1, data2]
Index: []


In [219]:
# transformation returns the transformed version of the data 

df.groupby('key').transform(lambda x: x - x.mean())

Unnamed: 0,data1,data2
0,-2.5,-2.0
1,-1.0,-1.5
2,-1.0,-2.0
3,1.0,1.5
4,1.0,2.0
5,2.5,2.0


In [220]:
# apply() method allows you to apply an arbitrary function to the group results
# lets look at an apply that normalizes the first column by the sum of the second

def norm_by_data2(x):
    # x is a Dataframe of a group values
    x['data1'] /= x['data2'].sum()
    return x
print(df), print(df.groupby('key').apply(norm_by_data2))

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   B      3      3
4   C      4      7
5   A      5      9
  key     data1  data2
0   A  0.000000      5
1   B  0.333333      0
2   C  0.200000      3
3   B  1.000000      3
4   C  0.400000      7
5   A  0.357143      9


(None, None)

In [221]:
# specifying the split key

L = [0,1,0,1,2,0]
print(df), print(df.groupby(L).sum())

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   B      3      3
4   C      4      7
5   A      5      9
   data1  data2
0      7     17
1      4      3
2      4      7


(None, None)

In [222]:
df.groupby(df['key']).sum()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,5,14
B,4,3
C,6,10


In [223]:
# mapping through a dictionary

df2 = df.set_index('key')
mapping = {'A':'vowel','B':'consonant','C':'consonant'}
print(df2), print(df2.groupby(mapping).sum())

     data1  data2
key              
A        0      5
B        1      0
C        2      3
B        3      3
C        4      7
A        5      9
           data1  data2
consonant     10     13
vowel          5     14


(None, None)

In [224]:
# we can as well input an python function

print(df2.groupby(str.lower).mean())

   data1  data2
a    2.5    7.0
b    2.0    1.5
c    3.0    5.0


In [225]:
df2.groupby([str.lower, mapping]).mean()

Unnamed: 0,Unnamed: 1,data1,data2
a,vowel,2.5,7.0
b,consonant,2.0,1.5
c,consonant,3.0,5.0


In [226]:
# let us work with our planets dataset by finding discovered planets
# by method and by decade

planets.count()

method            1035
number            1035
orbital_period     992
mass               513
distance           808
year              1035
dtype: int64

In [236]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0


In [232]:
decade


0       2000s
1       2000s
2       2010s
3       2000s
4       2000s
        ...  
1030    2000s
1031    2000s
1032    2000s
1033    2000s
1034    2000s
Name: decade, Length: 1035, dtype: object