## Aggregation and Grouping

This notebook covers the essentials of aggregation and grouping in pandas. We are using the planets dataset from seaborn library. Takeaways from this notebook:

1. How to summarize large data by computing aggregations like sum(), mean(), median(), etc.
2. How to group data and apply aggregate functions on groups.
3. How to filter data based on group characteristics.

In [1]:
##loading penguins data from seaborn package
import seaborn as sns

penguin_df = sns.load_dataset('penguins')

##check first few rows

In [2]:
penguin_df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


### Aggregate functions for a dataframe

In [3]:
##the aggregate functions like sum(), mean(), etc. return reseults within each column
penguin_df.mean()

bill_length_mm         43.921930
bill_depth_mm          17.151170
flipper_length_mm     200.915205
body_mass_g          4201.754386
dtype: float64

In [4]:
###describe() - pandas convenience method to compute common aggregates
penguin_df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


### GroupBy

In [6]:
##grouping penguins by species
penguin_df.groupby('species')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f02663a9310>

In [7]:
##selecting the body_mass_g series group from the original dataframe 
penguin_df.groupby('species')['body_mass_g']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f02663a9130>

In [8]:
##calling the aggregate function to compute the statistic
penguin_df.groupby('species')['body_mass_g'].mean()

species
Adelie       3700.662252
Chinstrap    3733.088235
Gentoo       5076.016260
Name: body_mass_g, dtype: float64

In [9]:
##iterating over a groupby object
for (specie, group) in penguin_df.groupby('species'):
            print("{0:30s} shape={1}".format(specie, group.shape))

Adelie                         shape=(152, 7)
Chinstrap                      shape=(68, 7)
Gentoo                         shape=(124, 7)


In [10]:
##computing summary statistics on groups
penguin_df.groupby('species')['body_mass_g'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adelie,151.0,3700.662252,458.566126,2850.0,3350.0,3700.0,4000.0,4775.0
Chinstrap,68.0,3733.088235,384.335081,2700.0,3487.5,3700.0,3950.0,4800.0
Gentoo,123.0,5076.01626,504.116237,3950.0,4700.0,5000.0,5500.0,6300.0


In [11]:
##computing different aggregates on different columns using aggregate() method
penguin_df.groupby('species').aggregate({
    'body_mass_g': 'mean',
    'flipper_length_mm': 'median'    
})

Unnamed: 0_level_0,body_mass_g,flipper_length_mm
species,Unnamed: 1_level_1,Unnamed: 2_level_1
Adelie,3700.662252,190.0
Chinstrap,3733.088235,196.0
Gentoo,5076.01626,216.0


### Adding filters on groups

In [12]:
##printing body_mass average for each specie
print(penguin_df.groupby('species')['body_mass_g'].mean())

##filtering species with average body mass greater than 3710
print(penguin_df.groupby('species').filter(lambda x: x['body_mass_g'].mean() > 3710))

species
Adelie       3700.662252
Chinstrap    3733.088235
Gentoo       5076.016260
Name: body_mass_g, dtype: float64
       species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
152  Chinstrap   Dream            46.5           17.9              192.0   
153  Chinstrap   Dream            50.0           19.5              196.0   
154  Chinstrap   Dream            51.3           19.2              193.0   
155  Chinstrap   Dream            45.4           18.7              188.0   
156  Chinstrap   Dream            52.7           19.8              197.0   
..         ...     ...             ...            ...                ...   
339     Gentoo  Biscoe             NaN            NaN                NaN   
340     Gentoo  Biscoe            46.8           14.3              215.0   
341     Gentoo  Biscoe            50.4           15.7              222.0   
342     Gentoo  Biscoe            45.2           14.8              212.0   
343     Gentoo  Biscoe            49.9         