# GroupBy: Split-Apply-Combine

### Essential Imports

In [1]:
import numpy as np
import pandas as pd 


### Sample DataFrame

In [8]:
df = pd.DataFrame(
    [
        ("bird", "Falconiformes", 389.0, 1.2),
        ("bird", "Psittaciformes", 24.0, 0.25),
        ("mammal", "Carnivora", 80.2, 400),
        ("mammal", "Primates", np.nan, 25),
        ("mammal", "Carnivora", 58, 80),
        ("Fruit", "Vegetaria", 0.0, 0.20),
        ("Fruit", "Vegetaria", 0.0, 0.15)
    ],
    index=["falcon", "parrot", "lion", "monkey", "leopard", "Apple", "Orange"],
    columns=("class", "order", "max_speed", "weight"),
)

df

Unnamed: 0,class,order,max_speed,weight
falcon,bird,Falconiformes,389.0,1.2
parrot,bird,Psittaciformes,24.0,0.25
lion,mammal,Carnivora,80.2,400.0
monkey,mammal,Primates,,25.0
leopard,mammal,Carnivora,58.0,80.0
Apple,Fruit,Vegetaria,0.0,0.2
Orange,Fruit,Vegetaria,0.0,0.15


### GroupBy: "Split" Into Groups

<img src="test.JPG"  />

GroupBy is very similar to SQL's GROUP clause: 
```
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
```

In [9]:
grouped = df.groupby('class')
# grouped is of type DataFrameGroupBy. 
type(grouped)

pandas.core.groupby.generic.DataFrameGroupBy

In [10]:
# Need to apply some function to grouped object to get results.  More on this later. 
grouped.sum()

Unnamed: 0_level_0,max_speed,weight
class,Unnamed: 1_level_1,Unnamed: 2_level_1
Fruit,0.0,0.35
bird,413.0,1.45
mammal,138.2,505.0


In [11]:
# Multi-Level Grouping. 
multiGroup = df.groupby(['class', 'order'])
multiGroup.sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,max_speed,weight
class,order,Unnamed: 2_level_1,Unnamed: 3_level_1
Fruit,Vegetaria,0.0,0.35
bird,Falconiformes,389.0,1.2
bird,Psittaciformes,24.0,0.25
mammal,Carnivora,138.2,480.0
mammal,Primates,0.0,25.0


In [12]:
# "groups" returns a dictionary {groups: [index values for the group]}. 
print("Original DataFrame:\n", df)
d = df.groupby('class').groups
d

Original DataFrame:
           class           order  max_speed  weight
falcon     bird   Falconiformes      389.0    1.20
parrot     bird  Psittaciformes       24.0    0.25
lion     mammal       Carnivora       80.2  400.00
monkey   mammal        Primates        NaN   25.00
leopard  mammal       Carnivora       58.0   80.00
Apple     Fruit       Vegetaria        0.0    0.20
Orange    Fruit       Vegetaria        0.0    0.15


{'Fruit': ['Apple', 'Orange'], 'bird': ['falcon', 'parrot'], 'mammal': ['lion', 'monkey', 'leopard']}

### GroupBy: "Apply" : Level I

- Apply a function which can act on an entire group's columns. 
  - Aggregation:  sum, mean, min, max ... etc. 
  - Transformation:  Filling NaN values with a value derived for each group. 
  - Filtration:  Discard some groups based on some condition. 


See Pandas docs for full list of functions.

In [88]:
# Can specify the columns we want before applying a aggregation function. 
grouped['max_speed'].mean()
# Returns a SERIES

class
Fruit       0.0
bird      206.5
mammal     69.1
Name: max_speed, dtype: float64

In [89]:
# We could convert above to a dataframe: pd.DataFrame( grouped['max_speed'].sum() )
# But a shorter/cleaner approach would be to use [[DOUBLE-BRACKETS]]: 
grouped[['max_speed', 'weight']].sum()

# The double-bracket specifies list of columns in a DATA-FRAME

Unnamed: 0_level_0,max_speed,weight
class,Unnamed: 1_level_1,Unnamed: 2_level_1
Fruit,0.0,0.35
bird,413.0,1.45
mammal,138.2,505.0


In [90]:
# We can even apply different aggrgation functions to different columns
df2 = df.groupby(['class', 'order'])\
        .agg({  'max_speed': 'max', 
                'weight':    'mean'   })

df2

Unnamed: 0_level_0,Unnamed: 1_level_0,max_speed,weight
class,order,Unnamed: 2_level_1,Unnamed: 3_level_1
Fruit,Vegetaria,0.0,0.175
bird,Falconiformes,389.0,1.2
bird,Psittaciformes,24.0,0.25
mammal,Carnivora,80.2,240.0
mammal,Primates,,25.0


In [91]:
# What if we are only interested in mammal sub-group.  use the '.xs' method. 
multiGroupSum = df.groupby(['class', 'order']).sum()
multiGroupSum.xs('mammal', level='class')

Unnamed: 0_level_0,max_speed,weight
order,Unnamed: 1_level_1,Unnamed: 2_level_1
Carnivora,138.2,480.0
Primates,0.0,25.0


### Apply: Level II

In [98]:
# Instead of the usual stuff that pandas provides (min, max, sum, mean etc.),
# We could CUSTOM DEFINE our own functions: 

# Here we define a super-complicated function
def sum_and_square(grp):
    s = grp.sum()   # Sum the group's column
    return s*s      # Square the sum

df.groupby(['class', 'order']).apply(sum_and_square)


Unnamed: 0_level_0,Unnamed: 1_level_0,max_speed,weight
class,order,Unnamed: 2_level_1,Unnamed: 3_level_1
Fruit,Vegetaria,0.0,0.1225
bird,Falconiformes,151321.0,1.44
bird,Psittaciformes,576.0,0.0625
mammal,Carnivora,19099.24,230400.0
mammal,Primates,0.0,625.0
