# Groupby

Any groupby operation involves one of the following operations on the original object: <br> 
1. Splitting the data into groups based on some criteria <br> 
2. Applying a function to each group independently <br> 
3. Combining the results  <br> 

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations:  <br> 
1. `Aggregation`: computing a summary statistic <br> 
2. `Transformation`: perform some group-specific operation <br> 
3. `Filtration`: discarding the data with some condition

In [1]:
# !pip install numpy
# !pip install pandas

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = {'Team': ['A','A','B','B','C','C','C','C','A','D','D','A'],
   'Rank': [1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(data)
df

Unnamed: 0,Team,Rank,Year,Points
0,A,1,2014,876
1,A,2,2015,789
2,B,2,2014,863
3,B,3,2015,673
4,C,3,2014,741
5,C,4,2015,812
6,C,1,2016,756
7,C,1,2017,788
8,A,2,2016,694
9,D,4,2014,701


Splitting data by Team:

In [3]:
# 
df.groupby('Team')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000275F521DE80>

View groups:

In [4]:
df.groupby('Team').groups

{'A': [0, 1, 8, 11], 'B': [2, 3], 'C': [4, 5, 6, 7], 'D': [9, 10]}

Getting the size of the groups:

In [5]:
df.groupby('Team').size()

Team
A    4
B    2
C    4
D    2
dtype: int64

Groupby with multiple columns:

In [6]:
df.groupby(['Team','Year']).groups

{('A', 2014): [0], ('A', 2015): [1], ('A', 2016): [8], ('A', 2017): [11], ('B', 2014): [2], ('B', 2015): [3], ('C', 2014): [4], ('C', 2015): [5], ('C', 2016): [6], ('C', 2017): [7], ('D', 2014): [9], ('D', 2015): [10]}

Iterating through groups:

In [7]:
t = df.groupby('Team')
for name,group in t:
    print(name)
    print(group)

A
   Team  Rank  Year  Points
0     A     1  2014     876
1     A     2  2015     789
8     A     2  2016     694
11    A     2  2017     690
B
  Team  Rank  Year  Points
2    B     2  2014     863
3    B     3  2015     673
C
  Team  Rank  Year  Points
4    C     3  2014     741
5    C     4  2015     812
6    C     1  2016     756
7    C     1  2017     788
D
   Team  Rank  Year  Points
9     D     4  2014     701
10    D     1  2015     804


**get_group**: we can select a single group

In [8]:
t.get_group('A')

Unnamed: 0,Team,Rank,Year,Points
0,A,1,2014,876
1,A,2,2015,789
8,A,2,2016,694
11,A,2,2017,690


In [9]:
t.get_group('D')

Unnamed: 0,Team,Rank,Year,Points
9,D,4,2014,701
10,D,1,2015,804


`aggregations`: An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data.

In [10]:
t.agg(np.mean)

Unnamed: 0_level_0,Rank,Year,Points
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1.75,2015.5,762.25
B,2.5,2014.5,768.0
C,2.25,2015.5,774.25
D,2.5,2014.5,752.5


In [11]:
t.agg([np.size, np.min, np.mean, np.max])

Unnamed: 0_level_0,Rank,Rank,Rank,Rank,Year,Year,Year,Year,Points,Points,Points,Points
Unnamed: 0_level_1,size,amin,mean,amax,size,amin,mean,amax,size,amin,mean,amax
Team,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
A,4,1,1.75,2,4,2014,2015.5,2017,4,690,762.25,876
B,2,2,2.5,3,2,2014,2014.5,2015,2,673,768.0,863
C,4,1,2.25,4,4,2014,2015.5,2017,4,741,774.25,812
D,2,1,2.5,4,2,2014,2014.5,2015,2,701,752.5,804


In [12]:
t['Points'].agg([np.size, np.sum, np.mean, np.std])

Unnamed: 0_level_0,size,sum,mean,std
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,4,3049,762.25,88.567771
B,2,1536,768.0,134.350288
C,4,3097,774.25,31.899582
D,2,1505,752.5,72.831998


In [13]:
t['Rank'].sum()

Team
A    7
B    5
C    9
D    5
Name: Rank, dtype: int64

`Transformation methods`: return a DataFrame with the same shape and indices as the original, but with different values. 

In [14]:
t = df.groupby('Team')

In [15]:
score = lambda x: (x - x.mean()) / x.std()

In [16]:
t.transform(score)

Unnamed: 0,Rank,Year,Points
0,-1.5,-1.161895,1.284327
1,0.5,-0.387298,0.302029
2,-0.707107,-0.707107,0.707107
3,0.707107,0.707107,-0.707107
4,0.5,-1.161895,-1.042333
5,1.166667,-0.387298,1.183401
6,-0.833333,0.387298,-0.572108
7,-0.833333,1.161895,0.43104
8,0.5,0.387298,-0.770596
9,0.707107,-0.707107,-0.707107


In [17]:
t.groups

{'A': [0, 1, 8, 11], 'B': [2, 3], 'C': [4, 5, 6, 7], 'D': [9, 10]}

The mean of Points is calculated by groups. For instance, indexes 2 and 3 belong to group B. Observe he mean is the same: 768.00. The result has the same shape and indices as the original.

In [18]:
t['Points'].transform(lambda x: x.mean())

0     762.25
1     762.25
2     768.00
3     768.00
4     774.25
5     774.25
6     774.25
7     774.25
8     762.25
9     752.50
10    752.50
11    762.25
Name: Points, dtype: float64

`filter()`: it is used to filter the data

In [19]:
t.groups

{'A': [0, 1, 8, 11], 'B': [2, 3], 'C': [4, 5, 6, 7], 'D': [9, 10]}

In [20]:
t.filter(lambda x: len(x)==2)

Unnamed: 0,Team,Rank,Year,Points
2,B,2,2014,863
3,B,3,2015,673
9,D,4,2014,701
10,D,1,2015,804


In [21]:
t.filter(lambda x: x['Points'].min() > 700)

Unnamed: 0,Team,Rank,Year,Points
4,C,3,2014,741
5,C,4,2015,812
6,C,1,2016,756
7,C,1,2017,788
9,D,4,2014,701
10,D,1,2015,804


In [22]:
t.filter(lambda x: x['Rank'].max() == 4)

Unnamed: 0,Team,Rank,Year,Points
4,C,3,2014,741
5,C,4,2015,812
6,C,1,2016,756
7,C,1,2017,788
9,D,4,2014,701
10,D,1,2015,804


In [23]:
tf = t.filter(lambda x: x['Year'].count() == 2)

In summary:
- If you want to get a single value for each group, use aggregate() (or one of its shortcuts). 
- If you want to get a new value for each original row, use transform().
- If you want to get a subset of the original rows, use filter(). 