# Pandas Groupby

## Objectives

- Master the use of Pandas' groupby for data aggregation, transformation, and filtration.
- Learn to apply various functions to grouped data for comprehensive analysis.

## Background

This notebook delves into the powerful groupby functionality in Pandas, highlighting its capacity to segment data into groups, apply distinct operations to each group, and combine the outcomes for data analysis.

## Datasets Used

The notebook does not reference external datasets. It uses a fictitious dataset containing teams, their rankings, years, and points to demonstrate the splitting, applying, and combining phases of groupby.

## groupby Method

Any groupby operation involves one of the following operations on the original object: <br> 
1. Splitting the data into groups based on some criteria <br> 
2. Applying a function to each group independently <br> 
3. Combining the results  <br> 

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations:  <br> 
1. `Aggregation`: computing a summary statistic <br> 
2. `Transformation`: perform some group-specific operation <br> 
3. `Filtration`: discarding the data with some condition

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = {'Team': ['A','A','B','B','C','C','C','C','A','D','D','A'],
   'Rank': [1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2],
   'Year': [2018,2019,2018,2019,2018,2019,2020,2021,2020,2018,2019,2021],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(data)
df

Unnamed: 0,Team,Rank,Year,Points
0,A,1,2018,876
1,A,2,2019,789
2,B,2,2018,863
3,B,3,2019,673
4,C,3,2018,741
5,C,4,2019,812
6,C,1,2020,756
7,C,1,2021,788
8,A,2,2020,694
9,D,4,2018,701


Splitting data by Team:

In [3]:
df.groupby('Team')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001697CD2F150>

View groups:

In [4]:
df.groupby('Team').groups

{'A': [0, 1, 8, 11], 'B': [2, 3], 'C': [4, 5, 6, 7], 'D': [9, 10]}

Getting the size of the groups:

In [5]:
df.groupby('Team').size()

Team
A    4
B    2
C    4
D    2
dtype: int64

Getting the number of Teams

In [6]:
df.groupby('Team').ngroups

4

Getting the keys of the groups

In [7]:
df.groupby('Team').groups.keys()

dict_keys(['A', 'B', 'C', 'D'])

Groupby with multiple columns:

In [8]:
df.groupby(['Team','Year']).groups

{('A', 2018): [0], ('A', 2019): [1], ('A', 2020): [8], ('A', 2021): [11], ('B', 2018): [2], ('B', 2019): [3], ('C', 2018): [4], ('C', 2019): [5], ('C', 2020): [6], ('C', 2021): [7], ('D', 2018): [9], ('D', 2019): [10]}

In [9]:
df.groupby(['Team','Year']).ngroups

12

Getting the keys of the groups

In [10]:
df.groupby(['Team','Year']).groups.keys()

dict_keys([('A', 2018), ('A', 2019), ('A', 2020), ('A', 2021), ('B', 2018), ('B', 2019), ('C', 2018), ('C', 2019), ('C', 2020), ('C', 2021), ('D', 2018), ('D', 2019)])

Iterating through groups:

In [11]:
t = df.groupby('Team')
for name, group in t:
    print(name)
    print(group)

A
   Team  Rank  Year  Points
0     A     1  2018     876
1     A     2  2019     789
8     A     2  2020     694
11    A     2  2021     690
B
  Team  Rank  Year  Points
2    B     2  2018     863
3    B     3  2019     673
C
  Team  Rank  Year  Points
4    C     3  2018     741
5    C     4  2019     812
6    C     1  2020     756
7    C     1  2021     788
D
   Team  Rank  Year  Points
9     D     4  2018     701
10    D     1  2019     804


`get_group`: we can select a single group

In [12]:
t.get_group('A')

Unnamed: 0,Team,Rank,Year,Points
0,A,1,2018,876
1,A,2,2019,789
8,A,2,2020,694
11,A,2,2021,690


In [13]:
t.get_group('D')

Unnamed: 0,Team,Rank,Year,Points
9,D,4,2018,701
10,D,1,2019,804


`aggregations`: An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data.

In [14]:
t[['Rank','Points']].agg(np.mean)

Unnamed: 0_level_0,Rank,Points
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1.75,762.25
B,2.5,768.0
C,2.25,774.25
D,2.5,752.5


In [15]:
t[['Rank','Points']].agg([np.size, np.mean, np.sum])

Unnamed: 0_level_0,Rank,Rank,Rank,Points,Points,Points
Unnamed: 0_level_1,size,mean,sum,size,mean,sum
Team,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,4,1.75,7,4,762.25,3049
B,2,2.5,5,2,768.0,1536
C,4,2.25,9,4,774.25,3097
D,2,2.5,5,2,752.5,1505


In [16]:
t['Points'].agg([np.size, np.mean, np.std])

Unnamed: 0_level_0,size,mean,std
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,4,762.25,88.567771
B,2,768.0,134.350288
C,4,774.25,31.899582
D,2,752.5,72.831998


In [17]:
t[['Rank','Points']].sum()

Unnamed: 0_level_0,Rank,Points
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
A,7,3049
B,5,1536
C,9,3097
D,5,1505


In [18]:
t[['Rank','Points']].mean()

Unnamed: 0_level_0,Rank,Points
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1.75,762.25
B,2.5,768.0
C,2.25,774.25
D,2.5,752.5


In [19]:
t[['Rank','Points']].median()

Unnamed: 0_level_0,Rank,Points
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2.0,741.5
B,2.5,768.0
C,2.0,772.0
D,2.5,752.5


`Transformation methods`: return a DataFrame with the same shape and indices as the original, but with different values. 

In [20]:
t = df.groupby('Team')

In [21]:
score = lambda x: (x - x.mean()) / x.std()

In [22]:
t.transform(score)

Unnamed: 0,Rank,Year,Points
0,-1.5,-1.161895,1.284327
1,0.5,-0.387298,0.302029
2,-0.707107,-0.707107,0.707107
3,0.707107,0.707107,-0.707107
4,0.5,-1.161895,-1.042333
5,1.166667,-0.387298,1.183401
6,-0.833333,0.387298,-0.572108
7,-0.833333,1.161895,0.43104
8,0.5,0.387298,-0.770596
9,0.707107,-0.707107,-0.707107


In [23]:
t.groups

{'A': [0, 1, 8, 11], 'B': [2, 3], 'C': [4, 5, 6, 7], 'D': [9, 10]}

The mean of Points is calculated by groups. For instance, indexes 2 and 3 belong to group B. Observe he mean is the same: 768.00. The result has the same shape and indices as the original.

In [24]:
t['Points'].transform(lambda x: x.mean())

0     762.25
1     762.25
2     768.00
3     768.00
4     774.25
5     774.25
6     774.25
7     774.25
8     762.25
9     752.50
10    752.50
11    762.25
Name: Points, dtype: float64

`filter()`: it is used to filter the data

In [25]:
t.groups

{'A': [0, 1, 8, 11], 'B': [2, 3], 'C': [4, 5, 6, 7], 'D': [9, 10]}

In [26]:
t.filter(lambda x: len(x)==2)

Unnamed: 0,Team,Rank,Year,Points
2,B,2,2018,863
3,B,3,2019,673
9,D,4,2018,701
10,D,1,2019,804


In [27]:
t.filter(lambda x: x['Points'].min() > 700)

Unnamed: 0,Team,Rank,Year,Points
4,C,3,2018,741
5,C,4,2019,812
6,C,1,2020,756
7,C,1,2021,788
9,D,4,2018,701
10,D,1,2019,804


In [28]:
t.filter(lambda x: x['Rank'].max() == 4)

Unnamed: 0,Team,Rank,Year,Points
4,C,3,2018,741
5,C,4,2019,812
6,C,1,2020,756
7,C,1,2021,788
9,D,4,2018,701
10,D,1,2019,804


In [29]:
tf = t.filter(lambda x: x['Year'].count() == 2)

In summary:
- If you want to get a single value for each group, use `aggregate()` (or one of its shortcuts). 
- If you want to get a new value for each original row, use `transform()`.
- If you want to get a subset of the original rows, use `filter()`. 

## Conclusions

Key Takeaways:
- The groupby function is essential for handling and analyzing grouped data, supporting aggregation, transformation, and filtration operations.
- Aggregation operations compute summary statistics for each group, offering insights like mean or sum.
- Transformation operations allow for group-specific computations, maintaining the shape of the original DataFrame.
- Filtration operations enable the exclusion of data based on a group-wise condition, such as filtering groups based on their size or a specific value threshold.

## References

- VanderPlas, J. (2017) Python Data Science Handbook: Essential Tools for Working with Data. USA: O’Reilly Media, Inc. chapter 3