<a href="https://colab.research.google.com/github/OptimalDecisions/sports-analytics-foundations/blob/main/pandas-basics/Pandas_Intermediate_2_10_GroupBy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

  ## Pandas Basics 2.10



  # GroupBy

  <img src = "../img/sa_logo.png" width="100" align="left">

  Ram Narasimhan

  <br><br><br>

  << [2.9 Merging Dataframes](Pandas_Intermediate_2_9_Merging_DataFrames.ipynb) | [2. 10 GroupBy](Pandas_Intermediate_2_10_GroupBy.ipynb)  >>




Concepts covered in this notebook.

1. Split-Apply-Combine (the concept)
2. How to Create Groups in Pandas
3. Groupby and then calculatng aggregations
4. Grouping by multiple columns
5. Flattening Hierarchical Indices

## 1. The Split - Aggregate - Combine Concept

- Split: Divide the data into groups.
- Apply: Perform a specific operation on each group.
- Combine: Bring the results back together into a single dataset.

The "Split-Apply-Combine" approach is powerful and flexible, allowing you to perform complex analyses efficiently. It is commonly used for tasks such as group-wise summary statistics and data transformation.

The `groupby` object in Pandas facilitates this process by providing a convenient interface for working with grouped data.


Split:

The first step involves dividing the dataset into groups based on some criteria. This could be a column's unique values, a combination of columns, or any other condition. Pandas provides the groupby function for this purpose.

```
grouped_data = df.groupby('Category')
```

Apply:

Once the dataset is split into groups, we can apply a specific operation or function to each group independently. Common operations include aggregation, transformation, or filtering. We can use methods like `sum()`, `mean()`, `apply()`, etc.

```
result_per_group = grouped_data['Value'].sum()
```
Here, the sum of the 'Value' column is calculated for each group.

Combine:

After applying the operation to each group, the results are combined back into a single dataset. This could be a new DataFrame, Series, or a summary statistic.
```
final_result = result_per_group.reset_index()
```

The results are combined into a new DataFrame, and in this case, the index is reset.

## 2. Create Groups

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline


In [2]:
url = "https://raw.githubusercontent.com/OptimalDecisions/sports-analytics-foundations/main/data/nba_games_with_names.csv"
games = pd.read_csv(url)

In [3]:
games.sample(3)

Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS,ABBREVIATION_h,NICKNAME_h,CITY_h,ABBREVIATION_a,NICKNAME_a,CITY_a
8512,2011-02-13,21000805,Final,1610612753,1610612747,2010,1610612753,89.0,0.487,0.8,...,0.125,19.0,34.0,1,ORL,Magic,Orlando,LAL,Lakers,Los Angeles
10562,2009-11-20,20900174,Final,1610612755,1610612763,2009,1610612755,97.0,0.457,0.737,...,0.273,22.0,48.0,0,PHI,76ers,Philadelphia,MEM,Grizzlies,Memphis
5468,2013-10-09,11300025,Final,1610612757,1610612756,2013,1610612757,98.0,0.467,0.677,...,0.529,22.0,43.0,0,POR,Trail Blazers,Portland,PHX,Suns,Phoenix


In [4]:
games.groupby('NICKNAME_h')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7b212143f490>

In [5]:
games.groupby('NICKNAME_h').size()

NICKNAME_h
76ers            878
Bucks            883
Bulls            899
Cavaliers        921
Celtics          950
Clippers         903
Grizzlies        882
Hawks            883
Heat             959
Hornets          798
Jazz             874
Kings            840
Knicks           840
Lakers           969
Magic            876
Mavericks        907
Nets             879
Nuggets          873
Pacers           891
Pelicans         851
Pistons          903
Raptors          891
Rockets          900
Spurs            942
Suns             890
Thunder          893
Timberwolves     847
Trail Blazers    871
Warriors         907
Wizards          851
dtype: int64

### 2.1 Examine the First and Last row in each group

In [6]:
grps = games.groupby('NICKNAME_h')
grps.first()
#grps.last()

Unnamed: 0_level_0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS,ABBREVIATION_h,CITY_h,ABBREVIATION_a,NICKNAME_a,CITY_a
NICKNAME_h,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
76ers,2022-12-21,22200467,Final,1610612755,1610612765,2022,1610612755,113.0,0.441,0.909,...,0.735,0.261,15.0,46.0,1,PHI,Philadelphia,DET,Pistons,Detroit
Bucks,2022-12-17,22200442,Final,1610612749,1610612762,2022,1610612749,123.0,0.533,0.818,...,0.5,0.34,24.0,29.0,1,MIL,Milwaukee,UTA,Jazz,Utah
Bulls,2022-12-16,22200434,Final,1610612741,1610612752,2022,1610612741,91.0,0.468,0.875,...,0.719,0.386,19.0,50.0,0,CHI,Chicago,NYK,Knicks,New York
Cavaliers,2022-12-21,22200466,Final,1610612739,1610612749,2022,1610612739,114.0,0.482,0.786,...,0.682,0.433,20.0,46.0,1,CLE,Cleveland,MIL,Bucks,Milwaukee
Celtics,2022-12-21,22200469,Final,1610612738,1610612754,2022,1610612738,112.0,0.386,0.84,...,0.778,0.462,27.0,47.0,0,BOS,Boston,IND,Pacers,Indiana
Clippers,2022-12-21,22200476,Final,1610612746,1610612766,2022,1610612746,126.0,0.506,0.913,...,0.759,0.29,25.0,40.0,1,LAC,Los Angeles,CHA,Hornets,Charlotte
Grizzlies,2022-12-15,22200425,Final,1610612763,1610612749,2022,1610612763,142.0,0.549,0.783,...,0.667,0.26,22.0,39.0,1,MEM,Memphis,MIL,Bucks,Milwaukee
Hawks,2022-12-21,22200468,Final,1610612737,1610612741,2022,1610612737,108.0,0.429,1.0,...,0.773,0.292,20.0,47.0,0,ATL,Atlanta,CHI,Bulls,Chicago
Heat,2022-12-20,22200462,Final,1610612748,1610612741,2022,1610612748,103.0,0.469,0.706,...,0.833,0.419,24.0,39.0,0,MIA,Miami,CHI,Bulls,Chicago
Hornets,2022-12-16,22200428,Final,1610612766,1610612737,2022,1610612766,106.0,0.398,0.684,...,0.824,0.517,25.0,41.0,0,CHA,Charlotte,ATL,Hawks,Atlanta


### 2.2 Fetch a Single Group

In [7]:
grps.get_group('Heat')

Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS,ABBREVIATION_h,NICKNAME_h,CITY_h,ABBREVIATION_a,NICKNAME_a,CITY_a
14,2022-12-20,22200462,Final,1610612748,1610612741,2022,1610612748,103.0,0.469,0.706,...,0.419,24.0,39.0,0,MIA,Heat,Miami,CHI,Bulls,Chicago
84,2022-12-10,22200387,Final,1610612748,1610612759,2022,1610612748,111.0,0.481,0.714,...,0.333,24.0,46.0,0,MIA,Heat,Miami,SAS,Spurs,San Antonio
102,2022-12-08,22200374,Final,1610612748,1610612746,2022,1610612748,115.0,0.511,0.786,...,0.472,24.0,43.0,1,MIA,Heat,Miami,LAC,Clippers,Los Angeles
117,2022-12-06,22200361,Final,1610612748,1610612765,2022,1610612748,96.0,0.429,0.769,...,0.463,27.0,40.0,0,MIA,Heat,Miami,DET,Pistons,Detroit
198,2022-11-25,22200277,Final,1610612748,1610612764,2022,1610612748,110.0,0.472,0.826,...,0.282,27.0,45.0,1,MIA,Heat,Miami,WAS,Wizards,Washington
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26592,2014-10-17,11400071,Final,1610612748,1610612744,2014,1610612748,115.0,0.557,0.806,...,0.290,26.0,29.0,1,MIA,Heat,Miami,GSW,Warriors,Golden State
26606,2014-10-14,11400052,Final,1610612748,1610612737,2014,1610612748,103.0,0.439,0.690,...,0.382,27.0,45.0,0,MIA,Heat,Miami,ATL,Hawks,Atlanta
26619,2014-10-11,11400033,Final,1610612748,1610612739,2014,1610612748,119.0,0.465,0.737,...,0.429,23.0,44.0,0,MIA,Heat,Miami,CLE,Cavaliers,Cleveland
26639,2014-10-07,11400012,Final,1610612748,1610612753,2014,1610612748,101.0,0.394,0.815,...,0.350,24.0,56.0,0,MIA,Heat,Miami,ORL,Magic,Orlando



## 3.Group by one Column and calculate stats on another column


In the example below, the *grouping column* is NICKNAME_h.
But we are interested in the average (mean) of another *numeric* column called HOME_TEAM_WINS.

In [8]:
games.groupby('NICKNAME_h')['HOME_TEAM_WINS'].mean()

NICKNAME_h
76ers            0.541002
Bucks            0.577576
Bulls            0.575083
Cavaliers        0.601520
Celtics          0.634737
Clippers         0.595792
Grizzlies        0.597506
Hawks            0.570781
Heat             0.659020
Hornets          0.491228
Jazz             0.656751
Kings            0.496429
Knicks           0.471429
Lakers           0.591331
Magic            0.521689
Mavericks        0.654906
Nets             0.509670
Nuggets          0.674685
Pacers           0.619529
Pelicans         0.546416
Pistons          0.566999
Raptors          0.593715
Rockets          0.634444
Spurs            0.727176
Suns             0.577528
Thunder          0.600224
Timberwolves     0.472255
Trail Blazers    0.601607
Warriors         0.668137
Wizards          0.529965
Name: HOME_TEAM_WINS, dtype: float64

In [9]:
games.groupby('NICKNAME_h').agg({'HOME_TEAM_WINS': [len, sum, 'mean']})

Unnamed: 0_level_0,HOME_TEAM_WINS,HOME_TEAM_WINS,HOME_TEAM_WINS
Unnamed: 0_level_1,len,sum,mean
NICKNAME_h,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
76ers,878,475,0.541002
Bucks,883,510,0.577576
Bulls,899,517,0.575083
Cavaliers,921,554,0.60152
Celtics,950,603,0.634737
Clippers,903,538,0.595792
Grizzlies,882,527,0.597506
Hawks,883,504,0.570781
Heat,959,632,0.65902
Hornets,798,392,0.491228


### 3.1 Perform multiple Aggregations across multiple columns after grouping

In [10]:
games.groupby('NICKNAME_h').agg({
    'FG_PCT_home': 'mean',
    'FG_PCT_away': 'mean'})

Unnamed: 0_level_0,FG_PCT_home,FG_PCT_away
NICKNAME_h,Unnamed: 1_level_1,Unnamed: 2_level_1
76ers,0.456315,0.449137
Bucks,0.461247,0.450475
Bulls,0.446363,0.441517
Cavaliers,0.457421,0.449975
Celtics,0.463992,0.440027
Clippers,0.465658,0.446419
Grizzlies,0.456776,0.446858
Hawks,0.45998,0.453045
Heat,0.472661,0.443626
Hornets,0.444371,0.451909


In [11]:
games.columns

Index(['GAME_DATE_EST', 'GAME_ID', 'GAME_STATUS_TEXT', 'HOME_TEAM_ID',
       'VISITOR_TEAM_ID', 'SEASON', 'TEAM_ID_home', 'PTS_home', 'FG_PCT_home',
       'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home', 'TEAM_ID_away',
       'PTS_away', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away',
       'REB_away', 'HOME_TEAM_WINS', 'ABBREVIATION_h', 'NICKNAME_h', 'CITY_h',
       'ABBREVIATION_a', 'NICKNAME_a', 'CITY_a'],
      dtype='object')

### 3.2 How to Apply a custom function inside `agg` after grouping

Let's use a Basketball example to understand how we could write our own function and apply it to each group.

Let's say that we want to know, for each team, what is the maximum and the minimum number of points that the AWAY_TEAM (the OPPONENT) has scored.

Let's write a custom function and find out.


In [12]:
def PTS_Range(df):
    mini = df.min()
    maxi = df.max()
    return f'Lowest {mini} Highest {maxi}'

NUMERIC_COLUMN = 'PTS_away'

grps.agg({NUMERIC_COLUMN: PTS_Range})


Unnamed: 0_level_0,PTS_away
NICKNAME_h,Unnamed: 1_level_1
76ers,Lowest 61.0 Highest 141.0
Bucks,Lowest 68.0 Highest 153.0
Bulls,Lowest 58.0 Highest 149.0
Cavaliers,Lowest 66.0 Highest 148.0
Celtics,Lowest 33.0 Highest 141.0
Clippers,Lowest 64.0 Highest 142.0
Grizzlies,Lowest 63.0 Highest 145.0
Hawks,Lowest 64.0 Highest 168.0
Heat,Lowest 63.0 Highest 144.0
Hornets,Lowest 63.0 Highest 141.0


### 3.3 Use `transform` to add new columns to your df based on groups

In [13]:
games.groupby('NICKNAME_h')['FG3_PCT_home'].transform("mean")


0        0.353038
1        0.360905
2        0.358482
3        0.346777
4        0.351469
           ...   
26646    0.351469
26647    0.358065
26648    0.341751
26649    0.359061
26650    0.356917
Name: FG3_PCT_home, Length: 26651, dtype: float64

And we could take the output above (the new column) and name it something like 'FG3_PCT_home_mean'

```
games['FG3_PCT_home_mean'] = games.groupby('NICKNAME_h')['FG3_PCT_home'].transform("mean")
```

In fact, we could create a whole set of new colums based on grouping. We need one Numeric column (our statistic of interest) and one aggregation method (such as sum, mean, count or median, etc.)

In [14]:
NUMERIC_COLUMN = 'PTS_away'
AGG_METHODS = ["sum", "count", "median", "mean", "std"]
for v in AGG_METHODS:
    games[NUMERIC_COLUMN + '_' + v] = games.groupby('NICKNAME_h')[NUMERIC_COLUMN].transform(v)


In [15]:
[x for x in games.columns if x.startswith('PTS_away')]


['PTS_away',
 'PTS_away_sum',
 'PTS_away_count',
 'PTS_away_median',
 'PTS_away_mean',
 'PTS_away_std']


## 4. Group by Multiple Columns

- The groupby operation in Pandas allows you to group data based on one or more columns.
- Grouping by multiple columns creates a hierarchical index, providing a more detailed level of grouping.
- We can calculate various summary statistics for each group, such as mean, sum, median, etc.


In [19]:
grps = games.groupby(['SEASON', 'NICKNAME_h'])


See how easy it is to get all the Home games of the Chicago Bulls in a given season. (Say 2003)

In [28]:
grps.get_group((2003, 'Bulls'))
#grps.groups.keys()

Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,NICKNAME_h,CITY_h,ABBREVIATION_a,NICKNAME_a,CITY_a,PTS_away_sum,PTS_away_count,PTS_away_median,PTS_away_mean,PTS_away_std
18009,2004-04-12,20301168,Final,1610612741,1610612753,2003,1610612741,84.0,0.358,0.692,...,Bulls,Chicago,ORL,Magic,Orlando,88282.0,895,97.0,98.639106,13.801016
18036,2004-04-09,20301142,Final,1610612741,1610612737,2003,1610612741,101.0,0.463,0.765,...,Bulls,Chicago,ATL,Hawks,Atlanta,88282.0,895,97.0,98.639106,13.801016
18073,2004-04-03,20301103,Final,1610612741,1610612748,2003,1610612741,83.0,0.345,0.833,...,Bulls,Chicago,MIA,Heat,Miami,88282.0,895,97.0,98.639106,13.801016
18080,2004-04-02,20301096,Final,1610612741,1610612746,2003,1610612741,114.0,0.413,0.8,...,Bulls,Chicago,LAC,Clippers,Los Angeles,88282.0,895,97.0,98.639106,13.801016
18132,2004-03-26,20301045,Final,1610612741,1610612749,2003,1610612741,105.0,0.452,0.5,...,Bulls,Chicago,MIL,Bucks,Milwaukee,88282.0,895,97.0,98.639106,13.801016
18154,2004-03-23,20301022,Final,1610612741,1610612751,2003,1610612741,81.0,0.4,0.722,...,Bulls,Chicago,BKN,Nets,Brooklyn,88282.0,895,97.0,98.639106,13.801016
18177,2004-03-20,20301000,Final,1610612741,1610612752,2003,1610612741,87.0,0.464,0.677,...,Bulls,Chicago,NYK,Knicks,New York,88282.0,895,97.0,98.639106,13.801016
18225,2004-03-13,20300952,Final,1610612741,1610612747,2003,1610612741,81.0,0.383,0.516,...,Bulls,Chicago,LAL,Lakers,Los Angeles,88282.0,895,97.0,98.639106,13.801016
18256,2004-03-09,20300922,Final,1610612741,1610612755,2003,1610612741,81.0,0.375,0.706,...,Bulls,Chicago,PHI,76ers,Philadelphia,88282.0,895,97.0,98.639106,13.801016
18313,2004-03-01,20300865,Final,1610612741,1610612739,2003,1610612741,92.0,0.378,0.793,...,Bulls,Chicago,CLE,Cavaliers,Cleveland,88282.0,895,97.0,98.639106,13.801016


In [29]:
grps['FT_PCT_home'].mean()
grps['FT_PCT_home'].sum()

SEASON  NICKNAME_h   
2003    76ers            30.632
        Bucks            32.540
        Bulls            29.449
        Cavaliers        31.496
        Celtics          33.619
                          ...  
2022    Thunder          13.272
        Timberwolves     14.123
        Trail Blazers    10.993
        Warriors         13.489
        Wizards          12.793
Name: FT_PCT_home, Length: 599, dtype: float64

## 5. Flatten Indices

- After performing a GroupBy operation, you might end up with a hierarchical index.
- To flatten the index and convert it back to a regular DataFrame, use the reset_index() method.
- The reset_index() method adds new index columns and moves the existing indices back to columns.
- This makes the DataFrame more straightforward for further analysis or visualiza

In [30]:
grps['FT_PCT_home'].mean().reset_index()

Unnamed: 0,SEASON,NICKNAME_h,FT_PCT_home
0,2003,76ers,0.747122
1,2003,Bucks,0.756744
2,2003,Bulls,0.718268
3,2003,Cavaliers,0.768195
4,2003,Celtics,0.764068
...,...,...,...
594,2022,Thunder,0.780706
595,2022,Timberwolves,0.784611
596,2022,Trail Blazers,0.785214
597,2022,Warriors,0.749389




<< [2.9 Merging Dataframes](Pandas_Intermediate_2_9_Merging_DataFrames.ipynb) | [2.10 GroupBy](Pandas_Intermediate_2_10_GroupBy.ipynb)  >>