In [1]:
import pandas as pd
import numpy as np

# Group by

This refers to three ways of manipulating data:

- **Splitting** data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** results into a data structure

Several things could happen in Applying: 

- Aggregation -- Computing summary stats on each group
- Transformation -- Performing a computation on each group to return a like-indexed but changed group, e.g. normalizing data or filling in NA's
- Filtration -- Discarding some groups according to a computation that evaluates True or False 
- Some combination of these 


## Splitting an object into groups 

Objects can be split by axes. 

In [2]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8), 
                   'D' : np.random.randn(8)})
df

Unnamed: 0,A,B,C,D
0,foo,one,0.55382,0.633548
1,bar,one,0.024034,0.104126
2,foo,two,1.238403,-0.184995
3,bar,three,-0.491016,-1.616213
4,foo,two,-0.099737,-0.174135
5,bar,two,0.481952,-0.978214
6,foo,one,0.560719,0.912002
7,foo,three,-0.26657,0.255826


In [3]:
grouped1 = df.groupby('A')
grouped1

<pandas.core.groupby.DataFrameGroupBy object at 0x7fbfce08f550>

In [4]:
# Hmm. I'm not seeing what actually happened when we grouped. 

grouped1.head()

Unnamed: 0,A,B,C,D
0,foo,one,0.55382,0.633548
1,bar,one,0.024034,0.104126
2,foo,two,1.238403,-0.184995
3,bar,three,-0.491016,-1.616213
4,foo,two,-0.099737,-0.174135
5,bar,two,0.481952,-0.978214
6,foo,one,0.560719,0.912002
7,foo,three,-0.26657,0.255826


In [5]:
# Another example from https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm 

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df2 = pd.DataFrame(ipl_data)

print(df2)

    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
2      863     2  Devils  2014
3      673     3  Devils  2015
4      741     3   Kings  2014
5      812     4   kings  2015
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
9      701     4  Royals  2014
10     804     1  Royals  2015
11     690     2  Riders  2017


In [6]:
print(df2.groupby('Team').groups)

{'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'kings': Int64Index([5], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64')}


OK, so this is what the documentation meant when it said 

>The abstract definition of grouping is to provide a mapping of labels to group names. 

When you `groupby` a column, it creates an object that can be thought of as a dictionary mapping the column names to index numbers. Above -- "Riders" is the group defined by those entries in the data frame with indices 0,1,8,11. It's like a coloring. 

Let's go back and look at the first DataFrame

In [7]:
df

Unnamed: 0,A,B,C,D
0,foo,one,0.55382,0.633548
1,bar,one,0.024034,0.104126
2,foo,two,1.238403,-0.184995
3,bar,three,-0.491016,-1.616213
4,foo,two,-0.099737,-0.174135
5,bar,two,0.481952,-0.978214
6,foo,one,0.560719,0.912002
7,foo,three,-0.26657,0.255826


In [8]:
print(df.groupby('A').groups)

{'bar': Int64Index([1, 3, 5], dtype='int64'), 'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}


In [10]:
# Grouping using two columns: 

print(df.groupby(['A','B']).groups)

# This treats the two columns like pairs in a Cartesian product. 

{('bar', 'two'): Int64Index([5], dtype='int64'), ('bar', 'one'): Int64Index([1], dtype='int64'), ('bar', 'three'): Int64Index([3], dtype='int64'), ('foo', 'two'): Int64Index([2, 4], dtype='int64'), ('foo', 'one'): Int64Index([0, 6], dtype='int64'), ('foo', 'three'): Int64Index([7], dtype='int64')}
