# The four ways to do groupby

The mental model is split, apply, combine. For instance, if we need to find the average salary by state, we should first split the dataframe into states, apply the operation (finding means) for each state, and combine them into one. There's four ways to instantiate a groupby object. Underneath the hood, there is only one, the other three are simply derivative.

First to debunk a myth, I used to think groupby must be done based on something FROM the dataframe, it is not true. The key can be completely external, and that's where we get started.

In [123]:
import pandas as pd 
df = pd.DataFrame({'internal_key': ['a', 'b','b', 'a'],
                   'values_1': [1,2,3,4],
                   'values_2': [24,5,5, 9]})

In [124]:
df.groupby(['g1', 'g2','g1', 'g2']).mean()

Unnamed: 0,values_1,values_2
g1,2.0,14.5
g2,3.0,7.0


In [11]:
df.groupby(['g1', 'g2','g1', 'g2']).max()

Unnamed: 0,internal_key,values_1,values_2
g1,b,3,24
g2,b,4,9


Here is the most fundamental way to do groupby: passing row labels and groupby the dataframe by the given row label. Apparently this piece of information was not even there before grouping. Additionally, if it's numeric operations like mean, then the string column will be discarded implicitly. 

Of course you can use  the internal key variable to do groupby, but the logic is identical since we passed a series and grouping is done based on this series of row label.

In [19]:
df.groupby(df['internal_key'], as_index=False).max()

Unnamed: 0,internal_key,values_1,values_2
0,a,4,24
1,b,3,5


However, if the groupby element is already part of the dataframe column, we do not even have to pass in the array,  just the name of array is fine.

In [125]:
df.groupby('internal_key', as_index=False).max()

Unnamed: 0,internal_key,values_1,values_2
0,a,4,24
1,b,3,5


Note that as_index is just a way to specify if we will use that column as index. If set to true, this column will be removed and used as the index instead.

In [126]:
df.groupby('internal_key').max()

Unnamed: 0_level_0,values_1,values_2
internal_key,Unnamed: 1_level_1,Unnamed: 2_level_1
a,4,24
b,3,5


Apparently, we can group things by more than one standard, just make sure include everything in a list. The following is an example:

In [28]:
df.groupby([['japan', 'usa','usa', 'usa'], ['hardware', 'hardware','software', 'software']]).mean()

Unnamed: 0,Unnamed: 1,values_1,values_2
japan,hardware,1.0,24.0
usa,hardware,2.0,5.0
usa,software,3.5,7.0


Do not set as_index=False in this case, because the array is not in the dataframe in the first place. If we do not keep it as index either, there is no way for this piece of information to go. 

In [127]:
df.groupby([['japan', 'usa','usa', 'usa'], ['hardware', 'hardware','software', 'software']],
          as_index=False).mean()

Unnamed: 0,values_1,values_2
0,1.0,24.0
1,2.0,5.0
2,3.5,7.0


Of couse, we dont have to always keep the nested indexing like that, just unstack them if you don't feel comfortable. The rule to unstack is the inner level is going to be relocated as part of column. And the original column, if there is any, is going to be moved up a higher level. Simply put, inner to inner.

In [49]:
df = df.groupby([['japan', 'usa','usa', 'usa'], ['hardware', 'hardware','software', 'software']]).mean()
df.unstack()

Unnamed: 0_level_0,values_1,values_1,values_2,values_2
Unnamed: 0_level_1,hardware,software,hardware,software
japan,1.0,,24.0,
usa,2.0,3.5,5.0,7.0


Two more ways to create a groupby object: passing a dictionary and passing a function. The idea is the same tho: we apply the dictionary/function for the index and use the result as group criteria. 

This shortcut will ONLY be available for index. Hence, if there is not index to begin with, it might be not terribly useful. Nevertheless, here is the demo.

In [111]:
df3 = pd.DataFrame({'internal_key': ['ab', 'ab','b', 'a'],
                   'values_1': [1,2,3,4],
                   'values_2': [24,5,5, 9]})

df3.groupby(len).max() will give you a type error saying int has no len. This is because the function will be applied to indices, while the indices for this dataframe is by default, a range from 0 to n-1 (int type obviously). Here is how to fix it.

In [122]:
df3 = pd.DataFrame({'internal_key': ['ab', 'ab','b', 'a'],
                   'values_1': [1,2,3,4],
                   'values_2': [24,5,5, 9]})
df3.set_index(['internal_key'],inplace=True)
df3.groupby(len).max()

Unnamed: 0,values_1,values_2
1,4,9
2,2,24


Above, 1 means the length of index name is 1 and 2 means the length of index is 2. In other words, the result of the function will be used as new indices. Dictionaries work in a similar way. 

In [104]:
df4 = pd.DataFrame({'internal_key': ['ab', 'ab','b', 'a'],
                   'values_1': [1,2,3,4],
                   'values_2': [24,5,5, 9]})
df4.set_index(['internal_key'], inplace=True)
mapping_ = {'ab':'G','b':'G', 'a': 'SK'}
df4.groupby(mapping_).max()

Unnamed: 0,values_1,values_2
G,3,24
SK,4,9


A seemingly obvious point to mention is, all four methods will apply if we desire to groupby/aggregate columns rather than rows. Just add axis=1. The latter two ways are more useful to create a groupby object since most dataframes come with column names.  

# Some other stuff

The groupby object is iterable. This should not come as a surprise given the design paradigm, since we have to apply operation item by item. By item, I mean the chopped up dataframes from the original.

The iterator spit out two values at a time, the key and the corresponding dataframe. Most of the time this functionality wont be useful since pandas provides extensive aggregation functions (will be talked about later) and it will not be necessary to implement it from scratch.

In [132]:
df2 = pd.DataFrame({'internal_key': ['a', 'b','b', 'a'],
                   'values_1': [1, 2, 3, 4],
                   'values_2': [24, 5, 5, 9]})
grouped = df2.groupby(['g1', 'g2','g1', 'g2'])

In [133]:
for i, j in grouped:
    print(i)
    print(j.shape)
    print(type(j))

g1
(2, 3)
<class 'pandas.core.frame.DataFrame'>
g2
(2, 3)
<class 'pandas.core.frame.DataFrame'>


This kind of operation also support multiple keys. Essentially, each possible combination will be created and spit out as long as it's not empty.

In [130]:
df2 = pd.DataFrame({'internal_key': ['a', 'b','b', 'a'],
                   'values_1': [1,2,3,4],
                   'values_2': [24,5,5, 9]})
grouped2 = df2.groupby([['japan', 'usa','usa', 'usa'], ['hardware', 'hardware','software', 'software']])

In [131]:
for (k1,k2), df in grouped2:
    print((k1,k2))
    print(df)

('japan', 'hardware')
  internal_key  values_1  values_2
0            a         1        24
('usa', 'hardware')
  internal_key  values_1  values_2
1            b         2         5
('usa', 'software')
  internal_key  values_1  values_2
2            b         3         5
3            a         4         9
