### Grouping

The general process in which we will use the groupby() function is what is known as a split-apply-combine procedure that consists of the following three steps:
* first, split the data into chunks
* then, apply different functions to each group
* finally, aggregate the results and combine them back into a DataFrame

Let’s implement these steps three sets. We start by defining the DataFrame which you have seen before in this subject:

In [2]:
import pandas as pd
import numpy as np

raw_data = {'team': ['Ten Snakes', 'Ten Snakes', 'Ten Snakes', 'Ten Snakes', 
                     'Nine Monkeys', 'Nine Monkeys', 'Nine Monkeys', 'Nine Monkeys', 
                     'Eight Eagles', 'Eight Eagles'], 
        'rank': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '2nd'], 
        'name': ['James', 'Allen', 'Matthew', 'James', 'Devon', 'Sam', 'Justin', 'Sam', 'Paul', 'Ross'], 
        'score1': [16,35,55,29,2,61,68,41,94,18],
        'score2': [81,65,54,44,28,93,2,5,53,99]}
df = pd.DataFrame(raw_data, columns = ['team', 'rank', 'name', 'score1', 'score2'])
df

Unnamed: 0,team,rank,name,score1,score2
0,Ten Snakes,1st,James,16,81
1,Ten Snakes,1st,Allen,35,65
2,Ten Snakes,2nd,Matthew,55,54
3,Ten Snakes,2nd,James,29,44
4,Nine Monkeys,1st,Devon,2,28
5,Nine Monkeys,1st,Sam,61,93
6,Nine Monkeys,2nd,Justin,68,2
7,Nine Monkeys,2nd,Sam,41,5
8,Eight Eagles,1st,Paul,94,53
9,Eight Eagles,2nd,Ross,18,99


#### Grouping by a single variable

The function groupby() essentially splits the data into different groups according to a variable of our choice. This variable can be one or more row indices or column labels. For our example, we will split by the column label ‘team’. Here is the command:



In [4]:
grouped = df.groupby("team").mean().plot(figsize=(18, 4))
# the variable grouped does not directly have the grouped data. Instead, this variable is now a DataFrameGroupBy object
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000242F0AC2B50>

In [5]:
# First, to have a look at what the grouping actually looks, like we can take the object grouped and pass it to the function list() as follows:
list(grouped)

[('Eight Eagles',
             team rank  name  score1  score2
  8  Eight Eagles  1st  Paul      94      53
  9  Eight Eagles  2nd  Ross      18      99),
 ('Nine Monkeys',
             team rank    name  score1  score2
  4  Nine Monkeys  1st   Devon       2      28
  5  Nine Monkeys  1st     Sam      61      93
  6  Nine Monkeys  2nd  Justin      68       2
  7  Nine Monkeys  2nd     Sam      41       5),
 ('Ten Snakes',
           team rank     name  score1  score2
  0  Ten Snakes  1st    James      16      81
  1  Ten Snakes  1st    Allen      35      65
  2  Ten Snakes  2nd  Matthew      55      54
  3  Ten Snakes  2nd    James      29      44)]

This lists all of the groups and the items in each group. We can also get some descriptive statistics by group, by calling the method describe() on our object grouped:



In [6]:
grouped.describe()

Unnamed: 0_level_0,score1,score1,score1,score1,score1,score1,score1,score1,score2,score2,score2,score2,score2,score2,score2,score2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
team,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Eight Eagles,2.0,56.0,53.740115,18.0,37.0,56.0,75.0,94.0,2.0,76.0,32.526912,53.0,64.5,76.0,87.5,99.0
Nine Monkeys,4.0,43.0,29.631065,2.0,31.25,51.0,62.75,68.0,4.0,32.0,42.292631,2.0,4.25,16.5,44.25,93.0
Ten Snakes,4.0,33.75,16.23525,16.0,25.75,32.0,40.0,55.0,4.0,61.0,15.853496,44.0,51.5,59.5,69.0,81.0


In [8]:
# We can also apply the function mean() to get each group’s mean value per column:
grouped.mean()
# We can find out the number of items in each group by using:
grouped.size()
# Or if we want the number of items in each column of each group:
grouped.count()

Unnamed: 0_level_0,rank,name,score1,score2
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Eight Eagles,2,2,2,2
Nine Monkeys,4,4,4,4
Ten Snakes,4,4,4,4


If we want to get a specific group we call the get_group() function and pass it a group name. The groups are labeled according to the entries of the variable that we decided to group by:

In [9]:
grouped.get_group("Ten Snakes")

Unnamed: 0,team,rank,name,score1,score2
0,Ten Snakes,1st,James,16,81
1,Ten Snakes,1st,Allen,35,65
2,Ten Snakes,2nd,Matthew,55,54
3,Ten Snakes,2nd,James,29,44


#### Grouping by multiple index levels
The hierarchical indexing we saw earlier is also where useful when grouping data. Let’s recreate the two-level index for the DataFrame above:



In [10]:
df2 = df.set_index(["team", "rank"])
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,name,score1,score2
team,rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ten Snakes,1st,James,16,81
Ten Snakes,1st,Allen,35,65
Ten Snakes,2nd,Matthew,55,54
Ten Snakes,2nd,James,29,44
Nine Monkeys,1st,Devon,2,28
Nine Monkeys,1st,Sam,61,93
Nine Monkeys,2nd,Justin,68,2
Nine Monkeys,2nd,Sam,41,5
Eight Eagles,1st,Paul,94,53
Eight Eagles,2nd,Ross,18,99


We can now group by the different levels. We do this by passing the levels in a list to the function groupby(). Let’s give it a try:

In [12]:
grouped2 = df2.groupby(level=["team", "rank"])
grouped2

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000242AE893280>

In [13]:
# Number of items in each group
grouped2.size()

team          rank
Eight Eagles  1st     1
              2nd     1
Nine Monkeys  1st     2
              2nd     2
Ten Snakes    1st     2
              2nd     2
dtype: int64

We can see that instead of having three groups as before, we now have six groups since inside each group, we split the data once again according to the two possible ranks.

We can access a certain group now by passing a tuple containing the two indices which define the group. For example:



In [14]:
grouped2.get_group(("Eight Eagles", "1st"))

Unnamed: 0_level_0,Unnamed: 1_level_0,name,score1,score2
team,rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Eight Eagles,1st,Paul,94,53


#### Applying a function

In the second step of the split-apply-combine process, we can now use our groups to apply a function of our choice on each group. We do this using the agg() function, and passing to it the function we wish to apply to each group. Let’s try summing the entries per group:

In [15]:
grouped.agg(np.sum)

Unnamed: 0_level_0,score1,score2
team,Unnamed: 1_level_1,Unnamed: 2_level_1
Eight Eagles,112,152
Nine Monkeys,172,128
Ten Snakes,135,244


In [17]:
# We can also try this with the object grouped2 that has six groups:
grouped2.agg(np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,score1,score2
team,rank,Unnamed: 2_level_1,Unnamed: 3_level_1
Eight Eagles,1st,94,53
Eight Eagles,2nd,18,99
Nine Monkeys,1st,63,121
Nine Monkeys,2nd,109,7
Ten Snakes,1st,51,146
Ten Snakes,2nd,84,98


#### Filtering by groups

Once we have split the data and applied certain functions on each group separately, we arrive at the last part of the process, which is putting the data back together again. Now, the pandas GroupBy object has a very useful function called filter() which allows us to decide whether to include a certain group or not in the final combination.

This is how it works:

* first, we define a function that when passed a group returns either True or False
* and then, we pass this function to filter()

We will then get back the groups for which the function we defined returned True. Let’s try this on our example grouped. Suppose we are interested in keeping only those groups that have an average value greater than 50 in both score1 and score2. We already saw that we can get the average value in a group using the mean() method. This returns a DataFrame that has the mean values per group for each column. So we can define our function as follows:


In [18]:
def f(x):
    m = x.mean()
    return (m.score1 > 50) & (m.score2 > 50)
grouped.filter(f)

  m = x.mean()


Unnamed: 0,team,rank,name,score1,score2
8,Eight Eagles,1st,Paul,94,53
9,Eight Eagles,2nd,Ross,18,99


The first thing to notice is that the object returned is a subset of the original DataFrame df. Next, observe that the rows returned are those from the team Eight Eagles. This is because this was the only group that satisfied our condition of having an average greater than 50 in both score1 and score2. In the next unit, you can test your understanding of grouping by completing a short exercise.