# Grouping

Like in many database languages, grouping can be one of the most commonly used actions. What it allows you to do is quickly apply some sort of function based on groups within data. Let's begin with an example of a dataset of sales where we have 4 columns: the day of the sale, the store of the sale, the product number, and the amount of sales.

In [1]:
import pandas as pd

#Create the data
sales = pd.DataFrame([[1, 1, 1, 10],
 [1, 1, 2, 74],
 [1, 1, 3, 27],
 [1, 2, 1, 41],
 [1, 2, 2, 66],
 [1, 2, 3, 95],
 [2, 1, 1, 1],
 [2, 1, 2, 23],
 [2, 1, 3, 67],
 [2, 2, 1, 86],
 [2, 2, 2, 87],
 [2, 2, 3, 19],
 [3, 1, 1, 6],
 [3, 1, 2, 30],
 [3, 1, 3, 55],
 [3, 2, 1, 68],
 [3, 2, 2, 32],
 [3, 2, 3, 50]],
columns=['Day', 'Store', 'Product', 'Sales'])

print(sales)

    Day  Store  Product  Sales
0     1      1        1     10
1     1      1        2     74
2     1      1        3     27
3     1      2        1     41
4     1      2        2     66
5     1      2        3     95
6     2      1        1      1
7     2      1        2     23
8     2      1        3     67
9     2      2        1     86
10    2      2        2     87
11    2      2        3     19
12    3      1        1      6
13    3      1        2     30
14    3      1        3     55
15    3      2        1     68
16    3      2        2     32
17    3      2        3     50


The way we use groupby is we call groupby() on a dataframe, and pass in the column to group with as an argument. Then we need to call some sort of function to run after this! So in this case we are going to work with sum(). The following syntax will give us the total number of sales on each given day below. The index is also going to be set to the grouping column.

In [2]:
#Get the sum by day
print(sales.groupby('Day').sum())

     Store  Product  Sales
Day                       
1        9       12    313
2        9       12    283
3        9       12    241


The sales data makes sense, but for the product and store, we aren't actually getting real data. Those numbers were the store identifiers. So maybe we want to just limit to our sales data. If that is the case then we can use indexing prior to sum to make it so that only sales data is shown.

In [3]:
#Get the sum by day only for the sales
print(sales.groupby('Day')['Sales'].sum())

Day
1    313
2    283
3    241
Name: Sales, dtype: int64


Likewise, any of the other columns could be used instead!

In [4]:
#Get sales by product
print(sales.groupby('Product')['Sales'].sum())

#Get sales by store
print(sales.groupby('Store')['Sales'].sum())

Product
1    212
2    312
3    313
Name: Sales, dtype: int64
Store
1    293
2    544
Name: Sales, dtype: int64


## Example of Using Groupby and Join

You can combine this with a join to do something like find the percent of daily sales that each row is contributing. The steps are the following:

1. Find the total sales by day

In [5]:
#Find the daily sales
daily_sales = sales.groupby('Day')['Sales'].sum()
print(daily_sales)

Day
1    313
2    283
3    241
Name: Sales, dtype: int64


2. Give this series a name to be called when we join it.

In [6]:
#Give the name to the series
daily_sales.name = "Total Daily Sales"
print(daily_sales)

Day
1    313
2    283
3    241
Name: Total Daily Sales, dtype: int64


3. Join this data into our main dataset. Keep in mind that daily_sales has an index of day, but there is no index for our main data. Instead of setting the index on our main data we can use the keyword on and give the column to use as the index.

In [7]:
#Join the daily sales data
sales = sales.join(daily_sales, on="Day")
print(sales)

    Day  Store  Product  Sales  Total Daily Sales
0     1      1        1     10                313
1     1      1        2     74                313
2     1      1        3     27                313
3     1      2        1     41                313
4     1      2        2     66                313
5     1      2        3     95                313
6     2      1        1      1                283
7     2      1        2     23                283
8     2      1        3     67                283
9     2      2        1     86                283
10    2      2        2     87                283
11    2      2        3     19                283
12    3      1        1      6                241
13    3      1        2     30                241
14    3      1        3     55                241
15    3      2        1     68                241
16    3      2        2     32                241
17    3      2        3     50                241


4. Create a new column which divides sales by total daily sales.

In [8]:
sales['Percent of Daily Sales'] = sales['Sales'] / sales['Total Daily Sales']
print(sales)

    Day  Store  Product  Sales  Total Daily Sales  Percent of Daily Sales
0     1      1        1     10                313                0.031949
1     1      1        2     74                313                0.236422
2     1      1        3     27                313                0.086262
3     1      2        1     41                313                0.130990
4     1      2        2     66                313                0.210863
5     1      2        3     95                313                0.303514
6     2      1        1      1                283                0.003534
7     2      1        2     23                283                0.081272
8     2      1        3     67                283                0.236749
9     2      2        1     86                283                0.303887
10    2      2        2     87                283                0.307420
11    2      2        3     19                283                0.067138
12    3      1        1      6        

## Multi-index Grouping

You are not limited to just using a single column for grouping, you can also pass in a list to do the grouping based on multiple groupings. Let's take an example by bringing back the original dataset again. One thing to keep in mind is you will get back data with a multi-index based on these groupings.

In [9]:
#Create the data
sales = pd.DataFrame([[1, 1, 1, 10],
 [1, 1, 2, 74],
 [1, 1, 3, 27],
 [1, 2, 1, 41],
 [1, 2, 2, 66],
 [1, 2, 3, 95],
 [2, 1, 1, 1],
 [2, 1, 2, 23],
 [2, 1, 3, 67],
 [2, 2, 1, 86],
 [2, 2, 2, 87],
 [2, 2, 3, 19],
 [3, 1, 1, 6],
 [3, 1, 2, 30],
 [3, 1, 3, 55],
 [3, 2, 1, 68],
 [3, 2, 2, 32],
 [3, 2, 3, 50]],
columns=['Day', 'Store', 'Product', 'Sales'])

print(sales)

    Day  Store  Product  Sales
0     1      1        1     10
1     1      1        2     74
2     1      1        3     27
3     1      2        1     41
4     1      2        2     66
5     1      2        3     95
6     2      1        1      1
7     2      1        2     23
8     2      1        3     67
9     2      2        1     86
10    2      2        2     87
11    2      2        3     19
12    3      1        1      6
13    3      1        2     30
14    3      1        3     55
15    3      2        1     68
16    3      2        2     32
17    3      2        3     50


Now what if we wanted to find the total sales by day and product? We can use both of them in grouping like below:

In [10]:
#Group by day and product
group_data = sales.groupby(['Day', 'Product'])['Sales'].sum()
print(group_data)

Day  Product
1    1           51
     2          140
     3          122
2    1           87
     2          110
     3           86
3    1           74
     2           62
     3          105
Name: Sales, dtype: int64


You might also want something like average sales for each store-day combo.

In [11]:
group_data = sales.groupby(['Store', 'Day'])['Sales'].mean()
print(group_data)

Store  Day
1      1      37.000000
       2      30.333333
       3      30.333333
2      1      67.333333
       2      64.000000
       3      50.000000
Name: Sales, dtype: float64


## Using Groupby to Apply a Function

You are not only limited to the regular functions, you have the freedom to also specify a function of your own to apply to each group. This is a very powerful level of functionality that I recommend mastering as it can save lots of time. Let's take the example that we did for percent of daily sales, there is actually a way to do this with using the apply function. To begin with we need to create a function that could execute these things.

The first step is going to be to find an example of one slice of the group. Define the grouping object below.

In [12]:
#Create the group_obj
group_obj = sales.groupby('Day')

There is an attribute, groups, within the object the denotes which indices go to which group.

In [13]:
#Print the groups
print(group_obj.groups)

{1: Int64Index([0, 1, 2, 3, 4, 5], dtype='int64'), 2: Int64Index([6, 7, 8, 9, 10, 11], dtype='int64'), 3: Int64Index([12, 13, 14, 15, 16, 17], dtype='int64')}


By calling get_group() with the key for the group that you want, you can get a group sample!

In [14]:
#Get the sample group
sample_group = group_obj.get_group(1)
print(sample_group)

   Day  Store  Product  Sales
0    1      1        1     10
1    1      1        2     74
2    1      1        3     27
3    1      2        1     41
4    1      2        2     66
5    1      2        3     95


Now to build the function, it will need to take one parameter which is the passed data for each group. Let's denote it as g, and then from there we can index into sales, and get it divided by the sum!

In [15]:
#Create the function
def find_pct(g):
    pct = g['Sales'] / g['Sales'].sum()
    return pct

#Test our function
print(find_pct(sample_group))

0    0.031949
1    0.236422
2    0.086262
3    0.130990
4    0.210863
5    0.303514
Name: Sales, dtype: float64


Finally, to run this, we do groupby and then call apply passing this function.

In [16]:
#Apply the function
print(sales.groupby("Day").apply(find_pct))

Day    
1    0     0.031949
     1     0.236422
     2     0.086262
     3     0.130990
     4     0.210863
     5     0.303514
2    6     0.003534
     7     0.081272
     8     0.236749
     9     0.303887
     10    0.307420
     11    0.067138
3    12    0.024896
     13    0.124481
     14    0.228216
     15    0.282158
     16    0.132780
     17    0.207469
Name: Sales, dtype: float64


One modification you may want to make is to get rid of the grouping column as the index. This can be done by setting , group_keys=False in the groupby.

In [17]:
#Apply the function without the day index
print(sales.groupby("Day", group_keys=False).apply(find_pct))

0     0.031949
1     0.236422
2     0.086262
3     0.130990
4     0.210863
5     0.303514
6     0.003534
7     0.081272
8     0.236749
9     0.303887
10    0.307420
11    0.067138
12    0.024896
13    0.124481
14    0.228216
15    0.282158
16    0.132780
17    0.207469
Name: Sales, dtype: float64


With this data you can add it into the dataframe by creating a column and setting the values equal to it.

In [18]:
#Create the new column
sales["Percent Daily Sales"] = sales.groupby("Day", group_keys=False).apply(find_pct)
print(sales)

    Day  Store  Product  Sales  Percent Daily Sales
0     1      1        1     10             0.031949
1     1      1        2     74             0.236422
2     1      1        3     27             0.086262
3     1      2        1     41             0.130990
4     1      2        2     66             0.210863
5     1      2        3     95             0.303514
6     2      1        1      1             0.003534
7     2      1        2     23             0.081272
8     2      1        3     67             0.236749
9     2      2        1     86             0.303887
10    2      2        2     87             0.307420
11    2      2        3     19             0.067138
12    3      1        1      6             0.024896
13    3      1        2     30             0.124481
14    3      1        3     55             0.228216
15    3      2        1     68             0.282158
16    3      2        2     32             0.132780
17    3      2        3     50             0.207469
