groupby() takes some df, splits it into chuncks based on some key values, applies computation on those chunks, then combines the results back together into another df. In pandas this is referred to as the split-apply-combine pattern.

# Splitting

In [1]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('path/to/dataset.csv')
# Exclude state level summarizations which have a sum level value of 40
df = df[df['SUMLEV']==50]
df.head()

In [None]:
# For the first example of without groupby() let's use census date. Get a list of unique states, then iterate over
# the states and for each state we reduce the df and calculate the average.

In [None]:
for state in df['STNAME'].unique():
    # We'll just calculate the avg using np
    avg = np.average(df.where(df['STNAME']==state).dropna()['CENSUS2010POP'])
    print('Counties in state ' + state + ' have an average population of ' + str(avg))

In [2]:
# Let's try a second approach using groupby()

In [None]:
for group, frame in df.groupby('STNAME'):
    # 1. Split step
    # groupby() returns a tuple, where the first value is the value of the key we're trying to group by,
    # in this case a specific state name, and the second one is the projected df that was found for that group.
    
    # 2. Apply step
    # Calculate and avg of the census2010pop
    avg = np.average(frame['CENSUS2010POP'])
    
    # And print the result
    print('Counties in state ' + group + ' have an average population of ' + str(avg))
    
    # 3. Combine step
    # We don't have to worry about the combine step in this case because all of our data transformation is 
    # actually printing out the results.

99% of the time, you'll use groupby on one or more columns. But you can also provide a function to groupby and use that to segment your data.

In [None]:
df = df.set_index('STNAME')

def set_batch_number():
    if item[0] < 'M':
        return 0
    if item[0] < 'Q':
        return 1
    return 2

# The df is supposed to be grouped by according to the batch number and we'll loop through each batch group.
for group, frame in df.groupby(set_batch_number):
    print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')

One more example of groupby. Using dataset of housing data from airbnb. In this dataset there are two columns of interest, cancellation_policy and review_scores_value.

In [None]:
df = pd.read_csv('path/to/dataset.csv')
# So, how would I groupby both of the columns? A first approach might be to promote them to a multi-index
# and just call groupby()
df = df.set_index(['cancellation_policy', 'review_scores_value'])

# When we have a multi-index we need to pass in the levels we are interested in grouping by.
for group, frame in df.groupby(level=(0,1)):
    pring(group)

In [None]:
# This works ok. But what if we wanted to groupby the cancellation policy and review scores, but
# separate out all the 10's from those under 10? In this case we could use a function to manage the 
# groupings.
def grouping_fun(item):
    # Check the "review_scores_value" portion of the index. 
    # item is in the format of (cancellation_policy, review_scores_value), a tuple
    if item[1] == 10.0:
        return (item[0], "10.0")
    else:
        return (item[0], "not 10.0")
    
for group, frame in df.groupby(by=grouping_fun):
    print(group)
        

# Aggregation

The most straightforward apply step is the aggregation of data, and uses the method agg() on the groupby object. 
Above, we simply iterated over the groupby object, unpacking it into a label (the group name) and a df. But with agg we can pass in a dict of the columns we are interested in aggregating along with the function we are looking to apply to aggregate. 

In [None]:
df = df.reset_index()

# Let's group by cancellation policy and find the avg review scores value by group
df.groupby('cancellation_policy').agg({'review_scores_value':np.nanmean})

In [None]:
# We can just extend this dictionary to aggregate by multiple functions or multiple columns
df.groupby('cancellation_policy').agg({'review_scores_value':(np.nanmean,np.nanstd),
                                       'reviews_per_month':np.nanmean})

# Transformation

Transformation is different from aggregation. Where agg() returns a single value per column, so one row per group, transform() returns an object that is the same size as the group. Essentially, it broadcasts the function you supply over the grouped df, returning a new df. This makes combining data later quite easy. 

For instance, suppose we wanted to include the avg rating values in a given group by cancellation policy, but preserve the df shape so that we could generate a difference between an individual observation and the sum. 

In [None]:
# First let's define some subset of the columns we're interested in
cols = ['cancellation_policy', 'review_score_value']

# Now, let's transform it. I'll store this in it's own df.
transform_df = df[cols].groupby('cancellation_policy').transform(np.nanmean)
transform_df.head()

# We can join in this df since it's index is the same as the original df. 
# Before we do that, let's rename the column in the transformed version
transform_df.rename({'review_scores_value': 'mean_review_scores'}, axis='columns', inplace=True)

df = df.merge(transform_df, left_index=True, right_index=True)
df.head()

# Our new column is in place, the mean review scores.
# So now we could create, for instance, the difference between a given row and 
# its group (the cancellation policy) means.

df['mean_diff'] = np.absolute(df['review_scores_value'] - df['mean_review_scores'])
df['mean_diff'].head()

# Filtering

The groupby object has built in support for filtering groups as well. Often you'll want to group by some feature, then make some transformation to the groups, then drop certain groups as part of your cleaning routine. 

The filter function takes in a function which it applies to each group dataframe and returns either a True or False, depending on whether that group should be included in the results. 

In [3]:
# For instance, if we wanted only those groups that had a mean rating above 9 included in our results
df.groupby('cancellation_policy').filter(lambda x: np.nanmean(x['review_scores_value']) > 9.2)

# The results are still indexed, but any of the results which were in a group with a mean review score
# of less than or equal to 9.2 were not copied over.

NameError: name 'df' is not defined

# Applying

By far the most common operation invoked on groupby objects is the apply() function. This allows you to apply an arbitrary function to each group, and stitch the results back for each apply() into a single df where the index is preserved.

In [None]:
# Let's look at an example using the Airbnb dataset
df = pd.read_csv('path/to/dataset.csv')

# Let's include some of the columns we're interested in.
df = df[['cancellation_policy', 'review_scores_value']]
df.head()

# In previous work we wanted to find the average review score of a listing and its deviation from the group mean. 
# This was a two step process. First we used transform() on the groupby object and then we had to broadcast to 
# create a new column. With apply we could wrap this logic in one place. 
def calc_mean_review_scores(group):
    #group is a df of whatever we have groupedby, eg. 'cancellation_policy'
    # so we can treat this as a complete df.
    avg = np.nanmean(group['review_scores_value'])
    group['review_scores_mean'] = np.abs(avg - group['review_scores_value'])
    return group

# Now apply this to all of the groups
df.groupby('cancellation_policy').apply(calc_mean_review_scores).head()

groupby is a powerful tool commonly used for data cleaning and data analysis. Once you have grouped the data by some category, you have a df of just those values and you can conduct aggregate analysis on the segments you're interested in. The groupby function follows a split-apply-combine approach.