# Grouped operations are a powerful way to aggregate, transform, and filter data. They rely on the mantra of “split–apply–combine”:
# 
# 1. Data is split into separate parts based on key(s).
# 2. A function is applied to each part of the data.
# 3. The results from each part are combined to create a new data set.

# The split–apply–combine concept is also heavily used in “big data” systems that use distributed computing, with the data being split into independent parts and dispatched to a separate server where a function is applied, and the results are then combined.

# The techniques shown in this chapter can all be done without using the .groupby() method. For example:
# 
#  Aggregation can be done by using conditional subsetting on a dataframe
#  Transformation can be done by passing a column into a separate function
#  Filtering can be done with conditional subsetting

# Aggregate

# Basic One-Variable Grouped Aggregation

In [1]:
import pandas as pd
df = pd.read_csv('gapminder.tsv', sep='\t')

# calculate the average life expectancy for each year
avg_life_exp_by_year = df.groupby('year')["lifeExp"].mean()

print(avg_life_exp_by_year)

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64


In [2]:
# get a list of unique years in the data
years = df.year.unique()
print(years)

[1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007]


In [3]:
# subset the data for the year 1952
y1952 = df.loc[df.year == 1952, :]
print(y1952)

                 country continent  year  lifeExp       pop    gdpPercap
0            Afghanistan      Asia  1952   28.801   8425333   779.445314
12               Albania    Europe  1952   55.230   1282697  1601.056136
24               Algeria    Africa  1952   43.077   9279525  2449.008185
36                Angola    Africa  1952   30.015   4232095  3520.610273
48             Argentina  Americas  1952   62.485  17876956  5911.315053
...                  ...       ...   ...      ...       ...          ...
1644             Vietnam      Asia  1952   40.412  26246839   605.066492
1656  West Bank and Gaza      Asia  1952   43.160   1030585  1515.592329
1668         Yemen, Rep.      Asia  1952   32.548   4963829   781.717576
1680              Zambia    Africa  1952   42.038   2672000  1147.388831
1692            Zimbabwe    Africa  1952   48.451   3080907   406.884115

[142 rows x 6 columns]


In [4]:
y1952_mean = y1952["lifeExp"].mean()
print(y1952_mean)

49.057619718309866


In [5]:
# group by continent and describe each group
continent_describe = df.groupby('continent')["lifeExp"].describe()
print(continent_describe)

           count       mean        std     min       25%      50%       75%  \
continent                                                                     
Africa     624.0  48.865330   9.150210  23.599  42.37250  47.7920  54.41150   
Americas   300.0  64.658737   9.345088  37.579  58.41000  67.0480  71.69950   
Asia       396.0  60.064903  11.864532  28.801  51.42625  61.7915  69.50525   
Europe     360.0  71.903686   5.433178  43.585  69.57000  72.2410  75.45050   
Oceania     24.0  74.326208   3.795611  69.120  71.20500  73.6650  77.55250   

              max  
continent          
Africa     76.442  
Americas   80.653  
Asia       82.603  
Europe     81.757  
Oceania    81.235  


# Aggregation Functions
The .agg() method is an alias for .aggregate(). The Pandas documentation suggests you use the alias, .agg(), over the fully spelled out method.

In [6]:
import numpy as np

# calculate the average life expectancy by continent
# but use the np.mean function
cont_le_agg = df.groupby('continent')["lifeExp"].agg(np.mean)

print(cont_le_agg)

continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64


  cont_le_agg = df.groupby('continent')["lifeExp"].agg(np.mean)


# Note
When we pass in the function into .agg(), we only need the actual function object, we do not need to “call” the function. That’s why we write np.mean and not np.mean(). This is similar to when we called .apply() in

# Transform
When we transform data, we pass values from our dataframe into a function. The function then “transforms” the data. Unlike .agg(), which can take multiple values and return a single (aggregated) value, .transform() takes multiple values and returns a one-to-one transformation of the values. That is, it does not reduce the amount of data.

# Z-Score Example
# 
# Let’s calculate the z-score of our life expectancy data by year. The z-score identifies the number of standard deviations from the mean of our data. It centers our data around 0, with a standard deviation of 1. This technique standardizes our data and makes it easier to compare different variables with different units to each other.
# 
# Here’s the formula for calculating z-score:

In [7]:
# x is a data point in our data set
#  µ is the average of our data set, as calculated by Equation 8.1
#  σ is the standard deviation, as calculated by Equation 8.3

![Text](img.png)

![Text2](img_1.png)

In [8]:
def my_zscore(x):
  '''Calculates the z-score of provided data
  'x' is a vector or series of values
  '''
  return (x - x.mean()) / x.std()

In [9]:
transform_z = df.groupby('year')["lifeExp"].transform(my_zscore)

print(transform_z)

0      -1.656854
1      -1.731249
2      -1.786543
3      -1.848157
4      -1.894173
          ...   
1699   -0.081621
1700   -0.336974
1701   -1.574962
1702   -2.093346
1703   -1.948180
Name: lifeExp, Length: 1704, dtype: float64


In [10]:
# note the number of rows in our data
print(df.shape)

(1704, 6)


In [11]:
# note the number of values in our transformation
print(transform_z.shape)

(1704,)


In [12]:
from scipy.stats import zscore

# calculate a grouped zscore

sp_z_grouped = df.groupby('year')["lifeExp"].transform(zscore)

# calculate a nongrouped zscore
sp_z_nogroup = zscore(df["lifeExp"])

In [13]:
# grouped z-score
print(transform_z.head())

0   -1.656854
1   -1.731249
2   -1.786543
3   -1.848157
4   -1.894173
Name: lifeExp, dtype: float64


In [14]:
# grouped z-score using scipy
print(sp_z_grouped.head())

0   -1.662719
1   -1.737377
2   -1.792867
3   -1.854699
4   -1.900878
Name: lifeExp, dtype: float64


In [15]:
# nongrouped z-score
print(sp_z_nogroup[:5])

0   -2.375334
1   -2.256774
2   -2.127837
3   -1.971178
4   -1.811033
Name: lifeExp, dtype: float64


# Missing Value Example

In [21]:
import seaborn as sns
import numpy as np

# set the seed so results are deterministic
np.random.seed(42)

# sample 10 rows from tips
tips_10 = sns.load_dataset("tips").sample(10)

# randomly pick 4 'total_bill' values and turn them into missing
tips_10.loc[
    np.random.permutation(tips_10.index)[:4],
    "total_bill"
] = np.NaN

print(tips_10)

     total_bill   tip     sex smoker   day    time  size
24        19.82  3.18    Male     No   Sat  Dinner     2
6          8.77  2.00    Male     No   Sun  Dinner     2
153         NaN  2.00    Male     No   Sun  Dinner     4
211         NaN  5.16    Male    Yes   Sat  Dinner     4
198         NaN  2.00  Female    Yes  Thur   Lunch     2
176         NaN  2.00    Male    Yes   Sun  Dinner     2
192       28.44  2.56    Male    Yes  Thur   Lunch     2
124       12.48  2.52  Female     No  Thur   Lunch     2
9         14.78  3.23    Male     No   Sun  Dinner     2
101       15.38  3.00  Female    Yes   Fri  Dinner     2


In [22]:
count_sex = tips_10.groupby('sex').count()
print(count_sex)

        total_bill  tip  smoker  day  time  size
sex                                             
Male             4    7       7    7     7     7
Female           2    3       3    3     3     3


  count_sex = tips_10.groupby('sex').count()


In [23]:
def fill_na_mean(x):
  """Returns the average of a given vector"""
  avg = x.mean()
  return x.fillna(avg)


# calculate a mean 'total_bill' by 'sex'
total_bill_group_mean = (
  tips_10
  .groupby("sex")
  .total_bill
  .transform(fill_na_mean)
)

# assign to a new column in the original data
# you can also replace the original column by using 'total_bill'
tips_10["fill_total_bill"] = total_bill_group_mean

print(tips_10[['sex', 'total_bill', 'fill_total_bill']])

        sex  total_bill  fill_total_bill
24     Male       19.82          19.8200
6      Male        8.77           8.7700
153    Male         NaN          17.9525
211    Male         NaN          17.9525
198  Female         NaN          13.9300
176    Male         NaN          17.9525
192    Male       28.44          28.4400
124  Female       12.48          12.4800
9      Male       14.78          14.7800
101  Female       15.38          15.3800


  .groupby("sex")


# Filter

In [24]:
# load the tips data set
tips = sns.load_dataset('tips')

# note the number of rows in the original data
print(tips.shape)

(244, 7)


In [25]:
# look at the frequency counts for the table size
print(tips['size'].value_counts())

size
2    156
3     38
4     37
5      5
1      4
6      4
Name: count, dtype: int64


In [26]:
# filter the data such that each group has more than 30 observations
tips_filtered = (
  tips
  .groupby("size")
  .filter(lambda x: x["size"].count() >= 30)
)

In [27]:
print(tips_filtered.shape)

(231, 7)


In [28]:
print(tips_filtered['size'].value_counts())

size
2    156
3     38
4     37
Name: count, dtype: int64


# The pandas.core.groupby. DataFrameGroupBy object

# Groups

In [41]:
tips_10 = sns.load_dataset('tips').sample(10, random_state=42)
print(tips_10)

     total_bill   tip     sex smoker   day    time  size
24        19.82  3.18    Male     No   Sat  Dinner     2
6          8.77  2.00    Male     No   Sun  Dinner     2
153       24.55  2.00    Male     No   Sun  Dinner     4
211       25.89  5.16    Male    Yes   Sat  Dinner     4
198       13.00  2.00  Female    Yes  Thur   Lunch     2
176       17.89  2.00    Male    Yes   Sun  Dinner     2
192       28.44  2.56    Male    Yes  Thur   Lunch     2
124       12.48  2.52  Female     No  Thur   Lunch     2
9         14.78  3.23    Male     No   Sun  Dinner     2
101       15.38  3.00  Female    Yes   Fri  Dinner     2


In [53]:
# save just the grouped object
grouped = tips_10.groupby('sex',observed=True)

# note that we just get back the object and its memory location
print(grouped)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000280662A9040>


In [54]:
# see the actual groups of the groupby
# it returns only the index
print(grouped.groups)

{'Male': [24, 6, 153, 211, 176, 192, 9], 'Female': [198, 124, 101]}


# Even when we ask for the groups from our grouped object, we get only the index of the dataframe back. Think of this index as indicating the row numbers. It is intended mainly to optimize performance. Again, we haven’t calculated anything yet.
# 
# This approach does allow you to save just the grouped result. You could then perform multiple .agg(), .transform(), or .filter() operations without having to process the .groupby() statement again.

# 8.4.2 Group Calculations Involving Multiple Variables

In [55]:
# calculate the mean on relevant columns
avgs = grouped.mean(numeric_only=True)
print(avgs)

        total_bill       tip      size
sex                                   
Male         20.02  2.875714  2.571429
Female       13.62  2.506667  2.000000


In [56]:
# get the 'Female' group
female = grouped.get_group('Female')
print(female)

     total_bill   tip     sex smoker   day    time  size
198       13.00  2.00  Female    Yes  Thur   Lunch     2
124       12.48  2.52  Female     No  Thur   Lunch     2
101       15.38  3.00  Female    Yes   Fri  Dinner     2


# Multiple Groups

In [59]:
# mean by sex and time
bill_sex_time = tips_10.groupby(['sex', 'time'],observed=True)

group_avg = bill_sex_time.mean(numeric_only=True)

In [60]:
# type of the group_avg
print(type(group_avg))

<class 'pandas.core.frame.DataFrame'>


In [61]:
print(group_avg.columns)

Index(['total_bill', 'tip', 'size'], dtype='object')


In [62]:
print(group_avg.index)

MultiIndex([(  'Male',  'Lunch'),
            (  'Male', 'Dinner'),
            ('Female',  'Lunch'),
            ('Female', 'Dinner')],
           names=['sex', 'time'])


In [65]:
group_method = tips_10.groupby(['sex', 'time'],observed=False).mean(numeric_only=True).reset_index()
print(group_method)

      sex    time  total_bill       tip      size
0    Male   Lunch   28.440000  2.560000  2.000000
1    Male  Dinner   18.616667  2.928333  2.666667
2  Female   Lunch   12.740000  2.260000  2.000000
3  Female  Dinner   15.380000  3.000000  2.000000


# Working With a MultiIndex

In [66]:
intv_df = pd.read_csv('epi_sim.zip')

print(intv_df)

         ig_type  intervened        pid  rep  sid        tr
0              3          40  294524448    1  201  0.000135
1              3          40  294571037    1  201  0.000135
2              3          40  290699504    1  201  0.000135
3              3          40  288354895    1  201  0.000135
4              3          40  292271290    1  201  0.000135
...          ...         ...        ...  ...  ...       ...
9434648        2          87  345636694    2  201  0.000166
9434649        3          87  295125214    2  201  0.000166
9434650        2          89  292571119    2  201  0.000166
9434651        3          89  292528142    2  201  0.000166
9434652        2          95  291956763    2  201  0.000166

[9434653 rows x 6 columns]


In [67]:
# The data set includes six columns:
# 
#  ig_type: edge type (type of relationship between two nodes in the network, such as “school” and “work”)
#  intervened: time in the simulation at which an intervention occurred for a given person (pid)
#  pid: simulated person’s ID number
#  rep: replication run (each set of simulation parameters was run multiple times)
# sid: simulation ID
#  tr: transmissibility value of the influenza virus

In [69]:
count_only = (
  intv_df
  .groupby(["rep", "intervened", "tr"])
  ["ig_type"]
  .count()
)

count_only

rep  intervened  tr      
0    8           0.000166    1
     9           0.000152    3
                 0.000166    1
     10          0.000152    1
                 0.000166    1
                            ..
2    193         0.000135    1
                 0.000152    1
     195         0.000135    1
     198         0.000166    1
     199         0.000135    1
Name: ig_type, Length: 1196, dtype: int64