Data aggregation refers to summarizing data with statistics such as sum, count, average, maximum, minimum etc. to provide a high level view of the data. Often there are mutually exclusive groups in the data that are of interest. In such cases, we may be interested in finding these statisitcs separately for each group. The Pandas DataFrame method `groupby()` is used to split the data into groups, and then the desired function(s) are applied on each of these groups for groupwise data aggregation. However, the `groupby()` method is not limited for groupwise data aggregation, but can also be used for several other kinds of groupwise oprations. 

**Groupby mechanics:** (Source: https://pandas.pydata.org/docs/user_guide/groupby.html)

Group by: split-apply-combine
By *'group by'* we are referring to a process involving one or more of the following steps:

1. Splitting the data into groups based on some criteria.

2. Applying a function to each group independently.

3. Combining the results in a DataFrame.

Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we may wish to do one of the following:

**1. Aggregation:** compute a summary statistic (or statistics) for each group. Some examples:

  - Compute group sums or means. 
  - Compute group sizes / counts.

**2. Transformation:** perform some group-specific computations and return a like-indexed object. Some examples:

   - Standardize data *(zscore)* within a group. 
   - Filling *NAs* within groups with a value derived from each group.

**3. Filtration:** discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

   - Discard data that belongs to groups with only a few members. 
   - Filter out data based on the group *sum* or *mean*.

Some combination of the above: *GroupBy* will examine the results of the *apply* step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories.

We'll use Pandas to group the data and perform *GroupBy* operations.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## The GroupBy object

### Creating a GroupBy object: [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)

This Pandas DataFrame method `groupby()` is used to create a `GroupBy` object.

A string passed to `groupby()` may refer to either a column or an index level. If a string matches both a column name and an index level name, a ValueError will be raised.

**Example**: Consider the life expectancy dataset, *gdp_lifeExpectancy.csv*. Suppose we want to group by the observations by `continent`.

In [3]:
data = pd.read_csv('./Datasets/gdp_lifeExpectancy.csv')
data.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


We will pass the column `continent` as an argument to the `groupby()` method.

In [5]:
#Creating a GroupBy object
grouped = data.groupby('continent')
#This will split the data into groups that correspond to values of the column 'continent'

The `groupby()` method returns a *GroupBy* object.

In [7]:
#A 'GroupBy' objects is created with the `groupby()` function
type(grouped)

pandas.core.groupby.generic.DataFrameGroupBy

The GroupBy object `grouped` contains the information of the groups in which the data is distributed. Each observation has been assigned to a specific group of the column(s) used to group the data. However, note that the dataset is not physically split into different DataFrames.  For example, in the above case, each observation is assigned to a particular group depending on the value of the `continent` for that observation. However, all the observations are still in the same DataFrame `data`.

### Attributes and methods of the *GroupBy* object

#### `keys`

The object(s) grouping the data are called *key(s)*. Here `continent` is the group key. The keys of the *GroupBy* object can be seen using Its `keys` attribute.

In [8]:
#Key(s) of the GroupBy object
grouped.keys

'continent'

#### [`ngroups`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.ngroup.html)
The number of groups in which the data is distributed based on the keys can be seen with the `ngroups` attribute.

In [9]:
#The number of groups based on the key(s)
grouped.ngroups

5

#### [`groups`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.groups.html)
The `groups` attribute of the *GroupBy* object contains the group labels (or names) and the row labels of the observations in each group, as a dictionary.

In [10]:
#The groups (in the dictionary format)
grouped.groups

{'Africa': [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, ...], 'Americas': [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 432, 433, 434, 435, ...],

The group names are the *keys* of the dictionary, while the row labels are the corresponding *values*

In [11]:
#Group names
grouped.groups.keys()

dict_keys(['Africa', 'Americas', 'Asia', 'Europe', 'Oceania'])

In [12]:
#Group values are the row labels corresponding to a particular group
grouped.groups.values()

dict_values([Int64Index([  24,   25,   26,   27,   28,   29,   30,   31,   32,   33,
            ...
            1694, 1695, 1696, 1697, 1698, 1699, 1700, 1701, 1702, 1703],
           dtype='int64', length=624), Int64Index([  48,   49,   50,   51,   52,   53,   54,   55,   56,   57,
            ...
            1634, 1635, 1636, 1637, 1638, 1639, 1640, 1641, 1642, 1643],
           dtype='int64', length=300), Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            1670, 1671, 1672, 1673, 1674, 1675, 1676, 1677, 1678, 1679],
           dtype='int64', length=396), Int64Index([  12,   13,   14,   15,   16,   17,   18,   19,   20,   21,
            ...
            1598, 1599, 1600, 1601, 1602, 1603, 1604, 1605, 1606, 1607],
           dtype='int64', length=360), Int64Index([  60,   61,   62,   63,   64,   65,   66,   67,   68,   69,   70,
              71, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101,
            1102, 1103],
      

#### [`size()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.size.html)
The `size()` method of the *GroupBy* object returns the number of observations in each group.

In [13]:
#Number of observations in each group
grouped.size()

continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
dtype: int64

#### [`first()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.first.html)
The first non missing element of each group is returned with the `first()` method of the *GroupBy* object.

In [15]:
#The first element of the group can be printed using the first() method
grouped.first()

Unnamed: 0_level_0,country,year,lifeExp,pop,gdpPercap
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Africa,Algeria,1952,43.077,9279525,2449.008185
Americas,Argentina,1952,62.485,17876956,5911.315053
Asia,Afghanistan,1952,28.801,8425333,779.445314
Europe,Albania,1952,55.23,1282697,1601.056136
Oceania,Australia,1952,69.12,8691212,10039.59564


#### [`get_group()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.get_group.html)
This method returns the observations for a particular group of the *GroupBy* object.

In [16]:
#Observations for individual groups can be obtained using the get_group() function
grouped.get_group('Asia')

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1675,"Yemen, Rep.",Asia,1987,52.922,11219340,1971.741538
1676,"Yemen, Rep.",Asia,1992,55.599,13367997,1879.496673
1677,"Yemen, Rep.",Asia,1997,58.020,15826497,2117.484526
1678,"Yemen, Rep.",Asia,2002,60.308,18701257,2234.820827


## Data aggregation with `groupby()` methods

### [`mean()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.mean.html)

This method returns the mean of each group of the *GroupBy* object.

#### Grouping observations

**Example:** Find the mean life expectancy, population and GDP per capita for each country since 1952.

First, we'll group the data such that all observations corresponding to a country make a unique group.

In [22]:
#Grouping the observations by 'country'
grouped = data.groupby('country')

Now, we'll find the mean statistics for each group with the `mean()` method. The method will be applied on all columns of the DataFrame and all groups.

In [25]:
#Finding the mean stastistic of all columns of the DataFrame and all groups
grouped.mean()

Unnamed: 0_level_0,year,lifeExp,pop,gdpPercap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Afghanistan,1979.5,37.478833,1.582372e+07,802.674598
Albania,1979.5,68.432917,2.580249e+06,3255.366633
Algeria,1979.5,59.030167,1.987541e+07,4426.025973
Angola,1979.5,37.883500,7.309390e+06,3607.100529
Argentina,1979.5,69.060417,2.860224e+07,8955.553783
...,...,...,...,...
Vietnam,1979.5,57.479500,5.456857e+07,1017.712615
West Bank and Gaza,1979.5,60.328667,1.848606e+06,3759.996781
"Yemen, Rep.",1979.5,46.780417,1.084319e+07,1569.274672
Zambia,1979.5,45.996333,6.353805e+06,1358.199409


Note that if we wished to retain the `continent` in the above dataset, we can group the data by both `continent` and `country`. If the data is to be grouped by multiple columns, we need to put them within `[]` brackets:

In [27]:
#Grouping the observations by 'continent' and 'country'
grouped = data.groupby(['continent','country'])

#Finding the mean stastistic of all columns of the DataFrame and all groups
grouped.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,year,lifeExp,pop,gdpPercap
continent,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Africa,Algeria,1979.5,59.030167,1.987541e+07,4426.025973
Africa,Angola,1979.5,37.883500,7.309390e+06,3607.100529
Africa,Benin,1979.5,48.779917,4.017497e+06,1155.395107
Africa,Botswana,1979.5,54.597500,9.711862e+05,5031.503557
Africa,Burkina Faso,1979.5,44.694000,7.548677e+06,843.990665
...,...,...,...,...,...
Europe,Switzerland,1979.5,75.565083,6.384293e+06,27074.334405
Europe,Turkey,1979.5,59.696417,4.590901e+07,4469.453380
Europe,United Kingdom,1979.5,73.922583,5.608780e+07,19380.472986
Oceania,Australia,1979.5,74.662917,1.464931e+07,19980.595634


Here the data has been aggregated according to the group keys - `continent` and `country`, and a new DataFrame is created that is indexed by the unique values of `continent`-`country`.

For large datasets, it may be desirable to aggregate only a few columns. For example, if we wish to compute the means of only `lifeExp` and `gdpPercap`, then we can filter those columns in the *GroupBy* object *(just like we filter columns in a DataFrame)*, and then apply the `mean()` method:

In [28]:
grouped[['lifeExp','gdpPercap']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,lifeExp,gdpPercap
continent,country,Unnamed: 2_level_1,Unnamed: 3_level_1
Africa,Algeria,59.030167,4426.025973
Africa,Angola,37.883500,3607.100529
Africa,Benin,48.779917,1155.395107
Africa,Botswana,54.597500,5031.503557
Africa,Burkina Faso,44.694000,843.990665
...,...,...,...
Europe,Switzerland,75.565083,27074.334405
Europe,Turkey,59.696417,4469.453380
Europe,United Kingdom,73.922583,19380.472986
Oceania,Australia,74.662917,19980.595634


#### Grouping columns

By default, the grouping takes place by rows. However, as with several other Pandas methods, grouping can also be done by columns by using the `axis = 1` argument.

**Example:** Consider we have the above dataset in the wide-format as follows.

In [30]:
data_wide = data.pivot(index = ['continent','country'],columns = 'year')
data_wide

Unnamed: 0_level_0,Unnamed: 1_level_0,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,lifeExp,...,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap,gdpPercap
Unnamed: 0_level_1,year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,...,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
continent,country,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
Africa,Algeria,43.077,45.685,48.303,51.407,54.518,58.014,61.368,65.799,67.744,69.152,...,2550.816880,3246.991771,4182.663766,4910.416756,5745.160213,5681.358539,5023.216647,4797.295051,5288.040382,6223.367465
Africa,Angola,30.015,31.999,34.000,35.985,37.928,39.483,39.942,39.906,40.647,40.963,...,4269.276742,5522.776375,5473.288005,3008.647355,2756.953672,2430.208311,2627.845685,2277.140884,2773.287312,4797.231267
Africa,Benin,38.223,40.358,42.618,44.885,47.014,49.190,50.904,52.337,53.919,54.777,...,949.499064,1035.831411,1085.796879,1029.161251,1277.897616,1225.856010,1191.207681,1232.975292,1372.877931,1441.284873
Africa,Botswana,47.622,49.618,51.520,53.298,56.024,59.319,61.484,63.622,62.745,52.556,...,983.653976,1214.709294,2263.611114,3214.857818,4551.142150,6205.883850,7954.111645,8647.142313,11003.605080,12569.851770
Africa,Burkina Faso,31.975,34.906,37.814,40.697,43.591,46.137,48.122,49.557,50.260,50.324,...,722.512021,794.826560,854.735976,743.387037,807.198586,912.063142,931.752773,946.294962,1037.645221,1217.032994
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Europe,Switzerland,69.620,70.560,71.320,72.770,73.780,75.390,76.210,77.410,78.030,79.370,...,20431.092700,22966.144320,27195.113040,26982.290520,28397.715120,30281.704590,31871.530300,32135.323010,34480.957710,37506.419070
Europe,Turkey,43.585,48.079,52.098,54.336,57.005,59.507,61.036,63.108,66.146,68.835,...,2322.869908,2826.356387,3450.696380,4269.122326,4241.356344,5089.043686,5678.348271,6601.429915,6508.085718,8458.276384
Europe,United Kingdom,69.180,70.420,70.760,71.360,72.010,72.760,74.040,75.007,76.420,77.218,...,12477.177070,14142.850890,15895.116410,17428.748460,18232.424520,21664.787670,22705.092540,26074.531360,29478.999190,33203.261280
Oceania,Australia,69.120,70.330,70.930,71.100,71.930,73.490,74.740,76.320,77.560,78.830,...,12217.226860,14526.124650,16788.629480,18334.197510,19477.009280,21888.889030,23424.766830,26997.936570,30687.754730,34435.367440


Now, find the mean GDP per capita, life expectancy and population for each country.

Here, we can group by the outer level column labels to obtain the means. Also, we need to use the argument `axis=1` to indicate that we intend to group columns, instead of rows.

In [31]:
data_wide.groupby(axis=1,level=0).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,gdpPercap,lifeExp,pop
continent,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Africa,Algeria,4426.025973,59.030167,1.987541e+07
Africa,Angola,3607.100529,37.883500,7.309390e+06
Africa,Benin,1155.395107,48.779917,4.017497e+06
Africa,Botswana,5031.503557,54.597500,9.711862e+05
Africa,Burkina Faso,843.990665,44.694000,7.548677e+06
...,...,...,...,...
Europe,Switzerland,27074.334405,75.565083,6.384293e+06
Europe,Turkey,4469.453380,59.696417,4.590901e+07
Europe,United Kingdom,19380.472986,73.922583,5.608780e+07
Oceania,Australia,19980.595634,74.662917,1.464931e+07


### Practice exercise 1

Read the table consisting of GDP per capita of countries from the webpage: https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita . 

To only read the relevant table, read the tables that contain the word *'Country'*. 

Estimate the GDP per capita of each country as the average of the estimates of the three agencies - IMF, United Nations and World Bank.

We need to do a bit of data cleaning before we could directly use the `groupby()` function. Follow the steps below to answer this question:

1. Set the first 3 columns containing country, sub-region and region as hierarchical row labels. 

2. Apply the following function on all the columns to convert them to numeric: `f = lambda x:pd.to_numeric(x,errors = 'coerce')`

3. Now use `groupby()` to find estimate the GDP per capita for each country.

In [None]:
#| echo: false
#| eval: false

dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita', match = 'Country')
gdp_per_capita = dfs[0]
gdp_per_capita_reindexed = gdp_per_capita.set_index([('Country/Territory','Country/Territory'),
                                                     ('UN Region','UN Region')])
gdp_per_capita_numeric=gdp_per_capita_reindexed.apply(lambda x:pd.to_numeric(x,errors = 'coerce'))
gdp_per_capita_numeric.groupby(axis=1,level=1).mean().drop(columns='Year')