Data aggregation refers to summarizing data with statistics such as sum, count, average, maximum, minimum etc. to provide a high level view of the data. Often there are mutually exclusive groups in the data that are of interest. In such cases, we may be interested in finding these statisitcs separately for each group. The Pandas function *groupby()* is used to split the data into groups, and then the desired function(s) are applied on each of these groups for groupwise data aggregation. However, the *groupby()* function is not limited for groupwise data aggregation, but can also be used for several other kinds of groupwise oprations. 

**Groupby mechanics:** (Source: https://pandas.pydata.org/docs/user_guide/groupby.html)

Group by: split-apply-combine
By “group by” we are referring to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria.

Applying a function to each group independently.

Combining the results into a data structure.

Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish to do one of the following:

**Aggregation:** compute a summary statistic (or statistics) for each group. Some examples:

Compute group sums or means. \
Compute group sizes / counts.

**Transformation:** perform some group-specific computations and return a like-indexed object. Some examples:

Standardize data (zscore) within a group. \
Filling NAs within groups with a value derived from each group.

**Filtration:** discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

Discard data that belongs to groups with only a few members. \
Filter out data based on the group sum or mean.

Some combination of the above: *GroupBy* will examine the results of the *apply* step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories.

We'll use Pandas to perform *GroupBy* operations.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## The GroupBy object

### Creating a GroupBy object: [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)

This Pandas DataFrame method `groupby()` is used to create a `GroupBy` object.

A string passed to `groupby()` may refer to either a column or an index level. If a string matches both a column name and an index level name, a ValueError will be raised.

**Example**: Consider the life expectancy dataset, *gdp_lifeExpectancy.csv*. Suppose we want to group by the observations by `continent`.

In [3]:
data = pd.read_csv('./Datasets/gdp_lifeExpectancy.csv')
data.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


We will pass the column `continent` as an argument to the `groupby()` method.

In [5]:
#Creating a GroupBy object
grouped = data.groupby('continent')
#This will split the data into groups that correspond to values of the column 'continent'

The `groupby()` method returns a *GroupBy* object.

In [7]:
#A 'GroupBy' objects is created with the `groupby()` function
type(grouped)

pandas.core.groupby.generic.DataFrameGroupBy

The GroupBy object `grouped` contains the information of the groups in which the data is distributed. Each observation has been assigned to a specific group of the column(s) used to group the data. However, note that the dataset is not physically split into different DataFrames.  For example, in the above case, each observation is assigned to a particular group depending on the value of the `continent` for that observation. However, all the observations are still in the same DataFrame `data`.

### Attributes and methods of the *GroupBy* object

#### `keys`

The object(s) grouping the data are called *key(s)*. Here `continent` is the group key. The keys of the *GroupBy* object can be seen using Its `keys` attribute.

In [8]:
#Key(s) of the GroupBy object
grouped.keys

'continent'

#### [`ngroups`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.ngroup.html)
The number of groups in which the data is distributed based on the keys can be seen with the `ngroups` attribute.

In [9]:
#The number of groups based on the key(s)
grouped.ngroups

5

#### [`groups`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.groups.html)
The `groups` attribute of the *GroupBy* object contains the group labels (or names) and the row labels of the observations in each group, as a dictionary.

In [10]:
#The groups (in the dictionary format)
grouped.groups

{'Africa': [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, ...], 'Americas': [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 432, 433, 434, 435, ...],

The group names are the *keys* of the dictionary, while the row labels are the corresponding *values*

In [11]:
#Group names
grouped.groups.keys()

dict_keys(['Africa', 'Americas', 'Asia', 'Europe', 'Oceania'])

In [12]:
#Group values are the row labels corresponding to a particular group
grouped.groups.values()

dict_values([Int64Index([  24,   25,   26,   27,   28,   29,   30,   31,   32,   33,
            ...
            1694, 1695, 1696, 1697, 1698, 1699, 1700, 1701, 1702, 1703],
           dtype='int64', length=624), Int64Index([  48,   49,   50,   51,   52,   53,   54,   55,   56,   57,
            ...
            1634, 1635, 1636, 1637, 1638, 1639, 1640, 1641, 1642, 1643],
           dtype='int64', length=300), Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            1670, 1671, 1672, 1673, 1674, 1675, 1676, 1677, 1678, 1679],
           dtype='int64', length=396), Int64Index([  12,   13,   14,   15,   16,   17,   18,   19,   20,   21,
            ...
            1598, 1599, 1600, 1601, 1602, 1603, 1604, 1605, 1606, 1607],
           dtype='int64', length=360), Int64Index([  60,   61,   62,   63,   64,   65,   66,   67,   68,   69,   70,
              71, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101,
            1102, 1103],
      

#### [`size()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.size.html)
The `size()` method of the *GroupBy* object returns the number of observations in each group.

In [13]:
#Number of observations in each group
grouped.size()

continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
dtype: int64

#### [`first()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.first.html)
The first non missing element of each group is returned with the `first()` method of the *GroupBy* object.

In [15]:
#The first element of the group can be printed using the first() method
grouped.first()

Unnamed: 0_level_0,country,year,lifeExp,pop,gdpPercap
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Africa,Algeria,1952,43.077,9279525,2449.008185
Americas,Argentina,1952,62.485,17876956,5911.315053
Asia,Afghanistan,1952,28.801,8425333,779.445314
Europe,Albania,1952,55.23,1282697,1601.056136
Oceania,Australia,1952,69.12,8691212,10039.59564


#### [`get_group()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.get_group.html)
This method returns the observations for a particular group of the *GroupBy* object.

In [16]:
#Observations for individual groups can be obtained using the get_group() function
grouped.get_group('Asia')

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1675,"Yemen, Rep.",Asia,1987,52.922,11219340,1971.741538
1676,"Yemen, Rep.",Asia,1992,55.599,13367997,1879.496673
1677,"Yemen, Rep.",Asia,1997,58.020,15826497,2117.484526
1678,"Yemen, Rep.",Asia,2002,60.308,18701257,2234.820827
