# Grouping and Aggregating with Pandas

## Grouping and Aggregation:

### Grouping: 

  - Mostly we are interested in information about groups not the whole population. Thus we need grouping

  - **Grouping means putting a set of rows as a group based off of a column**
  
  - **Grouping is done with __groupby()__ method**
  
  - After grouping, an __aggregate function__ is applied on the groups
  
### Aggregation:

 - Aggregation refers to any data transformation that produces a single value from a list of numbers. 

### Agggregate Function

  - A function that takes  __many values__ as an input and __returns one value__ as a result is called an **aggregate function**. i.e: __sum__, __mean__, __std__ are aggregate functions.
 
 - An aggregate function does not have to be a built-in function, it can be a user defined function as well.

### Setting up the environment

In [1]:
import pandas as pd
import numpy as np
from random import choices, seed, sample
from IPython.display import display
pd.options.display.float_format = '{:,.2f}'.format

## Groupby Function

   - __groupby__ method has a set of options:
   
      - __by__ argument, where you pass a cloumn-name or a list of column names to group by.
   
     - **axis**: By default, grouping takes place by rows (axis =0). For columns set axis to 1.
     - **level**: This is useful for MultiIndex DataFrame, so grouping will occur by one or more levels. 
     - **sort**: True by default, which means the groups keys are in a sort order.
     - **dropna**: True by default.
for more information about groupby, check the documentation by typing ```pd.DataFrame.groupby?``` on jupyter notebook.

The syntax:

```python 
df.groupby(by='column-label')          ---> groups using the method defaults.
# or multiple columns
df.groupby(by=['col_1', 'col_2']). 

# for the docs 
pd.DataFrame.groupby? 
```


- It is crucial to understand that __groupby()__ returns a DataFrameGroupBy object. Thus, when we apply only the __groupby__ without any aggregate function we will not see any output.


- To get results, there are always at least two steps involved.

   1. **Grouping**
   2. **Applying aggregate functions**

### DataFrame Example

In [2]:
df = pd.DataFrame({'Regions':['East', 'East', 'East', 'West', 'West', 
                             'South', 'South', 'South', 'South', 'North'], 
                  'Company':['C1', 'C2', 'C3', 'C1', 'C3', 'C4', 'C2','C3', 'C1', 'C2'], 
                  'Quantity': [1200, 3000, 2300, 5400, 2200, 1300, 2700, 6400, 7200, 
                               10000]})
df

Unnamed: 0,Regions,Company,Quantity
0,East,C1,1200
1,East,C2,3000
2,East,C3,2300
3,West,C1,5400
4,West,C3,2200
5,South,C4,1300
6,South,C2,2700
7,South,C3,6400
8,South,C1,7200
9,North,C2,10000


### Grouping Example: 

  - We are going to group by the regions column

In [3]:
df.groupby('Regions')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f89ef852cd0>

Calling groupby on the regions column returned a DataFrameGroupBy object. Because it is an object, we would love to assign it to a name.

In [4]:
by_region = df.groupby('Regions')

In [5]:
# Check the type
type(by_region)

pandas.core.groupby.generic.DataFrameGroupBy

As it's already mentioned, grouping is the first step to do calculations on each group. Thus, we need to apply an aggregate function to get results.

However, a persong with an inquisitive mind always seeks to know how things work internally. If you are one of those people, this section is for you.

**Grouping** in pandas follows a principle called **split-apply-combine**. You can think of it like this:

   - Splitting the data into small subsets, each corresponds to a unique group defined by the key. 
    
   - Applying an aggregate function on each subset.
   - Combining the result of aggregation in a convenient object.
   
In order to have the full picture, a __groupBy__ object can be iterated over. But, first pass it to __list__ function to see its content

In [6]:
list(by_region)

[('East',
    Regions Company  Quantity
  0    East      C1      1200
  1    East      C2      3000
  2    East      C3      2300),
 ('North',
    Regions Company  Quantity
  9   North      C2     10000),
 ('South',
    Regions Company  Quantity
  5   South      C4      1300
  6   South      C2      2700
  7   South      C3      6400
  8   South      C1      7200),
 ('West',
    Regions Company  Quantity
  3    West      C1      5400
  4    West      C3      2200)]

In [7]:
list(by_region)[0]

('East',
   Regions Company  Quantity
 0    East      C1      1200
 1    East      C2      3000
 2    East      C3      2300)

We see it is a list of tuples, and each element of a tuple has two components, a name and the group data. So, the iteration will be like this:

In [8]:
for name, group in by_region:
    print(name)
    print(group)

East
  Regions Company  Quantity
0    East      C1      1200
1    East      C2      3000
2    East      C3      2300
North
  Regions Company  Quantity
9   North      C2     10000
South
  Regions Company  Quantity
5   South      C4      1300
6   South      C2      2700
7   South      C3      6400
8   South      C1      7200
West
  Regions Company  Quantity
3    West      C1      5400
4    West      C3      2200


What is the benefit of iterating over the **groupBy** object? One scenario is that you want to extract a certain group or more to do more anlysis. This can be achieved by converting the **groupedBy** object into a dict.

In [9]:
pieces = dict(list(by_region))
pieces

{'East':   Regions Company  Quantity
 0    East      C1      1200
 1    East      C2      3000
 2    East      C3      2300,
 'North':   Regions Company  Quantity
 9   North      C2     10000,
 'South':   Regions Company  Quantity
 5   South      C4      1300
 6   South      C2      2700
 7   South      C3      6400
 8   South      C1      7200,
 'West':   Regions Company  Quantity
 3    West      C1      5400
 4    West      C3      2200}

In [10]:
pieces.keys()

dict_keys(['East', 'North', 'South', 'West'])

Now we can have each group seperately, then decide what to do with it.

In [11]:
east = pieces['East']
east

Unnamed: 0,Regions,Company,Quantity
0,East,C1,1200
1,East,C2,3000
2,East,C3,2300


We just extracted one group and saved as a DataFrame. This is the magic of the __groupby__ method. We did that manually; however, there is a __get_group__ and __groups__ methods to get information about the groups or retrieve the needed one.

In [12]:
by_region.groups

{'East': [0, 1, 2], 'North': [9], 'South': [5, 6, 7, 8], 'West': [3, 4]}

In [13]:
south = by_region.get_group('South')
south

Unnamed: 0,Regions,Company,Quantity
5,South,C4,1300
6,South,C2,2700
7,South,C3,6400
8,South,C1,7200


## Applying aggregate functions Examples

Suppose we want to know the average quantity by regions, or the sum, the minimux, the maximum ...etc. All these functions return one value for each group.

### Mean Function Example

In [14]:
by_region.mean()

Unnamed: 0_level_0,Quantity
Regions,Unnamed: 1_level_1
East,2166.67
North,10000.0
South,4400.0
West,3800.0


### Note:

  - After grouping, the column we group by becomes and index. we see that by the __indicies__ attribute.

In [15]:
by_region.indices

{'East': array([0, 1, 2]),
 'North': array([9]),
 'South': array([5, 6, 7, 8]),
 'West': array([3, 4])}

### Sum Function Example

In [16]:
by_region.sum()

Unnamed: 0_level_0,Quantity
Regions,Unnamed: 1_level_1
East,6500
North,10000
South,17600
West,7600


We are grouping, saving an abject then applying an aggregate function. This takes many steps and makes the code less readable. Thus, the steps are combined together, which is the common way to do the analysis.

Here is the syntax:
```python
df.groupby('col-name').agg_function()
```

### Note: 

  - Groupby method returns the results of the aggregate function of all columns. 
  
  
  - Not all aggregrate function works on every data type; for example, there is mean, max of min for a string variable. 

### The Sum function with groupby

In [17]:
df.groupby('Regions').sum()

Unnamed: 0_level_0,Quantity
Regions,Unnamed: 1_level_1
East,6500
North,10000
South,17600
West,7600


We get only the sum of the numeric variable quantity.

### The  median

In [18]:
df.groupby('Regions').median()

Unnamed: 0_level_0,Quantity
Regions,Unnamed: 1_level_1
East,2300.0
North,10000.0
South,4550.0
West,3800.0


### Grouping with more variables

  - It suffices to pass a list of labels to group by 

In [19]:
df.groupby(['Regions', 'Company']).median()

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
Regions,Company,Unnamed: 2_level_1
East,C1,1200.0
East,C2,3000.0
East,C3,2300.0
North,C2,10000.0
South,C1,7200.0
South,C2,2700.0
South,C3,6400.0
South,C4,1300.0
West,C1,5400.0
West,C3,2200.0


Because we do not have repeated companies within each region, the __groupby__ only sorted the DataFrame by the two columns. (a bonus from groupby!!!)

### The count  Function Example 

In [20]:
df.groupby('Regions').count()

Unnamed: 0_level_0,Company,Quantity
Regions,Unnamed: 1_level_1,Unnamed: 2_level_1
East,3,3
North,1,1
South,4,4
West,2,2


 Nicely done, the __Count__ method works on __categorical (string) variables__. Hence, we see how many times a company has occured. For example, three companies work in the eastern region, and four in the south, while only one company in the north. I would suggest to study the market if this is a real example, so a company may start selling its products there. This is not a real data in fact.  

### **What is the returned object?**

  - We always seek to know the type of the returned object, because that helps us to decide what we do with it.

In [21]:
byReg = df.groupby('Regions').median()
type(byReg)

pandas.core.frame.DataFrame

It is a DataFrame object. Thus, we can appy all what we learned before on it. 

If we are interested for example only on __the eastern region__. The __loc__ is the solution.

In [22]:
df.groupby('Regions').median().loc['East']

Quantity   2,300.00
Name: East, dtype: float64

> **congrats!!!**, you have reached a high level in python. Because this is the standard way of programming with python. All steps at once.

### Describe method with Grouping:

  - Getting more information at once is preferable. Hence, we use __describe method__ 

In [23]:
df.groupby('Regions').describe()

Unnamed: 0_level_0,Quantity,Quantity,Quantity,Quantity,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Regions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
East,3.0,2166.67,907.38,1200.0,1750.0,2300.0,2650.0,3000.0
North,1.0,10000.0,,10000.0,10000.0,10000.0,10000.0,10000.0
South,4.0,4400.0,2848.39,1300.0,2350.0,4550.0,6600.0,7200.0
West,2.0,3800.0,2262.74,2200.0,3000.0,3800.0,4600.0,5400.0


>  We can transpose if we want to have the results vertically displayed. 

In [24]:
df.groupby('Regions').describe().transpose()

Unnamed: 0,Regions,East,North,South,West
Quantity,count,3.0,1.0,4.0,2.0
Quantity,mean,2166.67,10000.0,4400.0,3800.0
Quantity,std,907.38,,2848.39,2262.74
Quantity,min,1200.0,10000.0,1300.0,2200.0
Quantity,25%,1750.0,10000.0,2350.0,3000.0
Quantity,50%,2300.0,10000.0,4550.0,3800.0
Quantity,75%,2650.0,10000.0,6600.0,4600.0
Quantity,max,3000.0,10000.0,7200.0,5400.0


Often times, we will be interested on one __group__, therefore, we want to select only what we are after. We do that just like subsetting a normal dataframe.

- Here, we are interested in the __eastern region__.

In [25]:
df.groupby('Regions').describe().transpose()['East']

Quantity  count       3.00
          mean    2,166.67
          std       907.38
          min     1,200.00
          25%     1,750.00
          50%     2,300.00
          75%     2,650.00
          max     3,000.00
Name: East, dtype: float64

We passed the variable using the bracket notation, because after tranposing the result,  The group key became a column. But, without transposing we need to use __loc__ method because it is an __index__. 

In [26]:
df.groupby('Regions').describe().loc['East']

Quantity  count       3.00
          mean    2,166.67
          std       907.38
          min     1,200.00
          25%     1,750.00
          50%     2,300.00
          75%     2,650.00
          max     3,000.00
Name: East, dtype: float64

## Aggregating Aggregates

I assume while you were reading this tutorial that you asked yourself:
    - **What if we want to apply many aggregate functions in one step**.
    - **What if we to apply a different aggregate function to each column at once**.
    
Well, Pandas provided the solution for us. The __agg__ or __aggregate__ method.

### The aggregate function AGG

  - __agg__ is an alias for `aggregate`. It is recommended to use the alias __agg__.
  
  - __agg__ has a __func__ option, which can be:
      - Any aggregate function.
      - List of aggregate functions.
      - Dict of column-label: agg-function. 
  - **axis**: By default, the function is applied on columns (axis = 0), set __axis to 1__ if you want to apply the function on rows.

The Syntax:
```python
df.agg(func = [sum, min, max, ...])       ---> a list of functions
df.agg(func = {'col_1': 'mean', 'col_2': 'median', ...}). 
# Or even more functions on each column

df.agg({'col_1': ['min', 'max'],
        'col_2': ['mean', 'median'], ...})

# The Docs
df.agg?
```  

In [27]:
seed(234)
df = pd.DataFrame({'var_1': choices(range(10, 300, 30),  k = 6),
                     'var_2': sample(range(1, 999, 100), k = 6),
                     'key_1': ['a', 'a', 'c', 'c', 'd', 'e'], 
                     'key_2': ['one', 'two', 'one', 'two', 'one', 'one']})
df

Unnamed: 0,var_1,var_2,key_1,key_2
0,100,301,a,one
1,250,401,a,two
2,250,501,c,one
3,280,201,c,two
4,160,901,d,one
5,130,801,e,one


In [28]:
grouped = df.groupby('key_1')

In [29]:
list(grouped)

[('a',
     var_1  var_2 key_1 key_2
  0    100    301     a   one
  1    250    401     a   two),
 ('c',
     var_1  var_2 key_1 key_2
  2    250    501     c   one
  3    280    201     c   two),
 ('d',
     var_1  var_2 key_1 key_2
  4    160    901     d   one),
 ('e',
     var_1  var_2 key_1 key_2
  5    130    801     e   one)]

### User Defined Function to Agg method

We can create a function then pass it to __agg__ method.

In [30]:
def min_max(obj):
    return obj.max() - obj.min()

In [31]:
grouped[['var_1', 'var_2']].agg(min_max)

Unnamed: 0_level_0,var_1,var_2
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,150,100
c,30,300
d,0,0
e,0,0


### A List of Functions with Agg Method

   - Passing a list of functions or function names will apply each function on each group. The result will be a DataFrame with the functions as column names.

In [32]:
grouped.agg([np.size, np.mean, min, max, np.std])

Unnamed: 0_level_0,var_1,var_1,var_1,var_1,var_1,var_2,var_2,var_2,var_2,var_2
Unnamed: 0_level_1,size,mean,min,max,std,size,mean,min,max,std
key_1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
a,2,175.0,100,250,106.07,2,351.0,301,401,70.71
c,2,265.0,250,280,21.21,2,351.0,201,501,212.13
d,1,160.0,160,160,,1,901.0,901,901,
e,1,130.0,130,130,,1,801.0,801,801,


We don't need to accept the function name, we a pass a tuple (name, function-name) to get a meaningful name.

In [33]:
grouped.agg([('N', np.size),('Average', np.mean)])

Unnamed: 0_level_0,var_1,var_1,var_2,var_2
Unnamed: 0_level_1,N,Average,N,Average
key_1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,2,175.0,2,351.0
c,2,265.0,2,351.0
d,1,160.0,1,901.0
e,1,130.0,1,801.0


We were apply the aggregating functions to each column, but what is the case if we want apply a different function or function to different coluumns. Here, we pass a dict of col-name, function to __agg__ method. Of course, the function part in the dict can be a list.

In [34]:
grouped.agg({'var_1': 'mean', 'var_2': 'median'})

Unnamed: 0_level_0,var_1,var_2
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,175.0,351.0
c,265.0,351.0
d,160.0,901.0
e,130.0,801.0


In [35]:
grouped.agg({'var_1': ['mean', 'max'], 'var_2': 'median'})

Unnamed: 0_level_0,var_1,var_1,var_2
Unnamed: 0_level_1,mean,max,median
key_1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
a,175.0,250,351.0
c,265.0,280,351.0
d,160.0,160,901.0
e,130.0,130,801.0


### Index Suppression 

   - The unique group keys are returns as index by default, we turn that off by passing __as_index=False__.

In [36]:
df.groupby(['key_1', 'key_2'], as_index = False).agg({'var_1': ['mean', 'max'],\
                                                      'var_2': 'median'})

Unnamed: 0_level_0,key_1,key_2,var_1,var_1,var_2
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,max,median
0,a,one,100.0,100,301.0
1,a,two,250.0,250,401.0
2,c,one,250.0,250,501.0
3,c,two,280.0,280,201.0
4,d,one,160.0,160,901.0
5,e,one,130.0,130,801.0


### Practical Example of Grouping (Iris Dataset)

Iris dataset comes from an external source __rdatasets__, so we need to install it beforehand. 

```python 
1. Installing on jupyter notebook

  !pip install rdatasets
    
2. Installing from the command line (Terminal or cmd)

  pip install rdatasets
    
```

In [37]:
#!pip install rdatasets

In [38]:
# import the data
from rdatasets import data

In [39]:
iris = data('iris')

In [40]:
type(iris)

pandas.core.frame.DataFrame

> iris data is pandas DataFrame. Thus we can apply what we learned in tutorial using this data.

In [41]:
iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### Task: 

The Dataset has five variables. four are numeric and one is categorical. We want to answer the following questions: 

   - How many categories? 
   
   - What is the mean of each species?
   
   - The median, variance, standard deviation, min value, max value.
   
   - Compute the summary statistics for the variables **Sepal.length and Petal.Length** (not all the variables)

#### A1: Count method is used to count how many categories in each group:

 - We can subset the results to get the results of the desired variable
 
```python 

  df.groupby('col-to-group-by')['desired-column'].agg-function()
```

In [42]:
iris.groupby('Species').count()

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,50,50,50,50
versicolor,50,50,50,50
virginica,50,50,50,50


- We see that we have a balanced data. We can subset the result to have only __Sepal.Length__ like this.


- Note that this is a very common task in data analysis

In [43]:
iris.groupby('Species')['Sepal.Length'].count()

Species
setosa        50
versicolor    50
virginica     50
Name: Sepal.Length, dtype: int64

#### A2: The mean of each species

In [44]:
iris.groupby('Species').mean()

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.01,3.43,1.46,0.25
versicolor,5.94,2.77,4.26,1.33
virginica,6.59,2.97,5.55,2.03


> The other aggregate functions are done the same way

#### The Summary Statistics for Sepal.length and Petal.Length

In [45]:
iris.groupby('Species')[['Sepal.Length', 'Petal.Length']].describe()

Unnamed: 0_level_0,Sepal.Length,Sepal.Length,Sepal.Length,Sepal.Length,Sepal.Length,Sepal.Length,Sepal.Length,Sepal.Length,Petal.Length,Petal.Length,Petal.Length,Petal.Length,Petal.Length,Petal.Length,Petal.Length,Petal.Length
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
setosa,50.0,5.01,0.35,4.3,4.8,5.0,5.2,5.8,50.0,1.46,0.17,1.0,1.4,1.5,1.58,1.9
versicolor,50.0,5.94,0.52,4.9,5.6,5.9,6.3,7.0,50.0,4.26,0.47,3.0,4.0,4.35,4.6,5.1
virginica,50.0,6.59,0.64,4.9,6.23,6.5,6.9,7.9,50.0,5.55,0.55,4.5,5.1,5.55,5.88,6.9


#### Here it is best to transpose the results

In [46]:
iris.groupby('Species')[['Sepal.Length', 'Petal.Length']].describe().transpose()

Unnamed: 0,Species,setosa,versicolor,virginica
Sepal.Length,count,50.0,50.0,50.0
Sepal.Length,mean,5.01,5.94,6.59
Sepal.Length,std,0.35,0.52,0.64
Sepal.Length,min,4.3,4.9,4.9
Sepal.Length,25%,4.8,5.6,6.23
Sepal.Length,50%,5.0,5.9,6.5
Sepal.Length,75%,5.2,6.3,6.9
Sepal.Length,max,5.8,7.0,7.9
Petal.Length,count,50.0,50.0,50.0
Petal.Length,mean,1.46,4.26,5.55


# Congratulations on completing this tutorial