<a href="https://colab.research.google.com/github/Shuraimi/DataScience-Handbook-Notes/blob/main/2.%20Data_manipulation_with_Pandas/9.%20Aggregation_and_Grouping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aggregation and Grouping

In [None]:
import numpy as np
import pandas as pd

The aggregation functions such as min() , max(), sum(), mean(), median() etc are simple Numpy functions and provide a summary for large datasets through a single value.

## Planets data

We import the planets data from seaborn library.

In [None]:
import seaborn as sns

In [None]:
planet_df=sns.load_dataset('planets')

This planets df contains the info of exoplanets discovered by scientists.

In [None]:
planet_df.shape

(1035, 6)

There are 1035 planets discovered upto 2014.

In [None]:
planet_df.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


## Simple aggregation in Pandas

When we perform aggregations on Series, it returns a single evaluate whereas when performed on DataFrames, we get values in their respective columns.

In [None]:
ser=pd.Series(np.random.rand(10))
ser

0    0.236299
1    0.904838
2    0.205369
3    0.584562
4    0.136947
5    0.863125
6    0.474971
7    0.672614
8    0.587219
9    0.142691
dtype: float64

In [None]:
ser.min()

0.13694679731327675

In [None]:
df=pd.DataFrame({'A':np.random.rand(5),'B':np.random.rand(5)})

In [None]:
df

Unnamed: 0,A,B
0,0.624083,0.307107
1,0.493245,0.133262
2,0.145676,0.903586
3,0.158091,0.646418
4,0.502726,0.426681


Now computing mean will give the mean of each column

In [None]:
df.mean()

A    0.384764
B    0.483411
dtype: float64

We can also compute this along column as

In [None]:
df.mean(axis=1)

0    0.465595
1    0.313253
2    0.524631
3    0.402254
4    0.464704
dtype: float64

Pandas has all the aggregates discussed in Numpy but it has a more convenient feature `describe()` with lists all the summary statistics of all the columns.

In [None]:
planet_df.describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,1035.0,992.0,513.0,808.0,1035.0
mean,1.785507,2002.917596,2.638161,264.069282,2009.070531
std,1.240976,26014.728304,3.818617,733.116493,3.972567
min,1.0,0.090706,0.0036,1.35,1989.0
25%,1.0,5.44254,0.229,32.56,2007.0
50%,1.0,39.9795,1.26,55.25,2010.0
75%,2.0,526.005,3.04,178.5,2012.0
max,7.0,730000.0,25.0,8500.0,2014.0


Pandas aggregates are:-
1. `count()`
2. `min()`, `max()`
3. `std()`, `var()`
4. `mad()` mean absolute error
5. `prod()` for product of elements
6. 'sum()` sum of items
7. `mean()` `median()`
8. `first()` `last()`

These are the aggregates on DataFrames and Series.

Groupby operation is used to quickly and efficiently perform aggregates on subsets of data.

## Groupby : Split, Apply and Combine

The aggregates give only a brief idea of the dataset but we often prefer to aggregate conditionally on some label or index using the *groupby* operation. This operation comes from SQL command.

### Split, Apply and Combine

The following image gives the description of groupby operation. Here the Apply is summation.

[Groupby operation](https://github.com/Shuraimi/DataScience-Handbook-Notes/blob/main/Screenshot_2023-11-01-11-52-14-392-edit_com.foxit.mobile.pdf.lite.jpg)

The image steps are:-1. The *split* step involves breaking and grouping up of Dataframes based on the values of specified key.
2. The *apply* step involves computing some function usually aggregate, transformation or filtering.

The *combine* step merges the results of two operations into output data.

This can be achieved by masking, aggregation and merging commands but *we do not need to perform intermediate splits explicitly*.
Rather, *groupby* performs this in a single step over the data updating mean,min,count or other aggregate for each group along the way.

The power of groupby is tha is abstract away these steps : matlab the user need not worry about how these operations work under the hood but thinks of it as a whole.

An example,
 Create a dataframe and apply the most simple groupby operation on keys column

In [None]:
dff=pd.DataFrame({'keys':['A','B','A','C','B'],'data':np.arange(5)})

We can perform the most basic split-apply-combine operation by passing the desired key column to `groupby()`.

In [None]:
print(dff)
dff.groupby('keys')

  keys  data
0    A     0
1    B     1
2    A     2
3    C     3
4    B     4


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7d6ce2620e20>

Notice that this result is not a DataFrame but is a DataFrameGroupBy object. This object is where the magic is : you can think of it as a special view of the object which is poised to dig into groups until an aggregation is applied.

This "lazy approach" means that any aggregates can be applied very efficiently transparent to th user.

To produce a result, we can perform an aggregation function which will perform the apt apply/combines methods.

In [None]:
dff.groupby('keys').sum()

Unnamed: 0_level_0,data
keys,Unnamed: 1_level_1
A,2
B,5
C,3


We can apply any Pandas or Numpy aggregation function inplace of sum().

### The GroupBy object

The GroupBy object is a very flexible abstraction. In many ways, it can be thought of as a collection of DataFrames and does complex things under the hood.

The most important operations are *Aggregate, Filter, Transform and Apply* which will be discussed later.

### Column indexing

The GroupBy object supports column indexing in the same way as the Dataframes and returns a modified GroupBy object.

In [None]:
planet_df.groupby('method')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7d6ce2620400>

In [None]:
planet_df.groupby('method')['orbital_period']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7d6ce2621e40>

Here we've selected a particular Series object from the DataFrame object by reference to it column name. No computation is performed unless an aggregation is applied.

In [None]:
planet_df.groupby('method')['orbital_period'].mean()

method
Astrometry                          631.180000
Eclipse Timing Variations          4751.644444
Imaging                          118247.737500
Microlensing                       3153.571429
Orbital Brightness Modulation         0.709307
Pulsar Timing                      7343.021201
Pulsation Timing Variations        1170.000000
Radial Velocity                     823.354680
Transit                              21.102073
Transit Timing Variations            79.783500
Name: orbital_period, dtype: float64

This gives an idea of the period in days that each method is sensitive to.

### Iteration over groups

The groupby supports direct Iteration over the groups as returns each group as a Series or DataFrame.

In [None]:
for (method,group) in planet_df.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)


This is helpful to do things manually, but u can use the apply fucntionality.

### Dispatch methods

Through some python magic, any method not explicitly implemented by Groupby object will be passed through and called on groups whether DataFrame or Series. We can use the describe method to perform a set of aggregation operations on each group.

In [None]:
planet_df.groupby('method')['year'].describe().unstack()

       method                       
count  Astrometry                          2.0
       Eclipse Timing Variations           9.0
       Imaging                            38.0
       Microlensing                       23.0
       Orbital Brightness Modulation       3.0
                                         ...  
max    Pulsar Timing                    2011.0
       Pulsation Timing Variations      2007.0
       Radial Velocity                  2014.0
       Transit                          2014.0
       Transit Timing Variations        2014.0
Length: 80, dtype: float64

Notice that they are applied to each group and results are then combined within Groupby object and returned. Any valid DataFrame can be passed in these dispatch methods

### Aggregate, filter, transform and apply

In the preceding discussion, we performed aggregation for the combine operation but there are many options available. in particular, GroupBy object has methods *aggregate()*, *filter()*, *transform()*, *fit()* and *apply()* that efficiently implement a variety of operations before combining the grouped data.

In [None]:
df1=pd.DataFrame({'keys':['A','B','A','C','B'],'data1':np.random.randint(0,10,5),'data2':np.random.randint(0,50,5)})

In [None]:
df1

Unnamed: 0,keys,data1,data2
0,A,0,42
1,B,4,0
2,A,5,14
3,C,3,25
4,B,1,40


### Aggregations

We're now familiar with the aggregations like sum(), mean(), median() etc but the aggregate() allows even more flexibility. We can pass a string, function or list of thereof and compute all at once.

In [None]:
df1.groupby('keys').aggregate(['min',np.median,'max'])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
keys,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,2.5,5,14,28.0,42
B,1,2.5,4,0,20.0,40
C,3,3.0,3,25,25.0,25


Another useful patterns is a dictionary mapping the column to the aggregate operation.

In [None]:
df1.groupby('keys').aggregate({'data1':'min','data2':'max'})

Unnamed: 0_level_0,data1,data2
keys,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,42
B,1,40
C,3,25


### Filtering

In [None]:
def filter_func(x):
    return x['data2'].std()>4

We can drop data based on group properties using filtering operation.

In [None]:
df1.groupby('keys').std()

Unnamed: 0_level_0,data1,data2
keys,Unnamed: 1_level_1,Unnamed: 2_level_1
A,3.535534,19.79899
B,2.12132,28.284271
C,,


In [None]:
df1.groupby('keys').filter(filter_func)

Unnamed: 0,keys,data1,data2
0,A,0,42
1,B,4,0
2,A,5,14
4,B,1,40


The filter() returns a boolean value Specifying whether the group passes the filtering.

Group C does not have std > 4, hence it's dropped from result.

### Transformation

While aggregation should return a reduce diversion of the data, transformation returns transformed version of the data to recombine. The result of the transformation will have the same shape as the input.

For example, Centering about the mean.

In [None]:
df1.groupby('keys').transform((lambda x: x-x.mean()))

Unnamed: 0,data1,data2
0,-2.5,14.0
1,1.5,-20.0
2,2.5,-14.0
3,0.0,0.0
4,-1.5,20.0


### The apply() method

The apply() is used to apply a function to the group results. The function should take in a DataFrame and return a Pandas object (Series or DataFrame) or a scalar and the combine with tailor the result as per the output returned.

For example, we apply a norm function to data1 which is sum of data2

In [None]:
def norm_data1(x):
    # x is a DataFrame of groups
    x['data1']/= x['data2']
    return x

In [None]:
df1.groupby('keys').apply(norm_data1)

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  df1.groupby('keys').apply(norm_data1)


Unnamed: 0,keys,data1,data2
0,A,0.0,42
1,B,inf,0
2,A,0.357143,14
3,C,0.12,25
4,B,0.025,40


The apply function within Groupby is quite flexible. The only criterion is that it take a DataFrame and returns is Pandas object or Series.

### Specifying the split key

In the simple examples discussed, we split on a single column name. But this is one of the many options available.

A list, array, series or index Specifying the grouping keys that match the length of the DataFrame.

In [None]:
L=[0,1,0,1,2]
df1.groupby(L).sum()

  df1.groupby(L).sum()


Unnamed: 0,data1,data2
0,5,56
1,7,25
2,1,40


**How is the sum calculated??**

In [None]:
df1

Unnamed: 0,keys,data1,data2
0,A,0,42
1,B,4,0
2,A,5,14
3,C,3,25
4,B,1,40


A more verbose way to perform group by discussed before.

In [None]:
df1.groupby(df1['keys']).sum()

Unnamed: 0_level_0,data1,data2
keys,Unnamed: 1_level_1,Unnamed: 2_level_1
A,5,56
B,5,40
C,3,25


### A dictionary or Series mapping index to group

Another way is to provide a dictionary that maps index to groups.

In [None]:
df2=df1.set_index('keys')
map={'A':'vowel','B':'consonat','C':'consonat'}
df2.groupby({'A':'vowel','B':'consonat','C':'consonat'}).sum()

Unnamed: 0_level_0,data1,data2
keys,Unnamed: 1_level_1,Unnamed: 2_level_1
consonat,8,65
vowel,5,56


### Any Python function

We can pass any python function which input index value and output the result.

In [None]:
df2.groupby(str.lower).sum()

Unnamed: 0_level_0,data1,data2
keys,Unnamed: 1_level_1,Unnamed: 2_level_1
a,5,56
b,5,40
c,3,25


### Any valid index

These index can be combined to give a MultiIndex

In [None]:
df2.groupby([str.lower,map]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
keys,keys,Unnamed: 2_level_1,Unnamed: 3_level_1
a,vowel,5,56
b,consonat,5,40
c,consonat,3,25


### Grouping example

In [None]:
decade=10*(planet_df['year']//10)
decade

0       2000
1       2000
2       2010
3       2000
4       2000
        ... 
1030    2000
1031    2000
1032    2000
1033    2000
1034    2000
Name: year, Length: 1035, dtype: int64

In [None]:
de=planet_df['year']//10 #removing the last 2 digits
de

0       200
1       200
2       201
3       200
4       200
       ... 
1030    200
1031    200
1032    200
1033    200
1034    200
Name: year, Length: 1035, dtype: int64

In [None]:
#mulitply with 10
de=10*de
de

0       2000
1       2000
2       2010
3       2000
4       2000
        ... 
1030    2000
1031    2000
1032    2000
1033    2000
1034    2000
Name: year, Length: 1035, dtype: int64

In [None]:
de=de.astype(str)
de

0       2000
1       2000
2       2010
3       2000
4       2000
        ... 
1030    2000
1031    2000
1032    2000
1033    2000
1034    2000
Name: year, Length: 1035, dtype: object

In [None]:

de=de+'s'
de

0       2000s
1       2000s
2       2010s
3       2000s
4       2000s
        ...  
1030    2000s
1031    2000s
1032    2000s
1033    2000s
1034    2000s
Name: year, Length: 1035, dtype: object

In [None]:
de.name='decade'

In [None]:
planet_df.groupby(['method',de])['number'].sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0


This complete code in few lines

In [None]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
 planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

IndentationError: ignored