<h1>Grouper and Agg functions </h1>

1. Grouper is useful for the type of summary analysis
2. agg function is another very useful and intuitive tool for summarizing data

<h1>Grouping Time Series Data </h1>

In [1]:
# %load libs.py
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime, date, time

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity='all'

%config InlineBackend.figure_format='svg'
# plt.rcParams['figure.dpi']=120

pd.options.display.float_format='{:,.2f}'.format
pd.set_option('display.max_colwidth', None)


In [2]:
import pandas as pd
df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True")
df.head()
df.shape

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
1,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16,2014-01-01 10:00:47
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
3,307599,"Kassulke, Ondricka and Metz",S1-65481,41,21.05,863.05,2014-01-01 15:05:22
4,412290,Jerde-Hilpert,S2-34077,6,83.21,499.26,2014-01-01 23:26:55


(1500, 7)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   account number  1500 non-null   int64  
 1   name            1500 non-null   object 
 2   sku             1500 non-null   object 
 3   quantity        1500 non-null   int64  
 4   unit price      1500 non-null   float64
 5   ext price       1500 non-null   float64
 6   date            1500 non-null   object 
dtypes: float64(2), int64(2), object(3)
memory usage: 82.2+ KB


In [4]:
# change the date column to datetime data type

df["date"] = pd.to_datetime(df['date'])
df.head()

df.info()

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
1,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16,2014-01-01 10:00:47
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
3,307599,"Kassulke, Ondricka and Metz",S1-65481,41,21.05,863.05,2014-01-01 15:05:22
4,412290,Jerde-Hilpert,S2-34077,6,83.21,499.26,2014-01-01 23:26:55


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   account number  1500 non-null   int64         
 1   name            1500 non-null   object        
 2   sku             1500 non-null   object        
 3   quantity        1500 non-null   int64         
 4   unit price      1500 non-null   float64       
 5   ext price       1500 non-null   float64       
 6   date            1500 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(2), object(2)
memory usage: 82.2+ KB


For summarizing all of the sales by month, resample function is available. However, <b>it only operates on an index.</b>. To use the resample function, use set_index to make the date column an index and then resample.

In [5]:
df.set_index('date').resample('M')['ext price'].sum()

date
2014-01-31   185,361.66
2014-02-28   146,211.62
2014-03-31   203,921.38
2014-04-30   174,574.11
2014-05-31   165,418.55
2014-06-30   174,089.33
2014-07-31   191,662.11
2014-08-31   153,778.59
2014-09-30   168,443.17
2014-10-31   171,495.32
2014-11-30   119,961.22
2014-12-31   163,867.26
Freq: M, Name: ext price, dtype: float64

To summarize the data by grouping, which is a little challenging

In [6]:
df.set_index('date').groupby('name')['ext price'].resample('M').sum()

name        date      
Barton LLC  2014-01-31    6,177.57
            2014-02-28   12,218.03
            2014-03-31    3,513.53
            2014-04-30   11,474.20
            2014-05-31   10,220.17
                            ...   
Will LLC    2014-08-31    1,439.82
            2014-09-30    4,345.99
            2014-10-31    7,085.33
            2014-11-30    3,210.44
            2014-12-31   12,561.21
Name: ext price, Length: 240, dtype: float64

Using `Grouper` to make this process streamlined

In [7]:
df.groupby(pd.Grouper(key='date', freq='M'))['ext price'].sum()

date
2014-01-31   185,361.66
2014-02-28   146,211.62
2014-03-31   203,921.38
2014-04-30   174,574.11
2014-05-31   165,418.55
2014-06-30   174,089.33
2014-07-31   191,662.11
2014-08-31   153,778.59
2014-09-30   168,443.17
2014-10-31   171,495.32
2014-11-30   119,961.22
2014-12-31   163,867.26
Freq: M, Name: ext price, dtype: float64

Benefit of Grouper is that data can be summarized in a different time frame by changing of the freq parameter to one of the valid offset aliases. For instance, an annual summary using December as the last month would look like this:

In [8]:
df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()

name                             date      
Barton LLC                       2014-12-31   109,438.50
Cronin, Oberbrunner and Spencer  2014-12-31    89,734.55
Frami, Hills and Schmidt         2014-12-31   103,569.59
Fritsch, Russel and Anderson     2014-12-31   112,214.71
Halvorson, Crona and Champlin    2014-12-31    70,004.36
Herman LLC                       2014-12-31    82,865.00
Jerde-Hilpert                    2014-12-31   112,591.43
Kassulke, Ondricka and Metz      2014-12-31    86,451.07
Keeling LLC                      2014-12-31   100,934.30
Kiehn-Spinka                     2014-12-31    99,608.77
Koepp Ltd                        2014-12-31   103,660.54
Kuhn-Gusikowski                  2014-12-31    91,094.28
Kulas Inc                        2014-12-31   137,351.96
Pollich LLC                      2014-12-31    87,347.18
Purdy-Kunde                      2014-12-31    77,898.21
Sanford and Sons                 2014-12-31    98,822.98
Stokes LLC                       2014-12-31 

<h1> agg function </h1>

In [9]:
df[["ext price", "quantity"]].sum()

ext price   2,018,784.32
quantity       36,463.00
dtype: float64

In [10]:
df["unit price"].mean()

55.00752666666659

agg makes the above simpler:

In [11]:
df[['ext price', 'quantity', 'unit price']].agg(['sum', 'mean'])

Unnamed: 0,ext price,quantity,unit price
sum,2018784.32,36463.0,82511.29
mean,1345.86,24.31,55.01


We can pass a dictionary to agg and specify what operations to apply to each column.

In [12]:
df.agg({'ext price':['sum', 'mean'], 'quantity':['sum', 'mean'], 'unit price':['mean']})

Unnamed: 0,ext price,quantity,unit price
sum,2018784.32,36463.0,
mean,1345.86,24.31,55.01


Define our own function and pass it to agg

In [13]:
get_max=lambda x:x.value_counts().index[0]

get_max.__name__='mode'

In [14]:
df.agg({'ext price':['sum', 'mean'], 'quantity':['sum', 'mean'], 'unit price':['mean'], 'sku':get_max})

Unnamed: 0,ext price,quantity,unit price,sku
sum,2018784.32,36463.0,,
mean,1345.86,24.31,55.01,
mode,,,,S2-77896


In [15]:
df['sku'].value_counts() # vakye counts returns index and values

S2-77896    73
S1-82801    60
S2-10342    59
S1-47412    58
S1-93683    57
B1-38851    56
S2-82423    56
S1-30248    55
S1-50961    55
S1-06532    53
S1-27722    53
B1-53636    53
B1-20000    53
S2-83881    51
S2-34077    51
B1-53102    49
B1-05914    49
S2-16558    49
B1-33364    49
B1-04202    48
B1-65551    47
S2-78676    46
B1-69924    44
S2-11481    44
B1-33087    44
S2-23246    41
B1-50809    39
S2-00301    39
B1-86481    35
S1-65481    34
Name: sku, dtype: int64

In [16]:
df['sku'].value_counts()[:1][0]
df['sku'].value_counts()[:1].iloc[0]
print()
df['sku'].value_counts().index[0]

73

73




'S2-77896'

 We can make sure that columns are in a specific order using an OrderedDict

In [17]:
import collections

f=collections.OrderedDict([('ext price', ['sum', 'mean']), ('quantity', ['sum', 'mean']), ('sku', [get_max])])

df.agg(f)

Unnamed: 0,ext price,quantity,sku
sum,2018784.32,36463.0,
mean,1345.86,24.31,
mode,,,S2-77896
