# Data Aggregation and Group Operations

In [4]:
import pandas as pd
import numpy as np

## GroupBy mechanics

We are going to review the Group By (simmilar to the one in SQL)

In [7]:
df = pd.DataFrame({'key1' : list('aabba'),
                   'key2' : ['one','two','one','two','one'],
                   'balance' : np.random.randn(5) * 10,
                   'income' : np.random.randn(5) + 2
    
})

df

Unnamed: 0,balance,income,key1,key2
0,6.65797,1.399476,a,one
1,0.621236,4.367726,a,two
2,-8.940374,1.175115,b,one
3,5.069804,2.965204,b,two
4,6.427537,0.558563,a,one


In [8]:
df.mean()

balance    1.967235
income     2.093217
dtype: float64

In [10]:
df.groupby('key1')

<pandas.core.groupby.DataFrameGroupBy object at 0x7f260dcfca90>

It gives a lazy collection. It is waiting a method to display something

In [11]:
df.groupby('key1').mean()

Unnamed: 0_level_0,balance,income
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,4.568914,2.108589
b,-1.935285,2.07016


In [13]:
mean_key = df.groupby('key1')['balance'].mean()
mean_key

key1
a    4.568914
b   -1.935285
Name: balance, dtype: float64

In [14]:
mean_key['a']

4.5689141007453786

In [16]:
df.groupby(['key1','key2']).agg(['mean','count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,balance,balance,income,income
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count,mean,count
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,one,6.542753,2,0.97902,2
a,two,0.621236,1,4.367726,1
b,one,-8.940374,1,1.175115,1
b,two,5.069804,1,2.965204,1


In [19]:
help(mean_key.agg)

Help on method aggregate in module pandas.core.series:

aggregate(func, axis=0, *args, **kwargs) method of pandas.core.series.Series instance
    Aggregate using callable, string, dict, or list of string/callables
    
    .. versionadded:: 0.20.0
    
    Parameters
    ----------
    func : callable, string, dictionary, or list of string/callables
        Function to use for aggregating the data. If a function, must either
        work when passed a Series or when passed to Series.apply. For
        a DataFrame, can pass a dict, if the keys are DataFrame column names.
    
        Accepted Combinations are:
    
        - string function name
        - function
        - list of functions
        - dict of column names -> functions (or list of functions)
    
    Notes
    -----
    Numpy functions mean/median/prod/sum/std/var are special cased so the
    default behavior is applying the function along axis=0
    (e.g., np.mean(arr_2d, axis=0)) as opposed to
    mimicking the default

In [22]:
df.groupby('key1')['key2'].agg(lambda strseries: strseries.str.len().sum())

key1
a    9
b    6
Name: key2, dtype: int64

### Iterating over groups

In [25]:
for xxx in df.groupby('key1'):
    print (type(xxx))
    print (xxx)

<class 'tuple'>
('a',     balance    income key1 key2
0  6.657970  1.399476    a  one
1  0.621236  4.367726    a  two
4  6.427537  0.558563    a  one)
<class 'tuple'>
('b',     balance    income key1 key2
2 -8.940374  1.175115    b  one
3  5.069804  2.965204    b  two)


When iterating, it is taken the tuples of the key:values

In [26]:
for value, group in df.groupby('key1'):
    print (value)
    print (group)

a
    balance    income key1 key2
0  6.657970  1.399476    a  one
1  0.621236  4.367726    a  two
4  6.427537  0.558563    a  one
b
    balance    income key1 key2
2 -8.940374  1.175115    b  one
3  5.069804  2.965204    b  two


In [27]:
a = list(df.groupby('key1')) #we need to make the lazy objects to run
a

[('a',     balance    income key1 key2
  0  6.657970  1.399476    a  one
  1  0.621236  4.367726    a  two
  4  6.427537  0.558563    a  one), ('b',     balance    income key1 key2
  2 -8.940374  1.175115    b  one
  3  5.069804  2.965204    b  two)]

In [29]:
dict_groupby = dict(a)
dict_groupby

{'a':     balance    income key1 key2
 0  6.657970  1.399476    a  one
 1  0.621236  4.367726    a  two
 4  6.427537  0.558563    a  one, 'b':     balance    income key1 key2
 2 -8.940374  1.175115    b  one
 3  5.069804  2.965204    b  two}

## Data aggregation

In [31]:
import requests

url = 'https://raw.githubusercontent.com/wesm/pydata-book/1st-edition/ch08/tips.csv'
response = requests.get(url)

out_file = open('tips.csv','wb')
out_file.write(response.content)
out_file.close()

In [32]:
tips = pd.read_csv('tips.csv')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


Are the females and males giving different tips

In [52]:
tips['perc_tip'] = tips['tip'] / tips['total_bill']
values_for_z_stat = tips.groupby('sex').agg(['mean','std','count'])['perc_tip']
z_stat = (values_for_z_stat['mean']['Male']-values_for_z_stat['mean']['Female'])/(tips['perc_tip'].std()/np.sqrt(tips['perc_tip'].count()))
z_stat
values_for_z_stat 

Unnamed: 0_level_0,mean,std,count
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,0.166491,0.053632,87
Male,0.157651,0.064778,157


In [56]:
stacked = df.groupby(['key1','key2']).mean()
stacked

Unnamed: 0_level_0,Unnamed: 1_level_0,balance,income
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,6.542753,0.97902
a,two,0.621236,4.367726
b,one,-8.940374,1.175115
b,two,5.069804,2.965204


In [60]:
stacked.unstack()

Unnamed: 0_level_0,balance,balance,income,income
key2,one,two,one,two
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,6.542753,0.621236,0.97902,4.367726
b,-8.940374,5.069804,1.175115,2.965204


In [61]:
stacked.unstack('key1').unstack('balance')

         key1  key2
balance  a     one     6.542753
               two     0.621236
         b     one    -8.940374
               two     5.069804
income   a     one     0.979020
               two     4.367726
         b     one     1.175115
               two     2.965204
dtype: float64

In [63]:
tips.pivot(columns = 'sex')

Unnamed: 0_level_0,total_bill,total_bill,tip,tip,smoker,smoker,day,day,time,time,size,size,perc_tip,perc_tip
sex,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male
0,16.99,,1.01,,No,,Sun,,Dinner,,2.0,,0.059447,
1,,10.34,,1.66,,No,,Sun,,Dinner,,3.0,,0.160542
2,,21.01,,3.50,,No,,Sun,,Dinner,,3.0,,0.166587
3,,23.68,,3.31,,No,,Sun,,Dinner,,2.0,,0.139780
4,24.59,,3.61,,No,,Sun,,Dinner,,4.0,,0.146808,
5,,25.29,,4.71,,No,,Sun,,Dinner,,4.0,,0.186240
6,,8.77,,2.00,,No,,Sun,,Dinner,,2.0,,0.228050
7,,26.88,,3.12,,No,,Sun,,Dinner,,4.0,,0.116071
8,,15.04,,1.96,,No,,Sun,,Dinner,,2.0,,0.130319
9,,14.78,,3.23,,No,,Sun,,Dinner,,2.0,,0.218539


## Group-wise operations and transformations

### Apply: General split-apply-combine

#### Suppressing the group keys

### Quantile and bucket analysis

### Example: Filling missing values with group-specific values

## Pivot tables and Cross-tabulation