# Agrupamiento y agregación, operaciones de grupos

### Operaciones `groupby`

La estructura de agregación consiste en tres pasos secuenciales:
1. **División** - Dividir en grupos el DataFrame según valores en las columnas clave.
2. **Aplicación** - Aplicar una función a cada grupo resultante en la división. Generalmente funciones agregantes, filtrantes ó transformantes.
3. **Combinación** - Combinar los resultados de los pasos anteriores en un nuevo DataFrame.
<img src='groupby.png',width=600>

In [1]:
import numpy as np
import pandas as pd

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

In [2]:
df=pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),'data2' : np.random.randn(5)})
df

Unnamed: 0,data1,data2,key1,key2
0,-0.767451,1.186281,a,one
1,0.234882,0.982056,a,two
2,0.209633,1.797629,b,one
3,1.232699,-1.03656,b,two
4,0.739736,0.615933,a,one


Agrupamos la primera columna respecto a la columna `'key1'`, esto genera un objeto agrupado sobre el que se pueden aplicar funciones:

In [3]:
grouped = df['data1'].groupby(df['key1'])
groupedall=df.groupby(df['key1'])
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x7f0e7082b390>

In [4]:
grouped.sum() #suma los valores de la columna data1 agrupados según key1

key1
a    0.207167
b    1.442332
Name: data1, dtype: float64

In [5]:
groupedall.sum()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.207167,2.78427
b,1.442332,0.761069


In [6]:
grouped.count() #cuenta el número de valores de cada tipo de key1 en la columna data1

key1
a    3
b    2
Name: data1, dtype: int64

In [7]:
groupedall.describe() #da una descripción de los valores de data1 agrupados según key1

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,count,3.0,3.0
a,mean,0.069056,0.92809
a,std,0.767155,0.288979
a,min,-0.767451,0.615933
a,25%,-0.266285,0.798994
a,50%,0.234882,0.982056
a,75%,0.487309,1.084169
a,max,0.739736,1.186281
b,count,2.0,2.0
b,mean,0.721166,0.380534


In [8]:
grouped.mean()

key1
a    0.069056
b    0.721166
Name: data1, dtype: float64

In [9]:
grouped.max()

key1
a    0.739736
b    1.232699
Name: data1, dtype: float64

In [10]:
grouped.min()

key1
a   -0.767451
b    0.209633
Name: data1, dtype: float64

Los atributos fundamentalesde un objeto `groupby` son el nombre y el grupo:

In [11]:
list(groupedall)

[('a',       data1     data2 key1 key2
  0 -0.767451  1.186281    a  one
  1  0.234882  0.982056    a  two
  4  0.739736  0.615933    a  one), ('b',       data1     data2 key1 key2
  2  0.209633  1.797629    b  one
  3  1.232699 -1.036560    b  two)]

In [12]:
list(groupedall)[0][0]

'a'

In [13]:
list(groupedall)[0][1]

Unnamed: 0,data1,data2,key1,key2
0,-0.767451,1.186281,a,one
1,0.234882,0.982056,a,two
4,0.739736,0.615933,a,one


Se pueden calcular estadísticos agrupados por varias columnas y luego cambiar la visualización con `stack`

In [14]:
medianas=df['data1'].groupby([df['key1'], df['key2']]).median()
medianas

key1  key2
a     one    -0.013858
      two     0.234882
b     one     0.209633
      two     1.232699
Name: data1, dtype: float64

In [15]:
medianas.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.013858,0.234882
b,0.209633,1.232699


In [16]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.767451,1.186281,a,one
1,0.234882,0.982056,a,two
2,0.209633,1.797629,b,one
3,1.232699,-1.03656,b,two
4,0.739736,0.615933,a,one


In [17]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### Iterando con `groupby`

Un objeto `groupby` tiene dos características, `name` y `group`:

In [18]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.767451,1.186281,a,one
1,0.234882,0.982056,a,two
2,0.209633,1.797629,b,one
3,1.232699,-1.03656,b,two
4,0.739736,0.615933,a,one


In [19]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)
    print('-'*60)

a
      data1     data2 key1 key2
0 -0.767451  1.186281    a  one
1  0.234882  0.982056    a  two
4  0.739736  0.615933    a  one
------------------------------------------------------------
b
      data1     data2 key1 key2
2  0.209633  1.797629    b  one
3  1.232699 -1.036560    b  two
------------------------------------------------------------


In [20]:
#si agrupamos respecto a varias columnas se generan tuplas en los nombres
for (n1,n2), group in df.groupby(['key1','key2']):
    print(n1,n2)
    print(group)
    print('-'*60)

a one
      data1     data2 key1 key2
0 -0.767451  1.186281    a  one
4  0.739736  0.615933    a  one
------------------------------------------------------------
a two
      data1     data2 key1 key2
1  0.234882  0.982056    a  two
------------------------------------------------------------
b one
      data1     data2 key1 key2
2  0.209633  1.797629    b  one
------------------------------------------------------------
b two
      data1    data2 key1 key2
3  1.232699 -1.03656    b  two
------------------------------------------------------------


Se pueden generar diccionarios con la información de los datos agrupados

In [21]:
pieces = dict(list(df.groupby(['key1','key2'])))
pieces

{('a', 'one'):       data1     data2 key1 key2
 0 -0.767451  1.186281    a  one
 4  0.739736  0.615933    a  one, ('a', 'two'):       data1     data2 key1 key2
 1  0.234882  0.982056    a  two, ('b', 'one'):       data1     data2 key1 key2
 2  0.209633  1.797629    b  one, ('b', 'two'):       data1    data2 key1 key2
 3  1.232699 -1.03656    b  two}

In [23]:
pieces[('b','one')]

Unnamed: 0,data1,data2,key1,key2
2,0.209633,1.797629,b,one


Se pueden seleccionar columnas en el agrupamiento:

In [34]:
df.groupby('key1')['data1'].min()

key1
a   -0.767451
b    0.209633
Name: data1, dtype: float64

In [35]:
df.groupby('key1')[['data2']].size()

key1
a    3
b    2
dtype: int64

### Agrupamiento usando diccionarios y series

In [37]:
people = pd.DataFrame(np.random.randn(5, 5),
                   columns=['a', 'b', 'c', 'd', 'e'],
                   index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people

Unnamed: 0,a,b,c,d,e
Joe,0.214048,-0.195566,0.245641,-1.287328,-1.005389
Steve,0.13634,0.648962,0.062273,0.504002,-1.149791
Wes,-1.191105,-0.068661,0.990163,-2.781301,-0.984173
Jim,0.852847,0.008926,-1.114196,1.076928,0.746466
Travis,0.924171,-1.426877,-0.266817,-0.262114,-0.575073


In [38]:
agrupamiento={'a': 'red', 'b': 'red', 'c': 'blue',
              'd': 'blue', 'e': 'red', 'f' : 'orange'}

In [40]:
people.groupby(agrupamiento, axis=1).var()

Unnamed: 0,blue,red
Joe,1.174998,0.385104
Steve,0.097562,0.858737
Wes,7.111969,0.356811
Jim,2.400512,0.211247
Travis,1.1e-05,1.416789


In [41]:
agrupamientoserie=pd.Series(agrupamiento)
agrupamientoserie

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [42]:
#misma funcionalidad usando series
people.groupby(agrupamientoserie, axis=1).var()

Unnamed: 0,blue,red
Joe,1.174998,0.385104
Steve,0.097562,0.858737
Wes,7.111969,0.356811
Jim,2.400512,0.211247
Travis,1.1e-05,1.416789


###  Agrupamiento usando funciones

Este permite libertad aboluta a la hora de configurar los grupos. Se agrupan según los valores que toma la función referencia en los índices. Por defecto se aplica sobre los índices de filas

In [52]:
people.rename(columns={'a':'aa','b':'bbb'},inplace=True)

#agrupa las personas por la longitud de su nombre
longnomfila=people.groupby(len).sum()
longnomcol=people.groupby(len,axis=1).sum()

display('people','longnomfila','longnomcol')

Unnamed: 0,aa,bbb,c,d,e
Joe,0.214048,-0.195566,0.245641,-1.287328,-1.005389
Steve,0.13634,0.648962,0.062273,0.504002,-1.149791
Wes,-1.191105,-0.068661,0.990163,-2.781301,-0.984173
Jim,0.852847,0.008926,-1.114196,1.076928,0.746466
Travis,0.924171,-1.426877,-0.266817,-0.262114,-0.575073

Unnamed: 0,aa,bbb,c,d,e
3,-0.124209,-0.2553,0.121608,-2.991701,-1.243096
5,0.13634,0.648962,0.062273,0.504002,-1.149791
6,0.924171,-1.426877,-0.266817,-0.262114,-0.575073

Unnamed: 0,1,2,3
Joe,-2.047076,0.214048,-0.195566
Steve,-0.583516,0.13634,0.648962
Wes,-2.77531,-1.191105,-0.068661
Jim,0.709197,0.852847,0.008926
Travis,-1.104003,0.924171,-1.426877


In [63]:
def agrupa(x):
    if 'e' in x:
        return len(x)+1
    else:
        return x[-1]

In [64]:
agrupa('aa'),agrupa('Steve')

('a', 6)

In [66]:
[agrupa(s) for s in people.index]

[4, 6, 4, 'm', 's']

In [67]:
people.groupby(agrupa).count()

Unnamed: 0,aa,bbb,c,d,e
4,2,2,2,2,2
6,1,1,1,1,1
m,1,1,1,1,1
s,1,1,1,1,1


# Aplicar columna por columna diferentes funciones

El método `agg (aggregate)` permite aplicar un conjunto de funciones a las columnas al realizar la agrupación.

**Las funciones deben ser reductoras, esto es, tomar valores vectoriales y devolver números**

Cargamos para los ejemplos usando `Seaborn` el data-set *tips*

In [93]:
# !pip install seaborn

In [94]:
import seaborn as sns

tips = sns.load_dataset("tips")
print(tips.shape)
tips.head()

(244, 7)


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [95]:
#generamos una columna con el porcentaje que representa la propina respecto
#a la cuenta total
tips['tip_pct'] = tips['tip'] / tips['total_bill']*100
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,5.944673
1,10.34,1.66,Male,No,Sun,Dinner,3,16.054159
2,21.01,3.5,Male,No,Sun,Dinner,3,16.658734
3,23.68,3.31,Male,No,Sun,Dinner,2,13.978041
4,24.59,3.61,Female,No,Sun,Dinner,4,14.680765


In [76]:
grouped=tips.groupby(['sex', 'smoker'])

In [110]:
def reducecos(x):
    return np.prod(1.5*np.cos(x))

In [111]:
grouped['tip_pct'].agg([np.mean,np.sum,reducecos])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,sum,reducecos
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Male,Yes,15.277118,916.627051,3.742158e-09
Male,No,16.066872,1558.486537,3.916244e-13
Female,Yes,18.215035,601.096164,0.1234156
Female,No,15.692097,847.373242,-1.111092e-14


Se pueden nombrar las funciones para las columnas en el agrupamiento introduciéndolas por tuplas `(nombre,función)`

In [112]:
grouped['tip_pct'].agg([('media',np.mean),
                        ('suma',np.sum),('función rara creada',reducecos)])

Unnamed: 0_level_0,Unnamed: 1_level_0,media,suma,función rara creada
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Male,Yes,15.277118,916.627051,3.742158e-09
Male,No,16.066872,1558.486537,3.916244e-13
Female,Yes,18.215035,601.096164,0.1234156
Female,No,15.692097,847.373242,-1.111092e-14


Se puede aplicar una lista de funciones a un subconjunto de columnas, se genera un DataFrame multíndice en columnas con un nivel más que indica la columna a la que se hace la transformación.

In [113]:
functions = ['count', 'mean', 'max']

In [114]:
result = grouped['tip_pct', 'total_bill'].agg(functions)
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Male,Yes,60,15.277118,71.034483,60,22.2845,50.81
Male,No,97,16.066872,29.198966,97,19.791237,48.33
Female,Yes,33,18.215035,41.666667,33,17.977879,44.3
Female,No,54,15.692097,25.26725,54,18.105185,35.83


In [115]:
result['total_bill']

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,max
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Male,Yes,60,22.2845,50.81
Male,No,97,19.791237,48.33
Female,Yes,33,17.977879,44.3
Female,No,54,18.105185,35.83


Se pueden aplicar conjuntos de funciones diferentes en distintas columna. Hay que dar al método `agg` un diccionario con los nombres de las columnas que se quieren transformar y las funciones para cada una.

In [116]:
grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'],'size' : [reducecos,'sum']})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std,reducecos,sum
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Male,Yes,3.563814,71.034483,15.277118,9.058794,-7.949107e-09,150
Male,No,7.180385,29.198966,16.066872,4.184875,-4.628385e-10,263
Female,Yes,5.643341,41.666667,18.215035,7.159451,4.301846e-05,74
Female,No,5.679667,25.26725,15.692097,3.642118,-3.125443e-06,140


### Devolver los datos sin forma indexada al agrupar

Puede que queramos aplicar las funciones por grupos pero no alterar el formato de la tabla. 

In [118]:
tips.groupby(['sex', 'smoker'],as_index=False).max()

Unnamed: 0,sex,smoker,size,tip,tip_pct,total_bill
0,Male,Yes,5,10.0,71.034483,50.81
1,Male,No,6,9.0,29.198966,48.33
2,Female,Yes,4,6.5,41.666667,44.3
3,Female,No,6,5.2,25.26725,35.83
