# Agrupado y agregaciones
pg. 158

In [1]:
import numpy as np
import pandas as pd

In [2]:
import seaborn as sns

In [3]:
planets = sns.load_dataset('planets')

In [4]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [5]:
planets.shape

(1035, 6)

### Agregaciones en Pandas

Las series, como en Numpy retornan un valor.

In [6]:
planets['mass'].mean()

2.6381605847953216

El DF retorna el agregado para cada columna.

In [7]:
planets.mean()

number               1.785507
orbital_period    2002.917596
mass                 2.638161
distance           264.069282
year              2009.070531
dtype: float64

Si especificas columnas te hace el agregado de todas.

In [8]:
planets.head().mean(axis=1)

0    472.1600
1    588.5868
2    559.4880
3    492.8100
4    531.2380
dtype: float64

Un método super útil es `describe()`, que da información de cada columna numérica

In [9]:
planets.describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,1035.0,992.0,513.0,808.0,1035.0
mean,1.785507,2002.917596,2.638161,264.069282,2009.070531
std,1.240976,26014.728304,3.818617,733.116493,3.972567
min,1.0,0.090706,0.0036,1.35,1989.0
25%,1.0,5.44254,0.229,32.56,2007.0
50%,1.0,39.9795,1.26,55.25,2010.0
75%,2.0,526.005,3.04,178.5,2012.0
max,7.0,730000.0,25.0,8500.0,2014.0


### GroupBy: Split, Apply, Combine

#### Partir, aplicar, combinar
- `split` romper y agrupar el DF dependiendo del valor dado.
- `apply` copmutar funciones, transformaciones, filtros a los grupos individuales.
- `combine` agrupar los resultados en un array de salida.

Se podría hacer esto manualmente usando masking, agregaciones, merge... Pero hay que instanciar los pasos intermedios, es menos efectivo. El poder de GroupBy es que te libra de todos estos pasos.

In [10]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [11]:
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd832883c70>

El resulstado es un `DataFrameGroupBy`. Preparado para trabajar con los grupos. No se hace computación hasta que no se aplica una acción. "Lazy Evaluation". Lo convierte en muy eficaz. Al aplicar una acción se realiza y combina los resultados.

In [12]:
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


### El Objeto GroupBy

Es un objeto abstracto que puede considerarse como una colección de DF.

#### Indexado por columnas

In [13]:
planets.groupby('method')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd8318c6730>

In [14]:
planets.groupby('method')['orbital_period']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fd8318c6430>

In [15]:
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

#### Iteración sobre grupos

In [44]:
for method, group in planets.groupby('method'):
    print("{0:30} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)


#### Métodos de envío.

Cualquier método no implicitamente implementado por `GroupBy` sera pasado y llamado para los grupos.

In [49]:
planets.groupby('method')['year'].describe().unstack()

       method                       
count  Astrometry                          2.0
       Eclipse Timing Variations           9.0
       Imaging                            38.0
       Microlensing                       23.0
       Orbital Brightness Modulation       3.0
                                         ...  
max    Pulsar Timing                    2011.0
       Pulsation Timing Variations      2007.0
       Radial Velocity                  2014.0
       Transit                          2014.0
       Transit Timing Variations        2014.0
Length: 80, dtype: float64

### Aggregate, filter, transform, apply

In [50]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


#### Agregación

Método más flexible al que podemos pasarle cadenas, funciones, listas, diccionarios etc

In [52]:
df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


In [53]:
df.groupby('key').aggregate({
    'data1': 'min',
    'data2': 'max'
})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


#### Filtrar

In [63]:
def filter_func(x):
    return x['data2'].std() > 4

In [55]:
df.groupby('key').std()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2.12132,1.414214
B,2.12132,4.949747
C,2.12132,4.242641


In [64]:
# Envía el grupo A, B ó C y mira que data2.std > 4
df.groupby('key').filter(filter_func)

Unnamed: 0,key,data1,data2
1,B,1,0
2,C,2,3
4,B,4,7
5,C,5,9


#### Transformación

In [67]:
df.groupby('key').transform(lambda x: x - x.mean())

Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


#### Apply

Aplicar una función deseada al grupo de resultados. La funcion debe tomar un DF y retorna una serie o DF.

In [73]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    print(x)
    x['data1'] /= x['data2'].sum()
    return x

In [75]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [74]:
df.groupby('key').apply(norm_by_data2)

  key  data1  data2
0   A      0      5
3   A      3      3
  key  data1  data2
1   B      1      0
4   B      4      7
  key  data1  data2
2   C      2      3
5   C      5      9


Unnamed: 0,key,data1,data2
0,A,0.0,5
1,B,0.142857,0
2,C,0.166667,3
3,A,0.375,3
4,B,0.571429,7
5,C,0.416667,9


### Specificando la key de agrupado

#### Lista, serie, indice...

In [82]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [104]:
# Estos los agrupas en el grupo 0 y estos en el grupo 1
L = [0, 0, 0, 0, 1, 1]

In [105]:
df.groupby(L).sum()

Unnamed: 0,data1,data2
0,6,11
1,9,16


In [106]:
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}

#### Diccionario

In [107]:
df2.groupby(mapping).sum()

Unnamed: 0,data1,data2
consonant,12,19
vowel,3,8


#### Función

In [108]:
df2.groupby(str.lower).mean()

Unnamed: 0,data1,data2
a,1.5,4.0
b,2.5,3.5
c,3.5,6.0


### Ejemplo

In [110]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [116]:
2006 // 10

200

In [117]:
decade = 10 * (planets['year'] // 10) # Dividir y redondear
decade = decade.astype(str) + 's'
decade.name = 'decade'
decade

0       2000s
1       2000s
2       2010s
3       2000s
4       2000s
        ...  
1030    2000s
1031    2000s
1032    2000s
1033    2000s
1034    2000s
Name: decade, Length: 1035, dtype: object

In [121]:
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0
