# Proyecto: Uso de `.groupby()` y funciones avanzadas

En la presente practica veremos como agrupar información usando la función `.groupby()`.

Veremos cómo aplicar funciones y filtros definidos por el usuario, especificando a que columnas se aplican. 

### Algunas funciones usadas en este proyecto son:

`.nunique()` <- Extrae el número de clases de los valores de cada variable (columna) de un DataFrame

`.loc[]` <-Permite extraer subconjuntos de un DataFrame

`.aggregate([func_1,...,func_X])` <- Permite definir y aplicar una o varias funciones (definidas por el usuario) al DataFrame agrupado

`.filter()` <- permite aplicar filtros definidos por el usuario al DataFrame agrupado

In [1]:
import pandas as pd

import numpy as np

import seaborn as sns

In [2]:
# importamos datos de seaborn

df = sns.load_dataset('diamonds')
df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB


In [4]:
# Obtenemos el número de clases de información por cada columna
df.nunique()

carat        273
cut            5
color          7
clarity        8
depth        184
table        127
price      11602
x            554
y            552
z            375
dtype: int64

Podemos ver que hay 5 clases de corte, 7 clases de colores, 8 clases de claridad,m etc.

## Función `.groupby()` para agrupamiento de datos

Forma de uso:

`df.groupby([ 'column_X' ])[['column_1',...,'column_N']].funcion()`

1. Seleccionamos el dataFrame: `df`

2. Seleccionamos la(s) columna(s) por la(s) que agruparemos la información: `.groupby( ['column_X'] )`

3. Seleccionamos las columnas que deseamos analizar: `[['column_1',...,'column_N']]`

4. Aplicamos alguna `.funcion()` que deseemos analizar. Las funciones comunmente usadas son: `.count()`, `.mean()`,`.std()`, `.median()`

## `.groupby()` de una sóla variable

In [5]:
# Obtenemos la media de todas las variables (columnas) agrupadas con respecto a 'cut'
df.groupby(['cut']).mean()

Unnamed: 0_level_0,carat,depth,table,price,x,y,z
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ideal,0.702837,61.709401,55.951668,3457.54197,5.507451,5.52008,3.401448
Premium,0.891955,61.264673,58.746095,4584.257704,5.973887,5.944879,3.647124
Very Good,0.806381,61.818275,57.95615,3981.759891,5.740696,5.770026,3.559801
Good,0.849185,62.365879,58.694639,3928.864452,5.838785,5.850744,3.639507
Fair,1.046137,64.041677,59.053789,4358.757764,6.246894,6.182652,3.98277


In [6]:
# Obtenemos la media de una sola variables (columna) agrupadas con respecto a 'cut'
df.groupby(['cut'])[['price']].mean()

Unnamed: 0_level_0,price
cut,Unnamed: 1_level_1
Ideal,3457.54197
Premium,4584.257704
Very Good,3981.759891
Good,3928.864452
Fair,4358.757764


In [7]:
# Obtenemos la media de algunas variables (columnas) agrupadas con respecto a 'cut'
T = df.groupby(['cut'])[['carat','price']].mean()
T

Unnamed: 0_level_0,carat,price
cut,Unnamed: 1_level_1,Unnamed: 2_level_1
Ideal,0.702837,3457.54197
Premium,0.891955,4584.257704
Very Good,0.806381,3981.759891
Good,0.849185,3928.864452
Fair,1.046137,4358.757764


In [10]:
#extracción de un renglón con '.loc[ ['index_name'] ,:]'
T.loc[ ['Good'] ,:]

Unnamed: 0_level_0,carat,price
cut,Unnamed: 1_level_1,Unnamed: 2_level_1
Good,0.849185,3928.864452


In [None]:
#extracción de una columna con '.loc[ : , ['column_name'] ]'
T.loc[ :,['price'] ]

In [11]:
#extracción de un campo específico con '.loc[ ['index_name'] , ['column_name'] ]'
T.loc[ ['Good'],['price'] ]

Unnamed: 0_level_0,price
cut,Unnamed: 1_level_1
Good,3928.864452


## `.groupby()` de varias variables

In [13]:
df.groupby(['color','cut'])[ ['price','carat'] ].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,price,carat
color,cut,Unnamed: 2_level_1,Unnamed: 3_level_1
D,Ideal,2629.094566,0.565766
D,Premium,3631.292576,0.721547
D,Very Good,3470.467284,0.696424
D,Good,3405.382175,0.744517
D,Fair,4291.06135,0.920123
E,Ideal,2597.55009,0.578401
E,Premium,3538.91442,0.717745
E,Very Good,3214.652083,0.676317
E,Good,3423.644159,0.745134
E,Fair,3682.3125,0.856607


## Aplicación de funciones definidas por el usuario: Aplicación mismas funciones por columna

In [17]:
def mean_kilo(x):
    return np.mean(x)/1000

In [18]:
# obtenemos: mínimo. media, máximo y 'mean_kilo' de los datos agrupados:
P = df.groupby(['cut','color'])[ ['price','carat'] ].aggregate([np.min,np.mean,np.max,mean_kilo])
P

Unnamed: 0_level_0,Unnamed: 1_level_0,price,price,price,price,carat,carat,carat,carat
Unnamed: 0_level_1,Unnamed: 1_level_1,amin,mean,amax,mean_kilo,amin,mean,amax,mean_kilo
cut,color,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Ideal,D,367,2629.094566,18693,2.629095,0.2,0.565766,2.75,0.000566
Ideal,E,326,2597.55009,18729,2.59755,0.2,0.578401,2.28,0.000578
Ideal,F,408,3374.939362,18780,3.374939,0.23,0.655829,2.45,0.000656
Ideal,G,361,3720.706388,18806,3.720706,0.23,0.700715,2.54,0.000701
Ideal,H,357,3889.334831,18760,3.889335,0.23,0.799525,3.5,0.0008
Ideal,I,348,4451.970377,18779,4.45197,0.23,0.913029,3.22,0.000913
Ideal,J,340,4918.186384,18508,4.918186,0.23,1.063594,3.01,0.001064
Premium,D,367,3631.292576,18575,3.631293,0.2,0.721547,2.57,0.000722
Premium,E,326,3538.91442,18477,3.538914,0.2,0.717745,3.05,0.000718
Premium,F,342,4324.890176,18791,4.32489,0.2,0.827036,3.01,0.000827


In [19]:
P.loc[ ['Good'],: ]

Unnamed: 0_level_0,Unnamed: 1_level_0,price,price,price,price,carat,carat,carat,carat
Unnamed: 0_level_1,Unnamed: 1_level_1,amin,mean,amax,mean_kilo,amin,mean,amax,mean_kilo
cut,color,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Good,D,361,3405.382175,18468,3.405382,0.23,0.744517,2.04,0.000745
Good,E,327,3423.644159,18236,3.423644,0.23,0.745134,3.0,0.000745
Good,F,357,3495.750275,18686,3.49575,0.23,0.77593,2.67,0.000776
Good,G,394,4123.482204,18788,4.123482,0.23,0.850896,2.8,0.000851
Good,H,368,4276.254986,18640,4.276255,0.25,0.914729,3.01,0.000915
Good,I,351,5078.532567,18707,5.078533,0.3,1.057222,3.01,0.001057
Good,J,335,4574.172638,18325,4.574173,0.28,1.099544,3.0,0.0011


In [20]:
P.loc[ ['Good'],['G'],: ]

Unnamed: 0_level_0,Unnamed: 1_level_0,price,price,price,price,carat,carat,carat,carat
Unnamed: 0_level_1,Unnamed: 1_level_1,amin,mean,amax,mean_kilo,amin,mean,amax,mean_kilo
cut,color,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Good,G,394,4123.482204,18788,4.123482,0.23,0.850896,2.8,0.000851


In [21]:
P.loc[ ['Good'],['G'],:]['price']

Unnamed: 0_level_0,Unnamed: 1_level_0,amin,mean,amax,mean_kilo
cut,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Good,G,394,4123.482204,18788,4.123482


In [22]:
P.loc[ ['Good'],['G'],:]['price']['mean_kilo']

cut   color
Good  G        4.123482
Name: mean_kilo, dtype: float64

## Aplicación de funciones definidas por el usuario: Aplicación distintas funciones por columna

In [23]:
# Podemos definir distintas funciones y decidir a que columnas aplicarlas:
dict_func={'carat':[np.min , np.max], 'price':[np.mean , mean_kilo] }
dict_func

# Aplicará las funciones 
# [np.min,np.max] a la columna 'carat' y 
# [np.mean , mean_kilo] a la columna 'price'


{'carat': [<function numpy.amin(a, axis=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)>,
  <function numpy.amax(a, axis=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)>],
 'price': [<function numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>)>,
  <function __main__.mean_kilo(x)>]}

In [24]:
df.groupby(['cut','color'])[ ['price','carat'] ].aggregate( dict_func )

Unnamed: 0_level_0,Unnamed: 1_level_0,carat,carat,price,price
Unnamed: 0_level_1,Unnamed: 1_level_1,amin,amax,mean,mean_kilo
cut,color,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Ideal,D,0.2,2.75,2629.094566,2.629095
Ideal,E,0.2,2.28,2597.55009,2.59755
Ideal,F,0.23,2.45,3374.939362,3.374939
Ideal,G,0.23,2.54,3720.706388,3.720706
Ideal,H,0.23,3.5,3889.334831,3.889335
Ideal,I,0.23,3.22,4451.970377,4.45197
Ideal,J,0.23,3.01,4918.186384,4.918186
Premium,D,0.2,2.57,3631.292576,3.631293
Premium,E,0.2,3.05,3538.91442,3.538914
Premium,F,0.2,3.01,4324.890176,4.32489


## Construcción y apliación de filtros definidos por el usuario 

In [25]:
df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [26]:
# El siguiente filtro se basará en seleccionar los registros que cumplan:
# mean_kilo(x[ selected_cols ]) > 4
# para un conjunto de columnas seleccionadas en 'selected_cols'

# Definimos la condición del filtro:
def my_filter(x):
    selected_cols = ['price']
    return mean_kilo(x[ selected_cols ]) > 4

In [27]:
# Extraemos los registros que satisfacen la condición del filtro definido previamente:
df.groupby(['cut']).filter(my_filter)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
8,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
12,0.22,Premium,F,SI1,60.4,61.0,342,3.88,3.84,2.33
14,0.20,Premium,E,SI2,60.2,62.0,345,3.79,3.75,2.27
...,...,...,...,...,...,...,...,...,...,...
53928,0.79,Premium,E,SI2,61.4,58.0,2756,6.03,5.96,3.68
53930,0.71,Premium,E,SI1,60.5,55.0,2756,5.79,5.74,3.49
53931,0.71,Premium,F,SI1,59.8,62.0,2756,5.74,5.73,3.43
53934,0.72,Premium,D,SI1,62.7,59.0,2757,5.69,5.73,3.58
