# Funcion GroupBy

La función groupby en la biblioteca pandas de Python se utiliza para agrupar filas de un DataFrame según los valores únicos en una o varias columnas. Es una operación esencial para realizar análisis y resúmenes de datos basados en categorías o grupos específicos.

La función groupby permite dividir un DataFrame en grupos basados en los valores únicos en una o más columnas clave y luego aplicar operaciones agregadas a estos grupos. Es similar a la operación "GROUP BY" en SQL.

La sintaxis básica de groupby es la siguiente:

    grouped = df.groupby(columna_clave)

Donde:

    df: El DataFrame en el que deseas realizar la operación groupby.
    columna_clave: La columna por la cual deseas agrupar los datos.

Una vez que has creado un objeto groupby, puedes aplicar operaciones de agregación como sum(), mean(), count(), max(), min(), etc., para calcular estadísticas sobre los grupos creados.
## ¿Que devuelve la funcion GroupBy?

La función groupby en pandas devuelve un objeto de tipo GroupBy, que es una representación especial de los grupos generados a partir de la operación de agrupamiento. Este objeto GroupBy no es directamente un DataFrame ni una serie; en cambio, es una estructura que contiene información sobre los grupos y permite realizar operaciones agregadas en estos grupos.

Cuando aplicas la función groupby a un DataFrame, obtienes un objeto GroupBy. Puedes pensar en este objeto como un "mapa" que relaciona los valores únicos en la(s) columna(s) de agrupamiento con los subconjuntos de filas del DataFrame original que pertenecen a cada grupo. A partir de este objeto GroupBy, puedes realizar operaciones agregadas en cada uno de los grupos.

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/census.csv')
df

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.832960,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3188,50,4,8,56,37,Wyoming,Sweetwater County,43806,43806,43593,...,1.072643,16.243199,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195
3189,50,4,8,56,39,Wyoming,Teton County,21294,21294,21297,...,-1.589565,0.972695,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747
3190,50,4,8,56,41,Wyoming,Uinta County,21118,21118,21102,...,-17.755986,-4.916350,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351
3191,50,4,8,56,43,Wyoming,Washakie County,8533,8533,8545,...,-11.637475,-0.827815,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961


In [3]:
for group, frame in df.groupby('STNAME'):
    print('Counties in state ' + group +
         ' have an average population of ' + str(frame['CENSUS2010POP'].mean()))

Counties in state Alabama have an average population of 140580.4705882353
Counties in state Alaska have an average population of 47348.73333333333
Counties in state Arizona have an average population of 799002.125
Counties in state Arkansas have an average population of 76734.68421052632
Counties in state California have an average population of 1262845.9661016949
Counties in state Colorado have an average population of 154744.4923076923
Counties in state Connecticut have an average population of 794243.7777777778
Counties in state Delaware have an average population of 448967.0
Counties in state District of Columbia have an average population of 601723.0
Counties in state Florida have an average population of 552979.7058823529
Counties in state Georgia have an average population of 121095.6625
Counties in state Hawaii have an average population of 453433.6666666667
Counties in state Idaho have an average population of 69670.3111111111
Counties in state Illinois have an average populat

In [12]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/census.csv')
df = df.set_index('STNAME')

def set_batch_number(item):
    if(item[0] < 'M'):
        return 0
    if(item[0] < 'Q'):
        return 1
    return 2

for group, frame in df.groupby(set_batch_number):
    print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')

There are 1196 records in group 0 for processing.
There are 1154 records in group 1 for processing.
There are 843 records in group 2 for processing.


Como se puede apreciar en el anterior fragmeto de codigo, no hemos especificado ningun nombre de columna para tomar como criterio de agrupación, en este caso, automáticamente cogerá la columna de indice de la tabla. 

# Agrupacion por multi-índice y función personalizada

In [31]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/listings.csv')
df = df.set_index(["cancellation_policy", "review_scores_value"])

def grouping_fun(item):
    if(item[1]) == 10.0:
        return (item[0], "10.0")
    else:
        return (item[0], "not 10.0")
    
for group, frame in df.groupby(by=grouping_fun):
    print(group)

('flexible', '10.0')
('flexible', 'not 10.0')
('moderate', '10.0')
('moderate', 'not 10.0')
('strict', '10.0')
('strict', 'not 10.0')
('super_strict_30', '10.0')
('super_strict_30', 'not 10.0')


# Agregación
El agregado de grupos en pandas se realiza generalmente utilizando la función groupby(), que permite agrupar filas de un DataFrame en base a una o más columnas o una función. Una vez que los datos están agrupados, puedes realizar diversas operaciones de agregación como suma, media, máximo, mínimo, etc., en cada uno de los grupos.

In [35]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/listings.csv')
df.groupby("cancellation_policy").agg({"review_scores_value": np.nanmean})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.237421
moderate,9.307398
strict,9.081441
super_strict_30,8.537313


In [37]:
df.groupby("cancellation_policy").agg({
    "review_scores_value": (np.nanmean, np.nanstd),
    "reviews_per_month": np.nanmean
})

Unnamed: 0_level_0,review_scores_value,review_scores_value,reviews_per_month
Unnamed: 0_level_1,nanmean,nanstd,nanmean
cancellation_policy,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
flexible,9.237421,1.096271,1.82921
moderate,9.307398,0.859859,2.391922
strict,9.081441,1.040531,1.873467
super_strict_30,8.537313,0.840785,0.340143


# Transformación
La transformación es otra operación común que puedes realizar con grupos en pandas. Después de realizar un groupby, la transformación modifica los datos dentro de cada grupo sin cambiar la forma del DataFrame original. A menudo, esto se hace para normalizar datos o para calcular estadísticas que luego se aplican al grupo original. Por ejemplo, podrías querer restar la media de cada grupo a cada elemento dentro de ese grupo. La transformación se lleva a cabo de manera que el tamaño del grupo no cambia, permitiéndote combinar los resultados transformados con el DataFrame original si así lo deseas.

In [54]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/listings.csv')
cols = ["cancellation_policy","review_scores_value"]
transform_df = df[cols].groupby('cancellation_policy').transform(np.nanmean)
transform_df.rename({'review_scores_value': 'mean_review_scores'}, axis='columns', inplace=True)

df = df.merge(transform_df, left_index=True, right_index=True)

df['mean_diff']=np.absolute(df['review_scores_value'] - df['mean_review_scores'])
df

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores,mean_diff
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,,f,moderate,f,f,1,,9.307398,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,,,t,moderate,f,f,1,1.30,9.307398,0.307398
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,,,f,moderate,t,f,1,0.47,9.307398,0.692602
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,,,f,moderate,f,f,1,1.00,9.307398,0.692602
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,,,f,flexible,f,f,1,2.25,9.237421,0.762579
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3580,8373729,https://www.airbnb.com/rooms/8373729,20160906204935,2016-09-07,Big cozy room near T,5 min walking to Orange Line subway with 2 sto...,,5 min walking to Orange Line subway with 2 sto...,none,,...,,,t,strict,f,f,8,0.34,9.081441,0.081441
3581,14844274,https://www.airbnb.com/rooms/14844274,20160906204935,2016-09-07,BU Apartment DexterPark Bright room,"Most popular apartment in BU, best located in ...",Best location in BU,"Most popular apartment in BU, best located in ...",none,,...,,,f,strict,f,f,2,,9.081441,
3582,14585486,https://www.airbnb.com/rooms/14585486,20160906204935,2016-09-07,Gorgeous funky apartment,Funky little apartment close to public transpo...,Modern and relaxed space with many facilities ...,Funky little apartment close to public transpo...,none,"Cambridge is a short walk into Boston, and set...",...,,,f,flexible,f,f,1,,9.237421,
3583,14603878,https://www.airbnb.com/rooms/14603878,20160906204935,2016-09-07,Great Location; Train and Restaurants,"My place is close to Taco Loco Mexican Grill, ...",,"My place is close to Taco Loco Mexican Grill, ...",none,,...,,,f,strict,f,f,1,2.00,9.081441,2.081441


# Filtración
La filtración en el contexto de groupby permite subseleccionar datos en grupos basados en ciertas condiciones. Después de agrupar un DataFrame con groupby, puedes aplicar un filtro para retener sólo los grupos que cumplen con ciertos criterios. Por ejemplo, podrías querer mantener sólo los grupos que tienen una media de ventas superior a un valor determinado. Este tipo de filtración es especialmente útil cuando estás interesado en analizar subconjuntos de datos que cumplen con condiciones específicas.

In [55]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/listings.csv')
df.groupby('cancellation_policy').filter(lambda x: np.nanmean(x['review_scores_value']) > 9.2)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.30
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.00
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3576,14689681,https://www.airbnb.com/rooms/14689681,20160906204935,2016-09-07,Beautiful loft style bedroom with large bathroom,You'd be living on the top floor of a four sto...,,You'd be living on the top floor of a four sto...,none,,...,,f,,,f,flexible,f,f,1,
3577,13750763,https://www.airbnb.com/rooms/13750763,20160906204935,2016-09-07,Comfortable Space in the Heart of Brookline,"Our place is close to Coolidge Corner, Allston...",This space consists of 2 Rooms and a private b...,"Our place is close to Coolidge Corner, Allston...",none,Brookline is known for being an excellent and ...,...,,f,,,f,flexible,f,f,1,
3579,14852179,https://www.airbnb.com/rooms/14852179,20160906204935,2016-09-07,Spacious Queen Bed Room Close to Boston Univer...,- Grocery: A full-size Star market is 2 minute...,,- Grocery: A full-size Star market is 2 minute...,none,,...,,f,,,f,flexible,f,f,1,
3582,14585486,https://www.airbnb.com/rooms/14585486,20160906204935,2016-09-07,Gorgeous funky apartment,Funky little apartment close to public transpo...,Modern and relaxed space with many facilities ...,Funky little apartment close to public transpo...,none,"Cambridge is a short walk into Boston, and set...",...,,f,,,f,flexible,f,f,1,


El método apply en un objeto DataFrameGroupBy es extremadamente versátil. Permite aplicar una función arbitraria a cada grupo de datos. La función deberá tomar un DataFrame como argumento (que representa el grupo) y devolver un valor, una serie o un DataFrame.

In [56]:
import pandas as pd

# Crear un DataFrame de ejemplo
df = pd.DataFrame({
    'animal': ['gato', 'perro', 'gato', 'perro', 'gato', 'perro'],
    'peso': [4, 25, 3.5, 30, 3.9, 20],
    'altura': [25, 50, 22, 55, 23, 45]
})

print("DataFrame original:")
print(df)

# Agrupar por la columna 'animal'
grouped = df.groupby('animal')

# Definir una función que calcule la suma de la columna 'peso' para cada grupo
def sumar_peso(group):
    return group['peso'].sum()

# Usar apply para aplicar la función a cada grupo
resultado = grouped.apply(sumar_peso)

print("\nResultado de aplicar la suma al grupo:")
print(resultado)


DataFrame original:
  animal  peso  altura
0   gato   4.0      25
1  perro  25.0      50
2   gato   3.5      22
3  perro  30.0      55
4   gato   3.9      23
5  perro  20.0      45

Resultado de aplicar la suma al grupo:
animal
gato     11.4
perro    75.0
dtype: float64
