### __Python Datawrangling Data Grouping__

Data Grouping has 3 phases:
    
    1. Divide, divide the data into groups according to a given criterion.
    2. Apply, apply calculation methods to each group.
    3. Combine, results are stored in a new data structure.

##### _.groupby()_

The groupby() method allows you to group your data and execute functions on these groups.

_dataframe.groupby(by, axis, level, as_index, sort, group_keys, observed, dropna)_

_by_ 	     Required. A label, a list of labels, or a function used to specify how to group the DataFrame.

_axis_	0 1 'index' 'columns'	Optional, Which axis to make the group by, default 0.

_level_	level None	Optional. Specify if grouping should be done by a certain level. Default None

_as_index_	True False	Optional, default True. Set to False if the result should NOT use the group labels as index

_sort_	True False	Optional, default True. Set to False if the result should NOT sort the group keys (for better performance)

_group_keys_	True False	Optional, default True. Set to False if the result should NOT add the group keys to index

_dropna_	True False	Optional, default True. Set to False if the result should include the rows/columns where the group key is a NULL value

_df.groupby('column_to_group_by')['column_to_aggregate'].aggregation_function()_

__Common Aggregation Functions__

You can use:

.sum() — total

.mean() — average

.count() — number of rows

.max() / .min() — highest/lowest

.median(), .std(), .nunique(), etc.

In [1]:
import pandas as pd

data = {
    'store': ['A', 'A', 'B', 'B', 'B', 'C'],
    'month': ['january', 'january', 'octuber', 'june', 'february', 'december'],
    'sales': [100, 200, 50, 300, 150, 400]
}

df = pd.DataFrame(data)

print(df)
print()

print(df.groupby('store')['sales'].sum())
print()
print(df.groupby(['store', 'month'])['sales'].sum())
print()
print(df.groupby('store')['sales'].agg(['sum', 'mean', 'count']))
print()
df.groupby("month", as_index=False)["sales"].mean()

  store     month  sales
0     A   january    100
1     A   january    200
2     B   octuber     50
3     B      june    300
4     B  february    150
5     C  december    400

store
A    300
B    500
C    400
Name: sales, dtype: int64

store  month   
A      january     300
B      february    150
       june        300
       octuber      50
C      december    400
Name: sales, dtype: int64

       sum        mean  count
store                        
A      300  150.000000      2
B      500  166.666667      3
C      400  400.000000      1



Unnamed: 0,month,sales
0,december,400.0
1,february,150.0
2,january,150.0
3,june,300.0
4,octuber,50.0


In [3]:
data = {
  'co2': [95, 90, 99, 104, 105, 94, 99, 104],
  'model': ['Citigo', 'Fabia', 'Fiesta', 'Rapid', 'Focus', 'Mondeo', 'Octavia', 'B-Max'],
  'car': ['Skoda', 'Skoda', 'Ford', 'Skoda', 'Ford', 'Ford', 'Skoda', 'Ford']
}

df = pd.DataFrame(data)

result = df.groupby(["car"]).agg({
    'co2': 'mean',  # Calculate mean for 'co2'
    'model': 'first'  # Take the first value for 'model'
})

print(result)

         co2   model
car                 
Ford   100.5  Fiesta
Skoda   97.0  Citigo


##### __Example 1__

For Exoplantes, get the count of each discovered group according to their radius

In [17]:
import pandas as pd

df_exoplanets = pd.read_csv('DataSets/exoplanet.csv')

print(df_exoplanets)
print()

print(df_exoplanets.groupby(by='discovered')) # print(exoplanet.groupby('discovered'))
print() 

df_exo_number = df_exoplanets.groupby(by='discovered')["radius"].count() # df_exonumber = exoplanet.groupby('discovered').count())
print(df_exo_number) 

    num        name  Property      mass    radius  discovered
0     0   1RXS1609b         0  14.00000  19.04000        2008
1     1  2M0122-24b         1  20.00000  11.20000        2013
2     2  2M0219-39b         2  13.90000  16.12800        2015
3     3  2M0746+20b         3  12.21000  10.86400        2010
4     4  2M2140+16b         4  20.00000  10.30400        2010
5     5  2M2206-20b         5  30.00000  14.56000        2010
6     6      51Erib         6   9.10000  12.43200        2015
7     7      51Pegb         7   0.47000  21.28000        1995
8     8      55Cnce         8   0.02703   1.94544        2004
9     9   BD+20594b         9   0.05130   2.22880        2016
10   10  BD-103166b        10   0.46000  11.53600        2000
11   11      CTChab        11  17.00000  24.64000        2008
12   12     CVSO30b        12   6.20000  21.39200        2012
13   13    CoRoT-1b        13   1.03000  16.68800        2007
14   14   CoRoT-10b        14   2.75000  10.86400        2010
15   15 

For Exoplantes, get the sum of each discovered group according to their radius

In [18]:
df_exo_radius_sum = df_exoplanets.groupby('discovered')['radius'].sum()
print(df_exo_radius_sum)

discovered
1995     21.28000
2000     11.53600
2004      1.94544
2007     33.09600
2008     43.68000
2010    137.92800
2011     73.75648
2012     21.39200
2013     11.20000
2015     28.56000
2016      2.22880
Name: radius, dtype: float64


In [19]:
df_exo_radius_mean = df_exo_radius_sum / df_exo_number 
print(df_exo_radius_mean)

discovered
1995    21.280000
2000    11.536000
2004     1.945440
2007    16.548000
2008    21.840000
2010    12.538909
2011    10.536640
2012    21.392000
2013    11.200000
2015    14.280000
2016     2.228800
Name: radius, dtype: float64


##### __Example 2__

You must group Digimon by their evolution level (Stage), and then apply different aggregation methods to this grouping to obtain the following information:

- The total number of Digimon by level (Stage).

- The sum of the health values ​​(LV 50 HP) by level.

- The average speed values ​​(LV 50 Spd) by level.

In [21]:
import pandas as pd

# Cargar el archivo CSV
digimon_data = pd.read_csv('DataSets/DigiDB_digimonlist.csv')


grouped_stage_count =  digimon_data.groupby("Stage")["Digimon"].count()
grouped_stage_sum =  digimon_data.groupby("Stage")["Lv 50 HP"].sum()
grouped_stage_mean =  digimon_data.groupby("Stage")["Lv50 Spd"].mean()


print('Distribución de los Digimons', '\n',grouped_stage_count,'\n')
print('Total de Salud', '\n',grouped_stage_sum, '\n',)
print('Promedio Nivel de Velocidad', '\n',grouped_stage_mean)

Distribución de los Digimons 
 Stage
Armor           3
Baby            5
Champion       54
In-Training    11
Mega           74
Rookie         38
Ultimate       58
Ultra           6
Name: Digimon, dtype: int64 

Total de Salud 
 Stage
Armor            3510
Baby             3640
Champion        58700
In-Training      9290
Mega           107700
Rookie          34980
Ultimate        74640
Ultra            9050
Name: Lv 50 HP, dtype: int64 

Promedio Nivel de Velocidad 
 Stage
Armor          128.666667
Baby            77.000000
Champion       103.259259
In-Training     81.181818
Mega           152.486486
Rookie          90.236842
Ultimate       122.586207
Ultra          152.833333
Name: Lv50 Spd, dtype: float64
