### __Python Datawrangling Data Grouping__

[_.groupby('column_to_group_by')['column_to_aggregate'].aggregation_function()_](#groupby)

[_.agg(func)_](#agg)

Data Grouping has 3 phases:
    
    1. Divide, divide the data into groups according to a given criterion.
    2. Apply, apply calculation methods to each group.
    3. Combine, results are stored in a new data structure.

##### _.groupby()_ <a index='groupBy'></a>

The groupby() method allows you to group your data and execute functions on these groups.

_dataframe.groupby(by, axis, level, as_index, sort, group_keys, observed, dropna)_

_by_ 	     Required. A label, a list of labels, or a function used to specify how to group the DataFrame.

_axis_	0 1 'index' 'columns'	Optional, Which axis to make the group by, default 0.

_level_	level None	Optional. Specify if grouping should be done by a certain level. Default None

_as_index_	True False	Optional, default True. Set to False if the result should NOT use the group labels as index

_sort_	True False	Optional, default True. Set to False if the result should NOT sort the group keys (for better performance)

_group_keys_	True False	Optional, default True. Set to False if the result should NOT add the group keys to index

_dropna_	True False	Optional, default True. Set to False if the result should include the rows/columns where the group key is a NULL value

_df.groupby('column_to_group_by')['column_to_aggregate'].aggregation_function()_

__Common Aggregation Functions__

You can use:

.sum() — total

.mean() — average

.count() — number of rows

.max() / .min() — highest/lowest

.median(), .std(), .nunique(), etc.

In [1]:
import pandas as pd

data = {
    'store': ['A', 'A', 'B', 'B', 'B', 'C'],
    'month': ['january', 'january', 'octuber', 'june', 'february', 'december'],
    'sales': [100, 200, 50, 300, 150, 400]
}

df = pd.DataFrame(data)

print(df)
print()

print(df.groupby('store')['sales'].sum())
print()
print(df.groupby(['store', 'month'])['sales'].sum())
print()
print(df.groupby('store')['sales'].agg(['sum', 'mean', 'count']))
print()
df.groupby("month", as_index=False)["sales"].mean()

  store     month  sales
0     A   january    100
1     A   january    200
2     B   octuber     50
3     B      june    300
4     B  february    150
5     C  december    400

store
A    300
B    500
C    400
Name: sales, dtype: int64

store  month   
A      january     300
B      february    150
       june        300
       octuber      50
C      december    400
Name: sales, dtype: int64

       sum        mean  count
store                        
A      300  150.000000      2
B      500  166.666667      3
C      400  400.000000      1



Unnamed: 0,month,sales
0,december,400.0
1,february,150.0
2,january,150.0
3,june,300.0
4,octuber,50.0


In [3]:
data = {
  'co2': [95, 90, 99, 104, 105, 94, 99, 104],
  'model': ['Citigo', 'Fabia', 'Fiesta', 'Rapid', 'Focus', 'Mondeo', 'Octavia', 'B-Max'],
  'car': ['Skoda', 'Skoda', 'Ford', 'Skoda', 'Ford', 'Ford', 'Skoda', 'Ford']
}

df = pd.DataFrame(data)

result = df.groupby(["car"]).agg({
    'co2': 'mean',  # Calculate mean for 'co2'
    'model': 'first'  # Take the first value for 'model'
})

print(result)

         co2   model
car                 
Ford   100.5  Fiesta
Skoda   97.0  Citigo


##### __Example 1__

For Exoplantes, get the count of each discovered group according to their radius

In [17]:
import pandas as pd

df_exoplanets = pd.read_csv('DataSets/exoplanet.csv')

print(df_exoplanets)
print()

print(df_exoplanets.groupby(by='discovered')) # print(exoplanet.groupby('discovered'))
print() 

df_exo_number = df_exoplanets.groupby(by='discovered')["radius"].count() # df_exonumber = exoplanet.groupby('discovered').count())
print(df_exo_number) 

    num        name  Property      mass    radius  discovered
0     0   1RXS1609b         0  14.00000  19.04000        2008
1     1  2M0122-24b         1  20.00000  11.20000        2013
2     2  2M0219-39b         2  13.90000  16.12800        2015
3     3  2M0746+20b         3  12.21000  10.86400        2010
4     4  2M2140+16b         4  20.00000  10.30400        2010
5     5  2M2206-20b         5  30.00000  14.56000        2010
6     6      51Erib         6   9.10000  12.43200        2015
7     7      51Pegb         7   0.47000  21.28000        1995
8     8      55Cnce         8   0.02703   1.94544        2004
9     9   BD+20594b         9   0.05130   2.22880        2016
10   10  BD-103166b        10   0.46000  11.53600        2000
11   11      CTChab        11  17.00000  24.64000        2008
12   12     CVSO30b        12   6.20000  21.39200        2012
13   13    CoRoT-1b        13   1.03000  16.68800        2007
14   14   CoRoT-10b        14   2.75000  10.86400        2010
15   15 

For Exoplantes, get the sum of each discovered group according to their radius

In [18]:
df_exo_radius_sum = df_exoplanets.groupby('discovered')['radius'].sum()
print(df_exo_radius_sum)

discovered
1995     21.28000
2000     11.53600
2004      1.94544
2007     33.09600
2008     43.68000
2010    137.92800
2011     73.75648
2012     21.39200
2013     11.20000
2015     28.56000
2016      2.22880
Name: radius, dtype: float64


In [19]:
df_exo_radius_mean = df_exo_radius_sum / df_exo_number 
print(df_exo_radius_mean)

discovered
1995    21.280000
2000    11.536000
2004     1.945440
2007    16.548000
2008    21.840000
2010    12.538909
2011    10.536640
2012    21.392000
2013    11.200000
2015    14.280000
2016     2.228800
Name: radius, dtype: float64


##### __Example 2__

You must group Digimon by their evolution level (Stage), and then apply different aggregation methods to this grouping to obtain the following information:

- The total number of Digimon by level (Stage).

- The sum of the health values ​​(LV 50 HP) by level.

- The average speed values ​​(LV 50 Spd) by level.

In [21]:
import pandas as pd

# Cargar el archivo CSV
digimon_data = pd.read_csv('DataSets/DigiDB_digimonlist.csv')


grouped_stage_count =  digimon_data.groupby("Stage")["Digimon"].count()
grouped_stage_sum =  digimon_data.groupby("Stage")["Lv 50 HP"].sum()
grouped_stage_mean =  digimon_data.groupby("Stage")["Lv50 Spd"].mean()


print('Distribución de los Digimons', '\n',grouped_stage_count,'\n')
print('Total de Salud', '\n',grouped_stage_sum, '\n',)
print('Promedio Nivel de Velocidad', '\n',grouped_stage_mean)

Distribución de los Digimons 
 Stage
Armor           3
Baby            5
Champion       54
In-Training    11
Mega           74
Rookie         38
Ultimate       58
Ultra           6
Name: Digimon, dtype: int64 

Total de Salud 
 Stage
Armor            3510
Baby             3640
Champion        58700
In-Training      9290
Mega           107700
Rookie          34980
Ultimate        74640
Ultra            9050
Name: Lv 50 HP, dtype: int64 

Promedio Nivel de Velocidad 
 Stage
Armor          128.666667
Baby            77.000000
Champion       103.259259
In-Training     81.181818
Mega           152.486486
Rookie          90.236842
Ultimate       122.586207
Ultra          152.833333
Name: Lv50 Spd, dtype: float64


##### _Processing grouped data with agg()_

In [4]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')
print(df)
print()
df.dropna(inplace=True)

print(df)

                                name platform  year_of_release         genre  \
0                         Wii Sports      Wii           2006.0        Sports   
1                  Super Mario Bros.      NES           1985.0      Platform   
2                     Mario Kart Wii      Wii           2008.0        Racing   
3                  Wii Sports Resort      Wii           2009.0        Sports   
4           Pokemon Red/Pokemon Blue       GB           1996.0  Role-Playing   
...                              ...      ...              ...           ...   
16712  Samurai Warriors: Sanada Maru      PS3           2016.0        Action   
16713               LMA Manager 2007     X360           2006.0        Sports   
16714        Haitaka no Psychedelica      PSV           2016.0     Adventure   
16715               Spirits & Spells      GBA           2003.0      Platform   
16716            Winning Post 8 2016      PSV           2016.0    Simulation   

          publisher developer  na_sales

Let's say we need the average critic score for each genre:

In [5]:
mean_score = df.groupby('genre')['critic_score'].mean()
print(mean_score)

genre
Action          66.701897
Adventure       65.229299
Fighting        69.158416
Misc            66.608696
Platform        68.173824
Puzzle          67.152778
Racing          68.068245
Role-Playing    72.655267
Shooter         70.260022
Simulation      68.567723
Sports          71.972318
Strategy        72.282313
Name: critic_score, dtype: float64


The index of the Series mean_score object is the “groupby() key,” in this case, the unique values ​​in the 'genre' column. Performing a groupby() operation changes the row index of the data to the keys we're grouping by.

In [6]:
grp = df.groupby(['platform', 'genre'])
print(grp['critic_score'].mean())

platform  genre       
3DS       Action          62.982759
          Adventure       67.500000
          Fighting        68.857143
          Misc            69.100000
          Platform        72.444444
                            ...    
XOne      Role-Playing    80.777778
          Shooter         77.656250
          Simulation      59.000000
          Sports          71.093750
          Strategy        70.000000
Name: critic_score, Length: 197, dtype: float64


Now we have the average critic score for each genre on each platform. Since we grouped by two columns, our result is a multi-index Series object with two index values ​​for each average score, in this case, 'platform' and 'genre'.

Here, the grp variable is an object that contains the grouped DataFrame before we process each group with the mean() method. It's called a "standby" object. If we try to print the grp object, it will display a text representation of the object.

In [7]:
grp = df.groupby(['platform', 'genre'])
print(grp)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026BD9B98910>


When we print df.groupby('column_name'), we don't see a table printed as we would if we printed df. Instead, we see the data type of the grouped object (DataFrameGroupBy) and a string (0x0000022EA82A75B0) representing the location in the computer's memory where the object is stored. No output is displayed until we process the groups.

##### __Split-Apply-Merge__

The DataFrameGroubBy object is part of a data processing framework called split-apply-merge:

1 split the data into groups;

2 apply a statistical aggregation function to each group;

3 combine the results for each group.

In the code below, we can illustrate each of the three components of split-apply-merge:

In [8]:
grp = df.groupby(['platform', 'genre'])
mean_scores = grp['critic_score'].mean()
print(mean_scores)

platform  genre       
3DS       Action          62.982759
          Adventure       67.500000
          Fighting        68.857143
          Misc            69.100000
          Platform        72.444444
                            ...    
XOne      Role-Playing    80.777778
          Shooter         77.656250
          Simulation      59.000000
          Sports          71.093750
          Strategy        70.000000
Name: critic_score, Length: 197, dtype: float64


We split the data into groups using df.groupby(['platform', 'genre']), apply the mean() method, and combine the results into a Series object, grp['critic_score'].mean().

Of course, we can skip creating the grp and mean_scores objects and have pandas perform all three steps in a single line of code:

In [9]:
print(df.groupby(['platform', 'genre'])['critic_score'].mean())

platform  genre       
3DS       Action          62.982759
          Adventure       67.500000
          Fighting        68.857143
          Misc            69.100000
          Platform        72.444444
                            ...    
XOne      Role-Playing    80.777778
          Shooter         77.656250
          Simulation      59.000000
          Sports          71.093750
          Strategy        70.000000
Name: critic_score, Length: 197, dtype: float64


##### _.agg()_ <a index='agg'></a>

The agg() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis.

_dataframe.agg(func, axis, args, kwargs)_

_func_	 	Required. A function, function name, or a list of function names to apply to the DataFrame.

_axis_	0 1 'index' 'columns'	Optional, Which axis to apply the function to. default 0.

_args_	 	Optional, arguments to send into the function

_kwargs_	 	Optional, keyword arguments to send into the function

So far, we've only applied a single function to our groups. But what if we want to calculate different summary statistics for different columns? For example, both the average review score and total sales in Japan for each group? We can do this using the agg() method, which is short for "aggregate."

The agg() method uses a dictionary as input where the keys are the column names and the corresponding values ​​are the aggregation functions you want to apply to them:

In [10]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')
df.dropna(inplace=True)

agg_dict = {'critic_score': 'mean', 'jp_sales': 'sum'}

grp = df.groupby(['platform', 'genre'])
print(grp.agg(agg_dict))

                       critic_score  jp_sales
platform genre                               
3DS      Action           62.982759      6.60
         Adventure        67.500000      0.66
         Fighting         68.857143      0.46
         Misc             69.100000      1.22
         Platform         72.444444      5.94
...                             ...       ...
XOne     Role-Playing     80.777778      0.01
         Shooter          77.656250      0.13
         Simulation       59.000000      0.00
         Sports           71.093750      0.02
         Strategy         70.000000      0.00

[197 rows x 2 columns]


In [11]:
def double_it(sales):
    sales = sales.sum() * 2 # multiplica la suma anterior por 2
    return sales

agg_dict = {'jp_sales': double_it}

grp = df.groupby(['platform', 'genre'])
print(grp.agg(agg_dict))

                       jp_sales
platform genre                 
3DS      Action           13.20
         Adventure         1.32
         Fighting          0.92
         Misc              2.44
         Platform         11.88
...                         ...
XOne     Role-Playing      0.02
         Shooter           0.26
         Simulation        0.00
         Sports            0.04
         Strategy          0.00

[197 rows x 1 columns]


##### __Excercis 01__

Create a 'total_sales' column. You'll use these columns, so make a note of their names.

The precode then groups the DataFrame df by the 'genre' column and assigns the resulting grouped object to the grp variable.

Now you'll do the following:

- Create a dictionary to calculate for each genre:
- Sum of total sales.
- Average sales NA (North America).
- Average sales EU (Europe).
- Average sales JP (Japan).
- Assign the dictionary to a variable called agg_dict with the tuples described above.
- Assign the result of agg() to a variable called genre.
- Print genre.

In [15]:
import pandas as pd

df = pd.read_csv('DataSets/vg_sales.csv')
df['total_sales'] = df['na_sales'] + df['eu_sales'] + df['jp_sales']

grp = df.groupby('genre')

agg_dict = {'total_sales':'sum', 'na_sales':'mean', 'eu_sales':'mean', 'jp_sales':'mean'}

genre = grp.agg(agg_dict)

print(genre)

              total_sales  na_sales  eu_sales  jp_sales
genre                                                  
Action            1559.58  0.260834  0.154045  0.047905
Adventure          221.10  0.080783  0.048764  0.040138
Fighting           411.17  0.263086  0.118174  0.103039
Misc               728.12  0.232726  0.121566  0.061777
Platform           776.68  0.501689  0.225619  0.147331
Puzzle             230.19  0.211845  0.086224  0.098810
Racing             652.57  0.287710  0.189359  0.045404
Role-Playing       874.98  0.220540  0.125807  0.236973
Shooter            948.34  0.447649  0.239864  0.029297
Simulation         359.51  0.208455  0.129886  0.072998
Sports            1196.76  0.291495  0.160473  0.057726
Strategy           163.38  0.100366  0.066135  0.072709
