## Pandas `groupby()` vs `groupby().transform()`

### warm-up

- `groupby` is used to analyze different groups or categories on your data
- When an aggregate statistic is performed on it the result is this statistic per group or category 
- In the following notebook look at other ways to structure the result of groupby during analysis
- Run the code and answer the questions to be discussed as a class

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
penguins = sns.load_dataset('penguins')

In [3]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### 1. Calculate the mean of the `bill_length_mm` in the dataset:

In [4]:
penguins['bill_length_mm'].mean()

np.float64(43.9219298245614)

### 2. Calculate the mean of the `bill_length_mm` per species

In [5]:
penguins.groupby('species')['bill_length_mm'].mean()

species
Adelie       38.791391
Chinstrap    48.833824
Gentoo       47.504878
Name: bill_length_mm, dtype: float64

### 3. What is the difference between the result in 2. and the code snippet below?

In [6]:
penguins.groupby('species')['bill_length_mm'].transform('mean')

0      38.791391
1      38.791391
2      38.791391
3      38.791391
4      38.791391
         ...    
339    47.504878
340    47.504878
341    47.504878
342    47.504878
343    47.504878
Name: bill_length_mm, Length: 344, dtype: float64

### 4. How could we add this to the dataframe?

In [7]:
penguins['species_mean'] = penguins.groupby('species')['bill_length_mm'].transform('mean')
penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,species_mean
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,38.791391
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,38.791391
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,38.791391
3,Adelie,Torgersen,,,,,,38.791391
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,38.791391
...,...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,,47.504878
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,47.504878
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,47.504878
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,47.504878


### 5. What exactly does `.transform()` do?

In [8]:
penguins['species_mean'] = penguins.groupby('species')['bill_depth_mm'].transform('mean')

pd.set_option('display.max_rows', None)
penguins


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,species_mean
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,18.346358
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,18.346358
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,18.346358
3,Adelie,Torgersen,,,,,,18.346358
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,18.346358
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,18.346358
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,Female,18.346358
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,Male,18.346358
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,18.346358
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,18.346358


In [9]:
pd.reset_option('display.max_rows')

### 6. BONUS: Can `.transform()` be used without `groupby()`?

In [10]:
def mean_diff(value):
    return value - value.mean()

penguins['bill_length_mm'].transform(mean_diff)#(lambda x : x - x.mean())

0     -4.82193
1     -4.42193
2     -3.62193
3          NaN
4     -7.22193
        ...   
339        NaN
340    2.87807
341    6.47807
342    1.27807
343    5.97807
Name: bill_length_mm, Length: 344, dtype: float64

In [11]:
import numpy as np

penguins['bill_length_mm'].transform(np.sqrt)

0      6.252999
1      6.284903
2      6.348228
3           NaN
4      6.058052
         ...   
339         NaN
340    6.841053
341    7.099296
342    6.723095
343    7.063993
Name: bill_length_mm, Length: 344, dtype: float64

In [12]:
df = pd.DataFrame({
                   'age': [25, 30, 35, 40, 45],
                   'score': [80, 85, 90, 95, 100]})

In [13]:
df_transformed = df.transform(lambda x: x-x.mean())

In [14]:
df_transformed

Unnamed: 0,age,score
0,-10.0,-10.0
1,-5.0,-5.0
2,0.0,0.0
3,5.0,5.0
4,10.0,10.0


In [15]:
df['age'].mean()

np.float64(35.0)