https://towardsdatascience.com/when-to-use-pandas-transform-function-df8861aa0dcf

Just a followalong notebook where I go thru this article by B. Chen and noodle around with the given code to really understand it.

# When to use `.transform` in Pandas
- Transforming values
- Combining `groupby()` results
- filtering data
- Handling missing value at the group level

In [22]:
import pandas as pd
import numpy as np

## 1. Transforming values
`pd.transform(func, axis=0)`

In [2]:
df = pd.DataFrame({'A': [1,2,3], 'B': [10,20,30]})
df

Unnamed: 0,A,B
0,1,10
1,2,20
2,3,30


In [3]:
# add 10 to each number, with a defined function
df.transform(lambda x: x+10)

Unnamed: 0,A,B
0,11,20
1,12,30
2,13,40


Take the sqrt of each number
Any valid Pandas string function works

In [4]:
df.transform('sqrt')

Unnamed: 0,A,B
0,1.0,3.162278
1,1.414214,4.472136
2,1.732051,5.477226


Can also supply a list of functions

In [5]:
df.transform(['sqrt', 'exp'])

Unnamed: 0_level_0,A,A,B,B
Unnamed: 0_level_1,sqrt,exp,sqrt,exp
0,1.0,2.718282,3.162278,22026.47
1,1.414214,7.389056,4.472136,485165200.0
2,1.732051,20.085537,5.477226,10686470000000.0


Can also supply a dictionary with different functions 

In [6]:
df.transform({
    'A': 'sqrt',
    'B': lambda x: x/10
})

Unnamed: 0,A,B
0,1.0,1.0
1,1.414214,2.0
2,1.732051,3.0


---

## 2. Combining `groupby()` results
One of the most compelling reasons to use `transform()`

In [7]:
df = pd.DataFrame({
  'restaurant_id': [101,102,103,104,105,106,107],
  'address': ['A','B','C','D', 'E', 'F', 'G'],
  'city': ['London','London','London','Oxford','Oxford', 'Durham', 'Durham'],
  'sales': [10,500,48,12,21,22,14]
})
df

Unnamed: 0,restaurant_id,address,city,sales
0,101,A,London,10
1,102,B,London,500
2,103,C,London,48
3,104,D,Oxford,12
4,105,E,Oxford,21
5,106,F,Durham,22
6,107,G,Durham,14


Each city has multiple restaurants.  
**Want to know the percentage of sales each restaurant represents in the city.**  
Add a column in showing that calculation.

#### Solution 1: `groupby()` + `apply()` + `merge()`
Split the data with `groupby()`, aggregate each group with `apply()`, merge back into the df with `merge()`

In [8]:
# %timeit df.groupby('city')['sales'].sum().rename('city_total_sales').reset_index()

504 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


With Pandas built-in `sum()` function. A bit quicker.

In [9]:
# %timeit df.groupby('city')['sales'].apply(sum).rename('city_total_sales').reset_index()

972 µs ± 53.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [10]:
# df we need to merge:
city_sales = df.groupby('city')['sales'].sum().rename('city_total_sales').reset_index()
city_sales

Unnamed: 0,city,city_total_sales
0,Durham,36
1,London,558
2,Oxford,33


In [11]:
df_new = pd.merge(df, city_sales, how='left')
df_new

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales
0,101,A,London,10,558
1,102,B,London,500,558
2,103,C,London,48,558
3,104,D,Oxford,12,33
4,105,E,Oxford,21,33
5,106,F,Durham,22,36
6,107,G,Durham,14,36


In [12]:
df_new['pct'] = df_new['sales'] / df_new['city_total_sales']
df_new['pct'] = df_new['pct'].apply(lambda x: format(x, '.2%'))
df_new

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales,pct
0,101,A,London,10,558,1.79%
1,102,B,London,500,558,89.61%
2,103,C,London,48,558,8.60%
3,104,D,Oxford,12,33,36.36%
4,105,E,Oxford,21,33,63.64%
5,106,F,Durham,22,36,61.11%
6,107,G,Durham,14,36,38.89%


That was a lot of work, can we somehow do less work? 
`.transform()` provides the way:

#### Solution 2: `.groupby()` + `.transform()`

In [13]:
# see this again
df

Unnamed: 0,restaurant_id,address,city,sales
0,101,A,London,10
1,102,B,London,500
2,103,C,London,48
3,104,D,Oxford,12
4,105,E,Oxford,21
5,106,F,Durham,22
6,107,G,Durham,14


So the key here is that `transform()` retains the same number of items as the original dataset, saves us from having to break off a new dataset and then left merging to do the same. 

In [14]:
df.groupby('city')['sales'].transform('sum')

0    558
1    558
2    558
3     33
4     33
5     36
6     36
Name: sales, dtype: int64

In [15]:
# compare to what we did above:
df.groupby('city')['sales'].sum()

city
Durham     36
London    558
Oxford     33
Name: sales, dtype: int64

In [16]:
# Good to go as our new column:
df['city_total_sales'] = df.groupby('city')['sales'].transform('sum')
df

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales
0,101,A,London,10,558
1,102,B,London,500,558
2,103,C,London,48,558
3,104,D,Oxford,12,33
4,105,E,Oxford,21,33
5,106,F,Durham,22,36
6,107,G,Durham,14,36


Now we just do the simple division as above to get our sales:

In [17]:
df['pct'] = df['sales'] / df['city_total_sales']
df['pct'] = df['pct'].apply(lambda x: format(x, '.2%'))
df

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales,pct
0,101,A,London,10,558,1.79%
1,102,B,London,500,558,89.61%
2,103,C,London,48,558,8.60%
3,104,D,Oxford,12,33,36.36%
4,105,E,Oxford,21,33,63.64%
5,106,F,Durham,22,36,61.11%
6,107,G,Durham,14,36,38.89%


---

## 3. Filtering Data
Use `transform()` to get records where the city's total sales is > 40.
Basically to make a boolean mask:

In [19]:
df[df.groupby('city')['sales'].transform('sum') > 40]

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales,pct
0,101,A,London,10,558,1.79%
1,102,B,London,500,558,89.61%
2,103,C,London,48,558,8.60%


---

## 4. Handling missing values at the group level
Here we're imputing the mean of the available values of each name category with the use of `fillna()`

In [24]:
df = pd.DataFrame({
    'name': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'value': [1, np.nan, np.nan, 2, 8, 2, np.nan, 3]
})
df

Unnamed: 0,name,value
0,A,1.0
1,A,
2,B,
3,B,2.0
4,B,8.0
5,C,2.0
6,C,
7,C,3.0


In [25]:
# avg vales of each group
df.groupby('name')['value'].mean()

name
A    1.0
B    5.0
C    2.5
Name: value, dtype: float64

In [29]:
df['value'] = df.groupby('name').transform(lambda x: x.fillna(x.mean()))
df

Unnamed: 0,name,value
0,A,1.0
1,A,1.0
2,B,5.0
3,B,2.0
4,B,8.0
5,C,2.0
6,C,2.5
7,C,3.0


Nice, we took care of them nans. 

---

## Next: 
https://towardsdatascience.com/difference-between-apply-and-transform-in-pandas-242e5cf32705