# When to use Pandas transform()

This is a notebook for the medium article [When to use Pandas transform() function](https://medium.com/@bindiatwork/when-to-use-pandas-transform-function-df8861aa0dcf)

Please check out article for instructions

**License**: [BSD 2-Clause](https://opensource.org/licenses/BSD-2-Clause)

In [1]:
import pandas as pd
import numpy as np

## 1. Transform values

### A function

In [2]:
df = pd.DataFrame({'A': [1,2,3], 'B': [10,20,30] })
df

Unnamed: 0,A,B
0,1,10
1,2,20
2,3,30


In [3]:
def plus_10(x):
    return x+10

df.transform(plus_10)

Unnamed: 0,A,B
0,11,20
1,12,30
2,13,40


In [4]:
df.transform(lambda x: x+10)

Unnamed: 0,A,B
0,11,20
1,12,30
2,13,40


### A string function

In [5]:
df.transform('sqrt')

Unnamed: 0,A,B
0,1.0,3.162278
1,1.414214,4.472136
2,1.732051,5.477226


### A list of functions

In [6]:
df.transform([np.sqrt, np.exp])

Unnamed: 0_level_0,A,A,B,B
Unnamed: 0_level_1,sqrt,exp,sqrt,exp
0,1.0,2.718282,3.162278,22026.47
1,1.414214,7.389056,4.472136,485165200.0
2,1.732051,20.085537,5.477226,10686470000000.0


### A dict of axis labels -> functions

In [7]:
df.transform({
    'A': np.sqrt,
    'B': np.exp,
})

Unnamed: 0,A,B
0,1.0,22026.47
1,1.414214,485165200.0
2,1.732051,10686470000000.0


## 2. Combining groupby results

In [9]:
# a dataset about a restaurant chain
df = pd.DataFrame({
  'restaurant_id': [101,102,103,104,105,106,107],
  'address': ['A','B','C','D', 'E', 'F', 'G'],
  'city': ['London','London','London','Oxford','Oxford', 'Durham', 'Durham'],
  'sales': [10,500,48,12,21,22,14]
})
df

Unnamed: 0,restaurant_id,address,city,sales
0,101,A,London,10
1,102,B,London,500
2,103,C,London,48
3,104,D,Oxford,12
4,105,E,Oxford,21
5,106,F,Durham,22
6,107,G,Durham,14


### Soluton one: `groupby()`, `apply()`, and `merge()`

In [10]:
# Step 1: groupby and sum
city_sales = df.groupby('city')['sales'].apply(sum).rename('city_total_sales').reset_index()
city_sales

Unnamed: 0,city,city_total_sales
0,Durham,36
1,London,558
2,Oxford,33


In [11]:
# Step 1: The Pandas sum equivalent 
city_sales = df.groupby('city')['sales'].sum().rename('city_total_sales').reset_index()
city_sales

Unnamed: 0,city,city_total_sales
0,Durham,36
1,London,558
2,Oxford,33


In [54]:
# Step 2: merge
df_new = pd.merge(df, city_sales, how='left')
df_new

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales
0,101,A,London,10,558
1,102,B,London,500,558
2,103,C,London,48,558
3,104,D,Oxford,12,33
4,105,E,Oxford,21,33
5,106,F,Durham,22,36
6,107,G,Durham,14,36


In [55]:
# Step 3
df_new['pct'] = df_new['sales'] / df_new['city_total_sales']
df_new['pct'] = df_new['pct'].apply(lambda x: format(x, '.2%'))

df_new

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales,pct
0,101,A,London,10,558,1.79%
1,102,B,London,500,558,89.61%
2,103,C,London,48,558,8.60%
3,104,D,Oxford,12,33,36.36%
4,105,E,Oxford,21,33,63.64%
5,106,F,Durham,22,36,61.11%
6,107,G,Durham,14,36,38.89%


### Solution 2: `groupby()` and `transform()`

In [12]:
# Step 1
df['city_total_sales'] = df.groupby('city')['sales'].transform('sum')
df

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales
0,101,A,London,10,558
1,102,B,London,500,558
2,103,C,London,48,558
3,104,D,Oxford,12,33
4,105,E,Oxford,21,33
5,106,F,Durham,22,36
6,107,G,Durham,14,36


In [13]:
# Step 2
df['pct'] = df['sales'] / df['city_total_sales']
df['pct'] = df['pct'].apply(lambda x: format(x, '.2%'))
df

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales,pct
0,101,A,London,10,558,1.79%
1,102,B,London,500,558,89.61%
2,103,C,London,48,558,8.60%
3,104,D,Oxford,12,33,36.36%
4,105,E,Oxford,21,33,63.64%
5,106,F,Durham,22,36,61.11%
6,107,G,Durham,14,36,38.89%


## 3. Filtering data

In [25]:
df[df.groupby('city')['sales'].transform('sum') > 40]

Unnamed: 0,restaurant_id,address,city,sales
0,101,A,London,10
1,102,B,London,500
2,103,C,London,48


## 4. Handling missing values at the group level

In [15]:
df = pd.DataFrame({
    'name': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'value': [1, np.nan, np.nan, 2,8,2,np.nan, 3]
})
df

Unnamed: 0,name,value
0,A,1.0
1,A,
2,B,
3,B,2.0
4,B,8.0
5,C,2.0
6,C,
7,C,3.0


In [16]:
df.groupby('name')['value'].mean()

name
A    1.0
B    5.0
C    2.5
Name: value, dtype: float64

In [17]:
df['value'] = df.groupby('name').transform(lambda x: x.fillna(x.mean()))
df

Unnamed: 0,name,value
0,A,1.0
1,A,1.0
2,B,5.0
3,B,2.0
4,B,8.0
5,C,2.0
6,C,2.5
7,C,3.0
