### About
This notebook contains the code discussed in the post [*Pandas: apply, map or tranform?*](https://towardsdatascience.com/pandas-apply-map-or-transform-dd931659e9cf), a guide to Pandas' functions and the scenarios they are meant to be used in.

### Imports

In [None]:
import pandas as pd
import numpy as np
import random

In [None]:
print("Pandas: v{}".format(pd.__version__))
print("Numpy: v{}".format(np.__version__))

Pandas: v1.3.5
Numpy: v1.21.6


## Creating our dataframe

In [None]:
df_english = pd.DataFrame(
    {
        "student": ["John", "James", "Jennifer"],
        "gender": ["male", "male", "female"],
        "score": [20, 30, 30],
        "subject": "english"
    }
)

df_math = pd.DataFrame(
    {
        "student": ["John", "James", "Jennifer"],
        "gender": ["male", "male", "female"],
        "score": [90, 100, 95],
        "subject": "math"
    }
)


In [None]:
df = pd.concat(
    [df_english, df_math],
    ignore_index=True
)

In [None]:
df

Unnamed: 0,student,gender,score,subject
0,John,male,20,english
1,James,male,30,english
2,Jennifer,female,30,english
3,John,male,90,math
4,James,male,100,math
5,Jennifer,female,95,math


## Comparing map, transform, apply and agg

### `Map` vs `Apply`

In [None]:
GENDER_ENCODING = {
    "male": 0,
    "female": 1
}

In [None]:
df["gender"].map(GENDER_ENCODING)

0    0
1    0
2    1
3    0
4    0
5    1
Name: gender, dtype: int64

In [None]:
df["gender"].apply(lambda x:
    GENDER_ENCODING.get(x, np.nan)
)

0    0
1    0
2    1
3    0
4    0
5    1
Name: gender, dtype: int64

#### Performance comparison
Map performed about 10x as fast as apply.

In [None]:
random_gender_series = pd.Series([
    random.choice(["male", "female"]) for _ in range(1_000_000)
])

In [None]:
random_gender_series.value_counts()

female    500094
male      499906
dtype: int64

In [None]:
%%timeit
random_gender_series.map(GENDER_ENCODING)

42.4 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
%%timeit
random_gender_series.apply(lambda x:
    GENDER_ENCODING.get(x, np.nan)
)

417 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Setting gender

In [None]:
df["gender"] = df["gender"].map(GENDER_ENCODING)

In [None]:
df

Unnamed: 0,student,gender,score,subject
0,John,0,20,english
1,James,0,30,english
2,Jennifer,1,30,english
3,John,0,90,math
4,James,0,100,math
5,Jennifer,1,95,math


### `Applymap` vs `Apply`
`applymap` works just like map but on the entire dataframe. Since it's internally implemented using `apply`, there's isn't much to discuss here.

In [None]:
df.applymap(type)

Unnamed: 0,student,gender,score,subject
0,<class 'str'>,<class 'int'>,<class 'int'>,<class 'str'>
1,<class 'str'>,<class 'int'>,<class 'int'>,<class 'str'>
2,<class 'str'>,<class 'int'>,<class 'int'>,<class 'str'>
3,<class 'str'>,<class 'int'>,<class 'int'>,<class 'str'>
4,<class 'str'>,<class 'int'>,<class 'int'>,<class 'str'>
5,<class 'str'>,<class 'int'>,<class 'int'>,<class 'str'>


`applymap` doesn't work with non-callables, unlike `map`

In [None]:
try: 
    df.applymap(dict())

except TypeError as e:
    print("Only callables are valid! Error:", e)

Only callables are valid! Error: the first argument must be callable


### `Transform` vs `Apply`

In [None]:
df

Unnamed: 0,student,gender,score,subject
0,John,0,20,english
1,James,0,30,english
2,Jennifer,1,30,english
3,John,0,90,math
4,James,0,100,math
5,Jennifer,1,95,math


In [None]:
df.groupby("subject")["score"] \
    .transform(
        lambda x: (x - x.mean()) / x.std()
    )

0   -1.154701
1    0.577350
2    0.577350
3   -1.000000
4    1.000000
5    0.000000
Name: score, dtype: float64

In [None]:
df.groupby("subject")["score"] \
    .apply(
        lambda x: (x - x.mean()) / x.std()
    )

0   -1.154701
1    0.577350
2    0.577350
3   -1.000000
4    1.000000
5    0.000000
Name: score, dtype: float64

In [None]:
df.groupby("subject")["score"] \
    .transform(
        sum
    )

0     80
1     80
2     80
3    285
4    285
5    285
Name: score, dtype: int64

In [None]:
df.groupby("subject")["score"] \
    .apply(
        sum
    )

subject
english     80
math       285
Name: score, dtype: int64

In [None]:
try:
    df["score"].transform("mean")
except ValueError as e:
    print("Aggregation doesn't work with transform. Error:", e)

Aggregation doesn't work with transform. Error: Function did not transform


In [None]:
df["score"].apply("mean")

60.833333333333336

#### Performance comparison

In [None]:
random_score_df = pd.DataFrame({
    "subject": random.choices(["english", "math", "science", "history"], k=1_000_000),
    "score": random.choices(list(np.arange(1, 100)), k=1_000_000)
})

In [None]:
random_score_df

Unnamed: 0,subject,score
0,math,97
1,english,85
2,science,52
3,science,10
4,science,8
...,...,...
999995,history,82
999996,english,46
999997,history,90
999998,history,69


In [None]:
%%timeit
random_score_df.groupby("subject")["score"] \
    .transform(
        lambda x: (x - x.mean()) / x.std()
    )

206 ms ± 5.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
random_score_df.groupby("subject")["score"] \
    .apply(
        lambda x: (x - x.mean()) / x.std()
    )

371 ms ± 124 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Setting standardized scores

In [None]:
df["subject_wise_standardized"] = df.groupby("subject")["score"] \
    .transform(
        lambda x: (x - x.mean()) / x.std()
    )

### `Agg` vs `Apply`

#### Use cases

##### Simple aggregation

In [None]:
df.groupby("subject")["score"].agg("mean").round(2)

subject
english    26.67
math       95.00
Name: score, dtype: float64

In [None]:
df.groupby("subject")["score"].apply(lambda x: x.mean()).round(2)

subject
english    26.67
math       95.00
Name: score, dtype: float64

##### Named aggregation

In [None]:
df.groupby("subject")["score"].agg(mean_score="mean").round(2)

Unnamed: 0_level_0,mean_score
subject,Unnamed: 1_level_1
english,26.67
math,95.0


In [None]:
df.groupby("subject")["score"].apply(lambda x: x.mean()).to_frame().round(2)

Unnamed: 0_level_0,score
subject,Unnamed: 1_level_1
english,26.67
math,95.0


##### Multiple aggregations

In [None]:
df.groupby("subject")["score"].agg(
    ["min", "mean", "max"]
).round(2)

Unnamed: 0_level_0,min,mean,max
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
english,20,26.67,30
math,90,95.0,100


In [None]:
df.groupby("subject")["score"].apply(
    lambda x: pd.DataFrame(
        {"min": [x.min()], "mean": [x.mean()], "max": [x.max()]}
    )
).round(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,mean,max
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
english,0,20,26.67,30
math,0,90,95.0,100


In [None]:
df.groupby("subject")["score"].apply(
    lambda x: pd.Series(
        {"min": x.min(), "mean": x.mean(), "max": x.max()}
    )
).round(2).unstack()

Unnamed: 0_level_0,min,mean,max
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
english,20.0,26.67,30.0
math,90.0,95.0,100.0


#### Performance

In [None]:
random_score_df = pd.DataFrame({
    "subject": random.choices(["english", "math", "science", "history"], k=1_000_000),
    "score": random.choices(list(np.arange(1, 100)), k=1_000_000)
})

In [None]:
random_score_df

Unnamed: 0,subject,score
0,history,99
1,science,45
2,english,24
3,history,55
4,english,48
...,...,...
999995,math,98
999996,science,40
999997,science,47
999998,english,72


##### Single aggregation

In [None]:
%%timeit
random_score_df.groupby("subject")["score"].agg("mean")

74.2 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
%%timeit
random_score_df.groupby("subject")["score"].apply(lambda x: x.mean())

99.1 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


##### Multiple aggregations

In [None]:
%%timeit
random_score_df.groupby("subject")["score"].agg(
    ["min", "mean", "max"]
)

90.5 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
random_score_df.groupby("subject")["score"].apply(
    lambda x: pd.Series(
        {"min": x.min(), "mean": x.mean(), "max": x.max()}
    )
).unstack()

104 ms ± 5.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Unexpected behaviour with `apply`

In [None]:
df

Unnamed: 0,student,gender,score,subject,subject_wise_standardized
0,John,0,20,english,-1.154701
1,James,0,30,english,0.57735
2,Jennifer,1,30,english,0.57735
3,John,0,90,math,-1.0
4,James,0,100,math,1.0
5,Jennifer,1,95,math,0.0


#### Reproducing the single group issue

In [None]:
df_single_group = df.copy()

In [None]:
df_single_group["city"] = "Boston"

In [None]:
df_single_group

Unnamed: 0,student,gender,score,subject,subject_wise_standardized,city
0,John,0,20,english,-1.154701,Boston
1,James,0,30,english,0.57735,Boston
2,Jennifer,1,30,english,0.57735,Boston
3,John,0,90,math,-1.0,Boston
4,James,0,100,math,1.0,Boston
5,Jennifer,1,95,math,0.0,Boston


Calculating group-wise mean scores works as expected when there are multiple groups, as is the case with the `subject` column.

In [None]:
df_single_group.groupby("subject").apply(lambda x: x["score"])

subject   
english  0     20
         1     30
         2     30
math     3     90
         4    100
         5     95
Name: score, dtype: int64

But it returns an unstacked result when there's only one group (the city "Boston" in this case).

In [None]:
df_single_group.groupby("city").apply(lambda x: x["score"])

score,0,1,2,3,4,5
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Boston,20,30,30,90,100,95


With a stack operation, we can convert the single group result to what we expect.

In [None]:
df_single_group.groupby("city").apply(lambda x: x["score"]).stack()

city    score
Boston  0         20
        1         30
        2         30
        3         90
        4        100
        5         95
dtype: int64