# <center><div style="width: 370px;"> ![Panel Data](pictures/Panel_Data.jpg)

# <center> Function Application

In [1]:
import numpy as np
import pandas as pd

To apply your own or another library’s functions to pandas objects,
you should be aware of the three methods below. The appropriate
method to use depends on whether your function expects to operate
on an entire `DataFrame` or `Series`, row- or column-wise, or elementwise.

- Tablewise Function Application: `pipe()`
- Row or Column-wise Function Application: `apply()`
- Aggregation API: `agg()` and `transform()`
- Applying Elementwise Functions: `applymap()`

## Tablewise function application

`DataFrames` and `Series` can be passed into functions.
However, if the function needs to be called in a chain, consider using the [`pipe()`](../reference/api/pandas.DataFrame.pipe.html#pandas.DataFrame.pipe "pandas.DataFrame.pipe") method.

First some setup:

In [2]:
def extract_city_name(df):
    """
 Chicago, IL -> Chicago for city_name column
 """
    df["city_name"] = df["city_and_code"].str.split(",").str.get(0)
    return df

In [3]:
def add_country_name(df, country_name=None):
    """
 Chicago -> Chicago-US for city_name column
 """
    col = "city_name"
    df["city_and_country"] = df[col] + ' - ' + country_name
    return df

In [4]:
df_p = pd.DataFrame({"city_and_code": ["Chicago, IL"]})

In [5]:
df_p

Unnamed: 0,city_and_code
0,"Chicago, IL"


`extract_city_name` and `add_country_name` are functions taking and returning `DataFrames`.

Now compare the following:

In [7]:
add_country_name(extract_city_name(df_p), country_name="US")

Unnamed: 0,city_and_code,city_name,city_and_country
0,"Chicago, IL",Chicago,Chicago - US


Is equivalent to:

In [8]:
df_p.pipe(extract_city_name).pipe(add_country_name, country_name="US")

Unnamed: 0,city_and_code,city_name,city_and_country
0,"Chicago, IL",Chicago,Chicago - US


pandas encourages the second style, which is known as method chaining.
`pipe` makes it easy to use your own or another library’s functions
in method chains, alongside pandas’ methods.

In the example above, the functions `extract_city_name` and `add_country_name` each expected a `DataFrame` as the first positional argument.
What if the function you wish to apply takes its data as, say, the second argument?
In this case, provide `pipe` with a tuple of `(callable, data_keyword)`.
`.pipe` will route the `DataFrame` to the argument specified in the tuple.

The pipe method is inspired by unix pipes and more recently [dplyr](https://github.com/tidyverse/dplyr) and [magrittr](https://github.com/tidyverse/magrittr), which
have introduced the popular `(%>%)` (read pipe) operator for [R](https://www.r-project.org).
The implementation of `pipe` here is quite clean and feels right at home in Python.
We encourage you to view the source code of [`pipe()`](../reference/api/pandas.DataFrame.pipe.html#pandas.DataFrame.pipe "pandas.DataFrame.pipe").

## Row or column-wise function application

Arbitrary functions can be applied along the axes of a DataFrame
using the [`apply()`](../reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply "pandas.DataFrame.apply") method, which, like the descriptive
statistics methods, takes an optional `axis` argument:

In [13]:
df = pd.DataFrame(
    {
        "one": pd.Series(np.random.randn(3), index=list('abc')),
        "two": pd.Series(np.random.randn(4), index=list('abcd')),
        "three": pd.Series(np.random.randn(3), index=list('bcd'))
    }
)

In [14]:
df

Unnamed: 0,one,two,three
a,2.362582,0.198185,
b,0.396149,0.643059,-1.193403
c,0.82314,1.321589,-0.622044
d,,0.489348,0.538136


In [16]:
df.apply(np.mean)

one      1.193957
two      0.663045
three   -0.425771
dtype: float64

In [17]:
df.apply(np.mean, axis=1)

a    1.280384
b   -0.051398
c    0.507562
d    0.513742
dtype: float64

In [18]:
df.apply(lambda x: x.max() - x.min())

one      1.966434
two      1.123404
three    1.731538
dtype: float64

In [19]:
df.apply(np.cumsum)

Unnamed: 0,one,two,three
a,2.362582,0.198185,
b,2.758731,0.841245,-1.193403
c,3.581871,2.162834,-1.815447
d,,2.652182,-1.277312


In [21]:
df.apply(np.exp)

Unnamed: 0,one,two,three
a,10.618333,1.219188,
b,1.48609,1.902292,0.303188
c,2.27764,3.749375,0.536846
d,,1.631252,1.71281


The `apply()` method will also dispatch on a string method name.

In [22]:
df.apply("mean")

one      1.193957
two      0.663045
three   -0.425771
dtype: float64

In [23]:
df.apply("mean", axis=1)

a    1.280384
b   -0.051398
c    0.507562
d    0.513742
dtype: float64

The return type of the function passed to `apply()` affects the
type of the final output from `DataFrame.apply` for the default behaviour:

- If the applied function returns a Series, the final output is a DataFrame. The columns match the index of the Series returned by the applied function.
- If the applied function returns any other type, the final output is a Series.

This default behaviour can be overridden using the `result_type`, which
accepts three options: `reduce`, `broadcast`, and `expand`.
These will determine how list-likes return values expand (or not) to a `DataFrame`.

`apply()` combined with some cleverness can be used to answer many questions
about a data set. For example, suppose we wanted to extract the date where the
maximum value for each column occurred:

In [30]:
tsdf = pd.DataFrame(
    np.random.randn(1000, 3),
    columns = list("ABC"),
    index = pd.date_range('1/1/2000', periods=1000)
)

In [32]:
tsdf

Unnamed: 0,A,B,C
2000-01-01,0.473652,0.712442,2.060165
2000-01-02,1.560232,0.830718,-2.194184
2000-01-03,-0.005763,0.658921,-0.597842
2000-01-04,-0.009619,0.118682,1.448215
2000-01-05,-0.202776,0.844060,-1.174555
...,...,...,...
2002-09-22,-0.443882,-0.183434,0.578656
2002-09-23,-0.030276,0.869985,-0.522726
2002-09-24,-1.108633,-1.045384,-2.313175
2002-09-25,-0.180611,-1.047725,0.082766


In [34]:
tsdf.apply(lambda x: x.idxmax())

A   2001-04-06
B   2002-03-16
C   2001-08-23
dtype: datetime64[ns]

You may also pass additional arguments and keyword arguments to the `apply()`
method. For instance, consider the following function you would like to apply:

In [35]:
def sub_test(x, sub, divide=1):
    return (x - sub) / divide

You may then apply this function as follows:

In [37]:
df.apply(sub_test, args=(5, 4))

Unnamed: 0,one,two,three
a,-0.659354,-1.200454,
b,-1.150963,-1.089235,-1.548351
c,-1.044215,-0.919603,-1.405511
d,,-1.127663,-1.115466


In [40]:
df.apply(sub_test, args=(5,), divide=3)

Unnamed: 0,one,two,three
a,-0.879139,-1.600605,
b,-1.534617,-1.452314,-2.064468
c,-1.392287,-1.226137,-1.874015
d,,-1.503551,-1.487288


In [41]:
df.apply(sub_test, sub=5, divide=3)

Unnamed: 0,one,two,three
a,-0.879139,-1.600605,
b,-1.534617,-1.452314,-2.064468
c,-1.392287,-1.226137,-1.874015
d,,-1.503551,-1.487288


Another useful feature is the ability to pass Series methods to carry out some
Series operation on each column or row:

In [45]:
!pip install scipy

Collecting scipy
  Obtaining dependency information for scipy from https://files.pythonhosted.org/packages/ef/1b/7538792254aec6850657d5b940fd05fe60582af829ffe40d6c054f065f34/scipy-1.11.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading scipy-1.11.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m313.6 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Downloading scipy-1.11.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.4/36.4 MB[0m [31m683.3 kB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
[?25hInstalling collected packages: scipy
Successfully installed scipy-1.11.3


In [47]:
s = pd.Series([0, 2, np.nan, 8])
s.interpolate(method='polynomial', order=2)

0    0.000000
1    2.000000
2    4.666667
3    8.000000
dtype: float64

Let's apply interpolation on tsdf:

In [48]:
# First let's add some null values
tsdf.iloc[3] = np.nan

In [49]:
tsdf.iloc[3] = np.nan

In [50]:
tsdf.head()

Unnamed: 0,A,B,C
2000-01-01,0.473652,0.712442,2.060165
2000-01-02,1.560232,0.830718,-2.194184
2000-01-03,-0.005763,0.658921,-0.597842
2000-01-04,,,
2000-01-05,-0.202776,0.84406,-1.174555


In [54]:
tsdf.apply(pd.Series.interpolate, method='linear').head()

Unnamed: 0,A,B,C
2000-01-01,0.473652,0.712442,2.060165
2000-01-02,1.560232,0.830718,-2.194184
2000-01-03,-0.005763,0.658921,-0.597842
2000-01-04,-0.104269,0.75149,-0.886198
2000-01-05,-0.202776,0.84406,-1.174555


Finally, `apply()` takes an argument `raw` which is False by default, which
converts each row or column into a Series before applying the function. When
set to True, the passed function will instead receive an ndarray object, which
has positive performance implications if you do not need the indexing
functionality.

## Aggregation API

The aggregation API allows one to express possibly multiple aggregation operations in a single concise way.
This API is similar across pandas objects, see groupby API, the
window API, and the resample API.
The entry point for aggregation is `DataFrame.aggregate()`, or the alias
`DataFrame.agg()`

We will use a similar starting frame from above:

In [55]:
tsdf = pd.DataFrame(
    np.random.randn(10, 3),
    columns=["A", "B", "C"],
    index=pd.date_range("1/1/2000", periods=10),
)

In [56]:
tsdf.iloc[3:7] = np.nan

In [58]:
tsdf

Unnamed: 0,A,B,C
2000-01-01,-0.442268,-1.441069,-1.685952
2000-01-02,0.781424,1.343911,1.302184
2000-01-03,0.640035,-0.12764,-0.386197
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,-1.898662,0.323027,0.390243
2000-01-09,0.242443,0.598272,-0.784099
2000-01-10,-0.065355,0.953811,-0.567138


Using a single function is equivalent to `apply()`. You can also
pass named methods as strings. These will return a `Series` of the aggregated
output:

In [59]:
tsdf.agg(np.sum)

  tsdf.agg(np.sum)


A   -0.742384
B    1.650312
C   -1.730959
dtype: float64

In [60]:
tsdf.agg("sum")

A   -0.742384
B    1.650312
C   -1.730959
dtype: float64

In [61]:
tsdf.sum()

A   -0.742384
B    1.650312
C   -1.730959
dtype: float64

Single aggregations on a `Series` this will return a scalar value:

In [63]:
tsdf["A"].agg("sum")

-0.7423840766079615

### Aggregating with multiple functions

You can pass multiple aggregation arguments as a list.
The results of each of the passed functions will be a row in the resulting `DataFrame`.
These are naturally named from the aggregation function.

In [64]:
tsdf.agg(["sum"])

Unnamed: 0,A,B,C
sum,-0.742384,1.650312,-1.730959


Multiple functions yield multiple rows:

In [65]:
tsdf.agg(["sum", "mean"])

Unnamed: 0,A,B,C
sum,-0.742384,1.650312,-1.730959
mean,-0.123731,0.275052,-0.288493


On a `Series`, multiple functions return a `Series`, indexed by the function names:

In [66]:
tsdf["A"].agg(["sum", "mean"])

sum    -0.742384
mean   -0.123731
Name: A, dtype: float64

Passing a `lambda` function will yield a `<lambda>` named row:

In [67]:
tsdf["A"].agg(["sum", lambda x: x.mean()])

sum        -0.742384
<lambda>   -0.123731
Name: A, dtype: float64

Passing a named function will yield that name for the row:

In [68]:
def mymean(x):
    return x.mean()

tsdf["A"].agg(["sum", mymean])

sum      -0.742384
mymean   -0.123731
Name: A, dtype: float64

### Aggregating with a dict

Passing a dictionary of column names to a scalar or a list of scalars, to `DataFrame.agg`
allows you to customize which functions are applied to which columns. Note that the results
are not in any particular order, you can use an `OrderedDict` instead to guarantee ordering.

In [69]:
tsdf.agg({"A": ["mean", "min"], "B": "sum"})

Unnamed: 0,A,B
mean,-0.123731,
min,-1.898662,
sum,,1.650312


### Mixed dtypes

Deprecated since version 1.4.0: Attempting to determine which columns cannot be aggregated and silently dropping them from the results is deprecated and will be removed in a future version. If any porition of the columns or operations provided fail, the call to `.agg` will raise.

When presented with mixed dtypes that cannot aggregate, `.agg` will only take the valid
aggregations. This is similar to how `.groupby.agg` works.

In [70]:
mdf = pd.DataFrame(
    {
        "A": [1, 2, 3],
        "B": [1.0, 2.0, 3.0],
        "C": ["foo", "bar", "baz"],
        "D": pd.date_range("20130101", periods=3),
    }
)

In [76]:
mdf.dtypes

A             int64
B           float64
C            object
D    datetime64[ns]
dtype: object

In [77]:
mdf

Unnamed: 0,A,B,C,D
0,1,1.0,foo,2013-01-01
1,2,2.0,bar,2013-01-02
2,3,3.0,baz,2013-01-03


In [75]:
mdf.drop('D', axis='columns').agg(["min", "sum"])

Unnamed: 0,A,B,C
min,1,1.0,bar
sum,6,6.0,foobarbaz


### Custom describe

With `.agg()` it is possible to easily create a custom describe function, similar
to the built in describe function

In [79]:
from functools import partial
q_25 = partial(pd.Series.quantile, q=0.25)
q_25.__name__ = "25%"
q_75 = partial(pd.Series.quantile, q=0.75)
q_75.__name__ = "75%"

In [80]:
tsdf.agg(["count", "mean", "std", "min", q_25, "median", q_75, "max"])

Unnamed: 0,A,B,C
count,6.0,6.0,6.0
mean,-0.123731,0.275052,-0.288493
std,0.978977,0.981548,1.027291
min,-1.898662,-1.441069,-1.685952
25%,-0.34804,-0.014974,-0.729858
median,0.088544,0.460649,-0.476667
75%,0.540637,0.864927,0.196133
max,0.781424,1.343911,1.302184


## Transform API

The `transform()` method returns an object that is indexed the same (same size)
as the original. This API allows you to provide *multiple* operations at the same
time rather than one-by-one. Its API is quite similar to the `.agg` API.

We create a frame similar to the one used in the above sections.

In [81]:
tsdf = pd.DataFrame(
    np.random.randn(10, 3),
    columns=["A", "B", "C"],
    index=pd.date_range("1/1/2000", periods=10),
)

tsdf.iloc[3:7] = np.nan
tsdf

Unnamed: 0,A,B,C
2000-01-01,-1.906021,0.553495,-0.00122
2000-01-02,-0.714361,-1.803843,-0.690069
2000-01-03,0.699666,1.148266,0.870129
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,-0.774317,-0.433686,0.568519
2000-01-09,-1.130911,-0.491484,0.10905
2000-01-10,-0.009735,-1.341069,1.260359


Transform the entire frame. `.transform()` allows input functions as: a NumPy function, a string
function name or a user defined function.

In [83]:
tsdf.transform(np.abs)

Unnamed: 0,A,B,C
2000-01-01,1.906021,0.553495,0.00122
2000-01-02,0.714361,1.803843,0.690069
2000-01-03,0.699666,1.148266,0.870129
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.774317,0.433686,0.568519
2000-01-09,1.130911,0.491484,0.10905
2000-01-10,0.009735,1.341069,1.260359


In [84]:
tsdf.transform("abs")

Unnamed: 0,A,B,C
2000-01-01,1.906021,0.553495,0.00122
2000-01-02,0.714361,1.803843,0.690069
2000-01-03,0.699666,1.148266,0.870129
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.774317,0.433686,0.568519
2000-01-09,1.130911,0.491484,0.10905
2000-01-10,0.009735,1.341069,1.260359


In [85]:
tsdf.transform(lambda x: x.abs())

Unnamed: 0,A,B,C
2000-01-01,1.906021,0.553495,0.00122
2000-01-02,0.714361,1.803843,0.690069
2000-01-03,0.699666,1.148266,0.870129
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.774317,0.433686,0.568519
2000-01-09,1.130911,0.491484,0.10905
2000-01-10,0.009735,1.341069,1.260359


Here `transform()` received a single function; this is equivalent to a ufunc application.

In [86]:
np.abs(tsdf)

Unnamed: 0,A,B,C
2000-01-01,1.906021,0.553495,0.00122
2000-01-02,0.714361,1.803843,0.690069
2000-01-03,0.699666,1.148266,0.870129
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.774317,0.433686,0.568519
2000-01-09,1.130911,0.491484,0.10905
2000-01-10,0.009735,1.341069,1.260359


Passing a single function to `.transform()` with a `Series` will yield a single `Series` in return.

In [87]:
tsdf["A"].transform(np.abs)

2000-01-01    1.906021
2000-01-02    0.714361
2000-01-03    0.699666
2000-01-04         NaN
2000-01-05         NaN
2000-01-06         NaN
2000-01-07         NaN
2000-01-08    0.774317
2000-01-09    1.130911
2000-01-10    0.009735
Freq: D, Name: A, dtype: float64

### Transform with multiple functions

Passing multiple functions will yield a column MultiIndexed DataFrame.
The first level will be the original frame column names; the second level
will be the names of the transforming functions.

In [88]:
tsdf.transform([np.abs, lambda x: x + 1])

Unnamed: 0_level_0,A,A,B,B,C,C
Unnamed: 0_level_1,absolute,<lambda>,absolute,<lambda>,absolute,<lambda>
2000-01-01,1.906021,-0.906021,0.553495,1.553495,0.00122,0.99878
2000-01-02,0.714361,0.285639,1.803843,-0.803843,0.690069,0.309931
2000-01-03,0.699666,1.699666,1.148266,2.148266,0.870129,1.870129
2000-01-04,,,,,,
2000-01-05,,,,,,
2000-01-06,,,,,,
2000-01-07,,,,,,
2000-01-08,0.774317,0.225683,0.433686,0.566314,0.568519,1.568519
2000-01-09,1.130911,-0.130911,0.491484,0.508516,0.10905,1.10905
2000-01-10,0.009735,0.990265,1.341069,-0.341069,1.260359,2.260359


Passing multiple functions to a Series will yield a DataFrame. The
resulting column names will be the transforming functions.

In [89]:
tsdf["A"].transform([np.abs, lambda x: x + 1])

Unnamed: 0,absolute,<lambda>
2000-01-01,1.906021,-0.906021
2000-01-02,0.714361,0.285639
2000-01-03,0.699666,1.699666
2000-01-04,,
2000-01-05,,
2000-01-06,,
2000-01-07,,
2000-01-08,0.774317,0.225683
2000-01-09,1.130911,-0.130911
2000-01-10,0.009735,0.990265


### Transforming with a dict

Passing a dict of functions will allow selective transforming per column.

In [90]:
tsdf.transform({"A": np.abs, "B": lambda x: x + 1})

Unnamed: 0,A,B
2000-01-01,1.906021,1.553495
2000-01-02,0.714361,-0.803843
2000-01-03,0.699666,2.148266
2000-01-04,,
2000-01-05,,
2000-01-06,,
2000-01-07,,
2000-01-08,0.774317,0.566314
2000-01-09,1.130911,0.508516
2000-01-10,0.009735,-0.341069


Passing a dict of lists will generate a MultiIndexed DataFrame with these
selective transforms.

In [91]:
tsdf.transform({"A": np.abs, "B": [lambda x: x + 1, "sqrt"]})

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0_level_0,A,B,B
Unnamed: 0_level_1,absolute,<lambda>,sqrt
2000-01-01,1.906021,1.553495,0.743972
2000-01-02,0.714361,-0.803843,
2000-01-03,0.699666,2.148266,1.071572
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,0.774317,0.566314,
2000-01-09,1.130911,0.508516,
2000-01-10,0.009735,-0.341069,


**Note**:Two major differences between apply and transform

There are two major differences between the `transform` and `apply` `groupby` methods.

- **Input:**
    - apply implicitly passes all the columns for each group as a DataFrame to the custom function.
    - while transform passes each column for each group individually as a Series to the custom function.
- **Output:**
    - The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
    - The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

In [92]:
# df.transform(np.sum) --> raises ValueError: Function did not transform
df.apply(np.sum)

one      3.581871
two      2.652182
three   -1.277312
dtype: float64

In [93]:
def add_two_columns(df):
    return df['one'] + df['two']

In [94]:
# df.transform(add_two_columns, axis='columns') --> raises ValueError: Function did not transform
df.apply(add_two_columns, axis='columns')

a    2.560767
b    1.039208
c    2.144729
d         NaN
dtype: float64

In [95]:
def add_1(s):
    return s + 1

In [96]:
df.transform(add_1)

Unnamed: 0,one,two,three
a,3.362582,1.198185,
b,1.396149,1.643059,-0.193403
c,1.82314,2.321589,0.377956
d,,1.489348,1.538136


In [97]:
df.apply(add_1)

Unnamed: 0,one,two,three
a,3.362582,1.198185,
b,1.396149,1.643059,-0.193403
c,1.82314,2.321589,0.377956
d,,1.489348,1.538136


In [98]:
def mysum(s):
    return sum(s)

In [99]:
# df.transform(mysum) --> raises ValueError: Function did not transform
df.apply(mysum)

one           NaN
two      2.652182
three         NaN
dtype: float64

## Applying elementwise functions

Since not all functions can be vectorized (accept NumPy arrays and return
another array or value), the methods `applymap()` on DataFrame
and analogously `map()` on Series accept any Python function taking
a single value and returning a single value. For example:

In [100]:
df

Unnamed: 0,one,two,three
a,2.362582,0.198185,
b,0.396149,0.643059,-1.193403
c,0.82314,1.321589,-0.622044
d,,0.489348,0.538136


In [101]:
def f(x):
    return len(str(x))

In [102]:
df["one"].map(f)

a    17
b    18
c    18
d     3
Name: one, dtype: int64

In [103]:
df.applymap(f)

  df.applymap(f)


Unnamed: 0,one,two,three
a,17,19,3
b,18,18,18
c,18,18,19
d,3,19,18


`Series.map()` has an additional feature; it can be used to easily
“link” or “map” values defined by a secondary series:

In [104]:
s = pd.Series(
    ["six", "seven", "six", "seven", "six"], index=["a", "b", "c", "d", "e"]
)

In [105]:
t = pd.Series({"six": 6.0, "seven": 7.0})

In [106]:
s

a      six
b    seven
c      six
d    seven
e      six
dtype: object

In [107]:
s.map(t)

a    6.0
b    7.0
c    6.0
d    7.0
e    6.0
dtype: float64