# Daily-Dose-of-Data-Science

[Daily Dose of Data Science](https://avichawla.substack.com) is a publication on Substack that brings together intriguing frameworks, libraries, technologies, and tips that make the life cycle of a Data Science project effortless. 

Author: Avi Chawla

[Medium](https://medium.com/@avi_chawla) | [LinkedIn](https://www.linkedin.com/in/avi-chawla/)

# The Best Way to Use Apply() in Pandas

Post Link: [Substack](https://avichawla.substack.com/p/the-best-way-to-use-apply-in-pandas)

LinkedIn Post: [LinkedIn](https://www.linkedin.com/posts/avi-chawla_python-datascience-pandas-activity-7007298172301520896-5xHq?utm_source=share&utm_medium=member_desktop)

## Resources

Swifter: [https://github.com/jmcarpenter2/swifter](https://github.com/jmcarpenter2/swifter)

Pandarallel: [https://github.com/nalepae/pandarallel](https://github.com/nalepae/pandarallel)

Parallel Pandas: [https://pypi.org/project/parallel-pandas/](https://pypi.org/project/parallel-pandas/)

Mapply: [https://pypi.org/project/mapply/](https://pypi.org/project/mapply/)

In [None]:
%%timeit
![ ! -f "pip_installed" ] && pip install -q --upgrade pandarallel mapply parallel-pandas swifter && touch pip_installed

In [3]:
import mapply
import pandas as pd
import swifter
import numpy as np
from time import perf_counter
from pandarallel import pandarallel
from parallel_pandas import ParallelPandas

In [4]:
df = pd.DataFrame(np.random.randint(1, 10**6, size = (10**7, 4)), columns = list("ABCD"))
df.head()

Unnamed: 0,A,B,C,D
0,694195,867235,625585,499243
1,829266,925135,347425,698796
2,296075,722698,465227,220427
3,871312,193671,464802,270806
4,752223,333601,904237,650913


In [5]:
def sum_row(row):
    return sum(row)

## Pandas Apply

In [6]:
start = perf_counter()
a = df.apply(sum_row, axis = 1)
print(perf_counter()-start)

41.18502562499998


## Pandarallel

In [7]:
pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [8]:
start = perf_counter()
a = df.parallel_apply(sum_row, axis = 1)
print(perf_counter()-start)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1250000), Label(value='0 / 1250000…

10.799626000000018


## Swifter Apply

In [9]:
start = perf_counter()
a = df.swifter.apply(sum_row, axis = 1)
print(perf_counter()-start)

Dask Apply:   0%|          | 0/16 [00:00<?, ?it/s]

21.07202774999999


## Parallel-Pandas Apply

In [10]:
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=True)

In [11]:
start = perf_counter()
a = df.p_apply(sum_row, axis = 1)
print(perf_counter()-start)

13.201775666000003


## Mapply Apply

In [12]:
mapply.init(
    n_workers=-1,
    chunk_size=100,
    max_chunks_per_worker=8,
    progressbar=True
)

In [13]:
start = perf_counter()
a = df.mapply(sum_row, axis = 1)
print(perf_counter()-start)

  0%|                                                    | 0/64 [00:00<?, ?it/s]

9.712804667
