# How to use more efficiently Pandas apply method


This was an example I took from my previous code at [snippet-optimization](https://github.com/DiegoEliasCosta/snippet-optimization)

Here we can see 3 methods for using apply
1. Passing the entire dataframe
2. Passing a Series and returning Series
3. Passing a Series and returning a python list (fastest)


In [51]:
import pandas as pd
import os

%load_ext line_profiler

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler


In [52]:
data = '../data/pandas-posts-dataset.csv'

df = pd.read_csv(data, encoding='ISO-8859-1', error_bad_lines=False, sep=";")


In [53]:
from bs4 import BeautifulSoup

def extract_tagged_code(text, tag):
    soup = BeautifulSoup(text, "html.parser")
    # Get the tag
    tagged = soup.find_all(tag)
    # Format from HTML to text
    return [i.get_text() for i in tagged]

### Original (Bad) Apply Implementation

Dataframe is passed entirely in the apply method

#### Why? 
The function populates multiple columns at once


In [54]:
def extract_code(df):
    codelist = extract_tagged_code(df['Body'], 'pre')
    # Code as a list (optional)
    df['CodeList'] = codelist
    # Concatenated code
    df['Code'] = [''.join(codelist)]
    return df


In [84]:
%%timeit    

# Extract Code
new_df1 = df.apply(extract_code, axis=1)

6.18 s ± 249 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Proper Apply Implementation (Series)

Pass only the essential information

#### How to deal with multiple returns?

1. Return a Series with names for each column


In [72]:
def extract_code_seriesapply(body):
    codelist = extract_tagged_code(body, 'pre')
    # Concatenated code
    code = [''.join(codelist)]
    return pd.Series({'Code': code, 'Codelist': codelist})

In [67]:
df2 = df.copy() # Just making sure not to mess with df

In [73]:
%%timeit

# Extract Code
new_df2 = df.append(df.Body.apply(extract_code_seriesapply))

3.3 s ± 599 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [91]:
new_df2.iloc[0].Code == new_df1.iloc[0].Code

True

### Proper Apply Implementation (Lists)

#### How to deal with multiple returns?

1. Return a list and use zip to assign names at the return of the function

In [76]:
def extract_code_listapply(body):
    codelist = extract_tagged_code(body, 'pre')
    # Concatenated code
    code = [''.join(codelist)]
    return [code, codelist]

In [94]:
%%timeit

new_df3 = df.copy()
# Extract Code
new_df3['Code'], new_df3['Codelist'] = zip(*new_df3.Body.apply(extract_code_listapply))

1.83 s ± 84.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [93]:
new_df2.iloc[0].Code == new_df3.iloc[0].Code

True

### Vectorization

Not only arithmetic operations are applied.
**It cannot be vectorized.**