# A Beginner's Guide to Optimizing Pandas Code for Speed


[Source Material](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6)

Very interesting initial approach on how to optimize on the API-level

1. For loops
2. Iterrows
3. Apply
4. Vectorization (Pandas)
5. Vectorization (NumPy)



In [2]:
import pandas as pd
import os
import numpy as np

%load_ext line_profiler

In [4]:
# Define a basic Haversine distance formula
def haversine(lat1, lon1, lat2, lon2):
    MILES = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    total_miles = MILES * c
    return total_miles

In [3]:
dataset = '../data/new_york_hotels.csv'
df = pd.read_csv(os.path.join(dataset), encoding='cp1252')

1631

### Haversine Loop 

In [21]:
# Define a function to manually loop over all rows and return a series of distances
def haversine_looping(df):
    distance_list = []
    for i in range(0, len(df)):
        d = haversine(40.671, -73.985, df.iloc[i]['latitude'], df.iloc[i]['longitude'])
        distance_list.append(d)
    return distance_list

In [22]:
%%timeit

# Run the haversine looping function
df['distance'] = haversine_looping(df)

621 ms ± 4.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Iterrows()

If complete iteration
- No limitation from the external loop.
- Can we do a simple swap?
- More related with the API


In [23]:
def haversine_iterrows(df):
    distance_list = []
    for index, row in df.iterrows():
        d = haversine(40.671, -73.985, row['latitude'], row['longitude'])
        distance_list.append(d)
    return distance_list

In [24]:
%%timeit

# Run the haversine looping function
df['distance'] = haversine_iterrows(df)

168 ms ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Apply Method

From `iterrows()` to `apply()`




In [6]:
def haversine_apply(df):
    return df.apply(lambda row: haversine(40.671, -73.985, row['latitude'], row['longitude']), axis=1)

In [33]:
%%timeit

# Timing apply on the Haversine function
df['distance'] = haversine_apply(df)

83.1 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [9]:
# Haversine applied on rows with line profiler
%lprun -f haversine haversine_apply(df)

### Vectorization over Pandas series

Huge improvement

- Functions that will be applied to specific columns in the dataframe


In [47]:
%%timeit 

# Vectorized implementation of Haversine applied on Pandas series
df['distance'] = haversine(40.671, -73.985, df['latitude'], df['longitude'])

1.62 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Vectorization over Numpy arrays

Similar to the vectorization on Pandas it requires:

- Function to be applied on specific columns

In [46]:
%%timeit

# Vectorized implementation of Haversine applied on NumPy arrays
df['distance'] = haversine(40.671, -73.985, df['latitude'].values, df['longitude'].values)

311 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
